[ https://issues.apache.org/jira/browse/YARN-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hu Ziqian updated YARN-9947: ---------------------------- Attachment: YARN-9947.001.patch > lazy init appLogAggregatorImpl when log aggregation > --------------------------------------------------- > > Key: YARN-9947 > URL: https://issues.apache.org/jira/browse/YARN-9947 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager > Affects Versions: 3.1.3 > Reporter: Hu Ziqian > Assignee: Hu Ziqian > Priority: Major > Attachments: YARN-9947.001.patch > > > This issue introduce an method to lazy init appLogAggregatorImpl, which let > it access hdfs as later as possible (when the app finish usually), to avoid > access hdfs at same time when restart NMs in a cluster and reduce hdfs > pressure. Lets go into the details below. > In current version, app log aggregator will check HDFS and try to create log > app when init an app. This cause a problem when restart NMs in a large > cluster with a heavy hdfs. Restart NM will init all apps on a NM and the NM > will try to connect HDFS. If the HDFS is heavily loaded, many NMs restart at > same time will let the hdfs not respond. The NM will wait for HDFS's response > and RM can't get NM's heartbeat and treat all containers as timeout. > In our product environment with 3500+ NMs, we find the NMs restart will put > heavy pressure on HDFS and the init app's operation is blocked on accessing > hdfs (stack attached blow), which causes all the container failed (we can > find the container number in one NM fall to zero). > !https://teambition-file.alibaba-inc.com/storage/011mcaf1aebf84f02a5d2c2c5fa85af80f5b?download=upload_tfs_by_description.png&Signature=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJBcHBJRCI6IjVjZDkwOTdmYjNhNDMyMjk3OTBhN2EyZiIsIl9hcHBJZCI6IjVjZDkwOTdmYjNhNDMyMjk3OTBhN2EyZiIsIl9vcmdhbml6YXRpb25JZCI6IjVjNDA1N2YwYmU4MjViMzkwNjY3YWJlZSIsImV4cCI6MTU3MjgzNzQxMywiaWF0IjoxNTcyODM3MTEzLCJyZXNvdXJjZSI6Ii9zdG9yYWdlLzAxMW1jYWYxYWViZjg0ZjAyYTVkMmMyYzVmYTg1YWY4MGY1YiJ9.JJQoQvjWdAQItQkjtdxa1SnkqJWuij_w2xq2Unoaktg! > !https://teambition-file.alibaba-inc.com/storage/011m873079212ee7fe507ddbe163a0c07fb1?download=upload_tfs_by_description.png&Signature=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJBcHBJRCI6IjVjZDkwOTdmYjNhNDMyMjk3OTBhN2EyZiIsIl9hcHBJZCI6IjVjZDkwOTdmYjNhNDMyMjk3OTBhN2EyZiIsIl9vcmdhbml6YXRpb25JZCI6IjVjNDA1N2YwYmU4MjViMzkwNjY3YWJlZSIsImV4cCI6MTU3MjgzNzQxMywiaWF0IjoxNTcyODM3MTEzLCJyZXNvdXJjZSI6Ii9zdG9yYWdlLzAxMW04NzMwNzkyMTJlZTdmZTUwN2RkYmUxNjNhMGMwN2ZiMSJ9.kH73n6bdx8ETXsrWcBGgXGay2WP3z9nzuDlE8-RvQzs! > We solve this problem by introduce lazy initialization in > appLogAggregatorImpl. When init app, we just create appLogAggregatorImpl > object with out verifyAndCreateRemoteLogDir(). We do the > verifyAndCreateRemoteLogDir() when the app start aggregate logs. Because apps > always are not finish or aggregate log at same time, the > verifyAndCreateRemoteLogDir will execute dispersedly, which means NMs will > not access hdfs at same time when they restart at same time. > > YARN-8418 solve the container logs' directory leaked problem by add a way to > update credentials of NM. If we lazy init appLogAggregatorImpl, we don't need > YARN-8418's logic because the lazy init logic happens after addCredentials > logic, which means the credentials always refreshed before we use it. > > In summary, this issue do two things: > # Introducing a lazy init logic to appLogAggregatorImpl to avoid centralized > access HDFS when restart all NMs in a cluster. > # Reverting YARN-8481 because the lazy init logic guarantee refreshing the > credentials. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org