[ 
https://issues.apache.org/jira/browse/YARN-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hu Ziqian updated YARN-9947:
----------------------------
    Description: 
This issue introduce an method to lazy init 

 

In current version, app log aggregator will check HDFS and try to create log 
app when init an app. This cause a problem when restart NMs in a  large cluster 
with a heavy hdfs. Restart NM will init all app on a NM and the NM will try to 
connect HDFS. If the HDFS is heavily loaded, many NMs restart at same time will 
let the hdfs not respond. The NM will wait for HDFS's response and RM can't get 
NM's heartbeat and treat all containers as timeout.

In our product environment with 3500+ NMs, we find the NMs restart will put 
heavy pressure on HDFS and the init app's operation is blocked on accessing 
hdfs (stack attached blow), which causes all the  container failed (we can find 
the container number in one NM fall to zero).

!https://teambition-file.alibaba-inc.com/storage/011mcaf1aebf84f02a5d2c2c5fa85af80f5b?download=upload_tfs_by_description.png&Signature=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJBcHBJRCI6IjVjZDkwOTdmYjNhNDMyMjk3OTBhN2EyZiIsIl9hcHBJZCI6IjVjZDkwOTdmYjNhNDMyMjk3OTBhN2EyZiIsIl9vcmdhbml6YXRpb25JZCI6IjVjNDA1N2YwYmU4MjViMzkwNjY3YWJlZSIsImV4cCI6MTU3MjgzNzQxMywiaWF0IjoxNTcyODM3MTEzLCJyZXNvdXJjZSI6Ii9zdG9yYWdlLzAxMW1jYWYxYWViZjg0ZjAyYTVkMmMyYzVmYTg1YWY4MGY1YiJ9.JJQoQvjWdAQItQkjtdxa1SnkqJWuij_w2xq2Unoaktg!

!https://teambition-file.alibaba-inc.com/storage/011m873079212ee7fe507ddbe163a0c07fb1?download=upload_tfs_by_description.png&Signature=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJBcHBJRCI6IjVjZDkwOTdmYjNhNDMyMjk3OTBhN2EyZiIsIl9hcHBJZCI6IjVjZDkwOTdmYjNhNDMyMjk3OTBhN2EyZiIsIl9vcmdhbml6YXRpb25JZCI6IjVjNDA1N2YwYmU4MjViMzkwNjY3YWJlZSIsImV4cCI6MTU3MjgzNzQxMywiaWF0IjoxNTcyODM3MTEzLCJyZXNvdXJjZSI6Ii9zdG9yYWdlLzAxMW04NzMwNzkyMTJlZTdmZTUwN2RkYmUxNjNhMGMwN2ZiMSJ9.kH73n6bdx8ETXsrWcBGgXGay2WP3z9nzuDlE8-RvQzs!

 

  was:In current version, app log aggregator will check


> lazy init appLogAggregatorImpl when log aggregation
> ---------------------------------------------------
>
>                 Key: YARN-9947
>                 URL: https://issues.apache.org/jira/browse/YARN-9947
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: nodemanager
>    Affects Versions: 3.1.3
>            Reporter: Hu Ziqian
>            Assignee: Hu Ziqian
>            Priority: Major
>
> This issue introduce an method to lazy init 
>  
> In current version, app log aggregator will check HDFS and try to create log 
> app when init an app. This cause a problem when restart NMs in a  large 
> cluster with a heavy hdfs. Restart NM will init all app on a NM and the NM 
> will try to connect HDFS. If the HDFS is heavily loaded, many NMs restart at 
> same time will let the hdfs not respond. The NM will wait for HDFS's response 
> and RM can't get NM's heartbeat and treat all containers as timeout.
> In our product environment with 3500+ NMs, we find the NMs restart will put 
> heavy pressure on HDFS and the init app's operation is blocked on accessing 
> hdfs (stack attached blow), which causes all the  container failed (we can 
> find the container number in one NM fall to zero).
> !https://teambition-file.alibaba-inc.com/storage/011mcaf1aebf84f02a5d2c2c5fa85af80f5b?download=upload_tfs_by_description.png&Signature=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJBcHBJRCI6IjVjZDkwOTdmYjNhNDMyMjk3OTBhN2EyZiIsIl9hcHBJZCI6IjVjZDkwOTdmYjNhNDMyMjk3OTBhN2EyZiIsIl9vcmdhbml6YXRpb25JZCI6IjVjNDA1N2YwYmU4MjViMzkwNjY3YWJlZSIsImV4cCI6MTU3MjgzNzQxMywiaWF0IjoxNTcyODM3MTEzLCJyZXNvdXJjZSI6Ii9zdG9yYWdlLzAxMW1jYWYxYWViZjg0ZjAyYTVkMmMyYzVmYTg1YWY4MGY1YiJ9.JJQoQvjWdAQItQkjtdxa1SnkqJWuij_w2xq2Unoaktg!
> !https://teambition-file.alibaba-inc.com/storage/011m873079212ee7fe507ddbe163a0c07fb1?download=upload_tfs_by_description.png&Signature=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJBcHBJRCI6IjVjZDkwOTdmYjNhNDMyMjk3OTBhN2EyZiIsIl9hcHBJZCI6IjVjZDkwOTdmYjNhNDMyMjk3OTBhN2EyZiIsIl9vcmdhbml6YXRpb25JZCI6IjVjNDA1N2YwYmU4MjViMzkwNjY3YWJlZSIsImV4cCI6MTU3MjgzNzQxMywiaWF0IjoxNTcyODM3MTEzLCJyZXNvdXJjZSI6Ii9zdG9yYWdlLzAxMW04NzMwNzkyMTJlZTdmZTUwN2RkYmUxNjNhMGMwN2ZiMSJ9.kH73n6bdx8ETXsrWcBGgXGay2WP3z9nzuDlE8-RvQzs!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to