[jira] [Commented] (YARN-8609) NM oom because of large container statuses

Jason Lowe (JIRA) Mon, 06 Aug 2018 14:31:07 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-8609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16570805#comment-16570805
 ]


Jason Lowe commented on YARN-8609:
----------------------------------

bq. As far as I know, there are two kinds of diagnostics info, one is fixed 
string, such as "Container is killed before being launched.\n", the other is 
exception message which may be very large, so I think we should just truncate 
exception message rather than the entire string made by for loop.

There should be only one way to store/update a container's diagnostics for 
recovery, and that's NMStateStoreService#storeContainerDiagnostics.  That 
method does not append but replaces the diagnostics.  The only call to that 
method is ContainerImpl#addDiagnostics which after YARN-3998 trims the 
diagnostics to the maximum configured length, keeping the most recently added 
characters.  The for loop is just for adding all the messages since it's 
implemented with variable arguments.  The most memory this method could take is 
 diagnosticsMaxSize + size_of_new_diagnostics which is then truncated to 
diagnosticsMaxSize at the end.  It will not persist, either in memory or in the 
state store, diagnostics beyond diagnosticsMaxSize.

If you're not running with YARN-3998 in your build then it appears the 
necessary changes are already addressed by YARN-3998.  It certainly looks as if 
that JIRA should have addressed your issue if it's configured to the default or 
a reasonable limit.  Are you running on a version that contains that change?  
If so then I'm wondering how you were able to get a 27MB diagnostic message 
into the state store.


> NM oom because of large container statuses
> ------------------------------------------
>
>                 Key: YARN-8609
>                 URL: https://issues.apache.org/jira/browse/YARN-8609
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Xianghao Lu
>            Priority: Major
>         Attachments: YARN-8609.001.patch, contain_status.jpg, oom.jpeg
>
>
> Sometimes, NodeManger will send large container statuses to ResourceManager 
> when NodeManger start with recovering, as a result , NodeManger will be 
> failed to start because of oom.
>  In my case, the large container statuses size is 135M, which contain 11 
> container statuses, and I find the diagnostics of 5 containers are very 
> large(27M), so, I truncate the container diagnostics as the patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8609) NM oom because of large container statuses

Reply via email to