[ 
https://issues.apache.org/jira/browse/YARN-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844308#comment-16844308
 ] 

Tan, Wangda commented on YARN-4946:
-----------------------------------

Thanks [~snemeth], [~ccondit] for commenting. 

The question I was trying to find out is: should we backport this patch to 
older release? 

After digging into details, I'm wondering should we do this or not. YARN-7952 
should solve part of the problem: log aggregation status is saved on NM as 
well. So the only issue this Jira could solve is: if #apps grow greater than 
configured ZK state store limits, we will keep the apps if log aggregation is 
not finished yet. I agree with [~ccondit] mentioned, this exception (to keep 
app in state store) seems safe, however, if something bad happens, like log 
aggregation bug, or slowness of log aggregation HDFS cluster, etc. It will 
bring down RM.

My understanding of this problem is: if RM recovery is enabled (I believe most 
prod clusters do), an app is removed from state-store (which should be a long 
time/buffer for log aggregation). If the log aggregation still not finished, we 
should still remove the app from RM state store and move on.

The description of the Jira: 
{quote}When the RM "forgets" about an older completed Application (e.g. RM 
failover, enough time has passed, etc), the tool won't find the Application in 
the RM and will just assume that its log aggregation succeeded, even if it 
actually failed or is still running.
{quote}
Seems the right behavior when completed apps forgot by RM. 

Thoughts?

> RM should not consider an application as COMPLETED when log aggregation is 
> not in a terminal state
> --------------------------------------------------------------------------------------------------
>
>                 Key: YARN-4946
>                 URL: https://issues.apache.org/jira/browse/YARN-4946
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: log-aggregation
>    Affects Versions: 2.8.0
>            Reporter: Robert Kanter
>            Assignee: Szilard Nemeth
>            Priority: Major
>             Fix For: 3.2.0
>
>         Attachments: YARN-4946.001.patch, YARN-4946.002.patch, 
> YARN-4946.003.patch, YARN-4946.004.patch
>
>
> MAPREDUCE-6415 added a tool that combines the aggregated log files for each 
> Yarn App into a HAR file.  When run, it seeds the list by looking at the 
> aggregated logs directory, and then filters out ineligible apps.  One of the 
> criteria involves checking with the RM that an Application's log aggregation 
> status is not still running and has not failed.  When the RM "forgets" about 
> an older completed Application (e.g. RM failover, enough time has passed, 
> etc), the tool won't find the Application in the RM and will just assume that 
> its log aggregation succeeded, even if it actually failed or is still running.
> We can solve this problem by doing the following:
> The RM should not consider an app to be fully completed (and thus removed 
> from its history) until the aggregation status has reached a terminal state 
> (e.g. SUCCEEDED, FAILED, TIME_OUT).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to