[ https://issues.apache.org/jira/browse/YARN-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844308#comment-16844308 ]
Tan, Wangda commented on YARN-4946: ----------------------------------- Thanks [~snemeth], [~ccondit] for commenting. The question I was trying to find out is: should we backport this patch to older release? After digging into details, I'm wondering should we do this or not. YARN-7952 should solve part of the problem: log aggregation status is saved on NM as well. So the only issue this Jira could solve is: if #apps grow greater than configured ZK state store limits, we will keep the apps if log aggregation is not finished yet. I agree with [~ccondit] mentioned, this exception (to keep app in state store) seems safe, however, if something bad happens, like log aggregation bug, or slowness of log aggregation HDFS cluster, etc. It will bring down RM. My understanding of this problem is: if RM recovery is enabled (I believe most prod clusters do), an app is removed from state-store (which should be a long time/buffer for log aggregation). If the log aggregation still not finished, we should still remove the app from RM state store and move on. The description of the Jira: {quote}When the RM "forgets" about an older completed Application (e.g. RM failover, enough time has passed, etc), the tool won't find the Application in the RM and will just assume that its log aggregation succeeded, even if it actually failed or is still running. {quote} Seems the right behavior when completed apps forgot by RM. Thoughts? > RM should not consider an application as COMPLETED when log aggregation is > not in a terminal state > -------------------------------------------------------------------------------------------------- > > Key: YARN-4946 > URL: https://issues.apache.org/jira/browse/YARN-4946 > Project: Hadoop YARN > Issue Type: Improvement > Components: log-aggregation > Affects Versions: 2.8.0 > Reporter: Robert Kanter > Assignee: Szilard Nemeth > Priority: Major > Fix For: 3.2.0 > > Attachments: YARN-4946.001.patch, YARN-4946.002.patch, > YARN-4946.003.patch, YARN-4946.004.patch > > > MAPREDUCE-6415 added a tool that combines the aggregated log files for each > Yarn App into a HAR file. When run, it seeds the list by looking at the > aggregated logs directory, and then filters out ineligible apps. One of the > criteria involves checking with the RM that an Application's log aggregation > status is not still running and has not failed. When the RM "forgets" about > an older completed Application (e.g. RM failover, enough time has passed, > etc), the tool won't find the Application in the RM and will just assume that > its log aggregation succeeded, even if it actually failed or is still running. > We can solve this problem by doing the following: > The RM should not consider an app to be fully completed (and thus removed > from its history) until the aggregation status has reached a terminal state > (e.g. SUCCEEDED, FAILED, TIME_OUT). -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org