[ 
https://issues.apache.org/jira/browse/YARN-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16555893#comment-16555893
 ] 

Robert Kanter commented on YARN-4946:
-------------------------------------

AFAIK, nothing has changed in this area.  However, I think the flag file is 
going to be a no-go.  I've gotten a _lot_ of pushback in the past when trying 
to have the RM write information to HDFS.  So I think we need to come up with a 
different approach.

The RM remembers X number of applications in order to save on memory and 
RMStateStore space.  This is controlled by 
{{yarn.resourcemanager.max-completed-applications}} and 
{{yarn.resourcemanager.state-store.max-completed-applications}}, respectively; 
and you usually would set them to the same value (in fact, I believe the 
state-store one is set to the other one by default).  For example, if set to 
1000, then when you run 1001 applications, the RM will forget the oldest 
application that is no longer running (i.e. completed, failed), so that it 
never remembers more than 1000 applications - that's what I mean about 
"forgetting."  Those applications can be looked up in the JHS, Spark HS, or etc.

No need to do a failover or HA (though we should test that once at the end to 
be thorough).  You can test this with 
{{yarn.resourcemanager.max-completed-applications}} by setting it to a low 
value like 3 or something.  The RM should not remember more than 3 completed 
applications, so simply run 4 jobs, wait for them to complete, and you'll see 
it.

The issue this JIRA is trying to solve is when you run the tool from 
MAPREDUCE-6415, if it can't find the App in the RM (because the RM forgot it) 
when getting the log aggregation status, it assumes that the aggregation 
completed successfully 
(https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-archive-logs/src/main/java/org/apache/hadoop/tools/HadoopArchiveLogs.java#L350).
  Assuming your cluster and job is working correctly, that's a good assumption, 
but if not, it'll be wrong.  IIRC, that's actually okay if log aggregation has 
reached a terminal state like succeeded or even failed; but is more of a 
problem if it's still in the middle of aggregating because we're going to 
process partial logs.  So I think we can leave that if we can ensure that the 
RM only forgets apps once they've reached a terminal log aggregation status.  
In other words, if the RM doesn't consider the App isn't truly finished until 
(and thus removed from it's history) until the aggregation status has reached a 
terminal state (i.e. DISABLED, SUCCEEDED, FAILED, TIME_OUT).  This should be a 
simpler fix and doesn't require writing anything to HDFS.

> RM should write out Aggregated Log Completion file flag next to logs
> --------------------------------------------------------------------
>
>                 Key: YARN-4946
>                 URL: https://issues.apache.org/jira/browse/YARN-4946
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: log-aggregation
>    Affects Versions: 2.8.0
>            Reporter: Robert Kanter
>            Assignee: Szilard Nemeth
>            Priority: Major
>
> MAPREDUCE-6415 added a tool that combines the aggregated log files for each 
> Yarn App into a HAR file.  When run, it seeds the list by looking at the 
> aggregated logs directory, and then filters out ineligible apps.  One of the 
> criteria involves checking with the RM that an Application's log aggregation 
> status is not still running and has not failed.  When the RM "forgets" about 
> an older completed Application (e.g. RM failover, enough time has passed, 
> etc), the tool won't find the Application in the RM and will just assume that 
> its log aggregation succeeded, even if it actually failed or is still running.
> We can solve this problem by doing the following:
> # When the RM sees that an Application has successfully finished aggregation 
> its logs, it will write a flag file next to that Application's log files
> # The tool no longer talks to the RM at all.  When looking at the FileSystem, 
> it now uses that flag file to determine if it should process those log files. 
>  If the file is there, it archives, otherwise it does not.
> # As part of the archiving process, it will delete the flag file
> # (If you don't run the tool, the flag file will eventually be cleaned up by 
> the JHS when it cleans up the aggregated logs because it's in the same 
> directory)
> This improvement has several advantages:
> # The edge case about "forgotten" Applications is fixed
> # The tool no longer has to talk to the RM; it only has to consult HDFS.  
> This is simpler



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to