[ 
https://issues.apache.org/jira/browse/YARN-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569917#comment-16569917
 ] 

Szilard Nemeth commented on YARN-4946:
--------------------------------------

Thanks [~rkanter] for your quick review!
1. Removed {{recordLogAggregationStartTime}}. For the other small methods, I 
think it definitely makes {{FinalTransition#transition}} more readable so I 
left them as is. Please tell me if you still have objections with this.
2. Good point, didn't know log aggregation could delay the finish time of the 
application with minutes, so I removed the condition around the 
{{APP_COMPLETED}} event in {{FinalTransition}}.

So the majority of the code changes now are in 
{{RMAppManager#checkAppNumCompletedLimit}}.
Basically, for both the state store and the in-memory completed application 
checks, I modified the logic so that checking what is the difference in the 
completed apps vs. configured max and then try to delete as many applications.
I only remove an application If the application has log aggregation enabled and 
the aggregation is not finished yet.

An example: Let's suppose we have configured max for state store and in-memory 
as 2 and we have 10 apps completed.
{{RMAppManager#checkAppNumCompletedLimit}} will realize that it need to remove 
8 apps so it starts from the 0th index in apps and try to delete sequentially.
If any of the apps has not finished their log aggregation then they won't be 
removed (index is skipped), so from now on, the configured max is not a hard 
limit. 

Please check the testcase added specifically to test the above scenario: 
{{TestAppManager#testStateStoreAppLimitSomeAppsHaveNotFinishedLogAggregation}}
I also extended 2 testcases with checking the removed / completed application 
IDs, too.

Please also keep in mind that {{checkAppNumCompletedLimit}} is only invoked 
when an app becomes completed ({{APP_COMPLETED}} event dispatched from 
{{FinalTransition}}), so it can happen that we store more applications in the 
state-store and memory until another app finishes, but I can't think of any 
better solution currently.




> RM should not consider an application as COMPLETED when log aggregation is 
> not in a terminal state
> --------------------------------------------------------------------------------------------------
>
>                 Key: YARN-4946
>                 URL: https://issues.apache.org/jira/browse/YARN-4946
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: log-aggregation
>    Affects Versions: 2.8.0
>            Reporter: Robert Kanter
>            Assignee: Szilard Nemeth
>            Priority: Major
>         Attachments: YARN-4946.001.patch, YARN-4946.002.patch, 
> YARN-4946.003.patch
>
>
> MAPREDUCE-6415 added a tool that combines the aggregated log files for each 
> Yarn App into a HAR file.  When run, it seeds the list by looking at the 
> aggregated logs directory, and then filters out ineligible apps.  One of the 
> criteria involves checking with the RM that an Application's log aggregation 
> status is not still running and has not failed.  When the RM "forgets" about 
> an older completed Application (e.g. RM failover, enough time has passed, 
> etc), the tool won't find the Application in the RM and will just assume that 
> its log aggregation succeeded, even if it actually failed or is still running.
> We can solve this problem by doing the following:
> The RM should not consider an app to be fully completed (and thus removed 
> from its history) until the aggregation status has reached a terminal state 
> (e.g. SUCCEEDED, FAILED, TIME_OUT).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to