[ 
https://issues.apache.org/jira/browse/YARN-4325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280176#comment-15280176
 ] 

Jason Lowe commented on YARN-4325:
----------------------------------

Yes, what I'm proposing is to have the log handlers always respond to the 
APPLICATION_FINISHED event.  We can look at this problem in two ways: either 
the bug is in the ApplicationImpl because it doesn't track that log handling 
failed and sometimes needs to clean up the app in other states, or the bug is 
in the log handlers because they failed to respond to the APPLICATION_FINISHED 
event when the application terminated.  If the log handlers always responded to 
the APPLICATION_FINISHED event with an APPLICATION_LOG_HANDLING_FAILED or 
APPLICATION_LOG_HANDLING_FINISHED event, wouldn't that also solve the problem?  
Then ApplicationImpl can simply wait until the terminal finished state to 
receive one of the log handling replies and then clean up the app in _one_ 
place rather than several places depending upon the special case being handled.


> Purge app state from NM state-store should cover more LOG_HANDLING cases
> ------------------------------------------------------------------------
>
>                 Key: YARN-4325
>                 URL: https://issues.apache.org/jira/browse/YARN-4325
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
>         Attachments: ApplicationImpl.PNG, YARN-4325-v1.1.patch, 
> YARN-4325-v1.patch, YARN-4325.patch
>
>
> From a long running cluster, we found tens of thousands of stale apps still 
> be recovered in NM restart recovery. 
> After investigating, there are three issues cause app state leak in NM 
> state-store:
> 1. APPLICATION_LOG_HANDLING_FAILED is not handled with remove App in 
> NMStateStore.
> 2. APPLICATION_LOG_HANDLING_FAILED event is missing in sent when hit 
> aggregator's doAppLogAggregation() exception case.
> 3. Only Application in FINISHED status receiving APPLICATION_LOG_FINISHED has 
> transition to remove app in NM state store. Application in other status - 
> like APPLICATION_RESOURCES_CLEANUP will ignore the event and later forget to 
> remove this app from NM state store even after app get finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to