[ 
https://issues.apache.org/jira/browse/YARN-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16095682#comment-16095682
 ] 

Vrushali C commented on YARN-6323:
----------------------------------

Ping on this jira. To summarize:

- new NM fails to recover apps since the timeline flow context is missing for 
old apps on the NM. This patch will put in a default flow context to help NM 
proceed. 

To answer Rohith's questions:

bq Application is NOT submitted with tags. So default values are created by 
YARN.
RM creates default FlowContext with FlowName as appName. On NM restart, we are 
creating FlowContex with appId. So, there will be a inconsistencies when 
entities are published during rolling upgrade.
Yes, inconsistencies would be there but it is not possible to upgrade the RM 
and the all the NMs at exactly the time, unless we take a downtime. 

bq. Assume that Application is submitted with some tags. RM recover the 
application and start publishing with tags as flow context. Again there is 
inconsistencies in published entity.
Yes, but how to synchronize RM and NM across restarts? We could use app id in 
both cases but this turns out to be strange default data.   

This patch will ensure the NM does not fail to start up.  I thought of adding 
in some default values for dropping the data but that will be an expensive 
check to do each time we want to write to the backend. 

ping [~rohithsharma] [~varun_saxena] [~haibo.chen]  any other ideas? At the 
very least, the NM can't be crashing during an upgrade due to missing flow 
context. 


> Rolling upgrade/config change is broken on timeline v2. 
> --------------------------------------------------------
>
>                 Key: YARN-6323
>                 URL: https://issues.apache.org/jira/browse/YARN-6323
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Li Lu
>            Assignee: Vrushali C
>              Labels: yarn-5355-merge-blocker
>         Attachments: YARN-6323.001.patch
>
>
> Found this issue when deploying on real clusters. If there are apps running 
> when we enable timeline v2 (with work preserving restart enabled), node 
> managers will fail to start due to missing app context data. We should 
> probably assign some default names to these "left over" apps. I believe it's 
> suboptimal to let users clean up the whole cluster before enabling timeline 
> v2. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to