[ https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16203096#comment-16203096 ]
Vinod Kumar Vavilapalli commented on YARN-7272: ----------------------------------------------- bq. In 1st cases, there will be outstanding unflushed entities in app collector buffer. If NM is restarted then it will looses all the outstanding entities from app collector buffer. So, scope of fault tolerance is restricted to NM JVM restart only bq. In 2nd case, since NM machine itself is down which looses all the running master containers. RM will launches these master container in different machine as a second attempt. This assumes that the collector lives inside the NM. One of the design goals for large scale apps is to fork the collector into its own container. When that is implemented, the above assumptions will be invalidated. We will have new fault scenarios where collector and AM may run on different machines, only collector dies and restarts on a different machine etc. bq. Since it is fresh attempt, old attempt data is not much important to end user. Considering this behavior, 2nd case can be eliminated by considering for fault tolerance of app collectors. If our goal is to take care of entity/event data in transit for 1 min (assuming the collector flush interval is 1 min), we should be equally concerned about data loss either due to NM failure or machine failure or HBase failures. Granted a HBase client buffer solution is faster / cheaper than levelDB solution which is in turn faster /cheaper than writing a JobHistory like WAL to HDFS. But the last one will encompass all those faults collectively, no? > Enable timeline collector fault tolerance > ----------------------------------------- > > Key: YARN-7272 > URL: https://issues.apache.org/jira/browse/YARN-7272 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineclient, timelinereader, timelineserver > Reporter: Vrushali C > Assignee: Rohith Sharma K S > > If a NM goes down and along with it the timeline collector aux service for a > running yarn app, we would like that yarn app to re-establish connection with > a new timeline collector. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org