[ 
https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16203096#comment-16203096
 ] 

Vinod Kumar Vavilapalli commented on YARN-7272:
-----------------------------------------------

bq. In 1st cases, there will be outstanding unflushed entities in app collector 
buffer. If NM is restarted then it will looses all the outstanding entities 
from app collector buffer. So, scope of fault tolerance is restricted to NM JVM 
restart only
bq. In 2nd case, since NM machine itself is down which looses all the running 
master containers. RM will launches these master container in different machine 
as a second attempt.
This assumes that the collector lives inside the NM. One of the design goals 
for large scale apps is to fork the collector into its own container. When that 
is implemented, the above assumptions will be invalidated. We will have new 
fault scenarios where collector and AM may run on different machines, only 
collector dies and restarts on a different machine etc.

bq. Since it is fresh attempt, old attempt data is not much important to end 
user. Considering this behavior, 2nd case can be eliminated by considering for 
fault tolerance of app collectors. 
If our goal is to take care of entity/event data in transit for 1 min (assuming 
the collector flush interval is 1 min), we should be equally concerned about 
data loss either due to NM failure or machine failure or HBase failures.

Granted a HBase client buffer solution is faster / cheaper than levelDB 
solution which is in turn faster /cheaper than writing a JobHistory like WAL to 
HDFS. But the last one will encompass all those faults collectively, no?

> Enable timeline collector fault tolerance
> -----------------------------------------
>
>                 Key: YARN-7272
>                 URL: https://issues.apache.org/jira/browse/YARN-7272
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineclient, timelinereader, timelineserver
>            Reporter: Vrushali C
>            Assignee: Rohith Sharma K S
>
> If a NM goes down and along with it the timeline collector aux service for a 
> running yarn app, we would like that yarn app to re-establish connection with 
> a new timeline collector. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to