[ 
https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240754#comment-16240754
 ] 

Varun Saxena commented on YARN-7272:
------------------------------------

Sorry for coming in a little late on this discussion, although we did discuss 
it during the call.
The primary objective of fault tolerance is to ensure that the entities which 
are guaranteed to be written by timeline service v2 are not lost. 
But writing every entity to some sort of WAL implementation would be expensive.

Now, we have 2 kinds of entity writes, sync and async.
Sync entities are guaranteed to be written to the backend via collector or an 
exception, even for server-side failures, is returned i.e. we indicate to the 
client that an entity could not be written all the way to the backend so that 
it can retry or take some other suitable action.
Async entities, as the name suggests are written asynchronously. They are not 
guaranteed to be written to the backend, by design. We initially cache them at 
the client side for some time or till a sync entity arrives, combine them and 
then send them to collector. Moreover, if any exception occurs in writing to 
the backend, the result is not propagated back to the client. We only throw 
exceptions for client-side failures.
Async entities are later cached in HBase writer implementation too, inside 
collector, before being flushed to HBase.

Sync writes hence should be used for publishing important events, while async 
writes should be used for not so important events, losing which should not be a 
big deal in case of a failure. For instance, publishing metric values every N 
seconds can be an asynchronous write, unless the metric is very important, say, 
used for billing.

Keeping this in mind, a client can potentially do synchronous writes if it 
cares about durability of entity data.
Furthermore, asynchronous writes can have other points of failure too. For 
instance, the collector can crash while writing the async entity to WAL. In 
this case, we currently do not propagate this error to timeline client i.e. 
client would not know which entity writes have failed.

Another possible case to handle is the case where storage is down i.e. instead 
of waiting for sync entity call to wait, it can be potentially committed to WAL 
till backend is unavailable. We can potentially explore this option. Say, in 
cases where HBase cluster runs separately from the cluster where ATS is running.
For HBase, would HBaseAdmin#checkHBaseAvailable be sufficient to check if HBase 
storage is down?

Thoughts?

> Enable timeline collector fault tolerance
> -----------------------------------------
>
>                 Key: YARN-7272
>                 URL: https://issues.apache.org/jira/browse/YARN-7272
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineclient, timelinereader, timelineserver
>            Reporter: Vrushali C
>            Assignee: Rohith Sharma K S
>         Attachments: YARN-7272-wip.patch
>
>
> If a NM goes down and along with it the timeline collector aux service for a 
> running yarn app, we would like that yarn app to re-establish connection with 
> a new timeline collector. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to