[ https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240754#comment-16240754 ]
Varun Saxena commented on YARN-7272: ------------------------------------ Sorry for coming in a little late on this discussion, although we did discuss it during the call. The primary objective of fault tolerance is to ensure that the entities which are guaranteed to be written by timeline service v2 are not lost. But writing every entity to some sort of WAL implementation would be expensive. Now, we have 2 kinds of entity writes, sync and async. Sync entities are guaranteed to be written to the backend via collector or an exception, even for server-side failures, is returned i.e. we indicate to the client that an entity could not be written all the way to the backend so that it can retry or take some other suitable action. Async entities, as the name suggests are written asynchronously. They are not guaranteed to be written to the backend, by design. We initially cache them at the client side for some time or till a sync entity arrives, combine them and then send them to collector. Moreover, if any exception occurs in writing to the backend, the result is not propagated back to the client. We only throw exceptions for client-side failures. Async entities are later cached in HBase writer implementation too, inside collector, before being flushed to HBase. Sync writes hence should be used for publishing important events, while async writes should be used for not so important events, losing which should not be a big deal in case of a failure. For instance, publishing metric values every N seconds can be an asynchronous write, unless the metric is very important, say, used for billing. Keeping this in mind, a client can potentially do synchronous writes if it cares about durability of entity data. Furthermore, asynchronous writes can have other points of failure too. For instance, the collector can crash while writing the async entity to WAL. In this case, we currently do not propagate this error to timeline client i.e. client would not know which entity writes have failed. Another possible case to handle is the case where storage is down i.e. instead of waiting for sync entity call to wait, it can be potentially committed to WAL till backend is unavailable. We can potentially explore this option. Say, in cases where HBase cluster runs separately from the cluster where ATS is running. For HBase, would HBaseAdmin#checkHBaseAvailable be sufficient to check if HBase storage is down? Thoughts? > Enable timeline collector fault tolerance > ----------------------------------------- > > Key: YARN-7272 > URL: https://issues.apache.org/jira/browse/YARN-7272 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineclient, timelinereader, timelineserver > Reporter: Vrushali C > Assignee: Rohith Sharma K S > Attachments: YARN-7272-wip.patch > > > If a NM goes down and along with it the timeline collector aux service for a > running yarn app, we would like that yarn app to re-establish connection with > a new timeline collector. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org