[ https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143401#comment-14143401 ]
bc Wong commented on YARN-1530: ------------------------------- Hi [~zjshen]. First, glad to see that we're discussing approaches. You seem to agree with the premise that *ATS write path should not slow down apps*. bq. Therefore, is making the timeline server reliable (or always-up) the essential solution? If the timeline server is reliable, ... In theory, you can make the ATS *always-up*. In practice, we both know what real life distributed systems do. "Always-up" isn't the only thing. The write path needs to have good uptime and latency regardless of what's happening to the read path or the backing store. HDFS is a good default for the write channel because: * We don't have to design an ATS that is always-up. If you really want to, I'm sure you can eventually build something with good uptime. But it took other projects (HDFS, ZK) lots of hard work to get to that point. * If we reuse HDFS, cluster admins know how to operate HDFS and get good uptime from it. But it'll take training and hard-learned lessons for operators to figure out how to get good uptime from ATS, even after you build an always-up ATS. * All the popular YARN app frameworks (MR, Spark, etc.) already rely on HDFS by default. So do most of the 3rd party applications that I know of. Architecturally, it seems easier for admins to accept that ATS write path depends on HDFS for reliability, instead of a new component that (we claim) will be as reliable as HDFS/ZK. bq. given the whole roadmap of the timeline service, let's think critically of work that can improve the timeline service most significantly. Exactly. Strong +1. If we can drop the high uptime + low write latency requirement from the ATS service, we can avoid tons of effort. ATS doesn't need to be as reliable as HDFS. We don't need to worry about insulating the write path from the read path. We don't need to worry about occasional hiccups in HBase (or whatever the store is). And at the end of all this, everybody sleeps better because "ATS service going down" isn't a catastrophic failure. > [Umbrella] Store, manage and serve per-framework application-timeline data > -------------------------------------------------------------------------- > > Key: YARN-1530 > URL: https://issues.apache.org/jira/browse/YARN-1530 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Vinod Kumar Vavilapalli > Attachments: ATS-Write-Pipeline-Design-Proposal.pdf, > ATS-meet-up-8-28-2014-notes.pdf, application timeline design-20140108.pdf, > application timeline design-20140116.pdf, application timeline > design-20140130.pdf, application timeline design-20140210.pdf > > > This is a sibling JIRA for YARN-321. > Today, each application/framework has to do store, and serve per-framework > data all by itself as YARN doesn't have a common solution. This JIRA attempts > to solve the storage, management and serving of per-framework data from > various applications, both running and finished. The aim is to change YARN to > collect and store data in a generic manner with plugin points for frameworks > to do their own thing w.r.t interpretation and serving. -- This message was sent by Atlassian JIRA (v6.3.4#6332)