[ https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143594#comment-14143594 ]
Zhijie Shen commented on YARN-1530: ----------------------------------- Hi, [~bcwalrus]. Thanks for your further comments. bq. You seem to agree with the premise that ATS write path should not slow down apps. Definitely. The arguable point is that the current timeline client is going to slow down the app, given we have a scalable and reliable timeline server. bq. If we can drop the high uptime + low write latency requirement from the ATS service, we can avoid tons of effort. I'm not sure such fundamental requirements can be dropped from the timeline service. Projecting the future, scalable and high available timeline servers have multiple benefits and enable different use cases. For example, 1. We can use it to serve realtime or near realtime data, such that we can go the timeline server to see what happens to an application. It's in particularly useful for the long running services, which will never turn down. 2. We can build checkpoints on the timeline server for the app do to recovery once it crashes. It's pretty much like what we've done for MR jobs. I bundled "scalable" and "reliable" together because multiple-instance solution will improve the timeline server in both dimensions. Moreover, no matter how scalable and reliable the channel could be, we eventually want to get the timeline data accommodated into the timeline server, right? Otherwise, it is not going to be accessible by users (Of course, tricks can be played to fetch it directly from HDFS, but it's completely another story than the timeline server). If the apps are publishing 10GB data per hour, while the server can only process 1G per hour, the 9GB outstanding data per hour that resides in some temp location of HDFS is going to be useless writes. We have narrow down very much to discuss the reliability of the write path, but if we look into the big picture, *the timeline server is not just place to store data, but also serves it to users* (e.g., YARN-2513). In terms of use case, users may want to monitor completed apps as well as running apps and cluster. If the timeline server doesn't have capacity to serve the data for a particular use case, it's actually wasting the cost on aggregating it. IMHO, the scalable and the reliable timeline server is going to be *the eventual solution to satisfy multiple stakeholders*, regardless the use case is read intensive, write intensive or both intensive. That's why I think it could a high margin work to improve the timeline server. It's may be a hard work, but we should definitely pick it up. > [Umbrella] Store, manage and serve per-framework application-timeline data > -------------------------------------------------------------------------- > > Key: YARN-1530 > URL: https://issues.apache.org/jira/browse/YARN-1530 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Vinod Kumar Vavilapalli > Attachments: ATS-Write-Pipeline-Design-Proposal.pdf, > ATS-meet-up-8-28-2014-notes.pdf, application timeline design-20140108.pdf, > application timeline design-20140116.pdf, application timeline > design-20140130.pdf, application timeline design-20140210.pdf > > > This is a sibling JIRA for YARN-321. > Today, each application/framework has to do store, and serve per-framework > data all by itself as YARN doesn't have a common solution. This JIRA attempts > to solve the storage, management and serving of per-framework data from > various applications, both running and finished. The aim is to change YARN to > collect and store data in a generic manner with plugin points for frameworks > to do their own thing w.r.t interpretation and serving. -- This message was sent by Atlassian JIRA (v6.3.4#6332)