[jira] [Updated] (YARN-2194) Add Cgroup support for RedHat 7
[ https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] bc Wong updated YARN-2194: -- Description:In previous versions of RedHat, we can build custom cgroup hierarchies with use of the cgconfig command from the libcgroup package. From RedHat 7, package libcgroup is deprecated and it is not recommended to use it since it can easily create conflicts with the default cgroup hierarchy. The systemd is provided and recommended for cgroup management. We need to add support for this. (was: In previous versions of RedHat, we can build custom cgroup hierarchies with use of the cgconfig command from the libcgroup package. From RedHat 7, package libcgroup is deprecated and it is not recommended to use it since it can easily create conflicts with the default cgroup hierarchy. The systemd is provided and recommended for cgroup management. We need to add support for this.) > Add Cgroup support for RedHat 7 > --- > > Key: YARN-2194 > URL: https://issues.apache.org/jira/browse/YARN-2194 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Wei Yan >Assignee: Wei Yan > Attachments: YARN-2194-1.patch > > >In previous versions of RedHat, we can build custom cgroup hierarchies > with use of the cgconfig command from the libcgroup package. From RedHat 7, > package libcgroup is deprecated and it is not recommended to use it since it > can easily create conflicts with the default cgroup hierarchy. The systemd is > provided and recommended for cgroup management. We need to add support for > this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2931) PublicLocalizer may fail with FileNotFoundException until directory gets initialized by LocalizeRunner
[ https://issues.apache.org/jira/browse/YARN-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14238734#comment-14238734 ] bc Wong commented on YARN-2931: --- Thanks for the fix! Some nits. ResourceLocalizationService.java * Instead of commenting out code, would just remove it. TestResourceLocalizationService.java * L950: Remove code that commented out. > PublicLocalizer may fail with FileNotFoundException until directory gets > initialized by LocalizeRunner > -- > > Key: YARN-2931 > URL: https://issues.apache.org/jira/browse/YARN-2931 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > Attachments: YARN-2931.001.patch, YARN-2931.002.patch, > YARN-2931.002.patch > > > When the data directory is cleaned up and NM is started with existing > recovery state, because of YARN-90, it will not recreate the local dirs. > This causes a PublicLocalizer to fail until getInitializedLocalDirs is called > due to some LocalizeRunner for private localization. > Example error > {noformat} > 2014-12-02 22:57:32,629 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Failed to download rsrc { { hdfs:/ machine>:8020/tmp/hive-hive/hive_2014-12-02_22-56-58_741_2045919883676051996-3/-mr-10004/8060c9dd-54b6-42fc-9d77-34b655fa5e82/reduce.xml, > 1417589819618, FILE, null > },pending,[(container_1417589109512_0001_02_03)],119413444132127,DOWNLOADING} > java.io.FileNotFoundException: File /data/yarn/nm/filecache does not exist > at > org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:524) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:737) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:514) > at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1051) > at > org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:162) > at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:197) > at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:724) > at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:720) > at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) > at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:720) > at org.apache.hadoop.yarn.util.FSDownload.createDir(FSDownload.java:104) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:351) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 2014-12-02 22:57:32,629 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1417589109512_0001_02_03 transitioned from > LOCALIZING to LOCALIZATION_FAILED > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2669) FairScheduler: queueName shouldn't allow periods the allocation.xml
[ https://issues.apache.org/jira/browse/YARN-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186448#comment-14186448 ] bc Wong commented on YARN-2669: --- Replacing "." with "\_dot\_" sounds fine here. While it doesn't eliminate collision, it makes it unlikely. Again, I'd leave it for another patch to do the real fix, which is more involved. > FairScheduler: queueName shouldn't allow periods the allocation.xml > --- > > Key: YARN-2669 > URL: https://issues.apache.org/jira/browse/YARN-2669 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Wei Yan >Assignee: Wei Yan >Priority: Minor > Attachments: YARN-2669-1.patch, YARN-2669-2.patch, YARN-2669-3.patch > > > For an allocation file like: > {noformat} > > > 4096mb,4vcores > > > {noformat} > Users may wish to config minResources for a queue with full path "root.q1". > However, right now, fair scheduler will treat this configureation for the > queue with full name "root.root.q1". We need to print out a warning msg to > notify users about this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2194) Add Cgroup support for RedHat 7
[ https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14185368#comment-14185368 ] bc Wong commented on YARN-2194: --- container-executor.c * L1188: If initialize_user() fails, do you not need to cleanup? * L1194: Same for create_log_dirs(). Seems that goto cleanup is still warranted. * L1207: Missing space before S_IRWXU. * L1243: Nit. Hardcoding 55 here is error-prone. You could allocate a 4K buffer here, and use snprintf. * L1244: You need to check the return value from malloc(). Since you're running as root here, everything has to be extra careful. * L1255: On failure, would log the command being executed. > Add Cgroup support for RedHat 7 > --- > > Key: YARN-2194 > URL: https://issues.apache.org/jira/browse/YARN-2194 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Wei Yan >Assignee: Wei Yan > Attachments: YARN-2194-1.patch > > > In previous versions of RedHat, we can build custom cgroup hierarchies with > use of the cgconfig command from the libcgroup package. From RedHat 7, > package libcgroup is deprecated and it is not recommended to use it since it > can easily create conflicts with the default cgroup hierarchy. The systemd is > provided and recommended for cgroup management. We need to add support for > this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2669) FairScheduler: queueName shouldn't allow periods the allocation.xml
[ https://issues.apache.org/jira/browse/YARN-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184540#comment-14184540 ] bc Wong commented on YARN-2669: --- To qualify what I wrote: bq. You'd need some escaping rule, like replacing any naturally occurring single underscore with two underscores, and then replacing a dot with a single underscore. That seems to be out of scope here, and could use more discussion and feedback. > FairScheduler: queueName shouldn't allow periods the allocation.xml > --- > > Key: YARN-2669 > URL: https://issues.apache.org/jira/browse/YARN-2669 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Wei Yan >Assignee: Wei Yan >Priority: Minor > Attachments: YARN-2669-1.patch, YARN-2669-2.patch > > > For an allocation file like: > {noformat} > > > 4096mb,4vcores > > > {noformat} > Users may wish to config minResources for a queue with full path "root.q1". > However, right now, fair scheduler will treat this configureation for the > queue with full name "root.root.q1". We need to print out a warning msg to > notify users about this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2669) FairScheduler: queueName shouldn't allow periods the allocation.xml
[ https://issues.apache.org/jira/browse/YARN-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184419#comment-14184419 ] bc Wong commented on YARN-2669: --- AllocationFileLoaderService.java * When throwing an error, would also output the offending queueName. QueuePlacementRule.java * Would log when you convert the username to one that doesn't have a queue. * I'm worried about username conflicts after the conversion, e.g. "eric.koffee" == "erick.offee". Replacing the dot with something else helps, but doesn't eliminate the problem. You'd need some escaping rule, like replacing any naturally occurring single underscore with two underscores, and then replacing a dot with a single underscore. > FairScheduler: queueName shouldn't allow periods the allocation.xml > --- > > Key: YARN-2669 > URL: https://issues.apache.org/jira/browse/YARN-2669 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Wei Yan >Assignee: Wei Yan >Priority: Minor > Attachments: YARN-2669-1.patch, YARN-2669-2.patch > > > For an allocation file like: > {noformat} > > > 4096mb,4vcores > > > {noformat} > Users may wish to config minResources for a queue with full path "root.q1". > However, right now, fair scheduler will treat this configureation for the > queue with full name "root.root.q1". We need to print out a warning msg to > notify users about this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2669) FairScheduler: queueName shouldn't allow periods the allocation.xml
[ https://issues.apache.org/jira/browse/YARN-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14176327#comment-14176327 ] bc Wong commented on YARN-2669: --- Thanks for the patch, Wei! What if the username has a period in it, and FS is configured to take the username as queue name? Is there a separate jira tracking that? > FairScheduler: queueName shouldn't allow periods the allocation.xml > --- > > Key: YARN-2669 > URL: https://issues.apache.org/jira/browse/YARN-2669 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Wei Yan >Assignee: Wei Yan >Priority: Minor > Attachments: YARN-2669-1.patch > > > For an allocation file like: > {noformat} > > > 4096mb,4vcores > > > {noformat} > Users may wish to config minResources for a queue with full path "root.q1". > However, right now, fair scheduler will treat this configureation for the > queue with full name "root.root.q1". We need to print out a warning msg to > notify users about this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2605) [RM HA] Rest api endpoints doing redirect incorrectly
bc Wong created YARN-2605: - Summary: [RM HA] Rest api endpoints doing redirect incorrectly Key: YARN-2605 URL: https://issues.apache.org/jira/browse/YARN-2605 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: bc Wong The standby RM's webui tries to do a redirect via meta-refresh. That is fine for pages designed to be viewed by web browsers. But the API endpoints shouldn't do that. Most programmatic HTTP clients do not do meta-refresh. I'd suggest HTTP 303, or return a well-defined error message (json or xml) stating that the standby status and a link to the active RM. The standby RM is returning this today: {noformat} $ curl -i http://bcsec-1.ent.cloudera.com:8088/ws/v1/cluster/metrics HTTP/1.1 200 OK Cache-Control: no-cache Expires: Thu, 25 Sep 2014 18:34:53 GMT Date: Thu, 25 Sep 2014 18:34:53 GMT Pragma: no-cache Expires: Thu, 25 Sep 2014 18:34:53 GMT Date: Thu, 25 Sep 2014 18:34:53 GMT Pragma: no-cache Content-Type: text/plain; charset=UTF-8 Refresh: 3; url=http://bcsec-2.ent.cloudera.com:8088/ws/v1/cluster/metrics Content-Length: 117 Server: Jetty(6.1.26) This is standby RM. Redirecting to the current active RM: http://bcsec-2.ent.cloudera.com:8088/ws/v1/cluster/metrics {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1530) [Umbrella] Store, manage and serve per-framework application-timeline data
[ https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143401#comment-14143401 ] bc Wong commented on YARN-1530: --- Hi [~zjshen]. First, glad to see that we're discussing approaches. You seem to agree with the premise that *ATS write path should not slow down apps*. bq. Therefore, is making the timeline server reliable (or always-up) the essential solution? If the timeline server is reliable, ... In theory, you can make the ATS *always-up*. In practice, we both know what real life distributed systems do. "Always-up" isn't the only thing. The write path needs to have good uptime and latency regardless of what's happening to the read path or the backing store. HDFS is a good default for the write channel because: * We don't have to design an ATS that is always-up. If you really want to, I'm sure you can eventually build something with good uptime. But it took other projects (HDFS, ZK) lots of hard work to get to that point. * If we reuse HDFS, cluster admins know how to operate HDFS and get good uptime from it. But it'll take training and hard-learned lessons for operators to figure out how to get good uptime from ATS, even after you build an always-up ATS. * All the popular YARN app frameworks (MR, Spark, etc.) already rely on HDFS by default. So do most of the 3rd party applications that I know of. Architecturally, it seems easier for admins to accept that ATS write path depends on HDFS for reliability, instead of a new component that (we claim) will be as reliable as HDFS/ZK. bq. given the whole roadmap of the timeline service, let's think critically of work that can improve the timeline service most significantly. Exactly. Strong +1. If we can drop the high uptime + low write latency requirement from the ATS service, we can avoid tons of effort. ATS doesn't need to be as reliable as HDFS. We don't need to worry about insulating the write path from the read path. We don't need to worry about occasional hiccups in HBase (or whatever the store is). And at the end of all this, everybody sleeps better because "ATS service going down" isn't a catastrophic failure. > [Umbrella] Store, manage and serve per-framework application-timeline data > -- > > Key: YARN-1530 > URL: https://issues.apache.org/jira/browse/YARN-1530 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Vinod Kumar Vavilapalli > Attachments: ATS-Write-Pipeline-Design-Proposal.pdf, > ATS-meet-up-8-28-2014-notes.pdf, application timeline design-20140108.pdf, > application timeline design-20140116.pdf, application timeline > design-20140130.pdf, application timeline design-20140210.pdf > > > This is a sibling JIRA for YARN-321. > Today, each application/framework has to do store, and serve per-framework > data all by itself as YARN doesn't have a common solution. This JIRA attempts > to solve the storage, management and serving of per-framework data from > various applications, both running and finished. The aim is to change YARN to > collect and store data in a generic manner with plugin points for frameworks > to do their own thing w.r.t interpretation and serving. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1530) [Umbrella] Store, manage and serve per-framework application-timeline data
[ https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14133483#comment-14133483 ] bc Wong commented on YARN-1530: --- Hi [~zjshen]. My main concern with the write path is: *Does the ATS write path have the right reliability, robustness and scalability so that its failures would not affect my apps?* I'll try to explain it with specific scenarios and technology choices. Then maybe you can tell me if those are valid concerns. First, to make it easy for other readers here, I'm advocating that this event flow:\\ _Client/App -> Reliable channel where event is persisted (HDFS/Kafka) -> ATS_ \\ is a lot better than:\\ _Client/App -> RPC -> ATS_ h4. Scenario 1. ATS service goes down If we use a reliable channel (e.g. HDFS) for writes, then apps do not suffer at all even when the ATS goes down. The ATS service going down is a valid scenario, due to causes ranging from bug to hardware failures. Having the write path decoupled from the ATS service being up all the time seems a clear win to me. Writing decoupled components is also a good distributed systems design principle. On the other hand, one may argue that _the ATS service will never go down entirely, or is not supposed to go down entirely_, just like we don't expect all the ZK nodes or all the RM nodes to down down. That argument then justifies using direct RPC for writes. Yes, you can design such an ATS service. To this I'll say: * YARN apps already depend on ZK/RM/HDFS being up. Every new service dependency we add will only increase the chances of YARN apps failing or slowing down. That's true even if the ATS service's uptime is as good as ZK or RM. * Realistically, getting the ATS service's uptime to the same level as ZK or HDFS is a long and winding road. Especially when most discussions here assume HBase as the backing store. HBase's uptime is lower than HDFS/ZK/RM because it's more complex to operate. If HBase going down means ATS service going down, then we certainly should guard against this failure scenario. h4. Scenario 2. ATS service partially down If the client writes directly to the ATS service using an unreliable channel (RPC), then the write path will do failover if one of ATS nodes fails. This transient failure still affects the performance of YARN apps. One can argue that _non-blocking RPC writes resolve this issue_. To this I'll say: * Non-blocking RPC writes only works for *long-duration apps*. We already short-lived applications, in the range of a few minutes. With Spark getting more popular, this will continue to happen. How short will the app duration get? The answer is a few seconds, if we want YARN to be the generic cluster scheduler. Google already sees that kind of job profile, if you look at their cluster traces. Of course, our scheduler and container allocation needs to get a lot better for that to happen. But I think that's the goal. Our ATS design here should consider short-lived applications. * It sucks if you're running an app that's supposed to finish under a minute, but then the ATS writes are stalled for an extra minute because one ATS node does a failover. Again, we can go back to the counter-argument in scenario #1, about how unlikely this is. I'll repeat that it's more likely that we think. And if we have a choice to decouple the write path from the ATS service, why not? h4. Scenario 3. ATS backing store fails By backing store, I mean the storage system where ATS persists the events, such as LevelDB and HBase. In a naive implementation, it seems that if the backing store fails, then the ATS service will be unavailable. Does that mean the event write path will fail, and the YARN apps will stall or fail? I hope not. It's not an issue if we use HDFS as the default write channel, because most YARN apps already depends on HDFS. One may argue that _the ATS service will buffer writes (persist them elsewhere) if the backing store fails_. To this I'll say: * If we have an alternate code path to persist events first before they hit the final backing store, why not do that all the time? Such a path will address scenario #1 and #2 as well. * HBase has been mentioned as if it's the penicillin of event storage here. That is probably true for big shops like Twitter and Yahoo, who have the expertise to operate an HBase cluster well. But most enterprise users or startups don't. We should assume that those HBase instances will run suboptimally with occasional widespread failures. Using HBase for event storage is a poor fit for most people. And I think it's difficult to achieve good uptime for the ATS service as a whole. > [Umbrella] Store, manage and serve per-framework application-timeline data > -- > > Key: YARN-1530 > URL: https://issues.apache.org/jira/browse/Y
[jira] [Commented] (YARN-1530) [Umbrella] Store, manage and serve per-framework application-timeline data
[ https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14127403#comment-14127403 ] bc Wong commented on YARN-1530: --- bq. The current writing channel allows the data to be available on the timeline server immediately Let's have reliability before speed. I think one of the requirement of ATS is: *The channel for writing events should be reliable.* I'm using *reliable* here in a strong sense, not the TCP-best-effort style reliability. HDFS is reliable. Kafka is reliable. (They are also scalable and robust.) A normal RPC connection is not. I don't want the ATS to be able to slow down my writes, and therefore, my applications, at all. For example, an ATS failover shouldn't pause all my applications for N seconds. A direct RPC to the ATS for writing seems a poor choice in general. Yes, you could make a distributed reliable scalable "ATS service" to accept writing events. But that seems a lot of work, while we can leverage existing technologies. If the channel itself is pluggable, then we have lots of options. Kafka is a very good choice, for sites that already deploy Kafka and know how to operate it. Using HDFS as a channel is also a good default implementation, for people already know how to scale and manage HDFS. Embedding a Kafka broker with each ATS daemon is also an option, if we're ok with that dependency. > [Umbrella] Store, manage and serve per-framework application-timeline data > -- > > Key: YARN-1530 > URL: https://issues.apache.org/jira/browse/YARN-1530 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Vinod Kumar Vavilapalli > Attachments: ATS-Write-Pipeline-Design-Proposal.pdf, > ATS-meet-up-8-28-2014-notes.pdf, application timeline design-20140108.pdf, > application timeline design-20140116.pdf, application timeline > design-20140130.pdf, application timeline design-20140210.pdf > > > This is a sibling JIRA for YARN-321. > Today, each application/framework has to do store, and serve per-framework > data all by itself as YARN doesn't have a common solution. This JIRA attempts > to solve the storage, management and serving of per-framework data from > various applications, both running and finished. The aim is to change YARN to > collect and store data in a generic manner with plugin points for frameworks > to do their own thing w.r.t interpretation and serving. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-596) Use scheduling policies throughout the queue hierarchy to decide which containers to preempt
[ https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] bc Wong updated YARN-596: - Description: In the fair scheduler, containers are chosen for preemption in the following way: All containers for all apps that are in queues that are over their fair share are put in a list. The list is sorted in order of the priority that the container was requested in. This means that an application can shield itself from preemption by requesting it's containers at higher priorities, which doesn't really make sense. Also, an application that is not over its fair share, but that is in a queue that is over it's fair share is just as likely to have containers preempted as an application that is over its fair share. was: In the fair scheduler, containers are chosen for preemption in the following way: All containers for all apps that are in queues that are over their fair share are put in a list. The list is sorted in order of the priority that the container was requested in. This means that an application can shield itself from preemption by requesting it's containers at higher priorities, which doesn't really make sense. Also, an application that is not over its fair share, but that is in a queue that is over it's fair share is just as likely to have containers preempted as an application that is over its fair share. > Use scheduling policies throughout the queue hierarchy to decide which > containers to preempt > > > Key: YARN-596 > URL: https://issues.apache.org/jira/browse/YARN-596 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.0.3-alpha >Reporter: Sandy Ryza >Assignee: Wei Yan > Fix For: 2.5.0 > > Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, > YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, > YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch > > > In the fair scheduler, containers are chosen for preemption in the > following way: > All containers for all apps that are in queues that are over their fair share > are put in a list. > The list is sorted in order of the priority that the container was requested > in. > This means that an application can shield itself from preemption by > requesting it's containers at higher priorities, which doesn't really make > sense. > Also, an application that is not over its fair share, but that is in a queue > that is over it's fair share is just as likely to have containers preempted > as an application that is over its fair share. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053457#comment-14053457 ] bc Wong commented on YARN-796: -- [~yufeldman] & [~sdaingade], just read your proposal (LabelBasedScheduling.pdf). Has a few comments: 1. *Would let each node report its own labels.* The current proposal specifies the node-label mapping in a centralized file. This seems operationally unfriendly, as the file is hard to maintain. * You need to get the DNS name right, which could be hard for a multi-homed setup. * The proposal uses regexes on FQDN, such as {{perfnode.*}}. This may work if the hostnames are set up by IT like that. But in reality, I've seen lots of sites where the FQDN is like {{stmp09wk0013.foobar.com}}, where "stmp" refers to the data center, and "wk0013" refers to "worker 13", and other weird stuff like that. Now imagine that a centralized node-label mapping file with 2000 nodes with such names. It'd be a nightmare. Instead, each node can supply its own labels, via {{yarn.nodemanager.node.labels}} (which specifies labels directly) or {{yarn.nodemanager.node.labelFile}} (which points to a file that has a single line containing all the labels). It's easy to generate the label file for each node. The admin can have puppet push it out, or populate it when the VM is built, or compute it in a local script by inspecting /proc. (Oh I have 192GB, so add the label "largeMem".) There is little room for mistake. The NM can still periodically refreshes its own labels, and update the RM via the heartbeat mechanism. The RM should also expose a "node label report", which is the real-time information of all nodes and their labels. 2. *Labels are per-container, not per-app. Right?* The doc keeps mentioning "application label", "ApplicationLabelExpression", etc. Should those be "container label" instead? I just want to confirm that each container request can carry its own label expression. Example use case: Only the mappers need GPU, not the reducers. 3. *Can we fail container requests with no satisfying nodes?* In "Considerations, #5", you wrote that the app would be in waiting state. Seems that a fail-fast behaviour would be better. If no node can satisfy the label expression, then it's better to tell the client "no". Very likely somebody made a typo somewhere. > Allow for (admin) labels on nodes and resource-requests > --- > > Key: YARN-796 > URL: https://issues.apache.org/jira/browse/YARN-796 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Arun C Murthy >Assignee: Wangda Tan > Attachments: LabelBasedScheduling.pdf, YARN-796.patch > > > It will be useful for admins to specify labels for nodes. Examples of labels > are OS, processor architecture etc. > We should expose these labels and allow applications to specify labels on > resource-requests. > Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-941) RM Should have a way to update the tokens it has for a running application
[ https://issues.apache.org/jira/browse/YARN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049414#comment-14049414 ] bc Wong commented on YARN-941: -- I'm fine with [~xgong]'s solution. I'd still like to see something more generic to make tokens (HDFS token, HBase token, etc) work with long running apps though. Perhaps I'll pursue the "arbitrary expiration time" approach in another jira. {quote} RPC privacy is a very expensive solution for AM-RM communication. First, it needs setup so AM/RM have access to key infrastructure - having this burden on all applications is not reasonable. This is compounded by the fact that we use AMRMTokens in non-secure mode too. Second, AM - RM communication is a very chatty protocol, it's likely the overhead is huge.. {quote} True security is often costly. The web/consumer industry went through the same exercise with HTTP vs HTTPS. You can get at least 10x better performance with HTTP. But in the end, everybody decided that it's worth it. And passing tokens around without RPC privacy is just like sending passwords around on HTTP without SSL. {quote} Unfortunately with long running services (the focus of this JIRA), this attack and its success is not as unlikely. This is the very reason why we roll master-keys every so often in the first place. {quote} With the rolling master key, it's unlikely for the attack to gather enough cipher text to mount that attack. Besides, a longer key would require so much computation to attack that it'd be infeasible. Anyway, appreciate your response, and I'll follow up in another jira. > RM Should have a way to update the tokens it has for a running application > -- > > Key: YARN-941 > URL: https://issues.apache.org/jira/browse/YARN-941 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Robert Joseph Evans >Assignee: Xuan Gong > Attachments: YARN-941.preview.2.patch, YARN-941.preview.3.patch, > YARN-941.preview.4.patch, YARN-941.preview.patch > > > When an application is submitted to the RM it includes with it a set of > tokens that the RM will renew on behalf of the application, that will be > passed to the AM when the application is launched, and will be used when > launching the application to access HDFS to download files on behalf of the > application. > For long lived applications/services these tokens can expire, and then the > tokens that the AM has will be invalid, and the tokens that the RM had will > also not work to launch a new AM. > We need to provide an API that will allow the RM to replace the current > tokens for this application with a new set. To avoid any real race issues, I > think this API should be something that the AM calls, so that the client can > connect to the AM with a new set of tokens it got using kerberos, then the AM > can inform the RM of the new set of tokens and quickly update its tokens > internally to use these new ones. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-941) RM Should have a way to update the tokens it has for a running application
[ https://issues.apache.org/jira/browse/YARN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045062#comment-14045062 ] bc Wong commented on YARN-941: -- Hi [~vinodkv], could you elaborate more on the point about master keys? {quote} Once we roll the master-keys, together with the fact that we want to support services that run for ever, the only way we can support not expiring tokens is by making ResourceManager remember master-keys for ever which is not feasible. {quote} The server side currently remembers the pair for all valid tokens. The master key isn't involved in verifying the token verification process, from what I understand. So if we allow arbitrary expiration time for AMRM tokens (to be specified by applications), then: * RM needs to persist the for all tokens that correspond to a running application. * Once an application finishes, RM can invalidate its token, and forget about it. * RM can keep rolling the master key, since it only affects how the "password" (aka token secret) is generated. Is my understanding of the password/master key interaction correct? This "arbitrary expiration time" idea is conceptually a lot simpler than the "token replacement patch" in this jira. So I feel it's worth a bit more discussion. There are a few obvious attack vectors: # *The attacker gains access to the persistence store, where the RM stores its map.* In this case, all bets are off. Neither solution is more secure than the other. # *The attacker snoops an insecure RPC channel and reads valid tokens from the network.* The proper solution is to turn on RPC privacy. The token replacement patch does not offer any real protection. On the contrary, it may give people a _false sense of security_, which would be worse. # *The attacker mounts a cryptographic attack, or somehow manages to guess a valid pair.* Token replacement is better because it limits the exposure. But this attack is very unlikely. And we can counter that by using a stronger hash function. To me, the "arbitrary expiration time" approach is a lot simpler, without compromising on security. What do you think? > RM Should have a way to update the tokens it has for a running application > -- > > Key: YARN-941 > URL: https://issues.apache.org/jira/browse/YARN-941 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Robert Joseph Evans >Assignee: Xuan Gong > Attachments: YARN-941.preview.2.patch, YARN-941.preview.3.patch, > YARN-941.preview.4.patch, YARN-941.preview.patch > > > When an application is submitted to the RM it includes with it a set of > tokens that the RM will renew on behalf of the application, that will be > passed to the AM when the application is launched, and will be used when > launching the application to access HDFS to download files on behalf of the > application. > For long lived applications/services these tokens can expire, and then the > tokens that the AM has will be invalid, and the tokens that the RM had will > also not work to launch a new AM. > We need to provide an API that will allow the RM to replace the current > tokens for this application with a new set. To avoid any real race issues, I > think this API should be something that the AM calls, so that the client can > connect to the AM with a new set of tokens it got using kerberos, then the AM > can inform the RM of the new set of tokens and quickly update its tokens > internally to use these new ones. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14008277#comment-14008277 ] bc Wong commented on YARN-796: -- Having the NMs specify their own labels is probably better from an administrative point of view. It's harder for the labels to get out of sync. Each node can have a "discovery script" that updates its labels, which feeds into the NM. So an admin can take a bunch of nodes out for upgrade, and put them back in without having to carefully reconfigure any central mapping file. > Allow for (admin) labels on nodes and resource-requests > --- > > Key: YARN-796 > URL: https://issues.apache.org/jira/browse/YARN-796 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Arun C Murthy >Assignee: Wangda Tan > Attachments: YARN-796.patch > > > It will be useful for admins to specify labels for nodes. Examples of labels > are OS, processor architecture etc. > We should expose these labels and allow applications to specify labels on > resource-requests. > Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-941) RM Should have a way to update the tokens it has for a running application
[ https://issues.apache.org/jira/browse/YARN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14003440#comment-14003440 ] bc Wong commented on YARN-941: -- Hi [~xgong], thanks for the patch! I'm interested in talking through the changes and their security implications, for everybody who's following along. I think the following are worth highlighting: # The token update mechanism is via the AM heartbeat. So if the previous AMRM token has been compromised, the attacker can get the new token. ** I don't think it's a big problem as the RM will only hand out the new token in _exactly_ one AllocateResponse (except for the case of RM restart). So if the attacker has the new token, the real AM won't, and it'll die and the token will get revoked. # How frequently a running AM gets an updated token is at the mercy of the configuration (the roll interval and activation delay). In addition, whenever the RM restarts, all AMs will get a new token on the next heartbeat. ** Should the RM check that the roll interval and activation delay are both shorter than the token expiration interval? # The client app is not responsible for renewing the token. The RM will renew it proactively and update the apps. ** The loss of control may be inconvenient to the app. The AM must also heartbeat frequently enough to catch the update in time. In practice, it's not an issue. But it still makes me slightly uncomfortable, since the client is the usually one renewing its credentials, of all other security protocols I know of. Here, the RM doesn't have any explicit logic to update an AMRM token before it expires. The math just generally works out if the admin sets the token expiry, roll interval and activation delay to the right values.\\ \\ Again, I think this is better than making it the AM's responsibility to get a new token, which is more burden on the AM. I just want to bring this up for discussion. > RM Should have a way to update the tokens it has for a running application > -- > > Key: YARN-941 > URL: https://issues.apache.org/jira/browse/YARN-941 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Robert Joseph Evans >Assignee: Xuan Gong > Attachments: YARN-941.preview.2.patch, YARN-941.preview.3.patch, > YARN-941.preview.patch > > > When an application is submitted to the RM it includes with it a set of > tokens that the RM will renew on behalf of the application, that will be > passed to the AM when the application is launched, and will be used when > launching the application to access HDFS to download files on behalf of the > application. > For long lived applications/services these tokens can expire, and then the > tokens that the AM has will be invalid, and the tokens that the RM had will > also not work to launch a new AM. > We need to provide an API that will allow the RM to replace the current > tokens for this application with a new set. To avoid any real race issues, I > think this API should be something that the AM calls, so that the client can > connect to the AM with a new set of tokens it got using kerberos, then the AM > can inform the RM of the new set of tokens and quickly update its tokens > internally to use these new ones. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2010) RM can't transition to active if it can't recover an app attempt
bc Wong created YARN-2010: - Summary: RM can't transition to active if it can't recover an app attempt Key: YARN-2010 URL: https://issues.apache.org/jira/browse/YARN-2010 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0 Reporter: bc Wong If the RM fails to recover an app attempt, it won't come up. We should make it more resilient. Specifically, the underlying error is that the app was submitted before Kerberos security got turned on. Makes sense for the app to fail in this case. But YARN should still start. {noformat} 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116) ... 4 more Caused by: org.apache.hadoop.service.ServiceStateException: org.apache.hadoop.yarn.exceptions.YarnException: java.lang.IllegalArgumentException: Missing argument at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265) ... 5 more Caused by: org.apache.hadoop.yarn.exceptions.YarnException: java.lang.IllegalArgumentException: Missing argument at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) ... 8 more Caused by: java.lang.IllegalArgumentException: Missing argument at javax.crypto.spec.SecretKeySpec.(SecretKeySpec.java:93) at org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188) at org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369) ... 13 more {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1913) Cluster logjam when all resources are consumed by AM
bc Wong created YARN-1913: - Summary: Cluster logjam when all resources are consumed by AM Key: YARN-1913 URL: https://issues.apache.org/jira/browse/YARN-1913 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.3.0 Reporter: bc Wong It's possible to deadlock a cluster by submitting many applications at once, and have all cluster resources taken up by AMs. One solution is for the scheduler to limit resources taken up by AMs, as a percentage of total cluster resources, via a "maxApplicationMasterShare" config. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1790) FairSchedule UI not showing apps table
[ https://issues.apache.org/jira/browse/YARN-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] bc Wong updated YARN-1790: -- Attachment: 0001-YARN-1790.-FairScheduler-UI-not-showing-apps-table.patch Same patch with --no-prefix. > FairSchedule UI not showing apps table > -- > > Key: YARN-1790 > URL: https://issues.apache.org/jira/browse/YARN-1790 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.3.0 >Reporter: bc Wong >Assignee: bc Wong > Attachments: > 0001-YARN-1790.-FairScheduler-UI-not-showing-apps-table.patch, fs_ui.png, > fs_ui_fixed.png > > > There is a running job, which shows up in the summary table in the > FairScheduler UI, the queue display, etc. Just not in the apps table at the > bottom. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1790) FairSchedule UI not showing apps table
[ https://issues.apache.org/jira/browse/YARN-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] bc Wong updated YARN-1790: -- Attachment: (was: 0001-YARN-1790.-FairScheduler-UI-not-showing-apps-table.patch) > FairSchedule UI not showing apps table > -- > > Key: YARN-1790 > URL: https://issues.apache.org/jira/browse/YARN-1790 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.3.0 >Reporter: bc Wong >Assignee: bc Wong > Attachments: > 0001-YARN-1790.-FairScheduler-UI-not-showing-apps-table.patch, fs_ui.png, > fs_ui_fixed.png > > > There is a running job, which shows up in the summary table in the > FairScheduler UI, the queue display, etc. Just not in the apps table at the > bottom. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1790) FairSchedule UI not showing apps table
[ https://issues.apache.org/jira/browse/YARN-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] bc Wong updated YARN-1790: -- Attachment: fs_ui_fixed.png 0001-YARN-1790.-FairScheduler-UI-not-showing-apps-table.patch Trivial fix. Also ported YARN-563 to FairScheduler UI. Tested manually (see screenshot). > FairSchedule UI not showing apps table > -- > > Key: YARN-1790 > URL: https://issues.apache.org/jira/browse/YARN-1790 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.3.0 >Reporter: bc Wong >Assignee: bc Wong > Attachments: > 0001-YARN-1790.-FairScheduler-UI-not-showing-apps-table.patch, fs_ui.png, > fs_ui_fixed.png > > > There is a running job, which shows up in the summary table in the > FairScheduler UI, the queue display, etc. Just not in the apps table at the > bottom. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1790) FairSchedule UI not showing apps table
[ https://issues.apache.org/jira/browse/YARN-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13922132#comment-13922132 ] bc Wong commented on YARN-1790: --- Seems that the fix of YARN-1407 forgot to change the FairSchedulerAppsBlock to use the user-facing app state. > FairSchedule UI not showing apps table > -- > > Key: YARN-1790 > URL: https://issues.apache.org/jira/browse/YARN-1790 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.3.0 >Reporter: bc Wong >Assignee: bc Wong > Attachments: fs_ui.png > > > There is a running job, which shows up in the summary table in the > FairScheduler UI, the queue display, etc. Just not in the apps table at the > bottom. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1790) FairSchedule UI not showing apps table
[ https://issues.apache.org/jira/browse/YARN-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] bc Wong updated YARN-1790: -- Attachment: fs_ui.png > FairSchedule UI not showing apps table > -- > > Key: YARN-1790 > URL: https://issues.apache.org/jira/browse/YARN-1790 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.3.0 >Reporter: bc Wong >Assignee: bc Wong > Attachments: fs_ui.png > > > There is a running job, which shows up in the summary table in the > FairScheduler UI, the queue display, etc. Just not in the apps table at the > bottom. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1790) FairSchedule UI not showing apps table
bc Wong created YARN-1790: - Summary: FairSchedule UI not showing apps table Key: YARN-1790 URL: https://issues.apache.org/jira/browse/YARN-1790 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.3.0 Reporter: bc Wong Assignee: bc Wong Attachments: fs_ui.png There is a running job, which shows up in the summary table in the FairScheduler UI, the queue display, etc. Just not in the apps table at the bottom. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1785) FairScheduler treats app lookup failures as ERRORs
[ https://issues.apache.org/jira/browse/YARN-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] bc Wong updated YARN-1785: -- Attachment: 0001-YARN-1785.-FairScheduler-treats-app-lookup-failures-.patch Attaching "0001-YARN-1785.-FairScheduler-treats-app-lookup-failures-.patch". Verified via manual testing. > FairScheduler treats app lookup failures as ERRORs > -- > > Key: YARN-1785 > URL: https://issues.apache.org/jira/browse/YARN-1785 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.2.0 >Reporter: bc Wong > Attachments: > 0001-YARN-1785.-FairScheduler-treats-app-lookup-failures-.patch > > > When invoking the /ws/v1/cluster/apps endpoint, RM will eventually get to > RMAppImpl#createAndGetApplicationReport, which calls > RMAppAttemptImpl#getApplicationResourceUsageReport, which looks up the app in > the scheduler, which may or may not exist. So FairScheduler shouldn't log an > error for every lookup failure: > {noformat} > 2014-02-17 08:23:21,240 ERROR > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Request for appInfo of unknown attemptappattempt_1392419715319_0135_01 > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1785) FairScheduler treats app lookup failures as ERRORs
bc Wong created YARN-1785: - Summary: FairScheduler treats app lookup failures as ERRORs Key: YARN-1785 URL: https://issues.apache.org/jira/browse/YARN-1785 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Reporter: bc Wong When invoking the /ws/v1/cluster/apps endpoint, RM will eventually get to RMAppImpl#createAndGetApplicationReport, which calls RMAppAttemptImpl#getApplicationResourceUsageReport, which looks up the app in the scheduler, which may or may not exist. So FairScheduler shouldn't log an error for every lookup failure: {noformat} 2014-02-17 08:23:21,240 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Request for appInfo of unknown attemptappattempt_1392419715319_0135_01 {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)