[jira] [Commented] (YARN-1487) How to develop with Eclipse
[ https://issues.apache.org/jira/browse/YARN-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844068#comment-13844068 ] Yang Hao commented on YARN-1487: When I compile the plugin, there are some errors, as follows: [ivy:resolve] :: [ivy:resolve] :: UNRESOLVED DEPENDENCIES :: [ivy:resolve] :: [ivy:resolve] :: org.apache.hadoop#hadoop-mapreduce-client-jobclient;2.2.0: not found [ivy:resolve] :: org.apache.hadoop#hadoop-mapreduce-client-core;2.2.0: not found [ivy:resolve] :: org.apache.hadoop#hadoop-mapreduce-client-common;2.2.0: not found [ivy:resolve] :: org.apache.hadoop#hadoop-hdfs;2.2.0: not found [ivy:resolve] :: org.apache.hadoop#hadoop-common;2.2.0: not found [ivy:resolve] :: [ivy:resolve] [ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS > How to develop with Eclipse > --- > > Key: YARN-1487 > URL: https://issues.apache.org/jira/browse/YARN-1487 > Project: Hadoop YARN > Issue Type: Improvement > Components: applications >Affects Versions: 2.2.0 > Environment: Linux,Hadoop2 >Reporter: Yang Hao > Labels: eclipse, plugin, yarn > Fix For: 2.2.0 > > > We can develop an application on Eclipse, but the Eclipse plugin is not > provided on Hadoop2. Will the new version provide Eclipse plugin for > developers? -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (YARN-1029) Allow embedding leader election into the RM
[ https://issues.apache.org/jira/browse/YARN-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844110#comment-13844110 ] Karthik Kambatla commented on YARN-1029: Manually testing the posted patch on a cluster showed that automatic failover works. However, automatic failover fails to take over after an explicit manual failover. To address this RMActiveStandbyElector should implement ZKFCProtocol and RMHAServiceTarget#getZKFCProxy should return a proxy to this. Will address this and other minor details in the next patch. > Allow embedding leader election into the RM > --- > > Key: YARN-1029 > URL: https://issues.apache.org/jira/browse/YARN-1029 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bikas Saha >Assignee: Karthik Kambatla > Attachments: yarn-1029-approach.patch > > > It should be possible to embed common ActiveStandyElector into the RM such > that ZooKeeper based leader election and notification is in-built. In > conjunction with a ZK state store, this configuration will be a simple > deployment option. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (YARN-1029) Allow embedding leader election into the RM
[ https://issues.apache.org/jira/browse/YARN-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844117#comment-13844117 ] Bikas Saha commented on YARN-1029: -- What are the pros and cons of using ZKFC embedded vs ActiveStandbyElector? If ActiveStandbyElector has to implement ZKFC protocol then are we better off just using ZKFC embedded directly? > Allow embedding leader election into the RM > --- > > Key: YARN-1029 > URL: https://issues.apache.org/jira/browse/YARN-1029 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bikas Saha >Assignee: Karthik Kambatla > Attachments: yarn-1029-approach.patch > > > It should be possible to embed common ActiveStandyElector into the RM such > that ZooKeeper based leader election and notification is in-built. In > conjunction with a ZK state store, this configuration will be a simple > deployment option. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (YARN-1481) ResourceManager and AdminService interact in a convoluted manner after YARN-1318
[ https://issues.apache.org/jira/browse/YARN-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844127#comment-13844127 ] Karthik Kambatla commented on YARN-1481: Thanks [~vinodkv]. The patch looks good to me. One minor nit: {{AdminService#isRMActive()}} need not be synchronized. I am okay with addressing the nit in another HA JIRA - may be, YARN-1029. +1, otherwise. Will wait for any comments until end of the day and commit it. > ResourceManager and AdminService interact in a convoluted manner after > YARN-1318 > > > Key: YARN-1481 > URL: https://issues.apache.org/jira/browse/YARN-1481 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli > Attachments: YARN-1481-20131207.txt, YARN-1481-20131209.txt > > > This is something I found while reviewing YARN-1318, but didn't halt that > patch as many cycles went there already. Some top level issues > - Not easy to follow RM's service life cycle > -- RM adds only AdminService as its service directly. > -- Other services are added to RM when AdminService's init calls > RM.activeServices.init() > - Overall, AdminService shouldn't encompass all of RM's HA state management. > It was originally supposed to be the implementation of just the RPC server. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (YARN-1029) Allow embedding leader election into the RM
[ https://issues.apache.org/jira/browse/YARN-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844182#comment-13844182 ] Karthik Kambatla commented on YARN-1029: Correction: Actually, it would be the AdminService that will have to implement ZKFCProtocol, not ActiveStandbyElector. bq. What are the pros and cons of using ZKFC embedded vs ActiveStandbyElector? Indeed, my first implementation was embedding ZKFC. While it works fine, I found it round about and has some avoidable overhead. Embedding ActiveStandbyElector definitely seems like a simpler, cleaner approach. Cons of ZKFC / Pros of ActiveStandbyElector: # ZKFC communicates to the RM through RPC; when embedded, both are in the same process. # In addition to ActiveStandbyElector, ZKFC has other overheads - health monitoring, fencing etc. which might not be required in a simple embedded option. # ZKFC#formatZK() needs to be exposed through rmadmin, which complicates it further. # Embedding ZKFC isn't very clean. Cons of ActiveStandbyElector: AFAIK, the only drawback of ActiveStandbyElector is having AdminService implement ZKFCProtocol - two methods: cedeActive() and gracefulFailover(). These methods are simple and straight-forward and are needed only to be able to safely failover manually when automatic-failover is enabled. > Allow embedding leader election into the RM > --- > > Key: YARN-1029 > URL: https://issues.apache.org/jira/browse/YARN-1029 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bikas Saha >Assignee: Karthik Kambatla > Attachments: yarn-1029-approach.patch > > > It should be possible to embed common ActiveStandyElector into the RM such > that ZooKeeper based leader election and notification is in-built. In > conjunction with a ZK state store, this configuration will be a simple > deployment option. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (YARN-1448) AM-RM protocol changes to support container resizing
[ https://issues.apache.org/jira/browse/YARN-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844189#comment-13844189 ] Hudson commented on YARN-1448: -- FAILURE: Integrated in Hadoop-Yarn-trunk #417 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/417/]) YARN-1448. AM-RM protocol changes to support container resizing (Wangda Tan via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1549627) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/AllocateRequest.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/AllocateResponse.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_service_protos.proto * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/AllocateRequestPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/AllocateResponsePBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/TestAllocateRequest.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/TestAllocateResponse.java > AM-RM protocol changes to support container resizing > > > Key: YARN-1448 > URL: https://issues.apache.org/jira/browse/YARN-1448 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, resourcemanager >Affects Versions: 2.2.0 >Reporter: Wangda Tan >Assignee: Wangda Tan > Fix For: 2.4.0 > > Attachments: yarn-1448.1.patch, yarn-1448.2.patch, yarn-1448.3.patch > > > As described in YARN-1197, we need add API in RM to support > 1) Add increase request in AllocateRequest > 2) Can get successfully increased/decreased containers from RM in > AllocateResponse -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (YARN-1053) Diagnostic message from ContainerExitEvent is ignored in ContainerImpl
[ https://issues.apache.org/jira/browse/YARN-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844209#comment-13844209 ] Hudson commented on YARN-1053: -- FAILURE: Integrated in Hadoop-Hdfs-0.23-Build #816 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/816/]) svn merge -c 1543973 FIXES: YARN-1053. Diagnostic message from ContainerExitEvent is ignored in ContainerImpl. Contributed by Omkar Vinit Joshi (jlowe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1549691) * /hadoop/common/branches/branch-0.23/hadoop-yarn-project/CHANGES.txt * /hadoop/common/branches/branch-0.23/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java * /hadoop/common/branches/branch-0.23/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/TestContainer.java > Diagnostic message from ContainerExitEvent is ignored in ContainerImpl > -- > > Key: YARN-1053 > URL: https://issues.apache.org/jira/browse/YARN-1053 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.2.0, 2.3.0 >Reporter: Omkar Vinit Joshi >Assignee: Omkar Vinit Joshi >Priority: Blocker > Labels: newbie > Fix For: 2.4.0, 0.23.11 > > Attachments: YARN-1053.1.patch, YARN-1053.20130809.patch > > > If the container launch fails then we send ContainerExitEvent. This event > contains exitCode and diagnostic message. Today we are ignoring diagnostic > message while handling this event inside ContainerImpl. Fixing it as it is > useful in diagnosing the failure. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (YARN-1448) AM-RM protocol changes to support container resizing
[ https://issues.apache.org/jira/browse/YARN-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844271#comment-13844271 ] Hudson commented on YARN-1448: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1608 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1608/]) YARN-1448. AM-RM protocol changes to support container resizing (Wangda Tan via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1549627) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/AllocateRequest.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/AllocateResponse.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_service_protos.proto * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/AllocateRequestPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/AllocateResponsePBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/TestAllocateRequest.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/TestAllocateResponse.java > AM-RM protocol changes to support container resizing > > > Key: YARN-1448 > URL: https://issues.apache.org/jira/browse/YARN-1448 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, resourcemanager >Affects Versions: 2.2.0 >Reporter: Wangda Tan >Assignee: Wangda Tan > Fix For: 2.4.0 > > Attachments: yarn-1448.1.patch, yarn-1448.2.patch, yarn-1448.3.patch > > > As described in YARN-1197, we need add API in RM to support > 1) Add increase request in AllocateRequest > 2) Can get successfully increased/decreased containers from RM in > AllocateResponse -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (YARN-1448) AM-RM protocol changes to support container resizing
[ https://issues.apache.org/jira/browse/YARN-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844306#comment-13844306 ] Hudson commented on YARN-1448: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1634 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1634/]) YARN-1448. AM-RM protocol changes to support container resizing (Wangda Tan via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1549627) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/AllocateRequest.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/AllocateResponse.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_service_protos.proto * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/AllocateRequestPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/AllocateResponsePBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/TestAllocateRequest.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/TestAllocateResponse.java > AM-RM protocol changes to support container resizing > > > Key: YARN-1448 > URL: https://issues.apache.org/jira/browse/YARN-1448 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, resourcemanager >Affects Versions: 2.2.0 >Reporter: Wangda Tan >Assignee: Wangda Tan > Fix For: 2.4.0 > > Attachments: yarn-1448.1.patch, yarn-1448.2.patch, yarn-1448.3.patch > > > As described in YARN-1197, we need add API in RM to support > 1) Add increase request in AllocateRequest > 2) Can get successfully increased/decreased containers from RM in > AllocateResponse -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (YARN-1404) Enable external systems/frameworks to share resources with Hadoop leveraging Yarn resource scheduling
[ https://issues.apache.org/jira/browse/YARN-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844344#comment-13844344 ] Arun C Murthy commented on YARN-1404: - I've spent time thinking about this in the context of running a myriad of external systems in YARN such as Impala, HDFS Caching (HDFS-4949) and some others. The overarching goal is to allow YARN to act as a ResourceManager for the overall cluster *and* a Workload Manager for external systems i.e. this way Impala or HDFS can rely on YARN's queues for workload management, SLAs via preemption etc. Is that a good characterization of the problem at hand? I think it's a good goal to support - this will allow other external systems to leverage YARN's capabilities for both resource sharing and workload management. Now, if we all agree on this - we can figure the best way to support this in a first-class manner. Ok, the core requirement is for an external system (Impala, HDFS, others) to leverage YARN's workload management capabilities (queues etc.) to acquire resources (cpu, memory) *on behalf* of a particular entity (user, queue) for completing a user's request (run a query, cache a dataset in RAM). The *key* is that these external systems need to acquire resources on behalf of the user and ensure that the chargeback is applied to the correct user, queue etc. This is a *brand new requirement* for YARN... so far, we have assumed that the entity acquiring the resource would also be actually utilizing the resource by launching a container etc. Here, it's clear that the requirement is that entity acquiring the resource would like to *delegate* the resource to an external framework. For e.g. # A user query would like to acquire cpu, memory etc. for appropriate accounting chargeback and then delegate it to Impala. # A user request for caching data would like to acquire memory for appropriate accounting chargeback and then delegate to the Datanode. In this scenario, I think explicitly allowing for *delegation* of a container would solve the problem in a first-class manner. We should add a new API to the NodeManager which would allow an application to *delegate* a container's resources to a different container: {code:title=ContainerManagementProtocol.java|borderStyle=solid} public interface ContainerManagementProtocol { // ... public DelegateContainerResponse delegateContainer(DelegateContainerRequest request); // ... } {code} {code:title=DelegateContainerRequest.java|borderStyle=solid} public abstract class DelegateContainerRequest { // ... public ContainerLaunchContext getSourceContainer(); public ContainerId getTargetContainer(); // ... } {code} The implementation of this api would notify the NodeManager to change it's monitoring on the recipient container i.e. Impala or Datanode by modifying cgroup of the recipient container. Similarly, the NodeManager could be instructed by the ResourceManager to preempt the resources of the source container for continuing to serve the global SLAs of the queues - again, this is implemented by modifying the cgroup of the recipient container. This will allow for ResouceManager/NodeManager to be explicitly in control of resources, even in the face of misbehaving AMs etc. The result of the above proposal is very similar to what is already being discussed, the only difference being that this is explicit (NodeManager knows the source and recipient containers) and this allows for all existing features such as preemption, over-allocation of resources to YARN queues etc. to continue to work as today. Thoughts? > Enable external systems/frameworks to share resources with Hadoop leveraging > Yarn resource scheduling > - > > Key: YARN-1404 > URL: https://issues.apache.org/jira/browse/YARN-1404 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Affects Versions: 2.2.0 >Reporter: Alejandro Abdelnur >Assignee: Alejandro Abdelnur > Attachments: YARN-1404.patch > > > Currently Hadoop Yarn expects to manage the lifecycle of the processes its > applications run workload in. External frameworks/systems could benefit from > sharing resources with other Yarn applications while running their workload > within long-running processes owned by the external framework (in other > words, running their workload outside of the context of a Yarn container > process). > Because Yarn provides robust and scalable resource management, it is > desirable for some external systems to leverage the resource governance > capabilities of Yarn (queues, capacities, scheduling, access control) while > supplying their own resource enforcement. > Impala is an exam
[jira] [Created] (YARN-1488) Allow containers to delegate resources to another container
Arun C Murthy created YARN-1488: --- Summary: Allow containers to delegate resources to another container Key: YARN-1488 URL: https://issues.apache.org/jira/browse/YARN-1488 Project: Hadoop YARN Issue Type: New Feature Reporter: Arun C Murthy We should allow containers to delegate resources to another container. This would allow external frameworks to share not just YARN's resource-management capabilities but also it's workload-management capabilities. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (YARN-1488) Allow containers to delegate resources to another container
[ https://issues.apache.org/jira/browse/YARN-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844348#comment-13844348 ] Arun C Murthy commented on YARN-1488: - We should add a new API to the NodeManager which would allow an application to *delegate* a container's resources to a different container: {code:title=ContainerManagementProtocol.java|borderStyle=solid} public interface ContainerManagementProtocol { // ... public DelegateContainerResponse delegateContainer(DelegateContainerRequest request); // ... } {code} {code:title=DelegateContainerRequest.java|borderStyle=solid} public abstract class DelegateContainerRequest { // ... public ContainerLaunchContext getSourceContainer(); public ContainerId getTargetContainer(); // ... } {code} The implementation of this api would notify the NodeManager to change it's monitoring on the recipient container i.e. Impala or Datanode by modifying cgroup of the recipient container. Similarly, the NodeManager could be instructed by the ResourceManager to preempt the resources of the source container for continuing to serve the global SLAs of the queues - again, this is implemented by modifying the cgroup of the recipient container. This will allow for ResouceManager/NodeManager to be explicitly in control of resources, even in the face of misbehaving AMs etc. > Allow containers to delegate resources to another container > --- > > Key: YARN-1488 > URL: https://issues.apache.org/jira/browse/YARN-1488 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > > We should allow containers to delegate resources to another container. This > would allow external frameworks to share not just YARN's resource-management > capabilities but also it's workload-management capabilities. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (YARN-1404) Enable external systems/frameworks to share resources with Hadoop leveraging Yarn resource scheduling
[ https://issues.apache.org/jira/browse/YARN-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844352#comment-13844352 ] Arun C Murthy commented on YARN-1404: - I've opened YARN-1488 to track delegation of container resources. > Enable external systems/frameworks to share resources with Hadoop leveraging > Yarn resource scheduling > - > > Key: YARN-1404 > URL: https://issues.apache.org/jira/browse/YARN-1404 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Affects Versions: 2.2.0 >Reporter: Alejandro Abdelnur >Assignee: Alejandro Abdelnur > Attachments: YARN-1404.patch > > > Currently Hadoop Yarn expects to manage the lifecycle of the processes its > applications run workload in. External frameworks/systems could benefit from > sharing resources with other Yarn applications while running their workload > within long-running processes owned by the external framework (in other > words, running their workload outside of the context of a Yarn container > process). > Because Yarn provides robust and scalable resource management, it is > desirable for some external systems to leverage the resource governance > capabilities of Yarn (queues, capacities, scheduling, access control) while > supplying their own resource enforcement. > Impala is an example of such system. Impala uses Llama > (http://cloudera.github.io/llama/) to request resources from Yarn. > Impala runs an impalad process in every node of the cluster, when a user > submits a query, the processing is broken into 'query fragments' which are > run in multiple impalad processes leveraging data locality (similar to > Map-Reduce Mappers processing a collocated HDFS block of input data). > The execution of a 'query fragment' requires an amount of CPU and memory in > the impalad. As the impalad shares the host with other services (HDFS > DataNode, Yarn NodeManager, Hbase Region Server) and Yarn Applications > (MapReduce tasks). > To ensure cluster utilization that follow the Yarn scheduler policies and it > does not overload the cluster nodes, before running a 'query fragment' in a > node, Impala requests the required amount of CPU and memory from Yarn. Once > the requested CPU and memory has been allocated, Impala starts running the > 'query fragment' taking care that the 'query fragment' does not use more > resources than the ones that have been allocated. Memory is book kept per > 'query fragment' and the threads used for the processing of the 'query > fragment' are placed under a cgroup to contain CPU utilization. > Today, for all resources that have been asked to Yarn RM, a (container) > process must be started via the corresponding NodeManager. Failing to do > this, will result on the cancelation of the container allocation > relinquishing the acquired resource capacity back to the pool of available > resources. To avoid this, Impala starts a dummy container process doing > 'sleep 10y'. > Using a dummy container process has its drawbacks: > * the dummy container process is in a cgroup with a given number of CPU > shares that are not used and Impala is re-issuing those CPU shares to another > cgroup for the thread running the 'query fragment'. The cgroup CPU > enforcement works correctly because of the CPU controller implementation (but > the formal specified behavior is actually undefined). > * Impala may ask for CPU and memory independent of each other. Some requests > may be only memory with no CPU or viceversa. Because a container requires a > process, complete absence of memory or CPU is not possible even if the dummy > process is 'sleep', a minimal amount of memory and CPU is required for the > dummy process. > Because of this it is desirable to be able to have a container without a > backing process. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (YARN-1197) Support changing resources of an allocated container
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844354#comment-13844354 ] Arun C Murthy commented on YARN-1197: - Sorry, to come in late - I'm +1 for the overall idea/approach. However, I feel we still have to work through details on the scheduler side. So, I'd like to see this developed in a branch. This would allow for a full picture to emerge before we commit it to a specific release 2.4 v/s 2.5 etc. Thoughts? > Support changing resources of an allocated container > > > Key: YARN-1197 > URL: https://issues.apache.org/jira/browse/YARN-1197 > Project: Hadoop YARN > Issue Type: Task > Components: api, nodemanager, resourcemanager >Affects Versions: 2.1.0-beta >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: mapreduce-project.patch.ver.1, > tools-project.patch.ver.1, yarn-1197-v2.pdf, yarn-1197-v3.pdf, > yarn-1197-v4.pdf, yarn-1197-v5.pdf, yarn-1197.pdf, > yarn-api-protocol.patch.ver.1, yarn-pb-impl.patch.ver.1, > yarn-server-common.patch.ver.1, yarn-server-nodemanager.patch.ver.1, > yarn-server-resourcemanager.patch.ver.1 > > > The current YARN resource management logic assumes resource allocated to a > container is fixed during the lifetime of it. When users want to change a > resource > of an allocated container the only way is releasing it and allocating a new > container with expected size. > Allowing run-time changing resources of an allocated container will give us > better control of resource usage in application side -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (YARN-1197) Support changing resources of an allocated container
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844370#comment-13844370 ] Bikas Saha commented on YARN-1197: -- There are some plumbing/infra related changes which we could commit to trunk safely. None of that would be executed until some scheduler actually supports this. When that happens we could decide to move the code to branch-2 to target a release. Would prefer that to a branch which would need maintenance. > Support changing resources of an allocated container > > > Key: YARN-1197 > URL: https://issues.apache.org/jira/browse/YARN-1197 > Project: Hadoop YARN > Issue Type: Task > Components: api, nodemanager, resourcemanager >Affects Versions: 2.1.0-beta >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: mapreduce-project.patch.ver.1, > tools-project.patch.ver.1, yarn-1197-v2.pdf, yarn-1197-v3.pdf, > yarn-1197-v4.pdf, yarn-1197-v5.pdf, yarn-1197.pdf, > yarn-api-protocol.patch.ver.1, yarn-pb-impl.patch.ver.1, > yarn-server-common.patch.ver.1, yarn-server-nodemanager.patch.ver.1, > yarn-server-resourcemanager.patch.ver.1 > > > The current YARN resource management logic assumes resource allocated to a > container is fixed during the lifetime of it. When users want to change a > resource > of an allocated container the only way is releasing it and allocating a new > container with expected size. > Allowing run-time changing resources of an allocated container will give us > better control of resource usage in application side -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (YARN-1197) Support changing resources of an allocated container
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844381#comment-13844381 ] Arun C Murthy commented on YARN-1197: - The problem is that we can't ship half of this feature in 2.4 - it's either in or out. So, a branch would be significantly better - it's either in or out for 2.4. > Support changing resources of an allocated container > > > Key: YARN-1197 > URL: https://issues.apache.org/jira/browse/YARN-1197 > Project: Hadoop YARN > Issue Type: Task > Components: api, nodemanager, resourcemanager >Affects Versions: 2.1.0-beta >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: mapreduce-project.patch.ver.1, > tools-project.patch.ver.1, yarn-1197-v2.pdf, yarn-1197-v3.pdf, > yarn-1197-v4.pdf, yarn-1197-v5.pdf, yarn-1197.pdf, > yarn-api-protocol.patch.ver.1, yarn-pb-impl.patch.ver.1, > yarn-server-common.patch.ver.1, yarn-server-nodemanager.patch.ver.1, > yarn-server-resourcemanager.patch.ver.1 > > > The current YARN resource management logic assumes resource allocated to a > container is fixed during the lifetime of it. When users want to change a > resource > of an allocated container the only way is releasing it and allocating a new > container with expected size. > Allowing run-time changing resources of an allocated container will give us > better control of resource usage in application side -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (YARN-1404) Enable external systems/frameworks to share resources with Hadoop leveraging Yarn resource scheduling
[ https://issues.apache.org/jira/browse/YARN-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844398#comment-13844398 ] Bikas Saha commented on YARN-1404: -- Is the scenario having containers from multiple users asking for resources within their quota and then delegating them to a shared service to use on their behalf. The above would imply that datanode/impala/others would be running as yarn containers so that they can be targets for delegation. > Enable external systems/frameworks to share resources with Hadoop leveraging > Yarn resource scheduling > - > > Key: YARN-1404 > URL: https://issues.apache.org/jira/browse/YARN-1404 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Affects Versions: 2.2.0 >Reporter: Alejandro Abdelnur >Assignee: Alejandro Abdelnur > Attachments: YARN-1404.patch > > > Currently Hadoop Yarn expects to manage the lifecycle of the processes its > applications run workload in. External frameworks/systems could benefit from > sharing resources with other Yarn applications while running their workload > within long-running processes owned by the external framework (in other > words, running their workload outside of the context of a Yarn container > process). > Because Yarn provides robust and scalable resource management, it is > desirable for some external systems to leverage the resource governance > capabilities of Yarn (queues, capacities, scheduling, access control) while > supplying their own resource enforcement. > Impala is an example of such system. Impala uses Llama > (http://cloudera.github.io/llama/) to request resources from Yarn. > Impala runs an impalad process in every node of the cluster, when a user > submits a query, the processing is broken into 'query fragments' which are > run in multiple impalad processes leveraging data locality (similar to > Map-Reduce Mappers processing a collocated HDFS block of input data). > The execution of a 'query fragment' requires an amount of CPU and memory in > the impalad. As the impalad shares the host with other services (HDFS > DataNode, Yarn NodeManager, Hbase Region Server) and Yarn Applications > (MapReduce tasks). > To ensure cluster utilization that follow the Yarn scheduler policies and it > does not overload the cluster nodes, before running a 'query fragment' in a > node, Impala requests the required amount of CPU and memory from Yarn. Once > the requested CPU and memory has been allocated, Impala starts running the > 'query fragment' taking care that the 'query fragment' does not use more > resources than the ones that have been allocated. Memory is book kept per > 'query fragment' and the threads used for the processing of the 'query > fragment' are placed under a cgroup to contain CPU utilization. > Today, for all resources that have been asked to Yarn RM, a (container) > process must be started via the corresponding NodeManager. Failing to do > this, will result on the cancelation of the container allocation > relinquishing the acquired resource capacity back to the pool of available > resources. To avoid this, Impala starts a dummy container process doing > 'sleep 10y'. > Using a dummy container process has its drawbacks: > * the dummy container process is in a cgroup with a given number of CPU > shares that are not used and Impala is re-issuing those CPU shares to another > cgroup for the thread running the 'query fragment'. The cgroup CPU > enforcement works correctly because of the CPU controller implementation (but > the formal specified behavior is actually undefined). > * Impala may ask for CPU and memory independent of each other. Some requests > may be only memory with no CPU or viceversa. Because a container requires a > process, complete absence of memory or CPU is not possible even if the dummy > process is 'sleep', a minimal amount of memory and CPU is required for the > dummy process. > Because of this it is desirable to be able to have a container without a > backing process. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (YARN-1404) Enable external systems/frameworks to share resources with Hadoop leveraging Yarn resource scheduling
[ https://issues.apache.org/jira/browse/YARN-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844437#comment-13844437 ] Arun C Murthy commented on YARN-1404: - Yes, agreed. Sorry, I thought it was clear that was what I proposing with: {quote} The implementation of this api would notify the NodeManager to change it's monitoring on the recipient container i.e. Impala or Datanode by modifying cgroup of the recipient container. Similarly, the NodeManager could be instructed by the ResourceManager to preempt the resources of the source container for continuing to serve the global SLAs of the queues - again, this is implemented by modifying the cgroup of the recipient container. This will allow for ResouceManager/NodeManager to be explicitly in control of resources, even in the face of misbehaving AMs etc. {quote} > Enable external systems/frameworks to share resources with Hadoop leveraging > Yarn resource scheduling > - > > Key: YARN-1404 > URL: https://issues.apache.org/jira/browse/YARN-1404 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Affects Versions: 2.2.0 >Reporter: Alejandro Abdelnur >Assignee: Alejandro Abdelnur > Attachments: YARN-1404.patch > > > Currently Hadoop Yarn expects to manage the lifecycle of the processes its > applications run workload in. External frameworks/systems could benefit from > sharing resources with other Yarn applications while running their workload > within long-running processes owned by the external framework (in other > words, running their workload outside of the context of a Yarn container > process). > Because Yarn provides robust and scalable resource management, it is > desirable for some external systems to leverage the resource governance > capabilities of Yarn (queues, capacities, scheduling, access control) while > supplying their own resource enforcement. > Impala is an example of such system. Impala uses Llama > (http://cloudera.github.io/llama/) to request resources from Yarn. > Impala runs an impalad process in every node of the cluster, when a user > submits a query, the processing is broken into 'query fragments' which are > run in multiple impalad processes leveraging data locality (similar to > Map-Reduce Mappers processing a collocated HDFS block of input data). > The execution of a 'query fragment' requires an amount of CPU and memory in > the impalad. As the impalad shares the host with other services (HDFS > DataNode, Yarn NodeManager, Hbase Region Server) and Yarn Applications > (MapReduce tasks). > To ensure cluster utilization that follow the Yarn scheduler policies and it > does not overload the cluster nodes, before running a 'query fragment' in a > node, Impala requests the required amount of CPU and memory from Yarn. Once > the requested CPU and memory has been allocated, Impala starts running the > 'query fragment' taking care that the 'query fragment' does not use more > resources than the ones that have been allocated. Memory is book kept per > 'query fragment' and the threads used for the processing of the 'query > fragment' are placed under a cgroup to contain CPU utilization. > Today, for all resources that have been asked to Yarn RM, a (container) > process must be started via the corresponding NodeManager. Failing to do > this, will result on the cancelation of the container allocation > relinquishing the acquired resource capacity back to the pool of available > resources. To avoid this, Impala starts a dummy container process doing > 'sleep 10y'. > Using a dummy container process has its drawbacks: > * the dummy container process is in a cgroup with a given number of CPU > shares that are not used and Impala is re-issuing those CPU shares to another > cgroup for the thread running the 'query fragment'. The cgroup CPU > enforcement works correctly because of the CPU controller implementation (but > the formal specified behavior is actually undefined). > * Impala may ask for CPU and memory independent of each other. Some requests > may be only memory with no CPU or viceversa. Because a container requires a > process, complete absence of memory or CPU is not possible even if the dummy > process is 'sleep', a minimal amount of memory and CPU is required for the > dummy process. > Because of this it is desirable to be able to have a container without a > backing process. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (YARN-1121) RMStateStore should flush all pending store events before closing
[ https://issues.apache.org/jira/browse/YARN-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844453#comment-13844453 ] Bikas Saha commented on YARN-1121: -- If the thread gets interrupted or otherwise has an unexpected exit then it does not look like drained will be set to true. And service stop will hang. {code} while (!stopped && !Thread.currentThread().isInterrupted()) { + drained = eventQueue.isEmpty(); {code} Also, it would probably be better if we signaled an object when we exit the above run() method and block on that signal instead of the following spin wait. {code} + while(!drained) { +Thread.yield(); + } {code} > RMStateStore should flush all pending store events before closing > - > > Key: YARN-1121 > URL: https://issues.apache.org/jira/browse/YARN-1121 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.1.0-beta >Reporter: Bikas Saha >Assignee: Jian He > Fix For: 2.4.0 > > Attachments: YARN-1121.1.patch, YARN-1121.2.patch, YARN-1121.2.patch, > YARN-1121.3.patch, YARN-1121.4.patch, YARN-1121.5.patch, YARN-1121.6.patch, > YARN-1121.6.patch, YARN-1121.7.patch > > > on serviceStop it should wait for all internal pending events to drain before > stopping. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Assigned] (YARN-1413) [YARN-321] AHS WebUI should server aggregated logs as well
[ https://issues.apache.org/jira/browse/YARN-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal reassigned YARN-1413: --- Assignee: Mayank Bansal (was: Zhijie Shen) > [YARN-321] AHS WebUI should server aggregated logs as well > -- > > Key: YARN-1413 > URL: https://issues.apache.org/jira/browse/YARN-1413 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Mayank Bansal > -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (YARN-1413) [YARN-321] AHS WebUI should server aggregated logs as well
[ https://issues.apache.org/jira/browse/YARN-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-1413: Attachment: YARN-1413-1.patch Attaching the patch. Thanks, Mayank > [YARN-321] AHS WebUI should server aggregated logs as well > -- > > Key: YARN-1413 > URL: https://issues.apache.org/jira/browse/YARN-1413 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Mayank Bansal > Attachments: YARN-1413-1.patch > > -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (YARN-1413) [YARN-321] AHS WebUI should server aggregated logs as well
[ https://issues.apache.org/jira/browse/YARN-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844567#comment-13844567 ] Hadoop QA commented on YARN-1413: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12618085/YARN-1413-1.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2640//console This message is automatically generated. > [YARN-321] AHS WebUI should server aggregated logs as well > -- > > Key: YARN-1413 > URL: https://issues.apache.org/jira/browse/YARN-1413 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Mayank Bansal > Attachments: YARN-1413-1.patch > > -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container
[ https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844594#comment-13844594 ] Hitesh Shah commented on YARN-1040: --- Given the recent comments on YARN-1404, I believe that this should not be supported unless the resources are being delegated to another YARN container. Furthermore, if we are talking about container leases ( for multiple process launches and not doing any resource delegation ), a container lease should start when the first process is launched - thereby having an API that supports a null ContainerLaunchContext is moot. The lease aspects should probably be encoded into the container token so that the NM understands that a process exiting in a particular container need not signal the end of the container i.e. multipleProcesses should not be an explicit flag in the api. > De-link container life cycle from the process and add ability to execute > multiple processes in the same long-lived container > > > Key: YARN-1040 > URL: https://issues.apache.org/jira/browse/YARN-1040 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 3.0.0 >Reporter: Steve Loughran > > The AM should be able to exec >1 process in a container, rather than have the > NM automatically release the container when the single process exits. > This would let an AM restart a process on the same container repeatedly, > which for HBase would offer locality on a restarted region server. > We may also want the ability to exec multiple processes in parallel, so that > something could be run in the container while a long-lived process was > already running. This can be useful in monitoring and reconfiguring the > long-lived process, as well as shutting it down. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container
[ https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844595#comment-13844595 ] Hitesh Shah commented on YARN-1040: --- Sorry - got my wires crossed on the different jiras going around. To clarify, I believe container leases for multiple processes is a good feature to have. Allowing a container to be launched without a process should be a no-no. Resource delegation as mentioned in YARN-1404 seems to be a decent approach at assigning resources to other containers - however, it should only be restricted to assigning resources to containers under the control of YARN. > De-link container life cycle from the process and add ability to execute > multiple processes in the same long-lived container > > > Key: YARN-1040 > URL: https://issues.apache.org/jira/browse/YARN-1040 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 3.0.0 >Reporter: Steve Loughran > > The AM should be able to exec >1 process in a container, rather than have the > NM automatically release the container when the single process exits. > This would let an AM restart a process on the same container repeatedly, > which for HBase would offer locality on a restarted region server. > We may also want the ability to exec multiple processes in parallel, so that > something could be run in the container while a long-lived process was > already running. This can be useful in monitoring and reconfiguring the > long-lived process, as well as shutting it down. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (YARN-1412) Allocating Containers on a particular Node in Yarn
[ https://issues.apache.org/jira/browse/YARN-1412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Weise updated YARN-1412: --- Affects Version/s: 2.2.0 > Allocating Containers on a particular Node in Yarn > -- > > Key: YARN-1412 > URL: https://issues.apache.org/jira/browse/YARN-1412 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.2.0 > Environment: centos, Hadoop 2.2.0 >Reporter: gaurav gupta > > Summary of the problem: > If I pass the node on which I want container and set relax locality default > which is true, I don't get back the container on the node specified even if > the resources are available on the node. It doesn't matter if I set rack or > not. > Here is the snippet of the code that I am using > AMRMClient amRmClient = AMRMClient.createAMRMClient();; > String host = "h1"; > Resource capability = Records.newRecord(Resource.class); > capability.setMemory(memory); > nodes = new String[] {host}; > // in order to request a host, we also have to request the rack > racks = new String[] {"/default-rack"}; > List containerRequests = new > ArrayList(); > List releasedContainers = new ArrayList(); > containerRequests.add(new ContainerRequest(capability, nodes, racks, > Priority.newInstance(priority))); > if (containerRequests.size() > 0) { > LOG.info("Asking RM for containers: " + containerRequests); > for (ContainerRequest cr : containerRequests) { > LOG.info("Requested container: {}", cr.toString()); > amRmClient.addContainerRequest(cr); > } > } > for (ContainerId containerId : releasedContainers) { > LOG.info("Released container, id={}", containerId.getId()); > amRmClient.releaseAssignedContainer(containerId); > } > return amRmClient.allocate(0); -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (YARN-1404) Enable external systems/frameworks to share resources with Hadoop leveraging Yarn resource scheduling
[ https://issues.apache.org/jira/browse/YARN-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844740#comment-13844740 ] Sandy Ryza commented on YARN-1404: -- Arun, I think I agree with most of the above and your proposal makes a lot of sense to me. There are numerous issues to tackle. On the YARN side: * YARN has assumed since its inception that a container's resources belong to a single application - we are likely to come across many subtle issues when rethinking this assumption. * While YARN has promise as a platform for deploying long-running services, that functionality currently isn't stable in the way that much of the rest of YARN is. * Currently preemption means killing a container process - we would need to change the way this mechanism works. On the Datanode/Impala side: * Rethink the way we deploy these services to allow them to run inside YARN containers. Stepping back a little, YARN does three things: * Central Scheduling - decides who gets to run and when and where they get to do so * Deployment - ships bits across the cluster and runs container processes * Enforcement - monitors container processes to make sure they stay within scheduled limits The central scheduling part is the most valuable to a framework like Impala because it allows it to truly share resources on a cluster with other processing frameworks. The second two are helpful - they allow us to standardize the way work is deployed on a Hadoop cluster - but they aren't enabling things that's fundamentally impossible without them. While these will simplify things in the long term and create a more cohesive platform, Impala currently has little tangible to gain by doing deployment and enforcement inside YARN. So, to summarize, I like the idea and would be both happy to see YARN move in this direction and to help it do so. However, making Impala-YARN integration depend on this fairly involved work would unnecessarily set it back. In the short term, we have proposed a minimally invasive change (making it possible to launch containers without starting processes) that would allow YARN to satisfy our use case. I am confident that the change poses no risk from a security perspective, from a stability perspective, or in terms of detracting from the longer-term vision. > Enable external systems/frameworks to share resources with Hadoop leveraging > Yarn resource scheduling > - > > Key: YARN-1404 > URL: https://issues.apache.org/jira/browse/YARN-1404 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Affects Versions: 2.2.0 >Reporter: Alejandro Abdelnur >Assignee: Alejandro Abdelnur > Attachments: YARN-1404.patch > > > Currently Hadoop Yarn expects to manage the lifecycle of the processes its > applications run workload in. External frameworks/systems could benefit from > sharing resources with other Yarn applications while running their workload > within long-running processes owned by the external framework (in other > words, running their workload outside of the context of a Yarn container > process). > Because Yarn provides robust and scalable resource management, it is > desirable for some external systems to leverage the resource governance > capabilities of Yarn (queues, capacities, scheduling, access control) while > supplying their own resource enforcement. > Impala is an example of such system. Impala uses Llama > (http://cloudera.github.io/llama/) to request resources from Yarn. > Impala runs an impalad process in every node of the cluster, when a user > submits a query, the processing is broken into 'query fragments' which are > run in multiple impalad processes leveraging data locality (similar to > Map-Reduce Mappers processing a collocated HDFS block of input data). > The execution of a 'query fragment' requires an amount of CPU and memory in > the impalad. As the impalad shares the host with other services (HDFS > DataNode, Yarn NodeManager, Hbase Region Server) and Yarn Applications > (MapReduce tasks). > To ensure cluster utilization that follow the Yarn scheduler policies and it > does not overload the cluster nodes, before running a 'query fragment' in a > node, Impala requests the required amount of CPU and memory from Yarn. Once > the requested CPU and memory has been allocated, Impala starts running the > 'query fragment' taking care that the 'query fragment' does not use more > resources than the ones that have been allocated. Memory is book kept per > 'query fragment' and the threads used for the processing of the 'query > fragment' are placed under a cgroup to contain CPU utilization. > Today, for all resources that have been asked to Yarn RM, a (container)
[jira] [Commented] (YARN-1028) Add FailoverProxyProvider like capability to RMProxy
[ https://issues.apache.org/jira/browse/YARN-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844747#comment-13844747 ] Xuan Gong commented on YARN-1028: - small nit : Add {code} getRMAdminService(0).transitionToActive(req); getRMAdminService(1).transitionToStandBy(req); {code} To {code} + @Test + public void testExplicitFailover() + throws YarnException, InterruptedException, IOException { +verifyNodeManagerConnected(); +verifyClientConnection(); + +// Failover to the second RM +getRMAdminService(0).transitionToStandby(req); +getRMAdminService(1).transitionToActive(req); + +verifyNodeManagerConnected(); +verifyClientConnection(); + +// Failover back to the first RM +verifyNodeManagerConnected(); +verifyClientConnection(); + } {code} to failover back to first RM. Others are LGTM > Add FailoverProxyProvider like capability to RMProxy > > > Key: YARN-1028 > URL: https://issues.apache.org/jira/browse/YARN-1028 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bikas Saha >Assignee: Karthik Kambatla > Attachments: yarn-1028-1.patch, yarn-1028-2.patch, yarn-1028-3.patch, > yarn-1028-4.patch, yarn-1028-5.patch, yarn-1028-draft-cumulative.patch > > > RMProxy layer currently abstracts RM discovery and implements it by looking > up service information from configuration. Motivated by HDFS and using > existing classes from Common, we can add failover proxy providers that may > provide RM discovery in extensible ways. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (YARN-1412) Allocating Containers on a particular Node in Yarn
[ https://issues.apache.org/jira/browse/YARN-1412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844750#comment-13844750 ] Thomas Weise commented on YARN-1412: We implemented it in the AM, tracking resource requests made for a specific host with relaxLocality=false and then, if they are not filled by the scheduler after n heartbeats, dropping host constraint and switching to relaxLocality=true. We would prefer to leave this to YARN with the combination of specific host and relaxLocality=true, but it does not work. The requirement is not unique to our application, and instead of handling it in user land it would be great to see this working as expected in future YARN versions. > Allocating Containers on a particular Node in Yarn > -- > > Key: YARN-1412 > URL: https://issues.apache.org/jira/browse/YARN-1412 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.2.0 > Environment: centos, Hadoop 2.2.0 >Reporter: gaurav gupta > > Summary of the problem: > If I pass the node on which I want container and set relax locality default > which is true, I don't get back the container on the node specified even if > the resources are available on the node. It doesn't matter if I set rack or > not. > Here is the snippet of the code that I am using > AMRMClient amRmClient = AMRMClient.createAMRMClient();; > String host = "h1"; > Resource capability = Records.newRecord(Resource.class); > capability.setMemory(memory); > nodes = new String[] {host}; > // in order to request a host, we also have to request the rack > racks = new String[] {"/default-rack"}; > List containerRequests = new > ArrayList(); > List releasedContainers = new ArrayList(); > containerRequests.add(new ContainerRequest(capability, nodes, racks, > Priority.newInstance(priority))); > if (containerRequests.size() > 0) { > LOG.info("Asking RM for containers: " + containerRequests); > for (ContainerRequest cr : containerRequests) { > LOG.info("Requested container: {}", cr.toString()); > amRmClient.addContainerRequest(cr); > } > } > for (ContainerId containerId : releasedContainers) { > LOG.info("Released container, id={}", containerId.getId()); > amRmClient.releaseAssignedContainer(containerId); > } > return amRmClient.allocate(0); -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (YARN-1404) Enable external systems/frameworks to share resources with Hadoop leveraging Yarn resource scheduling
[ https://issues.apache.org/jira/browse/YARN-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844772#comment-13844772 ] Vinod Kumar Vavilapalli commented on YARN-1404: --- Re Tucu's reply bq. Regarding ACLs and an on/off switch: IMO they are not necessary for the following reason. You need an external system installed and running in the node to use the resources of an unmanaged container. If you have direct access into the node to start the external system, you are 'trusted'. If you don't have direct access you cannot use the resources of an unmanaged container. Unfortunately that is not enough. We are exposing an API on NodeManager that anybody can use. The ACL prevents that. bq. In the case of managed containers we don't have a liveliness 'report' and the container process could very well be hung. In such scenario is the responsibility of the AM to detected the liveliness of the container process and react if it is considered hung. Like I said, we do have an implicit liveliness report - process liveliness. And NodeManager depends on that today to inform the app of container-finishes. bq. Regarding NM assume a whole lot of things about containers 3 bullet items: For the my current use case none if this is needed. It could be relatively easy to enable such functionality if a use case that needs it arises. So, then we start off with the assumption that they are not needed? That creates two very different code paths for managed and unmanded containers. If possible we should avoid that. > Enable external systems/frameworks to share resources with Hadoop leveraging > Yarn resource scheduling > - > > Key: YARN-1404 > URL: https://issues.apache.org/jira/browse/YARN-1404 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Affects Versions: 2.2.0 >Reporter: Alejandro Abdelnur >Assignee: Alejandro Abdelnur > Attachments: YARN-1404.patch > > > Currently Hadoop Yarn expects to manage the lifecycle of the processes its > applications run workload in. External frameworks/systems could benefit from > sharing resources with other Yarn applications while running their workload > within long-running processes owned by the external framework (in other > words, running their workload outside of the context of a Yarn container > process). > Because Yarn provides robust and scalable resource management, it is > desirable for some external systems to leverage the resource governance > capabilities of Yarn (queues, capacities, scheduling, access control) while > supplying their own resource enforcement. > Impala is an example of such system. Impala uses Llama > (http://cloudera.github.io/llama/) to request resources from Yarn. > Impala runs an impalad process in every node of the cluster, when a user > submits a query, the processing is broken into 'query fragments' which are > run in multiple impalad processes leveraging data locality (similar to > Map-Reduce Mappers processing a collocated HDFS block of input data). > The execution of a 'query fragment' requires an amount of CPU and memory in > the impalad. As the impalad shares the host with other services (HDFS > DataNode, Yarn NodeManager, Hbase Region Server) and Yarn Applications > (MapReduce tasks). > To ensure cluster utilization that follow the Yarn scheduler policies and it > does not overload the cluster nodes, before running a 'query fragment' in a > node, Impala requests the required amount of CPU and memory from Yarn. Once > the requested CPU and memory has been allocated, Impala starts running the > 'query fragment' taking care that the 'query fragment' does not use more > resources than the ones that have been allocated. Memory is book kept per > 'query fragment' and the threads used for the processing of the 'query > fragment' are placed under a cgroup to contain CPU utilization. > Today, for all resources that have been asked to Yarn RM, a (container) > process must be started via the corresponding NodeManager. Failing to do > this, will result on the cancelation of the container allocation > relinquishing the acquired resource capacity back to the pool of available > resources. To avoid this, Impala starts a dummy container process doing > 'sleep 10y'. > Using a dummy container process has its drawbacks: > * the dummy container process is in a cgroup with a given number of CPU > shares that are not used and Impala is re-issuing those CPU shares to another > cgroup for the thread running the 'query fragment'. The cgroup CPU > enforcement works correctly because of the CPU controller implementation (but > the formal specified behavior is actually undefined). > * Impala may ask for CPU and memory i
[jira] [Commented] (YARN-1404) Enable external systems/frameworks to share resources with Hadoop leveraging Yarn resource scheduling
[ https://issues.apache.org/jira/browse/YARN-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844774#comment-13844774 ] Vinod Kumar Vavilapalli commented on YARN-1404: --- bq. In this scenario, I think explicitly allowing for delegation of a container would solve the problem in a first-class manner. This is an interesting solution that avoids the problems about trust, liveliness reporting and resource limitations' enforcement. +1 for considering something like this. > Enable external systems/frameworks to share resources with Hadoop leveraging > Yarn resource scheduling > - > > Key: YARN-1404 > URL: https://issues.apache.org/jira/browse/YARN-1404 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Affects Versions: 2.2.0 >Reporter: Alejandro Abdelnur >Assignee: Alejandro Abdelnur > Attachments: YARN-1404.patch > > > Currently Hadoop Yarn expects to manage the lifecycle of the processes its > applications run workload in. External frameworks/systems could benefit from > sharing resources with other Yarn applications while running their workload > within long-running processes owned by the external framework (in other > words, running their workload outside of the context of a Yarn container > process). > Because Yarn provides robust and scalable resource management, it is > desirable for some external systems to leverage the resource governance > capabilities of Yarn (queues, capacities, scheduling, access control) while > supplying their own resource enforcement. > Impala is an example of such system. Impala uses Llama > (http://cloudera.github.io/llama/) to request resources from Yarn. > Impala runs an impalad process in every node of the cluster, when a user > submits a query, the processing is broken into 'query fragments' which are > run in multiple impalad processes leveraging data locality (similar to > Map-Reduce Mappers processing a collocated HDFS block of input data). > The execution of a 'query fragment' requires an amount of CPU and memory in > the impalad. As the impalad shares the host with other services (HDFS > DataNode, Yarn NodeManager, Hbase Region Server) and Yarn Applications > (MapReduce tasks). > To ensure cluster utilization that follow the Yarn scheduler policies and it > does not overload the cluster nodes, before running a 'query fragment' in a > node, Impala requests the required amount of CPU and memory from Yarn. Once > the requested CPU and memory has been allocated, Impala starts running the > 'query fragment' taking care that the 'query fragment' does not use more > resources than the ones that have been allocated. Memory is book kept per > 'query fragment' and the threads used for the processing of the 'query > fragment' are placed under a cgroup to contain CPU utilization. > Today, for all resources that have been asked to Yarn RM, a (container) > process must be started via the corresponding NodeManager. Failing to do > this, will result on the cancelation of the container allocation > relinquishing the acquired resource capacity back to the pool of available > resources. To avoid this, Impala starts a dummy container process doing > 'sleep 10y'. > Using a dummy container process has its drawbacks: > * the dummy container process is in a cgroup with a given number of CPU > shares that are not used and Impala is re-issuing those CPU shares to another > cgroup for the thread running the 'query fragment'. The cgroup CPU > enforcement works correctly because of the CPU controller implementation (but > the formal specified behavior is actually undefined). > * Impala may ask for CPU and memory independent of each other. Some requests > may be only memory with no CPU or viceversa. Because a container requires a > process, complete absence of memory or CPU is not possible even if the dummy > process is 'sleep', a minimal amount of memory and CPU is required for the > dummy process. > Because of this it is desirable to be able to have a container without a > backing process. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (YARN-1404) Enable external systems/frameworks to share resources with Hadoop leveraging Yarn resource scheduling
[ https://issues.apache.org/jira/browse/YARN-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844781#comment-13844781 ] Vinod Kumar Vavilapalli commented on YARN-1404: --- {quote} Stepping back a little, YARN does three things: Central Scheduling - decides who gets to run and when and where they get to do so Deployment - ships bits across the cluster and runs container processes Enforcement - monitors container processes to make sure they stay within scheduled limits The central scheduling part is the most valuable to a framework like Impala because it allows it to truly share resources on a cluster with other processing frameworks. The second two are helpful - they allow us to standardize the way work is deployed on a Hadoop cluster - but they aren't enabling things that's fundamentally impossible without them. While these will simplify things in the long term and create a more cohesive platform, Impala currently has little tangible to gain by doing deployment and enforcement inside YARN. {quote} Don't agree with that characterization. The thing is to enable only central scheduling, YARN has to give up its control over liveliness & enforcement and needs to create a new level of trust. If there are alternative architectures that will avoid losing that control, YARN will chose those options. The question is whether external systems want to take that option or not. > Enable external systems/frameworks to share resources with Hadoop leveraging > Yarn resource scheduling > - > > Key: YARN-1404 > URL: https://issues.apache.org/jira/browse/YARN-1404 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Affects Versions: 2.2.0 >Reporter: Alejandro Abdelnur >Assignee: Alejandro Abdelnur > Attachments: YARN-1404.patch > > > Currently Hadoop Yarn expects to manage the lifecycle of the processes its > applications run workload in. External frameworks/systems could benefit from > sharing resources with other Yarn applications while running their workload > within long-running processes owned by the external framework (in other > words, running their workload outside of the context of a Yarn container > process). > Because Yarn provides robust and scalable resource management, it is > desirable for some external systems to leverage the resource governance > capabilities of Yarn (queues, capacities, scheduling, access control) while > supplying their own resource enforcement. > Impala is an example of such system. Impala uses Llama > (http://cloudera.github.io/llama/) to request resources from Yarn. > Impala runs an impalad process in every node of the cluster, when a user > submits a query, the processing is broken into 'query fragments' which are > run in multiple impalad processes leveraging data locality (similar to > Map-Reduce Mappers processing a collocated HDFS block of input data). > The execution of a 'query fragment' requires an amount of CPU and memory in > the impalad. As the impalad shares the host with other services (HDFS > DataNode, Yarn NodeManager, Hbase Region Server) and Yarn Applications > (MapReduce tasks). > To ensure cluster utilization that follow the Yarn scheduler policies and it > does not overload the cluster nodes, before running a 'query fragment' in a > node, Impala requests the required amount of CPU and memory from Yarn. Once > the requested CPU and memory has been allocated, Impala starts running the > 'query fragment' taking care that the 'query fragment' does not use more > resources than the ones that have been allocated. Memory is book kept per > 'query fragment' and the threads used for the processing of the 'query > fragment' are placed under a cgroup to contain CPU utilization. > Today, for all resources that have been asked to Yarn RM, a (container) > process must be started via the corresponding NodeManager. Failing to do > this, will result on the cancelation of the container allocation > relinquishing the acquired resource capacity back to the pool of available > resources. To avoid this, Impala starts a dummy container process doing > 'sleep 10y'. > Using a dummy container process has its drawbacks: > * the dummy container process is in a cgroup with a given number of CPU > shares that are not used and Impala is re-issuing those CPU shares to another > cgroup for the thread running the 'query fragment'. The cgroup CPU > enforcement works correctly because of the CPU controller implementation (but > the formal specified behavior is actually undefined). > * Impala may ask for CPU and memory independent of each other. Some requests > may be only memory with no CPU or viceversa. Because a container requires a > pro
[jira] [Created] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart
Vinod Kumar Vavilapalli created YARN-1489: - Summary: [Umbrella] Work-preserving ApplicationMaster restart Key: YARN-1489 URL: https://issues.apache.org/jira/browse/YARN-1489 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Today if AMs go down, - RM kills all the containers of that ApplicationAttempt - New ApplicationAttempt doesn't know where the previous containers are running - Old running containers don't know where the new AM is running. We need to fix this to enable work-preserving AM restart. The later two potentially can be done at the app level, but it is good to have a common solution for all apps where-ever possible. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Created] (YARN-1490) RM should optionally not kill all containers when an ApplicationMaster exits
Vinod Kumar Vavilapalli created YARN-1490: - Summary: RM should optionally not kill all containers when an ApplicationMaster exits Key: YARN-1490 URL: https://issues.apache.org/jira/browse/YARN-1490 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli This is needed to enable work-preserving AM restart. Some apps can chose to reconnect with old running containers, some may not want to. This should be an option. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (YARN-1041) RM to bind and notify a restarted AM of existing containers
[ https://issues.apache.org/jira/browse/YARN-1041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-1041: -- Issue Type: Sub-task (was: Bug) Parent: YARN-1489 > RM to bind and notify a restarted AM of existing containers > --- > > Key: YARN-1041 > URL: https://issues.apache.org/jira/browse/YARN-1041 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 3.0.0 >Reporter: Steve Loughran >Assignee: Jian He > > For long lived containers we don't want the AM to be a SPOF. > When the RM restarts a (failed) AM, it should be given the list of containers > it had already been allocated. the AM should then be able to contact the NMs > to get details on them. NMs would also need to do any binding of the > containers needed to handle a moved/restarted AM. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (YARN-1041) RM to bind and notify a restarted AM of existing containers
[ https://issues.apache.org/jira/browse/YARN-1041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-1041: -- Issue Type: Bug (was: Sub-task) Parent: (was: YARN-896) > RM to bind and notify a restarted AM of existing containers > --- > > Key: YARN-1041 > URL: https://issues.apache.org/jira/browse/YARN-1041 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0 >Reporter: Steve Loughran >Assignee: Jian He > > For long lived containers we don't want the AM to be a SPOF. > When the RM restarts a (failed) AM, it should be given the list of containers > it had already been allocated. the AM should then be able to contact the NMs > to get details on them. NMs would also need to do any binding of the > containers needed to handle a moved/restarted AM. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (YARN-1136) Replace junit.framework.Assert with org.junit.Assert
[ https://issues.apache.org/jira/browse/YARN-1136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Eagles updated YARN-1136: -- Assignee: Chen He > Replace junit.framework.Assert with org.junit.Assert > > > Key: YARN-1136 > URL: https://issues.apache.org/jira/browse/YARN-1136 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.1.0-beta >Reporter: Karthik Kambatla >Assignee: Chen He > Labels: newbie, test > > There are several places where we are using junit.framework.Assert instead of > org.junit.Assert. > {code}grep -rn "junit.framework.Assert" hadoop-yarn-project/ > --include=*.java{code} -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Created] (YARN-1491) Upgrade JUnit3 TestCase to JUnit 4
Jonathan Eagles created YARN-1491: - Summary: Upgrade JUnit3 TestCase to JUnit 4 Key: YARN-1491 URL: https://issues.apache.org/jira/browse/YARN-1491 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Jonathan Eagles Assignee: Chen He There are still four references to test classes that extend from junit.framework.TestCase hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestYarnVersionInfo.java hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestWindowsResourceCalculatorPlugin.java hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestLinuxResourceCalculatorPlugin.java hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestWindowsBasedProcessTree.java -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (YARN-408) Capacity Scheduler delay scheduling should not be disabled by default
[ https://issues.apache.org/jira/browse/YARN-408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-408: --- Attachment: YARN-408-trunk-3.patch Fixing test. Thanks, Mayank > Capacity Scheduler delay scheduling should not be disabled by default > - > > Key: YARN-408 > URL: https://issues.apache.org/jira/browse/YARN-408 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.0.3-alpha >Reporter: Mayank Bansal >Assignee: Mayank Bansal >Priority: Minor > Attachments: YARN-408-trunk-2.patch, YARN-408-trunk-3.patch, > YARN-408-trunk.patch > > > Capacity Scheduler delay scheduling should not be disabled by default. > Enabling it to number of nodes in one rack. > Thanks, > Mayank -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (YARN-1391) Lost node list contains many active node with different port
[ https://issues.apache.org/jira/browse/YARN-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siqi Li updated YARN-1391: -- Attachment: YARN-1391.v1.patch > Lost node list contains many active node with different port > > > Key: YARN-1391 > URL: https://issues.apache.org/jira/browse/YARN-1391 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.0.5-alpha >Reporter: Siqi Li >Assignee: Siqi Li > Attachments: YARN-1391.v1.patch > > > When restarting node manager, the active node list in webUI will contain > duplicate entries. Such two entries have the same host name with different > port number. After expiry interval, the older entry will get expired and > transitioned to lost node list, and stay there until this node gets restarted > again. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (YARN-1391) Lost node list should be identify by NodeId
[ https://issues.apache.org/jira/browse/YARN-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siqi Li updated YARN-1391: -- Description: in case of multiple node managers on a single machine. each of them should be identified by NodeId, which is more unique than just host name (was: When restarting node manager, the active node list in webUI will contain duplicate entries. Such two entries have the same host name with different port number. After expiry interval, the older entry will get expired and transitioned to lost node list, and stay there until this node gets restarted again.) > Lost node list should be identify by NodeId > --- > > Key: YARN-1391 > URL: https://issues.apache.org/jira/browse/YARN-1391 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.0.5-alpha >Reporter: Siqi Li >Assignee: Siqi Li > Attachments: YARN-1391.v1.patch > > > in case of multiple node managers on a single machine. each of them should be > identified by NodeId, which is more unique than just host name -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (YARN-1391) Lost node list should be identify by NodeId
[ https://issues.apache.org/jira/browse/YARN-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siqi Li updated YARN-1391: -- Summary: Lost node list should be identify by NodeId (was: Lost node list contains many active node with different port) > Lost node list should be identify by NodeId > --- > > Key: YARN-1391 > URL: https://issues.apache.org/jira/browse/YARN-1391 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.0.5-alpha >Reporter: Siqi Li >Assignee: Siqi Li > Attachments: YARN-1391.v1.patch > > > When restarting node manager, the active node list in webUI will contain > duplicate entries. Such two entries have the same host name with different > port number. After expiry interval, the older entry will get expired and > transitioned to lost node list, and stay there until this node gets restarted > again. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (YARN-408) Capacity Scheduler delay scheduling should not be disabled by default
[ https://issues.apache.org/jira/browse/YARN-408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844875#comment-13844875 ] Hadoop QA commented on YARN-408: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12618141/YARN-408-trunk-3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2641//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2641//console This message is automatically generated. > Capacity Scheduler delay scheduling should not be disabled by default > - > > Key: YARN-408 > URL: https://issues.apache.org/jira/browse/YARN-408 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.0.3-alpha >Reporter: Mayank Bansal >Assignee: Mayank Bansal >Priority: Minor > Attachments: YARN-408-trunk-2.patch, YARN-408-trunk-3.patch, > YARN-408-trunk.patch > > > Capacity Scheduler delay scheduling should not be disabled by default. > Enabling it to number of nodes in one rack. > Thanks, > Mayank -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (YARN-1391) Lost node list should be identify by NodeId
[ https://issues.apache.org/jira/browse/YARN-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844896#comment-13844896 ] Hadoop QA commented on YARN-1391: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12618147/YARN-1391.v1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2642//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2642//console This message is automatically generated. > Lost node list should be identify by NodeId > --- > > Key: YARN-1391 > URL: https://issues.apache.org/jira/browse/YARN-1391 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.0.5-alpha >Reporter: Siqi Li >Assignee: Siqi Li > Attachments: YARN-1391.v1.patch > > > in case of multiple node managers on a single machine. each of them should be > identified by NodeId, which is more unique than just host name -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (YARN-1197) Support changing resources of an allocated container
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844926#comment-13844926 ] Wangda Tan commented on YARN-1197: -- Agree, I also think the scheduler part need some time to review, I'll create a Jira for scheduler part and upload patch(updated against YARN-1447 and YARN-1448)/design doc ASAP. > Support changing resources of an allocated container > > > Key: YARN-1197 > URL: https://issues.apache.org/jira/browse/YARN-1197 > Project: Hadoop YARN > Issue Type: Task > Components: api, nodemanager, resourcemanager >Affects Versions: 2.1.0-beta >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: mapreduce-project.patch.ver.1, > tools-project.patch.ver.1, yarn-1197-v2.pdf, yarn-1197-v3.pdf, > yarn-1197-v4.pdf, yarn-1197-v5.pdf, yarn-1197.pdf, > yarn-api-protocol.patch.ver.1, yarn-pb-impl.patch.ver.1, > yarn-server-common.patch.ver.1, yarn-server-nodemanager.patch.ver.1, > yarn-server-resourcemanager.patch.ver.1 > > > The current YARN resource management logic assumes resource allocated to a > container is fixed during the lifetime of it. When users want to change a > resource > of an allocated container the only way is releasing it and allocating a new > container with expected size. > Allowing run-time changing resources of an allocated container will give us > better control of resource usage in application side -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Moved] (YARN-1492) truly shared cache for jars (jobjar/libjar)
[ https://issues.apache.org/jira/browse/YARN-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli moved HADOOP-9639 to YARN-1492: --- Component/s: (was: filecache) Affects Version/s: (was: 2.0.4-alpha) 2.0.4-alpha Key: YARN-1492 (was: HADOOP-9639) Project: Hadoop YARN (was: Hadoop Common) > truly shared cache for jars (jobjar/libjar) > --- > > Key: YARN-1492 > URL: https://issues.apache.org/jira/browse/YARN-1492 > Project: Hadoop YARN > Issue Type: New Feature >Affects Versions: 2.0.4-alpha >Reporter: Sangjin Lee >Assignee: Sangjin Lee > Attachments: shared_cache_design.pdf, shared_cache_design_v2.pdf, > shared_cache_design_v3.pdf, shared_cache_design_v4.pdf > > > Currently there is the distributed cache that enables you to cache jars and > files so that attempts from the same job can reuse them. However, sharing is > limited with the distributed cache because it is normally on a per-job basis. > On a large cluster, sometimes copying of jobjars and libjars becomes so > prevalent that it consumes a large portion of the network bandwidth, not to > speak of defeating the purpose of "bringing compute to where data is". This > is wasteful because in most cases code doesn't change much across many jobs. > I'd like to propose and discuss feasibility of introducing a truly shared > cache so that multiple jobs from multiple users can share and cache jars. > This JIRA is to open the discussion. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (YARN-1492) truly shared cache for jars (jobjar/libjar)
[ https://issues.apache.org/jira/browse/YARN-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844930#comment-13844930 ] Vinod Kumar Vavilapalli commented on YARN-1492: --- Technical issue. This should be a YARN JIRA. As YARN handles distributed cache, it makes sense to have this discussion here. I don't follow the common lists much and I almost missed this (it's possible others too missed it because of that). If/when we create a branch, let's create it with a YARN JIRA number. I just moved the JIRA to YARN. Let me know if you disagree. > truly shared cache for jars (jobjar/libjar) > --- > > Key: YARN-1492 > URL: https://issues.apache.org/jira/browse/YARN-1492 > Project: Hadoop YARN > Issue Type: New Feature >Affects Versions: 2.0.4-alpha >Reporter: Sangjin Lee >Assignee: Sangjin Lee > Attachments: shared_cache_design.pdf, shared_cache_design_v2.pdf, > shared_cache_design_v3.pdf, shared_cache_design_v4.pdf > > > Currently there is the distributed cache that enables you to cache jars and > files so that attempts from the same job can reuse them. However, sharing is > limited with the distributed cache because it is normally on a per-job basis. > On a large cluster, sometimes copying of jobjars and libjars becomes so > prevalent that it consumes a large portion of the network bandwidth, not to > speak of defeating the purpose of "bringing compute to where data is". This > is wasteful because in most cases code doesn't change much across many jobs. > I'd like to propose and discuss feasibility of introducing a truly shared > cache so that multiple jobs from multiple users can share and cache jars. > This JIRA is to open the discussion. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (YARN-1404) Enable external systems/frameworks to share resources with Hadoop leveraging Yarn resource scheduling
[ https://issues.apache.org/jira/browse/YARN-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844954#comment-13844954 ] Sandy Ryza commented on YARN-1404: -- bq. The thing is to enable only central scheduling, YARN has to give up its control over liveliness & enforcement and needs to create a new level of trust. I'm not sure I entirely understand what you mean by create a new level of trust. We are a long way from YARN managing all resources on a Hadoop cluster. YARN implicitly understands that other trusted processes will be running alongside it. The proposed change does not grant any users the ability to use any resources without going through a framework trusted by the cluster administrator. bq. Like I said, we do have an implicit liveliness report - process liveliness. And NodeManager depends on that today to inform the app of container-finishes. It depends on that or the AM releasing the resources. Process liveliness is a very imperfect signifier - a process can stick around due to an accidentally-not-finished-thread even when all its work is done. I have seen clusters where all MR task processes are killed by the AM without exiting naturally and everything works fine. I've tried to think through situations where this could be harmful: Malicious application intentionally sits on cluster resources: They can do this already by running a process with sleep(infinity) Application unintentionally sits on cluster resources: This can already happen if a container process forgets to terminate a non-daemon thread. In both cases, preemption will prohibit an application from sitting on resources above its fair share. Is there a scenario I'm missing here? bq. If there are alternative architectures that will avoid losing that control, YARN will chose those options. YARN is not a power-hungry conscious entity that gets to make decisions for us. We as YARN committers and contributors get to decide what use cases we want to support, and we don't need to choose a single one. We should of course be careful with what we choose to support, but we should be restrictive when there are concrete consequences of doing otherwise. Not simply when a use case violates the abstract idea of YARN controlling everything. If the deeper concern is that Impala and similar frameworks will opt not to run fully inside YARN when that functionality is available, I think we would be happy to switch over when YARN supports this in a stable manner. However, I believe this is a long way away and depending on that work is not an option for us. > Enable external systems/frameworks to share resources with Hadoop leveraging > Yarn resource scheduling > - > > Key: YARN-1404 > URL: https://issues.apache.org/jira/browse/YARN-1404 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Affects Versions: 2.2.0 >Reporter: Alejandro Abdelnur >Assignee: Alejandro Abdelnur > Attachments: YARN-1404.patch > > > Currently Hadoop Yarn expects to manage the lifecycle of the processes its > applications run workload in. External frameworks/systems could benefit from > sharing resources with other Yarn applications while running their workload > within long-running processes owned by the external framework (in other > words, running their workload outside of the context of a Yarn container > process). > Because Yarn provides robust and scalable resource management, it is > desirable for some external systems to leverage the resource governance > capabilities of Yarn (queues, capacities, scheduling, access control) while > supplying their own resource enforcement. > Impala is an example of such system. Impala uses Llama > (http://cloudera.github.io/llama/) to request resources from Yarn. > Impala runs an impalad process in every node of the cluster, when a user > submits a query, the processing is broken into 'query fragments' which are > run in multiple impalad processes leveraging data locality (similar to > Map-Reduce Mappers processing a collocated HDFS block of input data). > The execution of a 'query fragment' requires an amount of CPU and memory in > the impalad. As the impalad shares the host with other services (HDFS > DataNode, Yarn NodeManager, Hbase Region Server) and Yarn Applications > (MapReduce tasks). > To ensure cluster utilization that follow the Yarn scheduler policies and it > does not overload the cluster nodes, before running a 'query fragment' in a > node, Impala requests the required amount of CPU and memory from Yarn. Once > the requested CPU and memory has been allocated, Impala starts running the > 'query fragment' taking care that the 'query fragment' does not use more > resource
[jira] [Commented] (YARN-1404) Enable external systems/frameworks to share resources with Hadoop leveraging Yarn resource scheduling
[ https://issues.apache.org/jira/browse/YARN-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13845011#comment-13845011 ] Vinod Kumar Vavilapalli commented on YARN-1404: --- bq. I'm not sure I entirely understand what you mean by create a new level of trust. I thought that was already clear to everyone. See my comment [here|https://issues.apache.org/jira/browse/YARN-1404?focusedCommentId=13840905&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13840905]. "YARN depends on the ability to enforce resource-usage restrictions". YARN enables both resource scheduling and enforcement of those scheduling decisions. If resources sit outside of YARN, YARN cannot enforce the limits on their usage. For e.g, YARN cannot enforce the memory usage of a datanode. People may work around it by setting up Cgroups on these daemons, but that defeats the purpose of YARN in the first place. That is why I earlier proposed that impala/datanode run under YARN. When I couldn't find a solution otherwise, I revised my proposal to restrict it to be used with a special ACL so that other apps don't abuse the cluster by requesting unmanaged containers and not using those resources. bq. It depends on that or the AM releasing the resources. Process liveliness is a very imperfect signifier ... We cannot trust AMs to always release containers. If it were so imperfect, we should change YARN as it is today to not depend on liveliness. I'd leave it as an exercise to see how, once we remove process-liveliness in general, apps will release containers and how clusters get utilized. Bonus points for trying it on a shared multi-tenant cluster with user-written YARN apps. My point is that Process liveliness + accounting based on that is a very understood model in the Hadoop land. The proposal for leases is to continue that. bq. Is there a scenario I'm missing here? One example that illustrates this. Today AMs can go away without releasing containers and YARN can kill the corresponding containers(as they are managed). If we don't have some kind of leases, and AMs that are unmanaged resources go away without explicit container-release, those resources are leaked. bq. YARN is not a power-hungry conscious entity that gets to make decisions for us. Not simply when a use case violates the abstract idea of YARN controlling everything. [...] Of course, when I mean YARN, I mean the YARN community. You take it too literally. I was pointing out your statements about "Impala currently has little tangible to gain by doing deployment and enforcement inside YARN", "However, making Impala-YARN integration depend on this fairly involved work would unnecessarily set it back". YARN community doesn't take decisions based on those things. Overall, I didn't originally have a complete solution for making it happen - so came up with ACLs, leases. But delegation as proposed by Arun seems like one that solves all the problems. Other than saying you don't want to wait for impala-under-YARN integration, I haven't heard any technical reservations against this approach. > Enable external systems/frameworks to share resources with Hadoop leveraging > Yarn resource scheduling > - > > Key: YARN-1404 > URL: https://issues.apache.org/jira/browse/YARN-1404 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Affects Versions: 2.2.0 >Reporter: Alejandro Abdelnur >Assignee: Alejandro Abdelnur > Attachments: YARN-1404.patch > > > Currently Hadoop Yarn expects to manage the lifecycle of the processes its > applications run workload in. External frameworks/systems could benefit from > sharing resources with other Yarn applications while running their workload > within long-running processes owned by the external framework (in other > words, running their workload outside of the context of a Yarn container > process). > Because Yarn provides robust and scalable resource management, it is > desirable for some external systems to leverage the resource governance > capabilities of Yarn (queues, capacities, scheduling, access control) while > supplying their own resource enforcement. > Impala is an example of such system. Impala uses Llama > (http://cloudera.github.io/llama/) to request resources from Yarn. > Impala runs an impalad process in every node of the cluster, when a user > submits a query, the processing is broken into 'query fragments' which are > run in multiple impalad processes leveraging data locality (similar to > Map-Reduce Mappers processing a collocated HDFS block of input data). > The execution of a 'query fragment' requires an amount of CPU and memory in > the impalad. As the impal
[jira] [Reopened] (YARN-1121) RMStateStore should flush all pending store events before closing
[ https://issues.apache.org/jira/browse/YARN-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He reopened YARN-1121: --- > RMStateStore should flush all pending store events before closing > - > > Key: YARN-1121 > URL: https://issues.apache.org/jira/browse/YARN-1121 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.1.0-beta >Reporter: Bikas Saha >Assignee: Jian He > Fix For: 2.4.0 > > Attachments: YARN-1121.1.patch, YARN-1121.2.patch, YARN-1121.2.patch, > YARN-1121.3.patch, YARN-1121.4.patch, YARN-1121.5.patch, YARN-1121.6.patch, > YARN-1121.6.patch, YARN-1121.7.patch > > > on serviceStop it should wait for all internal pending events to drain before > stopping. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (YARN-1121) RMStateStore should flush all pending store events before closing
[ https://issues.apache.org/jira/browse/YARN-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-1121: -- Attachment: YARN-1121.8.patch Thanks for pointing out. Fixed the issue > RMStateStore should flush all pending store events before closing > - > > Key: YARN-1121 > URL: https://issues.apache.org/jira/browse/YARN-1121 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.1.0-beta >Reporter: Bikas Saha >Assignee: Jian He > Fix For: 2.4.0 > > Attachments: YARN-1121.1.patch, YARN-1121.2.patch, YARN-1121.2.patch, > YARN-1121.3.patch, YARN-1121.4.patch, YARN-1121.5.patch, YARN-1121.6.patch, > YARN-1121.6.patch, YARN-1121.7.patch, YARN-1121.8.patch > > > on serviceStop it should wait for all internal pending events to drain before > stopping. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (YARN-1121) RMStateStore should flush all pending store events before closing
[ https://issues.apache.org/jira/browse/YARN-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13845060#comment-13845060 ] Hadoop QA commented on YARN-1121: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12618174/YARN-1121.8.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2643//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2643//console This message is automatically generated. > RMStateStore should flush all pending store events before closing > - > > Key: YARN-1121 > URL: https://issues.apache.org/jira/browse/YARN-1121 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.1.0-beta >Reporter: Bikas Saha >Assignee: Jian He > Fix For: 2.4.0 > > Attachments: YARN-1121.1.patch, YARN-1121.2.patch, YARN-1121.2.patch, > YARN-1121.3.patch, YARN-1121.4.patch, YARN-1121.5.patch, YARN-1121.6.patch, > YARN-1121.6.patch, YARN-1121.7.patch, YARN-1121.8.patch > > > on serviceStop it should wait for all internal pending events to drain before > stopping. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (YARN-1311) Fix app specific scheduler-events' names to be app-attempt based
[ https://issues.apache.org/jira/browse/YARN-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-1311: -- Issue Type: Sub-task (was: Bug) Parent: YARN-1489 > Fix app specific scheduler-events' names to be app-attempt based > > > Key: YARN-1311 > URL: https://issues.apache.org/jira/browse/YARN-1311 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli >Priority: Trivial > Attachments: YARN-1311-20131015.txt > > > Today, APP_ADDED and APP_REMOVED are sent to the scheduler. They are > misnomers as schedulers only deal with AppAttempts today. This JIRA is for > fixing their names so that we can add App-level events in the near future, > notably for work-preserving RM-restart. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (YARN-1493) Separate app-level handling logic in scheduler
[ https://issues.apache.org/jira/browse/YARN-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-1493: -- Description: Today, scheduler is tied to attempt only. We can add new app-level events to the scheduler and separate the app-level logic out. This is good for work-preserving AM restart, RM restart, and also needed for differentiating app-level metrics and attempt-level metrics. > Separate app-level handling logic in scheduler > --- > > Key: YARN-1493 > URL: https://issues.apache.org/jira/browse/YARN-1493 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jian He >Assignee: Jian He > > Today, scheduler is tied to attempt only. We can add new app-level events to > the scheduler and separate the app-level logic out. This is good for > work-preserving AM restart, RM restart, and also needed for differentiating > app-level metrics and attempt-level metrics. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Created] (YARN-1493) Separate app-level handling logic in scheduler
Jian He created YARN-1493: - Summary: Separate app-level handling logic in scheduler Key: YARN-1493 URL: https://issues.apache.org/jira/browse/YARN-1493 Project: Hadoop YARN Issue Type: Sub-task Environment: Today, scheduler is tied to attempt only. We can add new app-level events to the scheduler and separate the app-level logic out. This is good for work-preserving AM restart, RM restart, and also needed for differentiating app-level metrics and attempt-level metrics. Reporter: Jian He Assignee: Jian He -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (YARN-1493) Separate app-level handling logic in scheduler
[ https://issues.apache.org/jira/browse/YARN-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-1493: -- Environment: (was: Today, scheduler is tied to attempt only. We can add new app-level events to the scheduler and separate the app-level logic out. This is good for work-preserving AM restart, RM restart, and also needed for differentiating app-level metrics and attempt-level metrics.) > Separate app-level handling logic in scheduler > --- > > Key: YARN-1493 > URL: https://issues.apache.org/jira/browse/YARN-1493 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jian He >Assignee: Jian He > -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart
[ https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13845124#comment-13845124 ] Bikas Saha commented on YARN-1489: -- Would be good to see an overall design document, specially for the tricky pieces like reconnecting existing running containers to new app attempts. > [Umbrella] Work-preserving ApplicationMaster restart > > > Key: YARN-1489 > URL: https://issues.apache.org/jira/browse/YARN-1489 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli > > Today if AMs go down, > - RM kills all the containers of that ApplicationAttempt > - New ApplicationAttempt doesn't know where the previous containers are > running > - Old running containers don't know where the new AM is running. > We need to fix this to enable work-preserving AM restart. The later two > potentially can be done at the app level, but it is good to have a common > solution for all apps where-ever possible. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (YARN-1363) Get / Cancel / Renew delegation token api should be non blocking
[ https://issues.apache.org/jira/browse/YARN-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1363: -- Attachment: YARN-1363.2.patch I've drafted an initial patch which only contains the production code. Here're some important changes: 1. storing DT and removing DT are changed to be async in RMStateStore, and notify RMDelegationTokenSecretManager of operation completion. 2. updating DT is added to RMStateStore, such that RMStateStore can send separate update completion notification, not to confused with storing/removing completion notifications. 3. RMDelegationTokenSecretManager handles the completion nofitications from RMStateStore 4. RMStateStore maintains a map of outstanding DT operations. 5. ClientRMService are changed to whether the operation is still in progress or not, and poll the result only when the operation is finished. 6. Update the javadoc in ApplicationClientProtocol 7. Update the YarnClientImpl according. One finding in YarnClient is that canceling/renewing DT are not wrapped. > Get / Cancel / Renew delegation token api should be non blocking > > > Key: YARN-1363 > URL: https://issues.apache.org/jira/browse/YARN-1363 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Omkar Vinit Joshi >Assignee: Zhijie Shen > Attachments: YARN-1363.1.patch, YARN-1363.2.patch > > > Today GetDelgationToken, CancelDelegationToken and RenewDelegationToken are > all blocking apis. > * As a part of these calls we try to update RMStateStore and that may slow it > down. > * Now as we have limited number of client request handlers we may fill up > client handlers quickly. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (YARN-1363) Get / Cancel / Renew delegation token api should be non blocking
[ https://issues.apache.org/jira/browse/YARN-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13845165#comment-13845165 ] Zhijie Shen commented on YARN-1363: --- 8. RMDelegationTokenSecretManager has the cleanup thread to clean the outstanding DT operations that have been finished, but the results have never been polled by the client, which is possible if the client crashes. > Get / Cancel / Renew delegation token api should be non blocking > > > Key: YARN-1363 > URL: https://issues.apache.org/jira/browse/YARN-1363 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Omkar Vinit Joshi >Assignee: Zhijie Shen > Attachments: YARN-1363.1.patch, YARN-1363.2.patch > > > Today GetDelgationToken, CancelDelegationToken and RenewDelegationToken are > all blocking apis. > * As a part of these calls we try to update RMStateStore and that may slow it > down. > * Now as we have limited number of client request handlers we may fill up > client handlers quickly. -- This message was sent by Atlassian JIRA (v6.1.4#6159)