[jira] [Updated] (YARN-2856) Application recovery throw InvalidStateTransitonException: Invalid event: ATTEMPT_KILLED at ACCEPTED
[ https://issues.apache.org/jira/browse/YARN-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-2856: - Attachment: YARN-2856.patch > Application recovery throw InvalidStateTransitonException: Invalid event: > ATTEMPT_KILLED at ACCEPTED > > > Key: YARN-2856 > URL: https://issues.apache.org/jira/browse/YARN-2856 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Rohith >Assignee: Rohith > Attachments: YARN-2856.patch > > > It is observed that recovering an application with its attempt KILLED final > state throw below exception. And application remain in accepted state forever. > {code} > 2014-11-12 02:34:10,602 | ERROR | AsyncDispatcher event handler | Can't > handle this event at current state | > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:673) > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > ATTEMPT_KILLED at ACCEPTED > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:671) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:90) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:730) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:714) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2236) Shared Cache uploader service on the Node Manager
[ https://issues.apache.org/jira/browse/YARN-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207775#comment-14207775 ] Hadoop QA commented on YARN-2236: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12681014/YARN-2236-trunk-v6.patch against trunk revision 53f64ee. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5823//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5823//console This message is automatically generated. > Shared Cache uploader service on the Node Manager > - > > Key: YARN-2236 > URL: https://issues.apache.org/jira/browse/YARN-2236 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chris Trezzo >Assignee: Chris Trezzo > Attachments: YARN-2236-trunk-v1.patch, YARN-2236-trunk-v2.patch, > YARN-2236-trunk-v3.patch, YARN-2236-trunk-v4.patch, YARN-2236-trunk-v5.patch, > YARN-2236-trunk-v6.patch > > > Implement the shared cache uploader service on the node manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2236) Shared Cache uploader service on the Node Manager
[ https://issues.apache.org/jira/browse/YARN-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207748#comment-14207748 ] Sangjin Lee commented on YARN-2236: --- Karthik, the v.6 patch should address all of your comments except #8. As for #8, it is true that the event handler is bit extraneous. But from the code standpoint, it is pretty clean and elegant. We just initialize the SharedCacheUploadService, and ContainerImpl can simply publish the event when needed. It also makes the coupling between SharedCacheUploadService and ContainerImpl loose. It is possible to have ContainerImpl use SharedCacheUploadService directly, but then the SharedCacheUploadService needs to be passed into the ContainerImpl constructor so it can be invoked directly. So all in all, I feel that the current approach is as clean as the alternative, if not cleaner. Let me know your thoughts. Thanks! > Shared Cache uploader service on the Node Manager > - > > Key: YARN-2236 > URL: https://issues.apache.org/jira/browse/YARN-2236 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chris Trezzo >Assignee: Chris Trezzo > Attachments: YARN-2236-trunk-v1.patch, YARN-2236-trunk-v2.patch, > YARN-2236-trunk-v3.patch, YARN-2236-trunk-v4.patch, YARN-2236-trunk-v5.patch, > YARN-2236-trunk-v6.patch > > > Implement the shared cache uploader service on the node manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2856) Application recovery throw InvalidStateTransitonException: Invalid event: ATTEMPT_KILLED at ACCEPTED
[ https://issues.apache.org/jira/browse/YARN-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207744#comment-14207744 ] Rohith commented on YARN-2856: -- It is possible event ATTEMPT_KILLED can come to RMApp while recovering the attempt with KILLED state. This event need to be handled. > Application recovery throw InvalidStateTransitonException: Invalid event: > ATTEMPT_KILLED at ACCEPTED > > > Key: YARN-2856 > URL: https://issues.apache.org/jira/browse/YARN-2856 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Rohith >Assignee: Rohith > > It is observed that recovering an application with its attempt KILLED final > state throw below exception. And application remain in accepted state forever. > {code} > 2014-11-12 02:34:10,602 | ERROR | AsyncDispatcher event handler | Can't > handle this event at current state | > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:673) > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > ATTEMPT_KILLED at ACCEPTED > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:671) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:90) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:730) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:714) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2236) Shared Cache uploader service on the Node Manager
[ https://issues.apache.org/jira/browse/YARN-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee updated YARN-2236: -- Attachment: YARN-2236-trunk-v6.patch v.6 patch posted. Again, to see the diff against the trunk, see https://github.com/ctrezzo/hadoop/compare/trunk...sharedcache-5-YARN-2236-uploader To see the diff between v.5 and v.6, see https://github.com/ctrezzo/hadoop/commit/a74f38cf3e3de824b3c6ced327acbe8e3937aef0 > Shared Cache uploader service on the Node Manager > - > > Key: YARN-2236 > URL: https://issues.apache.org/jira/browse/YARN-2236 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chris Trezzo >Assignee: Chris Trezzo > Attachments: YARN-2236-trunk-v1.patch, YARN-2236-trunk-v2.patch, > YARN-2236-trunk-v3.patch, YARN-2236-trunk-v4.patch, YARN-2236-trunk-v5.patch, > YARN-2236-trunk-v6.patch > > > Implement the shared cache uploader service on the node manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2856) Application recovery throw InvalidStateTransitonException: Invalid event: ATTEMPT_KILLED at ACCEPTED
Rohith created YARN-2856: Summary: Application recovery throw InvalidStateTransitonException: Invalid event: ATTEMPT_KILLED at ACCEPTED Key: YARN-2856 URL: https://issues.apache.org/jira/browse/YARN-2856 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Rohith Assignee: Rohith It is observed that recovering an application with its attempt KILLED final state throw below exception. And application remain in accepted state forever. {code} 2014-11-12 02:34:10,602 | ERROR | AsyncDispatcher event handler | Can't handle this event at current state | org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:673) org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ATTEMPT_KILLED at ACCEPTED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:671) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:90) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:730) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:714) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-1964) Create Docker analog of the LinuxContainerExecutor in YARN
[ https://issues.apache.org/jira/browse/YARN-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravi Prakash reassigned YARN-1964: -- Assignee: Ravi Prakash (was: Abin Shahab) > Create Docker analog of the LinuxContainerExecutor in YARN > -- > > Key: YARN-1964 > URL: https://issues.apache.org/jira/browse/YARN-1964 > Project: Hadoop YARN > Issue Type: New Feature >Affects Versions: 2.2.0 >Reporter: Arun C Murthy >Assignee: Ravi Prakash > Attachments: YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, > YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, > YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, > yarn-1964-branch-2.2.0-docker.patch, yarn-1964-branch-2.2.0-docker.patch, > yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, > yarn-1964-docker.patch, yarn-1964-docker.patch > > > Docker (https://www.docker.io/) is, increasingly, a very popular container > technology. > In context of YARN, the support for Docker will provide a very elegant > solution to allow applications to *package* their software into a Docker > container (entire Linux file system incl. custom versions of perl, python > etc.) and use it as a blueprint to launch all their YARN containers with > requisite software environment. This provides both consistency (all YARN > containers will have the same software environment) and isolation (no > interference with whatever is installed on the physical machine). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1964) Create Docker analog of the LinuxContainerExecutor in YARN
[ https://issues.apache.org/jira/browse/YARN-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207709#comment-14207709 ] Ravi Prakash commented on YARN-1964: I've committed this to trunk and branch-2. I wasn't sure about whether to put the release notes under release 2.6 or 2.7, but on a leap of faith, I've put it under 2.6 right now. I'll fix if it Arun declines to respin an RC. > Create Docker analog of the LinuxContainerExecutor in YARN > -- > > Key: YARN-1964 > URL: https://issues.apache.org/jira/browse/YARN-1964 > Project: Hadoop YARN > Issue Type: New Feature >Affects Versions: 2.2.0 >Reporter: Arun C Murthy >Assignee: Abin Shahab > Attachments: YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, > YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, > YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, > yarn-1964-branch-2.2.0-docker.patch, yarn-1964-branch-2.2.0-docker.patch, > yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, > yarn-1964-docker.patch, yarn-1964-docker.patch > > > Docker (https://www.docker.io/) is, increasingly, a very popular container > technology. > In context of YARN, the support for Docker will provide a very elegant > solution to allow applications to *package* their software into a Docker > container (entire Linux file system incl. custom versions of perl, python > etc.) and use it as a blueprint to launch all their YARN containers with > requisite software environment. This provides both consistency (all YARN > containers will have the same software environment) and isolation (no > interference with whatever is installed on the physical machine). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1964) Create Docker analog of the LinuxContainerExecutor in YARN
[ https://issues.apache.org/jira/browse/YARN-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207706#comment-14207706 ] Hudson commented on YARN-1964: -- FAILURE: Integrated in Hadoop-trunk-Commit #6517 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6517/]) YARN-1964. Create Docker analog of the LinuxContainerExecutor in YARN (raviprak: rev 53f64ee516d03f6ec87b41d77c214aa2fe4fa0ed) * hadoop-project/src/site/site.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDockerContainerExecutorWithMocks.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/DockerContainerExecutor.apt.vm * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/TestContainerLaunch.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDockerContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DockerContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/ContainerExecutor.java > Create Docker analog of the LinuxContainerExecutor in YARN > -- > > Key: YARN-1964 > URL: https://issues.apache.org/jira/browse/YARN-1964 > Project: Hadoop YARN > Issue Type: New Feature >Affects Versions: 2.2.0 >Reporter: Arun C Murthy >Assignee: Abin Shahab > Attachments: YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, > YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, > YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, > yarn-1964-branch-2.2.0-docker.patch, yarn-1964-branch-2.2.0-docker.patch, > yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, > yarn-1964-docker.patch, yarn-1964-docker.patch > > > Docker (https://www.docker.io/) is, increasingly, a very popular container > technology. > In context of YARN, the support for Docker will provide a very elegant > solution to allow applications to *package* their software into a Docker > container (entire Linux file system incl. custom versions of perl, python > etc.) and use it as a blueprint to launch all their YARN containers with > requisite software environment. This provides both consistency (all YARN > containers will have the same software environment) and isolation (no > interference with whatever is installed on the physical machine). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2846) Incorrect persist exit code for running containers in reacquireContainer() that interrupted by NodeManager restart.
[ https://issues.apache.org/jira/browse/YARN-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207659#comment-14207659 ] Hadoop QA commented on YARN-2846: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12680993/YARN-2846.patch against trunk revision 46f6f9d. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5822//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5822//console This message is automatically generated. > Incorrect persist exit code for running containers in reacquireContainer() > that interrupted by NodeManager restart. > --- > > Key: YARN-2846 > URL: https://issues.apache.org/jira/browse/YARN-2846 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Junping Du >Assignee: Junping Du >Priority: Blocker > Attachments: YARN-2846-demo.patch, YARN-2846.patch > > > The NM restart work preserving feature could make running AM container get > LOST and killed during stop NM daemon. The exception is like below: > {code} > 2014-11-11 00:48:35,214 INFO monitor.ContainersMonitorImpl > (ContainersMonitorImpl.java:run(408)) - Memory usage of ProcessTree 22140 for > container-id container_1415666714233_0001_01_84: 53.8 MB of 512 MB > physical memory used; 931.3 MB of 1.0 GB virtual memory used > 2014-11-11 00:48:35,223 ERROR nodemanager.NodeManager > (SignalLogger.java:handle(60)) - RECEIVED SIGNAL 15: SIGTERM > 2014-11-11 00:48:35,299 INFO mortbay.log (Slf4jLog.java:info(67)) - Stopped > HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:50060 > 2014-11-11 00:48:35,337 INFO containermanager.ContainerManagerImpl > (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(512)) - > Applications still running : [application_1415666714233_0001] > 2014-11-11 00:48:35,338 INFO ipc.Server (Server.java:stop(2437)) - Stopping > server on 45454 > 2014-11-11 00:48:35,344 INFO ipc.Server (Server.java:run(706)) - Stopping > IPC Server listener on 45454 > 2014-11-11 00:48:35,346 INFO logaggregation.LogAggregationService > (LogAggregationService.java:serviceStop(141)) - > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService > waiting for pending aggregation during exit > 2014-11-11 00:48:35,347 INFO ipc.Server (Server.java:run(832)) - Stopping > IPC Server Responder > 2014-11-11 00:48:35,347 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:abortLogAggregation(502)) - Aborting log > aggregation for application_1415666714233_0001 > 2014-11-11 00:48:35,348 WARN logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:run(382)) - Aggregation did not complete for > application application_1415666714233_0001 > 2014-11-11 00:48:35,358 WARN monitor.ContainersMonitorImpl > (ContainersMonitorImpl.java:run(476)) - > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl > is interrupted. Exiting. > 2014-11-11 00:48:35,406 ERROR launcher.RecoveredContainerLaunch > (RecoveredContainerLaunch.java:call(87)) - Unable to recover container > container_1415666714233_0001_01_01 > java.io.IOException: Interrupted while waiting for process 20001 to exit > at > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:180) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:82)
[jira] [Commented] (YARN-2846) Incorrect persist exit code for running containers in reacquireContainer() that interrupted by NodeManager restart.
[ https://issues.apache.org/jira/browse/YARN-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207639#comment-14207639 ] Junping Du commented on YARN-2846: -- Thanks [~jlowe] for review and comments. The latest patch addressed your comments. bq. I'm curious why we're not seeing a similar issue with regular ContainerLaunch threads, as they should be interrupted as well. Are those threads silently swallowing the interrupt? Because otherwise I would expect us to log a container completion just like we were doing with a recovered container. I am not sure on this. But if regular ContainerLaunch get interrupted, we may not care if running container exit code as these running container should be killed soon (because NM daemon stop). Am I missing anything here? > Incorrect persist exit code for running containers in reacquireContainer() > that interrupted by NodeManager restart. > --- > > Key: YARN-2846 > URL: https://issues.apache.org/jira/browse/YARN-2846 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Junping Du >Assignee: Junping Du >Priority: Blocker > Attachments: YARN-2846-demo.patch, YARN-2846.patch > > > The NM restart work preserving feature could make running AM container get > LOST and killed during stop NM daemon. The exception is like below: > {code} > 2014-11-11 00:48:35,214 INFO monitor.ContainersMonitorImpl > (ContainersMonitorImpl.java:run(408)) - Memory usage of ProcessTree 22140 for > container-id container_1415666714233_0001_01_84: 53.8 MB of 512 MB > physical memory used; 931.3 MB of 1.0 GB virtual memory used > 2014-11-11 00:48:35,223 ERROR nodemanager.NodeManager > (SignalLogger.java:handle(60)) - RECEIVED SIGNAL 15: SIGTERM > 2014-11-11 00:48:35,299 INFO mortbay.log (Slf4jLog.java:info(67)) - Stopped > HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:50060 > 2014-11-11 00:48:35,337 INFO containermanager.ContainerManagerImpl > (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(512)) - > Applications still running : [application_1415666714233_0001] > 2014-11-11 00:48:35,338 INFO ipc.Server (Server.java:stop(2437)) - Stopping > server on 45454 > 2014-11-11 00:48:35,344 INFO ipc.Server (Server.java:run(706)) - Stopping > IPC Server listener on 45454 > 2014-11-11 00:48:35,346 INFO logaggregation.LogAggregationService > (LogAggregationService.java:serviceStop(141)) - > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService > waiting for pending aggregation during exit > 2014-11-11 00:48:35,347 INFO ipc.Server (Server.java:run(832)) - Stopping > IPC Server Responder > 2014-11-11 00:48:35,347 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:abortLogAggregation(502)) - Aborting log > aggregation for application_1415666714233_0001 > 2014-11-11 00:48:35,348 WARN logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:run(382)) - Aggregation did not complete for > application application_1415666714233_0001 > 2014-11-11 00:48:35,358 WARN monitor.ContainersMonitorImpl > (ContainersMonitorImpl.java:run(476)) - > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl > is interrupted. Exiting. > 2014-11-11 00:48:35,406 ERROR launcher.RecoveredContainerLaunch > (RecoveredContainerLaunch.java:call(87)) - Unable to recover container > container_1415666714233_0001_01_01 > java.io.IOException: Interrupted while waiting for process 20001 to exit > at > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:180) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:82) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.InterruptedException: sleep interrupted > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:177) > ... 6 more > {code} > In reacquireContainer() of ContainerExecutor.java, the while loop of checking > container process (AM container) will be interrupted by NM stop. The > IOException get thrown and failed to
[jira] [Updated] (YARN-2846) Incorrect persist exit code for running containers in reacquireContainer() that interrupted by NodeManager restart.
[ https://issues.apache.org/jira/browse/YARN-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-2846: - Attachment: YARN-2846.patch > Incorrect persist exit code for running containers in reacquireContainer() > that interrupted by NodeManager restart. > --- > > Key: YARN-2846 > URL: https://issues.apache.org/jira/browse/YARN-2846 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Junping Du >Assignee: Junping Du >Priority: Blocker > Attachments: YARN-2846-demo.patch, YARN-2846.patch > > > The NM restart work preserving feature could make running AM container get > LOST and killed during stop NM daemon. The exception is like below: > {code} > 2014-11-11 00:48:35,214 INFO monitor.ContainersMonitorImpl > (ContainersMonitorImpl.java:run(408)) - Memory usage of ProcessTree 22140 for > container-id container_1415666714233_0001_01_84: 53.8 MB of 512 MB > physical memory used; 931.3 MB of 1.0 GB virtual memory used > 2014-11-11 00:48:35,223 ERROR nodemanager.NodeManager > (SignalLogger.java:handle(60)) - RECEIVED SIGNAL 15: SIGTERM > 2014-11-11 00:48:35,299 INFO mortbay.log (Slf4jLog.java:info(67)) - Stopped > HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:50060 > 2014-11-11 00:48:35,337 INFO containermanager.ContainerManagerImpl > (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(512)) - > Applications still running : [application_1415666714233_0001] > 2014-11-11 00:48:35,338 INFO ipc.Server (Server.java:stop(2437)) - Stopping > server on 45454 > 2014-11-11 00:48:35,344 INFO ipc.Server (Server.java:run(706)) - Stopping > IPC Server listener on 45454 > 2014-11-11 00:48:35,346 INFO logaggregation.LogAggregationService > (LogAggregationService.java:serviceStop(141)) - > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService > waiting for pending aggregation during exit > 2014-11-11 00:48:35,347 INFO ipc.Server (Server.java:run(832)) - Stopping > IPC Server Responder > 2014-11-11 00:48:35,347 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:abortLogAggregation(502)) - Aborting log > aggregation for application_1415666714233_0001 > 2014-11-11 00:48:35,348 WARN logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:run(382)) - Aggregation did not complete for > application application_1415666714233_0001 > 2014-11-11 00:48:35,358 WARN monitor.ContainersMonitorImpl > (ContainersMonitorImpl.java:run(476)) - > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl > is interrupted. Exiting. > 2014-11-11 00:48:35,406 ERROR launcher.RecoveredContainerLaunch > (RecoveredContainerLaunch.java:call(87)) - Unable to recover container > container_1415666714233_0001_01_01 > java.io.IOException: Interrupted while waiting for process 20001 to exit > at > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:180) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:82) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.InterruptedException: sleep interrupted > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:177) > ... 6 more > {code} > In reacquireContainer() of ContainerExecutor.java, the while loop of checking > container process (AM container) will be interrupted by NM stop. The > IOException get thrown and failed to generate an ExitCodeFile for the running > container. Later, the IOException will be caught in upper call > (RecoveredContainerLaunch.call()) and the ExitCode (by default to be LOST > without any setting) get persistent in NMStateStore. > After NM restart again, this container is recovered as COMPLETE state but > exit code is LOST (154) - cause this (AM) container get killed later. > We should get rid of recording the exit code of running containers if > detecting process is interrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2855) Wish yarn web app use local date format to show app date time
[ https://issues.apache.org/jira/browse/YARN-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Junjun updated YARN-2855: Fix Version/s: 2.7.0 > Wish yarn web app use local date format to show app date time > -- > > Key: YARN-2855 > URL: https://issues.apache.org/jira/browse/YARN-2855 > Project: Hadoop YARN > Issue Type: Wish > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Li Junjun >Priority: Minor > Fix For: 2.7.0 > > > in yarn.dt.plugins.js > function renderHadoopDate use toUTCString . > I'm in China, so I need to add 8 hours in my mind every time! > I wish use toLocaleString() to format Date instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2855) Wish yarn web app use local date format to show app date time
[ https://issues.apache.org/jira/browse/YARN-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207624#comment-14207624 ] Li Junjun commented on YARN-2855: - yes! I closed it ! > Wish yarn web app use local date format to show app date time > -- > > Key: YARN-2855 > URL: https://issues.apache.org/jira/browse/YARN-2855 > Project: Hadoop YARN > Issue Type: Wish > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Li Junjun >Priority: Minor > Fix For: 2.7.0 > > > in yarn.dt.plugins.js > function renderHadoopDate use toUTCString . > I'm in China, so I need to add 8 hours in my mind every time! > I wish use toLocaleString() to format Date instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2855) Wish yarn web app use local date format to show app date time
[ https://issues.apache.org/jira/browse/YARN-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Junjun resolved YARN-2855. - Resolution: Duplicate > Wish yarn web app use local date format to show app date time > -- > > Key: YARN-2855 > URL: https://issues.apache.org/jira/browse/YARN-2855 > Project: Hadoop YARN > Issue Type: Wish > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Li Junjun >Priority: Minor > Fix For: 2.7.0 > > > in yarn.dt.plugins.js > function renderHadoopDate use toUTCString . > I'm in China, so I need to add 8 hours in my mind every time! > I wish use toLocaleString() to format Date instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2855) Wish yarn web app use local date format to show app date time
[ https://issues.apache.org/jira/browse/YARN-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207610#comment-14207610 ] Karthik Kambatla commented on YARN-2855: Duplicate of YARN-570? > Wish yarn web app use local date format to show app date time > -- > > Key: YARN-2855 > URL: https://issues.apache.org/jira/browse/YARN-2855 > Project: Hadoop YARN > Issue Type: Wish > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Li Junjun >Priority: Minor > > in yarn.dt.plugins.js > function renderHadoopDate use toUTCString . > I'm in China, so I need to add 8 hours in my mind every time! > I wish use toLocaleString() to format Date instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2838) Issues with TimeLineServer (Application History)
[ https://issues.apache.org/jira/browse/YARN-2838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207562#comment-14207562 ] Naganarasimha G R commented on YARN-2838: - Hi [~zjshen] I will go through it (YARN-2033), but felt like some issues still stand valid even if plan to continue as timeline server itself. {quote} # Whatever the CLI command user executes is historyserver or timelineserver it looks like ApplicationHistoryServer only run. So can we modify the name of the class ApplicationHistoryServer to TimelineHistoryServer (or any other suitable name as it seems like any command user runs ApplicationHistoryServer is started) # Instead of the "Starting the History Server anyway..." deprecated msg, can we have "Starting the Timeline History Server anyway...". # Based on start or stop, deprecated message should get modified to "Starting the Timeline History Server anyway..." or "Stopping the Timeline History Server anyway..." {quote} So if you comment on the individual issues/points would like to start fixing as part of this jira There is also a 4th issue which i mentioned {quote} Missed to add point 4 : In YARNClientIMPL;history data can be either got from HistoryServer (old manager) or from TimeLineServer (new) So historyServiceEnabled flag needs to check for both Timeline server configurations and ApplicationHistoryServer configurations, as data can be got from either of them. {quote} I think this is also related to the issue which you mentioned ??We still didn't integrate TimelineClient and AHSClient, the latter of which is RPC interface of getting generic history information via RPC interface.??. But any way we need to fix this issue also right ? so already any jira is raised or shall i work as part of this jira ? And also please inform if this issue needs to be split into mulitple jiras (apart from documentation which you have already raised) would like to split and work on them. As already i have started looking into these issues, was also planning to work on documentation. If you don't mind can you assign the issue (YARN-2854) to me ? > Issues with TimeLineServer (Application History) > > > Key: YARN-2838 > URL: https://issues.apache.org/jira/browse/YARN-2838 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.5.1 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Attachments: IssuesInTimelineServer.pdf > > > Few issues in usage of Timeline server for generic application history access -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2855) Wish yarn web app use local date format to show app date time
[ https://issues.apache.org/jira/browse/YARN-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Junjun updated YARN-2855: Summary: Wish yarn web app use local date format to show app date time (was: Use local date format to show app date time ,) > Wish yarn web app use local date format to show app date time > -- > > Key: YARN-2855 > URL: https://issues.apache.org/jira/browse/YARN-2855 > Project: Hadoop YARN > Issue Type: Wish > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Li Junjun >Priority: Minor > > in yarn.dt.plugins.js > function renderHadoopDate use toUTCString . > I'm in China, so I need to add 8 hours in my mind every time! > I wish use toLocaleString() to format Date instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2855) Use local date format to show app date time ,
Li Junjun created YARN-2855: --- Summary: Use local date format to show app date time , Key: YARN-2855 URL: https://issues.apache.org/jira/browse/YARN-2855 Project: Hadoop YARN Issue Type: Wish Components: resourcemanager Affects Versions: 2.5.1 Reporter: Li Junjun Priority: Minor in yarn.dt.plugins.js function renderHadoopDate use toUTCString . I'm in China, so I need to add 8 hours in my mind every time! I wish use toLocaleString() to format Date instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2848) (FICA) Applications should maintain an application specific 'cluster' resource to calculate headroom and userlimit
[ https://issues.apache.org/jira/browse/YARN-2848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-2848: -- Description: Likely solutions to [YARN-1680] (properly handling node and rack blacklisting with cluster level node additions and removals) will entail managing an application-level "slice" of the cluster resource available to the application for use in accurately calculating the application headroom and user limit. There is an assumption that events which impact this resource will occur less frequently than the need to calculate headroom, userlimit, etc (which is a valid assumption given that occurs per-allocation heartbeat). Given that, the application should (with assistance from cluster-level code...) detect changes to the composition of the cluster (node addition, removal) and when those have occurred, calculate an application specific cluster resource by comparing cluster nodes to it's own blacklist (both rack and individual node). I think it makes sense to include nodelabel considerations into this calculation as it will be efficient to do both at the same time and the single resource value reflecting both constraints could then be used for efficient frequent headroom and userlimit calculations while remaining highly accurate. The application would need to be made aware of nodelabel changes it is interested in (the application or removal of labels of interest to the application to/from nodes). For this purpose, the application submissions's nodelabel expression would be used to determine the nodelabel impact on the resource used to calculate userlimit and headroom (Cases where the application elected to request resources not using the application level label expression are out of scope for this - but for the common usecase of an application which uses a particular expression throughout, userlimit and headroom would be accurate) This could also provide an overall mechanism for handling application-specific resource constraints which might be added in the future. (was: Likely solutions to [YARN-1680] (properly handling node and rack blacklisting with cluster level node additions and removals) will entail managing an application-level "slice" of the cluster resource available to the application for use in accurately calculating the application headroom and user limit. There is an assumption that events which impact this resource will change less frequently than the need to calculate headroom, userlimit, etc (which is a valid assumption given that occurs per-allocation heartbeat). Given that, the application should (with assistance from cluster-level code...) detect changes to the composition of the cluster (node addition, removal) and when those have occurred, calculate a application specific cluster resource by comparing cluster nodes to it's own blacklist (both rack and individual node). I think it makes sense to include nodelabel considerations into this calculation as it will be efficient to do both at the same time and the single resource value reflecting both constraints could then be used for efficient frequent headroom and userlimit calculations while remaining highly accurate. The application would need to be made aware of nodelabel changes it is interested in (the application or removal of labels of interest to the application to/from nodes). For this purpose, the application submissions's nodelabel expression would be used to determine the nodelabel impact on the resource used to calculate userlimit and headroom (Cases where application elected to request resources not using the application level label expression are out of scope for this - but for the common usecase of an application which uses a particular expression throughout, userlimit and headroom would be accurate) This could also provide an overall mechanism for handling application-specific resource constraints which might be added in the future.) > (FICA) Applications should maintain an application specific 'cluster' > resource to calculate headroom and userlimit > -- > > Key: YARN-2848 > URL: https://issues.apache.org/jira/browse/YARN-2848 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Reporter: Craig Welch >Assignee: Craig Welch > > Likely solutions to [YARN-1680] (properly handling node and rack blacklisting > with cluster level node additions and removals) will entail managing an > application-level "slice" of the cluster resource available to the > application for use in accurately calculating the application headroom and > user limit. There is an assumption that events which impact this resource > will occur less frequently than the need to c
[jira] [Commented] (YARN-2853) Killing app may hang while AM is unregistering
[ https://issues.apache.org/jira/browse/YARN-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207486#comment-14207486 ] Hadoop QA commented on YARN-2853: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12680948/YARN-2853.1.patch against trunk revision 163bb55. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5821//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5821//console This message is automatically generated. > Killing app may hang while AM is unregistering > -- > > Key: YARN-2853 > URL: https://issues.apache.org/jira/browse/YARN-2853 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-2853.1.patch, YARN-2853.1.patch > > > When killing an app, app first moves to KILLING state, If RMAppAttempt > receives the attempt_unregister event before attempt_kill event, it'll > ignore the later attempt_kill event. Hence, RMApp won't be able to move to > KILLED state and stays at KILLING state forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2236) Shared Cache uploader service on the Node Manager
[ https://issues.apache.org/jira/browse/YARN-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207485#comment-14207485 ] Sangjin Lee commented on YARN-2236: --- Thanks Karthik! Let me review them, and see what I can do. Just a quick question, in 2, did you mean marking the entire class BuilderUtils as Private or only the methods that are touched by this JIRA? > Shared Cache uploader service on the Node Manager > - > > Key: YARN-2236 > URL: https://issues.apache.org/jira/browse/YARN-2236 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chris Trezzo >Assignee: Chris Trezzo > Attachments: YARN-2236-trunk-v1.patch, YARN-2236-trunk-v2.patch, > YARN-2236-trunk-v3.patch, YARN-2236-trunk-v4.patch, YARN-2236-trunk-v5.patch > > > Implement the shared cache uploader service on the node manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2236) Shared Cache uploader service on the Node Manager
[ https://issues.apache.org/jira/browse/YARN-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207464#comment-14207464 ] Karthik Kambatla commented on YARN-2236: Sorry for the delay on this, Sangjin. Patch looks generally good, but for some minor comments: # LocalResource - mark the methods Public-Unstable for now, we can mark them Public-Stable once the feature is complete. # Unrelated to this patch, can me mark BuilderUtils @Private for clarity. # Also, mark FSDownload#isPublic @Private # Rename ContainerImpl#storeSharedCacheUploadPolicies to storeSharedCacheUploadPolicy? Also, should use block comments instead of line comments. # LocalResourceRequest - LOG is unused, we should probably get rid of it along with its imports. # SharedCacheChecksumFactory ## In the map, can we use Class instead of String? ## getCheckSum should use conf.getClass for getting the classname, and ReflectionUtils.newInstance for instantiation to go with rest of the YARN code. Refer to RMProxy for further information. # Nit: SharedCacheUploader#call - remove the TODOs # Instead of creating an event and submitting through the event-handler, would it be simpler to synchronously submit it since we are queueing it up to the executor anyway? > Shared Cache uploader service on the Node Manager > - > > Key: YARN-2236 > URL: https://issues.apache.org/jira/browse/YARN-2236 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chris Trezzo >Assignee: Chris Trezzo > Attachments: YARN-2236-trunk-v1.patch, YARN-2236-trunk-v2.patch, > YARN-2236-trunk-v3.patch, YARN-2236-trunk-v4.patch, YARN-2236-trunk-v5.patch > > > Implement the shared cache uploader service on the node manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (YARN-2838) Issues with TimeLineServer (Application History)
[ https://issues.apache.org/jira/browse/YARN-2838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206671#comment-14206671 ] Zhijie Shen edited comment on YARN-2838 at 11/12/14 12:44 AM: -- [~Naganarasimha], sorry for not responding you immediately as being busy on finalizing 2.6. A quick scan through your issue document. Here's my clarification: 1. While the entry point of the this sub-module is still called ApplicationHistoryServer, it is actually generalized to be TimelineServer right now (definitely we need to refactor the code at some point). The baseline service provided the the timeline server is to allow the cluster and its apps to store their history information, metrics and so on by complying with the defined timeline data model. Later on, users and admins can query this information to do the analysis. 2. Application history (or we prefer to call it generic history service) is now a built-in service in the timeline server to record the generic history information of YARN apps. It was on a separate store (on FS), but after YARN-2033, it has been moved to the timeline store too, as a payload. We replace the old storage layer, but keep the existing interfaces (web UI, services, CLI) not changed to be the analog of what RM provides for running apps. We still didn't integrate TimelineClient and AHSClient, the latter of which is RPC interface of getting generic history information via RPC interface. APPLICATION_HISTORY_ENABLED is the only remaining old config to control whether we also want to pull the app info from the generic history service inside the timeline server. You may want to take a look at YARN-2033 to get more context about the change. Moreover, as a number of limitation of the old history store, we're no longer going to support it. 3. The document is definitely staled. I'll file separate document Jira, however, it's too late for 2.6. Let's target 2.7 for an up-to-date document about timeline service and its built-in generic history service (YARN-2854). Does it sound good? was (Author: zjshen): [~Naganarasimha], sorry for not responding you immediately as being busy on finalizing 2.6. A quick scan through your issue document. Here's my clarification: 1. While the entry point of the this sub-module is still called ApplicationHistoryServer, it is actually generalized to be TimelineServer right now (definitely we need to refactor the code at some point). The baseline service provided the the timeline server is to allow the cluster and its apps to store their history information, metrics and so on by complying with the defined timeline data model. Later on, users and admins can query this information to do the analysis. 2. Application history (or we prefer to call it generic history service) is now a built-in service in the timeline server to record the generic history information of YARN apps. It was on a separate store (on FS), but after YARN-2033, it has been moved to the timeline store too, as a payload. We replace the old storage layer, but keep the existing interfaces (web UI, services, CLI) not changed to be the analog of what RM provides for running apps. We still didn't integrate TimelineClient and AHSClient, the latter of which is RPC interface of getting generic history information via RPC interface. APPLICATION_HISTORY_ENABLED is the only remaining old config to control whether we also want to pull the app info from the generic history service inside the timeline server. You may want to take a look at YARN-2033 to get more context about the change. Moreover, as a number of limitation of the old history store, we're no longer going to support it. 3. The document is definitely staled. I'll file separate document Jira, however, it's too late for 2.6. Let's target 2.7 for an up-to-date document about timeline service and its built-in generic history service. Does it sound good? > Issues with TimeLineServer (Application History) > > > Key: YARN-2838 > URL: https://issues.apache.org/jira/browse/YARN-2838 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.5.1 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Attachments: IssuesInTimelineServer.pdf > > > Few issues in usage of Timeline server for generic application history access -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2854) The document about timeline service and generic service needs to be updated
Zhijie Shen created YARN-2854: - Summary: The document about timeline service and generic service needs to be updated Key: YARN-2854 URL: https://issues.apache.org/jira/browse/YARN-2854 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2637) maximum-am-resource-percent could be violated when resource of AM is > minimumAllocation
[ https://issues.apache.org/jira/browse/YARN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207405#comment-14207405 ] Craig Welch commented on YARN-2637: --- I think the fix is fairly straightforward - there is an "amResource" property on the SchedulerApplicationAttempt / FiCaSchedulerApp, it does not appear to be being populated in the CapacityScheduler case (but it should be, and the information is available in the submission / from the resource requests of the appliction) - populate this value, and then add a Resource property to LeafQueue which represents the resources used by active application masters - when an application starts, add it's amResource value to the LeafQueue's active application master resource value, when an application ends, remove it. Before starting an application compare the sum of the active application masters + the new application's resource to the resource represented by the percentage of cluster resource allowed to be used by am's in the queue (this can differ by queue...) and if it exceeds the value do not start the application. The existing trickle down logic base on the minimum allocation should be removed, there is also logic regarding how many applications can be running based on explicit configuration which should be retained. {code} if ((queue.activeApplicationMasterResourceTotal + readyToStartApplication.applicationMasterResource) <= queue.portionOfClusterResourceAllowedForApplicatoinMaster * clusterResource && maxAllowedApplications < runningApplications + 1) { queue.startTheApp } {code} > maximum-am-resource-percent could be violated when resource of AM is > > minimumAllocation > > > Key: YARN-2637 > URL: https://issues.apache.org/jira/browse/YARN-2637 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Wangda Tan >Priority: Critical > > Currently, number of AM in leaf queue will be calculated in following way: > {code} > max_am_resource = queue_max_capacity * maximum_am_resource_percent > #max_am_number = max_am_resource / minimum_allocation > #max_am_number_for_each_user = #max_am_number * userlimit * userlimit_factor > {code} > And when submit new application to RM, it will check if an app can be > activated in following way: > {code} > for (Iterator i=pendingApplications.iterator(); > i.hasNext(); ) { > FiCaSchedulerApp application = i.next(); > > // Check queue limit > if (getNumActiveApplications() >= getMaximumActiveApplications()) { > break; > } > > // Check user limit > User user = getUser(application.getUser()); > if (user.getActiveApplications() < > getMaximumActiveApplicationsPerUser()) { > user.activateApplication(); > activeApplications.add(application); > i.remove(); > LOG.info("Application " + application.getApplicationId() + > " from user: " + application.getUser() + > " activated in queue: " + getQueueName()); > } > } > {code} > An example is, > If a queue has capacity = 1G, max_am_resource_percent = 0.2, the maximum > resource that AM can use is 200M, assuming minimum_allocation=1M, #am can be > launched is 200, and if user uses 5M for each AM (> minimum_allocation). All > apps can still be activated, and it will occupy all resource of a queue > instead of only a max_am_resource_percent of a queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1964) Create Docker analog of the LinuxContainerExecutor in YARN
[ https://issues.apache.org/jira/browse/YARN-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207390#comment-14207390 ] Ravi Prakash commented on YARN-1964: I'm a +1 on this patch. I'll commit it to trunk and branch-2 soon. Soon as I get confirmation from Arun, I'll commit it into branch-2.6 as well. > Create Docker analog of the LinuxContainerExecutor in YARN > -- > > Key: YARN-1964 > URL: https://issues.apache.org/jira/browse/YARN-1964 > Project: Hadoop YARN > Issue Type: New Feature >Affects Versions: 2.2.0 >Reporter: Arun C Murthy >Assignee: Abin Shahab > Attachments: YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, > YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, > YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, > yarn-1964-branch-2.2.0-docker.patch, yarn-1964-branch-2.2.0-docker.patch, > yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, > yarn-1964-docker.patch, yarn-1964-docker.patch > > > Docker (https://www.docker.io/) is, increasingly, a very popular container > technology. > In context of YARN, the support for Docker will provide a very elegant > solution to allow applications to *package* their software into a Docker > container (entire Linux file system incl. custom versions of perl, python > etc.) and use it as a blueprint to launch all their YARN containers with > requisite software environment. This provides both consistency (all YARN > containers will have the same software environment) and isolation (no > interference with whatever is installed on the physical machine). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2853) Killing app may hang while AM is unregistering
[ https://issues.apache.org/jira/browse/YARN-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2853: -- Attachment: (was: YARN-2853.1.patch) > Killing app may hang while AM is unregistering > -- > > Key: YARN-2853 > URL: https://issues.apache.org/jira/browse/YARN-2853 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-2853.1.patch, YARN-2853.1.patch > > > When killing an app, app first moves to KILLING state, If RMAppAttempt > receives the attempt_unregister event before attempt_kill event, it'll > ignore the later attempt_kill event. Hence, RMApp won't be able to move to > KILLED state and stays at KILLING state forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2853) Killing app may hang while AM is unregistering
[ https://issues.apache.org/jira/browse/YARN-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2853: -- Attachment: YARN-2853.1.patch > Killing app may hang while AM is unregistering > -- > > Key: YARN-2853 > URL: https://issues.apache.org/jira/browse/YARN-2853 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-2853.1.patch, YARN-2853.1.patch > > > When killing an app, app first moves to KILLING state, If RMAppAttempt > receives the attempt_unregister event before attempt_kill event, it'll > ignore the later attempt_kill event. Hence, RMApp won't be able to move to > KILLED state and stays at KILLING state forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2853) Killing app may hang while AM is unregistering
[ https://issues.apache.org/jira/browse/YARN-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2853: -- Attachment: YARN-2853.1.patch > Killing app may hang while AM is unregistering > -- > > Key: YARN-2853 > URL: https://issues.apache.org/jira/browse/YARN-2853 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-2853.1.patch, YARN-2853.1.patch > > > When killing an app, app first moves to KILLING state, If RMAppAttempt > receives the attempt_unregister event before attempt_kill event, it'll > ignore the later attempt_kill event. Hence, RMApp won't be able to move to > KILLED state and stays at KILLING state forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2806) log container allocation requests
[ https://issues.apache.org/jira/browse/YARN-2806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207304#comment-14207304 ] Hadoop QA commented on YARN-2806: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12680883/YARN-2806.patch against trunk revision 163bb55. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5819//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5819//console This message is automatically generated. > log container allocation requests > - > > Key: YARN-2806 > URL: https://issues.apache.org/jira/browse/YARN-2806 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Allen Wittenauer > Attachments: YARN-2806.patch > > > I might have missed it, but I don't see where we log application container > requests outside of the DEBUG context. Without this being logged, we have no > idea on a per application the lag an application might be having in the > allocation system. > We should probably add this as an event to the RM audit log. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2853) Killing app may hang while AM is unregistering
[ https://issues.apache.org/jira/browse/YARN-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207303#comment-14207303 ] Hadoop QA commented on YARN-2853: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12680930/YARN-2853.1.patch against trunk revision 163bb55. {color:red}-1 patch{color}. Trunk compilation may be broken. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5820//console This message is automatically generated. > Killing app may hang while AM is unregistering > -- > > Key: YARN-2853 > URL: https://issues.apache.org/jira/browse/YARN-2853 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-2853.1.patch > > > When killing an app, app first moves to KILLING state, If RMAppAttempt > receives the attempt_unregister event before attempt_kill event, it'll > ignore the later attempt_kill event. Hence, RMApp won't be able to move to > KILLED state and stays at KILLING state forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2853) Killing app may hang while AM is unregistering
[ https://issues.apache.org/jira/browse/YARN-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207292#comment-14207292 ] Jian He commented on YARN-2853: --- Instead, we could get rid of the killing state completely and let app stay at the original state and change RMApp to handle attempt_killed state at each possible state. this way, we could avoid race condition like this. I'll file a separate jira to do this. > Killing app may hang while AM is unregistering > -- > > Key: YARN-2853 > URL: https://issues.apache.org/jira/browse/YARN-2853 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-2853.1.patch > > > When killing an app, app first moves to KILLING state, If RMAppAttempt > receives the attempt_unregister event before attempt_kill event, it'll > ignore the later attempt_kill event. Hence, RMApp won't be able to move to > KILLED state and stays at KILLING state forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2853) Killing app may hang while AM is unregistering
[ https://issues.apache.org/jira/browse/YARN-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2853: -- Attachment: YARN-2853.1.patch Uploaded a patch to handle the possible attempt_unregistered, attempt_failed, attempt_finished state at app_killing state. > Killing app may hang while AM is unregistering > -- > > Key: YARN-2853 > URL: https://issues.apache.org/jira/browse/YARN-2853 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-2853.1.patch > > > When killing an app, app first moves to KILLING state, If RMAppAttempt > receives the attempt_unregister event before attempt_kill event, it'll > ignore the later attempt_kill event. Hence, RMApp won't be able to move to > KILLED state and stays at KILLING state forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2729) Support script based NodeLabelsProvider Interface in Distributed Node Label Configuration Setup
[ https://issues.apache.org/jira/browse/YARN-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207268#comment-14207268 ] Wangda Tan commented on YARN-2729: -- Hi [~Naganarasimha], IIRC, the script based patch should be based on YARN-2495, and we should create a script-based labels provider extend NodeLabelsProviderService, correct? But I haven't seen much relationship between this and YARN-2495 besides configuration options. Please let me know if I understood incorrectly. Thanks, Wangda > Support script based NodeLabelsProvider Interface in Distributed Node Label > Configuration Setup > --- > > Key: YARN-2729 > URL: https://issues.apache.org/jira/browse/YARN-2729 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Attachments: YARN-2729.20141023-1.patch, YARN-2729.20141024-1.patch, > YARN-2729.20141031-1.patch > > > Support script based NodeLabelsProvider Interface in Distributed Node Label > Configuration Setup . -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2853) Killing app may hang while AM is unregistering
Jian He created YARN-2853: - Summary: Killing app may hang while AM is unregistering Key: YARN-2853 URL: https://issues.apache.org/jira/browse/YARN-2853 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He When killing an app, app first moves to KILLING state, If RMAppAttempt receives the attempt_unregister event before attempt_kill event, it'll ignore the later attempt_kill event. Hence, RMApp won't be able to move to KILLED state and stays at KILLING state forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2604) Scheduler should consider max-allocation-* in conjunction with the largest node
[ https://issues.apache.org/jira/browse/YARN-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207251#comment-14207251 ] Karthik Kambatla commented on YARN-2604: Looks mostly good. Just want to confirm - when there are no nodes connected to the RM, the patch sets the max-allocation to the configured value and not zero. I think this is good, otherwise all apps will get rejected immediately after the RM (re)starts. Actually, I wonder if we should add a config to specify either (a) a particular number of NMs after which this behavior kicks in or (b) a minimum/floor value for the configurable maximum (min-max-allocation :P). [~jlowe] - do you think such a config would be useful? Few comments on the patch itself: # We should have tests similar to TestFifoScheduler#testMaximumAllocation for Capacity and FairSchedulers. # Nit: Rename AbstractYarnScheduler#realMaximumAllocation to configuredMaximumAllocation? And, in all the schedulers, we should set configuredMaximumAllocation first and then set maximumAllocation to that. Also, given both these fields are in AbstractYarnScheduler, I wouldn't refer to them using {{this.}} in the sub-classes. # Nit: With locks and unlocks, we follow the following convention in YARN. Mind updating accordingly? {code} lock.lock(); try { // do your thing } finally { lock.unlock(); } {code} > Scheduler should consider max-allocation-* in conjunction with the largest > node > --- > > Key: YARN-2604 > URL: https://issues.apache.org/jira/browse/YARN-2604 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Affects Versions: 2.5.1 >Reporter: Karthik Kambatla >Assignee: Robert Kanter > Attachments: YARN-2604.patch, YARN-2604.patch, YARN-2604.patch > > > If the scheduler max-allocation-* values are larger than the resources > available on the largest node in the cluster, an application requesting > resources between the two values will be accepted by the scheduler but the > requests will never be satisfied. The app essentially hangs forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2009) Priority support for preemption in ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-2009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207194#comment-14207194 ] Carlo Curino commented on YARN-2009: Sunil, A thing to keep in mind is that any preemption action you initiate as a significant delay, i.e., we will see effects only a while after, likely under somewhat changed cluster conditions and app needs, etc.. For this reason we decided to maximize the flexibility of the application being preempted (allowing for late bind on which containers to yield back), instead of constraining the requests with strict locality preferences. Intuition being that we have a better chance to be efficient on the preempted side than on the preempting side (already running tasks and immediate impact, vs hypothesis on task locality for future running containers). I don't have any strong evidence to back those intuitions (which are likely to hold for some workload but probably not all), but I suggest you to consider this concerns, and maybe devise some good experiments to test whether the locality-centric preemption gets you the benefit you hope for (it is otherwise unnecessary complication, that has hard to understand interactions with fairness/priorities etc...). Similar thoughts apply to node labels, however I believe in this context the needs are likely to be more "stable" over time, so maybe preempted in a label-aware manner might be good. my 2 cents, Carlo > Priority support for preemption in ProportionalCapacityPreemptionPolicy > --- > > Key: YARN-2009 > URL: https://issues.apache.org/jira/browse/YARN-2009 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Devaraj K >Assignee: Sunil G > > While preempting containers based on the queue ideal assignment, we may need > to consider preempting the low priority application containers first. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2056) Disable preemption at Queue level
[ https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207172#comment-14207172 ] Wangda Tan commented on YARN-2056: -- Hi [~eepayne] Thanks for update, Major comments: 1. Regarding the new requirement you added: bq. The current patch only allows the disable queue preemption flag to be set on leaf queues. However, after discussing his internally, we need to be able to have leaf queues inherit this property from their parent. I think the feature is make sense, but I don't think we can really achieve it today. In your example, let's say root has a and b, a has a1/a2 and b has b1/b2. It maybe no problem to set a is non-preemptable but a1/a2 are preemptable, the existing ideal_capacity calculation algorithm will consider this and mark containers will be preempted in a1/a2 as what you expected. However, you cannot say, in this case, b cannot preempt resource from a. Because, When a container preempted, the resource is available for everyone to use. Like the resource freed after preempt a container from a2 that doesn't mean the resource is dedicated to a1 to use only. So the statements are not always true: {code} A should not be preemptable A1 and A2 should be able to preempt each other {code} So my opinion is not do it now since we don't have corresponding logic in CS side for this feature. Minor comments: 1. getUnderservedQueues the name is a little confusing me, it should be getMostUnderServedQueues 2. I would suggest to create a method in CapacitySchedulerConfiguration.getPreemptionEnabled(queue) 3. In {{getUnderservedQueues}}, you can simple use tqComparator.compare(q0, q1) instead of calculate idealPctGuaranteed 4. it's better to rename idealPctGuaranteed -> caculate(or get)IdealPctGuaranteed. For tests, I'll review it after we can get decision about my concern in *major comments*. Wangda > Disable preemption at Queue level > - > > Key: YARN-2056 > URL: https://issues.apache.org/jira/browse/YARN-2056 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Mayank Bansal >Assignee: Eric Payne > Attachments: YARN-2056.201408202039.txt, YARN-2056.201408260128.txt, > YARN-2056.201408310117.txt, YARN-2056.201409022208.txt, > YARN-2056.201409181916.txt, YARN-2056.201409210049.txt, > YARN-2056.201409232329.txt, YARN-2056.201409242210.txt, > YARN-2056.201410132225.txt, YARN-2056.201410141330.txt, > YARN-2056.201410232244.txt, YARN-2056.201410311746.txt, > YARN-2056.201411041635.txt, YARN-2056.201411072153.txt > > > We need to be able to disable preemption at individual queue level -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1964) Create Docker analog of the LinuxContainerExecutor in YARN
[ https://issues.apache.org/jira/browse/YARN-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207159#comment-14207159 ] Hadoop QA commented on YARN-1964: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12680880/YARN-1964.patch against trunk revision 99d9d0c. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5817//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5817//console This message is automatically generated. > Create Docker analog of the LinuxContainerExecutor in YARN > -- > > Key: YARN-1964 > URL: https://issues.apache.org/jira/browse/YARN-1964 > Project: Hadoop YARN > Issue Type: New Feature >Affects Versions: 2.2.0 >Reporter: Arun C Murthy >Assignee: Abin Shahab > Attachments: YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, > YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, > YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, > yarn-1964-branch-2.2.0-docker.patch, yarn-1964-branch-2.2.0-docker.patch, > yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, > yarn-1964-docker.patch, yarn-1964-docker.patch > > > Docker (https://www.docker.io/) is, increasingly, a very popular container > technology. > In context of YARN, the support for Docker will provide a very elegant > solution to allow applications to *package* their software into a Docker > container (entire Linux file system incl. custom versions of perl, python > etc.) and use it as a blueprint to launch all their YARN containers with > requisite software environment. This provides both consistency (all YARN > containers will have the same software environment) and isolation (no > interference with whatever is installed on the physical machine). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2811) Fair Scheduler is violating max memory settings in 2.4
[ https://issues.apache.org/jira/browse/YARN-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207140#comment-14207140 ] Siqi Li commented on YARN-2811: --- [~sandyr] Thank you for pointing out the hierarchical scenario. I have updated the patch that deals with that case. > Fair Scheduler is violating max memory settings in 2.4 > -- > > Key: YARN-2811 > URL: https://issues.apache.org/jira/browse/YARN-2811 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Siqi Li >Assignee: Siqi Li > Attachments: YARN-2811.v1.patch, YARN-2811.v2.patch, > YARN-2811.v3.patch, YARN-2811.v4.patch, YARN-2811.v5.patch > > > This has been seen on several queues showing the allocated MB going > significantly above the max MB and it appears to have started with the 2.4 > upgrade. It could be a regression bug from 2.0 to 2.4 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2806) log container allocation requests
[ https://issues.apache.org/jira/browse/YARN-2806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207122#comment-14207122 ] Hadoop QA commented on YARN-2806: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12680883/YARN-2806.patch against trunk revision 456b973. {color:red}-1 patch{color}. Trunk compilation may be broken. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5818//console This message is automatically generated. > log container allocation requests > - > > Key: YARN-2806 > URL: https://issues.apache.org/jira/browse/YARN-2806 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Allen Wittenauer > Attachments: YARN-2806.patch > > > I might have missed it, but I don't see where we log application container > requests outside of the DEBUG context. Without this being logged, we have no > idea on a per application the lag an application might be having in the > allocation system. > We should probably add this as an event to the RM audit log. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2811) Fair Scheduler is violating max memory settings in 2.4
[ https://issues.apache.org/jira/browse/YARN-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207123#comment-14207123 ] Hadoop QA commented on YARN-2811: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12680866/YARN-2811.v5.patch against trunk revision 061bc29. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5816//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5816//console This message is automatically generated. > Fair Scheduler is violating max memory settings in 2.4 > -- > > Key: YARN-2811 > URL: https://issues.apache.org/jira/browse/YARN-2811 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Siqi Li >Assignee: Siqi Li > Attachments: YARN-2811.v1.patch, YARN-2811.v2.patch, > YARN-2811.v3.patch, YARN-2811.v4.patch, YARN-2811.v5.patch > > > This has been seen on several queues showing the allocated MB going > significantly above the max MB and it appears to have started with the 2.4 > upgrade. It could be a regression bug from 2.0 to 2.4 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-570) Time strings are formated in different timezone
[ https://issues.apache.org/jira/browse/YARN-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207114#comment-14207114 ] Hudson commented on YARN-570: - SUCCESS: Integrated in Hadoop-trunk-Commit #6514 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6514/]) YARN-570. Time strings are formated in different timezone. (Akira Ajisaka and Peng Zhang via kasha) (kasha: rev 456b973819904e9647dabad292d2d6205dd84399) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/static/yarn.dt.plugins.js * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/Times.java > Time strings are formated in different timezone > --- > > Key: YARN-570 > URL: https://issues.apache.org/jira/browse/YARN-570 > Project: Hadoop YARN > Issue Type: Bug > Components: webapp >Affects Versions: 2.2.0 >Reporter: Peng Zhang >Assignee: Akira AJISAKA > Fix For: 2.7.0 > > Attachments: MAPREDUCE-5141.patch, YARN-570.2.patch, > YARN-570.3.patch, YARN-570.4.patch, YARN-570.5.patch > > > Time strings on different page are displayed in different timezone. > If it is rendered by renderHadoopDate() in yarn.dt.plugins.js, it appears as > "Wed, 10 Apr 2013 08:29:56 GMT" > If it is formatted by format() in yarn.util.Times, it appears as "10-Apr-2013 > 16:29:56" > Same value, but different timezone. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2848) (FICA) Applications should maintain an application specific 'cluster' resource to calculate headroom and userlimit
[ https://issues.apache.org/jira/browse/YARN-2848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207109#comment-14207109 ] Craig Welch commented on YARN-2848: --- There are a couple of different ways events at the cluster level (nodelabel additions/removals, node additions/removals) could be handled by the application to update it's own resource - they could merely be a trigger to cause the application to recalculate the value from scratch (just a "last event" map/value set in the scheduler/etc (topical, node add/remove + per label) (serial/vector clock value/ts, etc)), or they could include sufficient information for the application to adjust it's resource without necessarily having to look at a global view (per node "labels added", "labels removed", the node (incl rack) which was added to or removed from the cluster) (?available "for a time" for "for N changes" with fallback to a global calculation - may be more complex than is warranted) > (FICA) Applications should maintain an application specific 'cluster' > resource to calculate headroom and userlimit > -- > > Key: YARN-2848 > URL: https://issues.apache.org/jira/browse/YARN-2848 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Reporter: Craig Welch >Assignee: Craig Welch > > Likely solutions to [YARN-1680] (properly handling node and rack blacklisting > with cluster level node additions and removals) will entail managing an > application-level "slice" of the cluster resource available to the > application for use in accurately calculating the application headroom and > user limit. There is an assumption that events which impact this resource > will change less frequently than the need to calculate headroom, userlimit, > etc (which is a valid assumption given that occurs per-allocation heartbeat). > Given that, the application should (with assistance from cluster-level > code...) detect changes to the composition of the cluster (node addition, > removal) and when those have occurred, calculate a application specific > cluster resource by comparing cluster nodes to it's own blacklist (both rack > and individual node). I think it makes sense to include nodelabel > considerations into this calculation as it will be efficient to do both at > the same time and the single resource value reflecting both constraints could > then be used for efficient frequent headroom and userlimit calculations while > remaining highly accurate. The application would need to be made aware of > nodelabel changes it is interested in (the application or removal of labels > of interest to the application to/from nodes). For this purpose, the > application submissions's nodelabel expression would be used to determine the > nodelabel impact on the resource used to calculate userlimit and headroom > (Cases where application elected to request resources not using the > application level label expression are out of scope for this - but for the > common usecase of an application which uses a particular expression > throughout, userlimit and headroom would be accurate) This could also provide > an overall mechanism for handling application-specific resource constraints > which might be added in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2806) log container allocation requests
[ https://issues.apache.org/jira/browse/YARN-2806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207108#comment-14207108 ] Eric Wohlstadter commented on YARN-2806: Added patch for AppSchedulingInfo.updateResourceRequests > log container allocation requests > - > > Key: YARN-2806 > URL: https://issues.apache.org/jira/browse/YARN-2806 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Allen Wittenauer > Attachments: YARN-2806.patch > > > I might have missed it, but I don't see where we log application container > requests outside of the DEBUG context. Without this being logged, we have no > idea on a per application the lag an application might be having in the > allocation system. > We should probably add this as an event to the RM audit log. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2806) log container allocation requests
[ https://issues.apache.org/jira/browse/YARN-2806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Wohlstadter updated YARN-2806: --- Attachment: YARN-2806.patch > log container allocation requests > - > > Key: YARN-2806 > URL: https://issues.apache.org/jira/browse/YARN-2806 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Allen Wittenauer > Attachments: YARN-2806.patch > > > I might have missed it, but I don't see where we log application container > requests outside of the DEBUG context. Without this being logged, we have no > idea on a per application the lag an application might be having in the > allocation system. > We should probably add this as an event to the RM audit log. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2852) WebUI: Add disk I/O resource information to the web ui
Wei Yan created YARN-2852: - Summary: WebUI: Add disk I/O resource information to the web ui Key: YARN-2852 URL: https://issues.apache.org/jira/browse/YARN-2852 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wei Yan Assignee: Wei Yan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2851) YarnClient: Add support for disk I/O resource/request information
Wei Yan created YARN-2851: - Summary: YarnClient: Add support for disk I/O resource/request information Key: YARN-2851 URL: https://issues.apache.org/jira/browse/YARN-2851 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wei Yan Assignee: Wei Yan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2848) (FICA) Applications should maintain an application specific 'cluster' resource to calculate headroom and userlimit
[ https://issues.apache.org/jira/browse/YARN-2848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207088#comment-14207088 ] Craig Welch commented on YARN-2848: --- To be clear wrt node labels - this is to enable accurate support from a headroom and userlimit perspective for more complex label expressions - at present I believe single label expressions in relation to (up to) single label nodes can be accurate, this should allow for accuracy with more sophisticated scenarios. > (FICA) Applications should maintain an application specific 'cluster' > resource to calculate headroom and userlimit > -- > > Key: YARN-2848 > URL: https://issues.apache.org/jira/browse/YARN-2848 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Reporter: Craig Welch >Assignee: Craig Welch > > Likely solutions to [YARN-1680] (properly handling node and rack blacklisting > with cluster level node additions and removals) will entail managing an > application-level "slice" of the cluster resource available to the > application for use in accurately calculating the application headroom and > user limit. There is an assumption that events which impact this resource > will change less frequently than the need to calculate headroom, userlimit, > etc (which is a valid assumption given that occurs per-allocation heartbeat). > Given that, the application should (with assistance from cluster-level > code...) detect changes to the composition of the cluster (node addition, > removal) and when those have occurred, calculate a application specific > cluster resource by comparing cluster nodes to it's own blacklist (both rack > and individual node). I think it makes sense to include nodelabel > considerations into this calculation as it will be efficient to do both at > the same time and the single resource value reflecting both constraints could > then be used for efficient frequent headroom and userlimit calculations while > remaining highly accurate. The application would need to be made aware of > nodelabel changes it is interested in (the application or removal of labels > of interest to the application to/from nodes). For this purpose, the > application submissions's nodelabel expression would be used to determine the > nodelabel impact on the resource used to calculate userlimit and headroom > (Cases where application elected to request resources not using the > application level label expression are out of scope for this - but for the > common usecase of an application which uses a particular expression > throughout, userlimit and headroom would be accurate) This could also provide > an overall mechanism for handling application-specific resource constraints > which might be added in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2850) DistributedShell: Add support for disk I/O request
Wei Yan created YARN-2850: - Summary: DistributedShell: Add support for disk I/O request Key: YARN-2850 URL: https://issues.apache.org/jira/browse/YARN-2850 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wei Yan Assignee: Wei Yan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2849) MRAppMaster: Add support for disk I/O request
Wei Yan created YARN-2849: - Summary: MRAppMaster: Add support for disk I/O request Key: YARN-2849 URL: https://issues.apache.org/jira/browse/YARN-2849 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wei Yan Assignee: Wei Yan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-570) Time strings are formated in different timezone
[ https://issues.apache.org/jira/browse/YARN-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207083#comment-14207083 ] Karthik Kambatla commented on YARN-570: --- The patch looks reasonable. +1, relying on others' testing. Checking this in, will add one comment in Times.java in the process. > Time strings are formated in different timezone > --- > > Key: YARN-570 > URL: https://issues.apache.org/jira/browse/YARN-570 > Project: Hadoop YARN > Issue Type: Bug > Components: webapp >Affects Versions: 2.2.0 >Reporter: Peng Zhang >Assignee: Akira AJISAKA > Attachments: MAPREDUCE-5141.patch, YARN-570.2.patch, > YARN-570.3.patch, YARN-570.4.patch, YARN-570.5.patch > > > Time strings on different page are displayed in different timezone. > If it is rendered by renderHadoopDate() in yarn.dt.plugins.js, it appears as > "Wed, 10 Apr 2013 08:29:56 GMT" > If it is formatted by format() in yarn.util.Times, it appears as "10-Apr-2013 > 16:29:56" > Same value, but different timezone. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2791) Add Disk as a resource for scheduling
[ https://issues.apache.org/jira/browse/YARN-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207080#comment-14207080 ] Swapnil Daingade commented on YARN-2791: Thanks Karthik Kambatla. Sure, lets make this a sub-task of YARN-2139. > Add Disk as a resource for scheduling > - > > Key: YARN-2791 > URL: https://issues.apache.org/jira/browse/YARN-2791 > Project: Hadoop YARN > Issue Type: New Feature > Components: scheduler >Affects Versions: 2.5.1 >Reporter: Swapnil Daingade >Assignee: Yuliya Feldman > Attachments: DiskDriveAsResourceInYARN.pdf > > > Currently, the number of disks present on a node is not considered a factor > while scheduling containers on that node. Having large amount of memory on a > node can lead to high number of containers being launched on that node, all > of which compete for I/O bandwidth. This multiplexing of I/O across > containers can lead to slower overall progress and sub-optimal resource > utilization as containers starved for I/O bandwidth hold on to other > resources like cpu and memory. This problem can be solved by considering disk > as a resource and including it in deciding how many containers can be > concurrently run on a node. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1964) Create Docker analog of the LinuxContainerExecutor in YARN
[ https://issues.apache.org/jira/browse/YARN-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abin Shahab updated YARN-1964: -- Attachment: YARN-1964.patch Added docs to the site.xml > Create Docker analog of the LinuxContainerExecutor in YARN > -- > > Key: YARN-1964 > URL: https://issues.apache.org/jira/browse/YARN-1964 > Project: Hadoop YARN > Issue Type: New Feature >Affects Versions: 2.2.0 >Reporter: Arun C Murthy >Assignee: Abin Shahab > Attachments: YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, > YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, > YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, > yarn-1964-branch-2.2.0-docker.patch, yarn-1964-branch-2.2.0-docker.patch, > yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, > yarn-1964-docker.patch, yarn-1964-docker.patch > > > Docker (https://www.docker.io/) is, increasingly, a very popular container > technology. > In context of YARN, the support for Docker will provide a very elegant > solution to allow applications to *package* their software into a Docker > container (entire Linux file system incl. custom versions of perl, python > etc.) and use it as a blueprint to launch all their YARN containers with > requisite software environment. This provides both consistency (all YARN > containers will have the same software environment) and isolation (no > interference with whatever is installed on the physical machine). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2848) (FICA) Applications should maintain an application specific 'cluster' resource to calculate headroom and userlimit
[ https://issues.apache.org/jira/browse/YARN-2848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207060#comment-14207060 ] Craig Welch commented on YARN-2848: --- There are avenues to enhance this later for multiple nodelabel expressions later if so desired, likely the api for headroom, etc would need to be broadened to include a label expression and a "set" of application label expressions for calculating resources would need to be held. This is currently specific to Capacity Scheduler, but might be applicable to other schedulers as well. > (FICA) Applications should maintain an application specific 'cluster' > resource to calculate headroom and userlimit > -- > > Key: YARN-2848 > URL: https://issues.apache.org/jira/browse/YARN-2848 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Reporter: Craig Welch >Assignee: Craig Welch > > Likely solutions to [YARN-1680] (properly handling node and rack blacklisting > with cluster level node additions and removals) will entail managing an > application-level "slice" of the cluster resource available to the > application for use in accurately calculating the application headroom and > user limit. There is an assumption that events which impact this resource > will change less frequently than the need to calculate headroom, userlimit, > etc (which is a valid assumption given that occurs per-allocation heartbeat). > Given that, the application should (with assistance from cluster-level > code...) detect changes to the composition of the cluster (node addition, > removal) and when those have occurred, calculate a application specific > cluster resource by comparing cluster nodes to it's own blacklist (both rack > and individual node). I think it makes sense to include nodelabel > considerations into this calculation as it will be efficient to do both at > the same time and the single resource value reflecting both constraints could > then be used for efficient frequent headroom and userlimit calculations while > remaining highly accurate. The application would need to be made aware of > nodelabel changes it is interested in (the application or removal of labels > of interest to the application to/from nodes). For this purpose, the > application submissions's nodelabel expression would be used to determine the > nodelabel impact on the resource used to calculate userlimit and headroom > (Cases where application elected to request resources not using the > application level label expression are out of scope for this - but for the > common usecase of an application which uses a particular expression > throughout, userlimit and headroom would be accurate) This could also provide > an overall mechanism for handling application-specific resource constraints > which might be added in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2495) Allow admin specify labels from each NM (Distributed configuration)
[ https://issues.apache.org/jira/browse/YARN-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207055#comment-14207055 ] Wangda Tan commented on YARN-2495: -- bq. ... hence shall i create a new class extending TestPBImplRecords in hadoop-yarn-server-common project. ? This is an issue, I suggest to keep your code as-is, but please add checks in your tests for the values you added. And in the future, PB objects in h-y-sever-common should have an easier way to do testing as h-y-common. > Allow admin specify labels from each NM (Distributed configuration) > --- > > Key: YARN-2495 > URL: https://issues.apache.org/jira/browse/YARN-2495 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan >Assignee: Naganarasimha G R > Attachments: YARN-2495.20141023-1.patch, YARN-2495.20141024-1.patch, > YARN-2495.20141030-1.patch, YARN-2495.20141031-1.patch, > YARN-2495_20141022.1.patch > > > Target of this JIRA is to allow admin specify labels in each NM, this covers > - User can set labels in each NM (by setting yarn-site.xml or using script > suggested by [~aw]) > - NM will send labels to RM via ResourceTracker API > - RM will set labels in NodeLabelManager when NM register/update labels -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2843) NodeLabels manager should trim all inputs for hosts and labels
[ https://issues.apache.org/jira/browse/YARN-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207057#comment-14207057 ] Wangda Tan commented on YARN-2843: -- Thanks for [~vinodkv]'s review and commit! > NodeLabels manager should trim all inputs for hosts and labels > -- > > Key: YARN-2843 > URL: https://issues.apache.org/jira/browse/YARN-2843 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Sushmitha Sreenivasan >Assignee: Wangda Tan > Fix For: 2.7.0 > > Attachments: YARN-2843-1.patch, YARN-2843-2.patch > > > NodeLabels manager should trim all inputs for hosts and labels -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2848) (FICA) Applications should maintain an application specific 'cluster' resource to calculate headroom and userlimit
Craig Welch created YARN-2848: - Summary: (FICA) Applications should maintain an application specific 'cluster' resource to calculate headroom and userlimit Key: YARN-2848 URL: https://issues.apache.org/jira/browse/YARN-2848 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Reporter: Craig Welch Assignee: Craig Welch Likely solutions to [YARN-1680] (properly handling node and rack blacklisting with cluster level node additions and removals) will entail managing an application-level "slice" of the cluster resource available to the application for use in accurately calculating the application headroom and user limit. There is an assumption that events which impact this resource will change less frequently than the need to calculate headroom, userlimit, etc (which is a valid assumption given that occurs per-allocation heartbeat). Given that, the application should (with assistance from cluster-level code...) detect changes to the composition of the cluster (node addition, removal) and when those have occurred, calculate a application specific cluster resource by comparing cluster nodes to it's own blacklist (both rack and individual node). I think it makes sense to include nodelabel considerations into this calculation as it will be efficient to do both at the same time and the single resource value reflecting both constraints could then be used for efficient frequent headroom and userlimit calculations while remaining highly accurate. The application would need to be made aware of nodelabel changes it is interested in (the application or removal of labels of interest to the application to/from nodes). For this purpose, the application submissions's nodelabel expression would be used to determine the nodelabel impact on the resource used to calculate userlimit and headroom (Cases where application elected to request resources not using the application level label expression are out of scope for this - but for the common usecase of an application which uses a particular expression throughout, userlimit and headroom would be accurate) This could also provide an overall mechanism for handling application-specific resource constraints which might be added in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1964) Create Docker analog of the LinuxContainerExecutor in YARN
[ https://issues.apache.org/jira/browse/YARN-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207046#comment-14207046 ] Ravi Prakash commented on YARN-1964: Hi Karthik! That's fair. I'll ask Arun if he is willing to re-spin 2.6.0. > Create Docker analog of the LinuxContainerExecutor in YARN > -- > > Key: YARN-1964 > URL: https://issues.apache.org/jira/browse/YARN-1964 > Project: Hadoop YARN > Issue Type: New Feature >Affects Versions: 2.2.0 >Reporter: Arun C Murthy >Assignee: Abin Shahab > Attachments: YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, > YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, > YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, > yarn-1964-branch-2.2.0-docker.patch, yarn-1964-branch-2.2.0-docker.patch, > yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, > yarn-1964-docker.patch, yarn-1964-docker.patch > > > Docker (https://www.docker.io/) is, increasingly, a very popular container > technology. > In context of YARN, the support for Docker will provide a very elegant > solution to allow applications to *package* their software into a Docker > container (entire Linux file system incl. custom versions of perl, python > etc.) and use it as a blueprint to launch all their YARN containers with > requisite software environment. This provides both consistency (all YARN > containers will have the same software environment) and isolation (no > interference with whatever is installed on the physical machine). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2843) NodeLabels manager should trim all inputs for hosts and labels
[ https://issues.apache.org/jira/browse/YARN-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207017#comment-14207017 ] Hudson commented on YARN-2843: -- FAILURE: Integrated in Hadoop-trunk-Commit #6511 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6511/]) YARN-2843. Fixed NodeLabelsManager to trim inputs for hosts and labels so as to make them work correctly. Contributed by Wangda Tan. (vinodkv: rev 0fd97f9c1989a793b882e6678285607472a3f75a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/cli/RMAdminCLI.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ConverterUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/CommonNodeLabelsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/nodelabels/TestCommonNodeLabelsManager.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/nodelabels/NodeLabelTestBase.java > NodeLabels manager should trim all inputs for hosts and labels > -- > > Key: YARN-2843 > URL: https://issues.apache.org/jira/browse/YARN-2843 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Sushmitha Sreenivasan >Assignee: Wangda Tan > Attachments: YARN-2843-1.patch, YARN-2843-2.patch > > > NodeLabels manager should trim all inputs for hosts and labels -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2843) NodeLabels manager should trim all inputs for hosts and labels
[ https://issues.apache.org/jira/browse/YARN-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206999#comment-14206999 ] Vinod Kumar Vavilapalli commented on YARN-2843: --- +1, looks good. Checking this in. > NodeLabels manager should trim all inputs for hosts and labels > -- > > Key: YARN-2843 > URL: https://issues.apache.org/jira/browse/YARN-2843 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Sushmitha Sreenivasan >Assignee: Wangda Tan > Attachments: YARN-2843-1.patch, YARN-2843-2.patch > > > NodeLabels manager should trim all inputs for hosts and labels -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2139) [Umbrella] Support for Disk as a Resource in YARN
[ https://issues.apache.org/jira/browse/YARN-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206994#comment-14206994 ] Karthik Kambatla commented on YARN-2139: Thanks for the prototype, Wei. In light of the updates on YARN-2791 and YARN-2817, I propose we incorporate suggestions from [~sdaingade] and [~acmurthy] before posting patches for sub-tasks. Updated JIRA title, description, and marked it unassigned as this is an umbrella JIRA. > [Umbrella] Support for Disk as a Resource in YARN > -- > > Key: YARN-2139 > URL: https://issues.apache.org/jira/browse/YARN-2139 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Wei Yan > Attachments: Disk_IO_Scheduling_Design_1.pdf, > Disk_IO_Scheduling_Design_2.pdf, YARN-2139-prototype-2.patch, > YARN-2139-prototype.patch > > > YARN should consider disk as another resource for (1) scheduling tasks on > nodes, (2) isolation at runtime, (3) spindle locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2791) Add Disk as a resource for scheduling
[ https://issues.apache.org/jira/browse/YARN-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206993#comment-14206993 ] Karthik Kambatla commented on YARN-2791: Thanks [~sdaingade] for sharing the design doc. Well articulated. The designs on YARN-2139 and YARN-2791 are very similar, except for the disk resources are called vdisks in YARN-2139 and spindles in YARN-2791. In addition to the items specified here, YARN-2139 talks about isolation as well. Other than that, do you see any major items YARN-2791 covers that YARN-2139? The WebUI is good and very desirable, we should definitely include it. Also, I suggest we make this (as is - or split into multiple JIRAs) a sub-task of YARN-2139. Discussing the high-level details on one JIRA helps with aligning on one final design doc based on everyone's suggestions. > Add Disk as a resource for scheduling > - > > Key: YARN-2791 > URL: https://issues.apache.org/jira/browse/YARN-2791 > Project: Hadoop YARN > Issue Type: New Feature > Components: scheduler >Affects Versions: 2.5.1 >Reporter: Swapnil Daingade >Assignee: Yuliya Feldman > Attachments: DiskDriveAsResourceInYARN.pdf > > > Currently, the number of disks present on a node is not considered a factor > while scheduling containers on that node. Having large amount of memory on a > node can lead to high number of containers being launched on that node, all > of which compete for I/O bandwidth. This multiplexing of I/O across > containers can lead to slower overall progress and sub-optimal resource > utilization as containers starved for I/O bandwidth hold on to other > resources like cpu and memory. This problem can be solved by considering disk > as a resource and including it in deciding how many containers can be > concurrently run on a node. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2139) [Umbrella] Support for Disk as a Resource in YARN
[ https://issues.apache.org/jira/browse/YARN-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2139: --- Description: YARN should consider disk as another resource for (1) scheduling tasks on nodes, (2) isolation at runtime, (3) spindle locality. (was: YARN should support considering disk for scheduling tasks on nodes, and provide isolation for these allocations at runtime.) > [Umbrella] Support for Disk as a Resource in YARN > -- > > Key: YARN-2139 > URL: https://issues.apache.org/jira/browse/YARN-2139 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Wei Yan > Attachments: Disk_IO_Scheduling_Design_1.pdf, > Disk_IO_Scheduling_Design_2.pdf, YARN-2139-prototype-2.patch, > YARN-2139-prototype.patch > > > YARN should consider disk as another resource for (1) scheduling tasks on > nodes, (2) isolation at runtime, (3) spindle locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2139) [Umbrella] Support for Disk as a Resource in YARN
[ https://issues.apache.org/jira/browse/YARN-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2139: --- Summary: [Umbrella] Support for Disk as a Resource in YARN (was: Add support for disk IO isolation/scheduling for containers) > [Umbrella] Support for Disk as a Resource in YARN > -- > > Key: YARN-2139 > URL: https://issues.apache.org/jira/browse/YARN-2139 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Wei Yan > Attachments: Disk_IO_Scheduling_Design_1.pdf, > Disk_IO_Scheduling_Design_2.pdf, YARN-2139-prototype-2.patch, > YARN-2139-prototype.patch > > > YARN should support considering disk for scheduling tasks on nodes, and > provide isolation for these allocations at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2139) Add support for disk IO isolation/scheduling for containers
[ https://issues.apache.org/jira/browse/YARN-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2139: --- Assignee: (was: Wei Yan) > Add support for disk IO isolation/scheduling for containers > --- > > Key: YARN-2139 > URL: https://issues.apache.org/jira/browse/YARN-2139 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Wei Yan > Attachments: Disk_IO_Scheduling_Design_1.pdf, > Disk_IO_Scheduling_Design_2.pdf, YARN-2139-prototype-2.patch, > YARN-2139-prototype.patch > > > YARN should support considering disk for scheduling tasks on nodes, and > provide isolation for these allocations at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2811) Fair Scheduler is violating max memory settings in 2.4
[ https://issues.apache.org/jira/browse/YARN-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siqi Li updated YARN-2811: -- Attachment: YARN-2811.v5.patch > Fair Scheduler is violating max memory settings in 2.4 > -- > > Key: YARN-2811 > URL: https://issues.apache.org/jira/browse/YARN-2811 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Siqi Li >Assignee: Siqi Li > Attachments: YARN-2811.v1.patch, YARN-2811.v2.patch, > YARN-2811.v3.patch, YARN-2811.v4.patch, YARN-2811.v5.patch > > > This has been seen on several queues showing the allocated MB going > significantly above the max MB and it appears to have started with the 2.4 > upgrade. It could be a regression bug from 2.0 to 2.4 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2817) Disk drive as a resource in YARN
[ https://issues.apache.org/jira/browse/YARN-2817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206981#comment-14206981 ] Karthik Kambatla commented on YARN-2817: Resolving this as a duplicate of YARN-2791 since that JIRA proposes the exact same thing. In any case, I think these should be subtasks for YARN-2139, looking at the prototype code posted there. > Disk drive as a resource in YARN > > > Key: YARN-2817 > URL: https://issues.apache.org/jira/browse/YARN-2817 > Project: Hadoop YARN > Issue Type: New Feature > Components: scheduler >Reporter: Arun C Murthy >Assignee: Arun C Murthy > > As YARN continues to cover new ground in terms of new workloads, disk is > becoming a very important resource to govern. > It might be prudent to start with something very simple - allow applications > to request entire drives (e.g. 2 drives out of the 12 available on a node), > we can then also add support for specific iops, bandwidth etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2817) Disk drive as a resource in YARN
[ https://issues.apache.org/jira/browse/YARN-2817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla resolved YARN-2817. Resolution: Duplicate > Disk drive as a resource in YARN > > > Key: YARN-2817 > URL: https://issues.apache.org/jira/browse/YARN-2817 > Project: Hadoop YARN > Issue Type: New Feature > Components: scheduler >Reporter: Arun C Murthy >Assignee: Arun C Murthy > > As YARN continues to cover new ground in terms of new workloads, disk is > becoming a very important resource to govern. > It might be prudent to start with something very simple - allow applications > to request entire drives (e.g. 2 drives out of the 12 available on a node), > we can then also add support for specific iops, bandwidth etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2423) TimelineClient should wrap all GET APIs to facilitate Java users
[ https://issues.apache.org/jira/browse/YARN-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter updated YARN-2423: Target Version/s: 2.7.0 > TimelineClient should wrap all GET APIs to facilitate Java users > > > Key: YARN-2423 > URL: https://issues.apache.org/jira/browse/YARN-2423 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Robert Kanter > Attachments: YARN-2423.patch, YARN-2423.patch, YARN-2423.patch > > > TimelineClient provides the Java method to put timeline entities. It's also > good to wrap over all GET APIs (both entity and domain), and deserialize the > json response into Java POJO objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1964) Create Docker analog of the LinuxContainerExecutor in YARN
[ https://issues.apache.org/jira/browse/YARN-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206939#comment-14206939 ] Karthik Kambatla commented on YARN-1964: [~raviprak] - I would prefer backporting it to branch-2.6 only if it goes into 2.6.0 release, so we can avoid including features in point releases. In any case, the plan is to release 2.7.0 soon after 2.6.0. > Create Docker analog of the LinuxContainerExecutor in YARN > -- > > Key: YARN-1964 > URL: https://issues.apache.org/jira/browse/YARN-1964 > Project: Hadoop YARN > Issue Type: New Feature >Affects Versions: 2.2.0 >Reporter: Arun C Murthy >Assignee: Abin Shahab > Attachments: YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, > YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, > YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, > yarn-1964-branch-2.2.0-docker.patch, yarn-1964-branch-2.2.0-docker.patch, > yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, > yarn-1964-docker.patch, yarn-1964-docker.patch > > > Docker (https://www.docker.io/) is, increasingly, a very popular container > technology. > In context of YARN, the support for Docker will provide a very elegant > solution to allow applications to *package* their software into a Docker > container (entire Linux file system incl. custom versions of perl, python > etc.) and use it as a blueprint to launch all their YARN containers with > requisite software environment. This provides both consistency (all YARN > containers will have the same software environment) and isolation (no > interference with whatever is installed on the physical machine). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1964) Create Docker analog of the LinuxContainerExecutor in YARN
[ https://issues.apache.org/jira/browse/YARN-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206855#comment-14206855 ] Ravi Prakash commented on YARN-1964: Thanks Abin! The patch is looking really good now. However the documentation doesn't seem to be compiling for me. Once that is figured out, I'm a +1. I am looking to commit it EOD today to trunk, branch-2, branch-2.6. I'd like to commit it to 2.6 also and request a respin of the RC. > Create Docker analog of the LinuxContainerExecutor in YARN > -- > > Key: YARN-1964 > URL: https://issues.apache.org/jira/browse/YARN-1964 > Project: Hadoop YARN > Issue Type: New Feature >Affects Versions: 2.2.0 >Reporter: Arun C Murthy >Assignee: Abin Shahab > Attachments: YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, > YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, > YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, > yarn-1964-branch-2.2.0-docker.patch, yarn-1964-branch-2.2.0-docker.patch, > yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, > yarn-1964-docker.patch, yarn-1964-docker.patch > > > Docker (https://www.docker.io/) is, increasingly, a very popular container > technology. > In context of YARN, the support for Docker will provide a very elegant > solution to allow applications to *package* their software into a Docker > container (entire Linux file system incl. custom versions of perl, python > etc.) and use it as a blueprint to launch all their YARN containers with > requisite software environment. This provides both consistency (all YARN > containers will have the same software environment) and isolation (no > interference with whatever is installed on the physical machine). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2847) Linux native container executor segfaults if default banned user detected
[ https://issues.apache.org/jira/browse/YARN-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206801#comment-14206801 ] Jason Lowe commented on YARN-2847: -- The problem is in this code: {code} char **banned_users = get_values(BANNED_USERS_KEY); char **banned_user = (banned_users == NULL) ? (char**) DEFAULT_BANNED_USERS : banned_users; for(; *banned_user; ++banned_user) { if (strcmp(*banned_user, user) == 0) { free(user_info); if (banned_users != (char**)DEFAULT_BANNED_USERS) { free_values(banned_users); } fprintf(LOGFILE, "Requested user %s is banned\n", user); return NULL; } } if (banned_users != NULL && banned_users != (char**)DEFAULT_BANNED_USERS) { free_values(banned_users); } {code} Note that in one case we check for banned_users != NULL and != DEFAULT_BANNED_USERS but in another case we're missing the NULL check. Lots of ways to fix it: - free_values could check for NULL - banned_users could always be non-NULL (i.e.: set it to DEFAULT_BANNED_USERS if get_values returns NULL) - add check for != NULL before calling free_values > Linux native container executor segfaults if default banned user detected > - > > Key: YARN-2847 > URL: https://issues.apache.org/jira/browse/YARN-2847 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe > > The check_user function in container-executor.c can cause a segmentation > fault if banned.users is not provided but the user is detected as one of the > default users. In that scenario it will call free_values on a NULL pointer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2847) Linux native container executor segfaults if default banned user detected
Jason Lowe created YARN-2847: Summary: Linux native container executor segfaults if default banned user detected Key: YARN-2847 URL: https://issues.apache.org/jira/browse/YARN-2847 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe The check_user function in container-executor.c can cause a segmentation fault if banned.users is not provided but the user is detected as one of the default users. In that scenario it will call free_values on a NULL pointer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2735) diskUtilizationPercentageCutoff and diskUtilizationSpaceCutoff are initialized twice in DirectoryCollection
[ https://issues.apache.org/jira/browse/YARN-2735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206788#comment-14206788 ] Hudson commented on YARN-2735: -- FAILURE: Integrated in Hadoop-trunk-Commit #6510 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6510/]) YARN-2735. diskUtilizationPercentageCutoff and diskUtilizationSpaceCutoff are initialized twice in DirectoryCollection. (Zhihai Xu via kasha) (kasha: rev 061bc293c8dd3e2605cf150568988bde18407af6) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java > diskUtilizationPercentageCutoff and diskUtilizationSpaceCutoff are > initialized twice in DirectoryCollection > --- > > Key: YARN-2735 > URL: https://issues.apache.org/jira/browse/YARN-2735 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Trivial > Labels: newbie > Fix For: 2.7.0 > > Attachments: YARN-2735.000.patch > > > diskUtilizationPercentageCutoff and diskUtilizationSpaceCutoff are > initialized twice in DirectoryCollection -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1964) Create Docker analog of the LinuxContainerExecutor in YARN
[ https://issues.apache.org/jira/browse/YARN-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206768#comment-14206768 ] Hadoop QA commented on YARN-1964: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12680823/YARN-1964.patch against trunk revision 58e9bf4. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5815//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5815//console This message is automatically generated. > Create Docker analog of the LinuxContainerExecutor in YARN > -- > > Key: YARN-1964 > URL: https://issues.apache.org/jira/browse/YARN-1964 > Project: Hadoop YARN > Issue Type: New Feature >Affects Versions: 2.2.0 >Reporter: Arun C Murthy >Assignee: Abin Shahab > Attachments: YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, > YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, > YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, > yarn-1964-branch-2.2.0-docker.patch, yarn-1964-branch-2.2.0-docker.patch, > yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, > yarn-1964-docker.patch, yarn-1964-docker.patch > > > Docker (https://www.docker.io/) is, increasingly, a very popular container > technology. > In context of YARN, the support for Docker will provide a very elegant > solution to allow applications to *package* their software into a Docker > container (entire Linux file system incl. custom versions of perl, python > etc.) and use it as a blueprint to launch all their YARN containers with > requisite software environment. This provides both consistency (all YARN > containers will have the same software environment) and isolation (no > interference with whatever is installed on the physical machine). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2735) diskUtilizationPercentageCutoff and diskUtilizationSpaceCutoff are initialized twice in DirectoryCollection
[ https://issues.apache.org/jira/browse/YARN-2735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206765#comment-14206765 ] Karthik Kambatla commented on YARN-2735: Trivial patch. +1. Checking this in. > diskUtilizationPercentageCutoff and diskUtilizationSpaceCutoff are > initialized twice in DirectoryCollection > --- > > Key: YARN-2735 > URL: https://issues.apache.org/jira/browse/YARN-2735 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Trivial > Labels: newbie > Attachments: YARN-2735.000.patch > > > diskUtilizationPercentageCutoff and diskUtilizationSpaceCutoff are > initialized twice in DirectoryCollection -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2735) diskUtilizationPercentageCutoff and diskUtilizationSpaceCutoff are initialized twice in DirectoryCollection
[ https://issues.apache.org/jira/browse/YARN-2735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2735: --- Priority: Trivial (was: Minor) Labels: newbie (was: ) > diskUtilizationPercentageCutoff and diskUtilizationSpaceCutoff are > initialized twice in DirectoryCollection > --- > > Key: YARN-2735 > URL: https://issues.apache.org/jira/browse/YARN-2735 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Trivial > Labels: newbie > Attachments: YARN-2735.000.patch > > > diskUtilizationPercentageCutoff and diskUtilizationSpaceCutoff are > initialized twice in DirectoryCollection -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.
[ https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen He updated YARN-1680: -- Target Version/s: 2.7.0 (was: 2.6.0) > availableResources sent to applicationMaster in heartbeat should exclude > blacklistedNodes free memory. > -- > > Key: YARN-1680 > URL: https://issues.apache.org/jira/browse/YARN-1680 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.2.0, 2.3.0 > Environment: SuSE 11 SP2 + Hadoop-2.3 >Reporter: Rohith >Assignee: Craig Welch > Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, > YARN-1680-v2.patch, YARN-1680.patch > > > There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster > slow start is set to 1. > Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is > become unstable(3 Map got killed), MRAppMaster blacklisted unstable > NodeManager(NM-4). All reducer task are running in cluster now. > MRAppMaster does not preempt the reducers because for Reducer preemption > calculation, headRoom is considering blacklisted nodes memory. This makes > jobs to hang forever(ResourceManager does not assing any new containers on > blacklisted nodes but returns availableResouce considers cluster free > memory). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1964) Create Docker analog of the LinuxContainerExecutor in YARN
[ https://issues.apache.org/jira/browse/YARN-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abin Shahab updated YARN-1964: -- Attachment: YARN-1964.patch fixed imports. > Create Docker analog of the LinuxContainerExecutor in YARN > -- > > Key: YARN-1964 > URL: https://issues.apache.org/jira/browse/YARN-1964 > Project: Hadoop YARN > Issue Type: New Feature >Affects Versions: 2.2.0 >Reporter: Arun C Murthy >Assignee: Abin Shahab > Attachments: YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, > YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, > YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, > yarn-1964-branch-2.2.0-docker.patch, yarn-1964-branch-2.2.0-docker.patch, > yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, > yarn-1964-docker.patch, yarn-1964-docker.patch > > > Docker (https://www.docker.io/) is, increasingly, a very popular container > technology. > In context of YARN, the support for Docker will provide a very elegant > solution to allow applications to *package* their software into a Docker > container (entire Linux file system incl. custom versions of perl, python > etc.) and use it as a blueprint to launch all their YARN containers with > requisite software environment. This provides both consistency (all YARN > containers will have the same software environment) and isolation (no > interference with whatever is installed on the physical machine). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2009) Priority support for preemption in ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-2009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206675#comment-14206675 ] Sunil G commented on YARN-2009: --- Yes. Idea is more or less the same. We had a prototype done on this and along with ApplicationPrioirty, this can be brought as separate policy. However points to discuss are * Container has to be selected from lower priority applications based on node locality constraint from higher priority application * Co-existing this logic with node-labels * Am container has to be spared etc. Pls share your thoughts. > Priority support for preemption in ProportionalCapacityPreemptionPolicy > --- > > Key: YARN-2009 > URL: https://issues.apache.org/jira/browse/YARN-2009 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Devaraj K >Assignee: Sunil G > > While preempting containers based on the queue ideal assignment, we may need > to consider preempting the low priority application containers first. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2838) Issues with TimeLineServer (Application History)
[ https://issues.apache.org/jira/browse/YARN-2838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206671#comment-14206671 ] Zhijie Shen commented on YARN-2838: --- [~Naganarasimha], sorry for not responding you immediately as being busy on finalizing 2.6. A quick scan through your issue document. Here's my clarification: 1. While the entry point of the this sub-module is still called ApplicationHistoryServer, it is actually generalized to be TimelineServer right now (definitely we need to refactor the code at some point). The baseline service provided the the timeline server is to allow the cluster and its apps to store their history information, metrics and so on by complying with the defined timeline data model. Later on, users and admins can query this information to do the analysis. 2. Application history (or we prefer to call it generic history service) is now a built-in service in the timeline server to record the generic history information of YARN apps. It was on a separate store (on FS), but after YARN-2033, it has been moved to the timeline store too, as a payload. We replace the old storage layer, but keep the existing interfaces (web UI, services, CLI) not changed to be the analog of what RM provides for running apps. We still didn't integrate TimelineClient and AHSClient, the latter of which is RPC interface of getting generic history information via RPC interface. APPLICATION_HISTORY_ENABLED is the only remaining old config to control whether we also want to pull the app info from the generic history service inside the timeline server. You may want to take a look at YARN-2033 to get more context about the change. Moreover, as a number of limitation of the old history store, we're no longer going to support it. 3. The document is definitely staled. I'll file separate document Jira, however, it's too late for 2.6. Let's target 2.7 for an up-to-date document about timeline service and its built-in generic history service. Does it sound good? > Issues with TimeLineServer (Application History) > > > Key: YARN-2838 > URL: https://issues.apache.org/jira/browse/YARN-2838 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.5.1 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Attachments: IssuesInTimelineServer.pdf > > > Few issues in usage of Timeline server for generic application history access -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2693) Priority Label Manager in RM to manage priority labels
[ https://issues.apache.org/jira/browse/YARN-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-2693: -- Attachment: 0001-YARN-2693.patch Uploading a work in progress patch for priority label manager. * Supporting file system and memory store * Handling 4 events to store such as add_labels_to_queue, remove_labels_from_queue, store_cluster_labels, remove_cluster_labels * Using specific pb impls to store the labels details to file * Design similar to node label manager, however changes in event specific handling is done * RMPriorityLabelManager class has to be on top of this as a wrapper, which we can bring up as RM core changes Kindly review and provide major comments. I will keep updating this with tests also. > Priority Label Manager in RM to manage priority labels > -- > > Key: YARN-2693 > URL: https://issues.apache.org/jira/browse/YARN-2693 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Sunil G >Assignee: Sunil G > Attachments: 0001-YARN-2693.patch > > > Focus of this JIRA is to have a centralized service to handle priority labels. > Support operations such as > * Add/Delete priority label to a specified queue > * Manage integer mapping associated with each priority label > * Support managing default priority label of a given queue > * ACL support in queue level for priority label > * Expose interface to RM to validate priority label > Storage for this labels will be done in FileSystem and in Memory similar to > NodeLabel > * FileSystem Based : persistent across RM restart > * Memory Based: non-persistent across RM restart -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2838) Issues with TimeLineServer (Application History)
[ https://issues.apache.org/jira/browse/YARN-2838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206634#comment-14206634 ] Naganarasimha G R commented on YARN-2838: - Hi [~zjshen], Can you please feedback on these issues ? As some issues requires discussion before rectifiction... > Issues with TimeLineServer (Application History) > > > Key: YARN-2838 > URL: https://issues.apache.org/jira/browse/YARN-2838 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.5.1 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Attachments: IssuesInTimelineServer.pdf > > > Few issues in usage of Timeline server for generic application history access -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2495) Allow admin specify labels from each NM (Distributed configuration)
[ https://issues.apache.org/jira/browse/YARN-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206626#comment-14206626 ] Naganarasimha G R commented on YARN-2495: - {quote} The benefit are 1) You don't have to update test cases for that 2) The semanic are clear, create a register request with label or not. {quote} True, and will be able to revert some unwanted testcase modification. have corrected it. bq. I suggest to have different option for script-based/config-based, even if we can combine them together. Ok, will have different config param for script and config based bq. IIUC, NM_NODE_LABELS_FROM_CONFIG is a list of labels, even if we want to separate the two properties, we cannot remove NM_NODE_LABELS_FROM_CONFIG, correct? Had searched it wrongly and as you mentioned the name of was not good enough for me to recollect back too. corrected it bq. I think it's better to leverage existing utility class instead of implement your own. For example, you have set values but not check them, which is incorrect, but using utility class can avoid such problem. Even if you added new fields, tests will cover them without any changes: Problem is ??TestPBImplRecords?? is in ??hadoop-yarn-common?? project and ??NodeHeartbeatRequestPBImpl?? and others are in ??hadoop-yarn-server-common?? project. So as we cant add dependency on ??hadoop-yarn-server-common?? in ??hadoop-yarn-common??, hence shall i create a new class extending TestPBImplRecords in ??hadoop-yarn-server-common?? project. ? > Allow admin specify labels from each NM (Distributed configuration) > --- > > Key: YARN-2495 > URL: https://issues.apache.org/jira/browse/YARN-2495 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan >Assignee: Naganarasimha G R > Attachments: YARN-2495.20141023-1.patch, YARN-2495.20141024-1.patch, > YARN-2495.20141030-1.patch, YARN-2495.20141031-1.patch, > YARN-2495_20141022.1.patch > > > Target of this JIRA is to allow admin specify labels in each NM, this covers > - User can set labels in each NM (by setting yarn-site.xml or using script > suggested by [~aw]) > - NM will send labels to RM via ResourceTracker API > - RM will set labels in NodeLabelManager when NM register/update labels -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2846) Incorrect persist exit code for running containers in reacquireContainer() that interrupted by NodeManager restart.
[ https://issues.apache.org/jira/browse/YARN-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206616#comment-14206616 ] Jason Lowe commented on YARN-2846: -- Thanks for the report and patch, Junping! Nit: If reacquireContainer is going to allow InterruptedException to be thrown then I'd rather remove the try/catch around the Thread.sleep call and just let the exception be thrown directly from there. We can let the code catching the exception deal with any logging/etc as appropriate for that caller. In this case we can move the log message to RecoveredContainerLaunch when it fields the InterruptedException and chooses not to propagate it upwards. I'm curious why we're not seeing a similar issue with regular ContainerLaunch threads, as they should be interrupted as well. Are those threads silently swallowing the interrupt? Because otherwise I would expect us to log a container completion just like we were doing with a recovered container. > Incorrect persist exit code for running containers in reacquireContainer() > that interrupted by NodeManager restart. > --- > > Key: YARN-2846 > URL: https://issues.apache.org/jira/browse/YARN-2846 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Junping Du >Assignee: Junping Du >Priority: Blocker > Attachments: YARN-2846-demo.patch > > > The NM restart work preserving feature could make running AM container get > LOST and killed during stop NM daemon. The exception is like below: > {code} > 2014-11-11 00:48:35,214 INFO monitor.ContainersMonitorImpl > (ContainersMonitorImpl.java:run(408)) - Memory usage of ProcessTree 22140 for > container-id container_1415666714233_0001_01_84: 53.8 MB of 512 MB > physical memory used; 931.3 MB of 1.0 GB virtual memory used > 2014-11-11 00:48:35,223 ERROR nodemanager.NodeManager > (SignalLogger.java:handle(60)) - RECEIVED SIGNAL 15: SIGTERM > 2014-11-11 00:48:35,299 INFO mortbay.log (Slf4jLog.java:info(67)) - Stopped > HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:50060 > 2014-11-11 00:48:35,337 INFO containermanager.ContainerManagerImpl > (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(512)) - > Applications still running : [application_1415666714233_0001] > 2014-11-11 00:48:35,338 INFO ipc.Server (Server.java:stop(2437)) - Stopping > server on 45454 > 2014-11-11 00:48:35,344 INFO ipc.Server (Server.java:run(706)) - Stopping > IPC Server listener on 45454 > 2014-11-11 00:48:35,346 INFO logaggregation.LogAggregationService > (LogAggregationService.java:serviceStop(141)) - > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService > waiting for pending aggregation during exit > 2014-11-11 00:48:35,347 INFO ipc.Server (Server.java:run(832)) - Stopping > IPC Server Responder > 2014-11-11 00:48:35,347 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:abortLogAggregation(502)) - Aborting log > aggregation for application_1415666714233_0001 > 2014-11-11 00:48:35,348 WARN logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:run(382)) - Aggregation did not complete for > application application_1415666714233_0001 > 2014-11-11 00:48:35,358 WARN monitor.ContainersMonitorImpl > (ContainersMonitorImpl.java:run(476)) - > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl > is interrupted. Exiting. > 2014-11-11 00:48:35,406 ERROR launcher.RecoveredContainerLaunch > (RecoveredContainerLaunch.java:call(87)) - Unable to recover container > container_1415666714233_0001_01_01 > java.io.IOException: Interrupted while waiting for process 20001 to exit > at > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:180) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:82) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.InterruptedException: sleep interrupted > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:177) > ... 6 more > {code} > In reacquireCo
[jira] [Commented] (YARN-2846) Incorrect persist exit code for running containers in reacquireContainer() that interrupted by NodeManager restart.
[ https://issues.apache.org/jira/browse/YARN-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206601#comment-14206601 ] Hadoop QA commented on YARN-2846: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12680801/YARN-2846-demo.patch against trunk revision 58e9bf4. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5814//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5814//console This message is automatically generated. > Incorrect persist exit code for running containers in reacquireContainer() > that interrupted by NodeManager restart. > --- > > Key: YARN-2846 > URL: https://issues.apache.org/jira/browse/YARN-2846 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Junping Du >Assignee: Junping Du >Priority: Blocker > Attachments: YARN-2846-demo.patch > > > The NM restart work preserving feature could make running AM container get > LOST and killed during stop NM daemon. The exception is like below: > {code} > 2014-11-11 00:48:35,214 INFO monitor.ContainersMonitorImpl > (ContainersMonitorImpl.java:run(408)) - Memory usage of ProcessTree 22140 for > container-id container_1415666714233_0001_01_84: 53.8 MB of 512 MB > physical memory used; 931.3 MB of 1.0 GB virtual memory used > 2014-11-11 00:48:35,223 ERROR nodemanager.NodeManager > (SignalLogger.java:handle(60)) - RECEIVED SIGNAL 15: SIGTERM > 2014-11-11 00:48:35,299 INFO mortbay.log (Slf4jLog.java:info(67)) - Stopped > HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:50060 > 2014-11-11 00:48:35,337 INFO containermanager.ContainerManagerImpl > (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(512)) - > Applications still running : [application_1415666714233_0001] > 2014-11-11 00:48:35,338 INFO ipc.Server (Server.java:stop(2437)) - Stopping > server on 45454 > 2014-11-11 00:48:35,344 INFO ipc.Server (Server.java:run(706)) - Stopping > IPC Server listener on 45454 > 2014-11-11 00:48:35,346 INFO logaggregation.LogAggregationService > (LogAggregationService.java:serviceStop(141)) - > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService > waiting for pending aggregation during exit > 2014-11-11 00:48:35,347 INFO ipc.Server (Server.java:run(832)) - Stopping > IPC Server Responder > 2014-11-11 00:48:35,347 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:abortLogAggregation(502)) - Aborting log > aggregation for application_1415666714233_0001 > 2014-11-11 00:48:35,348 WARN logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:run(382)) - Aggregation did not complete for > application application_1415666714233_0001 > 2014-11-11 00:48:35,358 WARN monitor.ContainersMonitorImpl > (ContainersMonitorImpl.java:run(476)) - > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl > is interrupted. Exiting. > 2014-11-11 00:48:35,406 ERROR launcher.RecoveredContainerLaunch > (RecoveredContainerLaunch.java:call(87)) - Unable to recover container > container_1415666714233_0001_01_01 > java.io.IOException: Interrupted while waiting for process 20001 to exit > at > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:180) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:82) > at
[jira] [Updated] (YARN-2846) Incorrect persist exit code for running containers in reacquireContainer() that interrupted by NodeManager restart.
[ https://issues.apache.org/jira/browse/YARN-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-2846: - Attachment: YARN-2846-demo.patch Upload the first demo patch to fix the problem. > Incorrect persist exit code for running containers in reacquireContainer() > that interrupted by NodeManager restart. > --- > > Key: YARN-2846 > URL: https://issues.apache.org/jira/browse/YARN-2846 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Junping Du >Assignee: Junping Du >Priority: Blocker > Attachments: YARN-2846-demo.patch > > > The NM restart work preserving feature could make running AM container get > LOST and killed during stop NM daemon. The exception is like below: > {code} > 2014-11-11 00:48:35,214 INFO monitor.ContainersMonitorImpl > (ContainersMonitorImpl.java:run(408)) - Memory usage of ProcessTree 22140 for > container-id container_1415666714233_0001_01_84: 53.8 MB of 512 MB > physical memory used; 931.3 MB of 1.0 GB virtual memory used > 2014-11-11 00:48:35,223 ERROR nodemanager.NodeManager > (SignalLogger.java:handle(60)) - RECEIVED SIGNAL 15: SIGTERM > 2014-11-11 00:48:35,299 INFO mortbay.log (Slf4jLog.java:info(67)) - Stopped > HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:50060 > 2014-11-11 00:48:35,337 INFO containermanager.ContainerManagerImpl > (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(512)) - > Applications still running : [application_1415666714233_0001] > 2014-11-11 00:48:35,338 INFO ipc.Server (Server.java:stop(2437)) - Stopping > server on 45454 > 2014-11-11 00:48:35,344 INFO ipc.Server (Server.java:run(706)) - Stopping > IPC Server listener on 45454 > 2014-11-11 00:48:35,346 INFO logaggregation.LogAggregationService > (LogAggregationService.java:serviceStop(141)) - > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService > waiting for pending aggregation during exit > 2014-11-11 00:48:35,347 INFO ipc.Server (Server.java:run(832)) - Stopping > IPC Server Responder > 2014-11-11 00:48:35,347 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:abortLogAggregation(502)) - Aborting log > aggregation for application_1415666714233_0001 > 2014-11-11 00:48:35,348 WARN logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:run(382)) - Aggregation did not complete for > application application_1415666714233_0001 > 2014-11-11 00:48:35,358 WARN monitor.ContainersMonitorImpl > (ContainersMonitorImpl.java:run(476)) - > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl > is interrupted. Exiting. > 2014-11-11 00:48:35,406 ERROR launcher.RecoveredContainerLaunch > (RecoveredContainerLaunch.java:call(87)) - Unable to recover container > container_1415666714233_0001_01_01 > java.io.IOException: Interrupted while waiting for process 20001 to exit > at > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:180) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:82) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.InterruptedException: sleep interrupted > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:177) > ... 6 more > {code} > In reacquireContainer() of ContainerExecutor.java, the while loop of checking > container process (AM container) will be interrupted by NM stop. The > IOException get thrown and failed to generate an ExitCodeFile for the running > container. Later, the IOException will be caught in upper call > (RecoveredContainerLaunch.call()) and the ExitCode (by default to be LOST > without any setting) get persistent in NMStateStore. > After NM restart again, this container is recovered as COMPLETE state but > exit code is LOST (154) - cause this (AM) container get killed later. > We should get rid of recording the exit code of running containers if > detecting process is interrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2846) Incorrect persist exit code for running containers in reacquireContainer() that interrupted by NodeManager restart.
[ https://issues.apache.org/jira/browse/YARN-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du reassigned YARN-2846: Assignee: Junping Du > Incorrect persist exit code for running containers in reacquireContainer() > that interrupted by NodeManager restart. > --- > > Key: YARN-2846 > URL: https://issues.apache.org/jira/browse/YARN-2846 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Junping Du >Assignee: Junping Du >Priority: Blocker > > The NM restart work preserving feature could make running AM container get > LOST and killed during stop NM daemon. The exception is like below: > {code} > 2014-11-11 00:48:35,214 INFO monitor.ContainersMonitorImpl > (ContainersMonitorImpl.java:run(408)) - Memory usage of ProcessTree 22140 for > container-id container_1415666714233_0001_01_84: 53.8 MB of 512 MB > physical memory used; 931.3 MB of 1.0 GB virtual memory used > 2014-11-11 00:48:35,223 ERROR nodemanager.NodeManager > (SignalLogger.java:handle(60)) - RECEIVED SIGNAL 15: SIGTERM > 2014-11-11 00:48:35,299 INFO mortbay.log (Slf4jLog.java:info(67)) - Stopped > HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:50060 > 2014-11-11 00:48:35,337 INFO containermanager.ContainerManagerImpl > (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(512)) - > Applications still running : [application_1415666714233_0001] > 2014-11-11 00:48:35,338 INFO ipc.Server (Server.java:stop(2437)) - Stopping > server on 45454 > 2014-11-11 00:48:35,344 INFO ipc.Server (Server.java:run(706)) - Stopping > IPC Server listener on 45454 > 2014-11-11 00:48:35,346 INFO logaggregation.LogAggregationService > (LogAggregationService.java:serviceStop(141)) - > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService > waiting for pending aggregation during exit > 2014-11-11 00:48:35,347 INFO ipc.Server (Server.java:run(832)) - Stopping > IPC Server Responder > 2014-11-11 00:48:35,347 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:abortLogAggregation(502)) - Aborting log > aggregation for application_1415666714233_0001 > 2014-11-11 00:48:35,348 WARN logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:run(382)) - Aggregation did not complete for > application application_1415666714233_0001 > 2014-11-11 00:48:35,358 WARN monitor.ContainersMonitorImpl > (ContainersMonitorImpl.java:run(476)) - > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl > is interrupted. Exiting. > 2014-11-11 00:48:35,406 ERROR launcher.RecoveredContainerLaunch > (RecoveredContainerLaunch.java:call(87)) - Unable to recover container > container_1415666714233_0001_01_01 > java.io.IOException: Interrupted while waiting for process 20001 to exit > at > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:180) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:82) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.InterruptedException: sleep interrupted > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:177) > ... 6 more > {code} > In reacquireContainer() of ContainerExecutor.java, the while loop of checking > container process (AM container) will be interrupted by NM stop. The > IOException get thrown and failed to generate an ExitCodeFile for the running > container. Later, the IOException will be caught in upper call > (RecoveredContainerLaunch.call()) and the ExitCode (by default to be LOST > without any setting) get persistent in NMStateStore. > After NM restart again, this container is recovered as COMPLETE state but > exit code is LOST (154) - cause this (AM) container get killed later. > We should get rid of recording the exit code of running containers if > detecting process is interrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2846) Incorrect persist exit code for running containers in reacquireContainer() that interrupted by NodeManager restart.
Junping Du created YARN-2846: Summary: Incorrect persist exit code for running containers in reacquireContainer() that interrupted by NodeManager restart. Key: YARN-2846 URL: https://issues.apache.org/jira/browse/YARN-2846 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Junping Du Priority: Blocker The NM restart work preserving feature could make running AM container get LOST and killed during stop NM daemon. The exception is like below: {code} 2014-11-11 00:48:35,214 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(408)) - Memory usage of ProcessTree 22140 for container-id container_1415666714233_0001_01_84: 53.8 MB of 512 MB physical memory used; 931.3 MB of 1.0 GB virtual memory used 2014-11-11 00:48:35,223 ERROR nodemanager.NodeManager (SignalLogger.java:handle(60)) - RECEIVED SIGNAL 15: SIGTERM 2014-11-11 00:48:35,299 INFO mortbay.log (Slf4jLog.java:info(67)) - Stopped HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:50060 2014-11-11 00:48:35,337 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(512)) - Applications still running : [application_1415666714233_0001] 2014-11-11 00:48:35,338 INFO ipc.Server (Server.java:stop(2437)) - Stopping server on 45454 2014-11-11 00:48:35,344 INFO ipc.Server (Server.java:run(706)) - Stopping IPC Server listener on 45454 2014-11-11 00:48:35,346 INFO logaggregation.LogAggregationService (LogAggregationService.java:serviceStop(141)) - org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService waiting for pending aggregation during exit 2014-11-11 00:48:35,347 INFO ipc.Server (Server.java:run(832)) - Stopping IPC Server Responder 2014-11-11 00:48:35,347 INFO logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:abortLogAggregation(502)) - Aborting log aggregation for application_1415666714233_0001 2014-11-11 00:48:35,348 WARN logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:run(382)) - Aggregation did not complete for application application_1415666714233_0001 2014-11-11 00:48:35,358 WARN monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(476)) - org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl is interrupted. Exiting. 2014-11-11 00:48:35,406 ERROR launcher.RecoveredContainerLaunch (RecoveredContainerLaunch.java:call(87)) - Unable to recover container container_1415666714233_0001_01_01 java.io.IOException: Interrupted while waiting for process 20001 to exit at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:180) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:82) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:177) ... 6 more {code} In reacquireContainer() of ContainerExecutor.java, the while loop of checking container process (AM container) will be interrupted by NM stop. The IOException get thrown and failed to generate an ExitCodeFile for the running container. Later, the IOException will be caught in upper call (RecoveredContainerLaunch.call()) and the ExitCode (by default to be LOST without any setting) get persistent in NMStateStore. After NM restart again, this container is recovered as COMPLETE state but exit code is LOST (154) - cause this (AM) container get killed later. We should get rid of recording the exit code of running containers if detecting process is interrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2841) RMProxy should retry EOFException
[ https://issues.apache.org/jira/browse/YARN-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206536#comment-14206536 ] Hudson commented on YARN-2841: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1954 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1954/]) YARN-2841. RMProxy should retry EOFException. Contributed by Jian He (xgong: rev 5c9a51f140ba76ddb25580aeb288db25e3f9653f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java * hadoop-yarn-project/CHANGES.txt YARN-2841: Correct fix version from branch-2.6 to branch-2.7 in the (xgong: rev 58e9bf4b908e0b21309006eba49899b092f38071) * hadoop-yarn-project/CHANGES.txt > RMProxy should retry EOFException > -- > > Key: YARN-2841 > URL: https://issues.apache.org/jira/browse/YARN-2841 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Jian He >Assignee: Jian He >Priority: Critical > Fix For: 2.7.0 > > Attachments: YARN-2841.1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2780) Log aggregated resource allocation in rm-appsummary.log
[ https://issues.apache.org/jira/browse/YARN-2780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206532#comment-14206532 ] Jason Lowe commented on YARN-2780: -- +1 lgtm. Will commit this later today if there are no objections. > Log aggregated resource allocation in rm-appsummary.log > --- > > Key: YARN-2780 > URL: https://issues.apache.org/jira/browse/YARN-2780 > Project: Hadoop YARN > Issue Type: New Feature > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Koji Noguchi >Assignee: Eric Payne >Priority: Minor > Attachments: YARN-2780.v1.201411031728.txt, > YARN-2780.v2.201411061601.txt > > > YARN-415 added useful information about resource usage by applications. > Asking to log that info inside rm-appsummary.log. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2841) RMProxy should retry EOFException
[ https://issues.apache.org/jira/browse/YARN-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206459#comment-14206459 ] Hudson commented on YARN-2841: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1930 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1930/]) YARN-2841. RMProxy should retry EOFException. Contributed by Jian He (xgong: rev 5c9a51f140ba76ddb25580aeb288db25e3f9653f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java YARN-2841: Correct fix version from branch-2.6 to branch-2.7 in the (xgong: rev 58e9bf4b908e0b21309006eba49899b092f38071) * hadoop-yarn-project/CHANGES.txt > RMProxy should retry EOFException > -- > > Key: YARN-2841 > URL: https://issues.apache.org/jira/browse/YARN-2841 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Jian He >Assignee: Jian He >Priority: Critical > Fix For: 2.7.0 > > Attachments: YARN-2841.1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2841) RMProxy should retry EOFException
[ https://issues.apache.org/jira/browse/YARN-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206347#comment-14206347 ] Hudson commented on YARN-2841: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #740 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/740/]) YARN-2841. RMProxy should retry EOFException. Contributed by Jian He (xgong: rev 5c9a51f140ba76ddb25580aeb288db25e3f9653f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java YARN-2841: Correct fix version from branch-2.6 to branch-2.7 in the (xgong: rev 58e9bf4b908e0b21309006eba49899b092f38071) * hadoop-yarn-project/CHANGES.txt > RMProxy should retry EOFException > -- > > Key: YARN-2841 > URL: https://issues.apache.org/jira/browse/YARN-2841 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Jian He >Assignee: Jian He >Priority: Critical > Fix For: 2.7.0 > > Attachments: YARN-2841.1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2841) RMProxy should retry EOFException
[ https://issues.apache.org/jira/browse/YARN-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206336#comment-14206336 ] Hudson commented on YARN-2841: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #2 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/2/]) YARN-2841. RMProxy should retry EOFException. Contributed by Jian He (xgong: rev 5c9a51f140ba76ddb25580aeb288db25e3f9653f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java YARN-2841: Correct fix version from branch-2.6 to branch-2.7 in the (xgong: rev 58e9bf4b908e0b21309006eba49899b092f38071) * hadoop-yarn-project/CHANGES.txt > RMProxy should retry EOFException > -- > > Key: YARN-2841 > URL: https://issues.apache.org/jira/browse/YARN-2841 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Jian He >Assignee: Jian He >Priority: Critical > Fix For: 2.7.0 > > Attachments: YARN-2841.1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2845) MicroZookeeperService used in Yarn Registry tests doesn't shut down cleanly on windows
Steve Loughran created YARN-2845: Summary: MicroZookeeperService used in Yarn Registry tests doesn't shut down cleanly on windows Key: YARN-2845 URL: https://issues.apache.org/jira/browse/YARN-2845 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Environment: Windows Reporter: Steve Loughran Assignee: Steve Loughran Priority: Minor Fix For: 2.7.0 It's not surfacing in YARN's own tests, but we are seeing this in slider's windows testing ... two test methods, each setting up their own ZK micro cluster, seeing the previous test's data. The class needs the same cleanup logic as HBASE-6820 —as perhaps its origin, Twill's mini ZK cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2844) WebAppProxyServlet cannot handle urls which contain encoded characters
[ https://issues.apache.org/jira/browse/YARN-2844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206157#comment-14206157 ] Hadoop QA commented on YARN-2844: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12680759/YARN-2844.patch against trunk revision 58e9bf4. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5813//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5813//console This message is automatically generated. > WebAppProxyServlet cannot handle urls which contain encoded characters > -- > > Key: YARN-2844 > URL: https://issues.apache.org/jira/browse/YARN-2844 > Project: Hadoop YARN > Issue Type: Bug > Components: webapp >Reporter: Shixiong Zhu >Priority: Minor > Attachments: YARN-2844.patch > > > WebAppProxyServlet has a bug about the URL encode/decode. This was found when > running Spark on Yarn. > When a user accesses > "http://example.com:8088/proxy/application_1415344371838_0006/executors/threadDump/?executorId=%3Cdriver%3E";, > WebAppProxyServlet will require > "http://example.com:36429/executors/threadDump/?executorId=%25253Cdriver%25253E";. > But Spark Web Server expects > "http://example.com:36429/executors/threadDump/?executorId=%3Cdriver%3E";. > Here are problems I found in WebAppProxyServlet. > 1. java.net.URI.toString returns an encoded url string. So the following code > in WebAppProxyServlet should use `true` instead of `false`. > {code:java} > org.apache.commons.httpclient.URI uri = > new org.apache.commons.httpclient.URI(link.toString(), false); > {code} > 2. > [HttpServletRequest.getPathInfo()|https://docs.oracle.com/javaee/6/api/javax/servlet/http/HttpServletRequest.html#getPathInfo()] > will returns a decoded string. Therefore, if the link is > http://example.com:8088/proxy/application_1415344371838_0006/John%2FHunter, > pathInfo will be "/application_1415344371838_0006/John/Hunter". Then the URI > created in WebAppProxyServlet will be something like ".../John/Hunter", but > the correct link should be ".../John%2FHunber". We can use > [HttpServletRequest.getRequestURI()|https://docs.oracle.com/javaee/6/api/javax/servlet/http/HttpServletRequest.html#getRequestURI()] > to get the raw path. > {code:java} > final String pathInfo = req.getPathInfo(); > {code} > 3. Use wrong URI constructor. [URI(String scheme, String authority, String > path, String query, String > fragment)|https://docs.oracle.com/javase/7/docs/api/java/net/URI.html#URI(java.lang.String,%20java.lang.String,%20java.lang.String,%20java.lang.String,%20java.lang.String)] > will encode the path and query which have already been encoded. Should use > [URI(String > str)|https://docs.oracle.com/javase/7/docs/api/java/net/URI.html#URI(java.lang.String)] > directly since the url has already been encoded. > {code:java} > URI toFetch = new URI(trackingUri.getScheme(), > trackingUri.getAuthority(), > StringHelper.ujoin(trackingUri.getPath(), rest), > req.getQueryString(), > null); > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)