[jira] [Commented] (YARN-353) Add Zookeeper-based store implementation for RMStateStore
[ https://issues.apache.org/jira/browse/YARN-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697629#comment-13697629 ] Devaraj K commented on YARN-353: The patch overall looks good, here are my observations on the patch. 1. {code:xml} + property +descriptionACL's to be used for ZooKeeper znodes. +This may be supplied when using +org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore +as the value for yarn.resourcemanager.store.class/description +nameyarn.resourcemanager.zk.rm-state-store.timeout.ms/name +!--valueworld:anyone:rwcda/value-- + /property {code} Here configuration name should be yarn.resourcemanager.zk.rm-state-store.acl. 2. {code:xml} + // protected to mock for testing + protected synchronized ZooKeeper getNewZooKeeper() throws Exception { {code} Can we also annotate with @VisibleForTesting for this method? 3. {code:xml} + /** HostPort of ZK server for ZKRMStateStore */ +descriptionHostPort of the ZooKeeper server when using {code} These two places can we use Host:Port instead of HostPort for comment/description. 4. {code:xml} +zkHostPort = conf.get(YarnConfiguration.ZK_RM_STATE_STORE_ADDRESS); {code} Can we use the default value for this config with this as present for other props, {code:xml} +!--value127.0.0.1:2181/value-- {code} 5. {code:xml} + public static final String DEFAULT_ZK_RM_STATE_STORE_PARENT_PATH = ; {code} Can we use the default value for this config with this instead of having empty, {code:xml} +!--value/rmstore/value-- {code} Add Zookeeper-based store implementation for RMStateStore - Key: YARN-353 URL: https://issues.apache.org/jira/browse/YARN-353 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Hitesh Shah Assignee: Bikas Saha Attachments: YARN-353.1.patch, YARN-353.2.patch Add store that write RM state data to ZK -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (YARN-379) yarn [node,application] command print logger info messages
[ https://issues.apache.org/jira/browse/YARN-379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravi Prakash resolved YARN-379. --- Resolution: Not A Problem Fix Version/s: 2.2.0 This issue seems to have been fixed by YARN-530. I do not see the annoying message(s) any more. yarn [node,application] command print logger info messages -- Key: YARN-379 URL: https://issues.apache.org/jira/browse/YARN-379 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.0.3-alpha Reporter: Thomas Graves Assignee: Ravi Prakash Labels: usability Fix For: 2.2.0 Attachments: YARN-379.patch, YARN-379.patch Running the yarn node and yarn applications command results in annoying log info messages being printed: $ yarn node -list 13/02/06 02:36:50 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is inited. 13/02/06 02:36:50 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is started. Total Nodes:1 Node-IdNode-State Node-Http-Address Health-Status(isNodeHealthy)Running-Containers foo:8041RUNNING foo:8042 true 0 13/02/06 02:36:50 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is stopped. $ yarn application 13/02/06 02:38:47 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is inited. 13/02/06 02:38:47 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is started. Invalid Command Usage : usage: application -kill arg Kills the application. -list Lists all the Applications from RM. -status arg Prints the status of the application. 13/02/06 02:38:47 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is stopped. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (YARN-395) RM should have a way to disable scheduling to a set of nodes
[ https://issues.apache.org/jira/browse/YARN-395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha resolved YARN-395. - Resolution: Fixed YARN-750 covers most cases. RM should have a way to disable scheduling to a set of nodes Key: YARN-395 URL: https://issues.apache.org/jira/browse/YARN-395 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Arun C Murthy There should be a way to say schedule to A, B and C but never to D. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-395) RM should have a way to disable scheduling to a set of nodes
[ https://issues.apache.org/jira/browse/YARN-395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698018#comment-13698018 ] Bikas Saha commented on YARN-395: - Not exactly. But it should be good enough for now. RM should have a way to disable scheduling to a set of nodes Key: YARN-395 URL: https://issues.apache.org/jira/browse/YARN-395 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Arun C Murthy There should be a way to say schedule to A, B and C but never to D. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-763) AMRMClientAsync should stop heartbeating after receiving shutdown from RM
[ https://issues.apache.org/jira/browse/YARN-763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-763: --- Attachment: YARN-763.2.patch 1. remove the boolean stop 2. put heartbeat thread interrupt logic inside the switch block AMRMClientAsync should stop heartbeating after receiving shutdown from RM - Key: YARN-763 URL: https://issues.apache.org/jira/browse/YARN-763 Project: Hadoop YARN Issue Type: Bug Reporter: Bikas Saha Assignee: Xuan Gong Attachments: YARN-763.1.patch, YARN-763.2.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-353) Add Zookeeper-based store implementation for RMStateStore
[ https://issues.apache.org/jira/browse/YARN-353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-353: - Attachment: YARN-353.3.patch Devaraj, thanks for your review new patch, fixed the findbug and comments. Add Zookeeper-based store implementation for RMStateStore - Key: YARN-353 URL: https://issues.apache.org/jira/browse/YARN-353 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Hitesh Shah Assignee: Bikas Saha Attachments: YARN-353.1.patch, YARN-353.2.patch, YARN-353.3.patch Add store that write RM state data to ZK -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-763) AMRMClientAsync should stop heartbeating after receiving shutdown from RM
[ https://issues.apache.org/jira/browse/YARN-763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698071#comment-13698071 ] Hadoop QA commented on YARN-763: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12590473/YARN-763.2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client: org.apache.hadoop.yarn.client.api.impl.TestNMClient {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1415//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1415//console This message is automatically generated. AMRMClientAsync should stop heartbeating after receiving shutdown from RM - Key: YARN-763 URL: https://issues.apache.org/jira/browse/YARN-763 Project: Hadoop YARN Issue Type: Bug Reporter: Bikas Saha Assignee: Xuan Gong Attachments: YARN-763.1.patch, YARN-763.2.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-353) Add Zookeeper-based store implementation for RMStateStore
[ https://issues.apache.org/jira/browse/YARN-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698099#comment-13698099 ] Hadoop QA commented on YARN-353: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12590474/YARN-353.3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1416//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1416//console This message is automatically generated. Add Zookeeper-based store implementation for RMStateStore - Key: YARN-353 URL: https://issues.apache.org/jira/browse/YARN-353 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Hitesh Shah Assignee: Bikas Saha Attachments: YARN-353.1.patch, YARN-353.2.patch, YARN-353.3.patch Add store that write RM state data to ZK -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (YARN-894) NodeHealthScriptRunner timeout checking is inaccurate on Windows
[ https://issues.apache.org/jira/browse/YARN-894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Nauroth reassigned YARN-894: -- Assignee: Chris Nauroth (was: Chuan Liu) NodeHealthScriptRunner timeout checking is inaccurate on Windows Key: YARN-894 URL: https://issues.apache.org/jira/browse/YARN-894 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.1.0-beta Reporter: Chuan Liu Assignee: Chris Nauroth Priority: Minor Attachments: ReadProcessStdout.java, wait.cmd, wait.sh, YARN-894-trunk.patch In {{NodeHealthScriptRunner}} method, we will set HealthChecker status based on the Shell execution results. Some status are based on the exception thrown during the Shell script execution. Currently, we will catch a non-ExitCodeException from ShellCommandExecutor, and if Shell has the timeout status set at the same time, we will also set HealthChecker status to timeout. We have following execution sequence in Shell: 1) In main thread, schedule a delayed timer task that will kill the original process upon timeout. 2) In main thread, open a buffered reader and feed in the process's standard input stream. 3) When timeout happens, the timer task will call {{Process#destroy()}} to kill the main process. On Linux, when timeout happened and process killed, the buffered reader will thrown an IOException with message: Stream closed in main thread. On Windows, we don't have the IOException. Only -1 was returned from the reader that indicates the buffer is finished. As a result, the timeout status is not set on Windows, and {{TestNodeHealthService}} fails on Windows because of this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-710) Add to ser/deser methods to RecordFactory
[ https://issues.apache.org/jira/browse/YARN-710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698232#comment-13698232 ] Siddharth Seth commented on YARN-710: - In the unit test, the setters on the ApplicationId aren't meant to be used (will end up throwing exceptions - this is replaced by newInstance in AppliactionId). Don't think getProto() needs to be changed at all in RecordFactoryPBImpl - instead a new getBuilder method should be sufficient. Somewhere along the flow, it looks like the default proto ends up being created - possibly linked to the getProto changes. Add to ser/deser methods to RecordFactory - Key: YARN-710 URL: https://issues.apache.org/jira/browse/YARN-710 Project: Hadoop YARN Issue Type: Improvement Components: api Affects Versions: 2.0.4-alpha Reporter: Alejandro Abdelnur Assignee: Alejandro Abdelnur Attachments: YARN-710.patch, YARN-710.patch, YARN-710-wip.patch I order to do things like AMs failover and checkpointing I need to serialize app IDs, app attempt IDs, containers and/or IDs, resource requests, etc. Because we are wrapping/hiding the PB implementation from the APIs, we are hiding the built in PB ser/deser capabilities. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-814) Difficult to diagnose a failed container launch when error due to invalid environment variable
[ https://issues.apache.org/jira/browse/YARN-814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-814: - Attachment: YARN-814.4.patch new patch ,account for both stdout and stderr. Difficult to diagnose a failed container launch when error due to invalid environment variable -- Key: YARN-814 URL: https://issues.apache.org/jira/browse/YARN-814 Project: Hadoop YARN Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Jian He Attachments: YARN-814.1.patch, YARN-814.2.patch, YARN-814.3.patch, YARN-814.4.patch, YARN-814.patch The container's launch script sets up environment variables, symlinks etc. If there is any failure when setting up the basic context ( before the actual user's process is launched ), nothing is captured by the NM. This makes it impossible to diagnose the reason for the failure. To reproduce, set an env var where the value contains characters that throw syntax errors in bash. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-814) Difficult to diagnose a failed container launch when error due to invalid environment variable
[ https://issues.apache.org/jira/browse/YARN-814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698250#comment-13698250 ] Jian He commented on YARN-814: -- ran on single node, see the log messages Difficult to diagnose a failed container launch when error due to invalid environment variable -- Key: YARN-814 URL: https://issues.apache.org/jira/browse/YARN-814 Project: Hadoop YARN Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Jian He Attachments: YARN-814.1.patch, YARN-814.2.patch, YARN-814.3.patch, YARN-814.4.patch, YARN-814.patch The container's launch script sets up environment variables, symlinks etc. If there is any failure when setting up the basic context ( before the actual user's process is launched ), nothing is captured by the NM. This makes it impossible to diagnose the reason for the failure. To reproduce, set an env var where the value contains characters that throw syntax errors in bash. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-871) Failed to run MR example against latest trunk
[ https://issues.apache.org/jira/browse/YARN-871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698267#comment-13698267 ] Junping Du commented on YARN-871: - Hi [~zjshen], given YARN-874 is committed, shall we resolve it? Failed to run MR example against latest trunk - Key: YARN-871 URL: https://issues.apache.org/jira/browse/YARN-871 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Attachments: yarn-zshen-resourcemanager-ZShens-MacBook-Pro.local.log Built the latest trunk, deployed a single node cluster and ran examples, such as {code} hadoop jar hadoop-3.0.0-SNAPSHOT/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-SNAPSHOT.jar teragen 10 out1 {code} The job failed with the following console message: {code} 13/06/21 12:51:25 INFO mapreduce.Job: Running job: job_1371844267731_0001 13/06/21 12:51:31 INFO mapreduce.Job: Job job_1371844267731_0001 running in uber mode : false 13/06/21 12:51:31 INFO mapreduce.Job: map 0% reduce 0% 13/06/21 12:51:31 INFO mapreduce.Job: Job job_1371844267731_0001 failed with state FAILED due to: Application application_1371844267731_0001 failed 2 times due to AM Container for appattempt_1371844267731_0001_02 exited with exitCode: 127 due to: .Failing this attempt.. Failing the application. 13/06/21 12:51:31 INFO mapreduce.Job: Counters: 0 {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-814) Difficult to diagnose a failed container launch when error due to invalid environment variable
[ https://issues.apache.org/jira/browse/YARN-814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698280#comment-13698280 ] Hadoop QA commented on YARN-814: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12590515/YARN-814.4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1417//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1417//console This message is automatically generated. Difficult to diagnose a failed container launch when error due to invalid environment variable -- Key: YARN-814 URL: https://issues.apache.org/jira/browse/YARN-814 Project: Hadoop YARN Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Jian He Attachments: YARN-814.1.patch, YARN-814.2.patch, YARN-814.3.patch, YARN-814.4.patch, YARN-814.patch The container's launch script sets up environment variables, symlinks etc. If there is any failure when setting up the basic context ( before the actual user's process is launched ), nothing is captured by the NM. This makes it impossible to diagnose the reason for the failure. To reproduce, set an env var where the value contains characters that throw syntax errors in bash. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (YARN-871) Failed to run MR example against latest trunk
[ https://issues.apache.org/jira/browse/YARN-871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen resolved YARN-871. -- Resolution: Cannot Reproduce Thanks, [~djp]! Close it as cannot reproduce Failed to run MR example against latest trunk - Key: YARN-871 URL: https://issues.apache.org/jira/browse/YARN-871 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Attachments: yarn-zshen-resourcemanager-ZShens-MacBook-Pro.local.log Built the latest trunk, deployed a single node cluster and ran examples, such as {code} hadoop jar hadoop-3.0.0-SNAPSHOT/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-SNAPSHOT.jar teragen 10 out1 {code} The job failed with the following console message: {code} 13/06/21 12:51:25 INFO mapreduce.Job: Running job: job_1371844267731_0001 13/06/21 12:51:31 INFO mapreduce.Job: Job job_1371844267731_0001 running in uber mode : false 13/06/21 12:51:31 INFO mapreduce.Job: map 0% reduce 0% 13/06/21 12:51:31 INFO mapreduce.Job: Job job_1371844267731_0001 failed with state FAILED due to: Application application_1371844267731_0001 failed 2 times due to AM Container for appattempt_1371844267731_0001_02 exited with exitCode: 127 due to: .Failing this attempt.. Failing the application. 13/06/21 12:51:31 INFO mapreduce.Job: Counters: 0 {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-845) RM crash with NPE on NODE_UPDATE
[ https://issues.apache.org/jira/browse/YARN-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698326#comment-13698326 ] Mayank Bansal commented on YARN-845: I had an offline discussion with [~arpitgupta] and [~bikassaha] We are not able to reproduce the issue however we can synchronize the application object on assignreserved containers to make it consistent with another calls. I am adding more logs to find the issue if we can get this crash. I am also sending yean run time exceptions if we get this null again. Thanks, Mayank RM crash with NPE on NODE_UPDATE Key: YARN-845 URL: https://issues.apache.org/jira/browse/YARN-845 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 3.0.0, 2.1.0-beta Reporter: Arpit Gupta Assignee: Mayank Bansal Attachments: rm.log, YARN-845-trunk-draft.patch the following stack trace is generated in rm {code} n, service: 68.142.246.147:45454 }, ] resource=memory:1536, vCores:1 queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=memory:44544, vCores:29usedCapacity=0.90625, absoluteUsedCapacity=0.90625, numApps=1, numContainers=29 usedCapacity=0.90625 absoluteUsedCapacity=0.90625 used=memory:44544, vCores:29 cluster=memory:49152, vCores:48 2013-06-17 12:43:53,655 INFO capacity.ParentQueue (ParentQueue.java:completedContainer(696)) - completedContainer queue=root usedCapacity=0.90625 absoluteUsedCapacity=0.90625 used=memory:44544, vCores:29 cluster=memory:49152, vCores:48 2013-06-17 12:43:53,656 INFO capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(832)) - Application appattempt_1371448527090_0844_01 released container container_1371448527090_0844_01_05 on node: host: hostXX:45454 #containers=4 available=2048 used=6144 with event: FINISHED 2013-06-17 12:43:53,656 INFO capacity.CapacityScheduler (CapacityScheduler.java:nodeUpdate(661)) - Trying to fulfill reservation for application application_1371448527090_0844 on node: hostXX:45454 2013-06-17 12:43:53,656 INFO fica.FiCaSchedulerApp (FiCaSchedulerApp.java:unreserve(435)) - Application application_1371448527090_0844 unreserved on node host: hostXX:45454 #containers=4 available=2048 used=6144, currently has 4 at priority 20; currentReservation memory:6144, vCores:4 2013-06-17 12:43:53,656 INFO scheduler.AppSchedulingInfo (AppSchedulingInfo.java:updateResourceRequests(168)) - checking for deactivate... 2013-06-17 12:43:53,657 FATAL resourcemanager.ResourceManager (ResourceManager.java:run(422)) - Error in handling event type NODE_UPDATE to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:432) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.unreserve(LeafQueue.java:1416) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainer(LeafQueue.java:1346) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignOffSwitchContainers(LeafQueue.java:1221) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainersOnNode(LeafQueue.java:1180) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignReservedContainer(LeafQueue.java:939) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:803) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:665) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:727) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:83) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:413) at java.lang.Thread.run(Thread.java:662) 2013-06-17 12:43:53,659 INFO resourcemanager.ResourceManager (ResourceManager.java:run(426)) - Exiting, bbye.. 2013-06-17 12:43:53,665 INFO mortbay.log (Slf4jLog.java:info(67)) - Stopped SelectChannelConnector@hostXX:8088 2013-06-17 12:43:53,765 ERROR delegation.AbstractDelegationTokenSecretManager (AbstractDelegationTokenSecretManager.java:run(513)) - InterruptedExcpetion recieved for ExpiredTokenRemover thread java.lang.InterruptedException: sleep interrupted 2013-06-17 12:43:53,766 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:stop(200)) - Stopping ResourceManager
[jira] [Updated] (YARN-845) RM crash with NPE on NODE_UPDATE
[ https://issues.apache.org/jira/browse/YARN-845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-845: --- Attachment: YARN-845-trunk-1.patch Attaching updated patch and rebasing it. Thanks, Mayank RM crash with NPE on NODE_UPDATE Key: YARN-845 URL: https://issues.apache.org/jira/browse/YARN-845 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 3.0.0, 2.1.0-beta Reporter: Arpit Gupta Assignee: Mayank Bansal Attachments: rm.log, YARN-845-trunk-1.patch, YARN-845-trunk-draft.patch the following stack trace is generated in rm {code} n, service: 68.142.246.147:45454 }, ] resource=memory:1536, vCores:1 queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=memory:44544, vCores:29usedCapacity=0.90625, absoluteUsedCapacity=0.90625, numApps=1, numContainers=29 usedCapacity=0.90625 absoluteUsedCapacity=0.90625 used=memory:44544, vCores:29 cluster=memory:49152, vCores:48 2013-06-17 12:43:53,655 INFO capacity.ParentQueue (ParentQueue.java:completedContainer(696)) - completedContainer queue=root usedCapacity=0.90625 absoluteUsedCapacity=0.90625 used=memory:44544, vCores:29 cluster=memory:49152, vCores:48 2013-06-17 12:43:53,656 INFO capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(832)) - Application appattempt_1371448527090_0844_01 released container container_1371448527090_0844_01_05 on node: host: hostXX:45454 #containers=4 available=2048 used=6144 with event: FINISHED 2013-06-17 12:43:53,656 INFO capacity.CapacityScheduler (CapacityScheduler.java:nodeUpdate(661)) - Trying to fulfill reservation for application application_1371448527090_0844 on node: hostXX:45454 2013-06-17 12:43:53,656 INFO fica.FiCaSchedulerApp (FiCaSchedulerApp.java:unreserve(435)) - Application application_1371448527090_0844 unreserved on node host: hostXX:45454 #containers=4 available=2048 used=6144, currently has 4 at priority 20; currentReservation memory:6144, vCores:4 2013-06-17 12:43:53,656 INFO scheduler.AppSchedulingInfo (AppSchedulingInfo.java:updateResourceRequests(168)) - checking for deactivate... 2013-06-17 12:43:53,657 FATAL resourcemanager.ResourceManager (ResourceManager.java:run(422)) - Error in handling event type NODE_UPDATE to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:432) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.unreserve(LeafQueue.java:1416) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainer(LeafQueue.java:1346) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignOffSwitchContainers(LeafQueue.java:1221) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainersOnNode(LeafQueue.java:1180) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignReservedContainer(LeafQueue.java:939) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:803) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:665) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:727) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:83) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:413) at java.lang.Thread.run(Thread.java:662) 2013-06-17 12:43:53,659 INFO resourcemanager.ResourceManager (ResourceManager.java:run(426)) - Exiting, bbye.. 2013-06-17 12:43:53,665 INFO mortbay.log (Slf4jLog.java:info(67)) - Stopped SelectChannelConnector@hostXX:8088 2013-06-17 12:43:53,765 ERROR delegation.AbstractDelegationTokenSecretManager (AbstractDelegationTokenSecretManager.java:run(513)) - InterruptedExcpetion recieved for ExpiredTokenRemover thread java.lang.InterruptedException: sleep interrupted 2013-06-17 12:43:53,766 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:stop(200)) - Stopping ResourceManager metrics system... 2013-06-17 12:43:53,767 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:stop(206)) - ResourceManager metrics system stopped. 2013-06-17 12:43:53,767 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:shutdown(572)) - ResourceManager metrics system shutdown complete. 2013-06-17
[jira] [Commented] (YARN-245) Node Manager gives InvalidStateTransitonException for FINISH_APPLICATION at FINISHED
[ https://issues.apache.org/jira/browse/YARN-245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698337#comment-13698337 ] Mayank Bansal commented on YARN-245: I just tried this patch and it does not need rebasing. Thanks, Mayank Node Manager gives InvalidStateTransitonException for FINISH_APPLICATION at FINISHED Key: YARN-245 URL: https://issues.apache.org/jira/browse/YARN-245 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.0.2-alpha, 2.0.1-alpha Reporter: Devaraj K Assignee: Mayank Bansal Attachments: YARN-245-trunk-1.patch {code:xml} 2012-11-25 12:56:11,795 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: FINISH_APPLICATION at FINISHED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:301) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443) at org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:398) at org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:58) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:520) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:512) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:126) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75) at java.lang.Thread.run(Thread.java:662) 2012-11-25 12:56:11,796 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Application application_1353818859056_0004 transitioned from FINISHED to null {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-295) Resource Manager throws InvalidStateTransitonException: Invalid event: CONTAINER_FINISHED at ALLOCATED for RMAppAttemptImpl
[ https://issues.apache.org/jira/browse/YARN-295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698346#comment-13698346 ] Mayank Bansal commented on YARN-295: Latest patch does not need any rebasing Thanks, Mayank Resource Manager throws InvalidStateTransitonException: Invalid event: CONTAINER_FINISHED at ALLOCATED for RMAppAttemptImpl --- Key: YARN-295 URL: https://issues.apache.org/jira/browse/YARN-295 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.0.2-alpha, 2.0.1-alpha Reporter: Devaraj K Assignee: Mayank Bansal Attachments: YARN-295-trunk-1.patch, YARN-295-trunk-2.patch {code:xml} 2012-12-28 14:03:56,956 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: CONTAINER_FINISHED at ALLOCATED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:301) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:490) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:80) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:433) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:414) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:126) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75) at java.lang.Thread.run(Thread.java:662) {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-845) RM crash with NPE on NODE_UPDATE
[ https://issues.apache.org/jira/browse/YARN-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698357#comment-13698357 ] Hadoop QA commented on YARN-845: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12590532/YARN-845-trunk-1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1418//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1418//console This message is automatically generated. RM crash with NPE on NODE_UPDATE Key: YARN-845 URL: https://issues.apache.org/jira/browse/YARN-845 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 3.0.0, 2.1.0-beta Reporter: Arpit Gupta Assignee: Mayank Bansal Attachments: rm.log, YARN-845-trunk-1.patch, YARN-845-trunk-draft.patch the following stack trace is generated in rm {code} n, service: 68.142.246.147:45454 }, ] resource=memory:1536, vCores:1 queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=memory:44544, vCores:29usedCapacity=0.90625, absoluteUsedCapacity=0.90625, numApps=1, numContainers=29 usedCapacity=0.90625 absoluteUsedCapacity=0.90625 used=memory:44544, vCores:29 cluster=memory:49152, vCores:48 2013-06-17 12:43:53,655 INFO capacity.ParentQueue (ParentQueue.java:completedContainer(696)) - completedContainer queue=root usedCapacity=0.90625 absoluteUsedCapacity=0.90625 used=memory:44544, vCores:29 cluster=memory:49152, vCores:48 2013-06-17 12:43:53,656 INFO capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(832)) - Application appattempt_1371448527090_0844_01 released container container_1371448527090_0844_01_05 on node: host: hostXX:45454 #containers=4 available=2048 used=6144 with event: FINISHED 2013-06-17 12:43:53,656 INFO capacity.CapacityScheduler (CapacityScheduler.java:nodeUpdate(661)) - Trying to fulfill reservation for application application_1371448527090_0844 on node: hostXX:45454 2013-06-17 12:43:53,656 INFO fica.FiCaSchedulerApp (FiCaSchedulerApp.java:unreserve(435)) - Application application_1371448527090_0844 unreserved on node host: hostXX:45454 #containers=4 available=2048 used=6144, currently has 4 at priority 20; currentReservation memory:6144, vCores:4 2013-06-17 12:43:53,656 INFO scheduler.AppSchedulingInfo (AppSchedulingInfo.java:updateResourceRequests(168)) - checking for deactivate... 2013-06-17 12:43:53,657 FATAL resourcemanager.ResourceManager (ResourceManager.java:run(422)) - Error in handling event type NODE_UPDATE to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:432) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.unreserve(LeafQueue.java:1416) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainer(LeafQueue.java:1346) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignOffSwitchContainers(LeafQueue.java:1221) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainersOnNode(LeafQueue.java:1180) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignReservedContainer(LeafQueue.java:939) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:803)
[jira] [Created] (YARN-895) If NameNode is in safemode when RM restarts, RM should wait instead of crashing.
Jian He created YARN-895: Summary: If NameNode is in safemode when RM restarts, RM should wait instead of crashing. Key: YARN-895 URL: https://issues.apache.org/jira/browse/YARN-895 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Jian He -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-299) Node Manager throws org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: RESOURCE_FAILED at DONE
[ https://issues.apache.org/jira/browse/YARN-299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698414#comment-13698414 ] Mayank Bansal commented on YARN-299: This patch does not need rebasing Thanks, Mayank Node Manager throws org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: RESOURCE_FAILED at DONE --- Key: YARN-299 URL: https://issues.apache.org/jira/browse/YARN-299 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.0.1-alpha, 2.0.0-alpha Reporter: Devaraj K Assignee: Mayank Bansal Attachments: YARN-299-trunk-1.patch {code:xml} 2012-12-31 10:36:27,844 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Can't handle this event at current state: Current: [DONE], eventType: [RESOURCE_FAILED] org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: RESOURCE_FAILED at DONE at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:301) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:819) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:71) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:504) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:497) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:126) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75) at java.lang.Thread.run(Thread.java:662) 2012-12-31 10:36:27,845 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1356792558130_0002_01_01 transitioned from DONE to null {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-649) Make container logs available over HTTP in plain text
[ https://issues.apache.org/jira/browse/YARN-649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698415#comment-13698415 ] Zhijie Shen commented on YARN-649: -- Read the patch quickly. It looks almost fine to me. One minor question: why does getLogs not support XML? {code} + @GET + @Path(/containerlogs/{containerid}/{filename}) + @Produces({ MediaType.TEXT_PLAIN, MediaType.APPLICATION_JSON }) + @Evolving + public Response getLogs(@PathParam(containerid) String containerIdStr, + @PathParam(filename) String filename) { {code} Here's some additional thoughts. For the long running applications, they may have a big log file, such that it will take a long time to download the log file via the RESTful API. Consequently, HTTP connection may timeout before downloading before downloading a complete log file. Maybe it is good to zip the log file before sending it, and unzip it after receiving it. Moreover, it can be more advanced to query the part of log which is recorded during timestamp1 and timestamp2. Just think out loudly. Not sure it is required right now. Make container logs available over HTTP in plain text - Key: YARN-649 URL: https://issues.apache.org/jira/browse/YARN-649 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.0.4-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: YARN-649-2.patch, YARN-649-3.patch, YARN-649-4.patch, YARN-649.patch, YARN-752-1.patch It would be good to make container logs available over the REST API for MAPREDUCE-4362 and so that they can be accessed programatically in general. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-502) RM crash with NPE on NODE_REMOVED event with FairScheduler
[ https://issues.apache.org/jira/browse/YARN-502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698418#comment-13698418 ] Mayank Bansal commented on YARN-502: Latest patch does not need rebasing Thanks, Mayank RM crash with NPE on NODE_REMOVED event with FairScheduler -- Key: YARN-502 URL: https://issues.apache.org/jira/browse/YARN-502 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.0.3-alpha Reporter: Lohit Vijayarenu Assignee: Mayank Bansal Attachments: YARN-502-trunk-1.patch, YARN-502-trunk-2.patch While running some test and adding/removing nodes, we see RM crashed with the below exception. We are testing with fair scheduler and running hadoop-2.0.3-alpha {noformat} 2013-03-22 18:54:27,015 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating Node :55680 as it is now LOST 2013-03-22 18:54:27,015 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: :55680 Node Transitioned from UNHEALTHY to LOST 2013-03-22 18:54:27,015 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type NODE_REMOVED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeNode(FairScheduler.java:619) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:856) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:375) at java.lang.Thread.run(Thread.java:662) 2013-03-22 18:54:27,016 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. 2013-03-22 18:54:27,020 INFO org.mortbay.log: Stopped SelectChannelConnector@:50030 {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-649) Make container logs available over HTTP in plain text
[ https://issues.apache.org/jira/browse/YARN-649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698444#comment-13698444 ] Sandy Ryza commented on YARN-649: - Thanks for taking a look Zhijie. bq. why does getLogs not support XML? Oops, leaving in MediaType.APPLICATION_JSON was a mistake. My intention was actually to have it only support plain text. Thoughts? Regarding the zip files and the time-based queries, these seem like useful features, but I think they would be better for a separate JIRA, and can be added in a backwards-compatible manner with additional request parameters. My goal here was to implement the minimum needed to work on MAPREDUCE-4362 and YARN-675. Make container logs available over HTTP in plain text - Key: YARN-649 URL: https://issues.apache.org/jira/browse/YARN-649 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.0.4-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: YARN-649-2.patch, YARN-649-3.patch, YARN-649-4.patch, YARN-649.patch, YARN-752-1.patch It would be good to make container logs available over the REST API for MAPREDUCE-4362 and so that they can be accessed programatically in general. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-7) Add support for DistributedShell to ask for CPUs along with memory
[ https://issues.apache.org/jira/browse/YARN-7?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-7: -- Attachment: YARN-7-v2.patch Sync up patch with latest trunk branch. Add support for DistributedShell to ask for CPUs along with memory -- Key: YARN-7 URL: https://issues.apache.org/jira/browse/YARN-7 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.0.3-alpha Reporter: Arun C Murthy Assignee: Junping Du Labels: patch Attachments: YARN-7.patch, YARN-7-v2.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-896) Roll up for long lived YARN
[ https://issues.apache.org/jira/browse/YARN-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698500#comment-13698500 ] Robert Joseph Evans commented on YARN-896: -- During the most recent Hadoop Summit there was a developer meetup where we discussed some of these issues. This is to summarize what was discussed at that meeting and to add in a few things that have also been discussed on mailing lists and other places. HDFS delegation tokens have a maximum life time. Currently tokens submitted to the RM when the app master is launched will be renewed by the RM until the application finishes and the logs from the application have finished aggregating. The only token currently used by the YARN framework is the HDFS delegation token. This is used to read files from HDFS as part of the distributed cache and to write the aggregated logs out to HDFS. In order to support relaunching an app master after the HDFS the maximum lifetime of the HDFS delegation token, we either need to allow for tokens that do not expire or provide an API to allow the RM to replace the old token with a new one. Because removing the maximum lifetime of a token reduces the security of the cluster as a whole I think it would be better to provide an API to replace the token with a new one. If we want to continue supporting log aggregation we also need to provide a way for the Node Managers to get the new token too. It is assumed that each app master will also provide an API to get the new token so it can start using it. Log aggregation is another issue, although not required for long lived applications to work. Logs are aggregated into HDFS when the application finishes. This is not really that useful for applications that are never intended to exit. Ideally the processing of logs by the node manager should be pluggable so that clusters and applications can select how and when logs are processed/displayed to the end user. Because many of these systems roll their logs to avoid filling up disks we will probably need a protocol of some sort for the container to communicate with the Node Manager when logs are ready to be processed. Another issue is to allow containers to out live the app master that launched them and also to allow containers to outlive the node manager that launched them. This is especially critical for the stability of applications durring rolling upgrades to YARN. Roll up for long lived YARN --- Key: YARN-896 URL: https://issues.apache.org/jira/browse/YARN-896 Project: Hadoop YARN Issue Type: New Feature Reporter: Robert Joseph Evans YARN is intended to be general purpose, but it is missing some features to be able to truly support long lived applications and long lived containers. This ticket is intended to # discuss what is needed to support long lived processes # track the resulting JIRA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-896) Roll up for long lived YARN
[ https://issues.apache.org/jira/browse/YARN-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698505#comment-13698505 ] Robert Joseph Evans commented on YARN-896: -- Another issue that has been discussed in the past is the impact that long lived processes can have on resource scheduling. It is possible for a long lived process to grab lots of resources and then never release them even though it is using more resources than it would be allowed to have when the cluster is full. Recent preemption changes should be able to prevent this from happening between different queues/pools, but we may need to think if we need more control about this within a queue. Roll up for long lived YARN --- Key: YARN-896 URL: https://issues.apache.org/jira/browse/YARN-896 Project: Hadoop YARN Issue Type: New Feature Reporter: Robert Joseph Evans YARN is intended to be general purpose, but it is missing some features to be able to truly support long lived applications and long lived containers. This ticket is intended to # discuss what is needed to support long lived processes # track the resulting JIRA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-661) NM fails to cleanup local directories for users
[ https://issues.apache.org/jira/browse/YARN-661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698518#comment-13698518 ] Omkar Vinit Joshi commented on YARN-661: * First part..very much straight forward.. container-executor.c already had some code to do this ...just modified ResourceLocalizationService.cleanUpFilesFromSubDir to trigger it. (basically swapping subDir with baseDir )... * I am exposing deletion task dependency to user via DeletionService. Now user can specify multilevel deletion task DAG and deletion service will take care of it one all parent (root) deletion tasks are started by user after defining dependency. I tested this locally on secured cluster.. but will add test cases to verify that DAG actually works. I will update patch with test cases attaching initial patch. NM fails to cleanup local directories for users --- Key: YARN-661 URL: https://issues.apache.org/jira/browse/YARN-661 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.1.0-beta, 0.23.8 Reporter: Jason Lowe Assignee: Omkar Vinit Joshi Attachments: YARN-661-20130701.patch YARN-71 added deletion of local directories on startup, but in practice it fails to delete the directories because of permission problems. The top-level usercache directory is owned by the user but is in a directory that is not writable by the user. Therefore the deletion of the user's usercache directory, as the user, fails due to lack of permissions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-661) NM fails to cleanup local directories for users
[ https://issues.apache.org/jira/browse/YARN-661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-661: --- Attachment: YARN-661-20130701.patch NM fails to cleanup local directories for users --- Key: YARN-661 URL: https://issues.apache.org/jira/browse/YARN-661 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.1.0-beta, 0.23.8 Reporter: Jason Lowe Assignee: Omkar Vinit Joshi Attachments: YARN-661-20130701.patch YARN-71 added deletion of local directories on startup, but in practice it fails to delete the directories because of permission problems. The top-level usercache directory is owned by the user but is in a directory that is not writable by the user. Therefore the deletion of the user's usercache directory, as the user, fails due to lack of permissions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-502) RM crash with NPE on NODE_REMOVED event with FairScheduler
[ https://issues.apache.org/jira/browse/YARN-502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698522#comment-13698522 ] Karthik Kambatla commented on YARN-502: --- Looks good to me. +1 RM crash with NPE on NODE_REMOVED event with FairScheduler -- Key: YARN-502 URL: https://issues.apache.org/jira/browse/YARN-502 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.0.3-alpha Reporter: Lohit Vijayarenu Assignee: Mayank Bansal Attachments: YARN-502-trunk-1.patch, YARN-502-trunk-2.patch While running some test and adding/removing nodes, we see RM crashed with the below exception. We are testing with fair scheduler and running hadoop-2.0.3-alpha {noformat} 2013-03-22 18:54:27,015 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating Node :55680 as it is now LOST 2013-03-22 18:54:27,015 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: :55680 Node Transitioned from UNHEALTHY to LOST 2013-03-22 18:54:27,015 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type NODE_REMOVED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeNode(FairScheduler.java:619) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:856) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:375) at java.lang.Thread.run(Thread.java:662) 2013-03-22 18:54:27,016 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. 2013-03-22 18:54:27,020 INFO org.mortbay.log: Stopped SelectChannelConnector@:50030 {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira