[jira] [Created] (YARN-3758) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container
skrho created YARN-3758: --- Summary: The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container Key: YARN-3758 URL: https://issues.apache.org/jira/browse/YARN-3758 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: skrho Hello there~~ I have 2 clusters First cluster is 5 node , default 1 application queue, 8G Physical memory each node Second cluster is 10 node, 2 application queuey, 230G Physical memory each node Wherever a mapreduce job is running, I want resourcemanager is to set the minimum memory 256m to container So I was changing configuration in yarn-site.xml yarn.scheduler.minimum-allocation-mb : 256 mapreduce.map.java.opts : -Xms256m mapreduce.reduce.java.opts : -Xms256m mapreduce.map.memory.mb : 256 mapreduce.reduce.memory.mb : 256 In First cluster whenever a mapreduce job is running , I can see used memory 256m in web console( http://installedIP:8088/cluster/nodes ) But In Second cluster whenever a mapreduce job is running , I can see used memory 1024m in web console( http://installedIP:8088/cluster/nodes ) I know default memory value is 1024m, so if there is not changing memory setting, the default value is working. I have been testing for two weeks, but I don't know why mimimum memory setting is not working in second cluster Why this difference is happened? Am I wrong setting configuration? or Is there bug? Thank you for reading~~ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs
[ https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568667#comment-14568667 ] Hadoop QA commented on YARN-3749: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 18m 29s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 7 new or modified test files. | | {color:green}+1{color} | javac | 7m 36s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 33s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 49s | The applied patch generated 1 new checkstyle issues (total was 212, now 213). | | {color:green}+1{color} | whitespace | 0m 1s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 32s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 4m 34s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 0m 25s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 7m 1s | Tests passed in hadoop-yarn-client. | | {color:red}-1{color} | yarn tests | 60m 25s | Tests failed in hadoop-yarn-server-resourcemanager. | | {color:green}+1{color} | yarn tests | 1m 52s | Tests passed in hadoop-yarn-server-tests. | | | | 115m 5s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.TestRMEmbeddedElector | | | hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler | | Timed out tests | org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12736732/YARN-3749.6.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 990078b | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8158/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/8158/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-client test log | https://builds.apache.org/job/PreCommit-YARN-Build/8158/artifact/patchprocess/testrun_hadoop-yarn-client.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8158/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | hadoop-yarn-server-tests test log | https://builds.apache.org/job/PreCommit-YARN-Build/8158/artifact/patchprocess/testrun_hadoop-yarn-server-tests.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8158/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8158/console | This message was automatically generated. We should make a copy of configuration when init MiniYARNCluster with multiple RMs -- Key: YARN-3749 URL: https://issues.apache.org/jira/browse/YARN-3749 Project: Hadoop YARN Issue Type: Bug Reporter: Chun Chen Assignee: Chun Chen Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, YARN-3749.5.patch, YARN-3749.6.patch, YARN-3749.7.patch, YARN-3749.patch When I was trying to write a test case for YARN-2674, I found DS client trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 when RM failover. But I initially set yarn.resourcemanager.address.rm1=0.0.0.0:18032, yarn.resourcemanager.address.rm2=0.0.0.0:28032 After digging, I found it is in ClientRMService where the value of yarn.resourcemanager.address.rm2 changed to 0.0.0.0:18032. See the following code in ClientRMService: {code} clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST, YarnConfiguration.RM_ADDRESS, YarnConfiguration.DEFAULT_RM_ADDRESS, server.getListenerAddress()); {code} Since we use the same instance of configuration in rm1 and rm2 and init both RM before
[jira] [Updated] (YARN-3758) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container
[ https://issues.apache.org/jira/browse/YARN-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] skrho updated YARN-3758: Description: Hello there~~ I have 2 clusters First cluster is 5 node , default 1 application queue, Capacity scheduler, 8G Physical memory each node Second cluster is 10 node, 2 application queuey, fair-scheduler, 230G Physical memory each node Wherever a mapreduce job is running, I want resourcemanager is to set the minimum memory 256m to container So I was changing configuration in yarn-site.xml mapred-site.xml yarn.scheduler.minimum-allocation-mb : 256 mapreduce.map.java.opts : -Xms256m mapreduce.reduce.java.opts : -Xms256m mapreduce.map.memory.mb : 256 mapreduce.reduce.memory.mb : 256 In First cluster whenever a mapreduce job is running , I can see used memory 256m in web console( http://installedIP:8088/cluster/nodes ) But In Second cluster whenever a mapreduce job is running , I can see used memory 1024m in web console( http://installedIP:8088/cluster/nodes ) I know default memory value is 1024m, so if there is not changing memory setting, the default value is working. I have been testing for two weeks, but I don't know why mimimum memory setting is not working in second cluster Why this difference is happened? Am I wrong setting configuration? or Is there bug? Thank you for reading~~ was: Hello there~~ I have 2 clusters First cluster is 5 node , default 1 application queue, 8G Physical memory each node Second cluster is 10 node, 2 application queuey, 230G Physical memory each node Wherever a mapreduce job is running, I want resourcemanager is to set the minimum memory 256m to container So I was changing configuration in yarn-site.xml mapred-site.xml yarn.scheduler.minimum-allocation-mb : 256 mapreduce.map.java.opts : -Xms256m mapreduce.reduce.java.opts : -Xms256m mapreduce.map.memory.mb : 256 mapreduce.reduce.memory.mb : 256 In First cluster whenever a mapreduce job is running , I can see used memory 256m in web console( http://installedIP:8088/cluster/nodes ) But In Second cluster whenever a mapreduce job is running , I can see used memory 1024m in web console( http://installedIP:8088/cluster/nodes ) I know default memory value is 1024m, so if there is not changing memory setting, the default value is working. I have been testing for two weeks, but I don't know why mimimum memory setting is not working in second cluster Why this difference is happened? Am I wrong setting configuration? or Is there bug? Thank you for reading~~ The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container Key: YARN-3758 URL: https://issues.apache.org/jira/browse/YARN-3758 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: skrho Hello there~~ I have 2 clusters First cluster is 5 node , default 1 application queue, Capacity scheduler, 8G Physical memory each node Second cluster is 10 node, 2 application queuey, fair-scheduler, 230G Physical memory each node Wherever a mapreduce job is running, I want resourcemanager is to set the minimum memory 256m to container So I was changing configuration in yarn-site.xml mapred-site.xml yarn.scheduler.minimum-allocation-mb : 256 mapreduce.map.java.opts : -Xms256m mapreduce.reduce.java.opts : -Xms256m mapreduce.map.memory.mb : 256 mapreduce.reduce.memory.mb : 256 In First cluster whenever a mapreduce job is running , I can see used memory 256m in web console( http://installedIP:8088/cluster/nodes ) But In Second cluster whenever a mapreduce job is running , I can see used memory 1024m in web console( http://installedIP:8088/cluster/nodes ) I know default memory value is 1024m, so if there is not changing memory setting, the default value is working. I have been testing for two weeks, but I don't know why mimimum memory setting is not working in second cluster Why this difference is happened? Am I wrong setting configuration? or Is there bug? Thank you for reading~~ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3706) Generalize native HBase writer for additional tables
[ https://issues.apache.org/jira/browse/YARN-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joep Rottinghuis updated YARN-3706: --- Attachment: YARN-3726-YARN-2928.004.patch YARN-3726-YARN-2928.004.patch : - fixed bug in cleanse (found thanks to unit test) - fixed value separator (was ! instead of ?). - Added readResult and readResults to EntityColumnPrefix (still need to add signature in interface). - Added initial unit test for TimeLineWriterUtils - Added relationship checking to TestTimelineWriterImpl Generalize native HBase writer for additional tables Key: YARN-3706 URL: https://issues.apache.org/jira/browse/YARN-3706 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Joep Rottinghuis Assignee: Joep Rottinghuis Priority: Minor Attachments: YARN-3706-YARN-2928.001.patch, YARN-3726-YARN-2928.002.patch, YARN-3726-YARN-2928.003.patch, YARN-3726-YARN-2928.004.patch When reviewing YARN-3411 we noticed that we could change the class hierarchy a little in order to accommodate additional tables easily. In order to get ready for benchmark testing we left the original layout in place, as performance would not be impacted by the code hierarchy. Here is a separate jira to address the hierarchy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2962) ZKRMStateStore: Limit the number of znodes under a znode
[ https://issues.apache.org/jira/browse/YARN-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568789#comment-14568789 ] Varun Saxena commented on YARN-2962: Was waiting for an input from [~vinodkv] and [~asuresh] so that we reach a common understanding on what we will do on the backward compatibility part. Anyways in the coming week, plan to upload a patch implementing one of the approaches discussed. ZKRMStateStore: Limit the number of znodes under a znode Key: YARN-2962 URL: https://issues.apache.org/jira/browse/YARN-2962 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Varun Saxena Priority: Critical Attachments: YARN-2962.01.patch, YARN-2962.2.patch, YARN-2962.3.patch We ran into this issue where we were hitting the default ZK server message size configs, primarily because the message had too many znodes even though they individually they were all small. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3753) RM failed to come up with java.io.IOException: Wait for ZKClient creation timed out
[ https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568670#comment-14568670 ] Hadoop QA commented on YARN-3753: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 14m 53s | Findbugs (version ) appears to be broken on trunk. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 31s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 29s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 25s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 33s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 27s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 50m 16s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 86m 33s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12736741/YARN-3753.1.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 990078b | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8161/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8161/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8161/console | This message was automatically generated. RM failed to come up with java.io.IOException: Wait for ZKClient creation timed out - Key: YARN-3753 URL: https://issues.apache.org/jira/browse/YARN-3753 Project: Hadoop YARN Issue Type: Bug Components: yarn Reporter: Sumana Sathish Assignee: Jian He Priority: Critical Attachments: YARN-3753.1.patch, YARN-3753.patch RM failed to come up with the following error while submitting an mapreduce job. {code:title=RM log} 015-05-30 03:40:12,190 ERROR recovery.RMStateStore (RMStateStore.java:transition(179)) - Error storing app: application_1432956515242_0006 java.io.IOException: Wait for ZKClient creation timed out at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) at
[jira] [Created] (YARN-3756) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container
skrho created YARN-3756: --- Summary: The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container Key: YARN-3756 URL: https://issues.apache.org/jira/browse/YARN-3756 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Environment: hadoop 2.4.0 Reporter: skrho Hello there~~ I have 2 clusters First cluster is 5 node , default 1 application queue, 8G Physical memory each node Second cluster is 10 node, 2 application queuey, 230G Physical memory each node Wherever a mapreduce job is running, I want resourcemanager is to set the minimum memory 256m to container So I was changing configuration in yarn-site.xml mapred-site.xml yarn.scheduler.minimum-allocation-mb : 256 mapreduce.map.java.opts : -Xms256m mapreduce.reduce.java.opts : -Xms256m mapreduce.map.memory.mb : 256 mapreduce.reduce.memory.mb : 256 In First cluster whenever a mapreduce job is running , I can see used memory 256m in web console( http://installedIP:8088/cluster/nodes ) But In Second cluster whenever a mapreduce job is running , I can see used memory 1024m in web console( http://installedIP:8088/cluster/nodes ) I know default memory value is 1024m, so if there is not changing memory setting, the default value is working. I have been testing for two weeks, but I don't know why mimimum memory setting is not working in second cluster Why this difference is happened? Am I wrong setting configuration? or Is there bug? Thank you for reading~~ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3753) RM failed to come up with java.io.IOException: Wait for ZKClient creation timed out
[ https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-3753: -- Attachment: YARN-3753.2.patch RM failed to come up with java.io.IOException: Wait for ZKClient creation timed out - Key: YARN-3753 URL: https://issues.apache.org/jira/browse/YARN-3753 Project: Hadoop YARN Issue Type: Bug Components: yarn Reporter: Sumana Sathish Assignee: Jian He Priority: Critical Attachments: YARN-3753.1.patch, YARN-3753.2.patch, YARN-3753.patch RM failed to come up with the following error while submitting an mapreduce job. {code:title=RM log} 015-05-30 03:40:12,190 ERROR recovery.RMStateStore (RMStateStore.java:transition(179)) - Error storing app: application_1432956515242_0006 java.io.IOException: Wait for ZKClient creation timed out at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) 2015-05-30 03:40:12,194 FATAL resourcemanager.ResourceManager (ResourceManager.java:handle(750)) - Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Wait for ZKClient creation timed out at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at
[jira] [Commented] (YARN-3753) RM failed to come up with java.io.IOException: Wait for ZKClient creation timed out
[ https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568797#comment-14568797 ] Hadoop QA commented on YARN-3753: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 15m 53s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 33s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 36s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 48s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 34s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 27s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 50m 6s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 88m 7s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12736776/YARN-3753.2.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 990078b | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8164/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8164/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8164/console | This message was automatically generated. RM failed to come up with java.io.IOException: Wait for ZKClient creation timed out - Key: YARN-3753 URL: https://issues.apache.org/jira/browse/YARN-3753 Project: Hadoop YARN Issue Type: Bug Components: yarn Reporter: Sumana Sathish Assignee: Jian He Priority: Critical Attachments: YARN-3753.1.patch, YARN-3753.2.patch, YARN-3753.patch RM failed to come up with the following error while submitting an mapreduce job. {code:title=RM log} 015-05-30 03:40:12,190 ERROR recovery.RMStateStore (RMStateStore.java:transition(179)) - Error storing app: application_1432956515242_0006 java.io.IOException: Wait for ZKClient creation timed out at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) at
[jira] [Commented] (YARN-3591) Resource Localisation on a bad disk causes subsequent containers failure
[ https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568834#comment-14568834 ] Lavkesh Lahngir commented on YARN-3591: --- [~zxu] :Can we get away without storing into NMstateStore? Other changes seems to be okay. It's not a big change in terms of the code, but adding in NMstate could be debatable. [~vvasudev]: Thoughts? Resource Localisation on a bad disk causes subsequent containers failure - Key: YARN-3591 URL: https://issues.apache.org/jira/browse/YARN-3591 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.0 Reporter: Lavkesh Lahngir Assignee: Lavkesh Lahngir Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, YARN-3591.2.patch, YARN-3591.3.patch, YARN-3591.4.patch It happens when a resource is localised on the disk, after localising that disk has gone bad. NM keeps paths for localised resources in memory. At the time of resource request isResourcePresent(rsrc) will be called which calls file.exists() on the localised path. In some cases when disk has gone bad, inodes are stilled cached and file.exists() returns true. But at the time of reading, file will not open. Note: file.exists() actually calls stat64 natively which returns true because it was able to find inode information from the OS. A proposal is to call file.list() on the parent path of the resource, which will call open() natively. If the disk is good it should return an array of paths with length at-least 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3757) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container
skrho created YARN-3757: --- Summary: The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container Key: YARN-3757 URL: https://issues.apache.org/jira/browse/YARN-3757 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Environment: Hadoop 2.4.0 Reporter: skrho Hello there~~ I have 2 clusters First cluster is 5 node , default 1 application queue, 8G Physical memory each node Second cluster is 10 node, 2 application queuey, 230G Physical memory each node Wherever a mapreduce job is running, I want resourcemanager is to set the minimum memory 256m to container So I was changing configuration in yarn-site.xml yarn.scheduler.minimum-allocation-mb : 256 mapreduce.map.java.opts : -Xms256m mapreduce.reduce.java.opts : -Xms256m mapreduce.map.memory.mb : 256 mapreduce.reduce.memory.mb : 256 In First cluster whenever a mapreduce job is running , I can see used memory 256m in web console( http://installedIP:8088/cluster/nodes ) But In Second cluster whenever a mapreduce job is running , I can see used memory 1024m in web console( http://installedIP:8088/cluster/nodes ) I know default memory value is 1024m, so if there is not changing memory setting, the default value is working. I have been testing for two weeks, but I don't know why mimimum memory setting is not working in second cluster Why this difference is happened? Am I wrong setting configuration? or Is there bug? Thank you for reading~~ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty
[ https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568682#comment-14568682 ] Rohith commented on YARN-3733: -- Updated the summary as per defect. DominantRC#compare() does not work as expected if cluster resource is empty --- Key: YARN-3733 URL: https://issues.apache.org/jira/browse/YARN-3733 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3 , 2 NM , 2 RM one NM - 3 GB 6 v core Reporter: Bibin A Chundatt Assignee: Rohith Priority: Blocker Attachments: YARN-3733.patch Steps to reproduce = 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster) 2. Configure map and reduce size to 512 MB after changing scheduler minimum size to 512 MB 3. Configure capacity scheduler and AM limit to .5 (DominantResourceCalculator is configured) 4. Submit 30 concurrent task 5. Switch RM Actual = For 12 Jobs AM gets allocated and all 12 starts running No other Yarn child is initiated , *all 12 Jobs in Running state for ever* Expected === Only 6 should be running at a time since max AM allocated is .5 (3072 MB) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3170) YARN architecture document needs updating
[ https://issues.apache.org/jira/browse/YARN-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568853#comment-14568853 ] Brahma Reddy Battula commented on YARN-3170: Updated patch..Kindly review!! YARN architecture document needs updating - Key: YARN-3170 URL: https://issues.apache.org/jira/browse/YARN-3170 Project: Hadoop YARN Issue Type: Improvement Components: documentation Reporter: Allen Wittenauer Assignee: Brahma Reddy Battula Attachments: YARN-3170-002.patch, YARN-3170-003.patch, YARN-3170-004.patch, YARN-3170-005.patch, YARN-3170-006.patch, YARN-3170-007.patch, YARN-3170-008.patch, YARN-3170-009.patch, YARN-3170-010.patch, YARN-3170.patch The marketing paragraph at the top, NextGen MapReduce, etc are all marketing rather than actual descriptions. It also needs some general updates, esp given it reads as though 0.23 was just released yesterday. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs
[ https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568735#comment-14568735 ] Hadoop QA commented on YARN-3749: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 20m 3s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 8 new or modified test files. | | {color:green}+1{color} | javac | 7m 34s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 42s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 2m 17s | The applied patch generated 1 new checkstyle issues (total was 212, now 213). | | {color:green}+1{color} | whitespace | 0m 1s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 32s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 6m 5s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 0m 22s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 6m 58s | Tests passed in hadoop-yarn-client. | | {color:green}+1{color} | yarn tests | 1m 57s | Tests passed in hadoop-yarn-common. | | {color:red}-1{color} | yarn tests | 60m 34s | Tests failed in hadoop-yarn-server-resourcemanager. | | {color:green}+1{color} | yarn tests | 1m 51s | Tests passed in hadoop-yarn-server-tests. | | | | 121m 2s | | \\ \\ || Reason || Tests || | Timed out tests | org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12736753/YARN-3749.7.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 990078b | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8163/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/8163/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-client test log | https://builds.apache.org/job/PreCommit-YARN-Build/8163/artifact/patchprocess/testrun_hadoop-yarn-client.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8163/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8163/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | hadoop-yarn-server-tests test log | https://builds.apache.org/job/PreCommit-YARN-Build/8163/artifact/patchprocess/testrun_hadoop-yarn-server-tests.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8163/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8163/console | This message was automatically generated. We should make a copy of configuration when init MiniYARNCluster with multiple RMs -- Key: YARN-3749 URL: https://issues.apache.org/jira/browse/YARN-3749 Project: Hadoop YARN Issue Type: Bug Reporter: Chun Chen Assignee: Chun Chen Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, YARN-3749.5.patch, YARN-3749.6.patch, YARN-3749.7.patch, YARN-3749.patch When I was trying to write a test case for YARN-2674, I found DS client trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 when RM failover. But I initially set yarn.resourcemanager.address.rm1=0.0.0.0:18032, yarn.resourcemanager.address.rm2=0.0.0.0:28032 After digging, I found it is in ClientRMService where the value of yarn.resourcemanager.address.rm2 changed to 0.0.0.0:18032. See the following code in ClientRMService: {code} clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST, YarnConfiguration.RM_ADDRESS, YarnConfiguration.DEFAULT_RM_ADDRESS, server.getListenerAddress()); {code} Since we
[jira] [Updated] (YARN-3758) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container
[ https://issues.apache.org/jira/browse/YARN-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] skrho updated YARN-3758: Description: Hello there~~ I have 2 clusters First cluster is 5 node , default 1 application queue, 8G Physical memory each node Second cluster is 10 node, 2 application queuey, 230G Physical memory each node Wherever a mapreduce job is running, I want resourcemanager is to set the minimum memory 256m to container So I was changing configuration in yarn-site.xml mapred-site.xml yarn.scheduler.minimum-allocation-mb : 256 mapreduce.map.java.opts : -Xms256m mapreduce.reduce.java.opts : -Xms256m mapreduce.map.memory.mb : 256 mapreduce.reduce.memory.mb : 256 In First cluster whenever a mapreduce job is running , I can see used memory 256m in web console( http://installedIP:8088/cluster/nodes ) But In Second cluster whenever a mapreduce job is running , I can see used memory 1024m in web console( http://installedIP:8088/cluster/nodes ) I know default memory value is 1024m, so if there is not changing memory setting, the default value is working. I have been testing for two weeks, but I don't know why mimimum memory setting is not working in second cluster Why this difference is happened? Am I wrong setting configuration? or Is there bug? Thank you for reading~~ was: Hello there~~ I have 2 clusters First cluster is 5 node , default 1 application queue, 8G Physical memory each node Second cluster is 10 node, 2 application queuey, 230G Physical memory each node Wherever a mapreduce job is running, I want resourcemanager is to set the minimum memory 256m to container So I was changing configuration in yarn-site.xml yarn.scheduler.minimum-allocation-mb : 256 mapreduce.map.java.opts : -Xms256m mapreduce.reduce.java.opts : -Xms256m mapreduce.map.memory.mb : 256 mapreduce.reduce.memory.mb : 256 In First cluster whenever a mapreduce job is running , I can see used memory 256m in web console( http://installedIP:8088/cluster/nodes ) But In Second cluster whenever a mapreduce job is running , I can see used memory 1024m in web console( http://installedIP:8088/cluster/nodes ) I know default memory value is 1024m, so if there is not changing memory setting, the default value is working. I have been testing for two weeks, but I don't know why mimimum memory setting is not working in second cluster Why this difference is happened? Am I wrong setting configuration? or Is there bug? Thank you for reading~~ The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container Key: YARN-3758 URL: https://issues.apache.org/jira/browse/YARN-3758 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: skrho Hello there~~ I have 2 clusters First cluster is 5 node , default 1 application queue, 8G Physical memory each node Second cluster is 10 node, 2 application queuey, 230G Physical memory each node Wherever a mapreduce job is running, I want resourcemanager is to set the minimum memory 256m to container So I was changing configuration in yarn-site.xml mapred-site.xml yarn.scheduler.minimum-allocation-mb : 256 mapreduce.map.java.opts : -Xms256m mapreduce.reduce.java.opts : -Xms256m mapreduce.map.memory.mb : 256 mapreduce.reduce.memory.mb : 256 In First cluster whenever a mapreduce job is running , I can see used memory 256m in web console( http://installedIP:8088/cluster/nodes ) But In Second cluster whenever a mapreduce job is running , I can see used memory 1024m in web console( http://installedIP:8088/cluster/nodes ) I know default memory value is 1024m, so if there is not changing memory setting, the default value is working. I have been testing for two weeks, but I don't know why mimimum memory setting is not working in second cluster Why this difference is happened? Am I wrong setting configuration? or Is there bug? Thank you for reading~~ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3755) Log the command of launching containers
[ https://issues.apache.org/jira/browse/YARN-3755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated YARN-3755: - Attachment: YARN-3755-2.patch Upload new patch to address the checkstyle issue Log the command of launching containers --- Key: YARN-3755 URL: https://issues.apache.org/jira/browse/YARN-3755 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.7.0 Reporter: Jeff Zhang Assignee: Jeff Zhang Attachments: YARN-3755-1.patch, YARN-3755-2.patch In the resource manager log, yarn would log the command for launching AM, this is very useful. But there's no such log in the NN log for launching containers. It would be difficult to diagnose when containers fails to launch due to some issue in the commands. Although user can look at the commands in the container launch script file, this is an internal things of yarn, usually user don't know that. In user's perspective, they only know what commands they specify when building yarn application. {code} 2015-06-01 16:06:42,245 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Command to launch container container_1433145984561_0001_01_01 : $JAVA_HOME/bin/java -server -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx1024m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=LOG_DIR -Dtez.root.logger=info,CLA -Dsun.nio.ch.bugLevel='' org.apache.tez.dag.app.DAGAppMaster 1LOG_DIR/stdout 2LOG_DIR/stderr {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3757) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container
[ https://issues.apache.org/jira/browse/YARN-3757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] skrho resolved YARN-3757. - Resolution: Duplicate The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container Key: YARN-3757 URL: https://issues.apache.org/jira/browse/YARN-3757 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Environment: Hadoop 2.4.0 Reporter: skrho Hello there~~ I have 2 clusters First cluster is 5 node , default 1 application queue, 8G Physical memory each node Second cluster is 10 node, 2 application queuey, 230G Physical memory each node Wherever a mapreduce job is running, I want resourcemanager is to set the minimum memory 256m to container So I was changing configuration in yarn-site.xml yarn.scheduler.minimum-allocation-mb : 256 mapreduce.map.java.opts : -Xms256m mapreduce.reduce.java.opts : -Xms256m mapreduce.map.memory.mb : 256 mapreduce.reduce.memory.mb : 256 In First cluster whenever a mapreduce job is running , I can see used memory 256m in web console( http://installedIP:8088/cluster/nodes ) But In Second cluster whenever a mapreduce job is running , I can see used memory 1024m in web console( http://installedIP:8088/cluster/nodes ) I know default memory value is 1024m, so if there is not changing memory setting, the default value is working. I have been testing for two weeks, but I don't know why mimimum memory setting is not working in second cluster Why this difference is happened? Am I wrong setting configuration? or Is there bug? Thank you for reading~~ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3758) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container
[ https://issues.apache.org/jira/browse/YARN-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568790#comment-14568790 ] Naganarasimha G R commented on YARN-3758: - YARN-3756 and YARN-3757 are same as this issue ! can you close them . The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container Key: YARN-3758 URL: https://issues.apache.org/jira/browse/YARN-3758 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: skrho Hello there~~ I have 2 clusters First cluster is 5 node , default 1 application queue, Capacity scheduler, 8G Physical memory each node Second cluster is 10 node, 2 application queuey, fair-scheduler, 230G Physical memory each node Wherever a mapreduce job is running, I want resourcemanager is to set the minimum memory 256m to container So I was changing configuration in yarn-site.xml mapred-site.xml yarn.scheduler.minimum-allocation-mb : 256 mapreduce.map.java.opts : -Xms256m mapreduce.reduce.java.opts : -Xms256m mapreduce.map.memory.mb : 256 mapreduce.reduce.memory.mb : 256 In First cluster whenever a mapreduce job is running , I can see used memory 256m in web console( http://installedIP:8088/cluster/nodes ) But In Second cluster whenever a mapreduce job is running , I can see used memory 1024m in web console( http://installedIP:8088/cluster/nodes ) I know default memory value is 1024m, so if there is not changing memory setting, the default value is working. I have been testing for two weeks, but I don't know why mimimum memory setting is not working in second cluster Why this difference is happened? Am I wrong setting configuration? or Is there bug? Thank you for reading~~ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3756) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container
[ https://issues.apache.org/jira/browse/YARN-3756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] skrho resolved YARN-3756. - Resolution: Duplicate The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container Key: YARN-3756 URL: https://issues.apache.org/jira/browse/YARN-3756 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Environment: hadoop 2.4.0 Reporter: skrho Hello there~~ I have 2 clusters First cluster is 5 node , default 1 application queue, 8G Physical memory each node Second cluster is 10 node, 2 application queuey, 230G Physical memory each node Wherever a mapreduce job is running, I want resourcemanager is to set the minimum memory 256m to container So I was changing configuration in yarn-site.xml mapred-site.xml yarn.scheduler.minimum-allocation-mb : 256 mapreduce.map.java.opts : -Xms256m mapreduce.reduce.java.opts : -Xms256m mapreduce.map.memory.mb : 256 mapreduce.reduce.memory.mb : 256 In First cluster whenever a mapreduce job is running , I can see used memory 256m in web console( http://installedIP:8088/cluster/nodes ) But In Second cluster whenever a mapreduce job is running , I can see used memory 1024m in web console( http://installedIP:8088/cluster/nodes ) I know default memory value is 1024m, so if there is not changing memory setting, the default value is working. I have been testing for two weeks, but I don't know why mimimum memory setting is not working in second cluster Why this difference is happened? Am I wrong setting configuration? or Is there bug? Thank you for reading~~ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3761) Set delegation token service address at the server side
[ https://issues.apache.org/jira/browse/YARN-3761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena reassigned YARN-3761: -- Assignee: Varun Saxena Set delegation token service address at the server side --- Key: YARN-3761 URL: https://issues.apache.org/jira/browse/YARN-3761 Project: Hadoop YARN Issue Type: Improvement Components: security Reporter: Zhijie Shen Assignee: Varun Saxena Nowadays, YARN components generate the delegation token without the service address set, and leave it to the client to set. With our java client library, it is usually fine. However, if users are using REST API, it's going to be a problem: The delegation token is returned as a url string. It's so unfriendly for the thin client to deserialize the url string, set the token service address and serialize it again for further usage. If we move the task of setting the service address to the server side, the client can get rid of this trouble. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3069) Document missing properties in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Chiang updated YARN-3069: - Attachment: YARN-3069.011.patch Thanks Akira! New patch with the following changes: - Fix description for yarn.node-labels.fs-store.retry-policy-spec - Remove YARN registry entries from yarn-default.xml - Remove one outdated entry yarn.application.classpath.prepend.distcache - Add entry for yarn.intermediate-data-encryption.enable I'll also go through the yarn-default.xml file once more to make sure no default values will change. Document missing properties in yarn-default.xml --- Key: YARN-3069 URL: https://issues.apache.org/jira/browse/YARN-3069 Project: Hadoop YARN Issue Type: Bug Components: documentation Reporter: Ray Chiang Assignee: Ray Chiang Labels: BB2015-05-TBR, supportability Attachments: YARN-3069.001.patch, YARN-3069.002.patch, YARN-3069.003.patch, YARN-3069.004.patch, YARN-3069.005.patch, YARN-3069.006.patch, YARN-3069.007.patch, YARN-3069.008.patch, YARN-3069.009.patch, YARN-3069.010.patch, YARN-3069.011.patch The following properties are currently not defined in yarn-default.xml. These properties should either be A) documented in yarn-default.xml OR B) listed as an exception (with comments, e.g. for internal use) in the TestYarnConfigurationFields unit test Any comments for any of the properties below are welcome. org.apache.hadoop.yarn.server.sharedcachemanager.RemoteAppChecker org.apache.hadoop.yarn.server.sharedcachemanager.store.InMemorySCMStore security.applicationhistory.protocol.acl yarn.app.container.log.backups yarn.app.container.log.dir yarn.app.container.log.filesize yarn.client.app-submission.poll-interval yarn.client.application-client-protocol.poll-timeout-ms yarn.is.minicluster yarn.log.server.url yarn.minicluster.control-resource-monitoring yarn.minicluster.fixed.ports yarn.minicluster.use-rpc yarn.node-labels.fs-store.retry-policy-spec yarn.node-labels.fs-store.root-dir yarn.node-labels.manager-class yarn.nodemanager.container-executor.os.sched.priority.adjustment yarn.nodemanager.container-monitor.process-tree.class yarn.nodemanager.disk-health-checker.enable yarn.nodemanager.docker-container-executor.image-name yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms yarn.nodemanager.linux-container-executor.group yarn.nodemanager.log.deletion-threads-count yarn.nodemanager.user-home-dir yarn.nodemanager.webapp.https.address yarn.nodemanager.webapp.spnego-keytab-file yarn.nodemanager.webapp.spnego-principal yarn.nodemanager.windows-secure-container-executor.group yarn.resourcemanager.configuration.file-system-based-store yarn.resourcemanager.delegation-token-renewer.thread-count yarn.resourcemanager.delegation.key.update-interval yarn.resourcemanager.delegation.token.max-lifetime yarn.resourcemanager.delegation.token.renew-interval yarn.resourcemanager.history-writer.multi-threaded-dispatcher.pool-size yarn.resourcemanager.metrics.runtime.buckets yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs yarn.resourcemanager.reservation-system.class yarn.resourcemanager.reservation-system.enable yarn.resourcemanager.reservation-system.plan.follower yarn.resourcemanager.reservation-system.planfollower.time-step yarn.resourcemanager.rm.container-allocation.expiry-interval-ms yarn.resourcemanager.webapp.spnego-keytab-file yarn.resourcemanager.webapp.spnego-principal yarn.scheduler.include-port-in-node-name yarn.timeline-service.delegation.key.update-interval yarn.timeline-service.delegation.token.max-lifetime yarn.timeline-service.delegation.token.renew-interval yarn.timeline-service.generic-application-history.enabled yarn.timeline-service.generic-application-history.fs-history-store.compression-type yarn.timeline-service.generic-application-history.fs-history-store.uri yarn.timeline-service.generic-application-history.store-class yarn.timeline-service.http-cross-origin.enabled yarn.tracking.url.generator -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs
[ https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569548#comment-14569548 ] Sergey Shelukhin commented on YARN-1462: [~sseth] can you please comment on the above (use of Private API)? AHS API and other AHS changes to handle tags for completed MR jobs -- Key: YARN-1462 URL: https://issues.apache.org/jira/browse/YARN-1462 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.2.0 Reporter: Karthik Kambatla Assignee: Xuan Gong Fix For: 2.8.0 Attachments: YARN-1462-branch-2.7-1.2.patch, YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, YARN-1462.3.patch AHS related work for tags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3069) Document missing properties in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569646#comment-14569646 ] Hadoop QA commented on YARN-3069: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 19m 46s | Findbugs (version ) appears to be broken on trunk. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 33s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 39s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | site | 2m 58s | Site still builds. | | {color:green}+1{color} | checkstyle | 1m 36s | There were no new checkstyle issues. | | {color:red}-1{color} | whitespace | 0m 1s | The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 32s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 3m 22s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | common tests | 23m 34s | Tests passed in hadoop-common. | | {color:green}+1{color} | yarn tests | 1m 55s | Tests passed in hadoop-yarn-common. | | | | 72m 56s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12736976/YARN-3069.011.patch | | Optional Tests | site javadoc javac unit findbugs checkstyle | | git revision | trunk / a2bd621 | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/8168/artifact/patchprocess/whitespace.txt | | hadoop-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8168/artifact/patchprocess/testrun_hadoop-common.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8168/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8168/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8168/console | This message was automatically generated. Document missing properties in yarn-default.xml --- Key: YARN-3069 URL: https://issues.apache.org/jira/browse/YARN-3069 Project: Hadoop YARN Issue Type: Bug Components: documentation Reporter: Ray Chiang Assignee: Ray Chiang Labels: BB2015-05-TBR, supportability Attachments: YARN-3069.001.patch, YARN-3069.002.patch, YARN-3069.003.patch, YARN-3069.004.patch, YARN-3069.005.patch, YARN-3069.006.patch, YARN-3069.007.patch, YARN-3069.008.patch, YARN-3069.009.patch, YARN-3069.010.patch, YARN-3069.011.patch The following properties are currently not defined in yarn-default.xml. These properties should either be A) documented in yarn-default.xml OR B) listed as an exception (with comments, e.g. for internal use) in the TestYarnConfigurationFields unit test Any comments for any of the properties below are welcome. org.apache.hadoop.yarn.server.sharedcachemanager.RemoteAppChecker org.apache.hadoop.yarn.server.sharedcachemanager.store.InMemorySCMStore security.applicationhistory.protocol.acl yarn.app.container.log.backups yarn.app.container.log.dir yarn.app.container.log.filesize yarn.client.app-submission.poll-interval yarn.client.application-client-protocol.poll-timeout-ms yarn.is.minicluster yarn.log.server.url yarn.minicluster.control-resource-monitoring yarn.minicluster.fixed.ports yarn.minicluster.use-rpc yarn.node-labels.fs-store.retry-policy-spec yarn.node-labels.fs-store.root-dir yarn.node-labels.manager-class yarn.nodemanager.container-executor.os.sched.priority.adjustment yarn.nodemanager.container-monitor.process-tree.class yarn.nodemanager.disk-health-checker.enable yarn.nodemanager.docker-container-executor.image-name yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms yarn.nodemanager.linux-container-executor.group yarn.nodemanager.log.deletion-threads-count yarn.nodemanager.user-home-dir yarn.nodemanager.webapp.https.address yarn.nodemanager.webapp.spnego-keytab-file yarn.nodemanager.webapp.spnego-principal
[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs
[ https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569648#comment-14569648 ] Siddharth Seth commented on YARN-1462: -- ApplicationReport.newInstance is used by mapreduce and Tez, and potentially other applications which may be modeled along the same AMs. It'll be useful to make the API change here compatible. This is along the lines of newInstances being used for various constructs like ContainerId, AppId, etc. With the change, I don't believe MR2.6 will work with a 2.8 cluster - depending on how the classpath is setup. AHS API and other AHS changes to handle tags for completed MR jobs -- Key: YARN-1462 URL: https://issues.apache.org/jira/browse/YARN-1462 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.2.0 Reporter: Karthik Kambatla Assignee: Xuan Gong Fix For: 2.8.0 Attachments: YARN-1462-branch-2.7-1.2.patch, YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, YARN-1462.3.patch AHS related work for tags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
[ https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569773#comment-14569773 ] Jason Lowe commented on YARN-3585: -- +1 latest patch lgtm. Will commit this tomorrow if there are no objections. NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled -- Key: YARN-3585 URL: https://issues.apache.org/jira/browse/YARN-3585 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Rohith Priority: Critical Attachments: 0001-YARN-3585.patch, YARN-3585.patch With NM recovery enabled, after decommission, nodemanager log show stop but process cannot end. non daemon thread: {noformat} DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on condition [0x] leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable [0x] VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 nid=0x29ed runnable Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 nid=0x29ee runnable Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 nid=0x29ef runnable Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 nid=0x29f0 runnable Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 nid=0x29f1 runnable Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 nid=0x29f2 runnable Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 nid=0x29f3 runnable Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 nid=0x29f4 runnable Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 runnable Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 nid=0x29f5 runnable Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 nid=0x29f6 runnable VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting on condition {noformat} and jni leveldb thread stack {noformat} Thread 12 (Thread 0x7f33dd842700 (LWP 10903)): #0 0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7f33dfce2a3b in leveldb::(anonymous namespace)::PosixEnv::BGThreadWrapper(void*) () from /tmp/libleveldbjni-64-1-6922178968300745716.8 #2 0x003d83407851 in start_thread () from /lib64/libpthread.so.0 #3 0x003d830e811d in clone () from /lib64/libc.so.6 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2392) add more diags about app retry limits on AM failures
[ https://issues.apache.org/jira/browse/YARN-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-2392: - Priority: Minor (was: Major) add more diags about app retry limits on AM failures Key: YARN-2392 URL: https://issues.apache.org/jira/browse/YARN-2392 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.6.0 Reporter: Steve Loughran Assignee: Steve Loughran Priority: Minor Attachments: YARN-2392-001.patch, YARN-2392-002.patch, YARN-2392-002.patch # when an app fails the failure count is shown, but not what the global + local limits are. If the two are different, they should both be printed. # the YARN-2242 strings don't have enough whitespace between text and the URL -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3591) Resource Localisation on a bad disk causes subsequent containers failure
[ https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569727#comment-14569727 ] zhihai xu commented on YARN-3591: - Hi [~lavkesh], I think we can create a separate JIRA for storing local Error directories in NM state store, which will be a good enhancement. thanks [~sunilg]! Adding a new API to get local error directories is also a good suggestion. But I think it will be enough to just check newErrorDirs instead of all errorDirs. To better support NM recovery and make DirsChangeListener interface simple, I propose the following changes: 1.In DirectoryCollection, notify listener when any set of dirs(localDirs, errorDirs and fullDirs) are changed The code change at {{DirectoryCollection#checkDirs}} looks like the following: {code} bool needNotifyListener = false; needNotifyListener = setChanged; for (String dir : preCheckFullDirs) { if (postCheckOtherDirs.contains(dir)) { needNotifyListener = true; LOG.warn(Directory + dir + error + dirsFailedCheck.get(dir).message); } } for (String dir : preCheckOtherErrorDirs) { if (postCheckFullDirs.contains(dir)) { needNotifyListener = true; LOG.warn(Directory + dir + error + dirsFailedCheck.get(dir).message); } } if (needNotifyListener) { for (DirsChangeListener listener : dirsChangeListeners) { listener.onDirsChanged(); } } {code} 2. add an API to get local error directories. As [~sunilg] suggested, We can add an API {{synchronized ListString getErrorDirs()}} in DirectoryCollection.java We also need add an API {{public ListString getLocalErrorDirs()}} in LocalDirsHandlerService.java, which will call {{DirectoryCollection#getErrorDirs}} 3. add a field {{SetString preLocalErrorDirs}} in ResourceLocalizationService.java to store previous local error directories. {{ResourceLocalizationService#preLocalErrorDirs}} should be loaded from state store at the beginning if we support storing local Error directories in NM state store. 4.The following is pseudo code for {{localDirsChangeListener#onDirsChanged}}: {code} SetString curLocalErrorDirs = new HashSetString(dirsHandler.getLocalErrorDirs()); ListString newErrorDirs = new ArrayListString(); ListString newRepairedDirs = new ArrayListString(); for (String dir : curLocalErrorDirs) { if (!preLocalErrorDirs.contains(dir)) { newErrorDirs.add(dir); } } for (String dir : preLocalErrorDirs) { if (!curLocalErrorDirs.contains(dir)) { newRepairedDirs.add(dir); } } for (String localDir : newRepairedDirs) { cleanUpLocalDir(lfs, delService, localDir); } if (!newErrorDirs.isEmpty()) { //As Sunil suggested, checkLocalizedResources will call removeResource on those localized resources whose parent is present in newErrorDirs. publicRsrc.checkLocalizedResources(newErrorDirs); for (LocalResourcesTracker tracker : privateRsrc.values()) { tracker.checkLocalizedResources(newErrorDirs); } } if (!newErrorDirs.isEmpty() || !newRepairedDirs.isEmpty()) { preLocalErrorDirs = curLocalErrorDirs; stateStore.storeLocalErrorDirs(StringUtils.arrayToString(curLocalErrorDirs.toArray(new String[0]))); } checkAndInitializeLocalDirs(); {code} 5. It will be better to move {{verifyDirUsingMkdir(testDir)}} right after {{DiskChecker.checkDir(testDir)}} in {{DirectoryCollection#testDirs}}, so we can detect the error directory before detecting the full directory. Please feel free to change or add more to my proposal. Resource Localisation on a bad disk causes subsequent containers failure - Key: YARN-3591 URL: https://issues.apache.org/jira/browse/YARN-3591 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.0 Reporter: Lavkesh Lahngir Assignee: Lavkesh Lahngir Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, YARN-3591.2.patch, YARN-3591.3.patch, YARN-3591.4.patch It happens when a resource is localised on the disk, after localising that disk has gone bad. NM keeps paths for localised resources in memory. At the time of resource request isResourcePresent(rsrc) will be called which calls file.exists() on the localised path. In some cases when disk has gone bad, inodes are stilled cached and file.exists() returns true. But at the time of reading, file will not open. Note: file.exists() actually calls stat64 natively which returns true because it was able to find inode information from the OS. A proposal is to call file.list() on the parent path of the resource, which will call open() natively. If the disk is good it should return an array of paths with length at-least 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2392) add more diags about app retry limits on AM failures
[ https://issues.apache.org/jira/browse/YARN-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-2392: - Attachment: YARN-2392-002.patch Patch 002 * in sync with trunk * uses String.format for a more readable format of the response * includes sliding window details in the message There's no test here, for which I apologise. To test this I'd need a test to trigger failures and look for the final error message, which seems excessive for a log tuning. If there's a test for the sliding-window retry that could be patched, I'll do it there. add more diags about app retry limits on AM failures Key: YARN-2392 URL: https://issues.apache.org/jira/browse/YARN-2392 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.6.0 Reporter: Steve Loughran Assignee: Steve Loughran Attachments: YARN-2392-001.patch, YARN-2392-002.patch, YARN-2392-002.patch # when an app fails the failure count is shown, but not what the global + local limits are. If the two are different, they should both be printed. # the YARN-2242 strings don't have enough whitespace between text and the URL -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7
[ https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569778#comment-14569778 ] Matthew Jacobs commented on YARN-2194: -- I'm confused, does this mean that you'll re-mount the cpu and cpuacct controllers? Do we know that other components in the RHEL7 world don't expect them to be in the default place? Cgroups cease to work in RHEL7 -- Key: YARN-2194 URL: https://issues.apache.org/jira/browse/YARN-2194 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Wei Yan Assignee: Wei Yan Priority: Critical Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch In RHEL7, the CPU controller is named cpu,cpuacct. The comma in the controller name leads to container launch failure. RHEL7 deprecates libcgroup and recommends the user of systemd. However, systemd has certain shortcomings as identified in this JIRA (see comments). This JIRA only fixes the failure, and doesn't try to use systemd. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3510) Create an extension of ProportionalCapacityPreemptionPolicy which preempts a number of containers from each application in a way which respects fairness
[ https://issues.apache.org/jira/browse/YARN-3510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570039#comment-14570039 ] Craig Welch commented on YARN-3510: --- [~leftnoteasy] and I had some offline discussion. The patch currently here is simply meant to keep from unbalancing whatever allocation process is active by, generally, keeping relative usage between applications the same. It doesn't attempt to actively re-allocate in a way which achieves the overall allocation policy, i.e., as if all the applications had started at once. (this is a more complex proposition, obviously). There's a desire to have this because, among other things, sometime down the road we may do preemption just among users/applications in a queue and it will be necessary for the preemption to actively work toward the allocation goals to do that, rather than just maintain current levels. This will add some medium level complexity to the current patch, deltas with the current approach are: Since the effect of preemption on order for fairness doesn't occur until the container is released, and we want to consider it right away, there will need to be a need to retain info about pending preemption for comparison on the app resources (it will be a deduction from usage for ordering purposes, as if the preemption had already happened) The preemptEvenly loop will need to reorder the app which was preempted after each preemption and then restart the iteration over apps (not necessarily over all apps, again, just until the first preemption) Create an extension of ProportionalCapacityPreemptionPolicy which preempts a number of containers from each application in a way which respects fairness Key: YARN-3510 URL: https://issues.apache.org/jira/browse/YARN-3510 Project: Hadoop YARN Issue Type: Sub-task Components: yarn Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-3510.2.patch, YARN-3510.3.patch, YARN-3510.5.patch, YARN-3510.6.patch The ProportionalCapacityPreemptionPolicy preempts as many containers from applications as it can during it's preemption run. For fifo this makes sense, as it is prempting in reverse order therefore maintaining the primacy of the oldest. For fair ordering this does not have the desired effect - instead, it should preempt a number of containers from each application which maintains a fair balance /close to a fair balance between them -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7
[ https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570040#comment-14570040 ] Matthew Jacobs commented on YARN-2194: -- Thanks, [sidharta-s]. So the change would be in how the container-executor accepts lists of paths, not attempting to re-mount the controllers, right? If I understand it correctly, that sounds like a good plan to me. Cgroups cease to work in RHEL7 -- Key: YARN-2194 URL: https://issues.apache.org/jira/browse/YARN-2194 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Wei Yan Assignee: Wei Yan Priority: Critical Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch In RHEL7, the CPU controller is named cpu,cpuacct. The comma in the controller name leads to container launch failure. RHEL7 deprecates libcgroup and recommends the user of systemd. However, systemd has certain shortcomings as identified in this JIRA (see comments). This JIRA only fixes the failure, and doesn't try to use systemd. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3762) FairScheduler: CME on FSParentQueue#getQueueUserAclInfo
[ https://issues.apache.org/jira/browse/YARN-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-3762: --- Attachment: yarn-3762-1.patch Here is a patch that protects FSParentQueue members with read-write locks. FairScheduler: CME on FSParentQueue#getQueueUserAclInfo --- Key: YARN-3762 URL: https://issues.apache.org/jira/browse/YARN-3762 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-3762-1.patch In our testing, we ran into the following ConcurrentModificationException: {noformat} halxg.cloudera.com:8042, nodeRackName/rackvb07, nodeNumContainers0 15/05/22 13:02:22 INFO distributedshell.Client: Queue info, queueName=root.testyarnpool3, queueCurrentCapacity=0.0, queueMaxCapacity=-1.0, queueApplicationCount=0, queueChildQueueCount=0 15/05/22 13:02:22 FATAL distributedshell.Client: Error running Client java.util.ConcurrentModificationException: java.util.ConcurrentModificationException at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901) at java.util.ArrayList$Itr.next(ArrayList.java:851) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.getQueueUserAclInfo(FSParentQueue.java:155) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getQueueUserAclInfo(FairScheduler.java:1395) at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueUserAcls(ClientRMService.java:880) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7
[ https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570037#comment-14570037 ] Sidharta Seethana commented on YARN-2194: - There are two different issues here : * container-executor binary invocation uses ‘,’ as a separator when supplying a list of paths - which breaks when the path contains ‘,’ * cpu,cpuacct are mounted together by default on RHEL7 Now, for the latter issue : In {{CgroupsLCEResourcesHandler}}, the following steps occur : * If the {{yarn.nodemanager.linux-container-executor.cgroups.mount}} switch is enabled , the ‘cpu’ controller is explicitly mounted at the specified path. * (irrespective of the state of the switch) The {{/proc/mounts}} file (possibly updated by the previous step) is subsequently parsed to determine the mount locations for the various cgroup controllers - this parsing code seems to be correct even if cpu and cpuacct are mounted in one location. So, the thing we need to fix is the separator issue and we should be good. The important thing to remember is that there are *two* cgroups implementation classes ( {{CgroupsLCEResourcesHandler}} and {{CGroupsHandlerImpl}} ). Hopefully, this will be addressed soon ( YARN-3542 ) - or we risk divergence. Cgroups cease to work in RHEL7 -- Key: YARN-2194 URL: https://issues.apache.org/jira/browse/YARN-2194 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Wei Yan Assignee: Wei Yan Priority: Critical Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch In RHEL7, the CPU controller is named cpu,cpuacct. The comma in the controller name leads to container launch failure. RHEL7 deprecates libcgroup and recommends the user of systemd. However, systemd has certain shortcomings as identified in this JIRA (see comments). This JIRA only fixes the failure, and doesn't try to use systemd. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3762) FairScheduler: CME on FSParentQueue#getQueueUserAclInfo
[ https://issues.apache.org/jira/browse/YARN-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-3762: --- Attachment: yarn-3762-1.patch Sorry, I forgot to rebase and included some HDFS change as well. FairScheduler: CME on FSParentQueue#getQueueUserAclInfo --- Key: YARN-3762 URL: https://issues.apache.org/jira/browse/YARN-3762 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-3762-1.patch, yarn-3762-1.patch In our testing, we ran into the following ConcurrentModificationException: {noformat} halxg.cloudera.com:8042, nodeRackName/rackvb07, nodeNumContainers0 15/05/22 13:02:22 INFO distributedshell.Client: Queue info, queueName=root.testyarnpool3, queueCurrentCapacity=0.0, queueMaxCapacity=-1.0, queueApplicationCount=0, queueChildQueueCount=0 15/05/22 13:02:22 FATAL distributedshell.Client: Error running Client java.util.ConcurrentModificationException: java.util.ConcurrentModificationException at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901) at java.util.ArrayList$Itr.next(ArrayList.java:851) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.getQueueUserAclInfo(FSParentQueue.java:155) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getQueueUserAclInfo(FairScheduler.java:1395) at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueUserAcls(ClientRMService.java:880) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3762) FairScheduler: CME on FSParentQueue#getQueueUserAclInfo
[ https://issues.apache.org/jira/browse/YARN-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569992#comment-14569992 ] Karthik Kambatla commented on YARN-3762: Changed it to critical and targeting 2.8.0, as it only fails the application and not the RM. FairScheduler: CME on FSParentQueue#getQueueUserAclInfo --- Key: YARN-3762 URL: https://issues.apache.org/jira/browse/YARN-3762 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Attachments: yarn-3762-1.patch, yarn-3762-1.patch In our testing, we ran into the following ConcurrentModificationException: {noformat} halxg.cloudera.com:8042, nodeRackName/rackvb07, nodeNumContainers0 15/05/22 13:02:22 INFO distributedshell.Client: Queue info, queueName=root.testyarnpool3, queueCurrentCapacity=0.0, queueMaxCapacity=-1.0, queueApplicationCount=0, queueChildQueueCount=0 15/05/22 13:02:22 FATAL distributedshell.Client: Error running Client java.util.ConcurrentModificationException: java.util.ConcurrentModificationException at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901) at java.util.ArrayList$Itr.next(ArrayList.java:851) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.getQueueUserAclInfo(FSParentQueue.java:155) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getQueueUserAclInfo(FairScheduler.java:1395) at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueUserAcls(ClientRMService.java:880) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3762) FairScheduler: CME on FSParentQueue#getQueueUserAclInfo
[ https://issues.apache.org/jira/browse/YARN-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-3762: --- Priority: Critical (was: Blocker) Target Version/s: 2.8.0 (was: 2.7.1) FairScheduler: CME on FSParentQueue#getQueueUserAclInfo --- Key: YARN-3762 URL: https://issues.apache.org/jira/browse/YARN-3762 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Attachments: yarn-3762-1.patch, yarn-3762-1.patch In our testing, we ran into the following ConcurrentModificationException: {noformat} halxg.cloudera.com:8042, nodeRackName/rackvb07, nodeNumContainers0 15/05/22 13:02:22 INFO distributedshell.Client: Queue info, queueName=root.testyarnpool3, queueCurrentCapacity=0.0, queueMaxCapacity=-1.0, queueApplicationCount=0, queueChildQueueCount=0 15/05/22 13:02:22 FATAL distributedshell.Client: Error running Client java.util.ConcurrentModificationException: java.util.ConcurrentModificationException at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901) at java.util.ArrayList$Itr.next(ArrayList.java:851) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.getQueueUserAclInfo(FSParentQueue.java:155) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getQueueUserAclInfo(FairScheduler.java:1395) at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueUserAcls(ClientRMService.java:880) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3725) App submission via REST API is broken in secure mode due to Timeline DT service address is empty
[ https://issues.apache.org/jira/browse/YARN-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570007#comment-14570007 ] Zhijie Shen commented on YARN-3725: --- bq. is there a JIRA for the longer term fix? Yeah, I've filed YARN-3761 previously. App submission via REST API is broken in secure mode due to Timeline DT service address is empty Key: YARN-3725 URL: https://issues.apache.org/jira/browse/YARN-3725 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, timelineserver Affects Versions: 2.7.0 Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Blocker Fix For: 2.7.1 Attachments: YARN-3725.1.patch YARN-2971 changes TimelineClient to use the service address from Timeline DT to renew the DT instead of configured address. This break the procedure of submitting an YARN app via REST API in the secure mode. The problem is that service address is set by the client instead of the server in Java code. REST API response is an encode token Sting, such that it's so inconvenient to deserialize it and set the service address and serialize it again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3534) Collect memory/cpu usage on the node
[ https://issues.apache.org/jira/browse/YARN-3534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Inigo Goiri updated YARN-3534: -- Attachment: YARN-3534-10.patch Solved some comments Collect memory/cpu usage on the node Key: YARN-3534 URL: https://issues.apache.org/jira/browse/YARN-3534 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Affects Versions: 2.7.0 Reporter: Inigo Goiri Assignee: Inigo Goiri Attachments: YARN-3534-1.patch, YARN-3534-10.patch, YARN-3534-2.patch, YARN-3534-3.patch, YARN-3534-3.patch, YARN-3534-4.patch, YARN-3534-5.patch, YARN-3534-6.patch, YARN-3534-7.patch, YARN-3534-8.patch, YARN-3534-9.patch Original Estimate: 336h Remaining Estimate: 336h YARN should be aware of the resource utilization of the nodes when scheduling containers. For this, this task will implement the collection of memory/cpu usage on the node. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3762) FairScheduler: CME on FSParentQueue#getQueueUserAclInfo
[ https://issues.apache.org/jira/browse/YARN-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570082#comment-14570082 ] Hadoop QA commented on YARN-3762: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 15m 28s | Findbugs (version ) appears to be broken on trunk. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 7m 53s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 51s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 32s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 1s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 37s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 28s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 50m 25s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 88m 17s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12737043/yarn-3762-1.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / c1d50a9 | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8170/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8170/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8170/console | This message was automatically generated. FairScheduler: CME on FSParentQueue#getQueueUserAclInfo --- Key: YARN-3762 URL: https://issues.apache.org/jira/browse/YARN-3762 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Attachments: yarn-3762-1.patch, yarn-3762-1.patch In our testing, we ran into the following ConcurrentModificationException: {noformat} halxg.cloudera.com:8042, nodeRackName/rackvb07, nodeNumContainers0 15/05/22 13:02:22 INFO distributedshell.Client: Queue info, queueName=root.testyarnpool3, queueCurrentCapacity=0.0, queueMaxCapacity=-1.0, queueApplicationCount=0, queueChildQueueCount=0 15/05/22 13:02:22 FATAL distributedshell.Client: Error running Client java.util.ConcurrentModificationException: java.util.ConcurrentModificationException at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901) at java.util.ArrayList$Itr.next(ArrayList.java:851) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.getQueueUserAclInfo(FSParentQueue.java:155) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getQueueUserAclInfo(FairScheduler.java:1395) at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueUserAcls(ClientRMService.java:880) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3762) FairScheduler: CME on FSParentQueue#getQueueUserAclInfo
Karthik Kambatla created YARN-3762: -- Summary: FairScheduler: CME on FSParentQueue#getQueueUserAclInfo Key: YARN-3762 URL: https://issues.apache.org/jira/browse/YARN-3762 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker In our testing, we ran into the following ConcurrentModificationException: {noformat} halxg.cloudera.com:8042, nodeRackName/rackvb07, nodeNumContainers0 15/05/22 13:02:22 INFO distributedshell.Client: Queue info, queueName=root.testyarnpool3, queueCurrentCapacity=0.0, queueMaxCapacity=-1.0, queueApplicationCount=0, queueChildQueueCount=0 15/05/22 13:02:22 FATAL distributedshell.Client: Error running Client java.util.ConcurrentModificationException: java.util.ConcurrentModificationException at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901) at java.util.ArrayList$Itr.next(ArrayList.java:851) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.getQueueUserAclInfo(FSParentQueue.java:155) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getQueueUserAclInfo(FairScheduler.java:1395) at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueUserAcls(ClientRMService.java:880) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty
[ https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570025#comment-14570025 ] Wangda Tan commented on YARN-3733: -- Took a look at the patch and discussion. Thanks for working on this [~rohithsharma]. I think [~sunilg] mentioned https://issues.apache.org/jira/browse/YARN-3733?focusedCommentId=14568880page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14568880 makes sense to me. If the clusterResource is 0, we can compare individual resource type. It could be: {code} Returns : when l.mem right.mem || l.cpu right.cpu Returns =: when (l.mem = right.mem l.cpu = right.cpu) || (l.mem = right.mem l.cpu = right.cpu) Returns : when l.mem right.mem || l.cpu right.cpu {code} This produces same result as the INF approach in the patch, but also can compare if both l/r have 0 values. The reason I prefer this is, I'm sure the patch can solve the am-resource-percent problem. But with suggested approach, we can make sure getting more reasonable result if we need to compare non-zero-resource when clusterResource is zero. (For example, sort applications by their requirements when clusterResource is zero). And to avoid future regression, could you add a test to verify the am-resource-limit problem is solved? DominantRC#compare() does not work as expected if cluster resource is empty --- Key: YARN-3733 URL: https://issues.apache.org/jira/browse/YARN-3733 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3 , 2 NM , 2 RM one NM - 3 GB 6 v core Reporter: Bibin A Chundatt Assignee: Rohith Priority: Blocker Attachments: 0001-YARN-3733.patch, YARN-3733.patch Steps to reproduce = 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster) 2. Configure map and reduce size to 512 MB after changing scheduler minimum size to 512 MB 3. Configure capacity scheduler and AM limit to .5 (DominantResourceCalculator is configured) 4. Submit 30 concurrent task 5. Switch RM Actual = For 12 Jobs AM gets allocated and all 12 starts running No other Yarn child is initiated , *all 12 Jobs in Running state for ever* Expected === Only 6 should be running at a time since max AM allocated is .5 (3072 MB) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7
[ https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569899#comment-14569899 ] Philip Langdale commented on YARN-2194: --- You can remount controllers if you retain the same combination as the existing mount point, so I guess you could replace the ',' with something your parsing code can handle (or you could fix the parsing code). In general, life is a lot easier if you can avoid remounting as you then don't have to worry about managing their lifecycle. I'd argue the most robust thing to do is discover the existing mount point from /proc/mounts and then use it (assuming the comma parsing can be fixed) if it's present (and don't forget to respect the NodeManager's cgroup paths from /proc/self/mounts) Cgroups cease to work in RHEL7 -- Key: YARN-2194 URL: https://issues.apache.org/jira/browse/YARN-2194 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Wei Yan Assignee: Wei Yan Priority: Critical Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch In RHEL7, the CPU controller is named cpu,cpuacct. The comma in the controller name leads to container launch failure. RHEL7 deprecates libcgroup and recommends the user of systemd. However, systemd has certain shortcomings as identified in this JIRA (see comments). This JIRA only fixes the failure, and doesn't try to use systemd. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2392) add more diags about app retry limits on AM failures
[ https://issues.apache.org/jira/browse/YARN-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569869#comment-14569869 ] Hadoop QA commented on YARN-2392: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 23s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 9m 25s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 11m 23s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 24s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 56s | The applied patch generated 2 new checkstyle issues (total was 244, now 245). | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 46s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 35s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 42s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 52m 1s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 94m 38s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12737003/YARN-2392-002.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 03fb5c6 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8169/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8169/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8169/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8169/console | This message was automatically generated. add more diags about app retry limits on AM failures Key: YARN-2392 URL: https://issues.apache.org/jira/browse/YARN-2392 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.6.0 Reporter: Steve Loughran Assignee: Steve Loughran Priority: Minor Attachments: YARN-2392-001.patch, YARN-2392-002.patch, YARN-2392-002.patch # when an app fails the failure count is shown, but not what the global + local limits are. If the two are different, they should both be printed. # the YARN-2242 strings don't have enough whitespace between text and the URL -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3725) App submission via REST API is broken in secure mode due to Timeline DT service address is empty
[ https://issues.apache.org/jira/browse/YARN-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569918#comment-14569918 ] Vinod Kumar Vavilapalli commented on YARN-3725: --- [~zjshen], is there a JIRA for the longer term fix? App submission via REST API is broken in secure mode due to Timeline DT service address is empty Key: YARN-3725 URL: https://issues.apache.org/jira/browse/YARN-3725 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, timelineserver Affects Versions: 2.7.0 Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Blocker Fix For: 2.7.1 Attachments: YARN-3725.1.patch YARN-2971 changes TimelineClient to use the service address from Timeline DT to renew the DT instead of configured address. This break the procedure of submitting an YARN app via REST API in the secure mode. The problem is that service address is set by the client instead of the server in Java code. REST API response is an encode token Sting, such that it's so inconvenient to deserialize it and set the service address and serialize it again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs
[ https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570202#comment-14570202 ] Chun Chen commented on YARN-3749: - Thanks for reviewing the patch, [~zxu] ! We should make a copy of configuration when init MiniYARNCluster with multiple RMs -- Key: YARN-3749 URL: https://issues.apache.org/jira/browse/YARN-3749 Project: Hadoop YARN Issue Type: Bug Reporter: Chun Chen Assignee: Chun Chen Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, YARN-3749.5.patch, YARN-3749.6.patch, YARN-3749.7.patch, YARN-3749.patch When I was trying to write a test case for YARN-2674, I found DS client trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 when RM failover. But I initially set yarn.resourcemanager.address.rm1=0.0.0.0:18032, yarn.resourcemanager.address.rm2=0.0.0.0:28032 After digging, I found it is in ClientRMService where the value of yarn.resourcemanager.address.rm2 changed to 0.0.0.0:18032. See the following code in ClientRMService: {code} clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST, YarnConfiguration.RM_ADDRESS, YarnConfiguration.DEFAULT_RM_ADDRESS, server.getListenerAddress()); {code} Since we use the same instance of configuration in rm1 and rm2 and init both RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during starting of rm1. So I think it is safe to make a copy of configuration when init both of the rm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3558) Additional containers getting reserved from RM in case of Fair scheduler
[ https://issues.apache.org/jira/browse/YARN-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G reassigned YARN-3558: - Assignee: Sunil G Additional containers getting reserved from RM in case of Fair scheduler Key: YARN-3558 URL: https://issues.apache.org/jira/browse/YARN-3558 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, resourcemanager Affects Versions: 2.7.0 Environment: OS :Suse 11 Sp3 Setup : 2 RM 2 NM Scheduler : Fair scheduler Reporter: Bibin A Chundatt Assignee: Sunil G Attachments: Amlog.txt, rm.log Submit PI job with 16 maps Total container expected : 16 MAPS + 1 Reduce + 1 AM Total containers reserved by RM is 21 Below set of containers are not being used for execution container_1430213948957_0001_01_20 container_1430213948957_0001_01_19 RM Containers reservation and states {code} Processing container_1430213948957_0001_01_01 of type START Processing container_1430213948957_0001_01_01 of type ACQUIRED Processing container_1430213948957_0001_01_01 of type LAUNCHED Processing container_1430213948957_0001_01_02 of type START Processing container_1430213948957_0001_01_03 of type START Processing container_1430213948957_0001_01_02 of type ACQUIRED Processing container_1430213948957_0001_01_03 of type ACQUIRED Processing container_1430213948957_0001_01_04 of type START Processing container_1430213948957_0001_01_05 of type START Processing container_1430213948957_0001_01_04 of type ACQUIRED Processing container_1430213948957_0001_01_05 of type ACQUIRED Processing container_1430213948957_0001_01_02 of type LAUNCHED Processing container_1430213948957_0001_01_04 of type LAUNCHED Processing container_1430213948957_0001_01_06 of type RESERVED Processing container_1430213948957_0001_01_03 of type LAUNCHED Processing container_1430213948957_0001_01_05 of type LAUNCHED Processing container_1430213948957_0001_01_07 of type START Processing container_1430213948957_0001_01_07 of type ACQUIRED Processing container_1430213948957_0001_01_07 of type LAUNCHED Processing container_1430213948957_0001_01_08 of type RESERVED Processing container_1430213948957_0001_01_02 of type FINISHED Processing container_1430213948957_0001_01_06 of type START Processing container_1430213948957_0001_01_06 of type ACQUIRED Processing container_1430213948957_0001_01_06 of type LAUNCHED Processing container_1430213948957_0001_01_04 of type FINISHED Processing container_1430213948957_0001_01_09 of type START Processing container_1430213948957_0001_01_09 of type ACQUIRED Processing container_1430213948957_0001_01_09 of type LAUNCHED Processing container_1430213948957_0001_01_10 of type RESERVED Processing container_1430213948957_0001_01_03 of type FINISHED Processing container_1430213948957_0001_01_08 of type START Processing container_1430213948957_0001_01_08 of type ACQUIRED Processing container_1430213948957_0001_01_08 of type LAUNCHED Processing container_1430213948957_0001_01_05 of type FINISHED Processing container_1430213948957_0001_01_11 of type START Processing container_1430213948957_0001_01_11 of type ACQUIRED Processing container_1430213948957_0001_01_11 of type LAUNCHED Processing container_1430213948957_0001_01_07 of type FINISHED Processing container_1430213948957_0001_01_12 of type START Processing container_1430213948957_0001_01_12 of type ACQUIRED Processing container_1430213948957_0001_01_12 of type LAUNCHED Processing container_1430213948957_0001_01_13 of type RESERVED Processing container_1430213948957_0001_01_06 of type FINISHED Processing container_1430213948957_0001_01_10 of type START Processing container_1430213948957_0001_01_10 of type ACQUIRED Processing container_1430213948957_0001_01_10 of type LAUNCHED Processing container_1430213948957_0001_01_09 of type FINISHED Processing container_1430213948957_0001_01_14 of type START Processing container_1430213948957_0001_01_14 of type ACQUIRED Processing container_1430213948957_0001_01_14 of type LAUNCHED Processing container_1430213948957_0001_01_15 of type RESERVED Processing container_1430213948957_0001_01_08 of type FINISHED Processing container_1430213948957_0001_01_13 of type START Processing container_1430213948957_0001_01_16 of type RESERVED Processing container_1430213948957_0001_01_13 of type ACQUIRED Processing container_1430213948957_0001_01_13 of type LAUNCHED Processing container_1430213948957_0001_01_11 of type FINISHED
[jira] [Commented] (YARN-3044) [Event producers] Implement RM writing app lifecycle events to ATS
[ https://issues.apache.org/jira/browse/YARN-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570171#comment-14570171 ] Zhijie Shen commented on YARN-3044: --- [~Naganarasimha], I'm fine with the last patch. Will do some local test. However, the patch doesn't apply because of YARN-1462. I think we need to add tag info for v2 publisher too. Would you mind taking care of it? [Event producers] Implement RM writing app lifecycle events to ATS -- Key: YARN-3044 URL: https://issues.apache.org/jira/browse/YARN-3044 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Naganarasimha G R Attachments: YARN-3044-YARN-2928.004.patch, YARN-3044-YARN-2928.005.patch, YARN-3044-YARN-2928.006.patch, YARN-3044-YARN-2928.007.patch, YARN-3044-YARN-2928.008.patch, YARN-3044-YARN-2928.009.patch, YARN-3044.20150325-1.patch, YARN-3044.20150406-1.patch, YARN-3044.20150416-1.patch Per design in YARN-2928, implement RM writing app lifecycle events to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3755) Log the command of launching containers
[ https://issues.apache.org/jira/browse/YARN-3755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570275#comment-14570275 ] Jeff Zhang commented on YARN-3755: -- bq. How about we let individual frameworks like MapReduce/Tez log them as needed? That seems like the right place for debugging too - app developers don't always get access to the daemon logs. Make sense. Log the command of launching containers --- Key: YARN-3755 URL: https://issues.apache.org/jira/browse/YARN-3755 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.7.0 Reporter: Jeff Zhang Assignee: Jeff Zhang Attachments: YARN-3755-1.patch, YARN-3755-2.patch In the resource manager log, yarn would log the command for launching AM, this is very useful. But there's no such log in the NN log for launching containers. It would be difficult to diagnose when containers fails to launch due to some issue in the commands. Although user can look at the commands in the container launch script file, this is an internal things of yarn, usually user don't know that. In user's perspective, they only know what commands they specify when building yarn application. {code} 2015-06-01 16:06:42,245 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Command to launch container container_1433145984561_0001_01_01 : $JAVA_HOME/bin/java -server -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx1024m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=LOG_DIR -Dtez.root.logger=info,CLA -Dsun.nio.ch.bugLevel='' org.apache.tez.dag.app.DAGAppMaster 1LOG_DIR/stdout 2LOG_DIR/stderr {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3755) Log the command of launching containers
[ https://issues.apache.org/jira/browse/YARN-3755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570276#comment-14570276 ] Jeff Zhang commented on YARN-3755: -- Close it as won't fix Log the command of launching containers --- Key: YARN-3755 URL: https://issues.apache.org/jira/browse/YARN-3755 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.7.0 Reporter: Jeff Zhang Assignee: Jeff Zhang Attachments: YARN-3755-1.patch, YARN-3755-2.patch In the resource manager log, yarn would log the command for launching AM, this is very useful. But there's no such log in the NN log for launching containers. It would be difficult to diagnose when containers fails to launch due to some issue in the commands. Although user can look at the commands in the container launch script file, this is an internal things of yarn, usually user don't know that. In user's perspective, they only know what commands they specify when building yarn application. {code} 2015-06-01 16:06:42,245 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Command to launch container container_1433145984561_0001_01_01 : $JAVA_HOME/bin/java -server -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx1024m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=LOG_DIR -Dtez.root.logger=info,CLA -Dsun.nio.ch.bugLevel='' org.apache.tez.dag.app.DAGAppMaster 1LOG_DIR/stdout 2LOG_DIR/stderr {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7
[ https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570284#comment-14570284 ] Sidharta Seethana commented on YARN-2194: - [~mjacobs] , Yes, that is what I am proposing. If we handle the path separation correctly, we should be able to continue using the current (deprecated, but still workable) mechanism for using cgroups. Cgroups cease to work in RHEL7 -- Key: YARN-2194 URL: https://issues.apache.org/jira/browse/YARN-2194 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Wei Yan Assignee: Wei Yan Priority: Critical Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch In RHEL7, the CPU controller is named cpu,cpuacct. The comma in the controller name leads to container launch failure. RHEL7 deprecates libcgroup and recommends the user of systemd. However, systemd has certain shortcomings as identified in this JIRA (see comments). This JIRA only fixes the failure, and doesn't try to use systemd. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3763) Support for fuzzy search in ATS
Jeff Zhang created YARN-3763: Summary: Support for fuzzy search in ATS Key: YARN-3763 URL: https://issues.apache.org/jira/browse/YARN-3763 Project: Hadoop YARN Issue Type: Improvement Components: timelineserver Affects Versions: 2.7.0 Reporter: Jeff Zhang Currently ATS only support exact match. Sometimes fuzzy match may be helpful when the entities in the ATS has some common prefix or suffix. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3763) Support fuzzy search in ATS
[ https://issues.apache.org/jira/browse/YARN-3763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated YARN-3763: - Summary: Support fuzzy search in ATS (was: Support for fuzzy search in ATS) Support fuzzy search in ATS --- Key: YARN-3763 URL: https://issues.apache.org/jira/browse/YARN-3763 Project: Hadoop YARN Issue Type: Improvement Components: timelineserver Affects Versions: 2.7.0 Reporter: Jeff Zhang Currently ATS only support exact match. Sometimes fuzzy match may be helpful when the entities in the ATS has some common prefix or suffix. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3763) Support fuzzy search in ATS
[ https://issues.apache.org/jira/browse/YARN-3763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated YARN-3763: - Description: Currently ATS only support exact match. Sometimes fuzzy match may be helpful when the entities in the ATS has some common prefix or suffix. Link with TEZ-2531 (was: Currently ATS only support exact match. Sometimes fuzzy match may be helpful when the entities in the ATS has some common prefix or suffix. ) Support fuzzy search in ATS --- Key: YARN-3763 URL: https://issues.apache.org/jira/browse/YARN-3763 Project: Hadoop YARN Issue Type: Improvement Components: timelineserver Affects Versions: 2.7.0 Reporter: Jeff Zhang Currently ATS only support exact match. Sometimes fuzzy match may be helpful when the entities in the ATS has some common prefix or suffix. Link with TEZ-2531 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty
[ https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568913#comment-14568913 ] Hadoop QA commented on YARN-3733: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 6s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 33s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 36s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 54s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 33s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 33s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 1m 57s | Tests passed in hadoop-yarn-common. | | | | 40m 10s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12736802/0001-YARN-3733.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 990078b | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8166/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8166/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8166/console | This message was automatically generated. DominantRC#compare() does not work as expected if cluster resource is empty --- Key: YARN-3733 URL: https://issues.apache.org/jira/browse/YARN-3733 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3 , 2 NM , 2 RM one NM - 3 GB 6 v core Reporter: Bibin A Chundatt Assignee: Rohith Priority: Blocker Attachments: 0001-YARN-3733.patch, YARN-3733.patch Steps to reproduce = 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster) 2. Configure map and reduce size to 512 MB after changing scheduler minimum size to 512 MB 3. Configure capacity scheduler and AM limit to .5 (DominantResourceCalculator is configured) 4. Submit 30 concurrent task 5. Switch RM Actual = For 12 Jobs AM gets allocated and all 12 starts running No other Yarn child is initiated , *all 12 Jobs in Running state for ever* Expected === Only 6 should be running at a time since max AM allocated is .5 (3072 MB) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs
[ https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568926#comment-14568926 ] Hudson commented on YARN-1462: -- FAILURE: Integrated in Hadoop-Yarn-trunk #946 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/946/]) YARN-1462. Correct fix version from branch-2.7.1 to branch-2.8 in (xgong: rev 0b5cfacde638bc25cc010cd9236369237b4e51a8) * hadoop-yarn-project/CHANGES.txt AHS API and other AHS changes to handle tags for completed MR jobs -- Key: YARN-1462 URL: https://issues.apache.org/jira/browse/YARN-1462 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.2.0 Reporter: Karthik Kambatla Assignee: Xuan Gong Fix For: 2.8.0 Attachments: YARN-1462-branch-2.7-1.2.patch, YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, YARN-1462.3.patch AHS related work for tags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3759) Include command line, localization info and env vars on AM launch failure
Steve Loughran created YARN-3759: Summary: Include command line, localization info and env vars on AM launch failure Key: YARN-3759 URL: https://issues.apache.org/jira/browse/YARN-3759 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.7.0 Reporter: Steve Loughran Priority: Minor While trying to diagnose AM launch failures, its important to be able to get at the final, expanded {{CLASSPATH}} and other env variables. We don't get that today: you can log the unexpanded values on the client, and tweak NM ContainerExecutor log levels to DEBUG get some of this —‚ut you don't get it in the task logs, and tuning NM log level isn't viable on a large, busy cluster. Launch failures should include some env specifics: # list of env vars (ideally, full getenv values), with some stripping of sensitive options (i'm thinking AWS env vars here) # command line # path localisations These can go in the task logs, we don't need to include them in the application report. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-41) The RM should handle the graceful shutdown of the NM.
[ https://issues.apache.org/jira/browse/YARN-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568947#comment-14568947 ] Junping Du commented on YARN-41: bq. Junping Du I have updated the patch with review comments. Can you have a look into this? Sorry for being late on this as taking travel last week. I will review your latest patch today. The RM should handle the graceful shutdown of the NM. - Key: YARN-41 URL: https://issues.apache.org/jira/browse/YARN-41 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Ravi Teja Ch N V Assignee: Devaraj K Attachments: MAPREDUCE-3494.1.patch, MAPREDUCE-3494.2.patch, MAPREDUCE-3494.patch, YARN-41-1.patch, YARN-41-2.patch, YARN-41-3.patch, YARN-41-4.patch, YARN-41-5.patch, YARN-41-6.patch, YARN-41-7.patch, YARN-41-8.patch, YARN-41.patch Instead of waiting for the NM expiry, RM should remove and handle the NM, which is shutdown gracefully. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs
[ https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568920#comment-14568920 ] Hudson commented on YARN-1462: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #216 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/216/]) YARN-1462. Correct fix version from branch-2.7.1 to branch-2.8 in (xgong: rev 0b5cfacde638bc25cc010cd9236369237b4e51a8) * hadoop-yarn-project/CHANGES.txt AHS API and other AHS changes to handle tags for completed MR jobs -- Key: YARN-1462 URL: https://issues.apache.org/jira/browse/YARN-1462 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.2.0 Reporter: Karthik Kambatla Assignee: Xuan Gong Fix For: 2.8.0 Attachments: YARN-1462-branch-2.7-1.2.patch, YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, YARN-1462.3.patch AHS related work for tags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs
[ https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569171#comment-14569171 ] Hudson commented on YARN-1462: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2144 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2144/]) YARN-1462. Correct fix version from branch-2.7.1 to branch-2.8 in (xgong: rev 0b5cfacde638bc25cc010cd9236369237b4e51a8) * hadoop-yarn-project/CHANGES.txt AHS API and other AHS changes to handle tags for completed MR jobs -- Key: YARN-1462 URL: https://issues.apache.org/jira/browse/YARN-1462 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.2.0 Reporter: Karthik Kambatla Assignee: Xuan Gong Fix For: 2.8.0 Attachments: YARN-1462-branch-2.7-1.2.patch, YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, YARN-1462.3.patch AHS related work for tags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty
[ https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-3733: - Attachment: 0001-YARN-3733.patch The updated patch that fixes for 2nd and 3rd scenarios(This issue scenario fixes) in above table and refactored the test code. As a overall solution that solves input combination like 4th and 5th from above table, need to explore more on how to define fraction and how to decide which one is dominant. Any suggestions on this? DominantRC#compare() does not work as expected if cluster resource is empty --- Key: YARN-3733 URL: https://issues.apache.org/jira/browse/YARN-3733 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3 , 2 NM , 2 RM one NM - 3 GB 6 v core Reporter: Bibin A Chundatt Assignee: Rohith Priority: Blocker Attachments: 0001-YARN-3733.patch, YARN-3733.patch Steps to reproduce = 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster) 2. Configure map and reduce size to 512 MB after changing scheduler minimum size to 512 MB 3. Configure capacity scheduler and AM limit to .5 (DominantResourceCalculator is configured) 4. Submit 30 concurrent task 5. Switch RM Actual = For 12 Jobs AM gets allocated and all 12 starts running No other Yarn child is initiated , *all 12 Jobs in Running state for ever* Expected === Only 6 should be running at a time since max AM allocated is .5 (3072 MB) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty
[ https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568880#comment-14568880 ] Sunil G commented on YARN-3733: --- Hi [~rohithsharma] Thanks for the detailed scenario. Scenario 4 can be possible, correct?. clusterResource0,0 : lhs 2,2 and rhs 3,2. Currently getResourceAsValue gives back the max ratio of mem/vcores if dominent. Else gives the min ratio. If clusterResource is 0, then could we directly send the max of mem/vcore if dominent, and min in other case. This has to be made more better algorithm when more resources comes in. This is not completely perfect as we treat memory and vcores leniently. Pls share your thoughts. DominantRC#compare() does not work as expected if cluster resource is empty --- Key: YARN-3733 URL: https://issues.apache.org/jira/browse/YARN-3733 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3 , 2 NM , 2 RM one NM - 3 GB 6 v core Reporter: Bibin A Chundatt Assignee: Rohith Priority: Blocker Attachments: 0001-YARN-3733.patch, YARN-3733.patch Steps to reproduce = 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster) 2. Configure map and reduce size to 512 MB after changing scheduler minimum size to 512 MB 3. Configure capacity scheduler and AM limit to .5 (DominantResourceCalculator is configured) 4. Submit 30 concurrent task 5. Switch RM Actual = For 12 Jobs AM gets allocated and all 12 starts running No other Yarn child is initiated , *all 12 Jobs in Running state for ever* Expected === Only 6 should be running at a time since max AM allocated is .5 (3072 MB) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-41) The RM should handle the graceful shutdown of the NM.
[ https://issues.apache.org/jira/browse/YARN-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569182#comment-14569182 ] Junping Du commented on YARN-41: Thanks [~devaraj.k] for updating the patch with addressing previous comments! Latest patch LGTM. +1. Will commit it tomorrow if no further comments on the code from other reviewers. In addition, given the patch involve new SHUTDOWN category on: NodeState, UI and Cluster Metrics. Although it doesn't break any public APIs, we should mark this JIRA as incompatible for its inconsistent behaviors with previous releases in UI, CLI, Metrics (to notify users or third-party management monitor software). In general, I think it should be fine to keep the plan to include this patch in 2.x releases. However, please comments here to let us know if you have any concerns. The RM should handle the graceful shutdown of the NM. - Key: YARN-41 URL: https://issues.apache.org/jira/browse/YARN-41 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Ravi Teja Ch N V Assignee: Devaraj K Attachments: MAPREDUCE-3494.1.patch, MAPREDUCE-3494.2.patch, MAPREDUCE-3494.patch, YARN-41-1.patch, YARN-41-2.patch, YARN-41-3.patch, YARN-41-4.patch, YARN-41-5.patch, YARN-41-6.patch, YARN-41-7.patch, YARN-41-8.patch, YARN-41.patch Instead of waiting for the NM expiry, RM should remove and handle the NM, which is shutdown gracefully. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3603) Application Attempts page confusing
[ https://issues.apache.org/jira/browse/YARN-3603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-3603: -- Attachment: 0002-YARN-3603.patch Attaching an updated version of patch. Also attaching screen shots of UI. [~tgraves] Could u please take a look on this. Thank you. Application Attempts page confusing --- Key: YARN-3603 URL: https://issues.apache.org/jira/browse/YARN-3603 Project: Hadoop YARN Issue Type: Bug Components: webapp Affects Versions: 2.8.0 Reporter: Thomas Graves Assignee: Sunil G Attachments: 0001-YARN-3603.patch, 0002-YARN-3603.patch, ahs1.png The application attempts page (http://RM:8088/cluster/appattempt/appattempt_1431101480046_0003_01) is a bit confusing on what is going on. I think the table of containers there is for only Running containers and when the app is completed or killed its empty. The table should have a label on it stating so. Also the AM Container field is a link when running but not when its killed. That might be confusing. There is no link to the logs in this page but there is in the app attempt table when looking at http:// rm:8088/cluster/app/application_1431101480046_0003 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3603) Application Attempts page confusing
[ https://issues.apache.org/jira/browse/YARN-3603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-3603: -- Attachment: ahs1.png Application Attempts page confusing --- Key: YARN-3603 URL: https://issues.apache.org/jira/browse/YARN-3603 Project: Hadoop YARN Issue Type: Bug Components: webapp Affects Versions: 2.8.0 Reporter: Thomas Graves Assignee: Sunil G Attachments: 0001-YARN-3603.patch, 0002-YARN-3603.patch, ahs1.png The application attempts page (http://RM:8088/cluster/appattempt/appattempt_1431101480046_0003_01) is a bit confusing on what is going on. I think the table of containers there is for only Running containers and when the app is completed or killed its empty. The table should have a label on it stating so. Also the AM Container field is a link when running but not when its killed. That might be confusing. There is no link to the logs in this page but there is in the app attempt table when looking at http:// rm:8088/cluster/app/application_1431101480046_0003 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3754) Race condition when the NodeManager is shutting down and container is launched
[ https://issues.apache.org/jira/browse/YARN-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-3754: --- Priority: Critical (was: Major) Target Version/s: 2.7.1 Race condition when the NodeManager is shutting down and container is launched -- Key: YARN-3754 URL: https://issues.apache.org/jira/browse/YARN-3754 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Assignee: Sunil G Priority: Critical Container is launched and returned to ContainerImpl NodeManager closed the DB connection which resulting in {{org.iq80.leveldb.DBException: Closed}}. *Attaching the exception trace* {code} 2015-05-30 02:11:49,122 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Unable to update state store diagnostics for container_e310_1432817693365_3338_01_02 java.io.IOException: org.iq80.leveldb.DBException: Closed at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:261) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1109) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1101) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1129) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:246) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.iq80.leveldb.DBException: Closed at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:123) at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:106) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:259) ... 15 more {code} we can add a check whether DB is closed while we move container from ACQUIRED state. As per the discussion in YARN-3585 have add the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3754) Race condition when the NodeManager is shutting down and container is launched
[ https://issues.apache.org/jira/browse/YARN-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-3754: --- Target Version/s: 2.8.0 (was: 2.7.1) Race condition when the NodeManager is shutting down and container is launched -- Key: YARN-3754 URL: https://issues.apache.org/jira/browse/YARN-3754 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Assignee: Sunil G Priority: Critical Container is launched and returned to ContainerImpl NodeManager closed the DB connection which resulting in {{org.iq80.leveldb.DBException: Closed}}. *Attaching the exception trace* {code} 2015-05-30 02:11:49,122 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Unable to update state store diagnostics for container_e310_1432817693365_3338_01_02 java.io.IOException: org.iq80.leveldb.DBException: Closed at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:261) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1109) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1101) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1129) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:246) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.iq80.leveldb.DBException: Closed at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:123) at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:106) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:259) ... 15 more {code} we can add a check whether DB is closed while we move container from ACQUIRED state. As per the discussion in YARN-3585 have add the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3760) Log aggregation failures
[ https://issues.apache.org/jira/browse/YARN-3760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569289#comment-14569289 ] Daryn Sharp commented on YARN-3760: --- Cancelled tokens trigger the retry proxy bug. Log aggregation failures - Key: YARN-3760 URL: https://issues.apache.org/jira/browse/YARN-3760 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.4.0 Reporter: Daryn Sharp Priority: Critical The aggregated log file does not appear to be properly closed when writes fail. This leaves a lease renewer active in the NM that spams the NN with lease renewals. If the token is marked not to be cancelled, the renewals appear to continue until the token expires. If the token is cancelled, the periodic renew spam turns into a flood of failed connections until the lease renewer gives up. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2962) ZKRMStateStore: Limit the number of znodes under a znode
[ https://issues.apache.org/jira/browse/YARN-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569262#comment-14569262 ] Karthik Kambatla commented on YARN-2962: YARN-3643 should help alleviate most of the issues users face. This JIRA could be targeted only at trunk, without worrying about rolling upgrades. ZKRMStateStore: Limit the number of znodes under a znode Key: YARN-2962 URL: https://issues.apache.org/jira/browse/YARN-2962 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Varun Saxena Priority: Critical Attachments: YARN-2962.01.patch, YARN-2962.2.patch, YARN-2962.3.patch We ran into this issue where we were hitting the default ZK server message size configs, primarily because the message had too many znodes even though they individually they were all small. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3760) Log aggregation failures
Daryn Sharp created YARN-3760: - Summary: Log aggregation failures Key: YARN-3760 URL: https://issues.apache.org/jira/browse/YARN-3760 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.4.0 Reporter: Daryn Sharp Priority: Critical The aggregated log file does not appear to be properly closed when writes fail. This leaves a lease renewer active in the NM that spams the NN with lease renewals. If the token is marked not to be cancelled, the renewals appear to continue until the token expires. If the token is cancelled, the periodic renew spam turns into a flood of failed connections until the lease renewer gives up. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3758) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container
[ https://issues.apache.org/jira/browse/YARN-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569237#comment-14569237 ] Jason Lowe commented on YARN-3758: -- First off, one should never set the heap size and the container size to the same value. The container size needs to be big enough to hold the entire process, not just the heap, so it needs to also consider the overhead of the JVM itself and any off-heap usage (e.g.: JVM code, data, thread stacks, shared libs, off-heap allocations, etc.). If you set the heap size to the same size as the container then when the heap fills up the process overall will be bigger than the heap size and YARN will kill the container. Couple of things to check: - Does the job configuration show that it is indeed asking for only 256 MB containers for tasks? Check the job configuration link for the job on the job history server or the configuration link for the AM's UI while the job is running. - Check the RM logs to verify what minimum allocation size it is loading from the configs and what request size it is allocating per task The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container Key: YARN-3758 URL: https://issues.apache.org/jira/browse/YARN-3758 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: skrho Hello there~~ I have 2 clusters First cluster is 5 node , default 1 application queue, Capacity scheduler, 8G Physical memory each node Second cluster is 10 node, 2 application queuey, fair-scheduler, 230G Physical memory each node Wherever a mapreduce job is running, I want resourcemanager is to set the minimum memory 256m to container So I was changing configuration in yarn-site.xml mapred-site.xml yarn.scheduler.minimum-allocation-mb : 256 mapreduce.map.java.opts : -Xms256m mapreduce.reduce.java.opts : -Xms256m mapreduce.map.memory.mb : 256 mapreduce.reduce.memory.mb : 256 In First cluster whenever a mapreduce job is running , I can see used memory 256m in web console( http://installedIP:8088/cluster/nodes ) But In Second cluster whenever a mapreduce job is running , I can see used memory 1024m in web console( http://installedIP:8088/cluster/nodes ) I know default memory value is 1024m, so if there is not changing memory setting, the default value is working. I have been testing for two weeks, but I don't know why mimimum memory setting is not working in second cluster Why this difference is happened? Am I wrong setting configuration? or Is there bug? Thank you for reading~~ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3753) RM failed to come up with java.io.IOException: Wait for ZKClient creation timed out
[ https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569267#comment-14569267 ] Karthik Kambatla commented on YARN-3753: Fix looks reasonable to me. RM failed to come up with java.io.IOException: Wait for ZKClient creation timed out - Key: YARN-3753 URL: https://issues.apache.org/jira/browse/YARN-3753 Project: Hadoop YARN Issue Type: Bug Components: yarn Reporter: Sumana Sathish Assignee: Jian He Priority: Critical Attachments: YARN-3753.1.patch, YARN-3753.2.patch, YARN-3753.patch RM failed to come up with the following error while submitting an mapreduce job. {code:title=RM log} 015-05-30 03:40:12,190 ERROR recovery.RMStateStore (RMStateStore.java:transition(179)) - Error storing app: application_1432956515242_0006 java.io.IOException: Wait for ZKClient creation timed out at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) 2015-05-30 03:40:12,194 FATAL resourcemanager.ResourceManager (ResourceManager.java:handle(750)) - Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Wait for ZKClient creation timed out at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at
[jira] [Commented] (YARN-41) The RM should handle the graceful shutdown of the NM.
[ https://issues.apache.org/jira/browse/YARN-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569278#comment-14569278 ] Devaraj K commented on YARN-41: --- Thanks a lot [~djp] for your review and comments, I really appreciate your help on reviewing the patch. The RM should handle the graceful shutdown of the NM. - Key: YARN-41 URL: https://issues.apache.org/jira/browse/YARN-41 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Ravi Teja Ch N V Assignee: Devaraj K Attachments: MAPREDUCE-3494.1.patch, MAPREDUCE-3494.2.patch, MAPREDUCE-3494.patch, YARN-41-1.patch, YARN-41-2.patch, YARN-41-3.patch, YARN-41-4.patch, YARN-41-5.patch, YARN-41-6.patch, YARN-41-7.patch, YARN-41-8.patch, YARN-41.patch Instead of waiting for the NM expiry, RM should remove and handle the NM, which is shutdown gracefully. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs
[ https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569342#comment-14569342 ] Hudson commented on YARN-1462: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2162 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2162/]) YARN-1462. Correct fix version from branch-2.7.1 to branch-2.8 in (xgong: rev 0b5cfacde638bc25cc010cd9236369237b4e51a8) * hadoop-yarn-project/CHANGES.txt AHS API and other AHS changes to handle tags for completed MR jobs -- Key: YARN-1462 URL: https://issues.apache.org/jira/browse/YARN-1462 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.2.0 Reporter: Karthik Kambatla Assignee: Xuan Gong Fix For: 2.8.0 Attachments: YARN-1462-branch-2.7-1.2.patch, YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, YARN-1462.3.patch AHS related work for tags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs
[ https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569322#comment-14569322 ] Hudson commented on YARN-1462: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #214 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/214/]) YARN-1462. Correct fix version from branch-2.7.1 to branch-2.8 in (xgong: rev 0b5cfacde638bc25cc010cd9236369237b4e51a8) * hadoop-yarn-project/CHANGES.txt AHS API and other AHS changes to handle tags for completed MR jobs -- Key: YARN-1462 URL: https://issues.apache.org/jira/browse/YARN-1462 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.2.0 Reporter: Karthik Kambatla Assignee: Xuan Gong Fix For: 2.8.0 Attachments: YARN-1462-branch-2.7-1.2.patch, YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, YARN-1462.3.patch AHS related work for tags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3754) Race condition when the NodeManager is shutting down and container is launched
[ https://issues.apache.org/jira/browse/YARN-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569381#comment-14569381 ] Sunil G commented on YARN-3754: --- [~bibinchundatt] Could u also please attach NM logs here. Race condition when the NodeManager is shutting down and container is launched -- Key: YARN-3754 URL: https://issues.apache.org/jira/browse/YARN-3754 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Assignee: Sunil G Priority: Critical Container is launched and returned to ContainerImpl NodeManager closed the DB connection which resulting in {{org.iq80.leveldb.DBException: Closed}}. *Attaching the exception trace* {code} 2015-05-30 02:11:49,122 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Unable to update state store diagnostics for container_e310_1432817693365_3338_01_02 java.io.IOException: org.iq80.leveldb.DBException: Closed at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:261) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1109) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1101) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1129) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:246) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.iq80.leveldb.DBException: Closed at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:123) at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:106) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:259) ... 15 more {code} we can add a check whether DB is closed while we move container from ACQUIRED state. As per the discussion in YARN-3585 have add the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3755) Log the command of launching containers
[ https://issues.apache.org/jira/browse/YARN-3755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569347#comment-14569347 ] Vinod Kumar Vavilapalli commented on YARN-3755: --- We had this long ago in YARN, but removed it as the log files were getting inundated in large/high throughput clusters. If you combine the command line with the environment (classpath etc), this can get very long. How about we let individual frameworks like MapReduce/Tez log them as needed? That seems like the right place for debugging too - app developers don't always get access to the daemon logs. Log the command of launching containers --- Key: YARN-3755 URL: https://issues.apache.org/jira/browse/YARN-3755 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.7.0 Reporter: Jeff Zhang Assignee: Jeff Zhang Attachments: YARN-3755-1.patch, YARN-3755-2.patch In the resource manager log, yarn would log the command for launching AM, this is very useful. But there's no such log in the NN log for launching containers. It would be difficult to diagnose when containers fails to launch due to some issue in the commands. Although user can look at the commands in the container launch script file, this is an internal things of yarn, usually user don't know that. In user's perspective, they only know what commands they specify when building yarn application. {code} 2015-06-01 16:06:42,245 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Command to launch container container_1433145984561_0001_01_01 : $JAVA_HOME/bin/java -server -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx1024m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=LOG_DIR -Dtez.root.logger=info,CLA -Dsun.nio.ch.bugLevel='' org.apache.tez.dag.app.DAGAppMaster 1LOG_DIR/stdout 2LOG_DIR/stderr {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3761) Set delegation token service address at the server side
Zhijie Shen created YARN-3761: - Summary: Set delegation token service address at the server side Key: YARN-3761 URL: https://issues.apache.org/jira/browse/YARN-3761 Project: Hadoop YARN Issue Type: Improvement Components: security Reporter: Zhijie Shen Nowadays, YARN components generate the delegation token without the service address set, and leave it to the client to set. With our java client library, it is usually fine. However, if users are using REST API, it's going to be a problem: The delegation token is returned as a url string. It's so unfriendly for the thin client to deserialize the url string, set the token service address and serialize it again for further usage. If we move the task of setting the service address to the server side, the client can get rid of this trouble. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2618) Avoid over-allocation of disk resources
[ https://issues.apache.org/jira/browse/YARN-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569437#comment-14569437 ] Varun Vasudev commented on YARN-2618: - [~kasha] - should we commit this to the YARN-2139 branch? Should we get the branch up to date with trunk first? Avoid over-allocation of disk resources --- Key: YARN-2618 URL: https://issues.apache.org/jira/browse/YARN-2618 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wei Yan Assignee: Wei Yan Labels: BB2015-05-TBR Attachments: YARN-2618-1.patch, YARN-2618-2.patch, YARN-2618-3.patch, YARN-2618-4.patch, YARN-2618-5.patch, YARN-2618-6.patch, YARN-2618-7.patch Subtask of YARN-2139. This should include - Add API support for introducing disk I/O as the 3rd type resource. - NM should report this information to the RM - RM should consider this to avoid over-allocation -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2618) Avoid over-allocation of disk resources
[ https://issues.apache.org/jira/browse/YARN-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569453#comment-14569453 ] Hadoop QA commented on YARN-2618: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | patch | 0m 0s | The patch command could not apply the patch during dryrun. | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12723515/YARN-2618-7.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / a2bd621 | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8167/console | This message was automatically generated. Avoid over-allocation of disk resources --- Key: YARN-2618 URL: https://issues.apache.org/jira/browse/YARN-2618 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wei Yan Assignee: Wei Yan Labels: BB2015-05-TBR Attachments: YARN-2618-1.patch, YARN-2618-2.patch, YARN-2618-3.patch, YARN-2618-4.patch, YARN-2618-5.patch, YARN-2618-6.patch, YARN-2618-7.patch Subtask of YARN-2139. This should include - Add API support for introducing disk I/O as the 3rd type resource. - NM should report this information to the RM - RM should consider this to avoid over-allocation -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3753) RM failed to come up with java.io.IOException: Wait for ZKClient creation timed out
[ https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569456#comment-14569456 ] Xuan Gong commented on YARN-3753: - +1, LGTM. Check this in RM failed to come up with java.io.IOException: Wait for ZKClient creation timed out - Key: YARN-3753 URL: https://issues.apache.org/jira/browse/YARN-3753 Project: Hadoop YARN Issue Type: Bug Components: yarn Reporter: Sumana Sathish Assignee: Jian He Priority: Critical Attachments: YARN-3753.1.patch, YARN-3753.2.patch, YARN-3753.patch RM failed to come up with the following error while submitting an mapreduce job. {code:title=RM log} 015-05-30 03:40:12,190 ERROR recovery.RMStateStore (RMStateStore.java:transition(179)) - Error storing app: application_1432956515242_0006 java.io.IOException: Wait for ZKClient creation timed out at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) 2015-05-30 03:40:12,194 FATAL resourcemanager.ResourceManager (ResourceManager.java:handle(750)) - Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Wait for ZKClient creation timed out at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at
[jira] [Commented] (YARN-3753) RM failed to come up with java.io.IOException: Wait for ZKClient creation timed out
[ https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569465#comment-14569465 ] Xuan Gong commented on YARN-3753: - Committed into branch-2.7. Thanks, Jian RM failed to come up with java.io.IOException: Wait for ZKClient creation timed out - Key: YARN-3753 URL: https://issues.apache.org/jira/browse/YARN-3753 Project: Hadoop YARN Issue Type: Bug Components: yarn Reporter: Sumana Sathish Assignee: Jian He Priority: Critical Fix For: 2.7.1 Attachments: YARN-3753.1.patch, YARN-3753.2.patch, YARN-3753.patch RM failed to come up with the following error while submitting an mapreduce job. {code:title=RM log} 015-05-30 03:40:12,190 ERROR recovery.RMStateStore (RMStateStore.java:transition(179)) - Error storing app: application_1432956515242_0006 java.io.IOException: Wait for ZKClient creation timed out at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) 2015-05-30 03:40:12,194 FATAL resourcemanager.ResourceManager (ResourceManager.java:handle(750)) - Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Wait for ZKClient creation timed out at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at
[jira] [Commented] (YARN-3591) Resource Localisation on a bad disk causes subsequent containers failure
[ https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569480#comment-14569480 ] Sunil G commented on YARN-3591: --- If we have a new api which returns the present set of error dirs alone (w/o full dirs) {code} synchronized ListString getErrorDirs() {code} then could we modify LocalResourcesTrackerImpl#checkLocalizedResources in such a way that we call *removeResource* on those localized resources whose parent is present in ErrorDirs. Resource Localisation on a bad disk causes subsequent containers failure - Key: YARN-3591 URL: https://issues.apache.org/jira/browse/YARN-3591 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.0 Reporter: Lavkesh Lahngir Assignee: Lavkesh Lahngir Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, YARN-3591.2.patch, YARN-3591.3.patch, YARN-3591.4.patch It happens when a resource is localised on the disk, after localising that disk has gone bad. NM keeps paths for localised resources in memory. At the time of resource request isResourcePresent(rsrc) will be called which calls file.exists() on the localised path. In some cases when disk has gone bad, inodes are stilled cached and file.exists() returns true. But at the time of reading, file will not open. Note: file.exists() actually calls stat64 natively which returns true because it was able to find inode information from the OS. A proposal is to call file.list() on the parent path of the resource, which will call open() natively. If the disk is good it should return an array of paths with length at-least 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty
[ https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-3733: - Summary: DominantRC#compare() does not work as expected if cluster resource is empty (was: On RM restart AM getting more than maximum possible memory when many tasks in queue) DominantRC#compare() does not work as expected if cluster resource is empty --- Key: YARN-3733 URL: https://issues.apache.org/jira/browse/YARN-3733 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3 , 2 NM , 2 RM one NM - 3 GB 6 v core Reporter: Bibin A Chundatt Assignee: Rohith Priority: Blocker Attachments: YARN-3733.patch Steps to reproduce = 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster) 2. Configure map and reduce size to 512 MB after changing scheduler minimum size to 512 MB 3. Configure capacity scheduler and AM limit to .5 (DominantResourceCalculator is configured) 4. Submit 30 concurrent task 5. Switch RM Actual = For 12 Jobs AM gets allocated and all 12 starts running No other Yarn child is initiated , *all 12 Jobs in Running state for ever* Expected === Only 6 should be running at a time since max AM allocated is .5 (3072 MB) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs
[ https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568586#comment-14568586 ] Chun Chen commented on YARN-3749: - bq. It looks like we need keep conf.set(YarnConfiguration.RM_HA_ID, RM1_NODE_ID); in TestRMEmbeddedElector to fix this test failure. Sorry, my bad. Upload YARN-3749.7.patch to fix that and add a tests in {{TestYarnConfiguration}} to make sure {{YarnConfiguration#updateConnectAddr}} won't add suffix to NM service address configurations. We should make a copy of configuration when init MiniYARNCluster with multiple RMs -- Key: YARN-3749 URL: https://issues.apache.org/jira/browse/YARN-3749 Project: Hadoop YARN Issue Type: Bug Reporter: Chun Chen Assignee: Chun Chen Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, YARN-3749.5.patch, YARN-3749.6.patch, YARN-3749.7.patch, YARN-3749.patch When I was trying to write a test case for YARN-2674, I found DS client trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 when RM failover. But I initially set yarn.resourcemanager.address.rm1=0.0.0.0:18032, yarn.resourcemanager.address.rm2=0.0.0.0:28032 After digging, I found it is in ClientRMService where the value of yarn.resourcemanager.address.rm2 changed to 0.0.0.0:18032. See the following code in ClientRMService: {code} clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST, YarnConfiguration.RM_ADDRESS, YarnConfiguration.DEFAULT_RM_ADDRESS, server.getListenerAddress()); {code} Since we use the same instance of configuration in rm1 and rm2 and init both RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during starting of rm1. So I think it is safe to make a copy of configuration when init both of the rm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
[ https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568593#comment-14568593 ] Hadoop QA commented on YARN-3585: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 30s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 7m 51s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 56s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 40s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 37s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 14s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 6m 14s | Tests passed in hadoop-yarn-server-nodemanager. | | | | 45m 3s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12736738/0001-YARN-3585.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 990078b | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8159/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8159/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8159/console | This message was automatically generated. NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled -- Key: YARN-3585 URL: https://issues.apache.org/jira/browse/YARN-3585 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Rohith Priority: Critical Attachments: 0001-YARN-3585.patch, YARN-3585.patch With NM recovery enabled, after decommission, nodemanager log show stop but process cannot end. non daemon thread: {noformat} DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on condition [0x] leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable [0x] VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 nid=0x29ed runnable Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 nid=0x29ee runnable Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 nid=0x29ef runnable Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 nid=0x29f0 runnable Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 nid=0x29f1 runnable Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 nid=0x29f2 runnable Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 nid=0x29f3 runnable Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 nid=0x29f4 runnable Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 runnable Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 nid=0x29f5 runnable Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 nid=0x29f6 runnable VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting on condition {noformat} and jni leveldb thread stack {noformat} Thread 12 (Thread 0x7f33dd842700 (LWP 10903)): #0 0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7f33dfce2a3b in leveldb::(anonymous namespace)::PosixEnv::BGThreadWrapper(void*) () from /tmp/libleveldbjni-64-1-6922178968300745716.8 #2 0x003d83407851 in start_thread () from /lib64/libpthread.so.0 #3 0x003d830e811d in clone () from /lib64/libc.so.6 {noformat} -- This message was sent by Atlassian JIRA
[jira] [Commented] (YARN-3755) Log the command of launching containers
[ https://issues.apache.org/jira/browse/YARN-3755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568606#comment-14568606 ] Hadoop QA commented on YARN-3755: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 15m 43s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 7m 35s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 41s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 37s | The applied patch generated 3 new checkstyle issues (total was 58, now 60). | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 34s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 13s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 6m 9s | Tests passed in hadoop-yarn-server-nodemanager. | | | | 43m 32s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12736742/YARN-3755-1.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 990078b | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8160/artifact/patchprocess/diffcheckstylehadoop-yarn-server-nodemanager.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8160/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8160/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8160/console | This message was automatically generated. Log the command of launching containers --- Key: YARN-3755 URL: https://issues.apache.org/jira/browse/YARN-3755 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.7.0 Reporter: Jeff Zhang Assignee: Jeff Zhang Attachments: YARN-3755-1.patch In the resource manager log, yarn would log the command for launching AM, this is very useful. But there's no such log in the NN log for launching containers. It would be difficult to diagnose when containers fails to launch due to some issue in the commands. Although user can look at the commands in the container launch script file, this is an internal things of yarn, usually user don't know that. In user's perspective, they only know what commands they specify when building yarn application. {code} 2015-06-01 16:06:42,245 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Command to launch container container_1433145984561_0001_01_01 : $JAVA_HOME/bin/java -server -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx1024m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=LOG_DIR -Dtez.root.logger=info,CLA -Dsun.nio.ch.bugLevel='' org.apache.tez.dag.app.DAGAppMaster 1LOG_DIR/stdout 2LOG_DIR/stderr {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)