[jira] [Commented] (YARN-2811) Fair Scheduler is violating max memory settings in 2.4
[ https://issues.apache.org/jira/browse/YARN-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14199980#comment-14199980 ] Sandy Ryza commented on YARN-2811: -- Thanks for uncovering this [~l201514]. I think that in this case, in addition to not assigning the container, the application should release the reservation so that other apps can get to the node. Fair Scheduler is violating max memory settings in 2.4 -- Key: YARN-2811 URL: https://issues.apache.org/jira/browse/YARN-2811 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Siqi Li Assignee: Siqi Li Attachments: YARN-2811.v1.patch, YARN-2811.v2.patch This has been seen on several queues showing the allocated MB going significantly above the max MB and it appears to have started with the 2.4 upgrade. It could be a regression bug from 2.0 to 2.4 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2818) Remove the logic to inject entity owner as the primary filter
[ https://issues.apache.org/jira/browse/YARN-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14199981#comment-14199981 ] Hadoop QA commented on YARN-2818: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12679805/YARN-2818.1.patch against trunk revision 80d7d18. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5751//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5751//console This message is automatically generated. Remove the logic to inject entity owner as the primary filter - Key: YARN-2818 URL: https://issues.apache.org/jira/browse/YARN-2818 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Critical Attachments: YARN-2818.1.patch In 2.5, we inject owner info as a primary filter to support entity-level acls. Since 2.6, we have a different acls solution (YARN-2102). Therefore, there's no need to inject owner info. There're two motivations: 1. For leveldb timeline store, the primary filter is expensive. When we have a primary filter, we need to make a complete copy of the entity on the logic index table. 2. Owner info is incomplete. Say we want to put E1 (owner = tester, relatedEntity = E2). If E2 doesn't exist before, leveldb timeline store will create an empty E2 without owner info (at the db point of view, it doesn't know owner is a special primary filter). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2753) Fix potential issues and code clean up for *NodeLabelsManager
[ https://issues.apache.org/jira/browse/YARN-2753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2753: Attachment: YARN-2753.006.patch Fix potential issues and code clean up for *NodeLabelsManager - Key: YARN-2753 URL: https://issues.apache.org/jira/browse/YARN-2753 Project: Hadoop YARN Issue Type: Sub-task Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2753.000.patch, YARN-2753.001.patch, YARN-2753.002.patch, YARN-2753.003.patch, YARN-2753.004.patch, YARN-2753.005.patch, YARN-2753.006.patch Issues include: * CommonNodeLabelsManager#addToCluserNodeLabels should not change the value in labelCollections if the key already exists otherwise the Label.resource will be changed(reset). * potential NPE(NullPointerException) in checkRemoveLabelsFromNode of CommonNodeLabelsManager. ** because when a Node is created, Node.labels can be null. ** In this case, nm.labels; may be null. So we need check originalLabels not null before use it(originalLabels.containsAll). * addToCluserNodeLabels should be protected by writeLock in RMNodeLabelsManager.java. because we should protect labelCollections in RMNodeLabelsManager. * Fix a potential bug in CommonsNodeLabelsManager, after serviceStop(...) is invoked, some event may not be processed, see [comment|https://issues.apache.org/jira/browse/YARN-2753?focusedCommentId=14197206page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14197206] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2753) Fix potential issues and code clean up for *NodeLabelsManager
[ https://issues.apache.org/jira/browse/YARN-2753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1427#comment-1427 ] zhihai xu commented on YARN-2753: - Hi [~leftnoteasy], thanks for the thorough review. item 1). fixed item 2). fixed item 3). fixed item 4). fixed. I agree to you and also the ForwardingEventHandler is only active(registered) when CommonNodeLabelsManager#serviceStart is called, and serviceStart will be only called in STATE.STARTED. I attached a new patch YARN-2753.006.patch which addressed all your comments. thanks zhihai Fix potential issues and code clean up for *NodeLabelsManager - Key: YARN-2753 URL: https://issues.apache.org/jira/browse/YARN-2753 Project: Hadoop YARN Issue Type: Sub-task Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2753.000.patch, YARN-2753.001.patch, YARN-2753.002.patch, YARN-2753.003.patch, YARN-2753.004.patch, YARN-2753.005.patch, YARN-2753.006.patch Issues include: * CommonNodeLabelsManager#addToCluserNodeLabels should not change the value in labelCollections if the key already exists otherwise the Label.resource will be changed(reset). * potential NPE(NullPointerException) in checkRemoveLabelsFromNode of CommonNodeLabelsManager. ** because when a Node is created, Node.labels can be null. ** In this case, nm.labels; may be null. So we need check originalLabels not null before use it(originalLabels.containsAll). * addToCluserNodeLabels should be protected by writeLock in RMNodeLabelsManager.java. because we should protect labelCollections in RMNodeLabelsManager. * Fix a potential bug in CommonsNodeLabelsManager, after serviceStop(...) is invoked, some event may not be processed, see [comment|https://issues.apache.org/jira/browse/YARN-2753?focusedCommentId=14197206page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14197206] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2808) yarn client tool can not list app_attempt's container info correctly
[ https://issues.apache.org/jira/browse/YARN-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1429#comment-1429 ] George Wong commented on YARN-2808: --- [~Naganarasimha], could you link this jira to yarn-2301? yarn-2808 could be one piece of improvement for yarn container command. As you are working on yarn-2301, can you fix this in yarn-2301? yarn client tool can not list app_attempt's container info correctly Key: YARN-2808 URL: https://issues.apache.org/jira/browse/YARN-2808 Project: Hadoop YARN Issue Type: Bug Components: client Reporter: Gordon Wang Assignee: Naganarasimha G R When enabling timeline server, yarn client can not list the container info for a application attempt correctly. Here is the reproduce step. # enabling yarn timeline server # submit a MR job # after the job is finished. use yarn client to list the container info of the app attempt. Then, since the RM has cached the application's attempt info, the output show {noformat} [hadoop@localhost hadoop-3.0.0-SNAPSHOT]$ ./bin/yarn container -list appattempt_1415168250217_0001_01 14/11/05 01:19:15 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/11/05 01:19:15 INFO impl.TimelineClientImpl: Timeline service address: http://0.0.0.0:8188/ws/v1/timeline/ 14/11/05 01:19:16 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 14/11/05 01:19:16 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200 Total number of containers :0 Container-Id Start Time Finish Time StateHost LOG-URL {noformat} But if the rm is restarted, client can fetch the container info from timeline server correctly. {noformat} [hadoop@localhost hadoop-3.0.0-SNAPSHOT]$ ./bin/yarn container -list appattempt_1415168250217_0001_01 14/11/05 01:21:06 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/11/05 01:21:06 INFO impl.TimelineClientImpl: Timeline service address: http://0.0.0.0:8188/ws/v1/timeline/ 14/11/05 01:21:06 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 14/11/05 01:21:06 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200 Total number of containers :4 Container-Id Start Time Finish Time StateHost LOG-URL container_1415168250217_0001_01_01 1415168318376 1415168349896COMPLETElocalhost.localdomain:47024 http://0.0.0.0:8188/applicationhistory/logs/localhost.localdomain:47024/container_1415168250217_0001_01_01/container_1415168250217_0001_01_01/hadoop container_1415168250217_0001_01_02 1415168326399 1415168334858COMPLETElocalhost.localdomain:47024 http://0.0.0.0:8188/applicationhistory/logs/localhost.localdomain:47024/container_1415168250217_0001_01_02/container_1415168250217_0001_01_02/hadoop container_1415168250217_0001_01_03 1415168326400 1415168335277COMPLETElocalhost.localdomain:47024 http://0.0.0.0:8188/applicationhistory/logs/localhost.localdomain:47024/container_1415168250217_0001_01_03/container_1415168250217_0001_01_03/hadoop container_1415168250217_0001_01_04 1415168335825 1415168343873COMPLETElocalhost.localdomain:47024 http://0.0.0.0:8188/applicationhistory/logs/localhost.localdomain:47024/container_1415168250217_0001_01_04/container_1415168250217_0001_01_04/hadoop {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2753) Fix potential issues and code clean up for *NodeLabelsManager
[ https://issues.apache.org/jira/browse/YARN-2753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200048#comment-14200048 ] Hadoop QA commented on YARN-2753: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12679811/YARN-2753.006.patch against trunk revision 80d7d18. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5752//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5752//console This message is automatically generated. Fix potential issues and code clean up for *NodeLabelsManager - Key: YARN-2753 URL: https://issues.apache.org/jira/browse/YARN-2753 Project: Hadoop YARN Issue Type: Sub-task Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2753.000.patch, YARN-2753.001.patch, YARN-2753.002.patch, YARN-2753.003.patch, YARN-2753.004.patch, YARN-2753.005.patch, YARN-2753.006.patch Issues include: * CommonNodeLabelsManager#addToCluserNodeLabels should not change the value in labelCollections if the key already exists otherwise the Label.resource will be changed(reset). * potential NPE(NullPointerException) in checkRemoveLabelsFromNode of CommonNodeLabelsManager. ** because when a Node is created, Node.labels can be null. ** In this case, nm.labels; may be null. So we need check originalLabels not null before use it(originalLabels.containsAll). * addToCluserNodeLabels should be protected by writeLock in RMNodeLabelsManager.java. because we should protect labelCollections in RMNodeLabelsManager. * Fix a potential bug in CommonsNodeLabelsManager, after serviceStop(...) is invoked, some event may not be processed, see [comment|https://issues.apache.org/jira/browse/YARN-2753?focusedCommentId=14197206page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14197206] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2808) yarn client tool can not list app_attempt's container info correctly
[ https://issues.apache.org/jira/browse/YARN-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200081#comment-14200081 ] Naganarasimha G R commented on YARN-2808: - Hi [~GWong] Earlier idea was the same, but i feel there might be lot of differences for supporting yarn container command for both applicationID and application attemptID with -list option itself, so as suggested by JianHe in [YARN-2301|https://issues.apache.org/jira/browse/YARN-2301?focusedCommentId=14070512page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14070512], i thought of splitting as follows. # YARN-2301 : for first 3 small issues # new jira for supporting yarn container command for both applicationID and application attemptID # listing of all containers even for running and completed apps as part of yarn-1794 (similar to the current issue will confirm with Mayank and finalize it ) Have already been working on this but was waiting for level DB based Timeline server to be committed to get all the containers from timeline server itself which will resolve most of the issues of yarn container command. yarn client tool can not list app_attempt's container info correctly Key: YARN-2808 URL: https://issues.apache.org/jira/browse/YARN-2808 Project: Hadoop YARN Issue Type: Bug Components: client Reporter: Gordon Wang Assignee: Naganarasimha G R When enabling timeline server, yarn client can not list the container info for a application attempt correctly. Here is the reproduce step. # enabling yarn timeline server # submit a MR job # after the job is finished. use yarn client to list the container info of the app attempt. Then, since the RM has cached the application's attempt info, the output show {noformat} [hadoop@localhost hadoop-3.0.0-SNAPSHOT]$ ./bin/yarn container -list appattempt_1415168250217_0001_01 14/11/05 01:19:15 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/11/05 01:19:15 INFO impl.TimelineClientImpl: Timeline service address: http://0.0.0.0:8188/ws/v1/timeline/ 14/11/05 01:19:16 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 14/11/05 01:19:16 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200 Total number of containers :0 Container-Id Start Time Finish Time StateHost LOG-URL {noformat} But if the rm is restarted, client can fetch the container info from timeline server correctly. {noformat} [hadoop@localhost hadoop-3.0.0-SNAPSHOT]$ ./bin/yarn container -list appattempt_1415168250217_0001_01 14/11/05 01:21:06 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/11/05 01:21:06 INFO impl.TimelineClientImpl: Timeline service address: http://0.0.0.0:8188/ws/v1/timeline/ 14/11/05 01:21:06 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 14/11/05 01:21:06 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200 Total number of containers :4 Container-Id Start Time Finish Time StateHost LOG-URL container_1415168250217_0001_01_01 1415168318376 1415168349896COMPLETElocalhost.localdomain:47024 http://0.0.0.0:8188/applicationhistory/logs/localhost.localdomain:47024/container_1415168250217_0001_01_01/container_1415168250217_0001_01_01/hadoop container_1415168250217_0001_01_02 1415168326399 1415168334858COMPLETElocalhost.localdomain:47024 http://0.0.0.0:8188/applicationhistory/logs/localhost.localdomain:47024/container_1415168250217_0001_01_02/container_1415168250217_0001_01_02/hadoop container_1415168250217_0001_01_03 1415168326400 1415168335277COMPLETElocalhost.localdomain:47024 http://0.0.0.0:8188/applicationhistory/logs/localhost.localdomain:47024/container_1415168250217_0001_01_03/container_1415168250217_0001_01_03/hadoop container_1415168250217_0001_01_04 1415168335825 1415168343873COMPLETElocalhost.localdomain:47024 http://0.0.0.0:8188/applicationhistory/logs/localhost.localdomain:47024/container_1415168250217_0001_01_04/container_1415168250217_0001_01_04/hadoop {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2805) RM2 in HA setup tries to login using the RM1's kerberos principal
[ https://issues.apache.org/jira/browse/YARN-2805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200116#comment-14200116 ] Hudson commented on YARN-2805: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #735 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/735/]) YARN-2805. Fixed ResourceManager to load HA configs correctly before kerberos login. Contributed by Wangda Tan. (vinodkv: rev 834e931d8efe4d806347b266e7e62929ce05389b) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java RM2 in HA setup tries to login using the RM1's kerberos principal - Key: YARN-2805 URL: https://issues.apache.org/jira/browse/YARN-2805 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Arpit Gupta Assignee: Wangda Tan Priority: Blocker Fix For: 2.6.0 Attachments: YARN-2805.1.patch {code} 2014-11-04 08:41:08,705 INFO resourcemanager.ResourceManager (SignalLogger.java:register(91)) - registered UNIX signal handlers for [TERM, HUP, INT] 2014-11-04 08:41:10,636 INFO service.AbstractService (AbstractService.java:noteFailure(272)) - Service ResourceManager failed in state INITED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to login org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to login at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:211) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1229) Caused by: java.io.IOException: Login failure for rm/i...@example.com from keytab /etc/security/keytabs/rm.service.keytab: javax.security.auth.login.LoginException: Unable to obtain password from user at org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:935) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2812) TestApplicationHistoryServer is likely to fail on less powerful machine
[ https://issues.apache.org/jira/browse/YARN-2812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200118#comment-14200118 ] Hudson commented on YARN-2812: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #735 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/735/]) YARN-2812. TestApplicationHistoryServer is likely to fail on less powerful machine. Contributed by Zhijie Shen (xgong: rev b0b52c4e11336ca2ad6a02d64c0b5d5a8f1339ae) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryServer.java * hadoop-yarn-project/CHANGES.txt TestApplicationHistoryServer is likely to fail on less powerful machine --- Key: YARN-2812 URL: https://issues.apache.org/jira/browse/YARN-2812 Project: Hadoop YARN Issue Type: Test Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Fix For: 2.6.0 Attachments: YARN-2812.1.patch {code:title=testFilteOverrides} java.lang.Exception: test timed out after 5 milliseconds at java.net.Inet4AddressImpl.getHostByAddr(Native Method) at java.net.InetAddress$1.getHostByAddr(InetAddress.java:898) at java.net.InetAddress.getHostFromNameService(InetAddress.java:583) at java.net.InetAddress.getHostName(InetAddress.java:525) at java.net.InetAddress.getHostName(InetAddress.java:497) at java.net.InetSocketAddress$InetSocketAddressHolder.getHostName(InetSocketAddress.java:82) at java.net.InetSocketAddress$InetSocketAddressHolder.access$600(InetSocketAddress.java:56) at java.net.InetSocketAddress.getHostName(InetSocketAddress.java:345) at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169) at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132) at org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65) at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.serviceStart(ApplicationHistoryClientService.java:87) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceStart(ApplicationHistoryServer.java:111) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.applicationhistoryservice.TestApplicationHistoryServer.testFilteOverrides(TestApplicationHistoryServer.java:104) {code} {code:title=testStartStopServer, testLaunch} org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /grid/0/jenkins/workspace/UT-hadoop-champlain-chunks/workspace/UT-hadoop-champlain-chunks/commonarea/hdp-BUILDS/hadoop-2.6.0.2.2.0.0-src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/build/test/yarn/timeline/leveldb-timeline-store.ldb/LOCK: already held by process at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:219) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:99) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.applicationhistoryservice.TestApplicationHistoryServer.testStartStopServer(TestApplicationHistoryServer.java:48) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Deadlock when EmbeddedElectorService and FatalEventDispatcher try to transition RM to StandBy at the same time
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200117#comment-14200117 ] Hudson commented on YARN-2579: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #735 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/735/]) YARN-2579. Fixed a deadlock issue when EmbeddedElectorService and FatalEventDispatcher try to transition RM to StandBy at the same time. Contributed by Rohith Sharmaks (jianhe: rev 395275af8622c780b9071c243422b0780e096202) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMFatalEventType.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestRMFailover.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java Deadlock when EmbeddedElectorService and FatalEventDispatcher try to transition RM to StandBy at the same time -- Key: YARN-2579 URL: https://issues.apache.org/jira/browse/YARN-2579 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.1 Reporter: Rohith Assignee: Rohith Priority: Blocker Fix For: 2.6.0 Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.2.patch, YARN-2579-20141105.3.patch, YARN-2579-20141105.patch, YARN-2579.patch, YARN-2579.patch I encountered a situaltion where both RM's web page was able to access and its state displayed as Active. But One of the RM's ActiveServices were stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2813) NPE from MemoryTimelineStore.getDomains
[ https://issues.apache.org/jira/browse/YARN-2813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200119#comment-14200119 ] Hudson commented on YARN-2813: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #735 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/735/]) YARN-2813. Fixed NPE from MemoryTimelineStore.getDomains. Contributed by Zhijie Shen (xgong: rev e4b4901d36875faa98ec8628e22e75499e0741ab) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/MemoryTimelineStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TimelineStoreTestUtils.java NPE from MemoryTimelineStore.getDomains --- Key: YARN-2813 URL: https://issues.apache.org/jira/browse/YARN-2813 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Fix For: 2.6.0 Attachments: YARN-2813.1.patch {code} 2014-11-04 20:50:05,146 WARN org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR javax.ws.rs.WebApplicationException: java.lang.NullPointerException at org.apache.hadoop.yarn.server.timeline.webapp.TimelineWebServices.getDomains(TimelineWebServices.java:356) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339) at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416) at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:886) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:96) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:572) at
[jira] [Commented] (YARN-2767) RM web services - add test case to ensure the http static user cannot kill or submit apps in secure mode
[ https://issues.apache.org/jira/browse/YARN-2767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200125#comment-14200125 ] Hudson commented on YARN-2767: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #735 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/735/]) YARN-2767. Added a test case to verify that http static user cannot kill or submit apps in the secure mode. Contributed by Varun Vasudev. (zjshen: rev b4c951ab832f85189d815fb6df57eda4121c0199) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesHttpStaticUserPermissions.java RM web services - add test case to ensure the http static user cannot kill or submit apps in secure mode Key: YARN-2767 URL: https://issues.apache.org/jira/browse/YARN-2767 Project: Hadoop YARN Issue Type: Test Components: resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.6.0 Attachments: apache-yarn-2767.0.patch, apache-yarn-2767.1.patch, apache-yarn-2767.2.patch, apache-yarn-2767.3.patch We should add a test to ensure that the http static user used to access the RM web interface can't submit or kill apps if the cluster is running in secure mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Deadlock when EmbeddedElectorService and FatalEventDispatcher try to transition RM to StandBy at the same time
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200195#comment-14200195 ] Hudson commented on YARN-2579: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1925 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1925/]) YARN-2579. Fixed a deadlock issue when EmbeddedElectorService and FatalEventDispatcher try to transition RM to StandBy at the same time. Contributed by Rohith Sharmaks (jianhe: rev 395275af8622c780b9071c243422b0780e096202) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestRMFailover.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStore.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMFatalEventType.java Deadlock when EmbeddedElectorService and FatalEventDispatcher try to transition RM to StandBy at the same time -- Key: YARN-2579 URL: https://issues.apache.org/jira/browse/YARN-2579 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.1 Reporter: Rohith Assignee: Rohith Priority: Blocker Fix For: 2.6.0 Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.2.patch, YARN-2579-20141105.3.patch, YARN-2579-20141105.patch, YARN-2579.patch, YARN-2579.patch I encountered a situaltion where both RM's web page was able to access and its state displayed as Active. But One of the RM's ActiveServices were stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2813) NPE from MemoryTimelineStore.getDomains
[ https://issues.apache.org/jira/browse/YARN-2813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200199#comment-14200199 ] Hudson commented on YARN-2813: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1925 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1925/]) YARN-2813. Fixed NPE from MemoryTimelineStore.getDomains. Contributed by Zhijie Shen (xgong: rev e4b4901d36875faa98ec8628e22e75499e0741ab) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TimelineStoreTestUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/MemoryTimelineStore.java * hadoop-yarn-project/CHANGES.txt NPE from MemoryTimelineStore.getDomains --- Key: YARN-2813 URL: https://issues.apache.org/jira/browse/YARN-2813 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Fix For: 2.6.0 Attachments: YARN-2813.1.patch {code} 2014-11-04 20:50:05,146 WARN org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR javax.ws.rs.WebApplicationException: java.lang.NullPointerException at org.apache.hadoop.yarn.server.timeline.webapp.TimelineWebServices.getDomains(TimelineWebServices.java:356) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339) at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416) at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:886) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:96) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:572) at
[jira] [Commented] (YARN-2805) RM2 in HA setup tries to login using the RM1's kerberos principal
[ https://issues.apache.org/jira/browse/YARN-2805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200196#comment-14200196 ] Hudson commented on YARN-2805: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1925 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1925/]) YARN-2805. Fixed ResourceManager to load HA configs correctly before kerberos login. Contributed by Wangda Tan. (vinodkv: rev 834e931d8efe4d806347b266e7e62929ce05389b) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java RM2 in HA setup tries to login using the RM1's kerberos principal - Key: YARN-2805 URL: https://issues.apache.org/jira/browse/YARN-2805 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Arpit Gupta Assignee: Wangda Tan Priority: Blocker Fix For: 2.6.0 Attachments: YARN-2805.1.patch {code} 2014-11-04 08:41:08,705 INFO resourcemanager.ResourceManager (SignalLogger.java:register(91)) - registered UNIX signal handlers for [TERM, HUP, INT] 2014-11-04 08:41:10,636 INFO service.AbstractService (AbstractService.java:noteFailure(272)) - Service ResourceManager failed in state INITED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to login org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to login at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:211) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1229) Caused by: java.io.IOException: Login failure for rm/i...@example.com from keytab /etc/security/keytabs/rm.service.keytab: javax.security.auth.login.LoginException: Unable to obtain password from user at org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:935) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2812) TestApplicationHistoryServer is likely to fail on less powerful machine
[ https://issues.apache.org/jira/browse/YARN-2812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200198#comment-14200198 ] Hudson commented on YARN-2812: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1925 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1925/]) YARN-2812. TestApplicationHistoryServer is likely to fail on less powerful machine. Contributed by Zhijie Shen (xgong: rev b0b52c4e11336ca2ad6a02d64c0b5d5a8f1339ae) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryServer.java TestApplicationHistoryServer is likely to fail on less powerful machine --- Key: YARN-2812 URL: https://issues.apache.org/jira/browse/YARN-2812 Project: Hadoop YARN Issue Type: Test Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Fix For: 2.6.0 Attachments: YARN-2812.1.patch {code:title=testFilteOverrides} java.lang.Exception: test timed out after 5 milliseconds at java.net.Inet4AddressImpl.getHostByAddr(Native Method) at java.net.InetAddress$1.getHostByAddr(InetAddress.java:898) at java.net.InetAddress.getHostFromNameService(InetAddress.java:583) at java.net.InetAddress.getHostName(InetAddress.java:525) at java.net.InetAddress.getHostName(InetAddress.java:497) at java.net.InetSocketAddress$InetSocketAddressHolder.getHostName(InetSocketAddress.java:82) at java.net.InetSocketAddress$InetSocketAddressHolder.access$600(InetSocketAddress.java:56) at java.net.InetSocketAddress.getHostName(InetSocketAddress.java:345) at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169) at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132) at org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65) at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.serviceStart(ApplicationHistoryClientService.java:87) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceStart(ApplicationHistoryServer.java:111) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.applicationhistoryservice.TestApplicationHistoryServer.testFilteOverrides(TestApplicationHistoryServer.java:104) {code} {code:title=testStartStopServer, testLaunch} org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /grid/0/jenkins/workspace/UT-hadoop-champlain-chunks/workspace/UT-hadoop-champlain-chunks/commonarea/hdp-BUILDS/hadoop-2.6.0.2.2.0.0-src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/build/test/yarn/timeline/leveldb-timeline-store.ldb/LOCK: already held by process at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:219) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:99) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.applicationhistoryservice.TestApplicationHistoryServer.testStartStopServer(TestApplicationHistoryServer.java:48) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2816) NM fail to start with NPE during container recovery
[ https://issues.apache.org/jira/browse/YARN-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200241#comment-14200241 ] Jason Lowe commented on YARN-2816: -- This seems like a dubious use case. If something comes along and deletes (i.e.: corrupts) the leveldb database then in general the NM will not be able to recover properly. Trying to patch up one particular scenario won't cover the rest, and containers could leak (i.e.: be forgotten even though they're still running), container start requests lost, etc. As for the OS crash scenario, if the OS crashes then there's nothing left for the NM to recover. If we really want to protect against OS crashes then a much better way is to perform synchronous writes to leveldb. However this is _much_ slower than asynchronous writes and could easily impact NM performance. Given that there's nothing to recover from the OS crash scenario, it doesn't seem worth worrying about that case. The real issue for the reported scenario is that the leveldb database location is a poor one for the way that system is configured, since something is coming along and corrupting the database. Either the leveldb database needs to be moved somewhere else or the file cleanup procedure needs to exclude the leveldb database. NM fail to start with NPE during container recovery --- Key: YARN-2816 URL: https://issues.apache.org/jira/browse/YARN-2816 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-2816.000.patch NM fail to start with NPE during container recovery. We saw the following crash happen: 2014-10-30 22:22:37,211 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl failed in state INITED; cause: java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492) The reason is some DB files used in NMLeveldbStateStoreService are accidentally deleted to save disk space at /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state. This leaves some incomplete container record which don't have CONTAINER_REQUEST_KEY_SUFFIX(startRequest) entry in the DB. When container is recovered at ContainerManagerImpl#recoverContainer, The NullPointerException at the following code cause NM shutdown. {code} StartContainerRequest req = rcs.getStartRequest(); ContainerLaunchContext launchContext = req.getContainerLaunchContext(); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2647) Add yarn queue CLI to get queue infos
[ https://issues.apache.org/jira/browse/YARN-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-2647: -- Attachment: 0008-YARN-2647.patch Hi [~gp.leftnoteasy] Thank you. I have updated patch as per the comments. However I feel we can avoid having '-' in between the field labels. Because we already have fields like Maximum Capacity, Current Capacity etc. So if we change for node-labels, it was not much looking good. So I kept as it is and also removed for queue. Kindly share your thoughts. Add yarn queue CLI to get queue infos - Key: YARN-2647 URL: https://issues.apache.org/jira/browse/YARN-2647 Project: Hadoop YARN Issue Type: Sub-task Components: client Reporter: Wangda Tan Assignee: Sunil G Attachments: 0001-YARN-2647.patch, 0002-YARN-2647.patch, 0003-YARN-2647.patch, 0004-YARN-2647.patch, 0005-YARN-2647.patch, 0006-YARN-2647.patch, 0007-YARN-2647.patch, 0008-YARN-2647.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2813) NPE from MemoryTimelineStore.getDomains
[ https://issues.apache.org/jira/browse/YARN-2813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200294#comment-14200294 ] Hudson commented on YARN-2813: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1949 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1949/]) YARN-2813. Fixed NPE from MemoryTimelineStore.getDomains. Contributed by Zhijie Shen (xgong: rev e4b4901d36875faa98ec8628e22e75499e0741ab) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/MemoryTimelineStore.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TimelineStoreTestUtils.java NPE from MemoryTimelineStore.getDomains --- Key: YARN-2813 URL: https://issues.apache.org/jira/browse/YARN-2813 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Fix For: 2.6.0 Attachments: YARN-2813.1.patch {code} 2014-11-04 20:50:05,146 WARN org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR javax.ws.rs.WebApplicationException: java.lang.NullPointerException at org.apache.hadoop.yarn.server.timeline.webapp.TimelineWebServices.getDomains(TimelineWebServices.java:356) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339) at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416) at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:886) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:96) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:572) at
[jira] [Commented] (YARN-2767) RM web services - add test case to ensure the http static user cannot kill or submit apps in secure mode
[ https://issues.apache.org/jira/browse/YARN-2767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200300#comment-14200300 ] Hudson commented on YARN-2767: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1949 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1949/]) YARN-2767. Added a test case to verify that http static user cannot kill or submit apps in the secure mode. Contributed by Varun Vasudev. (zjshen: rev b4c951ab832f85189d815fb6df57eda4121c0199) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesHttpStaticUserPermissions.java RM web services - add test case to ensure the http static user cannot kill or submit apps in secure mode Key: YARN-2767 URL: https://issues.apache.org/jira/browse/YARN-2767 Project: Hadoop YARN Issue Type: Test Components: resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.6.0 Attachments: apache-yarn-2767.0.patch, apache-yarn-2767.1.patch, apache-yarn-2767.2.patch, apache-yarn-2767.3.patch We should add a test to ensure that the http static user used to access the RM web interface can't submit or kill apps if the cluster is running in secure mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Deadlock when EmbeddedElectorService and FatalEventDispatcher try to transition RM to StandBy at the same time
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200292#comment-14200292 ] Hudson commented on YARN-2579: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1949 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1949/]) YARN-2579. Fixed a deadlock issue when EmbeddedElectorService and FatalEventDispatcher try to transition RM to StandBy at the same time. Contributed by Rohith Sharmaks (jianhe: rev 395275af8622c780b9071c243422b0780e096202) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMFatalEventType.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestRMFailover.java Deadlock when EmbeddedElectorService and FatalEventDispatcher try to transition RM to StandBy at the same time -- Key: YARN-2579 URL: https://issues.apache.org/jira/browse/YARN-2579 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.1 Reporter: Rohith Assignee: Rohith Priority: Blocker Fix For: 2.6.0 Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.2.patch, YARN-2579-20141105.3.patch, YARN-2579-20141105.patch, YARN-2579.patch, YARN-2579.patch I encountered a situaltion where both RM's web page was able to access and its state displayed as Active. But One of the RM's ActiveServices were stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2805) RM2 in HA setup tries to login using the RM1's kerberos principal
[ https://issues.apache.org/jira/browse/YARN-2805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200291#comment-14200291 ] Hudson commented on YARN-2805: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1949 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1949/]) YARN-2805. Fixed ResourceManager to load HA configs correctly before kerberos login. Contributed by Wangda Tan. (vinodkv: rev 834e931d8efe4d806347b266e7e62929ce05389b) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java RM2 in HA setup tries to login using the RM1's kerberos principal - Key: YARN-2805 URL: https://issues.apache.org/jira/browse/YARN-2805 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Arpit Gupta Assignee: Wangda Tan Priority: Blocker Fix For: 2.6.0 Attachments: YARN-2805.1.patch {code} 2014-11-04 08:41:08,705 INFO resourcemanager.ResourceManager (SignalLogger.java:register(91)) - registered UNIX signal handlers for [TERM, HUP, INT] 2014-11-04 08:41:10,636 INFO service.AbstractService (AbstractService.java:noteFailure(272)) - Service ResourceManager failed in state INITED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to login org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to login at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:211) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1229) Caused by: java.io.IOException: Login failure for rm/i...@example.com from keytab /etc/security/keytabs/rm.service.keytab: javax.security.auth.login.LoginException: Unable to obtain password from user at org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:935) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2812) TestApplicationHistoryServer is likely to fail on less powerful machine
[ https://issues.apache.org/jira/browse/YARN-2812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200293#comment-14200293 ] Hudson commented on YARN-2812: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1949 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1949/]) YARN-2812. TestApplicationHistoryServer is likely to fail on less powerful machine. Contributed by Zhijie Shen (xgong: rev b0b52c4e11336ca2ad6a02d64c0b5d5a8f1339ae) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryServer.java * hadoop-yarn-project/CHANGES.txt TestApplicationHistoryServer is likely to fail on less powerful machine --- Key: YARN-2812 URL: https://issues.apache.org/jira/browse/YARN-2812 Project: Hadoop YARN Issue Type: Test Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Fix For: 2.6.0 Attachments: YARN-2812.1.patch {code:title=testFilteOverrides} java.lang.Exception: test timed out after 5 milliseconds at java.net.Inet4AddressImpl.getHostByAddr(Native Method) at java.net.InetAddress$1.getHostByAddr(InetAddress.java:898) at java.net.InetAddress.getHostFromNameService(InetAddress.java:583) at java.net.InetAddress.getHostName(InetAddress.java:525) at java.net.InetAddress.getHostName(InetAddress.java:497) at java.net.InetSocketAddress$InetSocketAddressHolder.getHostName(InetSocketAddress.java:82) at java.net.InetSocketAddress$InetSocketAddressHolder.access$600(InetSocketAddress.java:56) at java.net.InetSocketAddress.getHostName(InetSocketAddress.java:345) at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169) at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132) at org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65) at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.serviceStart(ApplicationHistoryClientService.java:87) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceStart(ApplicationHistoryServer.java:111) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.applicationhistoryservice.TestApplicationHistoryServer.testFilteOverrides(TestApplicationHistoryServer.java:104) {code} {code:title=testStartStopServer, testLaunch} org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /grid/0/jenkins/workspace/UT-hadoop-champlain-chunks/workspace/UT-hadoop-champlain-chunks/commonarea/hdp-BUILDS/hadoop-2.6.0.2.2.0.0-src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/build/test/yarn/timeline/leveldb-timeline-store.ldb/LOCK: already held by process at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:219) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:99) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.applicationhistoryservice.TestApplicationHistoryServer.testStartStopServer(TestApplicationHistoryServer.java:48) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2780) Log aggregated resource allocation in rm-appsummary.log
[ https://issues.apache.org/jira/browse/YARN-2780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-2780: - Attachment: YARN-2780.v2.201411061601.txt [~knoguchi], I am attaching a new patch that applies cleanly to trunk. {quote} Output looks mostly good. We may want to have different format for preemptedResources so that it doesn't use the same delimiter(comma). But since application name can also include comma, maybe it's a non-issue. {quote} The code handles the comma by putting a backslash (\) in front of it. We could split change the comma to something else before printing it out, but I think it is not a problem. Log aggregated resource allocation in rm-appsummary.log --- Key: YARN-2780 URL: https://issues.apache.org/jira/browse/YARN-2780 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 2.5.1 Reporter: Koji Noguchi Assignee: Eric Payne Priority: Minor Attachments: YARN-2780.v1.201411031728.txt, YARN-2780.v2.201411061601.txt YARN-415 added useful information about resource usage by applications. Asking to log that info inside rm-appsummary.log. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2139) Add support for disk IO isolation/scheduling for containers
[ https://issues.apache.org/jira/browse/YARN-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200385#comment-14200385 ] Karthik Kambatla commented on YARN-2139: Thanks for chiming in, Arun. This JIRA focuses on adding disk scheduling, and isolation for local disk read I/O. HDFS short-circuit reads happen to be local-disk reads, and hence we handle that too automatically. bq. We shouldn't embed Linux or blkio specific semantics such as proportional weight division into YARN. The Linux aspects are only for isolation, and this needs to be pluggable. Wei and I are more familiar with FairScheduler, and talk about weighted division between queues from that standpoint. We are eager to hear your thoughts on how we should do this with CapacityScheduler, and augment the configs etc. if need be. I was thinking we would handle it similar to how it handles CPU today (more on that later). bq. We need something generic such as bandwidth which can be understood by users, supportable on heterogenous nodes in the same cluster Our initial thinking was along these lines. However, similar to CPU, it gets very hard for a user to specify the bandwidth requirement. It is hard to figure out my container *needs* 200 MBps (and 2 GHz CPU). Furthermore, it is hard to enforce bandwidth isolation. When multiple processes are accessing a disk, its aggregate bandwidth could go down significantly. To *guarantee* bandwidth, I believe the scheduler has to be super pessimistic with its allocations. Given all this, we thought we should probably handle it the way we did CPU. Each process asks for 'n' vdisks to capture the number of disks it needs. To avoid floating point computations, we added an NM config for the available vdisks. Heterogeneity in terms of number of disks is easily handled with vdisks-per-node knob. Heterogeneity in each disk's capacity or bandwidth is not handled, similar to our CPU story. I propose we work on this heterogeneity as one of the follow-up items. bq. Spindle locality or I/O parallelism is a real concern Agree. Is it okay if we finish this work and follow-up with spindle-locality? We have some thoughts on how to handle it, but left it out of the doc to keep the design focused. Add support for disk IO isolation/scheduling for containers --- Key: YARN-2139 URL: https://issues.apache.org/jira/browse/YARN-2139 Project: Hadoop YARN Issue Type: New Feature Reporter: Wei Yan Assignee: Wei Yan Attachments: Disk_IO_Scheduling_Design_1.pdf, Disk_IO_Scheduling_Design_2.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (YARN-2139) Add support for disk IO isolation/scheduling for containers
[ https://issues.apache.org/jira/browse/YARN-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200385#comment-14200385 ] Karthik Kambatla edited comment on YARN-2139 at 11/6/14 4:27 PM: - Thanks for chiming in, Arun. This JIRA focuses on adding disk scheduling, and isolation for local disk read I/O. HDFS short-circuit reads happen to be local-disk reads, and hence we handle that too automatically. bq. We shouldn't embed Linux or blkio specific semantics such as proportional weight division into YARN. The Linux aspects are only for isolation, and this needs to be pluggable. Wei and I are more familiar with FairScheduler, and talk about weighted division between queues from that standpoint. We are eager to hear your thoughts on how we should do this with CapacityScheduler, and augment the configs etc. if need be. I was thinking we would handle it similar to how it handles CPU today (more on that later). bq. We need something generic such as bandwidth which can be understood by users, supportable on heterogenous nodes in the same cluster Our initial thinking was along these lines. However, similar to CPU, it gets very hard for a user to specify the bandwidth requirement. It is hard to figure out my container *needs* 200 MBps (and 2 GHz CPU). Furthermore, it is hard to enforce bandwidth isolation. When multiple processes are accessing a disk, its aggregate bandwidth could go down significantly. To *guarantee* bandwidth, I believe the scheduler has to be super conservative with its allocations. Given all this, we thought we should probably handle it the way we did CPU. Each process asks for 'n' vdisks to capture the number of disks it needs. To avoid floating point computations, we added an NM config for the available vdisks. Heterogeneity in terms of number of disks is easily handled with vdisks-per-node knob. Heterogeneity in each disk's capacity or bandwidth is not handled, similar to our CPU story. I propose we work on this heterogeneity as one of the follow-up items. bq. Spindle locality or I/O parallelism is a real concern Agree. Is it okay if we finish this work and follow-up with spindle-locality? We have some thoughts on how to handle it, but left it out of the doc to keep the design focused. was (Author: kasha): Thanks for chiming in, Arun. This JIRA focuses on adding disk scheduling, and isolation for local disk read I/O. HDFS short-circuit reads happen to be local-disk reads, and hence we handle that too automatically. bq. We shouldn't embed Linux or blkio specific semantics such as proportional weight division into YARN. The Linux aspects are only for isolation, and this needs to be pluggable. Wei and I are more familiar with FairScheduler, and talk about weighted division between queues from that standpoint. We are eager to hear your thoughts on how we should do this with CapacityScheduler, and augment the configs etc. if need be. I was thinking we would handle it similar to how it handles CPU today (more on that later). bq. We need something generic such as bandwidth which can be understood by users, supportable on heterogenous nodes in the same cluster Our initial thinking was along these lines. However, similar to CPU, it gets very hard for a user to specify the bandwidth requirement. It is hard to figure out my container *needs* 200 MBps (and 2 GHz CPU). Furthermore, it is hard to enforce bandwidth isolation. When multiple processes are accessing a disk, its aggregate bandwidth could go down significantly. To *guarantee* bandwidth, I believe the scheduler has to be super pessimistic with its allocations. Given all this, we thought we should probably handle it the way we did CPU. Each process asks for 'n' vdisks to capture the number of disks it needs. To avoid floating point computations, we added an NM config for the available vdisks. Heterogeneity in terms of number of disks is easily handled with vdisks-per-node knob. Heterogeneity in each disk's capacity or bandwidth is not handled, similar to our CPU story. I propose we work on this heterogeneity as one of the follow-up items. bq. Spindle locality or I/O parallelism is a real concern Agree. Is it okay if we finish this work and follow-up with spindle-locality? We have some thoughts on how to handle it, but left it out of the doc to keep the design focused. Add support for disk IO isolation/scheduling for containers --- Key: YARN-2139 URL: https://issues.apache.org/jira/browse/YARN-2139 Project: Hadoop YARN Issue Type: New Feature Reporter: Wei Yan Assignee: Wei Yan Attachments: Disk_IO_Scheduling_Design_1.pdf, Disk_IO_Scheduling_Design_2.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2818) Remove the logic to inject entity owner as the primary filter
[ https://issues.apache.org/jira/browse/YARN-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2818: -- Attachment: YARN-2818.2.patch Remove one more unnecessary method. Remove the logic to inject entity owner as the primary filter - Key: YARN-2818 URL: https://issues.apache.org/jira/browse/YARN-2818 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Critical Attachments: YARN-2818.1.patch, YARN-2818.2.patch In 2.5, we inject owner info as a primary filter to support entity-level acls. Since 2.6, we have a different acls solution (YARN-2102). Therefore, there's no need to inject owner info. There're two motivations: 1. For leveldb timeline store, the primary filter is expensive. When we have a primary filter, we need to make a complete copy of the entity on the logic index table. 2. Owner info is incomplete. Say we want to put E1 (owner = tester, relatedEntity = E2). If E2 doesn't exist before, leveldb timeline store will create an empty E2 without owner info (at the db point of view, it doesn't know owner is a special primary filter). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2818) Remove the logic to inject entity owner as the primary filter
[ https://issues.apache.org/jira/browse/YARN-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200511#comment-14200511 ] Hadoop QA commented on YARN-2818: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12679878/YARN-2818.2.patch against trunk revision 10f9f51. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5754//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5754//console This message is automatically generated. Remove the logic to inject entity owner as the primary filter - Key: YARN-2818 URL: https://issues.apache.org/jira/browse/YARN-2818 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Critical Attachments: YARN-2818.1.patch, YARN-2818.2.patch In 2.5, we inject owner info as a primary filter to support entity-level acls. Since 2.6, we have a different acls solution (YARN-2102). Therefore, there's no need to inject owner info. There're two motivations: 1. For leveldb timeline store, the primary filter is expensive. When we have a primary filter, we need to make a complete copy of the entity on the logic index table. 2. Owner info is incomplete. Say we want to put E1 (owner = tester, relatedEntity = E2). If E2 doesn't exist before, leveldb timeline store will create an empty E2 without owner info (at the db point of view, it doesn't know owner is a special primary filter). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2780) Log aggregated resource allocation in rm-appsummary.log
[ https://issues.apache.org/jira/browse/YARN-2780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200517#comment-14200517 ] Hadoop QA commented on YARN-2780: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12679871/YARN-2780.v2.201411061601.txt against trunk revision 10f9f51. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5753//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5753//console This message is automatically generated. Log aggregated resource allocation in rm-appsummary.log --- Key: YARN-2780 URL: https://issues.apache.org/jira/browse/YARN-2780 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 2.5.1 Reporter: Koji Noguchi Assignee: Eric Payne Priority: Minor Attachments: YARN-2780.v1.201411031728.txt, YARN-2780.v2.201411061601.txt YARN-415 added useful information about resource usage by applications. Asking to log that info inside rm-appsummary.log. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2647) Add yarn queue CLI to get queue infos
[ https://issues.apache.org/jira/browse/YARN-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200560#comment-14200560 ] Hadoop QA commented on YARN-2647: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12679850/0008-YARN-2647.patch against trunk revision 10f9f51. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5755//console This message is automatically generated. Add yarn queue CLI to get queue infos - Key: YARN-2647 URL: https://issues.apache.org/jira/browse/YARN-2647 Project: Hadoop YARN Issue Type: Sub-task Components: client Reporter: Wangda Tan Assignee: Sunil G Attachments: 0001-YARN-2647.patch, 0002-YARN-2647.patch, 0003-YARN-2647.patch, 0004-YARN-2647.patch, 0005-YARN-2647.patch, 0006-YARN-2647.patch, 0007-YARN-2647.patch, 0008-YARN-2647.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2819) NPE in ATS Timeline Domains when upgrading from 2.4 to 2.6
Gopal V created YARN-2819: - Summary: NPE in ATS Timeline Domains when upgrading from 2.4 to 2.6 Key: YARN-2819 URL: https://issues.apache.org/jira/browse/YARN-2819 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.6.0 Reporter: Gopal V {code} Caused by: java.lang.NullPointerException at java.lang.String.init(String.java:554) at org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.put(LeveldbTimelineStore.java:873) at org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.put(LeveldbTimelineStore.java:1014) at org.apache.hadoop.yarn.server.timeline.TimelineDataManager.postEntities(TimelineDataManager.java:330) at org.apache.hadoop.yarn.server.timeline.webapp.TimelineWebServices.postEntities(TimelineWebServices.java:260) {code} triggered by {code} entity.getRelatedEntities(); ... } else { byte[] domainIdBytes = db.get(createDomainIdKey( relatedEntityId, relatedEntityType, relatedEntityStartTime)); // This is the existing entity String domainId = new String(domainIdBytes); if (!domainId.equals(entity.getDomainId())) { {code} The new String(domainIdBytes); throws an NPE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2819) NPE in ATS Timeline Domains when upgrading from 2.4 to 2.6
[ https://issues.apache.org/jira/browse/YARN-2819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen reassigned YARN-2819: - Assignee: Zhijie Shen NPE in ATS Timeline Domains when upgrading from 2.4 to 2.6 -- Key: YARN-2819 URL: https://issues.apache.org/jira/browse/YARN-2819 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.6.0 Reporter: Gopal V Assignee: Zhijie Shen Labels: Upgrade {code} Caused by: java.lang.NullPointerException at java.lang.String.init(String.java:554) at org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.put(LeveldbTimelineStore.java:873) at org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.put(LeveldbTimelineStore.java:1014) at org.apache.hadoop.yarn.server.timeline.TimelineDataManager.postEntities(TimelineDataManager.java:330) at org.apache.hadoop.yarn.server.timeline.webapp.TimelineWebServices.postEntities(TimelineWebServices.java:260) {code} triggered by {code} entity.getRelatedEntities(); ... } else { byte[] domainIdBytes = db.get(createDomainIdKey( relatedEntityId, relatedEntityType, relatedEntityStartTime)); // This is the existing entity String domainId = new String(domainIdBytes); if (!domainId.equals(entity.getDomainId())) { {code} The new String(domainIdBytes); throws an NPE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2811) Fair Scheduler is violating max memory settings in 2.4
[ https://issues.apache.org/jira/browse/YARN-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200648#comment-14200648 ] Siqi Li commented on YARN-2811: --- [~sandyr] Thanks for your review. I was think about the same thing. Fair Scheduler is violating max memory settings in 2.4 -- Key: YARN-2811 URL: https://issues.apache.org/jira/browse/YARN-2811 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Siqi Li Assignee: Siqi Li Attachments: YARN-2811.v1.patch, YARN-2811.v2.patch, YARN-2811.v3.patch This has been seen on several queues showing the allocated MB going significantly above the max MB and it appears to have started with the 2.4 upgrade. It could be a regression bug from 2.0 to 2.4 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2819) NPE in ATS Timeline Domains when upgrading from 2.4 to 2.6
[ https://issues.apache.org/jira/browse/YARN-2819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2819: -- Priority: Critical (was: Major) NPE in ATS Timeline Domains when upgrading from 2.4 to 2.6 -- Key: YARN-2819 URL: https://issues.apache.org/jira/browse/YARN-2819 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.6.0 Reporter: Gopal V Assignee: Zhijie Shen Priority: Critical Labels: Upgrade {code} Caused by: java.lang.NullPointerException at java.lang.String.init(String.java:554) at org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.put(LeveldbTimelineStore.java:873) at org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.put(LeveldbTimelineStore.java:1014) at org.apache.hadoop.yarn.server.timeline.TimelineDataManager.postEntities(TimelineDataManager.java:330) at org.apache.hadoop.yarn.server.timeline.webapp.TimelineWebServices.postEntities(TimelineWebServices.java:260) {code} triggered by {code} entity.getRelatedEntities(); ... } else { byte[] domainIdBytes = db.get(createDomainIdKey( relatedEntityId, relatedEntityType, relatedEntityStartTime)); // This is the existing entity String domainId = new String(domainIdBytes); if (!domainId.equals(entity.getDomainId())) { {code} The new String(domainIdBytes); throws an NPE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2820) Improve FileSystemRMStateStore update failure exception handling to not shutdown RM.
zhihai xu created YARN-2820: --- Summary: Improve FileSystemRMStateStore update failure exception handling to not shutdown RM. Key: YARN-2820 URL: https://issues.apache.org/jira/browse/YARN-2820 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We saw the following IOexception cause the RM shutdown. {code} FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Unable to close file because the last block does not have enough number of replicas. at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} It will be better to Improve FileSystemRMStateStore update failure exception handling to not shutdown RM. So that a single state write out failure can't stop all jobs . -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2678) Recommended improvements to Yarn Registry
[ https://issues.apache.org/jira/browse/YARN-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200721#comment-14200721 ] Sanjay Radia commented on YARN-2678: Don't have time to review the code, but the proposed change is fine. Recommended improvements to Yarn Registry - Key: YARN-2678 URL: https://issues.apache.org/jira/browse/YARN-2678 Project: Hadoop YARN Issue Type: Sub-task Components: api, resourcemanager Affects Versions: 2.6.0 Reporter: Gour Saha Assignee: Steve Loughran Attachments: HADOOP-2678-002.patch, YARN-2678-001.patch, YARN-2678-003.patch, YARN-2678-006.patch, YARN-2678-007.patch, YARN-2678-008.patch, yarnregistry.pdf In the process of binding to Slider AM from Slider agent python code here are some of the items I stumbled upon and would recommend as improvements. This is how the Slider's registry looks today - {noformat} jsonservicerec{ description : Slider Application Master, external : [ { api : org.apache.slider.appmaster, addressType : host/port, protocolType : hadoop/protobuf, addresses : [ [ c6408.ambari.apache.org, 34837 ] ] }, { api : org.apache.http.UI, addressType : uri, protocolType : webui, addresses : [ [ http://c6408.ambari.apache.org:43314; ] ] }, { api : org.apache.slider.management, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/mgmt; ] ] }, { api : org.apache.slider.publisher, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/publisher; ] ] }, { api : org.apache.slider.registry, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/registry; ] ] }, { api : org.apache.slider.publisher.configurations, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/publisher/slider; ] ] } ], internal : [ { api : org.apache.slider.agents.secure, addressType : uri, protocolType : REST, addresses : [ [ https://c6408.ambari.apache.org:46958/ws/v1/slider/agents; ] ] }, { api : org.apache.slider.agents.oneway, addressType : uri, protocolType : REST, addresses : [ [ https://c6408.ambari.apache.org:57513/ws/v1/slider/agents; ] ] } ], yarn:persistence : application, yarn:id : application_1412974695267_0015 } {noformat} Recommendations: 1. I would suggest to either remove the string {color:red}jsonservicerec{color} or if it is desirable to have a non-null data at all times then loop the string into the json structure as a top-level attribute to ensure that the registry data is always a valid json document. 2. The {color:red}addresses{color} attribute is currently a list of list. I would recommend to convert it to a list of dictionary objects. In the dictionary object it would be nice to have the host and port portions of objects of addressType uri as separate key-value pairs to avoid parsing on the client side. The URI should also be retained as a key say uri to avoid clients trying to generate it by concatenating host, port, resource-path, etc. Here is a proposed structure - {noformat} { ... internal : [ { api : org.apache.slider.agents.secure, addressType : uri, protocolType : REST, addresses : [ { uri : https://c6408.ambari.apache.org:46958/ws/v1/slider/agents;, host : c6408.ambari.apache.org, port: 46958 } ] } ], } {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (YARN-2791) Add Disk as a resource for scheduling
[ https://issues.apache.org/jira/browse/YARN-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuliya Feldman reopened YARN-2791: -- Assignee: Yuliya Feldman I think this JIRA should be reopened, since https://issues.apache.org/jira/browse/YARN-2817 which is submitted later is talking about absolutely the same Add Disk as a resource for scheduling - Key: YARN-2791 URL: https://issues.apache.org/jira/browse/YARN-2791 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Affects Versions: 2.5.1 Reporter: Swapnil Daingade Assignee: Yuliya Feldman Currently, the number of disks present on a node is not considered a factor while scheduling containers on that node. Having large amount of memory on a node can lead to high number of containers being launched on that node, all of which compete for I/O bandwidth. This multiplexing of I/O across containers can lead to slower overall progress and sub-optimal resource utilization as containers starved for I/O bandwidth hold on to other resources like cpu and memory. This problem can be solved by considering disk as a resource and including it in deciding how many containers can be concurrently run on a node. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2678) Recommended improvements to Yarn Registry
[ https://issues.apache.org/jira/browse/YARN-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200728#comment-14200728 ] Gour Saha commented on YARN-2678: - + 1 (non binding) We already consumed the changes in Apache Slider agents and everything looks good. All tests passed. Recommended improvements to Yarn Registry - Key: YARN-2678 URL: https://issues.apache.org/jira/browse/YARN-2678 Project: Hadoop YARN Issue Type: Sub-task Components: api, resourcemanager Affects Versions: 2.6.0 Reporter: Gour Saha Assignee: Steve Loughran Attachments: HADOOP-2678-002.patch, YARN-2678-001.patch, YARN-2678-003.patch, YARN-2678-006.patch, YARN-2678-007.patch, YARN-2678-008.patch, yarnregistry.pdf In the process of binding to Slider AM from Slider agent python code here are some of the items I stumbled upon and would recommend as improvements. This is how the Slider's registry looks today - {noformat} jsonservicerec{ description : Slider Application Master, external : [ { api : org.apache.slider.appmaster, addressType : host/port, protocolType : hadoop/protobuf, addresses : [ [ c6408.ambari.apache.org, 34837 ] ] }, { api : org.apache.http.UI, addressType : uri, protocolType : webui, addresses : [ [ http://c6408.ambari.apache.org:43314; ] ] }, { api : org.apache.slider.management, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/mgmt; ] ] }, { api : org.apache.slider.publisher, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/publisher; ] ] }, { api : org.apache.slider.registry, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/registry; ] ] }, { api : org.apache.slider.publisher.configurations, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/publisher/slider; ] ] } ], internal : [ { api : org.apache.slider.agents.secure, addressType : uri, protocolType : REST, addresses : [ [ https://c6408.ambari.apache.org:46958/ws/v1/slider/agents; ] ] }, { api : org.apache.slider.agents.oneway, addressType : uri, protocolType : REST, addresses : [ [ https://c6408.ambari.apache.org:57513/ws/v1/slider/agents; ] ] } ], yarn:persistence : application, yarn:id : application_1412974695267_0015 } {noformat} Recommendations: 1. I would suggest to either remove the string {color:red}jsonservicerec{color} or if it is desirable to have a non-null data at all times then loop the string into the json structure as a top-level attribute to ensure that the registry data is always a valid json document. 2. The {color:red}addresses{color} attribute is currently a list of list. I would recommend to convert it to a list of dictionary objects. In the dictionary object it would be nice to have the host and port portions of objects of addressType uri as separate key-value pairs to avoid parsing on the client side. The URI should also be retained as a key say uri to avoid clients trying to generate it by concatenating host, port, resource-path, etc. Here is a proposed structure - {noformat} { ... internal : [ { api : org.apache.slider.agents.secure, addressType : uri, protocolType : REST, addresses : [ { uri : https://c6408.ambari.apache.org:46958/ws/v1/slider/agents;, host : c6408.ambari.apache.org, port: 46958 } ] } ], } {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2791) Add Disk as a resource for scheduling
[ https://issues.apache.org/jira/browse/YARN-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200739#comment-14200739 ] Wei Yan commented on YARN-2791: --- Hi, [~yufeldman], this jira is same to YARN-2139. Add Disk as a resource for scheduling - Key: YARN-2791 URL: https://issues.apache.org/jira/browse/YARN-2791 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Affects Versions: 2.5.1 Reporter: Swapnil Daingade Assignee: Yuliya Feldman Currently, the number of disks present on a node is not considered a factor while scheduling containers on that node. Having large amount of memory on a node can lead to high number of containers being launched on that node, all of which compete for I/O bandwidth. This multiplexing of I/O across containers can lead to slower overall progress and sub-optimal resource utilization as containers starved for I/O bandwidth hold on to other resources like cpu and memory. This problem can be solved by considering disk as a resource and including it in deciding how many containers can be concurrently run on a node. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2818) Remove the logic to inject entity owner as the primary filter
[ https://issues.apache.org/jira/browse/YARN-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200770#comment-14200770 ] Vinod Kumar Vavilapalli commented on YARN-2818: --- Makes sense given - We now do all authz based on domains - User filter was always a hidden filter anyways +1, checking this in. Remove the logic to inject entity owner as the primary filter - Key: YARN-2818 URL: https://issues.apache.org/jira/browse/YARN-2818 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Critical Attachments: YARN-2818.1.patch, YARN-2818.2.patch In 2.5, we inject owner info as a primary filter to support entity-level acls. Since 2.6, we have a different acls solution (YARN-2102). Therefore, there's no need to inject owner info. There're two motivations: 1. For leveldb timeline store, the primary filter is expensive. When we have a primary filter, we need to make a complete copy of the entity on the logic index table. 2. Owner info is incomplete. Say we want to put E1 (owner = tester, relatedEntity = E2). If E2 doesn't exist before, leveldb timeline store will create an empty E2 without owner info (at the db point of view, it doesn't know owner is a special primary filter). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2818) Remove the logic to inject entity owner as the primary filter
[ https://issues.apache.org/jira/browse/YARN-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200787#comment-14200787 ] Hudson commented on YARN-2818: -- SUCCESS: Integrated in Hadoop-trunk-Commit #6468 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6468/]) YARN-2818. Removed the now unnecessary user entity injection from Timeline service given we now have domains. Contributed by Zhijie Shen. (vinodkv: rev f5b19bed7d71979dc8685b03152188902b6e45e9) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineDataManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/webapp/TestTimelineWebServices.java Remove the logic to inject entity owner as the primary filter - Key: YARN-2818 URL: https://issues.apache.org/jira/browse/YARN-2818 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Critical Fix For: 2.6.0 Attachments: YARN-2818.1.patch, YARN-2818.2.patch In 2.5, we inject owner info as a primary filter to support entity-level acls. Since 2.6, we have a different acls solution (YARN-2102). Therefore, there's no need to inject owner info. There're two motivations: 1. For leveldb timeline store, the primary filter is expensive. When we have a primary filter, we need to make a complete copy of the entity on the logic index table. 2. Owner info is incomplete. Say we want to put E1 (owner = tester, relatedEntity = E2). If E2 doesn't exist before, leveldb timeline store will create an empty E2 without owner info (at the db point of view, it doesn't know owner is a special primary filter). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2817) Disk drive as a resource in YARN
[ https://issues.apache.org/jira/browse/YARN-2817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200818#comment-14200818 ] Swapnil Daingade commented on YARN-2817: Hi Arun, this is what we were proposing in https://issues.apache.org/jira/browse/YARN-2791. We already have this shipping to customers since the last 2 months and we would be happy to contribute it back. Disk drive as a resource in YARN Key: YARN-2817 URL: https://issues.apache.org/jira/browse/YARN-2817 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Reporter: Arun C Murthy Assignee: Arun C Murthy As YARN continues to cover new ground in terms of new workloads, disk is becoming a very important resource to govern. It might be prudent to start with something very simple - allow applications to request entire drives (e.g. 2 drives out of the 12 available on a node), we can then also add support for specific iops, bandwidth etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2791) Add Disk as a resource for scheduling
[ https://issues.apache.org/jira/browse/YARN-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200823#comment-14200823 ] Aditya Kishore commented on YARN-2791: -- I think the summary of this JIRA may seem as duplicate of YARN-2139 but they are not. YARN-2139 aims to address throttling/isolation of disk IO on individual container basis. However, from the description it seems that the purpose of this JIRA is to include the node's disks as a parameter in the capacity calculation of the node alongside with its memory and CPU cores. May be the summary should be reworded to reflect this. Add Disk as a resource for scheduling - Key: YARN-2791 URL: https://issues.apache.org/jira/browse/YARN-2791 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Affects Versions: 2.5.1 Reporter: Swapnil Daingade Assignee: Yuliya Feldman Currently, the number of disks present on a node is not considered a factor while scheduling containers on that node. Having large amount of memory on a node can lead to high number of containers being launched on that node, all of which compete for I/O bandwidth. This multiplexing of I/O across containers can lead to slower overall progress and sub-optimal resource utilization as containers starved for I/O bandwidth hold on to other resources like cpu and memory. This problem can be solved by considering disk as a resource and including it in deciding how many containers can be concurrently run on a node. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2791) Add Disk as a resource for scheduling
[ https://issues.apache.org/jira/browse/YARN-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200829#comment-14200829 ] Karthik Kambatla commented on YARN-2791: [~adityakishore] - from reading the description, I believe there is at least a significant overlap between the two JIRAs. I think we would benefit from consolidating them and working together, than take multiple paths. [~sdaingade] - nice to know you have something working. Could you look through the design doc on YARN-2139 so we can refine it. If it is significantly different from what is posted there, can you also post your design so we can evaluate which one is better and move forward. Add Disk as a resource for scheduling - Key: YARN-2791 URL: https://issues.apache.org/jira/browse/YARN-2791 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Affects Versions: 2.5.1 Reporter: Swapnil Daingade Assignee: Yuliya Feldman Currently, the number of disks present on a node is not considered a factor while scheduling containers on that node. Having large amount of memory on a node can lead to high number of containers being launched on that node, all of which compete for I/O bandwidth. This multiplexing of I/O across containers can lead to slower overall progress and sub-optimal resource utilization as containers starved for I/O bandwidth hold on to other resources like cpu and memory. This problem can be solved by considering disk as a resource and including it in deciding how many containers can be concurrently run on a node. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2811) Fair Scheduler is violating max memory settings in 2.4
[ https://issues.apache.org/jira/browse/YARN-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200832#comment-14200832 ] Hadoop QA commented on YARN-2811: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12679905/YARN-2811.v3.patch against trunk revision 10f9f51. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5756//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5756//console This message is automatically generated. Fair Scheduler is violating max memory settings in 2.4 -- Key: YARN-2811 URL: https://issues.apache.org/jira/browse/YARN-2811 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Siqi Li Assignee: Siqi Li Attachments: YARN-2811.v1.patch, YARN-2811.v2.patch, YARN-2811.v3.patch This has been seen on several queues showing the allocated MB going significantly above the max MB and it appears to have started with the 2.4 upgrade. It could be a regression bug from 2.0 to 2.4 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2791) Add Disk as a resource for scheduling
[ https://issues.apache.org/jira/browse/YARN-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200849#comment-14200849 ] Yuliya Feldman commented on YARN-2791: -- [~kasha] I agree that consolidating both and working together is a way to go here. Let's initiate this. What I don't quite understand is how https://issues.apache.org/jira/browse/YARN-2817 is different from this one or from https://issues.apache.org/jira/browse/YARN-2139 Add Disk as a resource for scheduling - Key: YARN-2791 URL: https://issues.apache.org/jira/browse/YARN-2791 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Affects Versions: 2.5.1 Reporter: Swapnil Daingade Assignee: Yuliya Feldman Currently, the number of disks present on a node is not considered a factor while scheduling containers on that node. Having large amount of memory on a node can lead to high number of containers being launched on that node, all of which compete for I/O bandwidth. This multiplexing of I/O across containers can lead to slower overall progress and sub-optimal resource utilization as containers starved for I/O bandwidth hold on to other resources like cpu and memory. This problem can be solved by considering disk as a resource and including it in deciding how many containers can be concurrently run on a node. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2768) optimize FSAppAttempt.updateDemand by avoid clone of Resource which takes 85% of computing time of update thread
[ https://issues.apache.org/jira/browse/YARN-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200861#comment-14200861 ] Hudson commented on YARN-2768: -- SUCCESS: Integrated in Hadoop-trunk-Commit #6469 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6469/]) YARN-2768 Improved Yarn Registry service record structure (stevel) (stevel: rev 1670578018b3210d518408530858a869e37b23cb) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/main/java/org/apache/hadoop/registry/client/impl/zk/RegistryOperationsService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/test/java/org/apache/hadoop/registry/RegistryTestHelper.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/main/java/org/apache/hadoop/registry/cli/RegistryCli.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/main/java/org/apache/hadoop/registry/client/types/AddressTypes.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/registry/yarn-registry.md * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/main/java/org/apache/hadoop/registry/client/binding/RegistryTypeUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/main/java/org/apache/hadoop/registry/client/types/ProtocolTypes.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/test/java/org/apache/hadoop/registry/operations/TestRegistryOperations.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/main/java/org/apache/hadoop/registry/client/types/ServiceRecordHeader.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/main/java/org/apache/hadoop/registry/client/exceptions/NoRecordException.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/main/tla/yarnregistry.tla * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/main/java/org/apache/hadoop/registry/client/types/ServiceRecord.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/main/java/org/apache/hadoop/registry/client/binding/JsonSerDeser.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/main/java/org/apache/hadoop/registry/client/binding/RegistryUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/main/java/org/apache/hadoop/registry/client/types/Endpoint.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/test/java/org/apache/hadoop/registry/client/binding/TestMarshalling.java optimize FSAppAttempt.updateDemand by avoid clone of Resource which takes 85% of computing time of update thread Key: YARN-2768 URL: https://issues.apache.org/jira/browse/YARN-2768 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Hong Zhiguo Assignee: Hong Zhiguo Priority: Minor Attachments: YARN-2768.patch, profiling_FairScheduler_update.png See the attached picture of profiling result. The clone of Resource object within Resources.multiply() takes up **85%** (19.2 / 22.6) CPU time of the function FairScheduler.update(). The code of FSAppAttempt.updateDemand: {code} public void updateDemand() { demand = Resources.createResource(0); // Demand is current consumption plus outstanding requests Resources.addTo(demand, app.getCurrentConsumption()); // Add up outstanding resource requests synchronized (app) { for (Priority p : app.getPriorities()) { for (ResourceRequest r : app.getResourceRequests(p).values()) { Resource total = Resources.multiply(r.getCapability(), r.getNumContainers()); Resources.addTo(demand, total); } } } } {code} The code of Resources.multiply: {code} public static Resource multiply(Resource lhs, double by) { return multiplyTo(clone(lhs), by); } {code} The clone could be skipped by directly update the value of this.demand. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2821) Distributed shell app master becomes unresponsive sometimes
Varun Vasudev created YARN-2821: --- Summary: Distributed shell app master becomes unresponsive sometimes Key: YARN-2821 URL: https://issues.apache.org/jira/browse/YARN-2821 Project: Hadoop YARN Issue Type: Bug Components: applications/distributed-shell Affects Versions: 2.5.1 Reporter: Varun Vasudev Assignee: Varun Vasudev We've noticed that once in a while the distributed shell app master becomes unresponsive and is eventually killed by the RM. snippet of the logs - {noformat} 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: appattempt_1415123350094_0017_01 received 0 previous attempts' running containers on AM registration. 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested container ask: Capability[memory:10, vCores:1]Priority[0] 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested container ask: Capability[memory:10, vCores:1]Priority[0] 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested container ask: Capability[memory:10, vCores:1]Priority[0] 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested container ask: Capability[memory:10, vCores:1]Priority[0] 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested container ask: Capability[memory:10, vCores:1]Priority[0] 14/11/04 18:21:38 INFO impl.AMRMClientImpl: Received new token for : onprem-tez2:45454 14/11/04 18:21:38 INFO distributedshell.ApplicationMaster: Got response from RM for container ask, allocatedCnt=1 14/11/04 18:21:38 INFO distributedshell.ApplicationMaster: Launching shell command on a new container., containerId=container_1415123350094_0017_01_02, containerNode=onprem-tez2:45454, containerNodeURI=onprem-tez2:50060, containerResourceMemory1024, containerResourceVirtualCores1 14/11/04 18:21:38 INFO distributedshell.ApplicationMaster: Setting up container launch container for containerid=container_1415123350094_0017_01_02 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: START_CONTAINER for Container container_1415123350094_0017_01_02 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : onprem-tez2:45454 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1415123350094_0017_01_02 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : onprem-tez2:45454 14/11/04 18:21:39 INFO impl.AMRMClientImpl: Received new token for : onprem-tez3:45454 14/11/04 18:21:39 INFO impl.AMRMClientImpl: Received new token for : onprem-tez4:45454 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Got response from RM for container ask, allocatedCnt=3 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Launching shell command on a new container., containerId=container_1415123350094_0017_01_03, containerNode=onprem-tez2:45454, containerNodeURI=onprem-tez2:50060, containerResourceMemory1024, containerResourceVirtualCores1 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Launching shell command on a new container., containerId=container_1415123350094_0017_01_04, containerNode=onprem-tez3:45454, containerNodeURI=onprem-tez3:50060, containerResourceMemory1024, containerResourceVirtualCores1 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Launching shell command on a new container., containerId=container_1415123350094_0017_01_05, containerNode=onprem-tez4:45454, containerNodeURI=onprem-tez4:50060, containerResourceMemory1024, containerResourceVirtualCores1 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Setting up container launch container for containerid=container_1415123350094_0017_01_03 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Setting up container launch container for containerid=container_1415123350094_0017_01_05 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Setting up container launch container for containerid=container_1415123350094_0017_01_04 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: START_CONTAINER for Container container_1415123350094_0017_01_05 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: START_CONTAINER for Container container_1415123350094_0017_01_03 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : onprem-tez4:45454 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : onprem-tez2:45454 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: START_CONTAINER for Container container_1415123350094_0017_01_04 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : onprem-tez3:45454 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType:
[jira] [Commented] (YARN-2791) Add Disk as a resource for scheduling
[ https://issues.apache.org/jira/browse/YARN-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200900#comment-14200900 ] Karthik Kambatla commented on YARN-2791: bq. how https://issues.apache.org/jira/browse/YARN-2817 is different from this one or from https://issues.apache.org/jira/browse/YARN-2139 I personally don't think it is. Even if we want to address a simpler usecase, I would suggest we do to that as part of YARN-2139. Add Disk as a resource for scheduling - Key: YARN-2791 URL: https://issues.apache.org/jira/browse/YARN-2791 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Affects Versions: 2.5.1 Reporter: Swapnil Daingade Assignee: Yuliya Feldman Currently, the number of disks present on a node is not considered a factor while scheduling containers on that node. Having large amount of memory on a node can lead to high number of containers being launched on that node, all of which compete for I/O bandwidth. This multiplexing of I/O across containers can lead to slower overall progress and sub-optimal resource utilization as containers starved for I/O bandwidth hold on to other resources like cpu and memory. This problem can be solved by considering disk as a resource and including it in deciding how many containers can be concurrently run on a node. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2821) Distributed shell app master becomes unresponsive sometimes
[ https://issues.apache.org/jira/browse/YARN-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-2821: Attachment: apache-yarn-2821.0.patch Uploaded patch with fix. Distributed shell app master becomes unresponsive sometimes --- Key: YARN-2821 URL: https://issues.apache.org/jira/browse/YARN-2821 Project: Hadoop YARN Issue Type: Bug Components: applications/distributed-shell Affects Versions: 2.5.1 Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2821.0.patch We've noticed that once in a while the distributed shell app master becomes unresponsive and is eventually killed by the RM. snippet of the logs - {noformat} 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: appattempt_1415123350094_0017_01 received 0 previous attempts' running containers on AM registration. 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested container ask: Capability[memory:10, vCores:1]Priority[0] 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested container ask: Capability[memory:10, vCores:1]Priority[0] 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested container ask: Capability[memory:10, vCores:1]Priority[0] 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested container ask: Capability[memory:10, vCores:1]Priority[0] 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested container ask: Capability[memory:10, vCores:1]Priority[0] 14/11/04 18:21:38 INFO impl.AMRMClientImpl: Received new token for : onprem-tez2:45454 14/11/04 18:21:38 INFO distributedshell.ApplicationMaster: Got response from RM for container ask, allocatedCnt=1 14/11/04 18:21:38 INFO distributedshell.ApplicationMaster: Launching shell command on a new container., containerId=container_1415123350094_0017_01_02, containerNode=onprem-tez2:45454, containerNodeURI=onprem-tez2:50060, containerResourceMemory1024, containerResourceVirtualCores1 14/11/04 18:21:38 INFO distributedshell.ApplicationMaster: Setting up container launch container for containerid=container_1415123350094_0017_01_02 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: START_CONTAINER for Container container_1415123350094_0017_01_02 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : onprem-tez2:45454 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1415123350094_0017_01_02 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : onprem-tez2:45454 14/11/04 18:21:39 INFO impl.AMRMClientImpl: Received new token for : onprem-tez3:45454 14/11/04 18:21:39 INFO impl.AMRMClientImpl: Received new token for : onprem-tez4:45454 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Got response from RM for container ask, allocatedCnt=3 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Launching shell command on a new container., containerId=container_1415123350094_0017_01_03, containerNode=onprem-tez2:45454, containerNodeURI=onprem-tez2:50060, containerResourceMemory1024, containerResourceVirtualCores1 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Launching shell command on a new container., containerId=container_1415123350094_0017_01_04, containerNode=onprem-tez3:45454, containerNodeURI=onprem-tez3:50060, containerResourceMemory1024, containerResourceVirtualCores1 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Launching shell command on a new container., containerId=container_1415123350094_0017_01_05, containerNode=onprem-tez4:45454, containerNodeURI=onprem-tez4:50060, containerResourceMemory1024, containerResourceVirtualCores1 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Setting up container launch container for containerid=container_1415123350094_0017_01_03 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Setting up container launch container for containerid=container_1415123350094_0017_01_05 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Setting up container launch container for containerid=container_1415123350094_0017_01_04 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: START_CONTAINER for Container container_1415123350094_0017_01_05 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: START_CONTAINER for Container container_1415123350094_0017_01_03 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : onprem-tez4:45454 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening
[jira] [Commented] (YARN-2791) Add Disk as a resource for scheduling
[ https://issues.apache.org/jira/browse/YARN-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200961#comment-14200961 ] Swapnil Daingade commented on YARN-2791: Karthik Kambatla - I'll go through the design doc for YARN-2139 and post our design document here as well. We can then decide if we should have two JIRA's or combine them as Aditya Kishore suggested. However I fully support combining our efforts on this. Add Disk as a resource for scheduling - Key: YARN-2791 URL: https://issues.apache.org/jira/browse/YARN-2791 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Affects Versions: 2.5.1 Reporter: Swapnil Daingade Assignee: Yuliya Feldman Currently, the number of disks present on a node is not considered a factor while scheduling containers on that node. Having large amount of memory on a node can lead to high number of containers being launched on that node, all of which compete for I/O bandwidth. This multiplexing of I/O across containers can lead to slower overall progress and sub-optimal resource utilization as containers starved for I/O bandwidth hold on to other resources like cpu and memory. This problem can be solved by considering disk as a resource and including it in deciding how many containers can be concurrently run on a node. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2753) Fix potential issues and code clean up for *NodeLabelsManager
[ https://issues.apache.org/jira/browse/YARN-2753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200973#comment-14200973 ] Wangda Tan commented on YARN-2753: -- +1 for latest patch, thanks for update! [~zxu] Fix potential issues and code clean up for *NodeLabelsManager - Key: YARN-2753 URL: https://issues.apache.org/jira/browse/YARN-2753 Project: Hadoop YARN Issue Type: Sub-task Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2753.000.patch, YARN-2753.001.patch, YARN-2753.002.patch, YARN-2753.003.patch, YARN-2753.004.patch, YARN-2753.005.patch, YARN-2753.006.patch Issues include: * CommonNodeLabelsManager#addToCluserNodeLabels should not change the value in labelCollections if the key already exists otherwise the Label.resource will be changed(reset). * potential NPE(NullPointerException) in checkRemoveLabelsFromNode of CommonNodeLabelsManager. ** because when a Node is created, Node.labels can be null. ** In this case, nm.labels; may be null. So we need check originalLabels not null before use it(originalLabels.containsAll). * addToCluserNodeLabels should be protected by writeLock in RMNodeLabelsManager.java. because we should protect labelCollections in RMNodeLabelsManager. * Fix a potential bug in CommonsNodeLabelsManager, after serviceStop(...) is invoked, some event may not be processed, see [comment|https://issues.apache.org/jira/browse/YARN-2753?focusedCommentId=14197206page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14197206] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2822) NPE when RM tries to transfer state from previous attempt on recovery
Jian He created YARN-2822: - Summary: NPE when RM tries to transfer state from previous attempt on recovery Key: YARN-2822 URL: https://issues.apache.org/jira/browse/YARN-2822 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Jian He {code} 2014-09-16 01:36:28,037 FATAL resourcemanager.ResourceManager (ResourceManager.java:run(612)) - Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:530) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:678) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1015) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:603) at java.lang.Thread.run(Thread.java:744) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2822) NPE when RM tries to transfer state from previous attempt on recovery
[ https://issues.apache.org/jira/browse/YARN-2822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200977#comment-14200977 ] Jian He commented on YARN-2822: --- The problem is on recovery, if the previous attempt already finished, we are not adding it the scheduler. when scheduler tries to transferStateFromPreviousAttempt for work-presrving AM restart, it throws NPE. NPE when RM tries to transfer state from previous attempt on recovery - Key: YARN-2822 URL: https://issues.apache.org/jira/browse/YARN-2822 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He {code} 2014-09-16 01:36:28,037 FATAL resourcemanager.ResourceManager (ResourceManager.java:run(612)) - Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:530) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:678) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1015) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:603) at java.lang.Thread.run(Thread.java:744) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2823) NullPointerException in RM HA enabled 3-node cluster
Gour Saha created YARN-2823: --- Summary: NullPointerException in RM HA enabled 3-node cluster Key: YARN-2823 URL: https://issues.apache.org/jira/browse/YARN-2823 Project: Hadoop YARN Issue Type: Bug Reporter: Gour Saha Branch: 2.6.0 Environment: A 3-node cluster with RM HA enabled. The HA setup went pretty smooth (used Ambari) and then installed HBase using Slider. After some time the RMs went down and would not come back up anymore. Following is the NPE we see in both the RM logs. {noformat} 2014-09-16 01:36:28,037 FATAL resourcemanager.ResourceManager (ResourceManager.java:run(612)) - Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:530) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:678) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1015) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:603) at java.lang.Thread.run(Thread.java:744) 2014-09-16 01:36:28,042 INFO resourcemanager.ResourceManager (ResourceManager.java:run(616)) - Exiting, bbye.. {noformat} All the logs for this 3-node cluster has been uploaded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2823) NullPointerException in RM HA enabled 3-node cluster
[ https://issues.apache.org/jira/browse/YARN-2823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gour Saha updated YARN-2823: Affects Version/s: 2.6.0 NullPointerException in RM HA enabled 3-node cluster Key: YARN-2823 URL: https://issues.apache.org/jira/browse/YARN-2823 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Gour Saha Branch: 2.6.0 Environment: A 3-node cluster with RM HA enabled. The HA setup went pretty smooth (used Ambari) and then installed HBase using Slider. After some time the RMs went down and would not come back up anymore. Following is the NPE we see in both the RM logs. {noformat} 2014-09-16 01:36:28,037 FATAL resourcemanager.ResourceManager (ResourceManager.java:run(612)) - Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:530) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:678) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1015) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:603) at java.lang.Thread.run(Thread.java:744) 2014-09-16 01:36:28,042 INFO resourcemanager.ResourceManager (ResourceManager.java:run(616)) - Exiting, bbye.. {noformat} All the logs for this 3-node cluster has been uploaded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2823) NullPointerException in RM HA enabled 3-node cluster
[ https://issues.apache.org/jira/browse/YARN-2823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gour Saha updated YARN-2823: Component/s: resourcemanager NullPointerException in RM HA enabled 3-node cluster Key: YARN-2823 URL: https://issues.apache.org/jira/browse/YARN-2823 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Gour Saha Branch: 2.6.0 Environment: A 3-node cluster with RM HA enabled. The HA setup went pretty smooth (used Ambari) and then installed HBase using Slider. After some time the RMs went down and would not come back up anymore. Following is the NPE we see in both the RM logs. {noformat} 2014-09-16 01:36:28,037 FATAL resourcemanager.ResourceManager (ResourceManager.java:run(612)) - Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:530) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:678) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1015) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:603) at java.lang.Thread.run(Thread.java:744) 2014-09-16 01:36:28,042 INFO resourcemanager.ResourceManager (ResourceManager.java:run(616)) - Exiting, bbye.. {noformat} All the logs for this 3-node cluster has been uploaded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2823) NullPointerException in RM HA enabled 3-node cluster
[ https://issues.apache.org/jira/browse/YARN-2823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gour Saha updated YARN-2823: Attachment: logs_with_NPE_in_RM.zip NullPointerException in RM HA enabled 3-node cluster Key: YARN-2823 URL: https://issues.apache.org/jira/browse/YARN-2823 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Gour Saha Attachments: logs_with_NPE_in_RM.zip Branch: 2.6.0 Environment: A 3-node cluster with RM HA enabled. The HA setup went pretty smooth (used Ambari) and then installed HBase using Slider. After some time the RMs went down and would not come back up anymore. Following is the NPE we see in both the RM logs. {noformat} 2014-09-16 01:36:28,037 FATAL resourcemanager.ResourceManager (ResourceManager.java:run(612)) - Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:530) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:678) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1015) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:603) at java.lang.Thread.run(Thread.java:744) 2014-09-16 01:36:28,042 INFO resourcemanager.ResourceManager (ResourceManager.java:run(616)) - Exiting, bbye.. {noformat} All the logs for this 3-node cluster has been uploaded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2821) Distributed shell app master becomes unresponsive sometimes
[ https://issues.apache.org/jira/browse/YARN-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200997#comment-14200997 ] Hadoop QA commented on YARN-2821: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12679948/apache-yarn-2821.0.patch against trunk revision 1670578. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5757//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5757//console This message is automatically generated. Distributed shell app master becomes unresponsive sometimes --- Key: YARN-2821 URL: https://issues.apache.org/jira/browse/YARN-2821 Project: Hadoop YARN Issue Type: Bug Components: applications/distributed-shell Affects Versions: 2.5.1 Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2821.0.patch We've noticed that once in a while the distributed shell app master becomes unresponsive and is eventually killed by the RM. snippet of the logs - {noformat} 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: appattempt_1415123350094_0017_01 received 0 previous attempts' running containers on AM registration. 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested container ask: Capability[memory:10, vCores:1]Priority[0] 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested container ask: Capability[memory:10, vCores:1]Priority[0] 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested container ask: Capability[memory:10, vCores:1]Priority[0] 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested container ask: Capability[memory:10, vCores:1]Priority[0] 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested container ask: Capability[memory:10, vCores:1]Priority[0] 14/11/04 18:21:38 INFO impl.AMRMClientImpl: Received new token for : onprem-tez2:45454 14/11/04 18:21:38 INFO distributedshell.ApplicationMaster: Got response from RM for container ask, allocatedCnt=1 14/11/04 18:21:38 INFO distributedshell.ApplicationMaster: Launching shell command on a new container., containerId=container_1415123350094_0017_01_02, containerNode=onprem-tez2:45454, containerNodeURI=onprem-tez2:50060, containerResourceMemory1024, containerResourceVirtualCores1 14/11/04 18:21:38 INFO distributedshell.ApplicationMaster: Setting up container launch container for containerid=container_1415123350094_0017_01_02 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: START_CONTAINER for Container container_1415123350094_0017_01_02 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : onprem-tez2:45454 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1415123350094_0017_01_02 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : onprem-tez2:45454 14/11/04 18:21:39 INFO impl.AMRMClientImpl: Received new token for : onprem-tez3:45454 14/11/04 18:21:39 INFO impl.AMRMClientImpl: Received new token for : onprem-tez4:45454 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Got response from RM for container ask, allocatedCnt=3 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Launching shell command on a new container., containerId=container_1415123350094_0017_01_03, containerNode=onprem-tez2:45454, containerNodeURI=onprem-tez2:50060, containerResourceMemory1024, containerResourceVirtualCores1 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Launching shell command on a new container., containerId=container_1415123350094_0017_01_04,
[jira] [Created] (YARN-2824) Capacity of labels should be zero by default
Wangda Tan created YARN-2824: Summary: Capacity of labels should be zero by default Key: YARN-2824 URL: https://issues.apache.org/jira/browse/YARN-2824 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Priority: Critical In existing Capacity Scheduler behavior, if user doesn't specify capacity of label, queue initialization will be failed. That will cause queue refreshment failed when add a new label to node labels collection and doesn't modify capacity-scheduler.xml. With this patch, capacity of labels should be explicitly set if user want to use it. If user doesn't set capacity of some labels, we will treat such labels are unused labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2816) NM fail to start with NPE during container recovery
[ https://issues.apache.org/jira/browse/YARN-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201020#comment-14201020 ] zhihai xu commented on YARN-2816: - Hi [~jlowe], thanks for the review, Based on your comment, it look like the the issue is not as serious as what I thought previously, I lowered the issue from Critical to Major. move the levelDB database location from /tmp directory to other safe directory is a very good suggestion. But I still think the patch is good for better error handling and NM recovery. 1. If an OS crash will cause a few partially written log record in the levelDB, we can't restart the NM due to NPE. To restart the NM, we need manually delete all these stateStore files. I think it will be bad for the user. Also with the patch, we still can recover most of the containers in the NM instead of losing all these container information, which will cause all these containers to be reallocated by RM. 2. The patch is to delete all these containers which don't have container start requests. It won't cause containers leaks. Because container start request is always the first entry to store(startContainerInternal) in the levelDB for each container records and it is always the first entry to remove (removeContainer) in the levelDB for each container records. Also when I debugged this problem, the levelDB stored most latest records in a LRU cache(memory), when the levelDB is closed, it will store these latest records from cache to the disk and our use case for levelDB is always to write key-value record and no read after NM initialization. The data record lost on disk will always the old record, so container start request record is more likely lost than other container records. This makes the patch meaningful for our use case. I attached a LevelDB container key record which has this issue. 3. Missing containers record is not a problem if we can recover most containers in NM. When AM calls ContainerManagementProtocol#getContainerStatuses, the NM will put the missing containers in failed requests of GetContainerStatusesResponse, then the AM will notify RM to release these missing containers in ApplicationMasterProtocol#allocate(AllocateRequest.getReleaseList). So the missing containers will be recovered quickly by AM. With the patch, we still can recover all these containers with complete record in levelDB. The less containers we missed to recover, the better. 4. The patch is small and safe, It won't cause any side effect. NM fail to start with NPE during container recovery --- Key: YARN-2816 URL: https://issues.apache.org/jira/browse/YARN-2816 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-2816.000.patch NM fail to start with NPE during container recovery. We saw the following crash happen: 2014-10-30 22:22:37,211 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl failed in state INITED; cause: java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492) The reason is some DB files used in NMLeveldbStateStoreService are accidentally deleted to save disk space at /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state. This leaves some incomplete container record which don't have CONTAINER_REQUEST_KEY_SUFFIX(startRequest) entry in the DB. When container is recovered at ContainerManagerImpl#recoverContainer, The NullPointerException at the following code cause NM shutdown. {code} StartContainerRequest req = rcs.getStartRequest(); ContainerLaunchContext launchContext = req.getContainerLaunchContext(); {code} -- This message was sent by
[jira] [Updated] (YARN-2816) NM fail to start with NPE during container recovery
[ https://issues.apache.org/jira/browse/YARN-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2816: Attachment: leveldb_records.txt NM fail to start with NPE during container recovery --- Key: YARN-2816 URL: https://issues.apache.org/jira/browse/YARN-2816 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2816.000.patch, leveldb_records.txt NM fail to start with NPE during container recovery. We saw the following crash happen: 2014-10-30 22:22:37,211 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl failed in state INITED; cause: java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492) The reason is some DB files used in NMLeveldbStateStoreService are accidentally deleted to save disk space at /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state. This leaves some incomplete container record which don't have CONTAINER_REQUEST_KEY_SUFFIX(startRequest) entry in the DB. When container is recovered at ContainerManagerImpl#recoverContainer, The NullPointerException at the following code cause NM shutdown. {code} StartContainerRequest req = rcs.getStartRequest(); ContainerLaunchContext launchContext = req.getContainerLaunchContext(); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2816) NM fail to start with NPE during container recovery
[ https://issues.apache.org/jira/browse/YARN-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2816: Priority: Major (was: Critical) NM fail to start with NPE during container recovery --- Key: YARN-2816 URL: https://issues.apache.org/jira/browse/YARN-2816 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2816.000.patch, leveldb_records.txt NM fail to start with NPE during container recovery. We saw the following crash happen: 2014-10-30 22:22:37,211 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl failed in state INITED; cause: java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492) The reason is some DB files used in NMLeveldbStateStoreService are accidentally deleted to save disk space at /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state. This leaves some incomplete container record which don't have CONTAINER_REQUEST_KEY_SUFFIX(startRequest) entry in the DB. When container is recovered at ContainerManagerImpl#recoverContainer, The NullPointerException at the following code cause NM shutdown. {code} StartContainerRequest req = rcs.getStartRequest(); ContainerLaunchContext launchContext = req.getContainerLaunchContext(); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2822) NPE when RM tries to transfer state from previous attempt on recovery
[ https://issues.apache.org/jira/browse/YARN-2822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2822: -- Attachment: YARN-2822.1.patch Upload a patch to add the previously finished attempt to scheduler NPE when RM tries to transfer state from previous attempt on recovery - Key: YARN-2822 URL: https://issues.apache.org/jira/browse/YARN-2822 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-2822.1.patch {code} 2014-09-16 01:36:28,037 FATAL resourcemanager.ResourceManager (ResourceManager.java:run(612)) - Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:530) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:678) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1015) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:603) at java.lang.Thread.run(Thread.java:744) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2816) NM fail to start with NPE during container recovery
[ https://issues.apache.org/jira/browse/YARN-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201025#comment-14201025 ] Hadoop QA commented on YARN-2816: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12679963/leveldb_records.txt against trunk revision 75b820c. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5758//console This message is automatically generated. NM fail to start with NPE during container recovery --- Key: YARN-2816 URL: https://issues.apache.org/jira/browse/YARN-2816 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2816.000.patch, leveldb_records.txt NM fail to start with NPE during container recovery. We saw the following crash happen: 2014-10-30 22:22:37,211 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl failed in state INITED; cause: java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492) The reason is some DB files used in NMLeveldbStateStoreService are accidentally deleted to save disk space at /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state. This leaves some incomplete container record which don't have CONTAINER_REQUEST_KEY_SUFFIX(startRequest) entry in the DB. When container is recovered at ContainerManagerImpl#recoverContainer, The NullPointerException at the following code cause NM shutdown. {code} StartContainerRequest req = rcs.getStartRequest(); ContainerLaunchContext launchContext = req.getContainerLaunchContext(); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2823) NullPointerException in RM HA enabled 3-node cluster
[ https://issues.apache.org/jira/browse/YARN-2823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He reassigned YARN-2823: - Assignee: Jian He NullPointerException in RM HA enabled 3-node cluster Key: YARN-2823 URL: https://issues.apache.org/jira/browse/YARN-2823 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Gour Saha Assignee: Jian He Attachments: logs_with_NPE_in_RM.zip Branch: 2.6.0 Environment: A 3-node cluster with RM HA enabled. The HA setup went pretty smooth (used Ambari) and then installed HBase using Slider. After some time the RMs went down and would not come back up anymore. Following is the NPE we see in both the RM logs. {noformat} 2014-09-16 01:36:28,037 FATAL resourcemanager.ResourceManager (ResourceManager.java:run(612)) - Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:530) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:678) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1015) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:603) at java.lang.Thread.run(Thread.java:744) 2014-09-16 01:36:28,042 INFO resourcemanager.ResourceManager (ResourceManager.java:run(616)) - Exiting, bbye.. {noformat} All the logs for this 3-node cluster has been uploaded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2753) Fix potential issues and code clean up for *NodeLabelsManager
[ https://issues.apache.org/jira/browse/YARN-2753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201038#comment-14201038 ] zhihai xu commented on YARN-2753: - Hi [~xgong], Could you help review and commit the patch? since Wangda already reviewed the patch. thanks zhihai Fix potential issues and code clean up for *NodeLabelsManager - Key: YARN-2753 URL: https://issues.apache.org/jira/browse/YARN-2753 Project: Hadoop YARN Issue Type: Sub-task Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2753.000.patch, YARN-2753.001.patch, YARN-2753.002.patch, YARN-2753.003.patch, YARN-2753.004.patch, YARN-2753.005.patch, YARN-2753.006.patch Issues include: * CommonNodeLabelsManager#addToCluserNodeLabels should not change the value in labelCollections if the key already exists otherwise the Label.resource will be changed(reset). * potential NPE(NullPointerException) in checkRemoveLabelsFromNode of CommonNodeLabelsManager. ** because when a Node is created, Node.labels can be null. ** In this case, nm.labels; may be null. So we need check originalLabels not null before use it(originalLabels.containsAll). * addToCluserNodeLabels should be protected by writeLock in RMNodeLabelsManager.java. because we should protect labelCollections in RMNodeLabelsManager. * Fix a potential bug in CommonsNodeLabelsManager, after serviceStop(...) is invoked, some event may not be processed, see [comment|https://issues.apache.org/jira/browse/YARN-2753?focusedCommentId=14197206page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14197206] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2823) NullPointerException in RM HA enabled 3-node cluster
[ https://issues.apache.org/jira/browse/YARN-2823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201040#comment-14201040 ] Jian He commented on YARN-2823: --- The problem is on recovery, if the previous attempt already finished, we are not adding it the scheduler. when scheduler tries to transferStateFromPreviousAttempt for work-presrving AM restart, it throws NPE. NullPointerException in RM HA enabled 3-node cluster Key: YARN-2823 URL: https://issues.apache.org/jira/browse/YARN-2823 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Gour Saha Assignee: Jian He Attachments: YARN-2823.1.patch, logs_with_NPE_in_RM.zip Branch: 2.6.0 Environment: A 3-node cluster with RM HA enabled. The HA setup went pretty smooth (used Ambari) and then installed HBase using Slider. After some time the RMs went down and would not come back up anymore. Following is the NPE we see in both the RM logs. {noformat} 2014-09-16 01:36:28,037 FATAL resourcemanager.ResourceManager (ResourceManager.java:run(612)) - Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:530) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:678) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1015) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:603) at java.lang.Thread.run(Thread.java:744) 2014-09-16 01:36:28,042 INFO resourcemanager.ResourceManager (ResourceManager.java:run(616)) - Exiting, bbye.. {noformat} All the logs for this 3-node cluster has been uploaded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2823) NullPointerException in RM HA enabled 3-node cluster
[ https://issues.apache.org/jira/browse/YARN-2823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201041#comment-14201041 ] Jian He commented on YARN-2823: --- Upload a patch to add the previously finished attempt to scheduler NullPointerException in RM HA enabled 3-node cluster Key: YARN-2823 URL: https://issues.apache.org/jira/browse/YARN-2823 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Gour Saha Assignee: Jian He Attachments: YARN-2823.1.patch, logs_with_NPE_in_RM.zip Branch: 2.6.0 Environment: A 3-node cluster with RM HA enabled. The HA setup went pretty smooth (used Ambari) and then installed HBase using Slider. After some time the RMs went down and would not come back up anymore. Following is the NPE we see in both the RM logs. {noformat} 2014-09-16 01:36:28,037 FATAL resourcemanager.ResourceManager (ResourceManager.java:run(612)) - Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:530) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:678) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1015) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:603) at java.lang.Thread.run(Thread.java:744) 2014-09-16 01:36:28,042 INFO resourcemanager.ResourceManager (ResourceManager.java:run(616)) - Exiting, bbye.. {noformat} All the logs for this 3-node cluster has been uploaded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2823) NullPointerException in RM HA enabled 3-node cluster
[ https://issues.apache.org/jira/browse/YARN-2823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2823: -- Attachment: YARN-2823.1.patch NullPointerException in RM HA enabled 3-node cluster Key: YARN-2823 URL: https://issues.apache.org/jira/browse/YARN-2823 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Gour Saha Assignee: Jian He Attachments: YARN-2823.1.patch, logs_with_NPE_in_RM.zip Branch: 2.6.0 Environment: A 3-node cluster with RM HA enabled. The HA setup went pretty smooth (used Ambari) and then installed HBase using Slider. After some time the RMs went down and would not come back up anymore. Following is the NPE we see in both the RM logs. {noformat} 2014-09-16 01:36:28,037 FATAL resourcemanager.ResourceManager (ResourceManager.java:run(612)) - Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:530) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:678) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1015) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:603) at java.lang.Thread.run(Thread.java:744) 2014-09-16 01:36:28,042 INFO resourcemanager.ResourceManager (ResourceManager.java:run(616)) - Exiting, bbye.. {noformat} All the logs for this 3-node cluster has been uploaded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2791) Add Disk as a resource for scheduling
[ https://issues.apache.org/jira/browse/YARN-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201101#comment-14201101 ] Karthik Kambatla commented on YARN-2791: [~adityakishore] - unfortunately, YARN-2139 didn't have a description this far. I just added a very succinct one. If you look at the design doc itself and the sub-tasks created, you would see that YARN-2139 includes adding disk as a resource to the resource-request-vector, the node-capacities, and the scheduler. [~sdaingade] - looking forward to you hear your thoughts and see a design doc. Add Disk as a resource for scheduling - Key: YARN-2791 URL: https://issues.apache.org/jira/browse/YARN-2791 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Affects Versions: 2.5.1 Reporter: Swapnil Daingade Assignee: Yuliya Feldman Currently, the number of disks present on a node is not considered a factor while scheduling containers on that node. Having large amount of memory on a node can lead to high number of containers being launched on that node, all of which compete for I/O bandwidth. This multiplexing of I/O across containers can lead to slower overall progress and sub-optimal resource utilization as containers starved for I/O bandwidth hold on to other resources like cpu and memory. This problem can be solved by considering disk as a resource and including it in deciding how many containers can be concurrently run on a node. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2791) Add Disk as a resource for scheduling
[ https://issues.apache.org/jira/browse/YARN-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201113#comment-14201113 ] Aditya Kishore commented on YARN-2791: -- Great! I think this JIRA should be added as a sub-task as non of the other sub-tasks cover this aspect. Add Disk as a resource for scheduling - Key: YARN-2791 URL: https://issues.apache.org/jira/browse/YARN-2791 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Affects Versions: 2.5.1 Reporter: Swapnil Daingade Assignee: Yuliya Feldman Currently, the number of disks present on a node is not considered a factor while scheduling containers on that node. Having large amount of memory on a node can lead to high number of containers being launched on that node, all of which compete for I/O bandwidth. This multiplexing of I/O across containers can lead to slower overall progress and sub-optimal resource utilization as containers starved for I/O bandwidth hold on to other resources like cpu and memory. This problem can be solved by considering disk as a resource and including it in deciding how many containers can be concurrently run on a node. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2139) Add support for disk IO isolation/scheduling for containers
[ https://issues.apache.org/jira/browse/YARN-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2139: -- Attachment: YARN-2139-prototype.patch I submit a prototype of the code implementation, to illustrate the basic design and implementation. Code changes in three major parts: (1) API: add vdisks as a 3rd type of resources, besides CPU/memory. The NM will specifly its own vdisks resource, and the AM includes vdisks in the resource request. (2) Scheduler: the scheduler will consider vdisks availability when scheduling. Additionally, the DRF policy also considers vdisks when choosing the dominant resource. (3) I/O isolation: this is implemented in the NM side. Use cgroup's blkio system to do the container I/O isolation. Will separate the patch into several sub-task patches once collecting more comments and the design, implementation. Add support for disk IO isolation/scheduling for containers --- Key: YARN-2139 URL: https://issues.apache.org/jira/browse/YARN-2139 Project: Hadoop YARN Issue Type: New Feature Reporter: Wei Yan Assignee: Wei Yan Attachments: Disk_IO_Scheduling_Design_1.pdf, Disk_IO_Scheduling_Design_2.pdf, YARN-2139-prototype.patch YARN should support considering disk for scheduling tasks on nodes, and provide isolation for these allocations at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2139) Add support for disk IO isolation/scheduling for containers
[ https://issues.apache.org/jira/browse/YARN-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201160#comment-14201160 ] Wei Yan commented on YARN-2139: --- This prototype patch is posted to garner feedback. The spindle locality has not been finished, will post once get updated. Add support for disk IO isolation/scheduling for containers --- Key: YARN-2139 URL: https://issues.apache.org/jira/browse/YARN-2139 Project: Hadoop YARN Issue Type: New Feature Reporter: Wei Yan Assignee: Wei Yan Attachments: Disk_IO_Scheduling_Design_1.pdf, Disk_IO_Scheduling_Design_2.pdf, YARN-2139-prototype.patch YARN should support considering disk for scheduling tasks on nodes, and provide isolation for these allocations at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2056) Disable preemption at Queue level
[ https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201166#comment-14201166 ] Eric Payne commented on YARN-2056: -- [~leftnoteasy], I'm sorry, but there is one more thing that needs to be modified with the current design. The current patch only allows the disable queue preemption flag to be set on leaf queues. However, after discussing his internally, we need to be able to have leaf queues inherit this property from their parent. Only setting the disable queue preemption property on leaf queues was an intentional design decision to begin with. This was because inheriting this property from a parent would impose a new set of requirements. Consider this use case: - root queue has children A and B - A has children A1 and A2 - B has children B1 and B2 - A should not be preemptable - A1 and A2 should be able to preempt each other In this use case, if A is over capacity, B should not be able to preempt. However, if A1 is over capacity, A2 should be able to preempt A1. I believe I can make the leaf nodes inherit this property from its parent and still be able to solve for the above use case. I will be putting up a new patch (hopefully) tomorrow. Disable preemption at Queue level - Key: YARN-2056 URL: https://issues.apache.org/jira/browse/YARN-2056 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Mayank Bansal Assignee: Eric Payne Attachments: YARN-2056.201408202039.txt, YARN-2056.201408260128.txt, YARN-2056.201408310117.txt, YARN-2056.201409022208.txt, YARN-2056.201409181916.txt, YARN-2056.201409210049.txt, YARN-2056.201409232329.txt, YARN-2056.201409242210.txt, YARN-2056.201410132225.txt, YARN-2056.201410141330.txt, YARN-2056.201410232244.txt, YARN-2056.201410311746.txt, YARN-2056.201411041635.txt We need to be able to disable preemption at individual queue level -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2822) NPE when RM tries to transfer state from previous attempt on recovery
[ https://issues.apache.org/jira/browse/YARN-2822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201175#comment-14201175 ] Hadoop QA commented on YARN-2822: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12679964/YARN-2822.1.patch against trunk revision 75b820c. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5759//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5759//console This message is automatically generated. NPE when RM tries to transfer state from previous attempt on recovery - Key: YARN-2822 URL: https://issues.apache.org/jira/browse/YARN-2822 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-2822.1.patch {code} 2014-09-16 01:36:28,037 FATAL resourcemanager.ResourceManager (ResourceManager.java:run(612)) - Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:530) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:678) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1015) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:603) at java.lang.Thread.run(Thread.java:744) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2821) Distributed shell app master becomes unresponsive sometimes
[ https://issues.apache.org/jira/browse/YARN-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201183#comment-14201183 ] Jian He commented on YARN-2821: --- thanks Varun ! the patch should solve the problem to release extra allocated containers. but seems in some cases, we may still run into the infinite loop. In finish(), it's checking {{numCompletedContainers.get() != numTotalContainers}}, but numCompletedContainers could be incremented elsewhere. e.g. {{onStartContainerError}} will also increase the numCompletedContainers count. Could the numCompletedContainers go beyond the numTotalContainers in such scenario also ? Distributed shell app master becomes unresponsive sometimes --- Key: YARN-2821 URL: https://issues.apache.org/jira/browse/YARN-2821 Project: Hadoop YARN Issue Type: Bug Components: applications/distributed-shell Affects Versions: 2.5.1 Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2821.0.patch We've noticed that once in a while the distributed shell app master becomes unresponsive and is eventually killed by the RM. snippet of the logs - {noformat} 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: appattempt_1415123350094_0017_01 received 0 previous attempts' running containers on AM registration. 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested container ask: Capability[memory:10, vCores:1]Priority[0] 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested container ask: Capability[memory:10, vCores:1]Priority[0] 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested container ask: Capability[memory:10, vCores:1]Priority[0] 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested container ask: Capability[memory:10, vCores:1]Priority[0] 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested container ask: Capability[memory:10, vCores:1]Priority[0] 14/11/04 18:21:38 INFO impl.AMRMClientImpl: Received new token for : onprem-tez2:45454 14/11/04 18:21:38 INFO distributedshell.ApplicationMaster: Got response from RM for container ask, allocatedCnt=1 14/11/04 18:21:38 INFO distributedshell.ApplicationMaster: Launching shell command on a new container., containerId=container_1415123350094_0017_01_02, containerNode=onprem-tez2:45454, containerNodeURI=onprem-tez2:50060, containerResourceMemory1024, containerResourceVirtualCores1 14/11/04 18:21:38 INFO distributedshell.ApplicationMaster: Setting up container launch container for containerid=container_1415123350094_0017_01_02 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: START_CONTAINER for Container container_1415123350094_0017_01_02 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : onprem-tez2:45454 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1415123350094_0017_01_02 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : onprem-tez2:45454 14/11/04 18:21:39 INFO impl.AMRMClientImpl: Received new token for : onprem-tez3:45454 14/11/04 18:21:39 INFO impl.AMRMClientImpl: Received new token for : onprem-tez4:45454 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Got response from RM for container ask, allocatedCnt=3 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Launching shell command on a new container., containerId=container_1415123350094_0017_01_03, containerNode=onprem-tez2:45454, containerNodeURI=onprem-tez2:50060, containerResourceMemory1024, containerResourceVirtualCores1 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Launching shell command on a new container., containerId=container_1415123350094_0017_01_04, containerNode=onprem-tez3:45454, containerNodeURI=onprem-tez3:50060, containerResourceMemory1024, containerResourceVirtualCores1 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Launching shell command on a new container., containerId=container_1415123350094_0017_01_05, containerNode=onprem-tez4:45454, containerNodeURI=onprem-tez4:50060, containerResourceMemory1024, containerResourceVirtualCores1 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Setting up container launch container for containerid=container_1415123350094_0017_01_03 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Setting up container launch container for containerid=container_1415123350094_0017_01_05 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Setting up container launch container for containerid=container_1415123350094_0017_01_04 14/11/04
[jira] [Commented] (YARN-2823) NullPointerException in RM HA enabled 3-node cluster
[ https://issues.apache.org/jira/browse/YARN-2823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201193#comment-14201193 ] Hadoop QA commented on YARN-2823: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12679967/YARN-2823.1.patch against trunk revision 75b820c. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5760//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5760//console This message is automatically generated. NullPointerException in RM HA enabled 3-node cluster Key: YARN-2823 URL: https://issues.apache.org/jira/browse/YARN-2823 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Gour Saha Assignee: Jian He Attachments: YARN-2823.1.patch, logs_with_NPE_in_RM.zip Branch: 2.6.0 Environment: A 3-node cluster with RM HA enabled. The HA setup went pretty smooth (used Ambari) and then installed HBase using Slider. After some time the RMs went down and would not come back up anymore. Following is the NPE we see in both the RM logs. {noformat} 2014-09-16 01:36:28,037 FATAL resourcemanager.ResourceManager (ResourceManager.java:run(612)) - Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:530) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:678) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1015) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:603) at java.lang.Thread.run(Thread.java:744) 2014-09-16 01:36:28,042 INFO resourcemanager.ResourceManager (ResourceManager.java:run(616)) - Exiting, bbye.. {noformat} All the logs for this 3-node cluster has been uploaded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2824) Capacity of labels should be zero by default
[ https://issues.apache.org/jira/browse/YARN-2824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2824: - Attachment: YARN-2824-1.patch Attached patch and added/updated tests Capacity of labels should be zero by default Key: YARN-2824 URL: https://issues.apache.org/jira/browse/YARN-2824 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Priority: Critical Attachments: YARN-2824-1.patch In existing Capacity Scheduler behavior, if user doesn't specify capacity of label, queue initialization will be failed. That will cause queue refreshment failed when add a new label to node labels collection and doesn't modify capacity-scheduler.xml. With this patch, capacity of labels should be explicitly set if user want to use it. If user doesn't set capacity of some labels, we will treat such labels are unused labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2647) Add yarn queue CLI to get queue infos
[ https://issues.apache.org/jira/browse/YARN-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201226#comment-14201226 ] Wangda Tan commented on YARN-2647: -- Hi [~sunilg], bq. However I feel we can avoid having '-' in between the field labels. Because we already have fields like Maximum Capacity, Current Capacity etc. So if we change for node-labels, it was not much looking good. So I kept as it is and also removed for queue. Kindly share your thoughts. I'm fine with that :), and I think it is better to make it like Default Node Label Expression. Do you like it? :) And the patch failure is caused by the changes on yarn.cmd. A simple workaround is remove yarn.cmd from you patch. Maintain two patch, one is full patch, another one is without yarn.cmd. Add yarn queue CLI to get queue infos - Key: YARN-2647 URL: https://issues.apache.org/jira/browse/YARN-2647 Project: Hadoop YARN Issue Type: Sub-task Components: client Reporter: Wangda Tan Assignee: Sunil G Attachments: 0001-YARN-2647.patch, 0002-YARN-2647.patch, 0003-YARN-2647.patch, 0004-YARN-2647.patch, 0005-YARN-2647.patch, 0006-YARN-2647.patch, 0007-YARN-2647.patch, 0008-YARN-2647.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2791) Add Disk as a resource for scheduling
[ https://issues.apache.org/jira/browse/YARN-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2791: - Assignee: Yuliya Feldman (was: Wangda Tan) Add Disk as a resource for scheduling - Key: YARN-2791 URL: https://issues.apache.org/jira/browse/YARN-2791 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Affects Versions: 2.5.1 Reporter: Swapnil Daingade Assignee: Yuliya Feldman Currently, the number of disks present on a node is not considered a factor while scheduling containers on that node. Having large amount of memory on a node can lead to high number of containers being launched on that node, all of which compete for I/O bandwidth. This multiplexing of I/O across containers can lead to slower overall progress and sub-optimal resource utilization as containers starved for I/O bandwidth hold on to other resources like cpu and memory. This problem can be solved by considering disk as a resource and including it in deciding how many containers can be concurrently run on a node. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2803) MR distributed cache not working correctly on Windows after NodeManager privileged account changes.
[ https://issues.apache.org/jira/browse/YARN-2803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-2803: -- Attachment: YARN-2803.0.patch Patch to revert behavior for jar creation location, which resolves this issue for non-secure windows. As it is it probably breaks secure windows, I may look at making the behavior conditional (however, I suspect secure windows will fail this test/is broken as well, so I'm not sure that making this conditional really matters at the moment...). MR distributed cache not working correctly on Windows after NodeManager privileged account changes. --- Key: YARN-2803 URL: https://issues.apache.org/jira/browse/YARN-2803 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Chris Nauroth Priority: Critical Attachments: YARN-2803.0.patch This problem is visible by running {{TestMRJobs#testDistributedCache}} or {{TestUberAM#testDistributedCache}} on Windows. Both tests fail. Running git bisect, I traced it to the YARN-2198 patch to remove the need to run NodeManager as a privileged account. The tests started failing when that patch was committed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2825) Container leak on NM
Jian He created YARN-2825: - Summary: Container leak on NM Key: YARN-2825 URL: https://issues.apache.org/jira/browse/YARN-2825 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Priority: Critical Caused by YARN-1372. thanks [~vinodkv] for pointing this out. The problem is that in YARN-1372 we changed the behavior to remove containers from NMContext only after the containers are acknowledged by AM. But in the {{NodeStatusUpdaterImpl#removeCompletedContainersFromContext}} call, we didn't check whether the container is really completed or not. If the container is stilll running, we shouldn't remove the container from the context -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2824) Capacity of labels should be zero by default
[ https://issues.apache.org/jira/browse/YARN-2824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201351#comment-14201351 ] Hadoop QA commented on YARN-2824: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12679998/YARN-2824-1.patch against trunk revision 75b820c. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5762//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5762//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5762//console This message is automatically generated. Capacity of labels should be zero by default Key: YARN-2824 URL: https://issues.apache.org/jira/browse/YARN-2824 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Priority: Critical Attachments: YARN-2824-1.patch In existing Capacity Scheduler behavior, if user doesn't specify capacity of label, queue initialization will be failed. That will cause queue refreshment failed when add a new label to node labels collection and doesn't modify capacity-scheduler.xml. With this patch, capacity of labels should be explicitly set if user want to use it. If user doesn't set capacity of some labels, we will treat such labels are unused labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2744) Under some scenario, it is possible to end up with capacity scheduler configuration that uses labels that no longer exist
[ https://issues.apache.org/jira/browse/YARN-2744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201370#comment-14201370 ] Vinod Kumar Vavilapalli commented on YARN-2744: --- Makes sense, looks good. +1. Checking this in. Under some scenario, it is possible to end up with capacity scheduler configuration that uses labels that no longer exist - Key: YARN-2744 URL: https://issues.apache.org/jira/browse/YARN-2744 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Affects Versions: 2.5.1 Reporter: Sumit Mohanty Assignee: Wangda Tan Priority: Critical Attachments: YARN-2744-20141025-1.patch, YARN-2744-20141025-2.patch Use the following steps: * Ensure default in-memory storage is configured for labels * Define some labels and assign nodes to labels (e.g. define two labels and assign both labels to the host on a one host cluster) * Invoke refreshQueues * Modify capacity scheduler to create two top level queues and allow access to the labels from both the queues * Assign appropriate label + queue specific capacities * Restart resource manager Noticed that RM starts without any issues. The labels are not preserved across restart and thus the capacity-scheduler ends up using labels that are no longer present. At this point submitting an application to YARN will not succeed as there are no resources available with the labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2803) MR distributed cache not working correctly on Windows after NodeManager privileged account changes.
[ https://issues.apache.org/jira/browse/YARN-2803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201373#comment-14201373 ] Hadoop QA commented on YARN-2803: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12680024/YARN-2803.0.patch against trunk revision 75b820c. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5763//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5763//console This message is automatically generated. MR distributed cache not working correctly on Windows after NodeManager privileged account changes. --- Key: YARN-2803 URL: https://issues.apache.org/jira/browse/YARN-2803 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Chris Nauroth Priority: Critical Attachments: YARN-2803.0.patch This problem is visible by running {{TestMRJobs#testDistributedCache}} or {{TestUberAM#testDistributedCache}} on Windows. Both tests fail. Running git bisect, I traced it to the YARN-2198 patch to remove the need to run NodeManager as a privileged account. The tests started failing when that patch was committed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2803) MR distributed cache not working correctly on Windows after NodeManager privileged account changes.
[ https://issues.apache.org/jira/browse/YARN-2803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201377#comment-14201377 ] Craig Welch commented on YARN-2803: --- The fix is to cause existing unit tests which fail on windows to now pass, there is no need for additional tests, they're already there. MR distributed cache not working correctly on Windows after NodeManager privileged account changes. --- Key: YARN-2803 URL: https://issues.apache.org/jira/browse/YARN-2803 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Chris Nauroth Priority: Critical Attachments: YARN-2803.0.patch This problem is visible by running {{TestMRJobs#testDistributedCache}} or {{TestUberAM#testDistributedCache}} on Windows. Both tests fail. Running git bisect, I traced it to the YARN-2198 patch to remove the need to run NodeManager as a privileged account. The tests started failing when that patch was committed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2744) Under some scenario, it is possible to end up with capacity scheduler configuration that uses labels that no longer exist
[ https://issues.apache.org/jira/browse/YARN-2744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201382#comment-14201382 ] Hudson commented on YARN-2744: -- FAILURE: Integrated in Hadoop-trunk-Commit #6472 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6472/]) YARN-2744. Fixed CapacityScheduler to validate node-labels correctly against queues. Contributed by Wangda Tan. (vinodkv: rev a3839a9fbfb8eec396b9bf85472d25e0ffc3aab2) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestQueueParsing.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractCSQueue.java * hadoop-yarn-project/CHANGES.txt Under some scenario, it is possible to end up with capacity scheduler configuration that uses labels that no longer exist - Key: YARN-2744 URL: https://issues.apache.org/jira/browse/YARN-2744 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Affects Versions: 2.5.1 Reporter: Sumit Mohanty Assignee: Wangda Tan Priority: Critical Fix For: 2.6.0 Attachments: YARN-2744-20141025-1.patch, YARN-2744-20141025-2.patch Use the following steps: * Ensure default in-memory storage is configured for labels * Define some labels and assign nodes to labels (e.g. define two labels and assign both labels to the host on a one host cluster) * Invoke refreshQueues * Modify capacity scheduler to create two top level queues and allow access to the labels from both the queues * Assign appropriate label + queue specific capacities * Restart resource manager Noticed that RM starts without any issues. The labels are not preserved across restart and thus the capacity-scheduler ends up using labels that are no longer present. At this point submitting an application to YARN will not succeed as there are no resources available with the labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2753) Fix potential issues and code clean up for *NodeLabelsManager
[ https://issues.apache.org/jira/browse/YARN-2753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201384#comment-14201384 ] Vinod Kumar Vavilapalli commented on YARN-2753: --- Let me look.. Fix potential issues and code clean up for *NodeLabelsManager - Key: YARN-2753 URL: https://issues.apache.org/jira/browse/YARN-2753 Project: Hadoop YARN Issue Type: Sub-task Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2753.000.patch, YARN-2753.001.patch, YARN-2753.002.patch, YARN-2753.003.patch, YARN-2753.004.patch, YARN-2753.005.patch, YARN-2753.006.patch Issues include: * CommonNodeLabelsManager#addToCluserNodeLabels should not change the value in labelCollections if the key already exists otherwise the Label.resource will be changed(reset). * potential NPE(NullPointerException) in checkRemoveLabelsFromNode of CommonNodeLabelsManager. ** because when a Node is created, Node.labels can be null. ** In this case, nm.labels; may be null. So we need check originalLabels not null before use it(originalLabels.containsAll). * addToCluserNodeLabels should be protected by writeLock in RMNodeLabelsManager.java. because we should protect labelCollections in RMNodeLabelsManager. * Fix a potential bug in CommonsNodeLabelsManager, after serviceStop(...) is invoked, some event may not be processed, see [comment|https://issues.apache.org/jira/browse/YARN-2753?focusedCommentId=14197206page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14197206] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2744) Under some scenario, it is possible to end up with capacity scheduler configuration that uses labels that no longer exist
[ https://issues.apache.org/jira/browse/YARN-2744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201385#comment-14201385 ] Wangda Tan commented on YARN-2744: -- Thanks for Vinod's review and commit! Under some scenario, it is possible to end up with capacity scheduler configuration that uses labels that no longer exist - Key: YARN-2744 URL: https://issues.apache.org/jira/browse/YARN-2744 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Affects Versions: 2.5.1 Reporter: Sumit Mohanty Assignee: Wangda Tan Priority: Critical Fix For: 2.6.0 Attachments: YARN-2744-20141025-1.patch, YARN-2744-20141025-2.patch Use the following steps: * Ensure default in-memory storage is configured for labels * Define some labels and assign nodes to labels (e.g. define two labels and assign both labels to the host on a one host cluster) * Invoke refreshQueues * Modify capacity scheduler to create two top level queues and allow access to the labels from both the queues * Assign appropriate label + queue specific capacities * Restart resource manager Noticed that RM starts without any issues. The labels are not preserved across restart and thus the capacity-scheduler ends up using labels that are no longer present. At this point submitting an application to YARN will not succeed as there are no resources available with the labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2505) Support get/add/remove/change labels in RM REST API
[ https://issues.apache.org/jira/browse/YARN-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-2505: -- Attachment: YARN-2505.18.patch And, uploaded the wrong one. This should be the right one. Support get/add/remove/change labels in RM REST API --- Key: YARN-2505 URL: https://issues.apache.org/jira/browse/YARN-2505 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Craig Welch Attachments: YARN-2505.1.patch, YARN-2505.11.patch, YARN-2505.12.patch, YARN-2505.13.patch, YARN-2505.14.patch, YARN-2505.15.patch, YARN-2505.16.patch, YARN-2505.16.patch, YARN-2505.16.patch, YARN-2505.18.patch, YARN-2505.3.patch, YARN-2505.4.patch, YARN-2505.5.patch, YARN-2505.6.patch, YARN-2505.7.patch, YARN-2505.8.patch, YARN-2505.9.patch, YARN-2505.9.patch, YARN-2505.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2505) Support get/add/remove/change labels in RM REST API
[ https://issues.apache.org/jira/browse/YARN-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201402#comment-14201402 ] Wangda Tan commented on YARN-2505: -- Hi [~cwelch], Latest patch LGTM, +1. Thanks! Support get/add/remove/change labels in RM REST API --- Key: YARN-2505 URL: https://issues.apache.org/jira/browse/YARN-2505 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Craig Welch Attachments: YARN-2505.1.patch, YARN-2505.11.patch, YARN-2505.12.patch, YARN-2505.13.patch, YARN-2505.14.patch, YARN-2505.15.patch, YARN-2505.16.patch, YARN-2505.16.patch, YARN-2505.16.patch, YARN-2505.18.patch, YARN-2505.3.patch, YARN-2505.4.patch, YARN-2505.5.patch, YARN-2505.6.patch, YARN-2505.7.patch, YARN-2505.8.patch, YARN-2505.9.patch, YARN-2505.9.patch, YARN-2505.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2808) yarn client tool can not list app_attempt's container info correctly
[ https://issues.apache.org/jira/browse/YARN-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201409#comment-14201409 ] George Wong commented on YARN-2808: --- That's great. Looking forward to the new container command. yarn client tool can not list app_attempt's container info correctly Key: YARN-2808 URL: https://issues.apache.org/jira/browse/YARN-2808 Project: Hadoop YARN Issue Type: Bug Components: client Reporter: Gordon Wang Assignee: Naganarasimha G R When enabling timeline server, yarn client can not list the container info for a application attempt correctly. Here is the reproduce step. # enabling yarn timeline server # submit a MR job # after the job is finished. use yarn client to list the container info of the app attempt. Then, since the RM has cached the application's attempt info, the output show {noformat} [hadoop@localhost hadoop-3.0.0-SNAPSHOT]$ ./bin/yarn container -list appattempt_1415168250217_0001_01 14/11/05 01:19:15 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/11/05 01:19:15 INFO impl.TimelineClientImpl: Timeline service address: http://0.0.0.0:8188/ws/v1/timeline/ 14/11/05 01:19:16 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 14/11/05 01:19:16 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200 Total number of containers :0 Container-Id Start Time Finish Time StateHost LOG-URL {noformat} But if the rm is restarted, client can fetch the container info from timeline server correctly. {noformat} [hadoop@localhost hadoop-3.0.0-SNAPSHOT]$ ./bin/yarn container -list appattempt_1415168250217_0001_01 14/11/05 01:21:06 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/11/05 01:21:06 INFO impl.TimelineClientImpl: Timeline service address: http://0.0.0.0:8188/ws/v1/timeline/ 14/11/05 01:21:06 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 14/11/05 01:21:06 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200 Total number of containers :4 Container-Id Start Time Finish Time StateHost LOG-URL container_1415168250217_0001_01_01 1415168318376 1415168349896COMPLETElocalhost.localdomain:47024 http://0.0.0.0:8188/applicationhistory/logs/localhost.localdomain:47024/container_1415168250217_0001_01_01/container_1415168250217_0001_01_01/hadoop container_1415168250217_0001_01_02 1415168326399 1415168334858COMPLETElocalhost.localdomain:47024 http://0.0.0.0:8188/applicationhistory/logs/localhost.localdomain:47024/container_1415168250217_0001_01_02/container_1415168250217_0001_01_02/hadoop container_1415168250217_0001_01_03 1415168326400 1415168335277COMPLETElocalhost.localdomain:47024 http://0.0.0.0:8188/applicationhistory/logs/localhost.localdomain:47024/container_1415168250217_0001_01_03/container_1415168250217_0001_01_03/hadoop container_1415168250217_0001_01_04 1415168335825 1415168343873COMPLETElocalhost.localdomain:47024 http://0.0.0.0:8188/applicationhistory/logs/localhost.localdomain:47024/container_1415168250217_0001_01_04/container_1415168250217_0001_01_04/hadoop {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2824) Capacity of labels should be zero by default
[ https://issues.apache.org/jira/browse/YARN-2824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201413#comment-14201413 ] Wangda Tan commented on YARN-2824: -- Attached new patch suppressing findbug warnings Capacity of labels should be zero by default Key: YARN-2824 URL: https://issues.apache.org/jira/browse/YARN-2824 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Priority: Critical Attachments: YARN-2824-1.patch, YARN-2824-2.patch In existing Capacity Scheduler behavior, if user doesn't specify capacity of label, queue initialization will be failed. That will cause queue refreshment failed when add a new label to node labels collection and doesn't modify capacity-scheduler.xml. With this patch, capacity of labels should be explicitly set if user want to use it. If user doesn't set capacity of some labels, we will treat such labels are unused labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2824) Capacity of labels should be zero by default
[ https://issues.apache.org/jira/browse/YARN-2824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2824: - Attachment: YARN-2824-2.patch Capacity of labels should be zero by default Key: YARN-2824 URL: https://issues.apache.org/jira/browse/YARN-2824 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Priority: Critical Attachments: YARN-2824-1.patch, YARN-2824-2.patch In existing Capacity Scheduler behavior, if user doesn't specify capacity of label, queue initialization will be failed. That will cause queue refreshment failed when add a new label to node labels collection and doesn't modify capacity-scheduler.xml. With this patch, capacity of labels should be explicitly set if user want to use it. If user doesn't set capacity of some labels, we will treat such labels are unused labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2632) Document NM Restart feature
[ https://issues.apache.org/jira/browse/YARN-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-2632: - Attachment: YARN-2632-v3.patch Document NM Restart feature --- Key: YARN-2632 URL: https://issues.apache.org/jira/browse/YARN-2632 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Junping Du Assignee: Junping Du Priority: Blocker Attachments: YARN-2632-v2.patch, YARN-2632-v3.patch, YARN-2632.patch As a new feature to YARN, we should document this feature's behavior, configuration, and things to pay attention. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2632) Document NM Restart feature
[ https://issues.apache.org/jira/browse/YARN-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201415#comment-14201415 ] Junping Du commented on YARN-2632: -- Thanks [~vvasudev] for review and comments! bq. I might be wrong but NM work preserving restart doesn't work with ephemeral ports, right? We should specify that in the documentation(as well as how to set the NM port in yarn-site.xml). Nice catch! Address this in v3 patch. Document NM Restart feature --- Key: YARN-2632 URL: https://issues.apache.org/jira/browse/YARN-2632 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Junping Du Assignee: Junping Du Priority: Blocker Attachments: YARN-2632-v2.patch, YARN-2632-v3.patch, YARN-2632.patch As a new feature to YARN, we should document this feature's behavior, configuration, and things to pay attention. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2825) Container leak on NM
[ https://issues.apache.org/jira/browse/YARN-2825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2825: -- Attachment: YARN-2825.1.patch Upload a patch to check if the container is at completed state before removing it from nm context. Container leak on NM Key: YARN-2825 URL: https://issues.apache.org/jira/browse/YARN-2825 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Priority: Critical Attachments: YARN-2825.1.patch Caused by YARN-1372. thanks [~vinodkv] for pointing this out. The problem is that in YARN-1372 we changed the behavior to remove containers from NMContext only after the containers are acknowledged by AM. But in the {{NodeStatusUpdaterImpl#removeCompletedContainersFromContext}} call, we didn't check whether the container is really completed or not. If the container is stilll running, we shouldn't remove the container from the context -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2826) User-Group mappings not updated by RM when a user is removed from a group.
[ https://issues.apache.org/jira/browse/YARN-2826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan reassigned YARN-2826: Assignee: Wangda Tan User-Group mappings not updated by RM when a user is removed from a group. -- Key: YARN-2826 URL: https://issues.apache.org/jira/browse/YARN-2826 Project: Hadoop YARN Issue Type: Bug Reporter: sidharta seethana Assignee: Wangda Tan Priority: Critical Removing a user from a group isn't reflected in getGroups even after a refresh. The following sequence fails in step 7. 1) add test_user to a machine with group1 2) add test_user to group2 on the machine 3) yarn rmadmin -refreshUserToGroupsMappings (expected to refresh user to group mappings) 4) yarn rmadmin -getGroups test_user (and ensure that user is in group2) 5) remove user from group2 on the machine 6) yarn rmadmin -refreshUserToGroupsMappings (expected to refresh user to group mappings) 7) yarn rmadmin -getGroups test_user (and ensure that user NOT in group2) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2505) Support get/add/remove/change labels in RM REST API
[ https://issues.apache.org/jira/browse/YARN-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201488#comment-14201488 ] Hadoop QA commented on YARN-2505: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12680035/YARN-2505.16.patch against trunk revision a3839a9. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5764//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5764//console This message is automatically generated. Support get/add/remove/change labels in RM REST API --- Key: YARN-2505 URL: https://issues.apache.org/jira/browse/YARN-2505 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Craig Welch Attachments: YARN-2505.1.patch, YARN-2505.11.patch, YARN-2505.12.patch, YARN-2505.13.patch, YARN-2505.14.patch, YARN-2505.15.patch, YARN-2505.16.patch, YARN-2505.16.patch, YARN-2505.16.patch, YARN-2505.18.patch, YARN-2505.3.patch, YARN-2505.4.patch, YARN-2505.5.patch, YARN-2505.6.patch, YARN-2505.7.patch, YARN-2505.8.patch, YARN-2505.9.patch, YARN-2505.9.patch, YARN-2505.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)