date:20141106


[ 
https://issues.apache.org/jira/browse/YARN-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14199981#comment-14199981
 ] 

Hadoop QA commented on YARN-2818:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12679805/YARN-2818.1.patch
  against trunk revision 80d7d18.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5751//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5751//console

This message is automatically generated.

 Remove the logic to inject entity owner as the primary filter
 -

 Key: YARN-2818
 URL: https://issues.apache.org/jira/browse/YARN-2818
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen
Priority: Critical
 Attachments: YARN-2818.1.patch


 In 2.5, we inject owner info as a primary filter to support entity-level 
 acls. Since 2.6, we have a different acls solution (YARN-2102). Therefore, 
 there's no need to inject owner info. There're two motivations:
 1. For leveldb timeline store, the primary filter is expensive. When we have 
 a primary filter, we need to make a complete copy of the entity on the logic 
 index table.
 2. Owner info is incomplete. Say we want to put E1 (owner = tester, 
 relatedEntity = E2). If E2 doesn't exist before, leveldb timeline store 
 will create an empty E2 without owner info (at the db point of view, it 
 doesn't know owner is a special primary filter). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2753) Fix potential issues and code clean up for *NodeLabelsManager


 [ 
https://issues.apache.org/jira/browse/YARN-2753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2753:

Attachment: YARN-2753.006.patch

 Fix potential issues and code clean up for *NodeLabelsManager
 -

 Key: YARN-2753
 URL: https://issues.apache.org/jira/browse/YARN-2753
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2753.000.patch, YARN-2753.001.patch, 
 YARN-2753.002.patch, YARN-2753.003.patch, YARN-2753.004.patch, 
 YARN-2753.005.patch, YARN-2753.006.patch


 Issues include:
 * CommonNodeLabelsManager#addToCluserNodeLabels should not change the value 
 in labelCollections if the key already exists otherwise the Label.resource 
 will be changed(reset).
 * potential NPE(NullPointerException) in checkRemoveLabelsFromNode of 
 CommonNodeLabelsManager.
 ** because when a Node is created, Node.labels can be null.
 ** In this case, nm.labels; may be null. So we need check originalLabels not 
 null before use it(originalLabels.containsAll).
 * addToCluserNodeLabels should be protected by writeLock in 
 RMNodeLabelsManager.java. because we should protect labelCollections in 
 RMNodeLabelsManager.
 * Fix a potential bug in CommonsNodeLabelsManager, after serviceStop(...) is 
 invoked, some event may not be processed, see 
 [comment|https://issues.apache.org/jira/browse/YARN-2753?focusedCommentId=14197206page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14197206]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2753) Fix potential issues and code clean up for *NodeLabelsManager


[ 
https://issues.apache.org/jira/browse/YARN-2753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1427#comment-1427
 ] 

zhihai xu commented on YARN-2753:
-

Hi [~leftnoteasy],

thanks for the thorough review.
item 1). fixed
item 2). fixed
item 3). fixed
item 4). fixed. I agree to you and also the  ForwardingEventHandler is only 
active(registered) when CommonNodeLabelsManager#serviceStart is called, and 
serviceStart will be only called in STATE.STARTED. 

I attached a new patch YARN-2753.006.patch which addressed all your comments.

thanks
zhihai

 Fix potential issues and code clean up for *NodeLabelsManager
 -

 Key: YARN-2753
 URL: https://issues.apache.org/jira/browse/YARN-2753
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2753.000.patch, YARN-2753.001.patch, 
 YARN-2753.002.patch, YARN-2753.003.patch, YARN-2753.004.patch, 
 YARN-2753.005.patch, YARN-2753.006.patch


 Issues include:
 * CommonNodeLabelsManager#addToCluserNodeLabels should not change the value 
 in labelCollections if the key already exists otherwise the Label.resource 
 will be changed(reset).
 * potential NPE(NullPointerException) in checkRemoveLabelsFromNode of 
 CommonNodeLabelsManager.
 ** because when a Node is created, Node.labels can be null.
 ** In this case, nm.labels; may be null. So we need check originalLabels not 
 null before use it(originalLabels.containsAll).
 * addToCluserNodeLabels should be protected by writeLock in 
 RMNodeLabelsManager.java. because we should protect labelCollections in 
 RMNodeLabelsManager.
 * Fix a potential bug in CommonsNodeLabelsManager, after serviceStop(...) is 
 invoked, some event may not be processed, see 
 [comment|https://issues.apache.org/jira/browse/YARN-2753?focusedCommentId=14197206page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14197206]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2808) yarn client tool can not list app_attempt's container info correctly

2014-11-06 Thread George Wong (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1429#comment-1429
 ] 

George Wong commented on YARN-2808:
---

[~Naganarasimha], could you link this jira to yarn-2301? yarn-2808 could be one 
piece of improvement for yarn container command.
As you are working on yarn-2301, can you fix this in yarn-2301?

 yarn client tool can not list app_attempt's container info correctly
 

 Key: YARN-2808
 URL: https://issues.apache.org/jira/browse/YARN-2808
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Reporter: Gordon Wang
Assignee: Naganarasimha G R

 When enabling timeline server, yarn client can not list the container info 
 for a application attempt correctly.
 Here is the reproduce step.
 # enabling yarn timeline server
 # submit a MR job
 # after the job is finished. use yarn client to list the container info of 
 the app attempt.
 Then, since the RM has cached the application's attempt info, the output show 
 {noformat}
 [hadoop@localhost hadoop-3.0.0-SNAPSHOT]$ ./bin/yarn container -list 
 appattempt_1415168250217_0001_01
 14/11/05 01:19:15 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 14/11/05 01:19:15 INFO impl.TimelineClientImpl: Timeline service address: 
 http://0.0.0.0:8188/ws/v1/timeline/
 14/11/05 01:19:16 INFO client.RMProxy: Connecting to ResourceManager at 
 /0.0.0.0:8032
 14/11/05 01:19:16 INFO client.AHSProxy: Connecting to Application History 
 server at /0.0.0.0:10200
 Total number of containers :0
   Container-Id  Start Time Finish 
 Time   StateHost  
   LOG-URL
 {noformat}
 But if the rm is restarted, client can fetch the container info from timeline 
 server correctly.
 {noformat}
 [hadoop@localhost hadoop-3.0.0-SNAPSHOT]$ ./bin/yarn container -list 
 appattempt_1415168250217_0001_01
 14/11/05 01:21:06 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 14/11/05 01:21:06 INFO impl.TimelineClientImpl: Timeline service address: 
 http://0.0.0.0:8188/ws/v1/timeline/
 14/11/05 01:21:06 INFO client.RMProxy: Connecting to ResourceManager at 
 /0.0.0.0:8032
 14/11/05 01:21:06 INFO client.AHSProxy: Connecting to Application History 
 server at /0.0.0.0:10200
 Total number of containers :4
   Container-Id  Start Time Finish 
 Time   StateHost  
   LOG-URL
 container_1415168250217_0001_01_01   1415168318376   
 1415168349896COMPLETElocalhost.localdomain:47024 
 http://0.0.0.0:8188/applicationhistory/logs/localhost.localdomain:47024/container_1415168250217_0001_01_01/container_1415168250217_0001_01_01/hadoop
 container_1415168250217_0001_01_02   1415168326399   
 1415168334858COMPLETElocalhost.localdomain:47024 
 http://0.0.0.0:8188/applicationhistory/logs/localhost.localdomain:47024/container_1415168250217_0001_01_02/container_1415168250217_0001_01_02/hadoop
 container_1415168250217_0001_01_03   1415168326400   
 1415168335277COMPLETElocalhost.localdomain:47024 
 http://0.0.0.0:8188/applicationhistory/logs/localhost.localdomain:47024/container_1415168250217_0001_01_03/container_1415168250217_0001_01_03/hadoop
 container_1415168250217_0001_01_04   1415168335825   
 1415168343873COMPLETElocalhost.localdomain:47024 
 http://0.0.0.0:8188/applicationhistory/logs/localhost.localdomain:47024/container_1415168250217_0001_01_04/container_1415168250217_0001_01_04/hadoop
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2753) Fix potential issues and code clean up for *NodeLabelsManager


[ 
https://issues.apache.org/jira/browse/YARN-2753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200048#comment-14200048
 ] 

Hadoop QA commented on YARN-2753:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12679811/YARN-2753.006.patch
  against trunk revision 80d7d18.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5752//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5752//console

This message is automatically generated.

 Fix potential issues and code clean up for *NodeLabelsManager
 -

 Key: YARN-2753
 URL: https://issues.apache.org/jira/browse/YARN-2753
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2753.000.patch, YARN-2753.001.patch, 
 YARN-2753.002.patch, YARN-2753.003.patch, YARN-2753.004.patch, 
 YARN-2753.005.patch, YARN-2753.006.patch


 Issues include:
 * CommonNodeLabelsManager#addToCluserNodeLabels should not change the value 
 in labelCollections if the key already exists otherwise the Label.resource 
 will be changed(reset).
 * potential NPE(NullPointerException) in checkRemoveLabelsFromNode of 
 CommonNodeLabelsManager.
 ** because when a Node is created, Node.labels can be null.
 ** In this case, nm.labels; may be null. So we need check originalLabels not 
 null before use it(originalLabels.containsAll).
 * addToCluserNodeLabels should be protected by writeLock in 
 RMNodeLabelsManager.java. because we should protect labelCollections in 
 RMNodeLabelsManager.
 * Fix a potential bug in CommonsNodeLabelsManager, after serviceStop(...) is 
 invoked, some event may not be processed, see 
 [comment|https://issues.apache.org/jira/browse/YARN-2753?focusedCommentId=14197206page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14197206]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2808) yarn client tool can not list app_attempt's container info correctly

2014-11-06 Thread Naganarasimha G R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200081#comment-14200081
 ] 

Naganarasimha G R commented on YARN-2808:
-

Hi [~GWong]
   Earlier idea was the same, but i feel there might be lot of differences for 
supporting yarn container command for both applicationID and application 
attemptID with -list option itself, so as suggested by JianHe in 
[YARN-2301|https://issues.apache.org/jira/browse/YARN-2301?focusedCommentId=14070512page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14070512],
 i thought of splitting as follows.
# YARN-2301 : for first 3 small issues 
# new jira for supporting yarn container command for both applicationID and 
application attemptID 
# listing of all containers even for running and completed apps as part of 
yarn-1794 (similar to the current issue will confirm with Mayank and finalize 
it )

Have already been working on this but was waiting for level DB based Timeline 
server to be committed to get all the containers from timeline server itself 
which will resolve most of the issues of  yarn container command.

 yarn client tool can not list app_attempt's container info correctly
 

 Key: YARN-2808
 URL: https://issues.apache.org/jira/browse/YARN-2808
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Reporter: Gordon Wang
Assignee: Naganarasimha G R

 When enabling timeline server, yarn client can not list the container info 
 for a application attempt correctly.
 Here is the reproduce step.
 # enabling yarn timeline server
 # submit a MR job
 # after the job is finished. use yarn client to list the container info of 
 the app attempt.
 Then, since the RM has cached the application's attempt info, the output show 
 {noformat}
 [hadoop@localhost hadoop-3.0.0-SNAPSHOT]$ ./bin/yarn container -list 
 appattempt_1415168250217_0001_01
 14/11/05 01:19:15 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 14/11/05 01:19:15 INFO impl.TimelineClientImpl: Timeline service address: 
 http://0.0.0.0:8188/ws/v1/timeline/
 14/11/05 01:19:16 INFO client.RMProxy: Connecting to ResourceManager at 
 /0.0.0.0:8032
 14/11/05 01:19:16 INFO client.AHSProxy: Connecting to Application History 
 server at /0.0.0.0:10200
 Total number of containers :0
   Container-Id  Start Time Finish 
 Time   StateHost  
   LOG-URL
 {noformat}
 But if the rm is restarted, client can fetch the container info from timeline 
 server correctly.
 {noformat}
 [hadoop@localhost hadoop-3.0.0-SNAPSHOT]$ ./bin/yarn container -list 
 appattempt_1415168250217_0001_01
 14/11/05 01:21:06 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 14/11/05 01:21:06 INFO impl.TimelineClientImpl: Timeline service address: 
 http://0.0.0.0:8188/ws/v1/timeline/
 14/11/05 01:21:06 INFO client.RMProxy: Connecting to ResourceManager at 
 /0.0.0.0:8032
 14/11/05 01:21:06 INFO client.AHSProxy: Connecting to Application History 
 server at /0.0.0.0:10200
 Total number of containers :4
   Container-Id  Start Time Finish 
 Time   StateHost  
   LOG-URL
 container_1415168250217_0001_01_01   1415168318376   
 1415168349896COMPLETElocalhost.localdomain:47024 
 http://0.0.0.0:8188/applicationhistory/logs/localhost.localdomain:47024/container_1415168250217_0001_01_01/container_1415168250217_0001_01_01/hadoop
 container_1415168250217_0001_01_02   1415168326399   
 1415168334858COMPLETElocalhost.localdomain:47024 
 http://0.0.0.0:8188/applicationhistory/logs/localhost.localdomain:47024/container_1415168250217_0001_01_02/container_1415168250217_0001_01_02/hadoop
 container_1415168250217_0001_01_03   1415168326400   
 1415168335277COMPLETElocalhost.localdomain:47024 
 http://0.0.0.0:8188/applicationhistory/logs/localhost.localdomain:47024/container_1415168250217_0001_01_03/container_1415168250217_0001_01_03/hadoop
 container_1415168250217_0001_01_04   1415168335825   
 1415168343873COMPLETElocalhost.localdomain:47024 
 http://0.0.0.0:8188/applicationhistory/logs/localhost.localdomain:47024/container_1415168250217_0001_01_04/container_1415168250217_0001_01_04/hadoop
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2805) RM2 in HA setup tries to login using the RM1's kerberos principal


[ 
https://issues.apache.org/jira/browse/YARN-2805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200116#comment-14200116
 ] 

Hudson commented on YARN-2805:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #735 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/735/])
YARN-2805. Fixed ResourceManager to load HA configs correctly before kerberos 
login. Contributed by Wangda Tan. (vinodkv: rev 
834e931d8efe4d806347b266e7e62929ce05389b)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java


 RM2 in HA setup tries to login using the RM1's kerberos principal
 -

 Key: YARN-2805
 URL: https://issues.apache.org/jira/browse/YARN-2805
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Arpit Gupta
Assignee: Wangda Tan
Priority: Blocker
 Fix For: 2.6.0

 Attachments: YARN-2805.1.patch


 {code}
 2014-11-04 08:41:08,705 INFO  resourcemanager.ResourceManager 
 (SignalLogger.java:register(91)) - registered UNIX signal handlers for [TERM, 
 HUP, INT]
 2014-11-04 08:41:10,636 INFO  service.AbstractService 
 (AbstractService.java:noteFailure(272)) - Service ResourceManager failed in 
 state INITED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
 Failed to login
 org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to login
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:211)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1229)
 Caused by: java.io.IOException: Login failure for rm/i...@example.com from 
 keytab /etc/security/keytabs/rm.service.keytab: 
 javax.security.auth.login.LoginException: Unable to obtain password from user
   at 
 org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:935)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2812) TestApplicationHistoryServer is likely to fail on less powerful machine


[ 
https://issues.apache.org/jira/browse/YARN-2812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200118#comment-14200118
 ] 

Hudson commented on YARN-2812:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #735 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/735/])
YARN-2812. TestApplicationHistoryServer is likely to fail on less powerful 
machine. Contributed by Zhijie Shen (xgong: rev 
b0b52c4e11336ca2ad6a02d64c0b5d5a8f1339ae)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryServer.java
* hadoop-yarn-project/CHANGES.txt


 TestApplicationHistoryServer is likely to fail on less powerful machine
 ---

 Key: YARN-2812
 URL: https://issues.apache.org/jira/browse/YARN-2812
 Project: Hadoop YARN
  Issue Type: Test
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Fix For: 2.6.0

 Attachments: YARN-2812.1.patch


 {code:title=testFilteOverrides}
 java.lang.Exception: test timed out after 5 milliseconds
   at java.net.Inet4AddressImpl.getHostByAddr(Native Method)
   at java.net.InetAddress$1.getHostByAddr(InetAddress.java:898)
   at java.net.InetAddress.getHostFromNameService(InetAddress.java:583)
   at java.net.InetAddress.getHostName(InetAddress.java:525)
   at java.net.InetAddress.getHostName(InetAddress.java:497)
   at 
 java.net.InetSocketAddress$InetSocketAddressHolder.getHostName(InetSocketAddress.java:82)
   at 
 java.net.InetSocketAddress$InetSocketAddressHolder.access$600(InetSocketAddress.java:56)
   at java.net.InetSocketAddress.getHostName(InetSocketAddress.java:345)
   at 
 org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169)
   at 
 org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132)
   at 
 org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65)
   at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54)
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.serviceStart(ApplicationHistoryClientService.java:87)
   at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
   at 
 org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceStart(ApplicationHistoryServer.java:111)
   at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.TestApplicationHistoryServer.testFilteOverrides(TestApplicationHistoryServer.java:104)
 {code}
 {code:title=testStartStopServer, testLaunch}
 org.apache.hadoop.service.ServiceStateException: 
 org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
 /grid/0/jenkins/workspace/UT-hadoop-champlain-chunks/workspace/UT-hadoop-champlain-chunks/commonarea/hdp-BUILDS/hadoop-2.6.0.2.2.0.0-src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/build/test/yarn/timeline/leveldb-timeline-store.ldb/LOCK:
  already held by process
   at 
 org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
   at 
 org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:219)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:99)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.TestApplicationHistoryServer.testStartStopServer(TestApplicationHistoryServer.java:48)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2579) Deadlock when EmbeddedElectorService and FatalEventDispatcher try to transition RM to StandBy at the same time


[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200117#comment-14200117
 ] 

Hudson commented on YARN-2579:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #735 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/735/])
YARN-2579. Fixed a deadlock issue when EmbeddedElectorService and 
FatalEventDispatcher try to transition RM to StandBy at the same time. 
Contributed by Rohith Sharmaks (jianhe: rev 
395275af8622c780b9071c243422b0780e096202)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStore.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMFatalEventType.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestRMFailover.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java


 Deadlock when EmbeddedElectorService and FatalEventDispatcher try to 
 transition RM to StandBy at the same time
 --

 Key: YARN-2579
 URL: https://issues.apache.org/jira/browse/YARN-2579
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.1
Reporter: Rohith
Assignee: Rohith
Priority: Blocker
 Fix For: 2.6.0

 Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.2.patch, 
 YARN-2579-20141105.3.patch, YARN-2579-20141105.patch, YARN-2579.patch, 
 YARN-2579.patch


 I encountered a situaltion where both RM's web page was able to access and 
 its state displayed as Active. But One of the RM's ActiveServices were 
 stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2813) NPE from MemoryTimelineStore.getDomains


[ 
https://issues.apache.org/jira/browse/YARN-2813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200119#comment-14200119
 ] 

Hudson commented on YARN-2813:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #735 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/735/])
YARN-2813. Fixed NPE from MemoryTimelineStore.getDomains. Contributed by Zhijie 
Shen (xgong: rev e4b4901d36875faa98ec8628e22e75499e0741ab)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/MemoryTimelineStore.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TimelineStoreTestUtils.java


 NPE from MemoryTimelineStore.getDomains
 ---

 Key: YARN-2813
 URL: https://issues.apache.org/jira/browse/YARN-2813
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Fix For: 2.6.0

 Attachments: YARN-2813.1.patch


 {code}
 2014-11-04 20:50:05,146 WARN 
 org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR
 javax.ws.rs.WebApplicationException: java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.timeline.webapp.TimelineWebServices.getDomains(TimelineWebServices.java:356)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
 at 
 com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
 at 
 com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
 at 
 com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288)
 at 
 com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
 at 
 com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
 at 
 com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
 at 
 com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
 at 
 com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
 at 
 com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
 at 
 com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469)
 at 
 com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400)
 at 
 com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
 at 
 com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
 at 
 com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
 at 
 com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)
 at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:886)
 at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
 at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
 at 
 com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
 at 
 com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
 at 
 com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
 at 
 com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
 at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
 at 
 org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:96)
 at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
 at 
 org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:572)
 at

[jira] [Commented] (YARN-2767) RM web services - add test case to ensure the http static user cannot kill or submit apps in secure mode


[ 
https://issues.apache.org/jira/browse/YARN-2767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200125#comment-14200125
 ] 

Hudson commented on YARN-2767:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #735 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/735/])
YARN-2767. Added a test case to verify that http static user cannot kill or 
submit apps in the secure mode. Contributed by Varun Vasudev. (zjshen: rev 
b4c951ab832f85189d815fb6df57eda4121c0199)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesHttpStaticUserPermissions.java


 RM web services - add test case to ensure the http static user cannot kill or 
 submit apps in secure mode
 

 Key: YARN-2767
 URL: https://issues.apache.org/jira/browse/YARN-2767
 Project: Hadoop YARN
  Issue Type: Test
  Components: resourcemanager
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Fix For: 2.6.0

 Attachments: apache-yarn-2767.0.patch, apache-yarn-2767.1.patch, 
 apache-yarn-2767.2.patch, apache-yarn-2767.3.patch


 We should add a test to ensure that the http static user used to access the 
 RM web interface can't submit or kill apps if the cluster is running in 
 secure mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2579) Deadlock when EmbeddedElectorService and FatalEventDispatcher try to transition RM to StandBy at the same time


[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200195#comment-14200195
 ] 

Hudson commented on YARN-2579:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1925 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1925/])
YARN-2579. Fixed a deadlock issue when EmbeddedElectorService and 
FatalEventDispatcher try to transition RM to StandBy at the same time. 
Contributed by Rohith Sharmaks (jianhe: rev 
395275af8622c780b9071c243422b0780e096202)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestRMFailover.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStore.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMFatalEventType.java


 Deadlock when EmbeddedElectorService and FatalEventDispatcher try to 
 transition RM to StandBy at the same time
 --

 Key: YARN-2579
 URL: https://issues.apache.org/jira/browse/YARN-2579
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.1
Reporter: Rohith
Assignee: Rohith
Priority: Blocker
 Fix For: 2.6.0

 Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.2.patch, 
 YARN-2579-20141105.3.patch, YARN-2579-20141105.patch, YARN-2579.patch, 
 YARN-2579.patch


 I encountered a situaltion where both RM's web page was able to access and 
 its state displayed as Active. But One of the RM's ActiveServices were 
 stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2813) NPE from MemoryTimelineStore.getDomains


[ 
https://issues.apache.org/jira/browse/YARN-2813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200199#comment-14200199
 ] 

Hudson commented on YARN-2813:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1925 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1925/])
YARN-2813. Fixed NPE from MemoryTimelineStore.getDomains. Contributed by Zhijie 
Shen (xgong: rev e4b4901d36875faa98ec8628e22e75499e0741ab)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TimelineStoreTestUtils.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/MemoryTimelineStore.java
* hadoop-yarn-project/CHANGES.txt


 NPE from MemoryTimelineStore.getDomains
 ---

 Key: YARN-2813
 URL: https://issues.apache.org/jira/browse/YARN-2813
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Fix For: 2.6.0

 Attachments: YARN-2813.1.patch


 {code}
 2014-11-04 20:50:05,146 WARN 
 org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR
 javax.ws.rs.WebApplicationException: java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.timeline.webapp.TimelineWebServices.getDomains(TimelineWebServices.java:356)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
 at 
 com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
 at 
 com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
 at 
 com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288)
 at 
 com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
 at 
 com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
 at 
 com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
 at 
 com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
 at 
 com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
 at 
 com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
 at 
 com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469)
 at 
 com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400)
 at 
 com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
 at 
 com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
 at 
 com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
 at 
 com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)
 at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:886)
 at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
 at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
 at 
 com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
 at 
 com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
 at 
 com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
 at 
 com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
 at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
 at 
 org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:96)
 at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
 at 
 org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:572)
 at

[jira] [Commented] (YARN-2805) RM2 in HA setup tries to login using the RM1's kerberos principal


[ 
https://issues.apache.org/jira/browse/YARN-2805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200196#comment-14200196
 ] 

Hudson commented on YARN-2805:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1925 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1925/])
YARN-2805. Fixed ResourceManager to load HA configs correctly before kerberos 
login. Contributed by Wangda Tan. (vinodkv: rev 
834e931d8efe4d806347b266e7e62929ce05389b)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java


 RM2 in HA setup tries to login using the RM1's kerberos principal
 -

 Key: YARN-2805
 URL: https://issues.apache.org/jira/browse/YARN-2805
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Arpit Gupta
Assignee: Wangda Tan
Priority: Blocker
 Fix For: 2.6.0

 Attachments: YARN-2805.1.patch


 {code}
 2014-11-04 08:41:08,705 INFO  resourcemanager.ResourceManager 
 (SignalLogger.java:register(91)) - registered UNIX signal handlers for [TERM, 
 HUP, INT]
 2014-11-04 08:41:10,636 INFO  service.AbstractService 
 (AbstractService.java:noteFailure(272)) - Service ResourceManager failed in 
 state INITED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
 Failed to login
 org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to login
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:211)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1229)
 Caused by: java.io.IOException: Login failure for rm/i...@example.com from 
 keytab /etc/security/keytabs/rm.service.keytab: 
 javax.security.auth.login.LoginException: Unable to obtain password from user
   at 
 org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:935)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2812) TestApplicationHistoryServer is likely to fail on less powerful machine


[ 
https://issues.apache.org/jira/browse/YARN-2812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200198#comment-14200198
 ] 

Hudson commented on YARN-2812:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1925 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1925/])
YARN-2812. TestApplicationHistoryServer is likely to fail on less powerful 
machine. Contributed by Zhijie Shen (xgong: rev 
b0b52c4e11336ca2ad6a02d64c0b5d5a8f1339ae)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryServer.java


 TestApplicationHistoryServer is likely to fail on less powerful machine
 ---

 Key: YARN-2812
 URL: https://issues.apache.org/jira/browse/YARN-2812
 Project: Hadoop YARN
  Issue Type: Test
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Fix For: 2.6.0

 Attachments: YARN-2812.1.patch


 {code:title=testFilteOverrides}
 java.lang.Exception: test timed out after 5 milliseconds
   at java.net.Inet4AddressImpl.getHostByAddr(Native Method)
   at java.net.InetAddress$1.getHostByAddr(InetAddress.java:898)
   at java.net.InetAddress.getHostFromNameService(InetAddress.java:583)
   at java.net.InetAddress.getHostName(InetAddress.java:525)
   at java.net.InetAddress.getHostName(InetAddress.java:497)
   at 
 java.net.InetSocketAddress$InetSocketAddressHolder.getHostName(InetSocketAddress.java:82)
   at 
 java.net.InetSocketAddress$InetSocketAddressHolder.access$600(InetSocketAddress.java:56)
   at java.net.InetSocketAddress.getHostName(InetSocketAddress.java:345)
   at 
 org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169)
   at 
 org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132)
   at 
 org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65)
   at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54)
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.serviceStart(ApplicationHistoryClientService.java:87)
   at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
   at 
 org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceStart(ApplicationHistoryServer.java:111)
   at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.TestApplicationHistoryServer.testFilteOverrides(TestApplicationHistoryServer.java:104)
 {code}
 {code:title=testStartStopServer, testLaunch}
 org.apache.hadoop.service.ServiceStateException: 
 org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
 /grid/0/jenkins/workspace/UT-hadoop-champlain-chunks/workspace/UT-hadoop-champlain-chunks/commonarea/hdp-BUILDS/hadoop-2.6.0.2.2.0.0-src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/build/test/yarn/timeline/leveldb-timeline-store.ldb/LOCK:
  already held by process
   at 
 org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
   at 
 org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:219)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:99)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.TestApplicationHistoryServer.testStartStopServer(TestApplicationHistoryServer.java:48)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2816) NM fail to start with NPE during container recovery

2014-11-06 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200241#comment-14200241
 ] 

Jason Lowe commented on YARN-2816:
--

This seems like a dubious use case.  If something comes along and deletes 
(i.e.: corrupts) the leveldb database then in general the NM will not be able 
to recover properly.  Trying to patch up one particular scenario won't cover 
the rest, and containers could leak (i.e.: be forgotten even though they're 
still running), container start requests lost, etc.

As for the OS crash scenario, if the OS crashes then there's nothing left for 
the NM to recover.  If we really want to protect against OS crashes then a much 
better way is to perform synchronous writes to leveldb.  However this is _much_ 
slower than asynchronous writes and could easily impact NM performance.  Given 
that there's nothing to recover from the OS crash scenario, it doesn't seem 
worth worrying about that case.

The real issue for the reported scenario is that the leveldb database location 
is a poor one for the way that system is configured, since something is coming 
along and corrupting the database.  Either the leveldb database needs to be 
moved somewhere else or the file cleanup procedure needs to exclude the leveldb 
database.

 NM fail to start with NPE during container recovery
 ---

 Key: YARN-2816
 URL: https://issues.apache.org/jira/browse/YARN-2816
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-2816.000.patch


 NM fail to start with NPE during container recovery.
 We saw the following crash happen:
 2014-10-30 22:22:37,211 INFO org.apache.hadoop.service.AbstractService: 
 Service 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
  failed in state INITED; cause: java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
 The reason is some DB files used in NMLeveldbStateStoreService are 
 accidentally deleted to save disk space at 
 /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state. This leaves some incomplete 
 container record which don't have CONTAINER_REQUEST_KEY_SUFFIX(startRequest) 
 entry in the DB. When container is recovered at 
 ContainerManagerImpl#recoverContainer, 
 The NullPointerException at the following code cause NM shutdown.
 {code}
 StartContainerRequest req = rcs.getStartRequest();
 ContainerLaunchContext launchContext = req.getContainerLaunchContext();
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2647) Add yarn queue CLI to get queue infos

2014-11-06 Thread Sunil G (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-2647:
--
Attachment: 0008-YARN-2647.patch

Hi [~gp.leftnoteasy]

Thank you. I have updated patch as per the comments.
However I feel we can avoid having '-' in between the field labels. Because we 
already have fields like Maximum Capacity, Current Capacity etc. So if we 
change for node-labels, it was not much looking good. So I kept as it is and 
also removed for queue. Kindly share your thoughts.

 Add yarn queue CLI to get queue infos
 -

 Key: YARN-2647
 URL: https://issues.apache.org/jira/browse/YARN-2647
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client
Reporter: Wangda Tan
Assignee: Sunil G
 Attachments: 0001-YARN-2647.patch, 0002-YARN-2647.patch, 
 0003-YARN-2647.patch, 0004-YARN-2647.patch, 0005-YARN-2647.patch, 
 0006-YARN-2647.patch, 0007-YARN-2647.patch, 0008-YARN-2647.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2813) NPE from MemoryTimelineStore.getDomains


[ 
https://issues.apache.org/jira/browse/YARN-2813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200294#comment-14200294
 ] 

Hudson commented on YARN-2813:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1949 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1949/])
YARN-2813. Fixed NPE from MemoryTimelineStore.getDomains. Contributed by Zhijie 
Shen (xgong: rev e4b4901d36875faa98ec8628e22e75499e0741ab)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/MemoryTimelineStore.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TimelineStoreTestUtils.java


 NPE from MemoryTimelineStore.getDomains
 ---

 Key: YARN-2813
 URL: https://issues.apache.org/jira/browse/YARN-2813
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Fix For: 2.6.0

 Attachments: YARN-2813.1.patch


 {code}
 2014-11-04 20:50:05,146 WARN 
 org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR
 javax.ws.rs.WebApplicationException: java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.timeline.webapp.TimelineWebServices.getDomains(TimelineWebServices.java:356)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
 at 
 com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
 at 
 com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
 at 
 com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288)
 at 
 com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
 at 
 com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
 at 
 com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
 at 
 com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
 at 
 com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
 at 
 com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
 at 
 com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469)
 at 
 com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400)
 at 
 com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
 at 
 com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
 at 
 com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
 at 
 com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)
 at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:886)
 at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
 at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
 at 
 com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
 at 
 com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
 at 
 com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
 at 
 com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
 at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
 at 
 org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:96)
 at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
 at 
 org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:572)
 at

[jira] [Commented] (YARN-2767) RM web services - add test case to ensure the http static user cannot kill or submit apps in secure mode


[ 
https://issues.apache.org/jira/browse/YARN-2767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200300#comment-14200300
 ] 

Hudson commented on YARN-2767:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1949 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1949/])
YARN-2767. Added a test case to verify that http static user cannot kill or 
submit apps in the secure mode. Contributed by Varun Vasudev. (zjshen: rev 
b4c951ab832f85189d815fb6df57eda4121c0199)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesHttpStaticUserPermissions.java


 RM web services - add test case to ensure the http static user cannot kill or 
 submit apps in secure mode
 

 Key: YARN-2767
 URL: https://issues.apache.org/jira/browse/YARN-2767
 Project: Hadoop YARN
  Issue Type: Test
  Components: resourcemanager
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Fix For: 2.6.0

 Attachments: apache-yarn-2767.0.patch, apache-yarn-2767.1.patch, 
 apache-yarn-2767.2.patch, apache-yarn-2767.3.patch


 We should add a test to ensure that the http static user used to access the 
 RM web interface can't submit or kill apps if the cluster is running in 
 secure mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2579) Deadlock when EmbeddedElectorService and FatalEventDispatcher try to transition RM to StandBy at the same time


[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200292#comment-14200292
 ] 

Hudson commented on YARN-2579:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1949 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1949/])
YARN-2579. Fixed a deadlock issue when EmbeddedElectorService and 
FatalEventDispatcher try to transition RM to StandBy at the same time. 
Contributed by Rohith Sharmaks (jianhe: rev 
395275af8622c780b9071c243422b0780e096202)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMFatalEventType.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStore.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestRMFailover.java


 Deadlock when EmbeddedElectorService and FatalEventDispatcher try to 
 transition RM to StandBy at the same time
 --

 Key: YARN-2579
 URL: https://issues.apache.org/jira/browse/YARN-2579
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.1
Reporter: Rohith
Assignee: Rohith
Priority: Blocker
 Fix For: 2.6.0

 Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.2.patch, 
 YARN-2579-20141105.3.patch, YARN-2579-20141105.patch, YARN-2579.patch, 
 YARN-2579.patch


 I encountered a situaltion where both RM's web page was able to access and 
 its state displayed as Active. But One of the RM's ActiveServices were 
 stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2805) RM2 in HA setup tries to login using the RM1's kerberos principal


[ 
https://issues.apache.org/jira/browse/YARN-2805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200291#comment-14200291
 ] 

Hudson commented on YARN-2805:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1949 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1949/])
YARN-2805. Fixed ResourceManager to load HA configs correctly before kerberos 
login. Contributed by Wangda Tan. (vinodkv: rev 
834e931d8efe4d806347b266e7e62929ce05389b)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java


 RM2 in HA setup tries to login using the RM1's kerberos principal
 -

 Key: YARN-2805
 URL: https://issues.apache.org/jira/browse/YARN-2805
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Arpit Gupta
Assignee: Wangda Tan
Priority: Blocker
 Fix For: 2.6.0

 Attachments: YARN-2805.1.patch


 {code}
 2014-11-04 08:41:08,705 INFO  resourcemanager.ResourceManager 
 (SignalLogger.java:register(91)) - registered UNIX signal handlers for [TERM, 
 HUP, INT]
 2014-11-04 08:41:10,636 INFO  service.AbstractService 
 (AbstractService.java:noteFailure(272)) - Service ResourceManager failed in 
 state INITED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
 Failed to login
 org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to login
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:211)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1229)
 Caused by: java.io.IOException: Login failure for rm/i...@example.com from 
 keytab /etc/security/keytabs/rm.service.keytab: 
 javax.security.auth.login.LoginException: Unable to obtain password from user
   at 
 org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:935)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2812) TestApplicationHistoryServer is likely to fail on less powerful machine


[ 
https://issues.apache.org/jira/browse/YARN-2812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200293#comment-14200293
 ] 

Hudson commented on YARN-2812:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1949 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1949/])
YARN-2812. TestApplicationHistoryServer is likely to fail on less powerful 
machine. Contributed by Zhijie Shen (xgong: rev 
b0b52c4e11336ca2ad6a02d64c0b5d5a8f1339ae)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryServer.java
* hadoop-yarn-project/CHANGES.txt


 TestApplicationHistoryServer is likely to fail on less powerful machine
 ---

 Key: YARN-2812
 URL: https://issues.apache.org/jira/browse/YARN-2812
 Project: Hadoop YARN
  Issue Type: Test
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Fix For: 2.6.0

 Attachments: YARN-2812.1.patch


 {code:title=testFilteOverrides}
 java.lang.Exception: test timed out after 5 milliseconds
   at java.net.Inet4AddressImpl.getHostByAddr(Native Method)
   at java.net.InetAddress$1.getHostByAddr(InetAddress.java:898)
   at java.net.InetAddress.getHostFromNameService(InetAddress.java:583)
   at java.net.InetAddress.getHostName(InetAddress.java:525)
   at java.net.InetAddress.getHostName(InetAddress.java:497)
   at 
 java.net.InetSocketAddress$InetSocketAddressHolder.getHostName(InetSocketAddress.java:82)
   at 
 java.net.InetSocketAddress$InetSocketAddressHolder.access$600(InetSocketAddress.java:56)
   at java.net.InetSocketAddress.getHostName(InetSocketAddress.java:345)
   at 
 org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169)
   at 
 org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132)
   at 
 org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65)
   at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54)
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.serviceStart(ApplicationHistoryClientService.java:87)
   at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
   at 
 org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceStart(ApplicationHistoryServer.java:111)
   at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.TestApplicationHistoryServer.testFilteOverrides(TestApplicationHistoryServer.java:104)
 {code}
 {code:title=testStartStopServer, testLaunch}
 org.apache.hadoop.service.ServiceStateException: 
 org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
 /grid/0/jenkins/workspace/UT-hadoop-champlain-chunks/workspace/UT-hadoop-champlain-chunks/commonarea/hdp-BUILDS/hadoop-2.6.0.2.2.0.0-src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/build/test/yarn/timeline/leveldb-timeline-store.ldb/LOCK:
  already held by process
   at 
 org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
   at 
 org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:219)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:99)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.TestApplicationHistoryServer.testStartStopServer(TestApplicationHistoryServer.java:48)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2780) Log aggregated resource allocation in rm-appsummary.log

2014-11-06 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-2780:
-
Attachment: YARN-2780.v2.201411061601.txt

[~knoguchi], I am attaching a new patch that applies cleanly to trunk.

{quote}
Output looks mostly good. We may want to have different format for 
preemptedResources so that it doesn't use the same delimiter(comma). But 
since application name can also include comma, maybe it's a non-issue.
{quote}
The code handles the comma by putting a backslash (\) in front of it. We could 
split change the comma to something else before printing it out, but I think it 
is not a problem.

 Log aggregated resource allocation in rm-appsummary.log
 ---

 Key: YARN-2780
 URL: https://issues.apache.org/jira/browse/YARN-2780
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 2.5.1
Reporter: Koji Noguchi
Assignee: Eric Payne
Priority: Minor
 Attachments: YARN-2780.v1.201411031728.txt, 
 YARN-2780.v2.201411061601.txt


 YARN-415 added useful information about resource usage by applications.  
 Asking to log that info inside rm-appsummary.log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2139) Add support for disk IO isolation/scheduling for containers

[
https://issues.apache.org/jira/browse/YARN-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200385#comment-14200385
]

Karthik Kambatla commented on YARN-2139:

Thanks for chiming in, Arun.

This JIRA focuses on adding disk scheduling, and isolation for local disk read
I/O. HDFS short-circuit reads happen to be local-disk reads, and hence we
handle that too automatically.

bq. We shouldn't embed Linux or blkio specific semantics such as proportional
weight division into YARN.
The Linux aspects are only for isolation, and this needs to be pluggable.

Wei and I are more familiar with FairScheduler, and talk about weighted
division between queues from that standpoint. We are eager to hear your
thoughts on how we should do this with CapacityScheduler, and augment the
configs etc. if need be. I was thinking we would handle it similar to how it
handles CPU today (more on that later).

bq. We need something generic such as bandwidth which can be understood by
users, supportable on heterogenous nodes in the same cluster
Our initial thinking was along these lines. However, similar to CPU, it gets
very hard for a user to specify the bandwidth requirement. It is hard to figure
out my container *needs* 200 MBps (and 2 GHz CPU). Furthermore, it is hard to
enforce bandwidth isolation. When multiple processes are accessing a disk, its
aggregate bandwidth could go down significantly. To *guarantee* bandwidth, I
believe the scheduler has to be super pessimistic with its allocations.

Given all this, we thought we should probably handle it the way we did CPU.
Each process asks for 'n' vdisks to capture the number of disks it needs. To
avoid floating point computations, we added an NM config for the available
vdisks. Heterogeneity in terms of number of disks is easily handled with
vdisks-per-node knob. Heterogeneity in each disk's capacity or bandwidth is not
handled, similar to our CPU story. I propose we work on this heterogeneity as
one of the follow-up items.

bq. Spindle locality or I/O parallelism is a real concern
Agree. Is it okay if we finish this work and follow-up with spindle-locality?
We have some thoughts on how to handle it, but left it out of the doc to keep
the design focused.

Add support for disk IO isolation/scheduling for containers
---

Key: YARN-2139
URL: https://issues.apache.org/jira/browse/YARN-2139
Project: Hadoop YARN
Issue Type: New Feature
Reporter: Wei Yan
Assignee: Wei Yan
Attachments: Disk_IO_Scheduling_Design_1.pdf,
Disk_IO_Scheduling_Design_2.pdf

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (YARN-2139) Add support for disk IO isolation/scheduling for containers

[
https://issues.apache.org/jira/browse/YARN-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200385#comment-14200385
]

Karthik Kambatla edited comment on YARN-2139 at 11/6/14 4:27 PM:
-

Thanks for chiming in, Arun.

This JIRA focuses on adding disk scheduling, and isolation for local disk read
I/O. HDFS short-circuit reads happen to be local-disk reads, and hence we
handle that too automatically.

bq. We shouldn't embed Linux or blkio specific semantics such as proportional
weight division into YARN.
The Linux aspects are only for isolation, and this needs to be pluggable.

was (Author: kasha):
Thanks for chiming in, Arun.

This JIRA focuses on adding disk scheduling, and isolation for local disk read
I/O. HDFS short-circuit reads happen to be local-disk reads, and hence we
handle that too automatically.

bq. We shouldn't embed Linux or blkio specific semantics such as proportional
weight division into YARN.
The Linux aspects are only for isolation, and this needs to be pluggable.

Add support for disk IO isolation/scheduling for containers
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2818) Remove the logic to inject entity owner as the primary filter

2014-11-06 Thread Zhijie Shen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2818:
--
Attachment: YARN-2818.2.patch

Remove one more unnecessary method.

 Remove the logic to inject entity owner as the primary filter
 -

 Key: YARN-2818
 URL: https://issues.apache.org/jira/browse/YARN-2818
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen
Priority: Critical
 Attachments: YARN-2818.1.patch, YARN-2818.2.patch


 In 2.5, we inject owner info as a primary filter to support entity-level 
 acls. Since 2.6, we have a different acls solution (YARN-2102). Therefore, 
 there's no need to inject owner info. There're two motivations:
 1. For leveldb timeline store, the primary filter is expensive. When we have 
 a primary filter, we need to make a complete copy of the entity on the logic 
 index table.
 2. Owner info is incomplete. Say we want to put E1 (owner = tester, 
 relatedEntity = E2). If E2 doesn't exist before, leveldb timeline store 
 will create an empty E2 without owner info (at the db point of view, it 
 doesn't know owner is a special primary filter). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2818) Remove the logic to inject entity owner as the primary filter


[ 
https://issues.apache.org/jira/browse/YARN-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200511#comment-14200511
 ] 

Hadoop QA commented on YARN-2818:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12679878/YARN-2818.2.patch
  against trunk revision 10f9f51.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5754//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5754//console

This message is automatically generated.

 Remove the logic to inject entity owner as the primary filter
 -

 Key: YARN-2818
 URL: https://issues.apache.org/jira/browse/YARN-2818
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen
Priority: Critical
 Attachments: YARN-2818.1.patch, YARN-2818.2.patch


 In 2.5, we inject owner info as a primary filter to support entity-level 
 acls. Since 2.6, we have a different acls solution (YARN-2102). Therefore, 
 there's no need to inject owner info. There're two motivations:
 1. For leveldb timeline store, the primary filter is expensive. When we have 
 a primary filter, we need to make a complete copy of the entity on the logic 
 index table.
 2. Owner info is incomplete. Say we want to put E1 (owner = tester, 
 relatedEntity = E2). If E2 doesn't exist before, leveldb timeline store 
 will create an empty E2 without owner info (at the db point of view, it 
 doesn't know owner is a special primary filter). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2780) Log aggregated resource allocation in rm-appsummary.log


[ 
https://issues.apache.org/jira/browse/YARN-2780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200517#comment-14200517
 ] 

Hadoop QA commented on YARN-2780:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12679871/YARN-2780.v2.201411061601.txt
  against trunk revision 10f9f51.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5753//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5753//console

This message is automatically generated.

 Log aggregated resource allocation in rm-appsummary.log
 ---

 Key: YARN-2780
 URL: https://issues.apache.org/jira/browse/YARN-2780
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 2.5.1
Reporter: Koji Noguchi
Assignee: Eric Payne
Priority: Minor
 Attachments: YARN-2780.v1.201411031728.txt, 
 YARN-2780.v2.201411061601.txt


 YARN-415 added useful information about resource usage by applications.  
 Asking to log that info inside rm-appsummary.log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2647) Add yarn queue CLI to get queue infos


[ 
https://issues.apache.org/jira/browse/YARN-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200560#comment-14200560
 ] 

Hadoop QA commented on YARN-2647:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12679850/0008-YARN-2647.patch
  against trunk revision 10f9f51.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5755//console

This message is automatically generated.

 Add yarn queue CLI to get queue infos
 -

 Key: YARN-2647
 URL: https://issues.apache.org/jira/browse/YARN-2647
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client
Reporter: Wangda Tan
Assignee: Sunil G
 Attachments: 0001-YARN-2647.patch, 0002-YARN-2647.patch, 
 0003-YARN-2647.patch, 0004-YARN-2647.patch, 0005-YARN-2647.patch, 
 0006-YARN-2647.patch, 0007-YARN-2647.patch, 0008-YARN-2647.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-2819) NPE in ATS Timeline Domains when upgrading from 2.4 to 2.6

2014-11-06 Thread Gopal V (JIRA)

Gopal V created YARN-2819:
-

 Summary: NPE in ATS Timeline Domains when upgrading from 2.4 to 2.6
 Key: YARN-2819
 URL: https://issues.apache.org/jira/browse/YARN-2819
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Gopal V


{code}
Caused by: java.lang.NullPointerException
at java.lang.String.init(String.java:554)
at 
org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.put(LeveldbTimelineStore.java:873)
at 
org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.put(LeveldbTimelineStore.java:1014)
at 
org.apache.hadoop.yarn.server.timeline.TimelineDataManager.postEntities(TimelineDataManager.java:330)
at 
org.apache.hadoop.yarn.server.timeline.webapp.TimelineWebServices.postEntities(TimelineWebServices.java:260)
{code}

triggered by 

{code}
entity.getRelatedEntities();
...
} else {
  byte[] domainIdBytes = db.get(createDomainIdKey(
  relatedEntityId, relatedEntityType, relatedEntityStartTime));
  // This is the existing entity
  String domainId = new String(domainIdBytes);
  if (!domainId.equals(entity.getDomainId())) {
{code}

The new String(domainIdBytes); throws an NPE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (YARN-2819) NPE in ATS Timeline Domains when upgrading from 2.4 to 2.6

2014-11-06 Thread Zhijie Shen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen reassigned YARN-2819:
-

Assignee: Zhijie Shen

 NPE in ATS Timeline Domains when upgrading from 2.4 to 2.6
 --

 Key: YARN-2819
 URL: https://issues.apache.org/jira/browse/YARN-2819
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Gopal V
Assignee: Zhijie Shen
  Labels: Upgrade

 {code}
 Caused by: java.lang.NullPointerException
 at java.lang.String.init(String.java:554)
 at 
 org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.put(LeveldbTimelineStore.java:873)
 at 
 org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.put(LeveldbTimelineStore.java:1014)
 at 
 org.apache.hadoop.yarn.server.timeline.TimelineDataManager.postEntities(TimelineDataManager.java:330)
 at 
 org.apache.hadoop.yarn.server.timeline.webapp.TimelineWebServices.postEntities(TimelineWebServices.java:260)
 {code}
 triggered by 
 {code}
 entity.getRelatedEntities();
 ...
 } else {
   byte[] domainIdBytes = db.get(createDomainIdKey(
   relatedEntityId, relatedEntityType, 
 relatedEntityStartTime));
   // This is the existing entity
   String domainId = new String(domainIdBytes);
   if (!domainId.equals(entity.getDomainId())) {
 {code}
 The new String(domainIdBytes); throws an NPE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2811) Fair Scheduler is violating max memory settings in 2.4

2014-11-06 Thread Siqi Li (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200648#comment-14200648
 ] 

Siqi Li commented on YARN-2811:
---

[~sandyr] Thanks for your review. I was think about the same thing.

 Fair Scheduler is violating max memory settings in 2.4
 --

 Key: YARN-2811
 URL: https://issues.apache.org/jira/browse/YARN-2811
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Siqi Li
Assignee: Siqi Li
 Attachments: YARN-2811.v1.patch, YARN-2811.v2.patch, 
 YARN-2811.v3.patch


 This has been seen on several queues showing the allocated MB going 
 significantly above the max MB and it appears to have started with the 2.4 
 upgrade. It could be a regression bug from 2.0 to 2.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2819) NPE in ATS Timeline Domains when upgrading from 2.4 to 2.6

2014-11-06 Thread Zhijie Shen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2819:
--
Priority: Critical  (was: Major)

 NPE in ATS Timeline Domains when upgrading from 2.4 to 2.6
 --

 Key: YARN-2819
 URL: https://issues.apache.org/jira/browse/YARN-2819
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Gopal V
Assignee: Zhijie Shen
Priority: Critical
  Labels: Upgrade

 {code}
 Caused by: java.lang.NullPointerException
 at java.lang.String.init(String.java:554)
 at 
 org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.put(LeveldbTimelineStore.java:873)
 at 
 org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.put(LeveldbTimelineStore.java:1014)
 at 
 org.apache.hadoop.yarn.server.timeline.TimelineDataManager.postEntities(TimelineDataManager.java:330)
 at 
 org.apache.hadoop.yarn.server.timeline.webapp.TimelineWebServices.postEntities(TimelineWebServices.java:260)
 {code}
 triggered by 
 {code}
 entity.getRelatedEntities();
 ...
 } else {
   byte[] domainIdBytes = db.get(createDomainIdKey(
   relatedEntityId, relatedEntityType, 
 relatedEntityStartTime));
   // This is the existing entity
   String domainId = new String(domainIdBytes);
   if (!domainId.equals(entity.getDomainId())) {
 {code}
 The new String(domainIdBytes); throws an NPE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-2820) Improve FileSystemRMStateStore update failure exception handling to not shutdown RM.

zhihai xu created YARN-2820:
---

 Summary: Improve FileSystemRMStateStore update failure exception 
handling to not  shutdown RM.
 Key: YARN-2820
 URL: https://issues.apache.org/jira/browse/YARN-2820
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu


When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We saw 
the following IOexception cause the RM shutdown.

{code}
FATAL
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
STATE_STORE_OP_FAILED. Cause: 
java.io.IOException: Unable to close file because the last block does not have 
enough number of replicas. 
at 
org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) 
at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) 
at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
 
at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
 
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) 
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
at java.lang.Thread.run(Thread.java:744) 
{code}

It will be better to  Improve FileSystemRMStateStore update failure exception 
handling to not  shutdown RM. So that a single state write out failure can't 
stop all jobs .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2678) Recommended improvements to Yarn Registry

2014-11-06 Thread Sanjay Radia (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200721#comment-14200721
 ] 

Sanjay Radia commented on YARN-2678:


Don't have time to review the code, but the proposed change is fine.

 Recommended improvements to Yarn Registry
 -

 Key: YARN-2678
 URL: https://issues.apache.org/jira/browse/YARN-2678
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, resourcemanager
Affects Versions: 2.6.0
Reporter: Gour Saha
Assignee: Steve Loughran
 Attachments: HADOOP-2678-002.patch, YARN-2678-001.patch, 
 YARN-2678-003.patch, YARN-2678-006.patch, YARN-2678-007.patch, 
 YARN-2678-008.patch, yarnregistry.pdf


 In the process of binding to Slider AM from Slider agent python code here are 
 some of the items I stumbled upon and would recommend as improvements.
 This is how the Slider's registry looks today -
 {noformat}
 jsonservicerec{
   description : Slider Application Master,
   external : [ {
 api : org.apache.slider.appmaster,
 addressType : host/port,
 protocolType : hadoop/protobuf,
 addresses : [ [ c6408.ambari.apache.org, 34837 ] ]
   }, {
 api : org.apache.http.UI,
 addressType : uri,
 protocolType : webui,
 addresses : [ [ http://c6408.ambari.apache.org:43314; ] ]
   }, {
 api : org.apache.slider.management,
 addressType : uri,
 protocolType : REST,
 addresses : [ [ 
 http://c6408.ambari.apache.org:43314/ws/v1/slider/mgmt; ] ]
   }, {
 api : org.apache.slider.publisher,
 addressType : uri,
 protocolType : REST,
 addresses : [ [ 
 http://c6408.ambari.apache.org:43314/ws/v1/slider/publisher; ] ]
   }, {
 api : org.apache.slider.registry,
 addressType : uri,
 protocolType : REST,
 addresses : [ [ 
 http://c6408.ambari.apache.org:43314/ws/v1/slider/registry; ] ]
   }, {
 api : org.apache.slider.publisher.configurations,
 addressType : uri,
 protocolType : REST,
 addresses : [ [ 
 http://c6408.ambari.apache.org:43314/ws/v1/slider/publisher/slider; ] ]
   } ],
   internal : [ {
 api : org.apache.slider.agents.secure,
 addressType : uri,
 protocolType : REST,
 addresses : [ [ 
 https://c6408.ambari.apache.org:46958/ws/v1/slider/agents; ] ]
   }, {
 api : org.apache.slider.agents.oneway,
 addressType : uri,
 protocolType : REST,
 addresses : [ [ 
 https://c6408.ambari.apache.org:57513/ws/v1/slider/agents; ] ]
   } ],
   yarn:persistence : application,
   yarn:id : application_1412974695267_0015
 }
 {noformat}
 Recommendations:
 1. I would suggest to either remove the string 
 {color:red}jsonservicerec{color} or if it is desirable to have a non-null 
 data at all times then loop the string into the json structure as a top-level 
 attribute to ensure that the registry data is always a valid json document. 
 2. The {color:red}addresses{color} attribute is currently a list of list. I 
 would recommend to convert it to a list of dictionary objects. In the 
 dictionary object it would be nice to have the host and port portions of 
 objects of addressType uri as separate key-value pairs to avoid parsing on 
 the client side. The URI should also be retained as a key say uri to avoid 
 clients trying to generate it by concatenating host, port, resource-path, 
 etc. Here is a proposed structure -
 {noformat}
 {
   ...
   internal : [ {
 api : org.apache.slider.agents.secure,
 addressType : uri,
 protocolType : REST,
 addresses : [ 
{ uri : https://c6408.ambari.apache.org:46958/ws/v1/slider/agents;,
  host : c6408.ambari.apache.org,
  port: 46958
}
 ]
   } 
   ],
 }
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Reopened] (YARN-2791) Add Disk as a resource for scheduling

2014-11-06 Thread Yuliya Feldman (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuliya Feldman reopened YARN-2791:
--
  Assignee: Yuliya Feldman

I think this JIRA should be reopened, since 
https://issues.apache.org/jira/browse/YARN-2817 which is submitted later is 
talking about absolutely the same 

 Add Disk as a resource for scheduling
 -

 Key: YARN-2791
 URL: https://issues.apache.org/jira/browse/YARN-2791
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: scheduler
Affects Versions: 2.5.1
Reporter: Swapnil Daingade
Assignee: Yuliya Feldman

 Currently, the number of disks present on a node is not considered a factor 
 while scheduling containers on that node. Having large amount of memory on a 
 node can lead to high number of containers being launched on that node, all 
 of which compete for I/O bandwidth. This multiplexing of I/O across 
 containers can lead to slower overall progress and sub-optimal resource 
 utilization as containers starved for I/O bandwidth hold on to other 
 resources like cpu and memory. This problem can be solved by considering disk 
 as a resource and including it in deciding how many containers can be 
 concurrently run on a node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2678) Recommended improvements to Yarn Registry


[ 
https://issues.apache.org/jira/browse/YARN-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200728#comment-14200728
 ] 

Gour Saha commented on YARN-2678:
-

+ 1 (non binding)

We already consumed the changes in Apache Slider agents and everything looks 
good. All tests passed.

 Recommended improvements to Yarn Registry
 -

 Key: YARN-2678
 URL: https://issues.apache.org/jira/browse/YARN-2678
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, resourcemanager
Affects Versions: 2.6.0
Reporter: Gour Saha
Assignee: Steve Loughran
 Attachments: HADOOP-2678-002.patch, YARN-2678-001.patch, 
 YARN-2678-003.patch, YARN-2678-006.patch, YARN-2678-007.patch, 
 YARN-2678-008.patch, yarnregistry.pdf


 In the process of binding to Slider AM from Slider agent python code here are 
 some of the items I stumbled upon and would recommend as improvements.
 This is how the Slider's registry looks today -
 {noformat}
 jsonservicerec{
   description : Slider Application Master,
   external : [ {
 api : org.apache.slider.appmaster,
 addressType : host/port,
 protocolType : hadoop/protobuf,
 addresses : [ [ c6408.ambari.apache.org, 34837 ] ]
   }, {
 api : org.apache.http.UI,
 addressType : uri,
 protocolType : webui,
 addresses : [ [ http://c6408.ambari.apache.org:43314; ] ]
   }, {
 api : org.apache.slider.management,
 addressType : uri,
 protocolType : REST,
 addresses : [ [ 
 http://c6408.ambari.apache.org:43314/ws/v1/slider/mgmt; ] ]
   }, {
 api : org.apache.slider.publisher,
 addressType : uri,
 protocolType : REST,
 addresses : [ [ 
 http://c6408.ambari.apache.org:43314/ws/v1/slider/publisher; ] ]
   }, {
 api : org.apache.slider.registry,
 addressType : uri,
 protocolType : REST,
 addresses : [ [ 
 http://c6408.ambari.apache.org:43314/ws/v1/slider/registry; ] ]
   }, {
 api : org.apache.slider.publisher.configurations,
 addressType : uri,
 protocolType : REST,
 addresses : [ [ 
 http://c6408.ambari.apache.org:43314/ws/v1/slider/publisher/slider; ] ]
   } ],
   internal : [ {
 api : org.apache.slider.agents.secure,
 addressType : uri,
 protocolType : REST,
 addresses : [ [ 
 https://c6408.ambari.apache.org:46958/ws/v1/slider/agents; ] ]
   }, {
 api : org.apache.slider.agents.oneway,
 addressType : uri,
 protocolType : REST,
 addresses : [ [ 
 https://c6408.ambari.apache.org:57513/ws/v1/slider/agents; ] ]
   } ],
   yarn:persistence : application,
   yarn:id : application_1412974695267_0015
 }
 {noformat}
 Recommendations:
 1. I would suggest to either remove the string 
 {color:red}jsonservicerec{color} or if it is desirable to have a non-null 
 data at all times then loop the string into the json structure as a top-level 
 attribute to ensure that the registry data is always a valid json document. 
 2. The {color:red}addresses{color} attribute is currently a list of list. I 
 would recommend to convert it to a list of dictionary objects. In the 
 dictionary object it would be nice to have the host and port portions of 
 objects of addressType uri as separate key-value pairs to avoid parsing on 
 the client side. The URI should also be retained as a key say uri to avoid 
 clients trying to generate it by concatenating host, port, resource-path, 
 etc. Here is a proposed structure -
 {noformat}
 {
   ...
   internal : [ {
 api : org.apache.slider.agents.secure,
 addressType : uri,
 protocolType : REST,
 addresses : [ 
{ uri : https://c6408.ambari.apache.org:46958/ws/v1/slider/agents;,
  host : c6408.ambari.apache.org,
  port: 46958
}
 ]
   } 
   ],
 }
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2791) Add Disk as a resource for scheduling

2014-11-06 Thread Wei Yan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200739#comment-14200739
 ] 

Wei Yan commented on YARN-2791:
---

Hi, [~yufeldman], this jira is same to YARN-2139.

 Add Disk as a resource for scheduling
 -

 Key: YARN-2791
 URL: https://issues.apache.org/jira/browse/YARN-2791
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: scheduler
Affects Versions: 2.5.1
Reporter: Swapnil Daingade
Assignee: Yuliya Feldman

 Currently, the number of disks present on a node is not considered a factor 
 while scheduling containers on that node. Having large amount of memory on a 
 node can lead to high number of containers being launched on that node, all 
 of which compete for I/O bandwidth. This multiplexing of I/O across 
 containers can lead to slower overall progress and sub-optimal resource 
 utilization as containers starved for I/O bandwidth hold on to other 
 resources like cpu and memory. This problem can be solved by considering disk 
 as a resource and including it in deciding how many containers can be 
 concurrently run on a node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2818) Remove the logic to inject entity owner as the primary filter

2014-11-06 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200770#comment-14200770
 ] 

Vinod Kumar Vavilapalli commented on YARN-2818:
---

Makes sense given
 - We now do all authz based on domains
 - User filter was always a hidden filter anyways

+1, checking this in.

 Remove the logic to inject entity owner as the primary filter
 -

 Key: YARN-2818
 URL: https://issues.apache.org/jira/browse/YARN-2818
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen
Priority: Critical
 Attachments: YARN-2818.1.patch, YARN-2818.2.patch


 In 2.5, we inject owner info as a primary filter to support entity-level 
 acls. Since 2.6, we have a different acls solution (YARN-2102). Therefore, 
 there's no need to inject owner info. There're two motivations:
 1. For leveldb timeline store, the primary filter is expensive. When we have 
 a primary filter, we need to make a complete copy of the entity on the logic 
 index table.
 2. Owner info is incomplete. Say we want to put E1 (owner = tester, 
 relatedEntity = E2). If E2 doesn't exist before, leveldb timeline store 
 will create an empty E2 without owner info (at the db point of view, it 
 doesn't know owner is a special primary filter). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2818) Remove the logic to inject entity owner as the primary filter


[ 
https://issues.apache.org/jira/browse/YARN-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200787#comment-14200787
 ] 

Hudson commented on YARN-2818:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #6468 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6468/])
YARN-2818. Removed the now unnecessary user entity injection from Timeline 
service given we now have domains. Contributed by Zhijie Shen. (vinodkv: rev 
f5b19bed7d71979dc8685b03152188902b6e45e9)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineDataManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/webapp/TestTimelineWebServices.java


 Remove the logic to inject entity owner as the primary filter
 -

 Key: YARN-2818
 URL: https://issues.apache.org/jira/browse/YARN-2818
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen
Priority: Critical
 Fix For: 2.6.0

 Attachments: YARN-2818.1.patch, YARN-2818.2.patch


 In 2.5, we inject owner info as a primary filter to support entity-level 
 acls. Since 2.6, we have a different acls solution (YARN-2102). Therefore, 
 there's no need to inject owner info. There're two motivations:
 1. For leveldb timeline store, the primary filter is expensive. When we have 
 a primary filter, we need to make a complete copy of the entity on the logic 
 index table.
 2. Owner info is incomplete. Say we want to put E1 (owner = tester, 
 relatedEntity = E2). If E2 doesn't exist before, leveldb timeline store 
 will create an empty E2 without owner info (at the db point of view, it 
 doesn't know owner is a special primary filter). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2817) Disk drive as a resource in YARN

2014-11-06 Thread Swapnil Daingade (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200818#comment-14200818
 ] 

Swapnil Daingade commented on YARN-2817:


Hi Arun, this is what we were proposing in 
https://issues.apache.org/jira/browse/YARN-2791. We already have this shipping 
to customers since the last 2 months and we would be happy to contribute it 
back.


 Disk drive as a resource in YARN
 

 Key: YARN-2817
 URL: https://issues.apache.org/jira/browse/YARN-2817
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: scheduler
Reporter: Arun C Murthy
Assignee: Arun C Murthy

 As YARN continues to cover new ground in terms of new workloads, disk is 
 becoming a very important resource to govern.
 It might be prudent to start with something very simple - allow applications 
 to request entire drives (e.g. 2 drives out of the 12 available on a node), 
 we can then also add support for specific iops, bandwidth etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2791) Add Disk as a resource for scheduling

2014-11-06 Thread Aditya Kishore (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200823#comment-14200823
]

Aditya Kishore commented on YARN-2791:
--

I think the summary of this JIRA may seem as duplicate of YARN-2139 but they
are not. YARN-2139 aims to address throttling/isolation of disk IO on
individual container basis.

However, from the description it seems that the purpose of this JIRA is to
include the node's disks as a parameter in the capacity calculation of the node
alongside with its memory and CPU cores. May be the summary should be reworded
to reflect this.

Add Disk as a resource for scheduling
-

Key: YARN-2791
URL: https://issues.apache.org/jira/browse/YARN-2791
Project: Hadoop YARN
Issue Type: New Feature
Components: scheduler
Affects Versions: 2.5.1
Reporter: Swapnil Daingade
Assignee: Yuliya Feldman

Currently, the number of disks present on a node is not considered a factor
while scheduling containers on that node. Having large amount of memory on a
node can lead to high number of containers being launched on that node, all
of which compete for I/O bandwidth. This multiplexing of I/O across
containers can lead to slower overall progress and sub-optimal resource
utilization as containers starved for I/O bandwidth hold on to other
resources like cpu and memory. This problem can be solved by considering disk
as a resource and including it in deciding how many containers can be
concurrently run on a node.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2791) Add Disk as a resource for scheduling

[
https://issues.apache.org/jira/browse/YARN-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200829#comment-14200829
]

Karthik Kambatla commented on YARN-2791:

[~adityakishore] - from reading the description, I believe there is at least a
significant overlap between the two JIRAs. I think we would benefit from
consolidating them and working together, than take multiple paths.

[~sdaingade] - nice to know you have something working. Could you look through
the design doc on YARN-2139 so we can refine it. If it is significantly
different from what is posted there, can you also post your design so we can
evaluate which one is better and move forward.

Add Disk as a resource for scheduling
-

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2811) Fair Scheduler is violating max memory settings in 2.4


[ 
https://issues.apache.org/jira/browse/YARN-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200832#comment-14200832
 ] 

Hadoop QA commented on YARN-2811:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12679905/YARN-2811.v3.patch
  against trunk revision 10f9f51.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5756//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5756//console

This message is automatically generated.

 Fair Scheduler is violating max memory settings in 2.4
 --

 Key: YARN-2811
 URL: https://issues.apache.org/jira/browse/YARN-2811
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Siqi Li
Assignee: Siqi Li
 Attachments: YARN-2811.v1.patch, YARN-2811.v2.patch, 
 YARN-2811.v3.patch


 This has been seen on several queues showing the allocated MB going 
 significantly above the max MB and it appears to have started with the 2.4 
 upgrade. It could be a regression bug from 2.0 to 2.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2791) Add Disk as a resource for scheduling

2014-11-06 Thread Yuliya Feldman (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200849#comment-14200849
 ] 

Yuliya Feldman commented on YARN-2791:
--

[~kasha] I agree that consolidating both and working together is a way to go 
here. Let's initiate this.
What I don't quite understand is how 
https://issues.apache.org/jira/browse/YARN-2817 is different from this one or 
from https://issues.apache.org/jira/browse/YARN-2139

 Add Disk as a resource for scheduling
 -

 Key: YARN-2791
 URL: https://issues.apache.org/jira/browse/YARN-2791
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: scheduler
Affects Versions: 2.5.1
Reporter: Swapnil Daingade
Assignee: Yuliya Feldman

 Currently, the number of disks present on a node is not considered a factor 
 while scheduling containers on that node. Having large amount of memory on a 
 node can lead to high number of containers being launched on that node, all 
 of which compete for I/O bandwidth. This multiplexing of I/O across 
 containers can lead to slower overall progress and sub-optimal resource 
 utilization as containers starved for I/O bandwidth hold on to other 
 resources like cpu and memory. This problem can be solved by considering disk 
 as a resource and including it in deciding how many containers can be 
 concurrently run on a node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2768) optimize FSAppAttempt.updateDemand by avoid clone of Resource which takes 85% of computing time of update thread


[ 
https://issues.apache.org/jira/browse/YARN-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200861#comment-14200861
 ] 

Hudson commented on YARN-2768:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #6469 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6469/])
YARN-2768 Improved Yarn Registry service record structure (stevel) (stevel: rev 
1670578018b3210d518408530858a869e37b23cb)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/main/java/org/apache/hadoop/registry/client/impl/zk/RegistryOperationsService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/test/java/org/apache/hadoop/registry/RegistryTestHelper.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/main/java/org/apache/hadoop/registry/cli/RegistryCli.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/main/java/org/apache/hadoop/registry/client/types/AddressTypes.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/registry/yarn-registry.md
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/main/java/org/apache/hadoop/registry/client/binding/RegistryTypeUtils.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/main/java/org/apache/hadoop/registry/client/types/ProtocolTypes.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/test/java/org/apache/hadoop/registry/operations/TestRegistryOperations.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/main/java/org/apache/hadoop/registry/client/types/ServiceRecordHeader.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/main/java/org/apache/hadoop/registry/client/exceptions/NoRecordException.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/main/tla/yarnregistry.tla
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/main/java/org/apache/hadoop/registry/client/types/ServiceRecord.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/main/java/org/apache/hadoop/registry/client/binding/JsonSerDeser.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/main/java/org/apache/hadoop/registry/client/binding/RegistryUtils.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/main/java/org/apache/hadoop/registry/client/types/Endpoint.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/test/java/org/apache/hadoop/registry/client/binding/TestMarshalling.java


 optimize FSAppAttempt.updateDemand by avoid clone of Resource which takes 85% 
 of computing time of update thread
 

 Key: YARN-2768
 URL: https://issues.apache.org/jira/browse/YARN-2768
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Reporter: Hong Zhiguo
Assignee: Hong Zhiguo
Priority: Minor
 Attachments: YARN-2768.patch, profiling_FairScheduler_update.png


 See the attached picture of profiling result. The clone of Resource object 
 within Resources.multiply() takes up **85%** (19.2 / 22.6) CPU time of the 
 function FairScheduler.update().
 The code of FSAppAttempt.updateDemand:
 {code}
 public void updateDemand() {
 demand = Resources.createResource(0);
 // Demand is current consumption plus outstanding requests
 Resources.addTo(demand, app.getCurrentConsumption());
 // Add up outstanding resource requests
 synchronized (app) {
   for (Priority p : app.getPriorities()) {
 for (ResourceRequest r : app.getResourceRequests(p).values()) {
   Resource total = Resources.multiply(r.getCapability(), 
 r.getNumContainers());
   Resources.addTo(demand, total);
 }
   }
 }
   }
 {code}
 The code of Resources.multiply:
 {code}
 public static Resource multiply(Resource lhs, double by) {
 return multiplyTo(clone(lhs), by);
 }
 {code}
 The clone could be skipped by directly update the value of this.demand.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-2821) Distributed shell app master becomes unresponsive sometimes

2014-11-06 Thread Varun Vasudev (JIRA)

Varun Vasudev created YARN-2821:
---

 Summary: Distributed shell app master becomes unresponsive 
sometimes
 Key: YARN-2821
 URL: https://issues.apache.org/jira/browse/YARN-2821
 Project: Hadoop YARN
  Issue Type: Bug
  Components: applications/distributed-shell
Affects Versions: 2.5.1
Reporter: Varun Vasudev
Assignee: Varun Vasudev


We've noticed that once in a while the distributed shell app master becomes 
unresponsive and is eventually killed by the RM. snippet of the logs -
{noformat}
14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: 
appattempt_1415123350094_0017_01 received 0 previous attempts' running 
containers on AM registration.
14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested container 
ask: Capability[memory:10, vCores:1]Priority[0]
14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested container 
ask: Capability[memory:10, vCores:1]Priority[0]
14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested container 
ask: Capability[memory:10, vCores:1]Priority[0]
14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested container 
ask: Capability[memory:10, vCores:1]Priority[0]
14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested container 
ask: Capability[memory:10, vCores:1]Priority[0]
14/11/04 18:21:38 INFO impl.AMRMClientImpl: Received new token for : 
onprem-tez2:45454
14/11/04 18:21:38 INFO distributedshell.ApplicationMaster: Got response from RM 
for container ask, allocatedCnt=1
14/11/04 18:21:38 INFO distributedshell.ApplicationMaster: Launching shell 
command on a new container., 
containerId=container_1415123350094_0017_01_02, 
containerNode=onprem-tez2:45454, containerNodeURI=onprem-tez2:50060, 
containerResourceMemory1024, containerResourceVirtualCores1
14/11/04 18:21:38 INFO distributedshell.ApplicationMaster: Setting up container 
launch container for containerid=container_1415123350094_0017_01_02
14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: 
START_CONTAINER for Container container_1415123350094_0017_01_02
14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
onprem-tez2:45454
14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: 
QUERY_CONTAINER for Container container_1415123350094_0017_01_02
14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
onprem-tez2:45454
14/11/04 18:21:39 INFO impl.AMRMClientImpl: Received new token for : 
onprem-tez3:45454
14/11/04 18:21:39 INFO impl.AMRMClientImpl: Received new token for : 
onprem-tez4:45454
14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Got response from RM 
for container ask, allocatedCnt=3
14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Launching shell 
command on a new container., 
containerId=container_1415123350094_0017_01_03, 
containerNode=onprem-tez2:45454, containerNodeURI=onprem-tez2:50060, 
containerResourceMemory1024, containerResourceVirtualCores1
14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Launching shell 
command on a new container., 
containerId=container_1415123350094_0017_01_04, 
containerNode=onprem-tez3:45454, containerNodeURI=onprem-tez3:50060, 
containerResourceMemory1024, containerResourceVirtualCores1
14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Launching shell 
command on a new container., 
containerId=container_1415123350094_0017_01_05, 
containerNode=onprem-tez4:45454, containerNodeURI=onprem-tez4:50060, 
containerResourceMemory1024, containerResourceVirtualCores1
14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Setting up container 
launch container for containerid=container_1415123350094_0017_01_03
14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Setting up container 
launch container for containerid=container_1415123350094_0017_01_05
14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Setting up container 
launch container for containerid=container_1415123350094_0017_01_04
14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: 
START_CONTAINER for Container container_1415123350094_0017_01_05
14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: 
START_CONTAINER for Container container_1415123350094_0017_01_03
14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
onprem-tez4:45454
14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
onprem-tez2:45454
14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: 
START_CONTAINER for Container container_1415123350094_0017_01_04
14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
onprem-tez3:45454
14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType:

[jira] [Commented] (YARN-2791) Add Disk as a resource for scheduling


[ 
https://issues.apache.org/jira/browse/YARN-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200900#comment-14200900
 ] 

Karthik Kambatla commented on YARN-2791:


bq. how https://issues.apache.org/jira/browse/YARN-2817 is different from this 
one or from https://issues.apache.org/jira/browse/YARN-2139

I personally don't think it is. Even if we want to address a simpler usecase, I 
would suggest we do to that as part of YARN-2139. 

 Add Disk as a resource for scheduling
 -

 Key: YARN-2791
 URL: https://issues.apache.org/jira/browse/YARN-2791
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: scheduler
Affects Versions: 2.5.1
Reporter: Swapnil Daingade
Assignee: Yuliya Feldman

 Currently, the number of disks present on a node is not considered a factor 
 while scheduling containers on that node. Having large amount of memory on a 
 node can lead to high number of containers being launched on that node, all 
 of which compete for I/O bandwidth. This multiplexing of I/O across 
 containers can lead to slower overall progress and sub-optimal resource 
 utilization as containers starved for I/O bandwidth hold on to other 
 resources like cpu and memory. This problem can be solved by considering disk 
 as a resource and including it in deciding how many containers can be 
 concurrently run on a node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2821) Distributed shell app master becomes unresponsive sometimes

2014-11-06 Thread Varun Vasudev (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-2821:

Attachment: apache-yarn-2821.0.patch

Uploaded patch with fix.

 Distributed shell app master becomes unresponsive sometimes
 ---

 Key: YARN-2821
 URL: https://issues.apache.org/jira/browse/YARN-2821
 Project: Hadoop YARN
  Issue Type: Bug
  Components: applications/distributed-shell
Affects Versions: 2.5.1
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-2821.0.patch


 We've noticed that once in a while the distributed shell app master becomes 
 unresponsive and is eventually killed by the RM. snippet of the logs -
 {noformat}
 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: 
 appattempt_1415123350094_0017_01 received 0 previous attempts' running 
 containers on AM registration.
 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested 
 container ask: Capability[memory:10, vCores:1]Priority[0]
 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested 
 container ask: Capability[memory:10, vCores:1]Priority[0]
 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested 
 container ask: Capability[memory:10, vCores:1]Priority[0]
 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested 
 container ask: Capability[memory:10, vCores:1]Priority[0]
 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested 
 container ask: Capability[memory:10, vCores:1]Priority[0]
 14/11/04 18:21:38 INFO impl.AMRMClientImpl: Received new token for : 
 onprem-tez2:45454
 14/11/04 18:21:38 INFO distributedshell.ApplicationMaster: Got response from 
 RM for container ask, allocatedCnt=1
 14/11/04 18:21:38 INFO distributedshell.ApplicationMaster: Launching shell 
 command on a new container., 
 containerId=container_1415123350094_0017_01_02, 
 containerNode=onprem-tez2:45454, containerNodeURI=onprem-tez2:50060, 
 containerResourceMemory1024, containerResourceVirtualCores1
 14/11/04 18:21:38 INFO distributedshell.ApplicationMaster: Setting up 
 container launch container for 
 containerid=container_1415123350094_0017_01_02
 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: 
 START_CONTAINER for Container container_1415123350094_0017_01_02
 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
 onprem-tez2:45454
 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: 
 QUERY_CONTAINER for Container container_1415123350094_0017_01_02
 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
 onprem-tez2:45454
 14/11/04 18:21:39 INFO impl.AMRMClientImpl: Received new token for : 
 onprem-tez3:45454
 14/11/04 18:21:39 INFO impl.AMRMClientImpl: Received new token for : 
 onprem-tez4:45454
 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Got response from 
 RM for container ask, allocatedCnt=3
 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Launching shell 
 command on a new container., 
 containerId=container_1415123350094_0017_01_03, 
 containerNode=onprem-tez2:45454, containerNodeURI=onprem-tez2:50060, 
 containerResourceMemory1024, containerResourceVirtualCores1
 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Launching shell 
 command on a new container., 
 containerId=container_1415123350094_0017_01_04, 
 containerNode=onprem-tez3:45454, containerNodeURI=onprem-tez3:50060, 
 containerResourceMemory1024, containerResourceVirtualCores1
 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Launching shell 
 command on a new container., 
 containerId=container_1415123350094_0017_01_05, 
 containerNode=onprem-tez4:45454, containerNodeURI=onprem-tez4:50060, 
 containerResourceMemory1024, containerResourceVirtualCores1
 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Setting up 
 container launch container for 
 containerid=container_1415123350094_0017_01_03
 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Setting up 
 container launch container for 
 containerid=container_1415123350094_0017_01_05
 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Setting up 
 container launch container for 
 containerid=container_1415123350094_0017_01_04
 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: 
 START_CONTAINER for Container container_1415123350094_0017_01_05
 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: 
 START_CONTAINER for Container container_1415123350094_0017_01_03
 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
 onprem-tez4:45454
 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening

[jira] [Commented] (YARN-2791) Add Disk as a resource for scheduling

2014-11-06 Thread Swapnil Daingade (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200961#comment-14200961
 ] 

Swapnil Daingade commented on YARN-2791:


Karthik Kambatla - I'll go through the design doc for YARN-2139 and post our 
design document here as well. We can then decide if we should have two JIRA's 
or combine them as Aditya Kishore suggested. However I fully support combining 
our efforts on this.

 Add Disk as a resource for scheduling
 -

 Key: YARN-2791
 URL: https://issues.apache.org/jira/browse/YARN-2791
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: scheduler
Affects Versions: 2.5.1
Reporter: Swapnil Daingade
Assignee: Yuliya Feldman

 Currently, the number of disks present on a node is not considered a factor 
 while scheduling containers on that node. Having large amount of memory on a 
 node can lead to high number of containers being launched on that node, all 
 of which compete for I/O bandwidth. This multiplexing of I/O across 
 containers can lead to slower overall progress and sub-optimal resource 
 utilization as containers starved for I/O bandwidth hold on to other 
 resources like cpu and memory. This problem can be solved by considering disk 
 as a resource and including it in deciding how many containers can be 
 concurrently run on a node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2753) Fix potential issues and code clean up for *NodeLabelsManager


[ 
https://issues.apache.org/jira/browse/YARN-2753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200973#comment-14200973
 ] 

Wangda Tan commented on YARN-2753:
--

+1 for latest patch, thanks for update! [~zxu]

 Fix potential issues and code clean up for *NodeLabelsManager
 -

 Key: YARN-2753
 URL: https://issues.apache.org/jira/browse/YARN-2753
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2753.000.patch, YARN-2753.001.patch, 
 YARN-2753.002.patch, YARN-2753.003.patch, YARN-2753.004.patch, 
 YARN-2753.005.patch, YARN-2753.006.patch


 Issues include:
 * CommonNodeLabelsManager#addToCluserNodeLabels should not change the value 
 in labelCollections if the key already exists otherwise the Label.resource 
 will be changed(reset).
 * potential NPE(NullPointerException) in checkRemoveLabelsFromNode of 
 CommonNodeLabelsManager.
 ** because when a Node is created, Node.labels can be null.
 ** In this case, nm.labels; may be null. So we need check originalLabels not 
 null before use it(originalLabels.containsAll).
 * addToCluserNodeLabels should be protected by writeLock in 
 RMNodeLabelsManager.java. because we should protect labelCollections in 
 RMNodeLabelsManager.
 * Fix a potential bug in CommonsNodeLabelsManager, after serviceStop(...) is 
 invoked, some event may not be processed, see 
 [comment|https://issues.apache.org/jira/browse/YARN-2753?focusedCommentId=14197206page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14197206]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-2822) NPE when RM tries to transfer state from previous attempt on recovery

Jian He created YARN-2822:
-

 Summary: NPE when RM tries to transfer state from previous attempt 
on recovery
 Key: YARN-2822
 URL: https://issues.apache.org/jira/browse/YARN-2822
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jian He
Assignee: Jian He


{code}
2014-09-16 01:36:28,037 FATAL resourcemanager.ResourceManager 
(ResourceManager.java:run(612)) - Error in handling event type 
APP_ATTEMPT_ADDED to the scheduler
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:530)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:678)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1015)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:603)
at java.lang.Thread.run(Thread.java:744)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2822) NPE when RM tries to transfer state from previous attempt on recovery


[ 
https://issues.apache.org/jira/browse/YARN-2822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200977#comment-14200977
 ] 

Jian He commented on YARN-2822:
---

The problem is on recovery, if the previous attempt already finished, we are 
not adding it the scheduler. when scheduler tries to 
transferStateFromPreviousAttempt for work-presrving AM restart, it throws NPE.

 NPE when RM tries to transfer state from previous attempt on recovery
 -

 Key: YARN-2822
 URL: https://issues.apache.org/jira/browse/YARN-2822
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He

 {code}
 2014-09-16 01:36:28,037 FATAL resourcemanager.ResourceManager 
 (ResourceManager.java:run(612)) - Error in handling event type 
 APP_ATTEMPT_ADDED to the scheduler
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:530)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:678)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1015)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:603)
 at java.lang.Thread.run(Thread.java:744)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-2823) NullPointerException in RM HA enabled 3-node cluster

Gour Saha created YARN-2823:
---

 Summary: NullPointerException in RM HA enabled 3-node cluster
 Key: YARN-2823
 URL: https://issues.apache.org/jira/browse/YARN-2823
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Gour Saha


Branch:
2.6.0

Environment: 
A 3-node cluster with RM HA enabled. The HA setup went pretty smooth (used 
Ambari) and then installed HBase using Slider. After some time the RMs went 
down and would not come back up anymore. Following is the NPE we see in both 
the RM logs.

{noformat}
2014-09-16 01:36:28,037 FATAL resourcemanager.ResourceManager 
(ResourceManager.java:run(612)) - Error in handling event type 
APP_ATTEMPT_ADDED to the scheduler
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:530)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:678)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1015)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:603)
at java.lang.Thread.run(Thread.java:744)
2014-09-16 01:36:28,042 INFO  resourcemanager.ResourceManager 
(ResourceManager.java:run(616)) - Exiting, bbye..
{noformat}

All the logs for this 3-node cluster has been uploaded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2823) NullPointerException in RM HA enabled 3-node cluster


 [ 
https://issues.apache.org/jira/browse/YARN-2823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gour Saha updated YARN-2823:

Affects Version/s: 2.6.0

 NullPointerException in RM HA enabled 3-node cluster
 

 Key: YARN-2823
 URL: https://issues.apache.org/jira/browse/YARN-2823
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Gour Saha

 Branch:
 2.6.0
 Environment: 
 A 3-node cluster with RM HA enabled. The HA setup went pretty smooth (used 
 Ambari) and then installed HBase using Slider. After some time the RMs went 
 down and would not come back up anymore. Following is the NPE we see in both 
 the RM logs.
 {noformat}
 2014-09-16 01:36:28,037 FATAL resourcemanager.ResourceManager 
 (ResourceManager.java:run(612)) - Error in handling event type 
 APP_ATTEMPT_ADDED to the scheduler
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:530)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:678)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1015)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:603)
 at java.lang.Thread.run(Thread.java:744)
 2014-09-16 01:36:28,042 INFO  resourcemanager.ResourceManager 
 (ResourceManager.java:run(616)) - Exiting, bbye..
 {noformat}
 All the logs for this 3-node cluster has been uploaded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2823) NullPointerException in RM HA enabled 3-node cluster


 [ 
https://issues.apache.org/jira/browse/YARN-2823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gour Saha updated YARN-2823:

Component/s: resourcemanager

 NullPointerException in RM HA enabled 3-node cluster
 

 Key: YARN-2823
 URL: https://issues.apache.org/jira/browse/YARN-2823
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Gour Saha

 Branch:
 2.6.0
 Environment: 
 A 3-node cluster with RM HA enabled. The HA setup went pretty smooth (used 
 Ambari) and then installed HBase using Slider. After some time the RMs went 
 down and would not come back up anymore. Following is the NPE we see in both 
 the RM logs.
 {noformat}
 2014-09-16 01:36:28,037 FATAL resourcemanager.ResourceManager 
 (ResourceManager.java:run(612)) - Error in handling event type 
 APP_ATTEMPT_ADDED to the scheduler
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:530)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:678)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1015)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:603)
 at java.lang.Thread.run(Thread.java:744)
 2014-09-16 01:36:28,042 INFO  resourcemanager.ResourceManager 
 (ResourceManager.java:run(616)) - Exiting, bbye..
 {noformat}
 All the logs for this 3-node cluster has been uploaded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2823) NullPointerException in RM HA enabled 3-node cluster


 [ 
https://issues.apache.org/jira/browse/YARN-2823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gour Saha updated YARN-2823:

Attachment: logs_with_NPE_in_RM.zip

 NullPointerException in RM HA enabled 3-node cluster
 

 Key: YARN-2823
 URL: https://issues.apache.org/jira/browse/YARN-2823
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Gour Saha
 Attachments: logs_with_NPE_in_RM.zip


 Branch:
 2.6.0
 Environment: 
 A 3-node cluster with RM HA enabled. The HA setup went pretty smooth (used 
 Ambari) and then installed HBase using Slider. After some time the RMs went 
 down and would not come back up anymore. Following is the NPE we see in both 
 the RM logs.
 {noformat}
 2014-09-16 01:36:28,037 FATAL resourcemanager.ResourceManager 
 (ResourceManager.java:run(612)) - Error in handling event type 
 APP_ATTEMPT_ADDED to the scheduler
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:530)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:678)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1015)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:603)
 at java.lang.Thread.run(Thread.java:744)
 2014-09-16 01:36:28,042 INFO  resourcemanager.ResourceManager 
 (ResourceManager.java:run(616)) - Exiting, bbye..
 {noformat}
 All the logs for this 3-node cluster has been uploaded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2821) Distributed shell app master becomes unresponsive sometimes


[ 
https://issues.apache.org/jira/browse/YARN-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200997#comment-14200997
 ] 

Hadoop QA commented on YARN-2821:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12679948/apache-yarn-2821.0.patch
  against trunk revision 1670578.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5757//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5757//console

This message is automatically generated.

 Distributed shell app master becomes unresponsive sometimes
 ---

 Key: YARN-2821
 URL: https://issues.apache.org/jira/browse/YARN-2821
 Project: Hadoop YARN
  Issue Type: Bug
  Components: applications/distributed-shell
Affects Versions: 2.5.1
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-2821.0.patch


 We've noticed that once in a while the distributed shell app master becomes 
 unresponsive and is eventually killed by the RM. snippet of the logs -
 {noformat}
 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: 
 appattempt_1415123350094_0017_01 received 0 previous attempts' running 
 containers on AM registration.
 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested 
 container ask: Capability[memory:10, vCores:1]Priority[0]
 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested 
 container ask: Capability[memory:10, vCores:1]Priority[0]
 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested 
 container ask: Capability[memory:10, vCores:1]Priority[0]
 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested 
 container ask: Capability[memory:10, vCores:1]Priority[0]
 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested 
 container ask: Capability[memory:10, vCores:1]Priority[0]
 14/11/04 18:21:38 INFO impl.AMRMClientImpl: Received new token for : 
 onprem-tez2:45454
 14/11/04 18:21:38 INFO distributedshell.ApplicationMaster: Got response from 
 RM for container ask, allocatedCnt=1
 14/11/04 18:21:38 INFO distributedshell.ApplicationMaster: Launching shell 
 command on a new container., 
 containerId=container_1415123350094_0017_01_02, 
 containerNode=onprem-tez2:45454, containerNodeURI=onprem-tez2:50060, 
 containerResourceMemory1024, containerResourceVirtualCores1
 14/11/04 18:21:38 INFO distributedshell.ApplicationMaster: Setting up 
 container launch container for 
 containerid=container_1415123350094_0017_01_02
 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: 
 START_CONTAINER for Container container_1415123350094_0017_01_02
 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
 onprem-tez2:45454
 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: 
 QUERY_CONTAINER for Container container_1415123350094_0017_01_02
 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
 onprem-tez2:45454
 14/11/04 18:21:39 INFO impl.AMRMClientImpl: Received new token for : 
 onprem-tez3:45454
 14/11/04 18:21:39 INFO impl.AMRMClientImpl: Received new token for : 
 onprem-tez4:45454
 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Got response from 
 RM for container ask, allocatedCnt=3
 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Launching shell 
 command on a new container., 
 containerId=container_1415123350094_0017_01_03, 
 containerNode=onprem-tez2:45454, containerNodeURI=onprem-tez2:50060, 
 containerResourceMemory1024, containerResourceVirtualCores1
 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Launching shell 
 command on a new container., 
 containerId=container_1415123350094_0017_01_04,

[jira] [Created] (YARN-2824) Capacity of labels should be zero by default

Wangda Tan created YARN-2824:


 Summary: Capacity of labels should be zero by default
 Key: YARN-2824
 URL: https://issues.apache.org/jira/browse/YARN-2824
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
Priority: Critical


In existing Capacity Scheduler behavior, if user doesn't specify capacity of 
label, queue initialization will be failed. That will cause queue refreshment 
failed when add a new label to node labels collection and doesn't modify 
capacity-scheduler.xml.

With this patch, capacity of labels should be explicitly set if user want to 
use it. If user doesn't set capacity of some labels, we will treat such labels 
are unused labels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2816) NM fail to start with NPE during container recovery

[
https://issues.apache.org/jira/browse/YARN-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201020#comment-14201020
]

zhihai xu commented on YARN-2816:
-

Hi [~jlowe], thanks for the review,
Based on your comment, it look like the the issue is not as serious as what I
thought previously, I lowered the issue from Critical to Major.
move the levelDB database location from /tmp directory to other safe directory
is a very good suggestion.

But I still think the patch is good for better error handling and NM recovery.

1. If an OS crash will cause a few partially written log record in the levelDB,
we can't restart the NM due to NPE.
To restart the NM, we need manually delete all these stateStore files. I think
it will be bad for the user.
Also with the patch, we still can recover most of the containers in the NM
instead of losing all these container information, which will cause all these
containers to be reallocated by RM.

2. The patch is to delete all these containers which don't have container start
requests. It won't cause containers leaks. Because container start request is
always the first entry to store(startContainerInternal) in the levelDB for each
container records and it is always the first entry to remove (removeContainer)
in the levelDB for each container records. Also when I debugged this problem,
the levelDB stored most latest records in a LRU cache(memory), when the levelDB
is closed, it will store these latest records from cache to the disk and our
use case for levelDB is always to write key-value record and no read after NM
initialization. The data record lost on disk will always the old record, so
container start request record is more likely lost than other container
records. This makes the patch meaningful for our use case.
I attached a LevelDB container key record which has this issue.

3. Missing containers record is not a problem if we can recover most containers
in NM.
When AM calls ContainerManagementProtocol#getContainerStatuses, the NM will put
the missing containers in failed requests of GetContainerStatusesResponse, then
the AM will notify RM to release these missing containers in
ApplicationMasterProtocol#allocate(AllocateRequest.getReleaseList). So the
missing containers will be recovered quickly by AM.
With the patch, we still can recover all these containers with complete record
in levelDB.
The less containers we missed to recover, the better.

4. The patch is small and safe, It won't cause any side effect.

NM fail to start with NPE during container recovery
---

Key: YARN-2816
URL: https://issues.apache.org/jira/browse/YARN-2816
Project: Hadoop YARN
Issue Type: Bug
Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
Attachments: YARN-2816.000.patch

NM fail to start with NPE during container recovery.
We saw the following crash happen:
2014-10-30 22:22:37,211 INFO org.apache.hadoop.service.AbstractService:
Service
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
failed in state INITED; cause: java.lang.NullPointerException
java.lang.NullPointerException
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235)
at
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
at
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250)
at
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)
at
org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
The reason is some DB files used in NMLeveldbStateStoreService are
accidentally deleted to save disk space at
/tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state. This leaves some incomplete
container record which don't have CONTAINER_REQUEST_KEY_SUFFIX(startRequest)
entry in the DB. When container is recovered at
ContainerManagerImpl#recoverContainer,
The NullPointerException at the following code cause NM shutdown.
{code}
StartContainerRequest req = rcs.getStartRequest();
ContainerLaunchContext launchContext = req.getContainerLaunchContext();
{code}

--
This message was sent by

[jira] [Updated] (YARN-2816) NM fail to start with NPE during container recovery


 [ 
https://issues.apache.org/jira/browse/YARN-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2816:

Attachment: leveldb_records.txt

 NM fail to start with NPE during container recovery
 ---

 Key: YARN-2816
 URL: https://issues.apache.org/jira/browse/YARN-2816
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2816.000.patch, leveldb_records.txt


 NM fail to start with NPE during container recovery.
 We saw the following crash happen:
 2014-10-30 22:22:37,211 INFO org.apache.hadoop.service.AbstractService: 
 Service 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
  failed in state INITED; cause: java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
 The reason is some DB files used in NMLeveldbStateStoreService are 
 accidentally deleted to save disk space at 
 /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state. This leaves some incomplete 
 container record which don't have CONTAINER_REQUEST_KEY_SUFFIX(startRequest) 
 entry in the DB. When container is recovered at 
 ContainerManagerImpl#recoverContainer, 
 The NullPointerException at the following code cause NM shutdown.
 {code}
 StartContainerRequest req = rcs.getStartRequest();
 ContainerLaunchContext launchContext = req.getContainerLaunchContext();
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2816) NM fail to start with NPE during container recovery


 [ 
https://issues.apache.org/jira/browse/YARN-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2816:

Priority: Major  (was: Critical)

 NM fail to start with NPE during container recovery
 ---

 Key: YARN-2816
 URL: https://issues.apache.org/jira/browse/YARN-2816
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2816.000.patch, leveldb_records.txt


 NM fail to start with NPE during container recovery.
 We saw the following crash happen:
 2014-10-30 22:22:37,211 INFO org.apache.hadoop.service.AbstractService: 
 Service 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
  failed in state INITED; cause: java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
 The reason is some DB files used in NMLeveldbStateStoreService are 
 accidentally deleted to save disk space at 
 /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state. This leaves some incomplete 
 container record which don't have CONTAINER_REQUEST_KEY_SUFFIX(startRequest) 
 entry in the DB. When container is recovered at 
 ContainerManagerImpl#recoverContainer, 
 The NullPointerException at the following code cause NM shutdown.
 {code}
 StartContainerRequest req = rcs.getStartRequest();
 ContainerLaunchContext launchContext = req.getContainerLaunchContext();
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2822) NPE when RM tries to transfer state from previous attempt on recovery


 [ 
https://issues.apache.org/jira/browse/YARN-2822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2822:
--
Attachment: YARN-2822.1.patch

Upload a patch to add the previously finished attempt to scheduler

 NPE when RM tries to transfer state from previous attempt on recovery
 -

 Key: YARN-2822
 URL: https://issues.apache.org/jira/browse/YARN-2822
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2822.1.patch


 {code}
 2014-09-16 01:36:28,037 FATAL resourcemanager.ResourceManager 
 (ResourceManager.java:run(612)) - Error in handling event type 
 APP_ATTEMPT_ADDED to the scheduler
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:530)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:678)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1015)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:603)
 at java.lang.Thread.run(Thread.java:744)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2816) NM fail to start with NPE during container recovery


[ 
https://issues.apache.org/jira/browse/YARN-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201025#comment-14201025
 ] 

Hadoop QA commented on YARN-2816:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12679963/leveldb_records.txt
  against trunk revision 75b820c.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5758//console

This message is automatically generated.

 NM fail to start with NPE during container recovery
 ---

 Key: YARN-2816
 URL: https://issues.apache.org/jira/browse/YARN-2816
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2816.000.patch, leveldb_records.txt


 NM fail to start with NPE during container recovery.
 We saw the following crash happen:
 2014-10-30 22:22:37,211 INFO org.apache.hadoop.service.AbstractService: 
 Service 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
  failed in state INITED; cause: java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
 The reason is some DB files used in NMLeveldbStateStoreService are 
 accidentally deleted to save disk space at 
 /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state. This leaves some incomplete 
 container record which don't have CONTAINER_REQUEST_KEY_SUFFIX(startRequest) 
 entry in the DB. When container is recovered at 
 ContainerManagerImpl#recoverContainer, 
 The NullPointerException at the following code cause NM shutdown.
 {code}
 StartContainerRequest req = rcs.getStartRequest();
 ContainerLaunchContext launchContext = req.getContainerLaunchContext();
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (YARN-2823) NullPointerException in RM HA enabled 3-node cluster


 [ 
https://issues.apache.org/jira/browse/YARN-2823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He reassigned YARN-2823:
-

Assignee: Jian He

 NullPointerException in RM HA enabled 3-node cluster
 

 Key: YARN-2823
 URL: https://issues.apache.org/jira/browse/YARN-2823
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Gour Saha
Assignee: Jian He
 Attachments: logs_with_NPE_in_RM.zip


 Branch:
 2.6.0
 Environment: 
 A 3-node cluster with RM HA enabled. The HA setup went pretty smooth (used 
 Ambari) and then installed HBase using Slider. After some time the RMs went 
 down and would not come back up anymore. Following is the NPE we see in both 
 the RM logs.
 {noformat}
 2014-09-16 01:36:28,037 FATAL resourcemanager.ResourceManager 
 (ResourceManager.java:run(612)) - Error in handling event type 
 APP_ATTEMPT_ADDED to the scheduler
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:530)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:678)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1015)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:603)
 at java.lang.Thread.run(Thread.java:744)
 2014-09-16 01:36:28,042 INFO  resourcemanager.ResourceManager 
 (ResourceManager.java:run(616)) - Exiting, bbye..
 {noformat}
 All the logs for this 3-node cluster has been uploaded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2753) Fix potential issues and code clean up for *NodeLabelsManager


[ 
https://issues.apache.org/jira/browse/YARN-2753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201038#comment-14201038
 ] 

zhihai xu commented on YARN-2753:
-

Hi [~xgong],

Could you help review and commit the patch? since Wangda already reviewed the 
patch.

thanks
zhihai

 Fix potential issues and code clean up for *NodeLabelsManager
 -

 Key: YARN-2753
 URL: https://issues.apache.org/jira/browse/YARN-2753
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2753.000.patch, YARN-2753.001.patch, 
 YARN-2753.002.patch, YARN-2753.003.patch, YARN-2753.004.patch, 
 YARN-2753.005.patch, YARN-2753.006.patch


 Issues include:
 * CommonNodeLabelsManager#addToCluserNodeLabels should not change the value 
 in labelCollections if the key already exists otherwise the Label.resource 
 will be changed(reset).
 * potential NPE(NullPointerException) in checkRemoveLabelsFromNode of 
 CommonNodeLabelsManager.
 ** because when a Node is created, Node.labels can be null.
 ** In this case, nm.labels; may be null. So we need check originalLabels not 
 null before use it(originalLabels.containsAll).
 * addToCluserNodeLabels should be protected by writeLock in 
 RMNodeLabelsManager.java. because we should protect labelCollections in 
 RMNodeLabelsManager.
 * Fix a potential bug in CommonsNodeLabelsManager, after serviceStop(...) is 
 invoked, some event may not be processed, see 
 [comment|https://issues.apache.org/jira/browse/YARN-2753?focusedCommentId=14197206page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14197206]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2823) NullPointerException in RM HA enabled 3-node cluster


[ 
https://issues.apache.org/jira/browse/YARN-2823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201040#comment-14201040
 ] 

Jian He commented on YARN-2823:
---

The problem is on recovery, if the previous attempt already finished, we are 
not adding it the scheduler. when scheduler tries to 
transferStateFromPreviousAttempt for work-presrving AM restart, it throws NPE.

 NullPointerException in RM HA enabled 3-node cluster
 

 Key: YARN-2823
 URL: https://issues.apache.org/jira/browse/YARN-2823
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Gour Saha
Assignee: Jian He
 Attachments: YARN-2823.1.patch, logs_with_NPE_in_RM.zip


 Branch:
 2.6.0
 Environment: 
 A 3-node cluster with RM HA enabled. The HA setup went pretty smooth (used 
 Ambari) and then installed HBase using Slider. After some time the RMs went 
 down and would not come back up anymore. Following is the NPE we see in both 
 the RM logs.
 {noformat}
 2014-09-16 01:36:28,037 FATAL resourcemanager.ResourceManager 
 (ResourceManager.java:run(612)) - Error in handling event type 
 APP_ATTEMPT_ADDED to the scheduler
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:530)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:678)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1015)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:603)
 at java.lang.Thread.run(Thread.java:744)
 2014-09-16 01:36:28,042 INFO  resourcemanager.ResourceManager 
 (ResourceManager.java:run(616)) - Exiting, bbye..
 {noformat}
 All the logs for this 3-node cluster has been uploaded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2823) NullPointerException in RM HA enabled 3-node cluster


[ 
https://issues.apache.org/jira/browse/YARN-2823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201041#comment-14201041
 ] 

Jian He commented on YARN-2823:
---

Upload a patch to add the previously finished attempt to scheduler

 NullPointerException in RM HA enabled 3-node cluster
 

 Key: YARN-2823
 URL: https://issues.apache.org/jira/browse/YARN-2823
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Gour Saha
Assignee: Jian He
 Attachments: YARN-2823.1.patch, logs_with_NPE_in_RM.zip


 Branch:
 2.6.0
 Environment: 
 A 3-node cluster with RM HA enabled. The HA setup went pretty smooth (used 
 Ambari) and then installed HBase using Slider. After some time the RMs went 
 down and would not come back up anymore. Following is the NPE we see in both 
 the RM logs.
 {noformat}
 2014-09-16 01:36:28,037 FATAL resourcemanager.ResourceManager 
 (ResourceManager.java:run(612)) - Error in handling event type 
 APP_ATTEMPT_ADDED to the scheduler
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:530)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:678)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1015)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:603)
 at java.lang.Thread.run(Thread.java:744)
 2014-09-16 01:36:28,042 INFO  resourcemanager.ResourceManager 
 (ResourceManager.java:run(616)) - Exiting, bbye..
 {noformat}
 All the logs for this 3-node cluster has been uploaded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2823) NullPointerException in RM HA enabled 3-node cluster


 [ 
https://issues.apache.org/jira/browse/YARN-2823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2823:
--
Attachment: YARN-2823.1.patch

 NullPointerException in RM HA enabled 3-node cluster
 

 Key: YARN-2823
 URL: https://issues.apache.org/jira/browse/YARN-2823
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Gour Saha
Assignee: Jian He
 Attachments: YARN-2823.1.patch, logs_with_NPE_in_RM.zip


 Branch:
 2.6.0
 Environment: 
 A 3-node cluster with RM HA enabled. The HA setup went pretty smooth (used 
 Ambari) and then installed HBase using Slider. After some time the RMs went 
 down and would not come back up anymore. Following is the NPE we see in both 
 the RM logs.
 {noformat}
 2014-09-16 01:36:28,037 FATAL resourcemanager.ResourceManager 
 (ResourceManager.java:run(612)) - Error in handling event type 
 APP_ATTEMPT_ADDED to the scheduler
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:530)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:678)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1015)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:603)
 at java.lang.Thread.run(Thread.java:744)
 2014-09-16 01:36:28,042 INFO  resourcemanager.ResourceManager 
 (ResourceManager.java:run(616)) - Exiting, bbye..
 {noformat}
 All the logs for this 3-node cluster has been uploaded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2791) Add Disk as a resource for scheduling

[
https://issues.apache.org/jira/browse/YARN-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201101#comment-14201101
]

Karthik Kambatla commented on YARN-2791:

[~adityakishore] - unfortunately, YARN-2139 didn't have a description this far.
I just added a very succinct one. If you look at the design doc itself and the
sub-tasks created, you would see that YARN-2139 includes adding disk as a
resource to the resource-request-vector, the node-capacities, and the
scheduler.

[~sdaingade] - looking forward to you hear your thoughts and see a design doc.

Add Disk as a resource for scheduling
-

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2791) Add Disk as a resource for scheduling

2014-11-06 Thread Aditya Kishore (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201113#comment-14201113
 ] 

Aditya Kishore commented on YARN-2791:
--

Great! I think this JIRA should be added as a sub-task as non of the other 
sub-tasks cover this aspect.

 Add Disk as a resource for scheduling
 -

 Key: YARN-2791
 URL: https://issues.apache.org/jira/browse/YARN-2791
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: scheduler
Affects Versions: 2.5.1
Reporter: Swapnil Daingade
Assignee: Yuliya Feldman

 Currently, the number of disks present on a node is not considered a factor 
 while scheduling containers on that node. Having large amount of memory on a 
 node can lead to high number of containers being launched on that node, all 
 of which compete for I/O bandwidth. This multiplexing of I/O across 
 containers can lead to slower overall progress and sub-optimal resource 
 utilization as containers starved for I/O bandwidth hold on to other 
 resources like cpu and memory. This problem can be solved by considering disk 
 as a resource and including it in deciding how many containers can be 
 concurrently run on a node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2139) Add support for disk IO isolation/scheduling for containers

2014-11-06 Thread Wei Yan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-2139:
--
Attachment: YARN-2139-prototype.patch

I submit a prototype of the code implementation, to illustrate the basic design 
and implementation.
Code changes in three major parts:
(1) API: add vdisks as a 3rd type of resources, besides CPU/memory. The NM will 
specifly its own vdisks resource, and the AM includes vdisks in the resource 
request.
(2) Scheduler: the scheduler will consider vdisks availability when scheduling. 
Additionally, the DRF policy also considers vdisks when choosing the dominant 
resource.
(3) I/O isolation: this is implemented in the NM side. Use cgroup's blkio 
system to do the container I/O isolation.

Will separate the patch into several sub-task patches once collecting more 
comments and the design, implementation.

 Add support for disk IO isolation/scheduling for containers
 ---

 Key: YARN-2139
 URL: https://issues.apache.org/jira/browse/YARN-2139
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Wei Yan
Assignee: Wei Yan
 Attachments: Disk_IO_Scheduling_Design_1.pdf, 
 Disk_IO_Scheduling_Design_2.pdf, YARN-2139-prototype.patch


 YARN should support considering disk for scheduling tasks on nodes, and 
 provide isolation for these allocations at runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2139) Add support for disk IO isolation/scheduling for containers

2014-11-06 Thread Wei Yan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201160#comment-14201160
 ] 

Wei Yan commented on YARN-2139:
---

This prototype patch is posted to garner feedback. The spindle locality has not 
been finished, will post once get updated.

 Add support for disk IO isolation/scheduling for containers
 ---

 Key: YARN-2139
 URL: https://issues.apache.org/jira/browse/YARN-2139
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Wei Yan
Assignee: Wei Yan
 Attachments: Disk_IO_Scheduling_Design_1.pdf, 
 Disk_IO_Scheduling_Design_2.pdf, YARN-2139-prototype.patch


 YARN should support considering disk for scheduling tasks on nodes, and 
 provide isolation for these allocations at runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2056) Disable preemption at Queue level

2014-11-06 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201166#comment-14201166
 ] 

Eric Payne commented on YARN-2056:
--

[~leftnoteasy], I'm sorry, but there is one more thing that needs to be 
modified with the current design. The current patch only allows the disable 
queue preemption flag to be set on leaf queues. However, after discussing his 
internally, we need to be able to have leaf queues inherit this property from 
their parent.

Only setting the disable queue preemption property on leaf queues was an 
intentional design decision to begin with. This was because inheriting this 
property from a parent would impose a new set of requirements. Consider this 
use case:

- root queue has children A and B
- A has children A1 and A2
- B has children B1 and B2
- A should not be preemptable
- A1 and A2 should be able to preempt each other

In this use case, if A is over capacity, B should not be able to preempt. 
However, if A1 is over capacity, A2 should be able to preempt A1.

I believe I can make the leaf nodes inherit this property from its parent and 
still be able to solve for the above use case. I will be putting up a new patch 
(hopefully) tomorrow.

 Disable preemption at Queue level
 -

 Key: YARN-2056
 URL: https://issues.apache.org/jira/browse/YARN-2056
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Mayank Bansal
Assignee: Eric Payne
 Attachments: YARN-2056.201408202039.txt, YARN-2056.201408260128.txt, 
 YARN-2056.201408310117.txt, YARN-2056.201409022208.txt, 
 YARN-2056.201409181916.txt, YARN-2056.201409210049.txt, 
 YARN-2056.201409232329.txt, YARN-2056.201409242210.txt, 
 YARN-2056.201410132225.txt, YARN-2056.201410141330.txt, 
 YARN-2056.201410232244.txt, YARN-2056.201410311746.txt, 
 YARN-2056.201411041635.txt


 We need to be able to disable preemption at individual queue level



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2822) NPE when RM tries to transfer state from previous attempt on recovery


[ 
https://issues.apache.org/jira/browse/YARN-2822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201175#comment-14201175
 ] 

Hadoop QA commented on YARN-2822:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12679964/YARN-2822.1.patch
  against trunk revision 75b820c.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5759//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5759//console

This message is automatically generated.

 NPE when RM tries to transfer state from previous attempt on recovery
 -

 Key: YARN-2822
 URL: https://issues.apache.org/jira/browse/YARN-2822
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2822.1.patch


 {code}
 2014-09-16 01:36:28,037 FATAL resourcemanager.ResourceManager 
 (ResourceManager.java:run(612)) - Error in handling event type 
 APP_ATTEMPT_ADDED to the scheduler
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:530)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:678)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1015)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:603)
 at java.lang.Thread.run(Thread.java:744)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2821) Distributed shell app master becomes unresponsive sometimes


[ 
https://issues.apache.org/jira/browse/YARN-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201183#comment-14201183
 ] 

Jian He commented on YARN-2821:
---

thanks Varun ! the patch should solve the problem to release extra allocated 
containers. but  seems in some cases, we  may still run into the infinite loop.
In finish(), it's checking {{numCompletedContainers.get() != 
numTotalContainers}}, but numCompletedContainers could be incremented 
elsewhere.  e.g. {{onStartContainerError}} will also increase the 
numCompletedContainers count. Could the numCompletedContainers go beyond the 
numTotalContainers in such scenario also ?

 Distributed shell app master becomes unresponsive sometimes
 ---

 Key: YARN-2821
 URL: https://issues.apache.org/jira/browse/YARN-2821
 Project: Hadoop YARN
  Issue Type: Bug
  Components: applications/distributed-shell
Affects Versions: 2.5.1
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-2821.0.patch


 We've noticed that once in a while the distributed shell app master becomes 
 unresponsive and is eventually killed by the RM. snippet of the logs -
 {noformat}
 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: 
 appattempt_1415123350094_0017_01 received 0 previous attempts' running 
 containers on AM registration.
 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested 
 container ask: Capability[memory:10, vCores:1]Priority[0]
 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested 
 container ask: Capability[memory:10, vCores:1]Priority[0]
 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested 
 container ask: Capability[memory:10, vCores:1]Priority[0]
 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested 
 container ask: Capability[memory:10, vCores:1]Priority[0]
 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested 
 container ask: Capability[memory:10, vCores:1]Priority[0]
 14/11/04 18:21:38 INFO impl.AMRMClientImpl: Received new token for : 
 onprem-tez2:45454
 14/11/04 18:21:38 INFO distributedshell.ApplicationMaster: Got response from 
 RM for container ask, allocatedCnt=1
 14/11/04 18:21:38 INFO distributedshell.ApplicationMaster: Launching shell 
 command on a new container., 
 containerId=container_1415123350094_0017_01_02, 
 containerNode=onprem-tez2:45454, containerNodeURI=onprem-tez2:50060, 
 containerResourceMemory1024, containerResourceVirtualCores1
 14/11/04 18:21:38 INFO distributedshell.ApplicationMaster: Setting up 
 container launch container for 
 containerid=container_1415123350094_0017_01_02
 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: 
 START_CONTAINER for Container container_1415123350094_0017_01_02
 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
 onprem-tez2:45454
 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: 
 QUERY_CONTAINER for Container container_1415123350094_0017_01_02
 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
 onprem-tez2:45454
 14/11/04 18:21:39 INFO impl.AMRMClientImpl: Received new token for : 
 onprem-tez3:45454
 14/11/04 18:21:39 INFO impl.AMRMClientImpl: Received new token for : 
 onprem-tez4:45454
 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Got response from 
 RM for container ask, allocatedCnt=3
 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Launching shell 
 command on a new container., 
 containerId=container_1415123350094_0017_01_03, 
 containerNode=onprem-tez2:45454, containerNodeURI=onprem-tez2:50060, 
 containerResourceMemory1024, containerResourceVirtualCores1
 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Launching shell 
 command on a new container., 
 containerId=container_1415123350094_0017_01_04, 
 containerNode=onprem-tez3:45454, containerNodeURI=onprem-tez3:50060, 
 containerResourceMemory1024, containerResourceVirtualCores1
 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Launching shell 
 command on a new container., 
 containerId=container_1415123350094_0017_01_05, 
 containerNode=onprem-tez4:45454, containerNodeURI=onprem-tez4:50060, 
 containerResourceMemory1024, containerResourceVirtualCores1
 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Setting up 
 container launch container for 
 containerid=container_1415123350094_0017_01_03
 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Setting up 
 container launch container for 
 containerid=container_1415123350094_0017_01_05
 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Setting up 
 container launch container for 
 containerid=container_1415123350094_0017_01_04
 14/11/04

[jira] [Commented] (YARN-2823) NullPointerException in RM HA enabled 3-node cluster


[ 
https://issues.apache.org/jira/browse/YARN-2823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201193#comment-14201193
 ] 

Hadoop QA commented on YARN-2823:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12679967/YARN-2823.1.patch
  against trunk revision 75b820c.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5760//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5760//console

This message is automatically generated.

 NullPointerException in RM HA enabled 3-node cluster
 

 Key: YARN-2823
 URL: https://issues.apache.org/jira/browse/YARN-2823
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Gour Saha
Assignee: Jian He
 Attachments: YARN-2823.1.patch, logs_with_NPE_in_RM.zip


 Branch:
 2.6.0
 Environment: 
 A 3-node cluster with RM HA enabled. The HA setup went pretty smooth (used 
 Ambari) and then installed HBase using Slider. After some time the RMs went 
 down and would not come back up anymore. Following is the NPE we see in both 
 the RM logs.
 {noformat}
 2014-09-16 01:36:28,037 FATAL resourcemanager.ResourceManager 
 (ResourceManager.java:run(612)) - Error in handling event type 
 APP_ATTEMPT_ADDED to the scheduler
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:530)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:678)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1015)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:603)
 at java.lang.Thread.run(Thread.java:744)
 2014-09-16 01:36:28,042 INFO  resourcemanager.ResourceManager 
 (ResourceManager.java:run(616)) - Exiting, bbye..
 {noformat}
 All the logs for this 3-node cluster has been uploaded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2824) Capacity of labels should be zero by default


 [ 
https://issues.apache.org/jira/browse/YARN-2824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-2824:
-
Attachment: YARN-2824-1.patch

Attached patch and added/updated tests

 Capacity of labels should be zero by default
 

 Key: YARN-2824
 URL: https://issues.apache.org/jira/browse/YARN-2824
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
Priority: Critical
 Attachments: YARN-2824-1.patch


 In existing Capacity Scheduler behavior, if user doesn't specify capacity of 
 label, queue initialization will be failed. That will cause queue refreshment 
 failed when add a new label to node labels collection and doesn't modify 
 capacity-scheduler.xml.
 With this patch, capacity of labels should be explicitly set if user want to 
 use it. If user doesn't set capacity of some labels, we will treat such 
 labels are unused labels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2647) Add yarn queue CLI to get queue infos


[ 
https://issues.apache.org/jira/browse/YARN-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201226#comment-14201226
 ] 

Wangda Tan commented on YARN-2647:
--

Hi [~sunilg],
bq. However I feel we can avoid having '-' in between the field labels. Because 
we already have fields like Maximum Capacity, Current Capacity etc. So if 
we change for node-labels, it was not much looking good. So I kept as it is and 
also removed for queue. Kindly share your thoughts.
I'm fine with that :), and I think it is better to make it like Default Node 
Label Expression. Do you like it? :)

And the patch failure is caused by the changes on yarn.cmd. A simple workaround 
is remove yarn.cmd from you patch. Maintain two patch, one is full patch, 
another one is without yarn.cmd.



 Add yarn queue CLI to get queue infos
 -

 Key: YARN-2647
 URL: https://issues.apache.org/jira/browse/YARN-2647
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client
Reporter: Wangda Tan
Assignee: Sunil G
 Attachments: 0001-YARN-2647.patch, 0002-YARN-2647.patch, 
 0003-YARN-2647.patch, 0004-YARN-2647.patch, 0005-YARN-2647.patch, 
 0006-YARN-2647.patch, 0007-YARN-2647.patch, 0008-YARN-2647.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2791) Add Disk as a resource for scheduling


 [ 
https://issues.apache.org/jira/browse/YARN-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-2791:
-
Assignee: Yuliya Feldman  (was: Wangda Tan)

 Add Disk as a resource for scheduling
 -

 Key: YARN-2791
 URL: https://issues.apache.org/jira/browse/YARN-2791
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: scheduler
Affects Versions: 2.5.1
Reporter: Swapnil Daingade
Assignee: Yuliya Feldman

 Currently, the number of disks present on a node is not considered a factor 
 while scheduling containers on that node. Having large amount of memory on a 
 node can lead to high number of containers being launched on that node, all 
 of which compete for I/O bandwidth. This multiplexing of I/O across 
 containers can lead to slower overall progress and sub-optimal resource 
 utilization as containers starved for I/O bandwidth hold on to other 
 resources like cpu and memory. This problem can be solved by considering disk 
 as a resource and including it in deciding how many containers can be 
 concurrently run on a node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2803) MR distributed cache not working correctly on Windows after NodeManager privileged account changes.

2014-11-06 Thread Craig Welch (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Welch updated YARN-2803:
--
Attachment: YARN-2803.0.patch

Patch to revert behavior for jar creation location, which resolves this issue 
for non-secure windows.  As it is it probably breaks secure windows, I may look 
at making the behavior conditional (however, I suspect secure windows will fail 
this test/is broken as well, so I'm not sure that making this conditional 
really matters at the moment...).

 MR distributed cache not working correctly on Windows after NodeManager 
 privileged account changes.
 ---

 Key: YARN-2803
 URL: https://issues.apache.org/jira/browse/YARN-2803
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Chris Nauroth
Priority: Critical
 Attachments: YARN-2803.0.patch


 This problem is visible by running {{TestMRJobs#testDistributedCache}} or 
 {{TestUberAM#testDistributedCache}} on Windows.  Both tests fail.  Running 
 git bisect, I traced it to the YARN-2198 patch to remove the need to run 
 NodeManager as a privileged account.  The tests started failing when that 
 patch was committed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-2825) Container leak on NM

Jian He created YARN-2825:
-

 Summary: Container leak on NM
 Key: YARN-2825
 URL: https://issues.apache.org/jira/browse/YARN-2825
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He
Priority: Critical


Caused by YARN-1372. thanks [~vinodkv] for pointing  this out.

The problem is that in YARN-1372 we changed the behavior to remove containers 
from NMContext only after the containers are acknowledged  by AM. But in the 
{{NodeStatusUpdaterImpl#removeCompletedContainersFromContext}} call, we didn't 
check whether the container is really completed or not.  If the container is 
stilll running, we shouldn't remove the container from the context



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2824) Capacity of labels should be zero by default


[ 
https://issues.apache.org/jira/browse/YARN-2824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201351#comment-14201351
 ] 

Hadoop QA commented on YARN-2824:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12679998/YARN-2824-1.patch
  against trunk revision 75b820c.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5762//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5762//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5762//console

This message is automatically generated.

 Capacity of labels should be zero by default
 

 Key: YARN-2824
 URL: https://issues.apache.org/jira/browse/YARN-2824
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
Priority: Critical
 Attachments: YARN-2824-1.patch


 In existing Capacity Scheduler behavior, if user doesn't specify capacity of 
 label, queue initialization will be failed. That will cause queue refreshment 
 failed when add a new label to node labels collection and doesn't modify 
 capacity-scheduler.xml.
 With this patch, capacity of labels should be explicitly set if user want to 
 use it. If user doesn't set capacity of some labels, we will treat such 
 labels are unused labels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2744) Under some scenario, it is possible to end up with capacity scheduler configuration that uses labels that no longer exist

2014-11-06 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201370#comment-14201370
 ] 

Vinod Kumar Vavilapalli commented on YARN-2744:
---

Makes sense, looks good. +1. Checking this in.

 Under some scenario, it is possible to end up with capacity scheduler 
 configuration that uses labels that no longer exist
 -

 Key: YARN-2744
 URL: https://issues.apache.org/jira/browse/YARN-2744
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Affects Versions: 2.5.1
Reporter: Sumit Mohanty
Assignee: Wangda Tan
Priority: Critical
 Attachments: YARN-2744-20141025-1.patch, YARN-2744-20141025-2.patch


 Use the following steps:
 * Ensure default in-memory storage is configured for labels
 * Define some labels and assign nodes to labels (e.g. define two labels and 
 assign both labels to the host on a one host cluster)
 * Invoke refreshQueues
 * Modify capacity scheduler to create two top level queues and allow access 
 to the labels from both the queues
 * Assign appropriate label + queue specific capacities
 * Restart resource manager
 Noticed that RM starts without any issues. The labels are not preserved 
 across restart and thus the capacity-scheduler ends up using labels that are 
 no longer present.
 At this point submitting an application to YARN will not succeed as there are 
 no resources available with the labels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2803) MR distributed cache not working correctly on Windows after NodeManager privileged account changes.


[ 
https://issues.apache.org/jira/browse/YARN-2803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201373#comment-14201373
 ] 

Hadoop QA commented on YARN-2803:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12680024/YARN-2803.0.patch
  against trunk revision 75b820c.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5763//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5763//console

This message is automatically generated.

 MR distributed cache not working correctly on Windows after NodeManager 
 privileged account changes.
 ---

 Key: YARN-2803
 URL: https://issues.apache.org/jira/browse/YARN-2803
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Chris Nauroth
Priority: Critical
 Attachments: YARN-2803.0.patch


 This problem is visible by running {{TestMRJobs#testDistributedCache}} or 
 {{TestUberAM#testDistributedCache}} on Windows.  Both tests fail.  Running 
 git bisect, I traced it to the YARN-2198 patch to remove the need to run 
 NodeManager as a privileged account.  The tests started failing when that 
 patch was committed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2803) MR distributed cache not working correctly on Windows after NodeManager privileged account changes.

2014-11-06 Thread Craig Welch (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201377#comment-14201377
 ] 

Craig Welch commented on YARN-2803:
---

The fix is to cause existing unit tests which fail on windows to now pass, 
there is no need for additional tests, they're already there.

 MR distributed cache not working correctly on Windows after NodeManager 
 privileged account changes.
 ---

 Key: YARN-2803
 URL: https://issues.apache.org/jira/browse/YARN-2803
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Chris Nauroth
Priority: Critical
 Attachments: YARN-2803.0.patch


 This problem is visible by running {{TestMRJobs#testDistributedCache}} or 
 {{TestUberAM#testDistributedCache}} on Windows.  Both tests fail.  Running 
 git bisect, I traced it to the YARN-2198 patch to remove the need to run 
 NodeManager as a privileged account.  The tests started failing when that 
 patch was committed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2744) Under some scenario, it is possible to end up with capacity scheduler configuration that uses labels that no longer exist


[ 
https://issues.apache.org/jira/browse/YARN-2744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201382#comment-14201382
 ] 

Hudson commented on YARN-2744:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6472 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6472/])
YARN-2744. Fixed CapacityScheduler to validate node-labels correctly against 
queues. Contributed by Wangda Tan. (vinodkv: rev 
a3839a9fbfb8eec396b9bf85472d25e0ffc3aab2)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestQueueParsing.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractCSQueue.java
* hadoop-yarn-project/CHANGES.txt


 Under some scenario, it is possible to end up with capacity scheduler 
 configuration that uses labels that no longer exist
 -

 Key: YARN-2744
 URL: https://issues.apache.org/jira/browse/YARN-2744
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Affects Versions: 2.5.1
Reporter: Sumit Mohanty
Assignee: Wangda Tan
Priority: Critical
 Fix For: 2.6.0

 Attachments: YARN-2744-20141025-1.patch, YARN-2744-20141025-2.patch


 Use the following steps:
 * Ensure default in-memory storage is configured for labels
 * Define some labels and assign nodes to labels (e.g. define two labels and 
 assign both labels to the host on a one host cluster)
 * Invoke refreshQueues
 * Modify capacity scheduler to create two top level queues and allow access 
 to the labels from both the queues
 * Assign appropriate label + queue specific capacities
 * Restart resource manager
 Noticed that RM starts without any issues. The labels are not preserved 
 across restart and thus the capacity-scheduler ends up using labels that are 
 no longer present.
 At this point submitting an application to YARN will not succeed as there are 
 no resources available with the labels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2753) Fix potential issues and code clean up for *NodeLabelsManager

2014-11-06 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201384#comment-14201384
 ] 

Vinod Kumar Vavilapalli commented on YARN-2753:
---

Let me look..

 Fix potential issues and code clean up for *NodeLabelsManager
 -

 Key: YARN-2753
 URL: https://issues.apache.org/jira/browse/YARN-2753
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2753.000.patch, YARN-2753.001.patch, 
 YARN-2753.002.patch, YARN-2753.003.patch, YARN-2753.004.patch, 
 YARN-2753.005.patch, YARN-2753.006.patch


 Issues include:
 * CommonNodeLabelsManager#addToCluserNodeLabels should not change the value 
 in labelCollections if the key already exists otherwise the Label.resource 
 will be changed(reset).
 * potential NPE(NullPointerException) in checkRemoveLabelsFromNode of 
 CommonNodeLabelsManager.
 ** because when a Node is created, Node.labels can be null.
 ** In this case, nm.labels; may be null. So we need check originalLabels not 
 null before use it(originalLabels.containsAll).
 * addToCluserNodeLabels should be protected by writeLock in 
 RMNodeLabelsManager.java. because we should protect labelCollections in 
 RMNodeLabelsManager.
 * Fix a potential bug in CommonsNodeLabelsManager, after serviceStop(...) is 
 invoked, some event may not be processed, see 
 [comment|https://issues.apache.org/jira/browse/YARN-2753?focusedCommentId=14197206page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14197206]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2744) Under some scenario, it is possible to end up with capacity scheduler configuration that uses labels that no longer exist


[ 
https://issues.apache.org/jira/browse/YARN-2744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201385#comment-14201385
 ] 

Wangda Tan commented on YARN-2744:
--

Thanks for Vinod's review and commit!

 Under some scenario, it is possible to end up with capacity scheduler 
 configuration that uses labels that no longer exist
 -

 Key: YARN-2744
 URL: https://issues.apache.org/jira/browse/YARN-2744
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Affects Versions: 2.5.1
Reporter: Sumit Mohanty
Assignee: Wangda Tan
Priority: Critical
 Fix For: 2.6.0

 Attachments: YARN-2744-20141025-1.patch, YARN-2744-20141025-2.patch


 Use the following steps:
 * Ensure default in-memory storage is configured for labels
 * Define some labels and assign nodes to labels (e.g. define two labels and 
 assign both labels to the host on a one host cluster)
 * Invoke refreshQueues
 * Modify capacity scheduler to create two top level queues and allow access 
 to the labels from both the queues
 * Assign appropriate label + queue specific capacities
 * Restart resource manager
 Noticed that RM starts without any issues. The labels are not preserved 
 across restart and thus the capacity-scheduler ends up using labels that are 
 no longer present.
 At this point submitting an application to YARN will not succeed as there are 
 no resources available with the labels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2505) Support get/add/remove/change labels in RM REST API

2014-11-06 Thread Craig Welch (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Welch updated YARN-2505:
--
Attachment: YARN-2505.18.patch

And, uploaded the wrong one.  This should be the right one.

 Support get/add/remove/change labels in RM REST API
 ---

 Key: YARN-2505
 URL: https://issues.apache.org/jira/browse/YARN-2505
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Craig Welch
 Attachments: YARN-2505.1.patch, YARN-2505.11.patch, 
 YARN-2505.12.patch, YARN-2505.13.patch, YARN-2505.14.patch, 
 YARN-2505.15.patch, YARN-2505.16.patch, YARN-2505.16.patch, 
 YARN-2505.16.patch, YARN-2505.18.patch, YARN-2505.3.patch, YARN-2505.4.patch, 
 YARN-2505.5.patch, YARN-2505.6.patch, YARN-2505.7.patch, YARN-2505.8.patch, 
 YARN-2505.9.patch, YARN-2505.9.patch, YARN-2505.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2505) Support get/add/remove/change labels in RM REST API


[ 
https://issues.apache.org/jira/browse/YARN-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201402#comment-14201402
 ] 

Wangda Tan commented on YARN-2505:
--

Hi [~cwelch],
Latest patch LGTM, +1.

Thanks!

 Support get/add/remove/change labels in RM REST API
 ---

 Key: YARN-2505
 URL: https://issues.apache.org/jira/browse/YARN-2505
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Craig Welch
 Attachments: YARN-2505.1.patch, YARN-2505.11.patch, 
 YARN-2505.12.patch, YARN-2505.13.patch, YARN-2505.14.patch, 
 YARN-2505.15.patch, YARN-2505.16.patch, YARN-2505.16.patch, 
 YARN-2505.16.patch, YARN-2505.18.patch, YARN-2505.3.patch, YARN-2505.4.patch, 
 YARN-2505.5.patch, YARN-2505.6.patch, YARN-2505.7.patch, YARN-2505.8.patch, 
 YARN-2505.9.patch, YARN-2505.9.patch, YARN-2505.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2808) yarn client tool can not list app_attempt's container info correctly

2014-11-06 Thread George Wong (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201409#comment-14201409
 ] 

George Wong commented on YARN-2808:
---

That's great. Looking forward to the new container command.

 yarn client tool can not list app_attempt's container info correctly
 

 Key: YARN-2808
 URL: https://issues.apache.org/jira/browse/YARN-2808
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Reporter: Gordon Wang
Assignee: Naganarasimha G R

 When enabling timeline server, yarn client can not list the container info 
 for a application attempt correctly.
 Here is the reproduce step.
 # enabling yarn timeline server
 # submit a MR job
 # after the job is finished. use yarn client to list the container info of 
 the app attempt.
 Then, since the RM has cached the application's attempt info, the output show 
 {noformat}
 [hadoop@localhost hadoop-3.0.0-SNAPSHOT]$ ./bin/yarn container -list 
 appattempt_1415168250217_0001_01
 14/11/05 01:19:15 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 14/11/05 01:19:15 INFO impl.TimelineClientImpl: Timeline service address: 
 http://0.0.0.0:8188/ws/v1/timeline/
 14/11/05 01:19:16 INFO client.RMProxy: Connecting to ResourceManager at 
 /0.0.0.0:8032
 14/11/05 01:19:16 INFO client.AHSProxy: Connecting to Application History 
 server at /0.0.0.0:10200
 Total number of containers :0
   Container-Id  Start Time Finish 
 Time   StateHost  
   LOG-URL
 {noformat}
 But if the rm is restarted, client can fetch the container info from timeline 
 server correctly.
 {noformat}
 [hadoop@localhost hadoop-3.0.0-SNAPSHOT]$ ./bin/yarn container -list 
 appattempt_1415168250217_0001_01
 14/11/05 01:21:06 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 14/11/05 01:21:06 INFO impl.TimelineClientImpl: Timeline service address: 
 http://0.0.0.0:8188/ws/v1/timeline/
 14/11/05 01:21:06 INFO client.RMProxy: Connecting to ResourceManager at 
 /0.0.0.0:8032
 14/11/05 01:21:06 INFO client.AHSProxy: Connecting to Application History 
 server at /0.0.0.0:10200
 Total number of containers :4
   Container-Id  Start Time Finish 
 Time   StateHost  
   LOG-URL
 container_1415168250217_0001_01_01   1415168318376   
 1415168349896COMPLETElocalhost.localdomain:47024 
 http://0.0.0.0:8188/applicationhistory/logs/localhost.localdomain:47024/container_1415168250217_0001_01_01/container_1415168250217_0001_01_01/hadoop
 container_1415168250217_0001_01_02   1415168326399   
 1415168334858COMPLETElocalhost.localdomain:47024 
 http://0.0.0.0:8188/applicationhistory/logs/localhost.localdomain:47024/container_1415168250217_0001_01_02/container_1415168250217_0001_01_02/hadoop
 container_1415168250217_0001_01_03   1415168326400   
 1415168335277COMPLETElocalhost.localdomain:47024 
 http://0.0.0.0:8188/applicationhistory/logs/localhost.localdomain:47024/container_1415168250217_0001_01_03/container_1415168250217_0001_01_03/hadoop
 container_1415168250217_0001_01_04   1415168335825   
 1415168343873COMPLETElocalhost.localdomain:47024 
 http://0.0.0.0:8188/applicationhistory/logs/localhost.localdomain:47024/container_1415168250217_0001_01_04/container_1415168250217_0001_01_04/hadoop
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2824) Capacity of labels should be zero by default


[ 
https://issues.apache.org/jira/browse/YARN-2824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201413#comment-14201413
 ] 

Wangda Tan commented on YARN-2824:
--

Attached new patch suppressing findbug warnings

 Capacity of labels should be zero by default
 

 Key: YARN-2824
 URL: https://issues.apache.org/jira/browse/YARN-2824
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
Priority: Critical
 Attachments: YARN-2824-1.patch, YARN-2824-2.patch


 In existing Capacity Scheduler behavior, if user doesn't specify capacity of 
 label, queue initialization will be failed. That will cause queue refreshment 
 failed when add a new label to node labels collection and doesn't modify 
 capacity-scheduler.xml.
 With this patch, capacity of labels should be explicitly set if user want to 
 use it. If user doesn't set capacity of some labels, we will treat such 
 labels are unused labels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2824) Capacity of labels should be zero by default


 [ 
https://issues.apache.org/jira/browse/YARN-2824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-2824:
-
Attachment: YARN-2824-2.patch

 Capacity of labels should be zero by default
 

 Key: YARN-2824
 URL: https://issues.apache.org/jira/browse/YARN-2824
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
Priority: Critical
 Attachments: YARN-2824-1.patch, YARN-2824-2.patch


 In existing Capacity Scheduler behavior, if user doesn't specify capacity of 
 label, queue initialization will be failed. That will cause queue refreshment 
 failed when add a new label to node labels collection and doesn't modify 
 capacity-scheduler.xml.
 With this patch, capacity of labels should be explicitly set if user want to 
 use it. If user doesn't set capacity of some labels, we will treat such 
 labels are unused labels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2632) Document NM Restart feature

2014-11-06 Thread Junping Du (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-2632:
-
Attachment: YARN-2632-v3.patch

 Document NM Restart feature
 ---

 Key: YARN-2632
 URL: https://issues.apache.org/jira/browse/YARN-2632
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Junping Du
Assignee: Junping Du
Priority: Blocker
 Attachments: YARN-2632-v2.patch, YARN-2632-v3.patch, YARN-2632.patch


 As a new feature to YARN, we should document this feature's behavior, 
 configuration, and things to pay attention.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2632) Document NM Restart feature

2014-11-06 Thread Junping Du (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201415#comment-14201415
 ] 

Junping Du commented on YARN-2632:
--

Thanks [~vvasudev] for review and comments!
bq. I might be wrong but NM work preserving restart doesn't work with ephemeral 
ports, right? We should specify that in the documentation(as well as how to set 
the NM port in yarn-site.xml).
Nice catch! Address this in v3 patch.

 Document NM Restart feature
 ---

 Key: YARN-2632
 URL: https://issues.apache.org/jira/browse/YARN-2632
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Junping Du
Assignee: Junping Du
Priority: Blocker
 Attachments: YARN-2632-v2.patch, YARN-2632-v3.patch, YARN-2632.patch


 As a new feature to YARN, we should document this feature's behavior, 
 configuration, and things to pay attention.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2825) Container leak on NM


 [ 
https://issues.apache.org/jira/browse/YARN-2825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2825:
--
Attachment: YARN-2825.1.patch

Upload a patch to check if the container is at completed state before removing 
it from nm context.

 Container leak on NM
 

 Key: YARN-2825
 URL: https://issues.apache.org/jira/browse/YARN-2825
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He
Priority: Critical
 Attachments: YARN-2825.1.patch


 Caused by YARN-1372. thanks [~vinodkv] for pointing  this out.
 The problem is that in YARN-1372 we changed the behavior to remove containers 
 from NMContext only after the containers are acknowledged  by AM. But in the 
 {{NodeStatusUpdaterImpl#removeCompletedContainersFromContext}} call, we 
 didn't check whether the container is really completed or not.  If the 
 container is stilll running, we shouldn't remove the container from the 
 context



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (YARN-2826) User-Group mappings not updated by RM when a user is removed from a group.


 [ 
https://issues.apache.org/jira/browse/YARN-2826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan reassigned YARN-2826:


Assignee: Wangda Tan

 User-Group mappings not updated by RM when a user is removed from a group.
 --

 Key: YARN-2826
 URL: https://issues.apache.org/jira/browse/YARN-2826
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: sidharta seethana
Assignee: Wangda Tan
Priority: Critical

 Removing a user from a group isn't reflected in getGroups even after a 
 refresh. The following sequence fails in step 7. 
 1) add test_user to a machine with group1 
 2) add test_user to group2 on the machine 
 3) yarn rmadmin -refreshUserToGroupsMappings (expected to refresh user to 
 group mappings) 
 4) yarn rmadmin -getGroups test_user (and ensure that user is in group2) 
 5) remove user from group2 on the machine 
 6) yarn rmadmin -refreshUserToGroupsMappings (expected to refresh user to 
 group mappings) 
 7) yarn rmadmin -getGroups test_user (and ensure that user NOT in group2) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2505) Support get/add/remove/change labels in RM REST API