[jira] [Created] (YARN-3758) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container

2015-06-02 Thread skrho (JIRA)
skrho created YARN-3758:
---

 Summary: The mininum memory 
setting(yarn.scheduler.minimum-allocation-mb) is not working in container
 Key: YARN-3758
 URL: https://issues.apache.org/jira/browse/YARN-3758
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: skrho


Hello there~~

I have 2 clusters


First cluster is 5 node , default 1 application queue, 8G Physical memory each 
node
Second cluster is 10 node, 2 application queuey, 230G Physical memory each node

Wherever a mapreduce job is running, I want resourcemanager is to set the 
minimum memory  256m to container

So I was changing configuration in yarn-site.xml

yarn.scheduler.minimum-allocation-mb : 256
mapreduce.map.java.opts : -Xms256m 
mapreduce.reduce.java.opts : -Xms256m 
mapreduce.map.memory.mb : 256 
mapreduce.reduce.memory.mb : 256 


In First cluster  whenever a mapreduce job is running , I can see used memory 
256m in web console( http://installedIP:8088/cluster/nodes )
But In Second cluster whenever a mapreduce job is running , I can see used 
memory 1024m in web console( http://installedIP:8088/cluster/nodes ) 

I know default memory value is 1024m, so if there is not changing memory 
setting, the default value is working.

I have been testing for two weeks, but I don't know why mimimum memory setting 
is not working in second cluster

Why this difference is happened? 

Am I wrong setting configuration?
or Is there bug?

Thank you for reading~~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-06-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568667#comment-14568667
 ] 

Hadoop QA commented on YARN-3749:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  18m 29s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 7 new or modified test files. |
| {color:green}+1{color} | javac |   7m 36s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 33s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   1m 49s | The applied patch generated  1 
new checkstyle issues (total was 212, now 213). |
| {color:green}+1{color} | whitespace |   0m  1s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 32s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   4m 34s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   0m 25s | Tests passed in 
hadoop-yarn-api. |
| {color:green}+1{color} | yarn tests |   7m  1s | Tests passed in 
hadoop-yarn-client. |
| {color:red}-1{color} | yarn tests |  60m 25s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| {color:green}+1{color} | yarn tests |   1m 52s | Tests passed in 
hadoop-yarn-server-tests. |
| | | 115m  5s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.yarn.server.resourcemanager.TestRMEmbeddedElector |
|   | hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler |
| Timed out tests | 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation
 |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12736732/YARN-3749.6.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 990078b |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8158/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt
 |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8158/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-client test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8158/artifact/patchprocess/testrun_hadoop-yarn-client.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8158/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| hadoop-yarn-server-tests test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8158/artifact/patchprocess/testrun_hadoop-yarn-server-tests.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8158/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8158/console |


This message was automatically generated.

 We should make a copy of configuration when init MiniYARNCluster with 
 multiple RMs
 --

 Key: YARN-3749
 URL: https://issues.apache.org/jira/browse/YARN-3749
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Chun Chen
Assignee: Chun Chen
 Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, 
 YARN-3749.5.patch, YARN-3749.6.patch, YARN-3749.7.patch, YARN-3749.patch


 When I was trying to write a test case for YARN-2674, I found DS client 
 trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
 when RM failover. But I initially set 
 yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
 yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
 in ClientRMService where the value of yarn.resourcemanager.address.rm2 
 changed to 0.0.0.0:18032. See the following code in ClientRMService:
 {code}
 clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
YarnConfiguration.RM_ADDRESS,

 YarnConfiguration.DEFAULT_RM_ADDRESS,
server.getListenerAddress());
 {code}
 Since we use the same instance of configuration in rm1 and rm2 and init both 
 RM before 

[jira] [Updated] (YARN-3758) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container

2015-06-02 Thread skrho (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

skrho updated YARN-3758:

Description: 
Hello there~~

I have 2 clusters


First cluster is 5 node , default 1 application queue, Capacity scheduler, 8G 
Physical memory each node
Second cluster is 10 node, 2 application queuey, fair-scheduler, 230G Physical 
memory each node

Wherever a mapreduce job is running, I want resourcemanager is to set the 
minimum memory  256m to container

So I was changing configuration in yarn-site.xml  mapred-site.xml

yarn.scheduler.minimum-allocation-mb : 256
mapreduce.map.java.opts : -Xms256m 
mapreduce.reduce.java.opts : -Xms256m 
mapreduce.map.memory.mb : 256 
mapreduce.reduce.memory.mb : 256 


In First cluster  whenever a mapreduce job is running , I can see used memory 
256m in web console( http://installedIP:8088/cluster/nodes )
But In Second cluster whenever a mapreduce job is running , I can see used 
memory 1024m in web console( http://installedIP:8088/cluster/nodes ) 

I know default memory value is 1024m, so if there is not changing memory 
setting, the default value is working.

I have been testing for two weeks, but I don't know why mimimum memory setting 
is not working in second cluster

Why this difference is happened? 

Am I wrong setting configuration?
or Is there bug?

Thank you for reading~~

  was:
Hello there~~

I have 2 clusters


First cluster is 5 node , default 1 application queue, 8G Physical memory each 
node
Second cluster is 10 node, 2 application queuey, 230G Physical memory each node

Wherever a mapreduce job is running, I want resourcemanager is to set the 
minimum memory  256m to container

So I was changing configuration in yarn-site.xml  mapred-site.xml

yarn.scheduler.minimum-allocation-mb : 256
mapreduce.map.java.opts : -Xms256m 
mapreduce.reduce.java.opts : -Xms256m 
mapreduce.map.memory.mb : 256 
mapreduce.reduce.memory.mb : 256 


In First cluster  whenever a mapreduce job is running , I can see used memory 
256m in web console( http://installedIP:8088/cluster/nodes )
But In Second cluster whenever a mapreduce job is running , I can see used 
memory 1024m in web console( http://installedIP:8088/cluster/nodes ) 

I know default memory value is 1024m, so if there is not changing memory 
setting, the default value is working.

I have been testing for two weeks, but I don't know why mimimum memory setting 
is not working in second cluster

Why this difference is happened? 

Am I wrong setting configuration?
or Is there bug?

Thank you for reading~~


 The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not 
 working in container
 

 Key: YARN-3758
 URL: https://issues.apache.org/jira/browse/YARN-3758
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: skrho

 Hello there~~
 I have 2 clusters
 First cluster is 5 node , default 1 application queue, Capacity scheduler, 8G 
 Physical memory each node
 Second cluster is 10 node, 2 application queuey, fair-scheduler, 230G 
 Physical memory each node
 Wherever a mapreduce job is running, I want resourcemanager is to set the 
 minimum memory  256m to container
 So I was changing configuration in yarn-site.xml  mapred-site.xml
 yarn.scheduler.minimum-allocation-mb : 256
 mapreduce.map.java.opts : -Xms256m 
 mapreduce.reduce.java.opts : -Xms256m 
 mapreduce.map.memory.mb : 256 
 mapreduce.reduce.memory.mb : 256 
 In First cluster  whenever a mapreduce job is running , I can see used memory 
 256m in web console( http://installedIP:8088/cluster/nodes )
 But In Second cluster whenever a mapreduce job is running , I can see used 
 memory 1024m in web console( http://installedIP:8088/cluster/nodes ) 
 I know default memory value is 1024m, so if there is not changing memory 
 setting, the default value is working.
 I have been testing for two weeks, but I don't know why mimimum memory 
 setting is not working in second cluster
 Why this difference is happened? 
 Am I wrong setting configuration?
 or Is there bug?
 Thank you for reading~~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3706) Generalize native HBase writer for additional tables

2015-06-02 Thread Joep Rottinghuis (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joep Rottinghuis updated YARN-3706:
---
Attachment: YARN-3726-YARN-2928.004.patch

YARN-3726-YARN-2928.004.patch :
- fixed bug in cleanse (found thanks to unit test)
- fixed value separator (was ! instead of ?).
- Added readResult and readResults to EntityColumnPrefix (still need to add 
signature in interface).
- Added initial unit test for TimeLineWriterUtils
- Added relationship checking to TestTimelineWriterImpl

 Generalize native HBase writer for additional tables
 

 Key: YARN-3706
 URL: https://issues.apache.org/jira/browse/YARN-3706
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Joep Rottinghuis
Assignee: Joep Rottinghuis
Priority: Minor
 Attachments: YARN-3706-YARN-2928.001.patch, 
 YARN-3726-YARN-2928.002.patch, YARN-3726-YARN-2928.003.patch, 
 YARN-3726-YARN-2928.004.patch


 When reviewing YARN-3411 we noticed that we could change the class hierarchy 
 a little in order to accommodate additional tables easily.
 In order to get ready for benchmark testing we left the original layout in 
 place, as performance would not be impacted by the code hierarchy.
 Here is a separate jira to address the hierarchy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2962) ZKRMStateStore: Limit the number of znodes under a znode

2015-06-02 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568789#comment-14568789
 ] 

Varun Saxena commented on YARN-2962:


Was waiting for an input from [~vinodkv] and [~asuresh] so that we reach a 
common understanding on what we will do on the backward compatibility part.

Anyways in the coming week, plan to upload a patch implementing one of the 
approaches discussed.

 ZKRMStateStore: Limit the number of znodes under a znode
 

 Key: YARN-2962
 URL: https://issues.apache.org/jira/browse/YARN-2962
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Karthik Kambatla
Assignee: Varun Saxena
Priority: Critical
 Attachments: YARN-2962.01.patch, YARN-2962.2.patch, YARN-2962.3.patch


 We ran into this issue where we were hitting the default ZK server message 
 size configs, primarily because the message had too many znodes even though 
 they individually they were all small.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3753) RM failed to come up with java.io.IOException: Wait for ZKClient creation timed out

2015-06-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568670#comment-14568670
 ] 

Hadoop QA commented on YARN-3753:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  14m 53s | Findbugs (version ) appears to 
be broken on trunk. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 31s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 29s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 25s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 33s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 27s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |  50m 16s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | |  86m 33s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12736741/YARN-3753.1.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 990078b |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8161/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8161/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8161/console |


This message was automatically generated.

 RM failed to come up with java.io.IOException: Wait for ZKClient creation 
 timed out
 -

 Key: YARN-3753
 URL: https://issues.apache.org/jira/browse/YARN-3753
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Reporter: Sumana Sathish
Assignee: Jian He
Priority: Critical
 Attachments: YARN-3753.1.patch, YARN-3753.patch


 RM failed to come up with the following error while submitting an mapreduce 
 job.
 {code:title=RM log}
 015-05-30 03:40:12,190 ERROR recovery.RMStateStore 
 (RMStateStore.java:transition(179)) - Error storing app: 
 application_1432956515242_0006
 java.io.IOException: Wait for ZKClient creation timed out
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
   at 
 

[jira] [Created] (YARN-3756) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container

2015-06-02 Thread skrho (JIRA)
skrho created YARN-3756:
---

 Summary: The mininum memory 
setting(yarn.scheduler.minimum-allocation-mb) is not working in container
 Key: YARN-3756
 URL: https://issues.apache.org/jira/browse/YARN-3756
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
 Environment: hadoop 2.4.0
Reporter: skrho


Hello there~~

I have 2 clusters

First cluster is 5 node , default 1 application queue, 8G Physical memory each 
node
Second cluster is 10 node, 2 application queuey, 230G Physical memory each node

Wherever a mapreduce job is running, I want resourcemanager is to set the 
minimum memory  256m to container

So I was changing configuration in yarn-site.xml  mapred-site.xml

yarn.scheduler.minimum-allocation-mb : 256
mapreduce.map.java.opts : -Xms256m 
mapreduce.reduce.java.opts : -Xms256m 
mapreduce.map.memory.mb : 256 
mapreduce.reduce.memory.mb : 256 

In First cluster  whenever a mapreduce job is running , I can see used memory 
256m in web console( http://installedIP:8088/cluster/nodes )

But In Second cluster whenever a mapreduce job is running , I can see used 
memory 1024m in web console( http://installedIP:8088/cluster/nodes ) 

I know default memory value is 1024m, so if there is not changing memory 
setting, the default value is working.

I have been testing for two weeks, but I don't know why mimimum memory setting 
is not working in second cluster

Why this difference is happened? 

Am I wrong setting configuration?
or Is there bug?

Thank you for reading~~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3753) RM failed to come up with java.io.IOException: Wait for ZKClient creation timed out

2015-06-02 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-3753:
--
Attachment: YARN-3753.2.patch

 RM failed to come up with java.io.IOException: Wait for ZKClient creation 
 timed out
 -

 Key: YARN-3753
 URL: https://issues.apache.org/jira/browse/YARN-3753
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Reporter: Sumana Sathish
Assignee: Jian He
Priority: Critical
 Attachments: YARN-3753.1.patch, YARN-3753.2.patch, YARN-3753.patch


 RM failed to come up with the following error while submitting an mapreduce 
 job.
 {code:title=RM log}
 015-05-30 03:40:12,190 ERROR recovery.RMStateStore 
 (RMStateStore.java:transition(179)) - Error storing app: 
 application_1432956515242_0006
 java.io.IOException: Wait for ZKClient creation timed out
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
   at java.lang.Thread.run(Thread.java:745)
 2015-05-30 03:40:12,194 FATAL resourcemanager.ResourceManager 
 (ResourceManager.java:handle(750)) - Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 java.io.IOException: Wait for ZKClient creation timed out
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 

[jira] [Commented] (YARN-3753) RM failed to come up with java.io.IOException: Wait for ZKClient creation timed out

2015-06-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568797#comment-14568797
 ] 

Hadoop QA commented on YARN-3753:
-

\\
\\
| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  15m 53s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 33s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 36s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 48s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 34s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 27s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |  50m  6s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | |  88m  7s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12736776/YARN-3753.2.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 990078b |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8164/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8164/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8164/console |


This message was automatically generated.

 RM failed to come up with java.io.IOException: Wait for ZKClient creation 
 timed out
 -

 Key: YARN-3753
 URL: https://issues.apache.org/jira/browse/YARN-3753
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Reporter: Sumana Sathish
Assignee: Jian He
Priority: Critical
 Attachments: YARN-3753.1.patch, YARN-3753.2.patch, YARN-3753.patch


 RM failed to come up with the following error while submitting an mapreduce 
 job.
 {code:title=RM log}
 015-05-30 03:40:12,190 ERROR recovery.RMStateStore 
 (RMStateStore.java:transition(179)) - Error storing app: 
 application_1432956515242_0006
 java.io.IOException: Wait for ZKClient creation timed out
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
   at 
 

[jira] [Commented] (YARN-3591) Resource Localisation on a bad disk causes subsequent containers failure

2015-06-02 Thread Lavkesh Lahngir (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568834#comment-14568834
 ] 

Lavkesh Lahngir commented on YARN-3591:
---

[~zxu] :Can we get away without storing into NMstateStore? Other changes seems 
to be okay.
It's not a big change in terms of the code, but adding in NMstate could be 
debatable.
[~vvasudev]: Thoughts?

 Resource Localisation on a bad disk causes subsequent containers failure 
 -

 Key: YARN-3591
 URL: https://issues.apache.org/jira/browse/YARN-3591
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.0
Reporter: Lavkesh Lahngir
Assignee: Lavkesh Lahngir
 Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, 
 YARN-3591.2.patch, YARN-3591.3.patch, YARN-3591.4.patch


 It happens when a resource is localised on the disk, after localising that 
 disk has gone bad. NM keeps paths for localised resources in memory.  At the 
 time of resource request isResourcePresent(rsrc) will be called which calls 
 file.exists() on the localised path.
 In some cases when disk has gone bad, inodes are stilled cached and 
 file.exists() returns true. But at the time of reading, file will not open.
 Note: file.exists() actually calls stat64 natively which returns true because 
 it was able to find inode information from the OS.
 A proposal is to call file.list() on the parent path of the resource, which 
 will call open() natively. If the disk is good it should return an array of 
 paths with length at-least 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3757) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container

2015-06-02 Thread skrho (JIRA)
skrho created YARN-3757:
---

 Summary: The mininum memory 
setting(yarn.scheduler.minimum-allocation-mb) is not working in container
 Key: YARN-3757
 URL: https://issues.apache.org/jira/browse/YARN-3757
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
 Environment: Hadoop 2.4.0
Reporter: skrho


Hello there~~

I have 2 clusters

First cluster is 5 node , default 1 application queue, 8G Physical memory each 
node
Second cluster is 10 node, 2 application queuey, 230G Physical memory each node

Wherever a mapreduce job is running, I want resourcemanager is to set the 
minimum memory  256m to container

So I was changing configuration in yarn-site.xml

yarn.scheduler.minimum-allocation-mb : 256
mapreduce.map.java.opts : -Xms256m 
mapreduce.reduce.java.opts : -Xms256m 
mapreduce.map.memory.mb : 256 
mapreduce.reduce.memory.mb : 256 

In First cluster  whenever a mapreduce job is running , I can see used memory 
256m in web console( http://installedIP:8088/cluster/nodes )
But In Second cluster whenever a mapreduce job is running , I can see used 
memory 1024m in web console( http://installedIP:8088/cluster/nodes ) 

I know default memory value is 1024m, so if there is not changing memory 
setting, the default value is working.

I have been testing for two weeks, but I don't know why mimimum memory setting 
is not working in second cluster

Why this difference is happened? 

Am I wrong setting configuration?
or Is there bug?

Thank you for reading~~




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty

2015-06-02 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568682#comment-14568682
 ] 

Rohith commented on YARN-3733:
--

Updated the summary as per defect.

 DominantRC#compare() does not work as expected if cluster resource is empty
 ---

 Key: YARN-3733
 URL: https://issues.apache.org/jira/browse/YARN-3733
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3 , 2 NM , 2 RM
 one NM - 3 GB 6 v core
Reporter: Bibin A Chundatt
Assignee: Rohith
Priority: Blocker
 Attachments: YARN-3733.patch


 Steps to reproduce
 =
 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
 size to 512 MB
 3. Configure capacity scheduler and AM limit to .5 
 (DominantResourceCalculator is configured)
 4. Submit 30 concurrent task 
 5. Switch RM
 Actual
 =
 For 12 Jobs AM gets allocated and all 12 starts running
 No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
 Expected
 ===
 Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3170) YARN architecture document needs updating

2015-06-02 Thread Brahma Reddy Battula (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568853#comment-14568853
 ] 

Brahma Reddy Battula commented on YARN-3170:


Updated patch..Kindly review!!

 YARN architecture document needs updating
 -

 Key: YARN-3170
 URL: https://issues.apache.org/jira/browse/YARN-3170
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Reporter: Allen Wittenauer
Assignee: Brahma Reddy Battula
 Attachments: YARN-3170-002.patch, YARN-3170-003.patch, 
 YARN-3170-004.patch, YARN-3170-005.patch, YARN-3170-006.patch, 
 YARN-3170-007.patch, YARN-3170-008.patch, YARN-3170-009.patch, 
 YARN-3170-010.patch, YARN-3170.patch


 The marketing paragraph at the top, NextGen MapReduce, etc are all 
 marketing rather than actual descriptions. It also needs some general 
 updates, esp given it reads as though 0.23 was just released yesterday.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-06-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568735#comment-14568735
 ] 

Hadoop QA commented on YARN-3749:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  20m  3s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 8 new or modified test files. |
| {color:green}+1{color} | javac |   7m 34s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 42s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   2m 17s | The applied patch generated  1 
new checkstyle issues (total was 212, now 213). |
| {color:green}+1{color} | whitespace |   0m  1s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 32s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 32s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   6m  5s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   0m 22s | Tests passed in 
hadoop-yarn-api. |
| {color:green}+1{color} | yarn tests |   6m 58s | Tests passed in 
hadoop-yarn-client. |
| {color:green}+1{color} | yarn tests |   1m 57s | Tests passed in 
hadoop-yarn-common. |
| {color:red}-1{color} | yarn tests |  60m 34s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| {color:green}+1{color} | yarn tests |   1m 51s | Tests passed in 
hadoop-yarn-server-tests. |
| | | 121m  2s | |
\\
\\
|| Reason || Tests ||
| Timed out tests | 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation
 |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12736753/YARN-3749.7.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 990078b |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8163/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt
 |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8163/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-client test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8163/artifact/patchprocess/testrun_hadoop-yarn-client.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8163/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8163/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| hadoop-yarn-server-tests test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8163/artifact/patchprocess/testrun_hadoop-yarn-server-tests.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8163/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8163/console |


This message was automatically generated.

 We should make a copy of configuration when init MiniYARNCluster with 
 multiple RMs
 --

 Key: YARN-3749
 URL: https://issues.apache.org/jira/browse/YARN-3749
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Chun Chen
Assignee: Chun Chen
 Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, 
 YARN-3749.5.patch, YARN-3749.6.patch, YARN-3749.7.patch, YARN-3749.patch


 When I was trying to write a test case for YARN-2674, I found DS client 
 trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
 when RM failover. But I initially set 
 yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
 yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
 in ClientRMService where the value of yarn.resourcemanager.address.rm2 
 changed to 0.0.0.0:18032. See the following code in ClientRMService:
 {code}
 clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
YarnConfiguration.RM_ADDRESS,

 YarnConfiguration.DEFAULT_RM_ADDRESS,
server.getListenerAddress());
 {code}
 Since we 

[jira] [Updated] (YARN-3758) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container

2015-06-02 Thread skrho (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

skrho updated YARN-3758:

Description: 
Hello there~~

I have 2 clusters


First cluster is 5 node , default 1 application queue, 8G Physical memory each 
node
Second cluster is 10 node, 2 application queuey, 230G Physical memory each node

Wherever a mapreduce job is running, I want resourcemanager is to set the 
minimum memory  256m to container

So I was changing configuration in yarn-site.xml  mapred-site.xml

yarn.scheduler.minimum-allocation-mb : 256
mapreduce.map.java.opts : -Xms256m 
mapreduce.reduce.java.opts : -Xms256m 
mapreduce.map.memory.mb : 256 
mapreduce.reduce.memory.mb : 256 


In First cluster  whenever a mapreduce job is running , I can see used memory 
256m in web console( http://installedIP:8088/cluster/nodes )
But In Second cluster whenever a mapreduce job is running , I can see used 
memory 1024m in web console( http://installedIP:8088/cluster/nodes ) 

I know default memory value is 1024m, so if there is not changing memory 
setting, the default value is working.

I have been testing for two weeks, but I don't know why mimimum memory setting 
is not working in second cluster

Why this difference is happened? 

Am I wrong setting configuration?
or Is there bug?

Thank you for reading~~

  was:
Hello there~~

I have 2 clusters


First cluster is 5 node , default 1 application queue, 8G Physical memory each 
node
Second cluster is 10 node, 2 application queuey, 230G Physical memory each node

Wherever a mapreduce job is running, I want resourcemanager is to set the 
minimum memory  256m to container

So I was changing configuration in yarn-site.xml

yarn.scheduler.minimum-allocation-mb : 256
mapreduce.map.java.opts : -Xms256m 
mapreduce.reduce.java.opts : -Xms256m 
mapreduce.map.memory.mb : 256 
mapreduce.reduce.memory.mb : 256 


In First cluster  whenever a mapreduce job is running , I can see used memory 
256m in web console( http://installedIP:8088/cluster/nodes )
But In Second cluster whenever a mapreduce job is running , I can see used 
memory 1024m in web console( http://installedIP:8088/cluster/nodes ) 

I know default memory value is 1024m, so if there is not changing memory 
setting, the default value is working.

I have been testing for two weeks, but I don't know why mimimum memory setting 
is not working in second cluster

Why this difference is happened? 

Am I wrong setting configuration?
or Is there bug?

Thank you for reading~~


 The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not 
 working in container
 

 Key: YARN-3758
 URL: https://issues.apache.org/jira/browse/YARN-3758
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: skrho

 Hello there~~
 I have 2 clusters
 First cluster is 5 node , default 1 application queue, 8G Physical memory 
 each node
 Second cluster is 10 node, 2 application queuey, 230G Physical memory each 
 node
 Wherever a mapreduce job is running, I want resourcemanager is to set the 
 minimum memory  256m to container
 So I was changing configuration in yarn-site.xml  mapred-site.xml
 yarn.scheduler.minimum-allocation-mb : 256
 mapreduce.map.java.opts : -Xms256m 
 mapreduce.reduce.java.opts : -Xms256m 
 mapreduce.map.memory.mb : 256 
 mapreduce.reduce.memory.mb : 256 
 In First cluster  whenever a mapreduce job is running , I can see used memory 
 256m in web console( http://installedIP:8088/cluster/nodes )
 But In Second cluster whenever a mapreduce job is running , I can see used 
 memory 1024m in web console( http://installedIP:8088/cluster/nodes ) 
 I know default memory value is 1024m, so if there is not changing memory 
 setting, the default value is working.
 I have been testing for two weeks, but I don't know why mimimum memory 
 setting is not working in second cluster
 Why this difference is happened? 
 Am I wrong setting configuration?
 or Is there bug?
 Thank you for reading~~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3755) Log the command of launching containers

2015-06-02 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated YARN-3755:
-
Attachment: YARN-3755-2.patch

Upload new patch to address the checkstyle issue 

 Log the command of launching containers
 ---

 Key: YARN-3755
 URL: https://issues.apache.org/jira/browse/YARN-3755
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.7.0
Reporter: Jeff Zhang
Assignee: Jeff Zhang
 Attachments: YARN-3755-1.patch, YARN-3755-2.patch


 In the resource manager log, yarn would log the command for launching AM, 
 this is very useful. But there's no such log in the NN log for launching 
 containers. It would be difficult to diagnose when containers fails to launch 
 due to some issue in the commands. Although user can look at the commands in 
 the container launch script file, this is an internal things of yarn, usually 
 user don't know that. In user's perspective, they only know what commands 
 they specify when building yarn application. 
 {code}
 2015-06-01 16:06:42,245 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Command 
 to launch container container_1433145984561_0001_01_01 : 
 $JAVA_HOME/bin/java -server -Djava.net.preferIPv4Stack=true 
 -Dhadoop.metrics.log.level=WARN  -Xmx1024m  
 -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator 
 -Dlog4j.configuration=tez-container-log4j.properties 
 -Dyarn.app.container.log.dir=LOG_DIR -Dtez.root.logger=info,CLA 
 -Dsun.nio.ch.bugLevel='' org.apache.tez.dag.app.DAGAppMaster 
 1LOG_DIR/stdout 2LOG_DIR/stderr
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3757) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container

2015-06-02 Thread skrho (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

skrho resolved YARN-3757.
-
Resolution: Duplicate

 The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not 
 working in container
 

 Key: YARN-3757
 URL: https://issues.apache.org/jira/browse/YARN-3757
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
 Environment: Hadoop 2.4.0
Reporter: skrho

 Hello there~~
 I have 2 clusters
 First cluster is 5 node , default 1 application queue, 8G Physical memory 
 each node
 Second cluster is 10 node, 2 application queuey, 230G Physical memory each 
 node
 Wherever a mapreduce job is running, I want resourcemanager is to set the 
 minimum memory  256m to container
 So I was changing configuration in yarn-site.xml
 yarn.scheduler.minimum-allocation-mb : 256
 mapreduce.map.java.opts : -Xms256m 
 mapreduce.reduce.java.opts : -Xms256m 
 mapreduce.map.memory.mb : 256 
 mapreduce.reduce.memory.mb : 256 
 In First cluster  whenever a mapreduce job is running , I can see used memory 
 256m in web console( http://installedIP:8088/cluster/nodes )
 But In Second cluster whenever a mapreduce job is running , I can see used 
 memory 1024m in web console( http://installedIP:8088/cluster/nodes ) 
 I know default memory value is 1024m, so if there is not changing memory 
 setting, the default value is working.
 I have been testing for two weeks, but I don't know why mimimum memory 
 setting is not working in second cluster
 Why this difference is happened? 
 Am I wrong setting configuration?
 or Is there bug?
 Thank you for reading~~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3758) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container

2015-06-02 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568790#comment-14568790
 ] 

Naganarasimha G R commented on YARN-3758:
-

YARN-3756 and YARN-3757 are same as this issue ! can you close them .

 The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not 
 working in container
 

 Key: YARN-3758
 URL: https://issues.apache.org/jira/browse/YARN-3758
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: skrho

 Hello there~~
 I have 2 clusters
 First cluster is 5 node , default 1 application queue, Capacity scheduler, 8G 
 Physical memory each node
 Second cluster is 10 node, 2 application queuey, fair-scheduler, 230G 
 Physical memory each node
 Wherever a mapreduce job is running, I want resourcemanager is to set the 
 minimum memory  256m to container
 So I was changing configuration in yarn-site.xml  mapred-site.xml
 yarn.scheduler.minimum-allocation-mb : 256
 mapreduce.map.java.opts : -Xms256m 
 mapreduce.reduce.java.opts : -Xms256m 
 mapreduce.map.memory.mb : 256 
 mapreduce.reduce.memory.mb : 256 
 In First cluster  whenever a mapreduce job is running , I can see used memory 
 256m in web console( http://installedIP:8088/cluster/nodes )
 But In Second cluster whenever a mapreduce job is running , I can see used 
 memory 1024m in web console( http://installedIP:8088/cluster/nodes ) 
 I know default memory value is 1024m, so if there is not changing memory 
 setting, the default value is working.
 I have been testing for two weeks, but I don't know why mimimum memory 
 setting is not working in second cluster
 Why this difference is happened? 
 Am I wrong setting configuration?
 or Is there bug?
 Thank you for reading~~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3756) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container

2015-06-02 Thread skrho (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

skrho resolved YARN-3756.
-
Resolution: Duplicate

 The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not 
 working in container
 

 Key: YARN-3756
 URL: https://issues.apache.org/jira/browse/YARN-3756
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
 Environment: hadoop 2.4.0
Reporter: skrho

 Hello there~~
 I have 2 clusters
 First cluster is 5 node , default 1 application queue, 8G Physical memory 
 each node
 Second cluster is 10 node, 2 application queuey, 230G Physical memory each 
 node
 Wherever a mapreduce job is running, I want resourcemanager is to set the 
 minimum memory  256m to container
 So I was changing configuration in yarn-site.xml  mapred-site.xml
 yarn.scheduler.minimum-allocation-mb : 256
 mapreduce.map.java.opts : -Xms256m 
 mapreduce.reduce.java.opts : -Xms256m 
 mapreduce.map.memory.mb : 256 
 mapreduce.reduce.memory.mb : 256 
 In First cluster  whenever a mapreduce job is running , I can see used memory 
 256m in web console( http://installedIP:8088/cluster/nodes )
 But In Second cluster whenever a mapreduce job is running , I can see used 
 memory 1024m in web console( http://installedIP:8088/cluster/nodes ) 
 I know default memory value is 1024m, so if there is not changing memory 
 setting, the default value is working.
 I have been testing for two weeks, but I don't know why mimimum memory 
 setting is not working in second cluster
 Why this difference is happened? 
 Am I wrong setting configuration?
 or Is there bug?
 Thank you for reading~~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-3761) Set delegation token service address at the server side

2015-06-02 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena reassigned YARN-3761:
--

Assignee: Varun Saxena

 Set delegation token service address at the server side
 ---

 Key: YARN-3761
 URL: https://issues.apache.org/jira/browse/YARN-3761
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: security
Reporter: Zhijie Shen
Assignee: Varun Saxena

 Nowadays, YARN components generate the delegation token without the service 
 address set, and leave it to the client to set. With our java client library, 
 it is usually fine. However, if users are using REST API, it's going to be a 
 problem: The delegation token is returned as a url string. It's so unfriendly 
 for the thin client to deserialize the url string, set the token service 
 address and serialize it again for further usage. If we move the task of 
 setting the service address to the server side, the client can get rid of 
 this trouble.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3069) Document missing properties in yarn-default.xml

2015-06-02 Thread Ray Chiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Chiang updated YARN-3069:
-
Attachment: YARN-3069.011.patch

Thanks Akira!  New patch with the following changes:

- Fix description for yarn.node-labels.fs-store.retry-policy-spec
- Remove YARN registry entries from yarn-default.xml
- Remove one outdated entry yarn.application.classpath.prepend.distcache
- Add entry for yarn.intermediate-data-encryption.enable

I'll also go through the yarn-default.xml file once more to make sure no 
default values will change.

 Document missing properties in yarn-default.xml
 ---

 Key: YARN-3069
 URL: https://issues.apache.org/jira/browse/YARN-3069
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Reporter: Ray Chiang
Assignee: Ray Chiang
  Labels: BB2015-05-TBR, supportability
 Attachments: YARN-3069.001.patch, YARN-3069.002.patch, 
 YARN-3069.003.patch, YARN-3069.004.patch, YARN-3069.005.patch, 
 YARN-3069.006.patch, YARN-3069.007.patch, YARN-3069.008.patch, 
 YARN-3069.009.patch, YARN-3069.010.patch, YARN-3069.011.patch


 The following properties are currently not defined in yarn-default.xml.  
 These properties should either be
   A) documented in yarn-default.xml OR
   B)  listed as an exception (with comments, e.g. for internal use) in the 
 TestYarnConfigurationFields unit test
 Any comments for any of the properties below are welcome.
   org.apache.hadoop.yarn.server.sharedcachemanager.RemoteAppChecker
   org.apache.hadoop.yarn.server.sharedcachemanager.store.InMemorySCMStore
   security.applicationhistory.protocol.acl
   yarn.app.container.log.backups
   yarn.app.container.log.dir
   yarn.app.container.log.filesize
   yarn.client.app-submission.poll-interval
   yarn.client.application-client-protocol.poll-timeout-ms
   yarn.is.minicluster
   yarn.log.server.url
   yarn.minicluster.control-resource-monitoring
   yarn.minicluster.fixed.ports
   yarn.minicluster.use-rpc
   yarn.node-labels.fs-store.retry-policy-spec
   yarn.node-labels.fs-store.root-dir
   yarn.node-labels.manager-class
   yarn.nodemanager.container-executor.os.sched.priority.adjustment
   yarn.nodemanager.container-monitor.process-tree.class
   yarn.nodemanager.disk-health-checker.enable
   yarn.nodemanager.docker-container-executor.image-name
   yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms
   yarn.nodemanager.linux-container-executor.group
   yarn.nodemanager.log.deletion-threads-count
   yarn.nodemanager.user-home-dir
   yarn.nodemanager.webapp.https.address
   yarn.nodemanager.webapp.spnego-keytab-file
   yarn.nodemanager.webapp.spnego-principal
   yarn.nodemanager.windows-secure-container-executor.group
   yarn.resourcemanager.configuration.file-system-based-store
   yarn.resourcemanager.delegation-token-renewer.thread-count
   yarn.resourcemanager.delegation.key.update-interval
   yarn.resourcemanager.delegation.token.max-lifetime
   yarn.resourcemanager.delegation.token.renew-interval
   yarn.resourcemanager.history-writer.multi-threaded-dispatcher.pool-size
   yarn.resourcemanager.metrics.runtime.buckets
   yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs
   yarn.resourcemanager.reservation-system.class
   yarn.resourcemanager.reservation-system.enable
   yarn.resourcemanager.reservation-system.plan.follower
   yarn.resourcemanager.reservation-system.planfollower.time-step
   yarn.resourcemanager.rm.container-allocation.expiry-interval-ms
   yarn.resourcemanager.webapp.spnego-keytab-file
   yarn.resourcemanager.webapp.spnego-principal
   yarn.scheduler.include-port-in-node-name
   yarn.timeline-service.delegation.key.update-interval
   yarn.timeline-service.delegation.token.max-lifetime
   yarn.timeline-service.delegation.token.renew-interval
   yarn.timeline-service.generic-application-history.enabled
   
 yarn.timeline-service.generic-application-history.fs-history-store.compression-type
   yarn.timeline-service.generic-application-history.fs-history-store.uri
   yarn.timeline-service.generic-application-history.store-class
   yarn.timeline-service.http-cross-origin.enabled
   yarn.tracking.url.generator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs

2015-06-02 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569548#comment-14569548
 ] 

Sergey Shelukhin commented on YARN-1462:


[~sseth] can you please comment on the above (use of Private API)?

 AHS API and other AHS changes to handle tags for completed MR jobs
 --

 Key: YARN-1462
 URL: https://issues.apache.org/jira/browse/YARN-1462
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.2.0
Reporter: Karthik Kambatla
Assignee: Xuan Gong
 Fix For: 2.8.0

 Attachments: YARN-1462-branch-2.7-1.2.patch, 
 YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, 
 YARN-1462.3.patch


 AHS related work for tags. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3069) Document missing properties in yarn-default.xml

2015-06-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569646#comment-14569646
 ] 

Hadoop QA commented on YARN-3069:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  19m 46s | Findbugs (version ) appears to 
be broken on trunk. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 33s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 39s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | site |   2m 58s | Site still builds. |
| {color:green}+1{color} | checkstyle |   1m 36s | There were no new checkstyle 
issues. |
| {color:red}-1{color} | whitespace |   0m  1s | The patch has 1  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 32s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   3m 22s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | common tests |  23m 34s | Tests passed in 
hadoop-common. |
| {color:green}+1{color} | yarn tests |   1m 55s | Tests passed in 
hadoop-yarn-common. |
| | |  72m 56s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12736976/YARN-3069.011.patch |
| Optional Tests | site javadoc javac unit findbugs checkstyle |
| git revision | trunk / a2bd621 |
| whitespace | 
https://builds.apache.org/job/PreCommit-YARN-Build/8168/artifact/patchprocess/whitespace.txt
 |
| hadoop-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8168/artifact/patchprocess/testrun_hadoop-common.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8168/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8168/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8168/console |


This message was automatically generated.

 Document missing properties in yarn-default.xml
 ---

 Key: YARN-3069
 URL: https://issues.apache.org/jira/browse/YARN-3069
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Reporter: Ray Chiang
Assignee: Ray Chiang
  Labels: BB2015-05-TBR, supportability
 Attachments: YARN-3069.001.patch, YARN-3069.002.patch, 
 YARN-3069.003.patch, YARN-3069.004.patch, YARN-3069.005.patch, 
 YARN-3069.006.patch, YARN-3069.007.patch, YARN-3069.008.patch, 
 YARN-3069.009.patch, YARN-3069.010.patch, YARN-3069.011.patch


 The following properties are currently not defined in yarn-default.xml.  
 These properties should either be
   A) documented in yarn-default.xml OR
   B)  listed as an exception (with comments, e.g. for internal use) in the 
 TestYarnConfigurationFields unit test
 Any comments for any of the properties below are welcome.
   org.apache.hadoop.yarn.server.sharedcachemanager.RemoteAppChecker
   org.apache.hadoop.yarn.server.sharedcachemanager.store.InMemorySCMStore
   security.applicationhistory.protocol.acl
   yarn.app.container.log.backups
   yarn.app.container.log.dir
   yarn.app.container.log.filesize
   yarn.client.app-submission.poll-interval
   yarn.client.application-client-protocol.poll-timeout-ms
   yarn.is.minicluster
   yarn.log.server.url
   yarn.minicluster.control-resource-monitoring
   yarn.minicluster.fixed.ports
   yarn.minicluster.use-rpc
   yarn.node-labels.fs-store.retry-policy-spec
   yarn.node-labels.fs-store.root-dir
   yarn.node-labels.manager-class
   yarn.nodemanager.container-executor.os.sched.priority.adjustment
   yarn.nodemanager.container-monitor.process-tree.class
   yarn.nodemanager.disk-health-checker.enable
   yarn.nodemanager.docker-container-executor.image-name
   yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms
   yarn.nodemanager.linux-container-executor.group
   yarn.nodemanager.log.deletion-threads-count
   yarn.nodemanager.user-home-dir
   yarn.nodemanager.webapp.https.address
   yarn.nodemanager.webapp.spnego-keytab-file
   yarn.nodemanager.webapp.spnego-principal
   

[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs

2015-06-02 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569648#comment-14569648
 ] 

Siddharth Seth commented on YARN-1462:
--

ApplicationReport.newInstance is used by mapreduce and Tez, and potentially 
other applications which may be modeled along the same AMs. It'll be useful to 
make the API change here compatible. This is along the lines of newInstances 
being used for various constructs like ContainerId, AppId, etc.
With the change, I don't believe MR2.6 will work with a 2.8 cluster - depending 
on how the classpath is setup.

 AHS API and other AHS changes to handle tags for completed MR jobs
 --

 Key: YARN-1462
 URL: https://issues.apache.org/jira/browse/YARN-1462
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.2.0
Reporter: Karthik Kambatla
Assignee: Xuan Gong
 Fix For: 2.8.0

 Attachments: YARN-1462-branch-2.7-1.2.patch, 
 YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, 
 YARN-1462.3.patch


 AHS related work for tags. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-06-02 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569773#comment-14569773
 ] 

Jason Lowe commented on YARN-3585:
--

+1 latest patch lgtm.  Will commit this tomorrow if there are no objections.

 NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
 --

 Key: YARN-3585
 URL: https://issues.apache.org/jira/browse/YARN-3585
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang
Assignee: Rohith
Priority: Critical
 Attachments: 0001-YARN-3585.patch, YARN-3585.patch


 With NM recovery enabled, after decommission, nodemanager log show stop but 
 process cannot end. 
 non daemon thread:
 {noformat}
 DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
 condition [0x]
 leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
 [0x]
 VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
 Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 
 nid=0x29ed runnable 
 Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 
 nid=0x29ee runnable 
 Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 
 nid=0x29ef runnable 
 Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 
 nid=0x29f0 runnable 
 Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 
 nid=0x29f1 runnable 
 Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 
 nid=0x29f2 runnable 
 Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 
 nid=0x29f3 runnable 
 Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 
 nid=0x29f4 runnable 
 Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 
 runnable 
 Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 
 nid=0x29f5 runnable 
 Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 
 nid=0x29f6 runnable 
 VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
 on condition 
 {noformat}
 and jni leveldb thread stack
 {noformat}
 Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
 #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
 /lib64/libpthread.so.0
 #1  0x7f33dfce2a3b in leveldb::(anonymous 
 namespace)::PosixEnv::BGThreadWrapper(void*) () from 
 /tmp/libleveldbjni-64-1-6922178968300745716.8
 #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
 #3  0x003d830e811d in clone () from /lib64/libc.so.6
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2392) add more diags about app retry limits on AM failures

2015-06-02 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated YARN-2392:
-
Priority: Minor  (was: Major)

 add more diags about app retry limits on AM failures
 

 Key: YARN-2392
 URL: https://issues.apache.org/jira/browse/YARN-2392
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Steve Loughran
Assignee: Steve Loughran
Priority: Minor
 Attachments: YARN-2392-001.patch, YARN-2392-002.patch, 
 YARN-2392-002.patch


 # when an app fails the failure count is shown, but not what the global + 
 local limits are. If the two are different, they should both be printed. 
 # the YARN-2242 strings don't have enough whitespace between text and the URL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3591) Resource Localisation on a bad disk causes subsequent containers failure

2015-06-02 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569727#comment-14569727
 ] 

zhihai xu commented on YARN-3591:
-

Hi [~lavkesh], I think we can create a separate JIRA for storing local Error 
directories in NM state store, which will be a good enhancement.
thanks [~sunilg]! Adding a new API to get local error directories is also a 
good suggestion. But I think it will be enough to just check newErrorDirs 
instead of all errorDirs.

To better support NM recovery and make DirsChangeListener interface simple, I 
propose the following changes:

1.In DirectoryCollection, notify listener when any set of dirs(localDirs, 
errorDirs and fullDirs) are changed
The code change at {{DirectoryCollection#checkDirs}} looks like the following:
{code}
bool needNotifyListener = false;
needNotifyListener = setChanged;
for (String dir : preCheckFullDirs) {
  if (postCheckOtherDirs.contains(dir)) {
needNotifyListener = true;
LOG.warn(Directory  + dir +  error 
+ dirsFailedCheck.get(dir).message);
  }
}
for (String dir : preCheckOtherErrorDirs) {
  if (postCheckFullDirs.contains(dir)) {
needNotifyListener = true;
LOG.warn(Directory  + dir +  error 
+ dirsFailedCheck.get(dir).message);
  }
}
if (needNotifyListener) {
  for (DirsChangeListener listener : dirsChangeListeners) {
listener.onDirsChanged();
  }
}
{code}

2.  add an API to get local error directories.
As [~sunilg] suggested, We can add an API {{synchronized ListString 
getErrorDirs()}} in DirectoryCollection.java
We also need add an API {{public ListString getLocalErrorDirs()}} in 
LocalDirsHandlerService.java, which will call 
{{DirectoryCollection#getErrorDirs}}

3. add a field {{SetString preLocalErrorDirs}} in 
ResourceLocalizationService.java to store previous local error directories.
{{ResourceLocalizationService#preLocalErrorDirs}} should be loaded from state 
store at the beginning if we support storing local Error directories in NM 
state store.

4.The following is pseudo code for {{localDirsChangeListener#onDirsChanged}}:
{code}
SetString curLocalErrorDirs = new 
HashSetString(dirsHandler.getLocalErrorDirs());
ListString newErrorDirs = new ArrayListString();
ListString newRepairedDirs = new ArrayListString();
for (String dir : curLocalErrorDirs) {
  if (!preLocalErrorDirs.contains(dir)) {
newErrorDirs.add(dir);
  }
}
for (String dir : preLocalErrorDirs) {
  if (!curLocalErrorDirs.contains(dir)) {
newRepairedDirs.add(dir);
  }
}
for (String localDir : newRepairedDirs) {
cleanUpLocalDir(lfs, delService, localDir);
}
if (!newErrorDirs.isEmpty()) {
//As Sunil suggested, checkLocalizedResources will call removeResource on those 
localized resources whose parent is present in newErrorDirs.
publicRsrc.checkLocalizedResources(newErrorDirs);
for (LocalResourcesTracker tracker : privateRsrc.values()) {
tracker.checkLocalizedResources(newErrorDirs);
}
}
if (!newErrorDirs.isEmpty() || !newRepairedDirs.isEmpty()) {
preLocalErrorDirs = curLocalErrorDirs;
stateStore.storeLocalErrorDirs(StringUtils.arrayToString(curLocalErrorDirs.toArray(new
 String[0])));
}
checkAndInitializeLocalDirs();
{code}

5. It will be better to move {{verifyDirUsingMkdir(testDir)}} right after 
{{DiskChecker.checkDir(testDir)}} in {{DirectoryCollection#testDirs}}, so we 
can detect the error directory before detecting the full directory.

Please feel free to change or add more to my proposal.

 Resource Localisation on a bad disk causes subsequent containers failure 
 -

 Key: YARN-3591
 URL: https://issues.apache.org/jira/browse/YARN-3591
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.0
Reporter: Lavkesh Lahngir
Assignee: Lavkesh Lahngir
 Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, 
 YARN-3591.2.patch, YARN-3591.3.patch, YARN-3591.4.patch


 It happens when a resource is localised on the disk, after localising that 
 disk has gone bad. NM keeps paths for localised resources in memory.  At the 
 time of resource request isResourcePresent(rsrc) will be called which calls 
 file.exists() on the localised path.
 In some cases when disk has gone bad, inodes are stilled cached and 
 file.exists() returns true. But at the time of reading, file will not open.
 Note: file.exists() actually calls stat64 natively which returns true because 
 it was able to find inode information from the OS.
 A proposal is to call file.list() on the parent path of the resource, which 
 will call open() natively. If the disk is good it should return an array of 
 paths with length at-least 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2392) add more diags about app retry limits on AM failures

2015-06-02 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated YARN-2392:
-
Attachment: YARN-2392-002.patch

Patch 002
* in sync with trunk
* uses String.format for a more readable format of the response
* includes sliding window details in the message

There's no test here, for which I apologise. To test this I'd need a test to 
trigger failures and look for the final error message, which seems excessive 
for a log tuning. If there's a test for the sliding-window retry that could be 
patched, I'll do it there.

 add more diags about app retry limits on AM failures
 

 Key: YARN-2392
 URL: https://issues.apache.org/jira/browse/YARN-2392
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Steve Loughran
Assignee: Steve Loughran
 Attachments: YARN-2392-001.patch, YARN-2392-002.patch, 
 YARN-2392-002.patch


 # when an app fails the failure count is shown, but not what the global + 
 local limits are. If the two are different, they should both be printed. 
 # the YARN-2242 strings don't have enough whitespace between text and the URL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7

2015-06-02 Thread Matthew Jacobs (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569778#comment-14569778
 ] 

Matthew Jacobs commented on YARN-2194:
--

I'm confused, does this mean that you'll re-mount the cpu and cpuacct 
controllers? Do we know that other components in the RHEL7 world don't expect 
them to be in the default place?

 Cgroups cease to work in RHEL7
 --

 Key: YARN-2194
 URL: https://issues.apache.org/jira/browse/YARN-2194
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Wei Yan
Assignee: Wei Yan
Priority: Critical
 Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch


 In RHEL7, the CPU controller is named cpu,cpuacct. The comma in the 
 controller name leads to container launch failure. 
 RHEL7 deprecates libcgroup and recommends the user of systemd. However, 
 systemd has certain shortcomings as identified in this JIRA (see comments). 
 This JIRA only fixes the failure, and doesn't try to use systemd.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3510) Create an extension of ProportionalCapacityPreemptionPolicy which preempts a number of containers from each application in a way which respects fairness

2015-06-02 Thread Craig Welch (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570039#comment-14570039
 ] 

Craig Welch commented on YARN-3510:
---

[~leftnoteasy] and I had some offline discussion.  The patch currently here is 
simply meant to keep from unbalancing whatever allocation process is active by, 
generally, keeping relative usage between applications the same.  It doesn't 
attempt to actively re-allocate in a way which achieves the overall allocation 
policy, i.e., as if all the applications had started at once.  (this is a 
more complex proposition, obviously).  There's a desire to have this because, 
among other things, sometime down the road we may do preemption just among 
users/applications in a queue and it will be necessary for the preemption to 
actively work toward the allocation goals to do that, rather than just maintain 
current levels.  This will add some medium level complexity to the current 
patch, deltas with the current approach are:
Since the effect of preemption on order for fairness doesn't occur until the 
container is released, and we want to consider it right away, there will need 
to be a need to retain info about pending preemption for comparison on the 
app resources (it will be a deduction from usage for ordering purposes, as if 
the preemption had already happened)
The preemptEvenly loop will need to reorder the app which was preempted after 
each preemption and then restart the iteration over apps (not necessarily over 
all apps, again, just until the first preemption)


 Create an extension of ProportionalCapacityPreemptionPolicy which preempts a 
 number of containers from each application in a way which respects fairness
 

 Key: YARN-3510
 URL: https://issues.apache.org/jira/browse/YARN-3510
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: yarn
Reporter: Craig Welch
Assignee: Craig Welch
 Attachments: YARN-3510.2.patch, YARN-3510.3.patch, YARN-3510.5.patch, 
 YARN-3510.6.patch


 The ProportionalCapacityPreemptionPolicy preempts as many containers from 
 applications as it can during it's preemption run.  For fifo this makes 
 sense, as it is prempting in reverse order  therefore maintaining the 
 primacy of the oldest.  For fair ordering this does not have the desired 
 effect - instead, it should preempt a number of containers from each 
 application which maintains a fair balance /close to a fair balance between 
 them



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7

2015-06-02 Thread Matthew Jacobs (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570040#comment-14570040
 ] 

Matthew Jacobs commented on YARN-2194:
--

Thanks, [sidharta-s]. So the change would be in how the container-executor 
accepts lists of paths, not attempting to re-mount the controllers, right? If I 
understand it correctly, that sounds like a good plan to me.

 Cgroups cease to work in RHEL7
 --

 Key: YARN-2194
 URL: https://issues.apache.org/jira/browse/YARN-2194
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Wei Yan
Assignee: Wei Yan
Priority: Critical
 Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch


 In RHEL7, the CPU controller is named cpu,cpuacct. The comma in the 
 controller name leads to container launch failure. 
 RHEL7 deprecates libcgroup and recommends the user of systemd. However, 
 systemd has certain shortcomings as identified in this JIRA (see comments). 
 This JIRA only fixes the failure, and doesn't try to use systemd.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3762) FairScheduler: CME on FSParentQueue#getQueueUserAclInfo

2015-06-02 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-3762:
---
Attachment: yarn-3762-1.patch

Here is a patch that protects FSParentQueue members with read-write locks. 

 FairScheduler: CME on FSParentQueue#getQueueUserAclInfo
 ---

 Key: YARN-3762
 URL: https://issues.apache.org/jira/browse/YARN-3762
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.7.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: yarn-3762-1.patch


 In our testing, we ran into the following ConcurrentModificationException:
 {noformat}
 halxg.cloudera.com:8042, nodeRackName/rackvb07, nodeNumContainers0
 15/05/22 13:02:22 INFO distributedshell.Client: Queue info, 
 queueName=root.testyarnpool3, queueCurrentCapacity=0.0, 
 queueMaxCapacity=-1.0, queueApplicationCount=0, queueChildQueueCount=0
 15/05/22 13:02:22 FATAL distributedshell.Client: Error running Client
 java.util.ConcurrentModificationException: 
 java.util.ConcurrentModificationException
   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
   at java.util.ArrayList$Itr.next(ArrayList.java:851)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.getQueueUserAclInfo(FSParentQueue.java:155)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getQueueUserAclInfo(FairScheduler.java:1395)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueUserAcls(ClientRMService.java:880)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7

2015-06-02 Thread Sidharta Seethana (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570037#comment-14570037
 ] 

Sidharta Seethana commented on YARN-2194:
-

There are two different issues here : 

* container-executor binary invocation uses ‘,’ as a separator when supplying a 
list of paths - which breaks when the path contains ‘,’
* cpu,cpuacct are mounted together by default on RHEL7 

Now, for the latter issue : In {{CgroupsLCEResourcesHandler}}, the following 
steps occur : 

* If the {{yarn.nodemanager.linux-container-executor.cgroups.mount}} switch is 
enabled , the ‘cpu’ controller is explicitly mounted at the specified path. 
* (irrespective of the state of the switch) The {{/proc/mounts}} file (possibly 
updated by the previous step) is subsequently parsed to determine the mount 
locations for the various cgroup controllers - this parsing code seems to be 
correct even if cpu and cpuacct are mounted in one location.

So, the thing we need to fix is the separator issue and we should be good.  The 
important thing to remember is that there are *two* cgroups implementation 
classes ( {{CgroupsLCEResourcesHandler}} and {{CGroupsHandlerImpl}} ). 
Hopefully, this will be addressed soon ( YARN-3542 ) - or we risk divergence. 


 Cgroups cease to work in RHEL7
 --

 Key: YARN-2194
 URL: https://issues.apache.org/jira/browse/YARN-2194
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Wei Yan
Assignee: Wei Yan
Priority: Critical
 Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch


 In RHEL7, the CPU controller is named cpu,cpuacct. The comma in the 
 controller name leads to container launch failure. 
 RHEL7 deprecates libcgroup and recommends the user of systemd. However, 
 systemd has certain shortcomings as identified in this JIRA (see comments). 
 This JIRA only fixes the failure, and doesn't try to use systemd.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3762) FairScheduler: CME on FSParentQueue#getQueueUserAclInfo

2015-06-02 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-3762:
---
Attachment: yarn-3762-1.patch

Sorry, I forgot to rebase and included some HDFS change as well.

 FairScheduler: CME on FSParentQueue#getQueueUserAclInfo
 ---

 Key: YARN-3762
 URL: https://issues.apache.org/jira/browse/YARN-3762
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.7.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: yarn-3762-1.patch, yarn-3762-1.patch


 In our testing, we ran into the following ConcurrentModificationException:
 {noformat}
 halxg.cloudera.com:8042, nodeRackName/rackvb07, nodeNumContainers0
 15/05/22 13:02:22 INFO distributedshell.Client: Queue info, 
 queueName=root.testyarnpool3, queueCurrentCapacity=0.0, 
 queueMaxCapacity=-1.0, queueApplicationCount=0, queueChildQueueCount=0
 15/05/22 13:02:22 FATAL distributedshell.Client: Error running Client
 java.util.ConcurrentModificationException: 
 java.util.ConcurrentModificationException
   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
   at java.util.ArrayList$Itr.next(ArrayList.java:851)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.getQueueUserAclInfo(FSParentQueue.java:155)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getQueueUserAclInfo(FairScheduler.java:1395)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueUserAcls(ClientRMService.java:880)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3762) FairScheduler: CME on FSParentQueue#getQueueUserAclInfo

2015-06-02 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569992#comment-14569992
 ] 

Karthik Kambatla commented on YARN-3762:


Changed it to critical and targeting 2.8.0, as it only fails the application 
and not the RM.

 FairScheduler: CME on FSParentQueue#getQueueUserAclInfo
 ---

 Key: YARN-3762
 URL: https://issues.apache.org/jira/browse/YARN-3762
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.7.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Critical
 Attachments: yarn-3762-1.patch, yarn-3762-1.patch


 In our testing, we ran into the following ConcurrentModificationException:
 {noformat}
 halxg.cloudera.com:8042, nodeRackName/rackvb07, nodeNumContainers0
 15/05/22 13:02:22 INFO distributedshell.Client: Queue info, 
 queueName=root.testyarnpool3, queueCurrentCapacity=0.0, 
 queueMaxCapacity=-1.0, queueApplicationCount=0, queueChildQueueCount=0
 15/05/22 13:02:22 FATAL distributedshell.Client: Error running Client
 java.util.ConcurrentModificationException: 
 java.util.ConcurrentModificationException
   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
   at java.util.ArrayList$Itr.next(ArrayList.java:851)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.getQueueUserAclInfo(FSParentQueue.java:155)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getQueueUserAclInfo(FairScheduler.java:1395)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueUserAcls(ClientRMService.java:880)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3762) FairScheduler: CME on FSParentQueue#getQueueUserAclInfo

2015-06-02 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-3762:
---
Priority: Critical  (was: Blocker)
Target Version/s: 2.8.0  (was: 2.7.1)

 FairScheduler: CME on FSParentQueue#getQueueUserAclInfo
 ---

 Key: YARN-3762
 URL: https://issues.apache.org/jira/browse/YARN-3762
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.7.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Critical
 Attachments: yarn-3762-1.patch, yarn-3762-1.patch


 In our testing, we ran into the following ConcurrentModificationException:
 {noformat}
 halxg.cloudera.com:8042, nodeRackName/rackvb07, nodeNumContainers0
 15/05/22 13:02:22 INFO distributedshell.Client: Queue info, 
 queueName=root.testyarnpool3, queueCurrentCapacity=0.0, 
 queueMaxCapacity=-1.0, queueApplicationCount=0, queueChildQueueCount=0
 15/05/22 13:02:22 FATAL distributedshell.Client: Error running Client
 java.util.ConcurrentModificationException: 
 java.util.ConcurrentModificationException
   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
   at java.util.ArrayList$Itr.next(ArrayList.java:851)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.getQueueUserAclInfo(FSParentQueue.java:155)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getQueueUserAclInfo(FairScheduler.java:1395)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueUserAcls(ClientRMService.java:880)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3725) App submission via REST API is broken in secure mode due to Timeline DT service address is empty

2015-06-02 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570007#comment-14570007
 ] 

Zhijie Shen commented on YARN-3725:
---

bq. is there a JIRA for the longer term fix?

Yeah, I've filed YARN-3761 previously.

 App submission via REST API is broken in secure mode due to Timeline DT 
 service address is empty
 

 Key: YARN-3725
 URL: https://issues.apache.org/jira/browse/YARN-3725
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, timelineserver
Affects Versions: 2.7.0
Reporter: Zhijie Shen
Assignee: Zhijie Shen
Priority: Blocker
 Fix For: 2.7.1

 Attachments: YARN-3725.1.patch


 YARN-2971 changes TimelineClient to use the service address from Timeline DT 
 to renew the DT instead of configured address. This break the procedure of 
 submitting an YARN app via REST API in the secure mode.
 The problem is that service address is set by the client instead of the 
 server in Java code. REST API response is an encode token Sting, such that 
 it's so inconvenient to deserialize it and set the service address and 
 serialize it again. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3534) Collect memory/cpu usage on the node

2015-06-02 Thread Inigo Goiri (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Inigo Goiri updated YARN-3534:
--
Attachment: YARN-3534-10.patch

Solved some comments

 Collect memory/cpu usage on the node
 

 Key: YARN-3534
 URL: https://issues.apache.org/jira/browse/YARN-3534
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Affects Versions: 2.7.0
Reporter: Inigo Goiri
Assignee: Inigo Goiri
 Attachments: YARN-3534-1.patch, YARN-3534-10.patch, 
 YARN-3534-2.patch, YARN-3534-3.patch, YARN-3534-3.patch, YARN-3534-4.patch, 
 YARN-3534-5.patch, YARN-3534-6.patch, YARN-3534-7.patch, YARN-3534-8.patch, 
 YARN-3534-9.patch

   Original Estimate: 336h
  Remaining Estimate: 336h

 YARN should be aware of the resource utilization of the nodes when scheduling 
 containers. For this, this task will implement the collection of memory/cpu 
 usage on the node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3762) FairScheduler: CME on FSParentQueue#getQueueUserAclInfo

2015-06-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570082#comment-14570082
 ] 

Hadoop QA commented on YARN-3762:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  15m 28s | Findbugs (version ) appears to 
be broken on trunk. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 53s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 51s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 32s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  1s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 37s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 34s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 28s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |  50m 25s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | |  88m 17s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12737043/yarn-3762-1.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / c1d50a9 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8170/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8170/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8170/console |


This message was automatically generated.

 FairScheduler: CME on FSParentQueue#getQueueUserAclInfo
 ---

 Key: YARN-3762
 URL: https://issues.apache.org/jira/browse/YARN-3762
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.7.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Critical
 Attachments: yarn-3762-1.patch, yarn-3762-1.patch


 In our testing, we ran into the following ConcurrentModificationException:
 {noformat}
 halxg.cloudera.com:8042, nodeRackName/rackvb07, nodeNumContainers0
 15/05/22 13:02:22 INFO distributedshell.Client: Queue info, 
 queueName=root.testyarnpool3, queueCurrentCapacity=0.0, 
 queueMaxCapacity=-1.0, queueApplicationCount=0, queueChildQueueCount=0
 15/05/22 13:02:22 FATAL distributedshell.Client: Error running Client
 java.util.ConcurrentModificationException: 
 java.util.ConcurrentModificationException
   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
   at java.util.ArrayList$Itr.next(ArrayList.java:851)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.getQueueUserAclInfo(FSParentQueue.java:155)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getQueueUserAclInfo(FairScheduler.java:1395)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueUserAcls(ClientRMService.java:880)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3762) FairScheduler: CME on FSParentQueue#getQueueUserAclInfo

2015-06-02 Thread Karthik Kambatla (JIRA)
Karthik Kambatla created YARN-3762:
--

 Summary: FairScheduler: CME on FSParentQueue#getQueueUserAclInfo
 Key: YARN-3762
 URL: https://issues.apache.org/jira/browse/YARN-3762
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.7.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Blocker


In our testing, we ran into the following ConcurrentModificationException:

{noformat}
halxg.cloudera.com:8042, nodeRackName/rackvb07, nodeNumContainers0
15/05/22 13:02:22 INFO distributedshell.Client: Queue info, 
queueName=root.testyarnpool3, queueCurrentCapacity=0.0, queueMaxCapacity=-1.0, 
queueApplicationCount=0, queueChildQueueCount=0
15/05/22 13:02:22 FATAL distributedshell.Client: Error running Client
java.util.ConcurrentModificationException: 
java.util.ConcurrentModificationException
at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
at java.util.ArrayList$Itr.next(ArrayList.java:851)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.getQueueUserAclInfo(FSParentQueue.java:155)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getQueueUserAclInfo(FairScheduler.java:1395)
at 
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueUserAcls(ClientRMService.java:880)
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty

2015-06-02 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570025#comment-14570025
 ] 

Wangda Tan commented on YARN-3733:
--

Took a look at the patch and discussion. Thanks for working on this 
[~rohithsharma].

I think [~sunilg] mentioned 
https://issues.apache.org/jira/browse/YARN-3733?focusedCommentId=14568880page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14568880
 makes sense to me. If the clusterResource is 0, we can compare individual 
resource type. It could be:

{code}
Returns : when l.mem  right.mem || l.cpu  right.cpu
Returns =: when (l.mem = right.mem  l.cpu = right.cpu) || (l.mem = 
right.mem  l.cpu = right.cpu)
Returns : when l.mem  right.mem || l.cpu  right.cpu
{code}

This produces same result as the INF approach in the patch, but also can 
compare if both l/r have  0 values. The reason I prefer this is, I'm sure the 
patch can solve the am-resource-percent problem. But with suggested approach, 
we can make sure getting more reasonable result if we need to compare 
non-zero-resource when clusterResource is zero. (For example, sort applications 
by their requirements when clusterResource is zero).

And to avoid future regression, could you add a test to verify the 
am-resource-limit problem is solved?

 DominantRC#compare() does not work as expected if cluster resource is empty
 ---

 Key: YARN-3733
 URL: https://issues.apache.org/jira/browse/YARN-3733
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3 , 2 NM , 2 RM
 one NM - 3 GB 6 v core
Reporter: Bibin A Chundatt
Assignee: Rohith
Priority: Blocker
 Attachments: 0001-YARN-3733.patch, YARN-3733.patch


 Steps to reproduce
 =
 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
 size to 512 MB
 3. Configure capacity scheduler and AM limit to .5 
 (DominantResourceCalculator is configured)
 4. Submit 30 concurrent task 
 5. Switch RM
 Actual
 =
 For 12 Jobs AM gets allocated and all 12 starts running
 No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
 Expected
 ===
 Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7

2015-06-02 Thread Philip Langdale (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569899#comment-14569899
 ] 

Philip Langdale commented on YARN-2194:
---

You can remount controllers if you retain the same combination as the existing 
mount point, so I guess you could replace the ',' with something your parsing 
code can handle (or you could fix the parsing code). In general, life is a lot 
easier if you can avoid remounting as you then don't have to worry about 
managing their lifecycle.

I'd argue the most robust thing to do is discover the existing mount point from 
/proc/mounts and then use it (assuming the comma parsing can be fixed) if it's 
present (and don't forget to respect the NodeManager's cgroup paths from 
/proc/self/mounts)

 Cgroups cease to work in RHEL7
 --

 Key: YARN-2194
 URL: https://issues.apache.org/jira/browse/YARN-2194
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Wei Yan
Assignee: Wei Yan
Priority: Critical
 Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch


 In RHEL7, the CPU controller is named cpu,cpuacct. The comma in the 
 controller name leads to container launch failure. 
 RHEL7 deprecates libcgroup and recommends the user of systemd. However, 
 systemd has certain shortcomings as identified in this JIRA (see comments). 
 This JIRA only fixes the failure, and doesn't try to use systemd.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2392) add more diags about app retry limits on AM failures

2015-06-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569869#comment-14569869
 ] 

Hadoop QA commented on YARN-2392:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m 23s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   9m 25s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  11m 23s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 24s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   0m 56s | The applied patch generated  2 
new checkstyle issues (total was 244, now 245). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 46s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 35s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 42s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |  52m  1s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | |  94m 38s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12737003/YARN-2392-002.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 03fb5c6 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8169/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8169/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8169/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8169/console |


This message was automatically generated.

 add more diags about app retry limits on AM failures
 

 Key: YARN-2392
 URL: https://issues.apache.org/jira/browse/YARN-2392
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Steve Loughran
Assignee: Steve Loughran
Priority: Minor
 Attachments: YARN-2392-001.patch, YARN-2392-002.patch, 
 YARN-2392-002.patch


 # when an app fails the failure count is shown, but not what the global + 
 local limits are. If the two are different, they should both be printed. 
 # the YARN-2242 strings don't have enough whitespace between text and the URL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3725) App submission via REST API is broken in secure mode due to Timeline DT service address is empty

2015-06-02 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569918#comment-14569918
 ] 

Vinod Kumar Vavilapalli commented on YARN-3725:
---

[~zjshen], is there a JIRA for the longer term fix?

 App submission via REST API is broken in secure mode due to Timeline DT 
 service address is empty
 

 Key: YARN-3725
 URL: https://issues.apache.org/jira/browse/YARN-3725
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, timelineserver
Affects Versions: 2.7.0
Reporter: Zhijie Shen
Assignee: Zhijie Shen
Priority: Blocker
 Fix For: 2.7.1

 Attachments: YARN-3725.1.patch


 YARN-2971 changes TimelineClient to use the service address from Timeline DT 
 to renew the DT instead of configured address. This break the procedure of 
 submitting an YARN app via REST API in the secure mode.
 The problem is that service address is set by the client instead of the 
 server in Java code. REST API response is an encode token Sting, such that 
 it's so inconvenient to deserialize it and set the service address and 
 serialize it again. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-06-02 Thread Chun Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570202#comment-14570202
 ] 

Chun Chen commented on YARN-3749:
-

Thanks for reviewing the patch, [~zxu] ! 

 We should make a copy of configuration when init MiniYARNCluster with 
 multiple RMs
 --

 Key: YARN-3749
 URL: https://issues.apache.org/jira/browse/YARN-3749
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Chun Chen
Assignee: Chun Chen
 Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, 
 YARN-3749.5.patch, YARN-3749.6.patch, YARN-3749.7.patch, YARN-3749.patch


 When I was trying to write a test case for YARN-2674, I found DS client 
 trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
 when RM failover. But I initially set 
 yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
 yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
 in ClientRMService where the value of yarn.resourcemanager.address.rm2 
 changed to 0.0.0.0:18032. See the following code in ClientRMService:
 {code}
 clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
YarnConfiguration.RM_ADDRESS,

 YarnConfiguration.DEFAULT_RM_ADDRESS,
server.getListenerAddress());
 {code}
 Since we use the same instance of configuration in rm1 and rm2 and init both 
 RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 
 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during 
 starting of rm1.
 So I think it is safe to make a copy of configuration when init both of the 
 rm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-3558) Additional containers getting reserved from RM in case of Fair scheduler

2015-06-02 Thread Sunil G (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G reassigned YARN-3558:
-

Assignee: Sunil G

 Additional containers getting reserved from RM in case of Fair scheduler
 

 Key: YARN-3558
 URL: https://issues.apache.org/jira/browse/YARN-3558
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler, resourcemanager
Affects Versions: 2.7.0
 Environment: OS :Suse 11 Sp3
 Setup : 2 RM 2 NM
 Scheduler : Fair scheduler
Reporter: Bibin A Chundatt
Assignee: Sunil G
 Attachments: Amlog.txt, rm.log


 Submit PI job with 16 maps
 Total container expected : 16 MAPS + 1 Reduce  + 1 AM
 Total containers reserved by RM is 21
 Below set of containers are not being used for execution
 container_1430213948957_0001_01_20
 container_1430213948957_0001_01_19
 RM Containers reservation and states
 {code}
  Processing container_1430213948957_0001_01_01 of type START
  Processing container_1430213948957_0001_01_01 of type ACQUIRED
  Processing container_1430213948957_0001_01_01 of type LAUNCHED
  Processing container_1430213948957_0001_01_02 of type START
  Processing container_1430213948957_0001_01_03 of type START
  Processing container_1430213948957_0001_01_02 of type ACQUIRED
  Processing container_1430213948957_0001_01_03 of type ACQUIRED
  Processing container_1430213948957_0001_01_04 of type START
  Processing container_1430213948957_0001_01_05 of type START
  Processing container_1430213948957_0001_01_04 of type ACQUIRED
  Processing container_1430213948957_0001_01_05 of type ACQUIRED
  Processing container_1430213948957_0001_01_02 of type LAUNCHED
  Processing container_1430213948957_0001_01_04 of type LAUNCHED
  Processing container_1430213948957_0001_01_06 of type RESERVED
  Processing container_1430213948957_0001_01_03 of type LAUNCHED
  Processing container_1430213948957_0001_01_05 of type LAUNCHED
  Processing container_1430213948957_0001_01_07 of type START
  Processing container_1430213948957_0001_01_07 of type ACQUIRED
  Processing container_1430213948957_0001_01_07 of type LAUNCHED
  Processing container_1430213948957_0001_01_08 of type RESERVED
  Processing container_1430213948957_0001_01_02 of type FINISHED
  Processing container_1430213948957_0001_01_06 of type START
  Processing container_1430213948957_0001_01_06 of type ACQUIRED
  Processing container_1430213948957_0001_01_06 of type LAUNCHED
  Processing container_1430213948957_0001_01_04 of type FINISHED
  Processing container_1430213948957_0001_01_09 of type START
  Processing container_1430213948957_0001_01_09 of type ACQUIRED
  Processing container_1430213948957_0001_01_09 of type LAUNCHED
  Processing container_1430213948957_0001_01_10 of type RESERVED
  Processing container_1430213948957_0001_01_03 of type FINISHED
  Processing container_1430213948957_0001_01_08 of type START
  Processing container_1430213948957_0001_01_08 of type ACQUIRED
  Processing container_1430213948957_0001_01_08 of type LAUNCHED
  Processing container_1430213948957_0001_01_05 of type FINISHED
  Processing container_1430213948957_0001_01_11 of type START
  Processing container_1430213948957_0001_01_11 of type ACQUIRED
  Processing container_1430213948957_0001_01_11 of type LAUNCHED
  Processing container_1430213948957_0001_01_07 of type FINISHED
  Processing container_1430213948957_0001_01_12 of type START
  Processing container_1430213948957_0001_01_12 of type ACQUIRED
  Processing container_1430213948957_0001_01_12 of type LAUNCHED
  Processing container_1430213948957_0001_01_13 of type RESERVED
  Processing container_1430213948957_0001_01_06 of type FINISHED
  Processing container_1430213948957_0001_01_10 of type START
  Processing container_1430213948957_0001_01_10 of type ACQUIRED
  Processing container_1430213948957_0001_01_10 of type LAUNCHED
  Processing container_1430213948957_0001_01_09 of type FINISHED
  Processing container_1430213948957_0001_01_14 of type START
  Processing container_1430213948957_0001_01_14 of type ACQUIRED
  Processing container_1430213948957_0001_01_14 of type LAUNCHED
  Processing container_1430213948957_0001_01_15 of type RESERVED
  Processing container_1430213948957_0001_01_08 of type FINISHED
  Processing container_1430213948957_0001_01_13 of type START
  Processing container_1430213948957_0001_01_16 of type RESERVED
  Processing container_1430213948957_0001_01_13 of type ACQUIRED
  Processing container_1430213948957_0001_01_13 of type LAUNCHED
  Processing container_1430213948957_0001_01_11 of type FINISHED
  

[jira] [Commented] (YARN-3044) [Event producers] Implement RM writing app lifecycle events to ATS

2015-06-02 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570171#comment-14570171
 ] 

Zhijie Shen commented on YARN-3044:
---

[~Naganarasimha], I'm fine with the last patch. Will do some local test. 
However, the patch doesn't apply because of YARN-1462. I think we need to add 
tag info for v2 publisher too. Would you mind taking care of it?

 [Event producers] Implement RM writing app lifecycle events to ATS
 --

 Key: YARN-3044
 URL: https://issues.apache.org/jira/browse/YARN-3044
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Sangjin Lee
Assignee: Naganarasimha G R
 Attachments: YARN-3044-YARN-2928.004.patch, 
 YARN-3044-YARN-2928.005.patch, YARN-3044-YARN-2928.006.patch, 
 YARN-3044-YARN-2928.007.patch, YARN-3044-YARN-2928.008.patch, 
 YARN-3044-YARN-2928.009.patch, YARN-3044.20150325-1.patch, 
 YARN-3044.20150406-1.patch, YARN-3044.20150416-1.patch


 Per design in YARN-2928, implement RM writing app lifecycle events to ATS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3755) Log the command of launching containers

2015-06-02 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570275#comment-14570275
 ] 

Jeff Zhang commented on YARN-3755:
--

bq. How about we let individual frameworks like MapReduce/Tez log them as 
needed? That seems like the right place for debugging too - app developers 
don't always get access to the daemon logs.
Make sense. 

 Log the command of launching containers
 ---

 Key: YARN-3755
 URL: https://issues.apache.org/jira/browse/YARN-3755
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.7.0
Reporter: Jeff Zhang
Assignee: Jeff Zhang
 Attachments: YARN-3755-1.patch, YARN-3755-2.patch


 In the resource manager log, yarn would log the command for launching AM, 
 this is very useful. But there's no such log in the NN log for launching 
 containers. It would be difficult to diagnose when containers fails to launch 
 due to some issue in the commands. Although user can look at the commands in 
 the container launch script file, this is an internal things of yarn, usually 
 user don't know that. In user's perspective, they only know what commands 
 they specify when building yarn application. 
 {code}
 2015-06-01 16:06:42,245 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Command 
 to launch container container_1433145984561_0001_01_01 : 
 $JAVA_HOME/bin/java -server -Djava.net.preferIPv4Stack=true 
 -Dhadoop.metrics.log.level=WARN  -Xmx1024m  
 -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator 
 -Dlog4j.configuration=tez-container-log4j.properties 
 -Dyarn.app.container.log.dir=LOG_DIR -Dtez.root.logger=info,CLA 
 -Dsun.nio.ch.bugLevel='' org.apache.tez.dag.app.DAGAppMaster 
 1LOG_DIR/stdout 2LOG_DIR/stderr
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3755) Log the command of launching containers

2015-06-02 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570276#comment-14570276
 ] 

Jeff Zhang commented on YARN-3755:
--

Close it as won't fix

 Log the command of launching containers
 ---

 Key: YARN-3755
 URL: https://issues.apache.org/jira/browse/YARN-3755
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.7.0
Reporter: Jeff Zhang
Assignee: Jeff Zhang
 Attachments: YARN-3755-1.patch, YARN-3755-2.patch


 In the resource manager log, yarn would log the command for launching AM, 
 this is very useful. But there's no such log in the NN log for launching 
 containers. It would be difficult to diagnose when containers fails to launch 
 due to some issue in the commands. Although user can look at the commands in 
 the container launch script file, this is an internal things of yarn, usually 
 user don't know that. In user's perspective, they only know what commands 
 they specify when building yarn application. 
 {code}
 2015-06-01 16:06:42,245 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Command 
 to launch container container_1433145984561_0001_01_01 : 
 $JAVA_HOME/bin/java -server -Djava.net.preferIPv4Stack=true 
 -Dhadoop.metrics.log.level=WARN  -Xmx1024m  
 -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator 
 -Dlog4j.configuration=tez-container-log4j.properties 
 -Dyarn.app.container.log.dir=LOG_DIR -Dtez.root.logger=info,CLA 
 -Dsun.nio.ch.bugLevel='' org.apache.tez.dag.app.DAGAppMaster 
 1LOG_DIR/stdout 2LOG_DIR/stderr
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2194) Cgroups cease to work in RHEL7

2015-06-02 Thread Sidharta Seethana (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570284#comment-14570284
 ] 

Sidharta Seethana commented on YARN-2194:
-

[~mjacobs] , Yes, that is what I am proposing.  If we handle the path 
separation correctly, we should be able to continue using the current 
(deprecated, but still workable) mechanism for using cgroups.

 Cgroups cease to work in RHEL7
 --

 Key: YARN-2194
 URL: https://issues.apache.org/jira/browse/YARN-2194
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Wei Yan
Assignee: Wei Yan
Priority: Critical
 Attachments: YARN-2194-1.patch, YARN-2194-2.patch, YARN-2194-3.patch


 In RHEL7, the CPU controller is named cpu,cpuacct. The comma in the 
 controller name leads to container launch failure. 
 RHEL7 deprecates libcgroup and recommends the user of systemd. However, 
 systemd has certain shortcomings as identified in this JIRA (see comments). 
 This JIRA only fixes the failure, and doesn't try to use systemd.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3763) Support for fuzzy search in ATS

2015-06-02 Thread Jeff Zhang (JIRA)
Jeff Zhang created YARN-3763:


 Summary: Support for fuzzy search in ATS
 Key: YARN-3763
 URL: https://issues.apache.org/jira/browse/YARN-3763
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: timelineserver
Affects Versions: 2.7.0
Reporter: Jeff Zhang


Currently ATS only support exact match. Sometimes fuzzy match may be helpful 
when the entities in the ATS has some common prefix or suffix.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3763) Support fuzzy search in ATS

2015-06-02 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated YARN-3763:
-
Summary: Support fuzzy search in ATS  (was: Support for fuzzy search in ATS)

 Support fuzzy search in ATS
 ---

 Key: YARN-3763
 URL: https://issues.apache.org/jira/browse/YARN-3763
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: timelineserver
Affects Versions: 2.7.0
Reporter: Jeff Zhang

 Currently ATS only support exact match. Sometimes fuzzy match may be helpful 
 when the entities in the ATS has some common prefix or suffix.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3763) Support fuzzy search in ATS

2015-06-02 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated YARN-3763:
-
Description: Currently ATS only support exact match. Sometimes fuzzy match 
may be helpful when the entities in the ATS has some common prefix or suffix.   
Link with TEZ-2531  (was: Currently ATS only support exact match. Sometimes 
fuzzy match may be helpful when the entities in the ATS has some common prefix 
or suffix.  )

 Support fuzzy search in ATS
 ---

 Key: YARN-3763
 URL: https://issues.apache.org/jira/browse/YARN-3763
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: timelineserver
Affects Versions: 2.7.0
Reporter: Jeff Zhang

 Currently ATS only support exact match. Sometimes fuzzy match may be helpful 
 when the entities in the ATS has some common prefix or suffix.   Link with 
 TEZ-2531



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty

2015-06-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568913#comment-14568913
 ] 

Hadoop QA commented on YARN-3733:
-

\\
\\
| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m  6s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 33s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 36s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 54s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 33s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 32s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 33s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   1m 57s | Tests passed in 
hadoop-yarn-common. |
| | |  40m 10s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12736802/0001-YARN-3733.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 990078b |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8166/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8166/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8166/console |


This message was automatically generated.

 DominantRC#compare() does not work as expected if cluster resource is empty
 ---

 Key: YARN-3733
 URL: https://issues.apache.org/jira/browse/YARN-3733
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3 , 2 NM , 2 RM
 one NM - 3 GB 6 v core
Reporter: Bibin A Chundatt
Assignee: Rohith
Priority: Blocker
 Attachments: 0001-YARN-3733.patch, YARN-3733.patch


 Steps to reproduce
 =
 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
 size to 512 MB
 3. Configure capacity scheduler and AM limit to .5 
 (DominantResourceCalculator is configured)
 4. Submit 30 concurrent task 
 5. Switch RM
 Actual
 =
 For 12 Jobs AM gets allocated and all 12 starts running
 No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
 Expected
 ===
 Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs

2015-06-02 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568926#comment-14568926
 ] 

Hudson commented on YARN-1462:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #946 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/946/])
YARN-1462. Correct fix version from branch-2.7.1 to branch-2.8 in (xgong: rev 
0b5cfacde638bc25cc010cd9236369237b4e51a8)
* hadoop-yarn-project/CHANGES.txt


 AHS API and other AHS changes to handle tags for completed MR jobs
 --

 Key: YARN-1462
 URL: https://issues.apache.org/jira/browse/YARN-1462
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.2.0
Reporter: Karthik Kambatla
Assignee: Xuan Gong
 Fix For: 2.8.0

 Attachments: YARN-1462-branch-2.7-1.2.patch, 
 YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, 
 YARN-1462.3.patch


 AHS related work for tags. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3759) Include command line, localization info and env vars on AM launch failure

2015-06-02 Thread Steve Loughran (JIRA)
Steve Loughran created YARN-3759:


 Summary: Include command line, localization info and env vars on 
AM launch failure
 Key: YARN-3759
 URL: https://issues.apache.org/jira/browse/YARN-3759
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Steve Loughran
Priority: Minor


While trying to diagnose AM launch failures, its important to be able to get at 
the final, expanded {{CLASSPATH}} and other env variables. We don't get that 
today: you can log the unexpanded values on the client, and tweak NM 
ContainerExecutor log levels to DEBUG  get some of this —‚ut you don't get it 
in the task logs, and tuning NM log level isn't viable on a large, busy cluster.

Launch failures should include some env specifics:
# list of env vars (ideally, full getenv values), with some stripping of 
sensitive options (i'm thinking AWS env vars here)
# command line
# path localisations

These can go in the task logs, we don't need to include them in the application 
report.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-41) The RM should handle the graceful shutdown of the NM.

2015-06-02 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568947#comment-14568947
 ] 

Junping Du commented on YARN-41:


bq. Junping Du I have updated the patch with review comments. Can you have a 
look into this?
Sorry for being late on this as taking travel last week. I will review your 
latest patch today.

 The RM should handle the graceful shutdown of the NM.
 -

 Key: YARN-41
 URL: https://issues.apache.org/jira/browse/YARN-41
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager, resourcemanager
Reporter: Ravi Teja Ch N V
Assignee: Devaraj K
 Attachments: MAPREDUCE-3494.1.patch, MAPREDUCE-3494.2.patch, 
 MAPREDUCE-3494.patch, YARN-41-1.patch, YARN-41-2.patch, YARN-41-3.patch, 
 YARN-41-4.patch, YARN-41-5.patch, YARN-41-6.patch, YARN-41-7.patch, 
 YARN-41-8.patch, YARN-41.patch


 Instead of waiting for the NM expiry, RM should remove and handle the NM, 
 which is shutdown gracefully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs

2015-06-02 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568920#comment-14568920
 ] 

Hudson commented on YARN-1462:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #216 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/216/])
YARN-1462. Correct fix version from branch-2.7.1 to branch-2.8 in (xgong: rev 
0b5cfacde638bc25cc010cd9236369237b4e51a8)
* hadoop-yarn-project/CHANGES.txt


 AHS API and other AHS changes to handle tags for completed MR jobs
 --

 Key: YARN-1462
 URL: https://issues.apache.org/jira/browse/YARN-1462
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.2.0
Reporter: Karthik Kambatla
Assignee: Xuan Gong
 Fix For: 2.8.0

 Attachments: YARN-1462-branch-2.7-1.2.patch, 
 YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, 
 YARN-1462.3.patch


 AHS related work for tags. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs

2015-06-02 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569171#comment-14569171
 ] 

Hudson commented on YARN-1462:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2144 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2144/])
YARN-1462. Correct fix version from branch-2.7.1 to branch-2.8 in (xgong: rev 
0b5cfacde638bc25cc010cd9236369237b4e51a8)
* hadoop-yarn-project/CHANGES.txt


 AHS API and other AHS changes to handle tags for completed MR jobs
 --

 Key: YARN-1462
 URL: https://issues.apache.org/jira/browse/YARN-1462
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.2.0
Reporter: Karthik Kambatla
Assignee: Xuan Gong
 Fix For: 2.8.0

 Attachments: YARN-1462-branch-2.7-1.2.patch, 
 YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, 
 YARN-1462.3.patch


 AHS related work for tags. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty

2015-06-02 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3733:
-
Attachment: 0001-YARN-3733.patch

The updated patch that fixes for 2nd and 3rd scenarios(This issue scenario  
fixes) in above table and refactored the test code.

As a overall solution that solves input combination like 4th and 5th from above 
table, need to explore more on how to define fraction and how to decide which 
one is dominant. Any suggestions on this?



 DominantRC#compare() does not work as expected if cluster resource is empty
 ---

 Key: YARN-3733
 URL: https://issues.apache.org/jira/browse/YARN-3733
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3 , 2 NM , 2 RM
 one NM - 3 GB 6 v core
Reporter: Bibin A Chundatt
Assignee: Rohith
Priority: Blocker
 Attachments: 0001-YARN-3733.patch, YARN-3733.patch


 Steps to reproduce
 =
 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
 size to 512 MB
 3. Configure capacity scheduler and AM limit to .5 
 (DominantResourceCalculator is configured)
 4. Submit 30 concurrent task 
 5. Switch RM
 Actual
 =
 For 12 Jobs AM gets allocated and all 12 starts running
 No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
 Expected
 ===
 Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty

2015-06-02 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568880#comment-14568880
 ] 

Sunil G commented on YARN-3733:
---

Hi [~rohithsharma]
Thanks for the detailed scenario.

Scenario 4 can be possible, correct?. clusterResource0,0 : lhs 2,2 and rhs 
3,2.

Currently getResourceAsValue gives back the max ratio of mem/vcores if 
dominent. Else gives the min ratio.
If clusterResource is 0, then could we directly send the max of mem/vcore if 
dominent, and min in other case. This has to be made more better algorithm when 
more resources comes in.
This is not completely perfect as we treat memory and vcores leniently. Pls 
share your thoughts.

 DominantRC#compare() does not work as expected if cluster resource is empty
 ---

 Key: YARN-3733
 URL: https://issues.apache.org/jira/browse/YARN-3733
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3 , 2 NM , 2 RM
 one NM - 3 GB 6 v core
Reporter: Bibin A Chundatt
Assignee: Rohith
Priority: Blocker
 Attachments: 0001-YARN-3733.patch, YARN-3733.patch


 Steps to reproduce
 =
 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
 size to 512 MB
 3. Configure capacity scheduler and AM limit to .5 
 (DominantResourceCalculator is configured)
 4. Submit 30 concurrent task 
 5. Switch RM
 Actual
 =
 For 12 Jobs AM gets allocated and all 12 starts running
 No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
 Expected
 ===
 Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-41) The RM should handle the graceful shutdown of the NM.

2015-06-02 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569182#comment-14569182
 ] 

Junping Du commented on YARN-41:


Thanks [~devaraj.k] for updating the patch with addressing previous comments! 
Latest patch LGTM. +1. Will commit it tomorrow if no further comments on the 
code from other reviewers.
In addition, given the patch involve new SHUTDOWN category on: NodeState, UI 
and Cluster Metrics. Although it doesn't break any public APIs, we should mark 
this JIRA as incompatible for its inconsistent behaviors with previous releases 
in UI, CLI, Metrics (to notify users or third-party management  monitor 
software). In general, I think it should be fine to keep the plan to include 
this patch in 2.x releases. However, please comments here to let us know if you 
have any concerns.

 The RM should handle the graceful shutdown of the NM.
 -

 Key: YARN-41
 URL: https://issues.apache.org/jira/browse/YARN-41
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager, resourcemanager
Reporter: Ravi Teja Ch N V
Assignee: Devaraj K
 Attachments: MAPREDUCE-3494.1.patch, MAPREDUCE-3494.2.patch, 
 MAPREDUCE-3494.patch, YARN-41-1.patch, YARN-41-2.patch, YARN-41-3.patch, 
 YARN-41-4.patch, YARN-41-5.patch, YARN-41-6.patch, YARN-41-7.patch, 
 YARN-41-8.patch, YARN-41.patch


 Instead of waiting for the NM expiry, RM should remove and handle the NM, 
 which is shutdown gracefully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3603) Application Attempts page confusing

2015-06-02 Thread Sunil G (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-3603:
--
Attachment: 0002-YARN-3603.patch

Attaching an updated version of patch. Also attaching screen shots of UI. 
[~tgraves] Could u please take a look on this. Thank you.

 Application Attempts page confusing
 ---

 Key: YARN-3603
 URL: https://issues.apache.org/jira/browse/YARN-3603
 Project: Hadoop YARN
  Issue Type: Bug
  Components: webapp
Affects Versions: 2.8.0
Reporter: Thomas Graves
Assignee: Sunil G
 Attachments: 0001-YARN-3603.patch, 0002-YARN-3603.patch, ahs1.png


 The application attempts page 
 (http://RM:8088/cluster/appattempt/appattempt_1431101480046_0003_01)
 is a bit confusing on what is going on.  I think the table of containers 
 there is for only Running containers and when the app is completed or killed 
 its empty.  The table should have a label on it stating so.  
 Also the AM Container field is a link when running but not when its killed. 
  That might be confusing.
 There is no link to the logs in this page but there is in the app attempt 
 table when looking at http://
 rm:8088/cluster/app/application_1431101480046_0003



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3603) Application Attempts page confusing

2015-06-02 Thread Sunil G (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-3603:
--
Attachment: ahs1.png

 Application Attempts page confusing
 ---

 Key: YARN-3603
 URL: https://issues.apache.org/jira/browse/YARN-3603
 Project: Hadoop YARN
  Issue Type: Bug
  Components: webapp
Affects Versions: 2.8.0
Reporter: Thomas Graves
Assignee: Sunil G
 Attachments: 0001-YARN-3603.patch, 0002-YARN-3603.patch, ahs1.png


 The application attempts page 
 (http://RM:8088/cluster/appattempt/appattempt_1431101480046_0003_01)
 is a bit confusing on what is going on.  I think the table of containers 
 there is for only Running containers and when the app is completed or killed 
 its empty.  The table should have a label on it stating so.  
 Also the AM Container field is a link when running but not when its killed. 
  That might be confusing.
 There is no link to the logs in this page but there is in the app attempt 
 table when looking at http://
 rm:8088/cluster/app/application_1431101480046_0003



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3754) Race condition when the NodeManager is shutting down and container is launched

2015-06-02 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-3754:
---
Priority: Critical  (was: Major)
Target Version/s: 2.7.1

 Race condition when the NodeManager is shutting down and container is launched
 --

 Key: YARN-3754
 URL: https://issues.apache.org/jira/browse/YARN-3754
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
 Environment: Suse 11 Sp3
Reporter: Bibin A Chundatt
Assignee: Sunil G
Priority: Critical

 Container is launched and returned to ContainerImpl
 NodeManager closed the DB connection which resulting in 
 {{org.iq80.leveldb.DBException: Closed}}. 
 *Attaching the exception trace*
 {code}
 2015-05-30 02:11:49,122 WARN 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
  Unable to update state store diagnostics for 
 container_e310_1432817693365_3338_01_02
 java.io.IOException: org.iq80.leveldb.DBException: Closed
 at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:261)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1109)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1101)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1129)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83)
 at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:246)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: org.iq80.leveldb.DBException: Closed
 at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:123)
 at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:106)
 at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:259)
 ... 15 more
 {code}
 we can add a check whether DB is closed while we move container from ACQUIRED 
 state.
 As per the discussion in YARN-3585 have add the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3754) Race condition when the NodeManager is shutting down and container is launched

2015-06-02 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-3754:
---
Target Version/s: 2.8.0  (was: 2.7.1)

 Race condition when the NodeManager is shutting down and container is launched
 --

 Key: YARN-3754
 URL: https://issues.apache.org/jira/browse/YARN-3754
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
 Environment: Suse 11 Sp3
Reporter: Bibin A Chundatt
Assignee: Sunil G
Priority: Critical

 Container is launched and returned to ContainerImpl
 NodeManager closed the DB connection which resulting in 
 {{org.iq80.leveldb.DBException: Closed}}. 
 *Attaching the exception trace*
 {code}
 2015-05-30 02:11:49,122 WARN 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
  Unable to update state store diagnostics for 
 container_e310_1432817693365_3338_01_02
 java.io.IOException: org.iq80.leveldb.DBException: Closed
 at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:261)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1109)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1101)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1129)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83)
 at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:246)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: org.iq80.leveldb.DBException: Closed
 at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:123)
 at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:106)
 at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:259)
 ... 15 more
 {code}
 we can add a check whether DB is closed while we move container from ACQUIRED 
 state.
 As per the discussion in YARN-3585 have add the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3760) Log aggregation failures

2015-06-02 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569289#comment-14569289
 ] 

Daryn Sharp commented on YARN-3760:
---

Cancelled tokens trigger the retry proxy bug.

 Log aggregation failures 
 -

 Key: YARN-3760
 URL: https://issues.apache.org/jira/browse/YARN-3760
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.4.0
Reporter: Daryn Sharp
Priority: Critical

 The aggregated log file does not appear to be properly closed when writes 
 fail.  This leaves a lease renewer active in the NM that spams the NN with 
 lease renewals.  If the token is marked not to be cancelled, the renewals 
 appear to continue until the token expires.  If the token is cancelled, the 
 periodic renew spam turns into a flood of failed connections until the lease 
 renewer gives up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2962) ZKRMStateStore: Limit the number of znodes under a znode

2015-06-02 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569262#comment-14569262
 ] 

Karthik Kambatla commented on YARN-2962:


YARN-3643 should help alleviate most of the issues users face. This JIRA could 
be targeted only at trunk, without worrying about rolling upgrades.

 ZKRMStateStore: Limit the number of znodes under a znode
 

 Key: YARN-2962
 URL: https://issues.apache.org/jira/browse/YARN-2962
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Karthik Kambatla
Assignee: Varun Saxena
Priority: Critical
 Attachments: YARN-2962.01.patch, YARN-2962.2.patch, YARN-2962.3.patch


 We ran into this issue where we were hitting the default ZK server message 
 size configs, primarily because the message had too many znodes even though 
 they individually they were all small.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3760) Log aggregation failures

2015-06-02 Thread Daryn Sharp (JIRA)
Daryn Sharp created YARN-3760:
-

 Summary: Log aggregation failures 
 Key: YARN-3760
 URL: https://issues.apache.org/jira/browse/YARN-3760
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.4.0
Reporter: Daryn Sharp
Priority: Critical


The aggregated log file does not appear to be properly closed when writes fail. 
 This leaves a lease renewer active in the NM that spams the NN with lease 
renewals.  If the token is marked not to be cancelled, the renewals appear to 
continue until the token expires.  If the token is cancelled, the periodic 
renew spam turns into a flood of failed connections until the lease renewer 
gives up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3758) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container

2015-06-02 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569237#comment-14569237
 ] 

Jason Lowe commented on YARN-3758:
--

First off, one should never set the heap size and the container size to the 
same value.  The container size needs to be big enough to hold the entire 
process, not just the heap, so it needs to also consider the overhead of the 
JVM itself and any off-heap usage (e.g.: JVM code, data, thread stacks, shared 
libs, off-heap allocations, etc.).  If you set the heap size to the same size 
as the container then when the heap fills up the process overall will be bigger 
than the heap size and YARN will kill the container.

Couple of things to check:
- Does the job configuration show that it is indeed asking for only 256 MB 
containers for tasks?  Check the job configuration link for the job on the job 
history server or the configuration link for the AM's UI while the job is 
running.
- Check the RM logs to verify what minimum allocation size it is loading from 
the configs and what request size it is allocating per task

 The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not 
 working in container
 

 Key: YARN-3758
 URL: https://issues.apache.org/jira/browse/YARN-3758
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: skrho

 Hello there~~
 I have 2 clusters
 First cluster is 5 node , default 1 application queue, Capacity scheduler, 8G 
 Physical memory each node
 Second cluster is 10 node, 2 application queuey, fair-scheduler, 230G 
 Physical memory each node
 Wherever a mapreduce job is running, I want resourcemanager is to set the 
 minimum memory  256m to container
 So I was changing configuration in yarn-site.xml  mapred-site.xml
 yarn.scheduler.minimum-allocation-mb : 256
 mapreduce.map.java.opts : -Xms256m 
 mapreduce.reduce.java.opts : -Xms256m 
 mapreduce.map.memory.mb : 256 
 mapreduce.reduce.memory.mb : 256 
 In First cluster  whenever a mapreduce job is running , I can see used memory 
 256m in web console( http://installedIP:8088/cluster/nodes )
 But In Second cluster whenever a mapreduce job is running , I can see used 
 memory 1024m in web console( http://installedIP:8088/cluster/nodes ) 
 I know default memory value is 1024m, so if there is not changing memory 
 setting, the default value is working.
 I have been testing for two weeks, but I don't know why mimimum memory 
 setting is not working in second cluster
 Why this difference is happened? 
 Am I wrong setting configuration?
 or Is there bug?
 Thank you for reading~~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3753) RM failed to come up with java.io.IOException: Wait for ZKClient creation timed out

2015-06-02 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569267#comment-14569267
 ] 

Karthik Kambatla commented on YARN-3753:


Fix looks reasonable to me.

 RM failed to come up with java.io.IOException: Wait for ZKClient creation 
 timed out
 -

 Key: YARN-3753
 URL: https://issues.apache.org/jira/browse/YARN-3753
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Reporter: Sumana Sathish
Assignee: Jian He
Priority: Critical
 Attachments: YARN-3753.1.patch, YARN-3753.2.patch, YARN-3753.patch


 RM failed to come up with the following error while submitting an mapreduce 
 job.
 {code:title=RM log}
 015-05-30 03:40:12,190 ERROR recovery.RMStateStore 
 (RMStateStore.java:transition(179)) - Error storing app: 
 application_1432956515242_0006
 java.io.IOException: Wait for ZKClient creation timed out
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
   at java.lang.Thread.run(Thread.java:745)
 2015-05-30 03:40:12,194 FATAL resourcemanager.ResourceManager 
 (ResourceManager.java:handle(750)) - Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 java.io.IOException: Wait for ZKClient creation timed out
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 

[jira] [Commented] (YARN-41) The RM should handle the graceful shutdown of the NM.

2015-06-02 Thread Devaraj K (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569278#comment-14569278
 ] 

Devaraj K commented on YARN-41:
---

Thanks a lot [~djp] for your review and comments, I really appreciate your help 
on reviewing the patch.

 The RM should handle the graceful shutdown of the NM.
 -

 Key: YARN-41
 URL: https://issues.apache.org/jira/browse/YARN-41
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager, resourcemanager
Reporter: Ravi Teja Ch N V
Assignee: Devaraj K
 Attachments: MAPREDUCE-3494.1.patch, MAPREDUCE-3494.2.patch, 
 MAPREDUCE-3494.patch, YARN-41-1.patch, YARN-41-2.patch, YARN-41-3.patch, 
 YARN-41-4.patch, YARN-41-5.patch, YARN-41-6.patch, YARN-41-7.patch, 
 YARN-41-8.patch, YARN-41.patch


 Instead of waiting for the NM expiry, RM should remove and handle the NM, 
 which is shutdown gracefully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs

2015-06-02 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569342#comment-14569342
 ] 

Hudson commented on YARN-1462:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2162 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2162/])
YARN-1462. Correct fix version from branch-2.7.1 to branch-2.8 in (xgong: rev 
0b5cfacde638bc25cc010cd9236369237b4e51a8)
* hadoop-yarn-project/CHANGES.txt


 AHS API and other AHS changes to handle tags for completed MR jobs
 --

 Key: YARN-1462
 URL: https://issues.apache.org/jira/browse/YARN-1462
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.2.0
Reporter: Karthik Kambatla
Assignee: Xuan Gong
 Fix For: 2.8.0

 Attachments: YARN-1462-branch-2.7-1.2.patch, 
 YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, 
 YARN-1462.3.patch


 AHS related work for tags. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs

2015-06-02 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569322#comment-14569322
 ] 

Hudson commented on YARN-1462:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #214 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/214/])
YARN-1462. Correct fix version from branch-2.7.1 to branch-2.8 in (xgong: rev 
0b5cfacde638bc25cc010cd9236369237b4e51a8)
* hadoop-yarn-project/CHANGES.txt


 AHS API and other AHS changes to handle tags for completed MR jobs
 --

 Key: YARN-1462
 URL: https://issues.apache.org/jira/browse/YARN-1462
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.2.0
Reporter: Karthik Kambatla
Assignee: Xuan Gong
 Fix For: 2.8.0

 Attachments: YARN-1462-branch-2.7-1.2.patch, 
 YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch, 
 YARN-1462.3.patch


 AHS related work for tags. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3754) Race condition when the NodeManager is shutting down and container is launched

2015-06-02 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569381#comment-14569381
 ] 

Sunil G commented on YARN-3754:
---

[~bibinchundatt] Could u also please attach NM logs here.

 Race condition when the NodeManager is shutting down and container is launched
 --

 Key: YARN-3754
 URL: https://issues.apache.org/jira/browse/YARN-3754
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
 Environment: Suse 11 Sp3
Reporter: Bibin A Chundatt
Assignee: Sunil G
Priority: Critical

 Container is launched and returned to ContainerImpl
 NodeManager closed the DB connection which resulting in 
 {{org.iq80.leveldb.DBException: Closed}}. 
 *Attaching the exception trace*
 {code}
 2015-05-30 02:11:49,122 WARN 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
  Unable to update state store diagnostics for 
 container_e310_1432817693365_3338_01_02
 java.io.IOException: org.iq80.leveldb.DBException: Closed
 at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:261)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1109)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1101)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1129)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83)
 at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:246)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: org.iq80.leveldb.DBException: Closed
 at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:123)
 at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:106)
 at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:259)
 ... 15 more
 {code}
 we can add a check whether DB is closed while we move container from ACQUIRED 
 state.
 As per the discussion in YARN-3585 have add the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3755) Log the command of launching containers

2015-06-02 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569347#comment-14569347
 ] 

Vinod Kumar Vavilapalli commented on YARN-3755:
---

We had this long ago in YARN, but removed it as the log files were getting 
inundated in large/high throughput clusters. If you combine the command line 
with the environment (classpath etc), this can get very long.

How about we let individual frameworks like MapReduce/Tez log them as needed? 
That seems like the right place for debugging too - app developers don't always 
get access to the daemon logs.

 Log the command of launching containers
 ---

 Key: YARN-3755
 URL: https://issues.apache.org/jira/browse/YARN-3755
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.7.0
Reporter: Jeff Zhang
Assignee: Jeff Zhang
 Attachments: YARN-3755-1.patch, YARN-3755-2.patch


 In the resource manager log, yarn would log the command for launching AM, 
 this is very useful. But there's no such log in the NN log for launching 
 containers. It would be difficult to diagnose when containers fails to launch 
 due to some issue in the commands. Although user can look at the commands in 
 the container launch script file, this is an internal things of yarn, usually 
 user don't know that. In user's perspective, they only know what commands 
 they specify when building yarn application. 
 {code}
 2015-06-01 16:06:42,245 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Command 
 to launch container container_1433145984561_0001_01_01 : 
 $JAVA_HOME/bin/java -server -Djava.net.preferIPv4Stack=true 
 -Dhadoop.metrics.log.level=WARN  -Xmx1024m  
 -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator 
 -Dlog4j.configuration=tez-container-log4j.properties 
 -Dyarn.app.container.log.dir=LOG_DIR -Dtez.root.logger=info,CLA 
 -Dsun.nio.ch.bugLevel='' org.apache.tez.dag.app.DAGAppMaster 
 1LOG_DIR/stdout 2LOG_DIR/stderr
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3761) Set delegation token service address at the server side

2015-06-02 Thread Zhijie Shen (JIRA)
Zhijie Shen created YARN-3761:
-

 Summary: Set delegation token service address at the server side
 Key: YARN-3761
 URL: https://issues.apache.org/jira/browse/YARN-3761
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: security
Reporter: Zhijie Shen


Nowadays, YARN components generate the delegation token without the service 
address set, and leave it to the client to set. With our java client library, 
it is usually fine. However, if users are using REST API, it's going to be a 
problem: The delegation token is returned as a url string. It's so unfriendly 
for the thin client to deserialize the url string, set the token service 
address and serialize it again for further usage. If we move the task of 
setting the service address to the server side, the client can get rid of this 
trouble.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2618) Avoid over-allocation of disk resources

2015-06-02 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569437#comment-14569437
 ] 

Varun Vasudev commented on YARN-2618:
-

[~kasha] - should we commit this to the YARN-2139 branch? Should we get the 
branch up to date with trunk first?

 Avoid over-allocation of disk resources
 ---

 Key: YARN-2618
 URL: https://issues.apache.org/jira/browse/YARN-2618
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wei Yan
Assignee: Wei Yan
  Labels: BB2015-05-TBR
 Attachments: YARN-2618-1.patch, YARN-2618-2.patch, YARN-2618-3.patch, 
 YARN-2618-4.patch, YARN-2618-5.patch, YARN-2618-6.patch, YARN-2618-7.patch


 Subtask of YARN-2139. 
 This should include
 - Add API support for introducing disk I/O as the 3rd type resource.
 - NM should report this information to the RM
 - RM should consider this to avoid over-allocation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2618) Avoid over-allocation of disk resources

2015-06-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569453#comment-14569453
 ] 

Hadoop QA commented on YARN-2618:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | patch |   0m  0s | The patch command could not apply 
the patch during dryrun. |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12723515/YARN-2618-7.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / a2bd621 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8167/console |


This message was automatically generated.

 Avoid over-allocation of disk resources
 ---

 Key: YARN-2618
 URL: https://issues.apache.org/jira/browse/YARN-2618
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wei Yan
Assignee: Wei Yan
  Labels: BB2015-05-TBR
 Attachments: YARN-2618-1.patch, YARN-2618-2.patch, YARN-2618-3.patch, 
 YARN-2618-4.patch, YARN-2618-5.patch, YARN-2618-6.patch, YARN-2618-7.patch


 Subtask of YARN-2139. 
 This should include
 - Add API support for introducing disk I/O as the 3rd type resource.
 - NM should report this information to the RM
 - RM should consider this to avoid over-allocation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3753) RM failed to come up with java.io.IOException: Wait for ZKClient creation timed out

2015-06-02 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569456#comment-14569456
 ] 

Xuan Gong commented on YARN-3753:
-

+1, LGTM. Check this in

 RM failed to come up with java.io.IOException: Wait for ZKClient creation 
 timed out
 -

 Key: YARN-3753
 URL: https://issues.apache.org/jira/browse/YARN-3753
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Reporter: Sumana Sathish
Assignee: Jian He
Priority: Critical
 Attachments: YARN-3753.1.patch, YARN-3753.2.patch, YARN-3753.patch


 RM failed to come up with the following error while submitting an mapreduce 
 job.
 {code:title=RM log}
 015-05-30 03:40:12,190 ERROR recovery.RMStateStore 
 (RMStateStore.java:transition(179)) - Error storing app: 
 application_1432956515242_0006
 java.io.IOException: Wait for ZKClient creation timed out
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
   at java.lang.Thread.run(Thread.java:745)
 2015-05-30 03:40:12,194 FATAL resourcemanager.ResourceManager 
 (ResourceManager.java:handle(750)) - Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 java.io.IOException: Wait for ZKClient creation timed out
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 

[jira] [Commented] (YARN-3753) RM failed to come up with java.io.IOException: Wait for ZKClient creation timed out

2015-06-02 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569465#comment-14569465
 ] 

Xuan Gong commented on YARN-3753:
-

Committed into branch-2.7. Thanks, Jian

 RM failed to come up with java.io.IOException: Wait for ZKClient creation 
 timed out
 -

 Key: YARN-3753
 URL: https://issues.apache.org/jira/browse/YARN-3753
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Reporter: Sumana Sathish
Assignee: Jian He
Priority: Critical
 Fix For: 2.7.1

 Attachments: YARN-3753.1.patch, YARN-3753.2.patch, YARN-3753.patch


 RM failed to come up with the following error while submitting an mapreduce 
 job.
 {code:title=RM log}
 015-05-30 03:40:12,190 ERROR recovery.RMStateStore 
 (RMStateStore.java:transition(179)) - Error storing app: 
 application_1432956515242_0006
 java.io.IOException: Wait for ZKClient creation timed out
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
   at java.lang.Thread.run(Thread.java:745)
 2015-05-30 03:40:12,194 FATAL resourcemanager.ResourceManager 
 (ResourceManager.java:handle(750)) - Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 java.io.IOException: Wait for ZKClient creation timed out
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1098)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:609)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:175)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:160)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 

[jira] [Commented] (YARN-3591) Resource Localisation on a bad disk causes subsequent containers failure

2015-06-02 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569480#comment-14569480
 ] 

Sunil G commented on YARN-3591:
---

If we have a new api which returns the present set of error dirs alone (w/o 
full dirs) 
{code}
synchronized ListString getErrorDirs() 
{code}
then could we modify LocalResourcesTrackerImpl#checkLocalizedResources in such 
a way that we call *removeResource* on those localized resources whose parent 
is present in ErrorDirs.



 Resource Localisation on a bad disk causes subsequent containers failure 
 -

 Key: YARN-3591
 URL: https://issues.apache.org/jira/browse/YARN-3591
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.0
Reporter: Lavkesh Lahngir
Assignee: Lavkesh Lahngir
 Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, 
 YARN-3591.2.patch, YARN-3591.3.patch, YARN-3591.4.patch


 It happens when a resource is localised on the disk, after localising that 
 disk has gone bad. NM keeps paths for localised resources in memory.  At the 
 time of resource request isResourcePresent(rsrc) will be called which calls 
 file.exists() on the localised path.
 In some cases when disk has gone bad, inodes are stilled cached and 
 file.exists() returns true. But at the time of reading, file will not open.
 Note: file.exists() actually calls stat64 natively which returns true because 
 it was able to find inode information from the OS.
 A proposal is to call file.list() on the parent path of the resource, which 
 will call open() natively. If the disk is good it should return an array of 
 paths with length at-least 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty

2015-06-02 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3733:
-
Summary: DominantRC#compare() does not work as expected if cluster resource 
is empty  (was:  On RM restart AM getting more than maximum possible memory 
when many  tasks in queue)

 DominantRC#compare() does not work as expected if cluster resource is empty
 ---

 Key: YARN-3733
 URL: https://issues.apache.org/jira/browse/YARN-3733
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3 , 2 NM , 2 RM
 one NM - 3 GB 6 v core
Reporter: Bibin A Chundatt
Assignee: Rohith
Priority: Blocker
 Attachments: YARN-3733.patch


 Steps to reproduce
 =
 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
 size to 512 MB
 3. Configure capacity scheduler and AM limit to .5 
 (DominantResourceCalculator is configured)
 4. Submit 30 concurrent task 
 5. Switch RM
 Actual
 =
 For 12 Jobs AM gets allocated and all 12 starts running
 No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
 Expected
 ===
 Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-06-02 Thread Chun Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568586#comment-14568586
 ] 

Chun Chen commented on YARN-3749:
-

bq. It looks like we need keep conf.set(YarnConfiguration.RM_HA_ID, 
RM1_NODE_ID); in TestRMEmbeddedElector to fix this test failure.
Sorry, my bad. Upload YARN-3749.7.patch to fix that and add a tests in 
{{TestYarnConfiguration}} to make sure {{YarnConfiguration#updateConnectAddr}} 
won't add suffix to NM service address configurations. 

 We should make a copy of configuration when init MiniYARNCluster with 
 multiple RMs
 --

 Key: YARN-3749
 URL: https://issues.apache.org/jira/browse/YARN-3749
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Chun Chen
Assignee: Chun Chen
 Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, 
 YARN-3749.5.patch, YARN-3749.6.patch, YARN-3749.7.patch, YARN-3749.patch


 When I was trying to write a test case for YARN-2674, I found DS client 
 trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
 when RM failover. But I initially set 
 yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
 yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
 in ClientRMService where the value of yarn.resourcemanager.address.rm2 
 changed to 0.0.0.0:18032. See the following code in ClientRMService:
 {code}
 clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
YarnConfiguration.RM_ADDRESS,

 YarnConfiguration.DEFAULT_RM_ADDRESS,
server.getListenerAddress());
 {code}
 Since we use the same instance of configuration in rm1 and rm2 and init both 
 RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 
 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during 
 starting of rm1.
 So I think it is safe to make a copy of configuration when init both of the 
 rm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-06-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568593#comment-14568593
 ] 

Hadoop QA commented on YARN-3585:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m 30s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 51s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 56s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 40s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 37s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 34s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 14s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   6m 14s | Tests passed in 
hadoop-yarn-server-nodemanager. |
| | |  45m  3s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12736738/0001-YARN-3585.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 990078b |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8159/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8159/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8159/console |


This message was automatically generated.

 NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
 --

 Key: YARN-3585
 URL: https://issues.apache.org/jira/browse/YARN-3585
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang
Assignee: Rohith
Priority: Critical
 Attachments: 0001-YARN-3585.patch, YARN-3585.patch


 With NM recovery enabled, after decommission, nodemanager log show stop but 
 process cannot end. 
 non daemon thread:
 {noformat}
 DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
 condition [0x]
 leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
 [0x]
 VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
 Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 
 nid=0x29ed runnable 
 Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 
 nid=0x29ee runnable 
 Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 
 nid=0x29ef runnable 
 Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 
 nid=0x29f0 runnable 
 Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 
 nid=0x29f1 runnable 
 Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 
 nid=0x29f2 runnable 
 Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 
 nid=0x29f3 runnable 
 Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 
 nid=0x29f4 runnable 
 Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 
 runnable 
 Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 
 nid=0x29f5 runnable 
 Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 
 nid=0x29f6 runnable 
 VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
 on condition 
 {noformat}
 and jni leveldb thread stack
 {noformat}
 Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
 #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
 /lib64/libpthread.so.0
 #1  0x7f33dfce2a3b in leveldb::(anonymous 
 namespace)::PosixEnv::BGThreadWrapper(void*) () from 
 /tmp/libleveldbjni-64-1-6922178968300745716.8
 #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
 #3  0x003d830e811d in clone () from /lib64/libc.so.6
 {noformat}



--
This message was sent by Atlassian JIRA

[jira] [Commented] (YARN-3755) Log the command of launching containers

2015-06-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568606#comment-14568606
 ] 

Hadoop QA commented on YARN-3755:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  15m 43s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 35s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 41s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   0m 37s | The applied patch generated  3 
new checkstyle issues (total was 58, now 60). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 34s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 13s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   6m  9s | Tests passed in 
hadoop-yarn-server-nodemanager. |
| | |  43m 32s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12736742/YARN-3755-1.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 990078b |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8160/artifact/patchprocess/diffcheckstylehadoop-yarn-server-nodemanager.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8160/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8160/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8160/console |


This message was automatically generated.

 Log the command of launching containers
 ---

 Key: YARN-3755
 URL: https://issues.apache.org/jira/browse/YARN-3755
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.7.0
Reporter: Jeff Zhang
Assignee: Jeff Zhang
 Attachments: YARN-3755-1.patch


 In the resource manager log, yarn would log the command for launching AM, 
 this is very useful. But there's no such log in the NN log for launching 
 containers. It would be difficult to diagnose when containers fails to launch 
 due to some issue in the commands. Although user can look at the commands in 
 the container launch script file, this is an internal things of yarn, usually 
 user don't know that. In user's perspective, they only know what commands 
 they specify when building yarn application. 
 {code}
 2015-06-01 16:06:42,245 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Command 
 to launch container container_1433145984561_0001_01_01 : 
 $JAVA_HOME/bin/java -server -Djava.net.preferIPv4Stack=true 
 -Dhadoop.metrics.log.level=WARN  -Xmx1024m  
 -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator 
 -Dlog4j.configuration=tez-container-log4j.properties 
 -Dyarn.app.container.log.dir=LOG_DIR -Dtez.root.logger=info,CLA 
 -Dsun.nio.ch.bugLevel='' org.apache.tez.dag.app.DAGAppMaster 
 1LOG_DIR/stdout 2LOG_DIR/stderr
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)