[jira] [Created] (YARN-1060) Two tests in TestFairScheduler are missing @Test annotation

2013-08-13 Thread Sandy Ryza (JIRA)
Sandy Ryza created YARN-1060:


 Summary: Two tests in TestFairScheduler are missing @Test 
annotation
 Key: YARN-1060
 URL: https://issues.apache.org/jira/browse/YARN-1060
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.1.0-beta
Reporter: Sandy Ryza


Amazingly, these tests appear to pass with the annotations added.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (YARN-1060) Two tests in TestFairScheduler are missing @Test annotation

2013-08-13 Thread Niranjan Singh (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niranjan Singh reassigned YARN-1060:


Assignee: Niranjan Singh

 Two tests in TestFairScheduler are missing @Test annotation
 ---

 Key: YARN-1060
 URL: https://issues.apache.org/jira/browse/YARN-1060
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.1.0-beta
Reporter: Sandy Ryza
Assignee: Niranjan Singh
  Labels: newbie

 Amazingly, these tests appear to pass with the annotations added.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-1060) Two tests in TestFairScheduler are missing @Test annotation

2013-08-13 Thread Niranjan Singh (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niranjan Singh updated YARN-1060:
-

Attachment: YARN-1060.patch

Added @Test annotations

 Two tests in TestFairScheduler are missing @Test annotation
 ---

 Key: YARN-1060
 URL: https://issues.apache.org/jira/browse/YARN-1060
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.1.0-beta
Reporter: Sandy Ryza
Assignee: Niranjan Singh
  Labels: newbie
 Attachments: YARN-1060.patch


 Amazingly, these tests appear to pass with the annotations added.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-1061) NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager.

2013-08-13 Thread Rohith Sharma K S (JIRA)
Rohith Sharma K S created YARN-1061:
---

 Summary: NodeManager is indefinitely waiting for nodeHeartBeat() 
response from ResouceManager.
 Key: YARN-1061
 URL: https://issues.apache.org/jira/browse/YARN-1061
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.5-alpha
Reporter: Rohith Sharma K S


It is observed that in one of the scenario, NodeManger is indefinetly waiting 
for nodeHeartbeat response from ResouceManger where ResouceManger is in hanged 
up state.

NodeManager should get timeout exception instead of waiting indefinetly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1061) NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager.

2013-08-13 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13737990#comment-13737990
 ] 

Rohith Sharma K S commented on YARN-1061:
-

Extracted thread dump from NodeManager is 

{noformat}
Node Status Updater prio=10 tid=0x414dc000 nid=0x1d754 in 
Object.wait() [0x7fefa2dec000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:485)
at org.apache.hadoop.ipc.Client.call(Client.java:1231)
- locked 0xdef4f158 (a org.apache.hadoop.ipc.Client$Call)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
at $Proxy28.nodeHeartbeat(Unknown Source)
at 
org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:70)
at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
at $Proxy30.nodeHeartbeat(Unknown Source)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:348)
{noformat}

 NodeManager is indefinitely waiting for nodeHeartBeat() response from 
 ResouceManager.
 -

 Key: YARN-1061
 URL: https://issues.apache.org/jira/browse/YARN-1061
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.5-alpha
Reporter: Rohith Sharma K S

 It is observed that in one of the scenario, NodeManger is indefinetly waiting 
 for nodeHeartbeat response from ResouceManger where ResouceManger is in 
 hanged up state.
 NodeManager should get timeout exception instead of waiting indefinetly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1059) IllegalArgumentException while starting YARN

2013-08-13 Thread rvller (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738002#comment-13738002
 ] 

rvller commented on YARN-1059:
--

It was my fault, I've changed another parameter to one line in the XML, so 
that's why RM was not able to start.

When I changed all of the parameters to one line 
(value10.245.1.30:9030/value and etc) RM wa able to start. 

I suppose that this is a bug, because it's confusing.

 IllegalArgumentException while starting YARN
 

 Key: YARN-1059
 URL: https://issues.apache.org/jira/browse/YARN-1059
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.5-alpha
 Environment: Ubuntu 12.04, hadoop 2.0.5
Reporter: rvller

 Here is the traceback while starting the yarn resourse manager:
 2013-08-12 12:53:29,319 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error starting 
 ResourceManager
 java.lang.IllegalArgumentException: Does not contain a valid host:port 
 authority: 
 10.245.1.30:9030
  (configuration property 'yarn.resourcemanager.resource-tracker.address')
   at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:193)
   at 
 org.apache.hadoop.conf.Configuration.getSocketAddr(Configuration.java:1450)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.init(ResourceTrackerService.java:105)
   at 
 org.apache.hadoop.yarn.service.CompositeService.init(CompositeService.java:58)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.init(ResourceManager.java:255)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:710)
 And here is the yarn-site.xml:
 configuration
 property
 name
 yarn.resourcemanager.address
 /name
 value
 10.245.1.30:9010
 /value
 description
 /description
 /property
 property
 name
 yarn.resourcemanager.scheduler.address
 /name
 value
 10.245.1.30:9020
 /value
 description
 /description
 /property
 property
 name
 yarn.resourcemanager.resource-tracker.address
 /name
 value
 10.245.1.30:9030
 /value
 description
 /description
 /property
 property
 name
 yarn.resourcemanager.admin.address
 /name
 value
 10.245.1.30:9040
 /value
 description
 /description
 /property
 property
 name
 yarn.resourcemanager.webapp.address
 /name
 value
 10.245.1.30:9050
 /value
 description
 /description
 /property
 !-- Site specific YARN configuration properties --
 /configuration

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1008) MiniYARNCluster with multiple nodemanagers, all nodes have same key for allocations

2013-08-13 Thread Alejandro Abdelnur (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738175#comment-13738175
 ] 

Alejandro Abdelnur commented on YARN-1008:
--

What the patch does:

Introduces a new configuration property in the RM, 
{{RM_SCHEDULER_USE_PORT_FOR_NODE_NAME}} (defaulting to {{false}}). 

This is an RM property but it is to be used by the scheduler implementations 
when matching a resourcerequest to a node.

If the property is set to {{false}} things work as today, the matching is done 
using only the HOSTNAME obtained from the NodeId.

If the property is set to {{true}}, the matching is done using the 
HOSTNAME:PORT obtained from the NodeId.

There are not changes in the NM or AM side. 

If the property is set to {{true}}, the AM must be aware of such setting and 
when creating  resource requests, it must set the location as HOSTNAME:PORT of 
the node instead of just HOSTNAME.

The renaming of {{SchedulerNode#getHostName()}} to 
{{SchedulerNode#getNodeName()}} is to make obvious to developers that may not 
be HOSTNAME. Added javadocs explaing this clearly.

This works with all 3 schedulers.

The main motivation for this change is to be able to use yarn minicluster with 
multiple NMs and being able to target a specific NM instance.

We could expose this for production use if there is a need. For that we would 
need to:

* Expose via ApplicationReport the node matching mode: HOSTNAME or HOSTNAME:PORT
* Provide a mechanism for AMs only aware of HOSTNAME matching mode to work with 
HOSTNAME:PORT mode

If there is a usecase for this in a real deployment we should follow up this 
with another JIRA.


 MiniYARNCluster with multiple nodemanagers, all nodes have same key for 
 allocations
 ---

 Key: YARN-1008
 URL: https://issues.apache.org/jira/browse/YARN-1008
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.1.0-beta
Reporter: Alejandro Abdelnur
Assignee: Alejandro Abdelnur
 Attachments: YARN-1008.patch, YARN-1008.patch, YARN-1008.patch


 While the NMs are keyed using the NodeId, the allocation is done based on the 
 hostname. 
 This makes the different nodes indistinguishable to the scheduler.
 There should be an option to enabled the host:port instead just port for 
 allocations. The nodes reported to the AM should report the 'key' (host or 
 host:port). 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1036) Distributed Cache gives inconsistent result if cache files get deleted from task tracker

2013-08-13 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738337#comment-13738337
 ] 

Jason Lowe commented on YARN-1036:
--

Agree with Ravi that we should focus on porting the change to 0.23 and fix any 
issues that also apply to trunk/branch-2 in a separate JIRA.  Therefore I agree 
with Omkar that we should simply break or omit the LOCALIZED case from the 
switch statement since 0.23 doesn't have localCacheDirectoryManager to match 
the trunk behavior.  Otherwise patch looks good to me.

 Distributed Cache gives inconsistent result if cache files get deleted from 
 task tracker 
 -

 Key: YARN-1036
 URL: https://issues.apache.org/jira/browse/YARN-1036
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 0.23.9
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Attachments: YARN-1036.branch-0.23.patch, YARN-1036.branch-0.23.patch


 This is a JIRA to backport MAPREDUCE-4342. I had to open a new JIRA because 
 that one had been closed. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-1062) MRAppMaster take a long time to init taskAttempt

2013-08-13 Thread shenhong (JIRA)
shenhong created YARN-1062:
--

 Summary: MRAppMaster take a long time to init taskAttempt
 Key: YARN-1062
 URL: https://issues.apache.org/jira/browse/YARN-1062
 Project: Hadoop YARN
  Issue Type: Bug
  Components: applications
Affects Versions: 0.23.6
Reporter: shenhong


In our cluster, MRAppMaster take a long time to init taskAttempt, the following 
log last one minute,

2013-07-17 11:28:06,328 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11012.yh.aliyun.com to 
/r01f11
2013-07-17 11:28:06,357 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11004.yh.aliyun.com to 
/r01f11
2013-07-17 11:28:06,383 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.util.RackResolver: Resolved r03b05042.yh.aliyun.com to 
/r03b05
2013-07-17 11:28:06,384 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1373523419753_4543_m_00_0 TaskAttempt Transitioned from NEW to 
UNASSIGNED

The reason is: resolved one host to rack almost take 25ms, our hdfs cluster is 
more than 4000 datanodes, then a large input job will take a long time to init 
TaskAttempt.

Is there any good idea to solve this problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-1062) MRAppMaster take a long time to init taskAttempt

2013-08-13 Thread shenhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shenhong updated YARN-1062:
---

Description: 
In our cluster, MRAppMaster take a long time to init taskAttempt, the following 
log last one minute,

2013-07-17 11:28:06,328 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11012.yh.aliyun.com to 
/r01f11
2013-07-17 11:28:06,357 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11004.yh.aliyun.com to 
/r01f11
2013-07-17 11:28:06,383 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.util.RackResolver: Resolved r03b05042.yh.aliyun.com to 
/r03b05
2013-07-17 11:28:06,384 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1373523419753_4543_m_00_0 TaskAttempt Transitioned from NEW to 
UNASSIGNED
2013-07-17 11:28:06,415 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.util.RackResolver: Resolved r03b02006.yh.aliyun.com to 
/r03b02
2013-07-17 11:28:06,436 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.util.RackResolver: Resolved r02f02045.yh.aliyun.com to 
/r02f02
2013-07-17 11:28:06,457 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.util.RackResolver: Resolved r02f02034.yh.aliyun.com to 
/r02f02
2013-07-17 11:28:06,457 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1373523419753_4543_m_01_0 TaskAttempt Transitioned from NEW to 
UNASSIGNED

The reason is: resolved one host to rack almost take 25ms (We resolve the host 
to rack by a python script). Our hdfs cluster is more than 4000 datanodes, then 
a large input job will take a long time to init TaskAttempt.

Is there any good idea to solve this problem. 

  was:
In our cluster, MRAppMaster take a long time to init taskAttempt, the following 
log last one minute,

2013-07-17 11:28:06,328 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11012.yh.aliyun.com to 
/r01f11
2013-07-17 11:28:06,357 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11004.yh.aliyun.com to 
/r01f11
2013-07-17 11:28:06,383 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.util.RackResolver: Resolved r03b05042.yh.aliyun.com to 
/r03b05
2013-07-17 11:28:06,384 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1373523419753_4543_m_00_0 TaskAttempt Transitioned from NEW to 
UNASSIGNED

The reason is: resolved one host to rack almost take 25ms, our hdfs cluster is 
more than 4000 datanodes, then a large input job will take a long time to init 
TaskAttempt.

Is there any good idea to solve this problem.


 MRAppMaster take a long time to init taskAttempt
 

 Key: YARN-1062
 URL: https://issues.apache.org/jira/browse/YARN-1062
 Project: Hadoop YARN
  Issue Type: Bug
  Components: applications
Affects Versions: 0.23.6
Reporter: shenhong

 In our cluster, MRAppMaster take a long time to init taskAttempt, the 
 following log last one minute,
 2013-07-17 11:28:06,328 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11012.yh.aliyun.com to 
 /r01f11
 2013-07-17 11:28:06,357 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11004.yh.aliyun.com to 
 /r01f11
 2013-07-17 11:28:06,383 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.yarn.util.RackResolver: Resolved r03b05042.yh.aliyun.com to 
 /r03b05
 2013-07-17 11:28:06,384 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
 attempt_1373523419753_4543_m_00_0 TaskAttempt Transitioned from NEW to 
 UNASSIGNED
 2013-07-17 11:28:06,415 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.yarn.util.RackResolver: Resolved r03b02006.yh.aliyun.com to 
 /r03b02
 2013-07-17 11:28:06,436 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.yarn.util.RackResolver: Resolved r02f02045.yh.aliyun.com to 
 /r02f02
 2013-07-17 11:28:06,457 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.yarn.util.RackResolver: Resolved r02f02034.yh.aliyun.com to 
 /r02f02
 2013-07-17 11:28:06,457 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
 attempt_1373523419753_4543_m_01_0 TaskAttempt Transitioned from NEW to 
 UNASSIGNED
 The reason is: resolved one host to rack almost take 25ms (We resolve the 
 host to rack by a python script). Our hdfs cluster is more than 4000 
 datanodes, then a large input job will take a long time to init TaskAttempt.
 Is there any good idea to solve this problem. 

--
This message is 

[jira] [Commented] (YARN-1056) Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}

2013-08-13 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738413#comment-13738413
 ] 

Vinod Kumar Vavilapalli commented on YARN-1056:
---

Config changes ARE API changes. If you wish to rename it now itself, mark it as 
blocker and let the release manager now. Otherwise, you should deprecate this 
config and add a new one and wait for the next release. I'm okay either ways.

 Fix configs 
 yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
 

 Key: YARN-1056
 URL: https://issues.apache.org/jira/browse/YARN-1056
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.1.0-beta
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
  Labels: conf
 Attachments: yarn-1056-1.patch


 Fix configs 
 yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
  to have a *resourcemanager* only once, make them consistent with other such 
 yarn configs and add entries in yarn-default.xml

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1062) MRAppMaster take a long time to init taskAttempt

2013-08-13 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738417#comment-13738417
 ] 

Vinod Kumar Vavilapalli commented on YARN-1062:
---

You should definitely see if you can improve your python script by looking at a 
static resolution file instead of dynamically pinging DNS at run time. That'll 
clearly improve your performance.

Overall, we wish to expose this information to AMs from RM itself so that each 
AM doesn't need to do this itself. That is tracked via YARN-435. If that okay 
with you, please close this as duplicate.

 MRAppMaster take a long time to init taskAttempt
 

 Key: YARN-1062
 URL: https://issues.apache.org/jira/browse/YARN-1062
 Project: Hadoop YARN
  Issue Type: Bug
  Components: applications
Affects Versions: 0.23.6
Reporter: shenhong

 In our cluster, MRAppMaster take a long time to init taskAttempt, the 
 following log last one minute,
 2013-07-17 11:28:06,328 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11012.yh.aliyun.com to 
 /r01f11
 2013-07-17 11:28:06,357 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11004.yh.aliyun.com to 
 /r01f11
 2013-07-17 11:28:06,383 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.yarn.util.RackResolver: Resolved r03b05042.yh.aliyun.com to 
 /r03b05
 2013-07-17 11:28:06,384 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
 attempt_1373523419753_4543_m_00_0 TaskAttempt Transitioned from NEW to 
 UNASSIGNED
 2013-07-17 11:28:06,415 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.yarn.util.RackResolver: Resolved r03b02006.yh.aliyun.com to 
 /r03b02
 2013-07-17 11:28:06,436 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.yarn.util.RackResolver: Resolved r02f02045.yh.aliyun.com to 
 /r02f02
 2013-07-17 11:28:06,457 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.yarn.util.RackResolver: Resolved r02f02034.yh.aliyun.com to 
 /r02f02
 2013-07-17 11:28:06,457 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
 attempt_1373523419753_4543_m_01_0 TaskAttempt Transitioned from NEW to 
 UNASSIGNED
 The reason is: resolved one host to rack almost take 25ms (We resolve the 
 host to rack by a python script). Our hdfs cluster is more than 4000 
 datanodes, then a large input job will take a long time to init TaskAttempt.
 Is there any good idea to solve this problem. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-1023) [YARN-321] Webservices REST API's support for Application History

2013-08-13 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-1023:
--

Summary: [YARN-321] Webservices REST API's support for Application History  
(was: [YARN-321] Weservices REST API's support for Application History)

 [YARN-321] Webservices REST API's support for Application History
 -

 Key: YARN-1023
 URL: https://issues.apache.org/jira/browse/YARN-1023
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: YARN-321
Reporter: Devaraj K
Assignee: Devaraj K
 Attachments: YARN-1023-v0.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-1036) Distributed Cache gives inconsistent result if cache files get deleted from task tracker

2013-08-13 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated YARN-1036:
---

Attachment: YARN-1036.branch-0.23.patch

Thanks Jason and Omkar for your comments. Ok. Here is the updated patch which 
has src/main code exactly like Omkar suggested.
I've tested it by using a pendrive to simulate drive failure, and the file is 
indeed localized again.

 Distributed Cache gives inconsistent result if cache files get deleted from 
 task tracker 
 -

 Key: YARN-1036
 URL: https://issues.apache.org/jira/browse/YARN-1036
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 0.23.9
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Attachments: YARN-1036.branch-0.23.patch, 
 YARN-1036.branch-0.23.patch, YARN-1036.branch-0.23.patch


 This is a JIRA to backport MAPREDUCE-4342. I had to open a new JIRA because 
 that one had been closed. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-979) [YARN-321] Adding application attempt and container to ApplicationHistoryProtocol

2013-08-13 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738519#comment-13738519
 ] 

Zhijie Shen commented on YARN-979:
--

There're some high-level comments on the patch:

1. To make the protocol work, it is required to define the corresponding protos 
in yarn_service.proto and update pplication_history_service.proto

2. The setter of the request/response APIs should be @Public, shouldn't it?

3. It's required to mark ApplicationHistoryProtocol as well.

 [YARN-321] Adding application attempt and container to 
 ApplicationHistoryProtocol
 -

 Key: YARN-979
 URL: https://issues.apache.org/jira/browse/YARN-979
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Mayank Bansal
Assignee: Mayank Bansal
 Attachments: YARN-979-1.patch


  Adding application attempt and container to ApplicationHistoryProtocol
 Thanks,
 Mayank

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-1056) Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}

2013-08-13 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-1056:
---

Attachment: yarn-1056-1.patch

Reuploading patch to kick Jenkins.

 Fix configs 
 yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
 

 Key: YARN-1056
 URL: https://issues.apache.org/jira/browse/YARN-1056
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.1.0-beta
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Blocker
  Labels: conf
 Attachments: yarn-1056-1.patch, yarn-1056-1.patch


 Fix configs 
 yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
  to have a *resourcemanager* only once, make them consistent with other such 
 yarn configs and add entries in yarn-default.xml

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-1056) Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}

2013-08-13 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-1056:
---

Target Version/s: 2.1.0-beta  (was: 2.1.1-beta)

 Fix configs 
 yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
 

 Key: YARN-1056
 URL: https://issues.apache.org/jira/browse/YARN-1056
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.1.0-beta
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Blocker
  Labels: conf
 Attachments: yarn-1056-1.patch, yarn-1056-1.patch


 Fix configs 
 yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
  to have a *resourcemanager* only once, make them consistent with other such 
 yarn configs and add entries in yarn-default.xml

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1036) Distributed Cache gives inconsistent result if cache files get deleted from task tracker

2013-08-13 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738541#comment-13738541
 ] 

Omkar Vinit Joshi commented on YARN-1036:
-

+1 ... thanks for updating the patch..lgtm.

 Distributed Cache gives inconsistent result if cache files get deleted from 
 task tracker 
 -

 Key: YARN-1036
 URL: https://issues.apache.org/jira/browse/YARN-1036
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 0.23.9
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Attachments: YARN-1036.branch-0.23.patch, 
 YARN-1036.branch-0.23.patch, YARN-1036.branch-0.23.patch


 This is a JIRA to backport MAPREDUCE-4342. I had to open a new JIRA because 
 that one had been closed. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1056) Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}

2013-08-13 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738560#comment-13738560
 ] 

Arun C Murthy commented on YARN-1056:
-

Looks fine, I'll commit after jenkins. Thanks [~kkambatl].

 Fix configs 
 yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
 

 Key: YARN-1056
 URL: https://issues.apache.org/jira/browse/YARN-1056
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.1.0-beta
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Blocker
  Labels: conf
 Attachments: yarn-1056-1.patch, yarn-1056-1.patch


 Fix configs 
 yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
  to have a *resourcemanager* only once, make them consistent with other such 
 yarn configs and add entries in yarn-default.xml

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1021) Yarn Scheduler Load Simulator

2013-08-13 Thread Wei Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738556#comment-13738556
 ] 

Wei Yan commented on YARN-1021:
---

Updates of the patch: Reduce the number of threads needed for NMSimulators. 
Before, each NMSimulator uses one thread (for its AsyncDispatcher). Currently 
removed AsyncDispatcher and the total number of threads needed only depends on 
the thread pool size.

 Yarn Scheduler Load Simulator
 -

 Key: YARN-1021
 URL: https://issues.apache.org/jira/browse/YARN-1021
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: scheduler
Reporter: Wei Yan
Assignee: Wei Yan
 Attachments: YARN-1021-demo.tar.gz, YARN-1021-images.tar.gz, 
 YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, 
 YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.pdf


 The Yarn Scheduler is a fertile area of interest with different 
 implementations, e.g., Fifo, Capacity and Fair  schedulers. Meanwhile, 
 several optimizations are also made to improve scheduler performance for 
 different scenarios and workload. Each scheduler algorithm has its own set of 
 features, and drives scheduling decisions by many factors, such as fairness, 
 capacity guarantee, resource availability, etc. It is very important to 
 evaluate a scheduler algorithm very well before we deploy it in a production 
 cluster. Unfortunately, currently it is non-trivial to evaluate a scheduling 
 algorithm. Evaluating in a real cluster is always time and cost consuming, 
 and it is also very hard to find a large-enough cluster. Hence, a simulator 
 which can predict how well a scheduler algorithm for some specific workload 
 would be quite useful.
 We want to build a Scheduler Load Simulator to simulate large-scale Yarn 
 clusters and application loads in a single machine. This would be invaluable 
 in furthering Yarn by providing a tool for researchers and developers to 
 prototype new scheduler features and predict their behavior and performance 
 with reasonable amount of confidence, there-by aiding rapid innovation.
 The simulator will exercise the real Yarn ResourceManager removing the 
 network factor by simulating NodeManagers and ApplicationMasters via handling 
 and dispatching NM/AMs heartbeat events from within the same JVM.
 To keep tracking of scheduler behavior and performance, a scheduler wrapper 
 will wrap the real scheduler.
 The simulator will produce real time metrics while executing, including:
 * Resource usages for whole cluster and each queue, which can be utilized to 
 configure cluster and queue's capacity.
 * The detailed application execution trace (recorded in relation to simulated 
 time), which can be analyzed to understand/validate the  scheduler behavior 
 (individual jobs turn around time, throughput, fairness, capacity guarantee, 
 etc).
 * Several key metrics of scheduler algorithm, such as time cost of each 
 scheduler operation (allocate, handle, etc), which can be utilized by Hadoop 
 developers to find the code spots and scalability limits.
 The simulator will provide real time charts showing the behavior of the 
 scheduler and its performance.
 A short demo is available http://www.youtube.com/watch?v=6thLi8q0qLE, showing 
 how to use simulator to simulate Fair Scheduler and Capacity Scheduler.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1021) Yarn Scheduler Load Simulator

2013-08-13 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738570#comment-13738570
 ] 

Hadoop QA commented on YARN-1021:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12597774/YARN-1021.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-assemblies hadoop-tools/hadoop-sls hadoop-tools/hadoop-tools-dist.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1705//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1705//console

This message is automatically generated.

 Yarn Scheduler Load Simulator
 -

 Key: YARN-1021
 URL: https://issues.apache.org/jira/browse/YARN-1021
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: scheduler
Reporter: Wei Yan
Assignee: Wei Yan
 Attachments: YARN-1021-demo.tar.gz, YARN-1021-images.tar.gz, 
 YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, 
 YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.pdf


 The Yarn Scheduler is a fertile area of interest with different 
 implementations, e.g., Fifo, Capacity and Fair  schedulers. Meanwhile, 
 several optimizations are also made to improve scheduler performance for 
 different scenarios and workload. Each scheduler algorithm has its own set of 
 features, and drives scheduling decisions by many factors, such as fairness, 
 capacity guarantee, resource availability, etc. It is very important to 
 evaluate a scheduler algorithm very well before we deploy it in a production 
 cluster. Unfortunately, currently it is non-trivial to evaluate a scheduling 
 algorithm. Evaluating in a real cluster is always time and cost consuming, 
 and it is also very hard to find a large-enough cluster. Hence, a simulator 
 which can predict how well a scheduler algorithm for some specific workload 
 would be quite useful.
 We want to build a Scheduler Load Simulator to simulate large-scale Yarn 
 clusters and application loads in a single machine. This would be invaluable 
 in furthering Yarn by providing a tool for researchers and developers to 
 prototype new scheduler features and predict their behavior and performance 
 with reasonable amount of confidence, there-by aiding rapid innovation.
 The simulator will exercise the real Yarn ResourceManager removing the 
 network factor by simulating NodeManagers and ApplicationMasters via handling 
 and dispatching NM/AMs heartbeat events from within the same JVM.
 To keep tracking of scheduler behavior and performance, a scheduler wrapper 
 will wrap the real scheduler.
 The simulator will produce real time metrics while executing, including:
 * Resource usages for whole cluster and each queue, which can be utilized to 
 configure cluster and queue's capacity.
 * The detailed application execution trace (recorded in relation to simulated 
 time), which can be analyzed to understand/validate the  scheduler behavior 
 (individual jobs turn around time, throughput, fairness, capacity guarantee, 
 etc).
 * Several key metrics of scheduler algorithm, such as time cost of each 
 scheduler operation (allocate, handle, etc), which can be utilized by Hadoop 
 developers to find the code spots and scalability limits.
 The simulator will provide real time charts showing the behavior of the 
 scheduler and its performance.
 A short demo is available http://www.youtube.com/watch?v=6thLi8q0qLE, showing 
 how to use simulator to simulate Fair Scheduler and Capacity Scheduler.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1056) Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}

2013-08-13 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738572#comment-13738572
 ] 

Jian He commented on YARN-1056:
---

Hi [~kkambatl], do you think it's also necessary to change 
'yarn.resourcemanager.fs.rm-state-store.uri' to 
'yarn.resourcemanager.fs.state-store.uri' for consistency with 'zk.state-store' 
?

 Fix configs 
 yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
 

 Key: YARN-1056
 URL: https://issues.apache.org/jira/browse/YARN-1056
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.1.0-beta
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Blocker
  Labels: conf
 Attachments: yarn-1056-1.patch, yarn-1056-1.patch


 Fix configs 
 yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
  to have a *resourcemanager* only once, make them consistent with other such 
 yarn configs and add entries in yarn-default.xml

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1056) Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}

2013-08-13 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738582#comment-13738582
 ] 

Hadoop QA commented on YARN-1056:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12597773/yarn-1056-1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1706//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1706//console

This message is automatically generated.

 Fix configs 
 yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
 

 Key: YARN-1056
 URL: https://issues.apache.org/jira/browse/YARN-1056
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.1.0-beta
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Blocker
  Labels: conf
 Attachments: yarn-1056-1.patch, yarn-1056-1.patch


 Fix configs 
 yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
  to have a *resourcemanager* only once, make them consistent with other such 
 yarn configs and add entries in yarn-default.xml

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1056) Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}

2013-08-13 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738595#comment-13738595
 ] 

Karthik Kambatla commented on YARN-1056:


[~jianhe], good point. Let me upload a patch including that change shortly.

 Fix configs 
 yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
 

 Key: YARN-1056
 URL: https://issues.apache.org/jira/browse/YARN-1056
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.1.0-beta
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Blocker
  Labels: conf
 Attachments: yarn-1056-1.patch, yarn-1056-1.patch


 Fix configs 
 yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
  to have a *resourcemanager* only once, make them consistent with other such 
 yarn configs and add entries in yarn-default.xml

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1062) MRAppMaster take a long time to init taskAttempt

2013-08-13 Thread shenhong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738596#comment-13738596
 ] 

shenhong commented on YARN-1062:


Thanks Vinod Kumar Vavilapalli, I think  YARN-435 is okay to me.


 MRAppMaster take a long time to init taskAttempt
 

 Key: YARN-1062
 URL: https://issues.apache.org/jira/browse/YARN-1062
 Project: Hadoop YARN
  Issue Type: Bug
  Components: applications
Affects Versions: 0.23.6
Reporter: shenhong

 In our cluster, MRAppMaster take a long time to init taskAttempt, the 
 following log last one minute,
 2013-07-17 11:28:06,328 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11012.yh.aliyun.com to 
 /r01f11
 2013-07-17 11:28:06,357 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11004.yh.aliyun.com to 
 /r01f11
 2013-07-17 11:28:06,383 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.yarn.util.RackResolver: Resolved r03b05042.yh.aliyun.com to 
 /r03b05
 2013-07-17 11:28:06,384 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
 attempt_1373523419753_4543_m_00_0 TaskAttempt Transitioned from NEW to 
 UNASSIGNED
 2013-07-17 11:28:06,415 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.yarn.util.RackResolver: Resolved r03b02006.yh.aliyun.com to 
 /r03b02
 2013-07-17 11:28:06,436 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.yarn.util.RackResolver: Resolved r02f02045.yh.aliyun.com to 
 /r02f02
 2013-07-17 11:28:06,457 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.yarn.util.RackResolver: Resolved r02f02034.yh.aliyun.com to 
 /r02f02
 2013-07-17 11:28:06,457 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
 attempt_1373523419753_4543_m_01_0 TaskAttempt Transitioned from NEW to 
 UNASSIGNED
 The reason is: resolved one host to rack almost take 25ms (We resolve the 
 host to rack by a python script). Our hdfs cluster is more than 4000 
 datanodes, then a large input job will take a long time to init TaskAttempt.
 Is there any good idea to solve this problem. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1061) NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager.

2013-08-13 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738601#comment-13738601
 ] 

Omkar Vinit Joshi commented on YARN-1061:
-

Are you able to reproduce this scenario? Can you please enable DEBUG 
(HADOOP_ROOT_LOGGER  YARN_ROOT_LOGGER) logs and attach them to this jira? How 
big is your cluster? what is the frequency at which nodemanagers are 
heartbeating? Can you also attach yarn-site.xml? which version are you using?

 NodeManager is indefinitely waiting for nodeHeartBeat() response from 
 ResouceManager.
 -

 Key: YARN-1061
 URL: https://issues.apache.org/jira/browse/YARN-1061
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.5-alpha
Reporter: Rohith Sharma K S

 It is observed that in one of the scenario, NodeManger is indefinetly waiting 
 for nodeHeartbeat response from ResouceManger where ResouceManger is in 
 hanged up state.
 NodeManager should get timeout exception instead of waiting indefinetly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (YARN-1062) MRAppMaster take a long time to init taskAttempt

2013-08-13 Thread shenhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shenhong resolved YARN-1062.


Resolution: Duplicate

 MRAppMaster take a long time to init taskAttempt
 

 Key: YARN-1062
 URL: https://issues.apache.org/jira/browse/YARN-1062
 Project: Hadoop YARN
  Issue Type: Bug
  Components: applications
Affects Versions: 0.23.6
Reporter: shenhong

 In our cluster, MRAppMaster take a long time to init taskAttempt, the 
 following log last one minute,
 2013-07-17 11:28:06,328 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11012.yh.aliyun.com to 
 /r01f11
 2013-07-17 11:28:06,357 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11004.yh.aliyun.com to 
 /r01f11
 2013-07-17 11:28:06,383 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.yarn.util.RackResolver: Resolved r03b05042.yh.aliyun.com to 
 /r03b05
 2013-07-17 11:28:06,384 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
 attempt_1373523419753_4543_m_00_0 TaskAttempt Transitioned from NEW to 
 UNASSIGNED
 2013-07-17 11:28:06,415 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.yarn.util.RackResolver: Resolved r03b02006.yh.aliyun.com to 
 /r03b02
 2013-07-17 11:28:06,436 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.yarn.util.RackResolver: Resolved r02f02045.yh.aliyun.com to 
 /r02f02
 2013-07-17 11:28:06,457 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.yarn.util.RackResolver: Resolved r02f02034.yh.aliyun.com to 
 /r02f02
 2013-07-17 11:28:06,457 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
 attempt_1373523419753_4543_m_01_0 TaskAttempt Transitioned from NEW to 
 UNASSIGNED
 The reason is: resolved one host to rack almost take 25ms (We resolve the 
 host to rack by a python script). Our hdfs cluster is more than 4000 
 datanodes, then a large input job will take a long time to init TaskAttempt.
 Is there any good idea to solve this problem. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-1030) Adding AHS as service of RM

2013-08-13 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-1030:
--

Attachment: YARN-1030.2.patch

Thanks [~devaraj.k] for your review. I've updated patch according to your 
comments. If YARN-953 is committed first, I'll remove the change in pom.xml in 
this patch. Now I keep the change not to break the build.

 Adding AHS as service of RM
 ---

 Key: YARN-1030
 URL: https://issues.apache.org/jira/browse/YARN-1030
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-1030.1.patch, YARN-1030.2.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1030) Adding AHS as service of RM

2013-08-13 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738622#comment-13738622
 ] 

Hadoop QA commented on YARN-1030:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12597781/YARN-1030.2.patch
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1707//console

This message is automatically generated.

 Adding AHS as service of RM
 ---

 Key: YARN-1030
 URL: https://issues.apache.org/jira/browse/YARN-1030
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-1030.1.patch, YARN-1030.2.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-1056) Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}

2013-08-13 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-1056:
---

Attachment: yarn-1056-2.patch

Updated fs config to be fs.state-store.uri instead of fs.rm-state-store.uri

 Fix configs 
 yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
 

 Key: YARN-1056
 URL: https://issues.apache.org/jira/browse/YARN-1056
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.1.0-beta
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Blocker
  Labels: conf
 Attachments: yarn-1056-1.patch, yarn-1056-1.patch, yarn-1056-2.patch


 Fix configs 
 yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
  to have a *resourcemanager* only once, make them consistent with other such 
 yarn configs and add entries in yarn-default.xml

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-435) Make it easier to access cluster topology information in an AM

2013-08-13 Thread shenhong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738647#comment-13738647
 ] 

shenhong commented on YARN-435:
---

Firstly, if AM get all nodes in the cluster including their rack information by 
calling RM. This will increase pressure on the RM's network. For example, the 
cluster had more than 5000 datanodes.

Secondly, if the yarn cluster only has 100 nodemanagers, but the hdfs it 
accessed is a cluster with more than 5000 datanodes, we can't get all the nodes 
including their rack information. However, AM need all the datanode information 
in it's job.splitmetainfo file, in order to init TaskAttempt. In this case, we 
can't get all nodes by calling RM.

 Make it easier to access cluster topology information in an AM
 --

 Key: YARN-435
 URL: https://issues.apache.org/jira/browse/YARN-435
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Hitesh Shah
Assignee: Omkar Vinit Joshi

 ClientRMProtocol exposes a getClusterNodes api that provides a report on all 
 nodes in the cluster including their rack information. 
 However, this requires the AM to open and establish a separate connection to 
 the RM in addition to one for the AMRMProtocol. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1056) Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}

2013-08-13 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738651#comment-13738651
 ] 

Jian He commented on YARN-1056:
---

Looks good, +1

 Fix configs 
 yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
 

 Key: YARN-1056
 URL: https://issues.apache.org/jira/browse/YARN-1056
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.1.0-beta
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Blocker
  Labels: conf
 Attachments: yarn-1056-1.patch, yarn-1056-1.patch, yarn-1056-2.patch


 Fix configs 
 yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
  to have a *resourcemanager* only once, make them consistent with other such 
 yarn configs and add entries in yarn-default.xml

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1056) Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}

2013-08-13 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738686#comment-13738686
 ] 

Hadoop QA commented on YARN-1056:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12597782/yarn-1056-2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1708//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1708//console

This message is automatically generated.

 Fix configs 
 yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
 

 Key: YARN-1056
 URL: https://issues.apache.org/jira/browse/YARN-1056
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.1.0-beta
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Blocker
  Labels: conf
 Attachments: yarn-1056-1.patch, yarn-1056-1.patch, yarn-1056-2.patch


 Fix configs 
 yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
  to have a *resourcemanager* only once, make them consistent with other such 
 yarn configs and add entries in yarn-default.xml

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1036) Distributed Cache gives inconsistent result if cache files get deleted from task tracker

2013-08-13 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738754#comment-13738754
 ] 

Jason Lowe commented on YARN-1036:
--

+1 lgtm as well.  Committing this.

 Distributed Cache gives inconsistent result if cache files get deleted from 
 task tracker 
 -

 Key: YARN-1036
 URL: https://issues.apache.org/jira/browse/YARN-1036
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 0.23.9
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Attachments: YARN-1036.branch-0.23.patch, 
 YARN-1036.branch-0.23.patch, YARN-1036.branch-0.23.patch


 This is a JIRA to backport MAPREDUCE-4342. I had to open a new JIRA because 
 that one had been closed. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-573) Shared data structures in Public Localizer and Private Localizer are not Thread safe.

2013-08-13 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738845#comment-13738845
 ] 

Omkar Vinit Joshi commented on YARN-573:


+1 ..lgtm for branch 0.23

 Shared data structures in Public Localizer and Private Localizer are not 
 Thread safe.
 -

 Key: YARN-573
 URL: https://issues.apache.org/jira/browse/YARN-573
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi
Priority: Critical
 Fix For: 3.0.0, 2.1.1-beta

 Attachments: YARN-573-20130730.1.patch, YARN-573-20130731.1.patch, 
 YARN-573.branch-0.23-08132013.patch


 PublicLocalizer
 1) pending accessed by addResource (part of event handling) and run method 
 (as a part of PublicLocalizer.run() ).
 PrivateLocalizer
 1) pending accessed by addResource (part of event handling) and 
 findNextResource (i.remove()). Also update method should be fixed. It too is 
 sharing pending list.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-573) Shared data structures in Public Localizer and Private Localizer are not Thread safe.

2013-08-13 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-573:


Fix Version/s: 0.23.10

+1 lgtm as well, thanks Mit and Omkar!  I committed this to branch-0.23.

 Shared data structures in Public Localizer and Private Localizer are not 
 Thread safe.
 -

 Key: YARN-573
 URL: https://issues.apache.org/jira/browse/YARN-573
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi
Priority: Critical
 Fix For: 3.0.0, 0.23.10, 2.1.1-beta

 Attachments: YARN-573-20130730.1.patch, YARN-573-20130731.1.patch, 
 YARN-573.branch-0.23-08132013.patch


 PublicLocalizer
 1) pending accessed by addResource (part of event handling) and run method 
 (as a part of PublicLocalizer.run() ).
 PrivateLocalizer
 1) pending accessed by addResource (part of event handling) and 
 findNextResource (i.remove()). Also update method should be fixed. It too is 
 sharing pending list.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1058) Recovery issues on RM Restart with FileSystemRMStateStore

2013-08-13 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738934#comment-13738934
 ] 

Karthik Kambatla commented on YARN-1058:


I was expecting the first one, and Bikas is right about the second one.

When I kil the job client, the job does finish successfully. However, the AM 
for the recovered attempt fails to write the history. 
{noformat}
2013-08-13 13:57:32,440 ERROR [eventHandlingThread] 
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
Thread[eventHandlingThread,5,main] threw an Exception.
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
 No lease on 
/tmp/hadoop-yarn/staging/kasha/.staging/job_1376427059607_0002/job_1376427059607_0002_2.jhist:
 File does not exist. Holder DFSClient_NONMAPREDUCE_416024880_1 does not have 
any open files.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2737)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2543)
...  
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2037)

at 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleEvent(JobHistoryEventHandler.java:514)
at 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler$1.run(JobHistoryEventHandler.java:276)
at java.lang.Thread.run(Thread.java:662)
{noformat}

 Recovery issues on RM Restart with FileSystemRMStateStore
 -

 Key: YARN-1058
 URL: https://issues.apache.org/jira/browse/YARN-1058
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla

 App recovery doesn't work as expected using FileSystemRMStateStore.
 Steps to reproduce:
 - Ran sleep job with a single map and sleep time of 2 mins
 - Restarted RM while the map task is still running
 - The first attempt fails with the following error
 {noformat}
 Caused by: 
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
  Password not found for ApplicationAttempt 
 appattempt_1376294441253_0001_01
   at org.apache.hadoop.ipc.Client.call(Client.java:1404)
   at org.apache.hadoop.ipc.Client.call(Client.java:1357)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
   at $Proxy28.finishApplicationMaster(Unknown Source)
   at 
 org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.finishApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:91)
 {noformat}
 - The second attempt fails with a different error:
 {noformat}
 Caused by: 
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
  No lease on 
 /tmp/hadoop-yarn/staging/kasha/.staging/job_1376294441253_0001/job_1376294441253_0001_2.jhist:
  File does not exist. Holder DFSClient_NONMAPREDUCE_389533538_1 does not have 
 any open files.
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2737)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2543)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2454)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:534)
   at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:387)
   at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:48073)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:605)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-337) RM handles killed application tracking URL poorly

2013-08-13 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738968#comment-13738968
 ] 

Thomas Graves commented on YARN-337:


+1 looks good. Thanks Jason!  Feel free to commit it.

 RM handles killed application tracking URL poorly
 -

 Key: YARN-337
 URL: https://issues.apache.org/jira/browse/YARN-337
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.2-alpha, 0.23.5
Reporter: Jason Lowe
Assignee: Jason Lowe
  Labels: usability
 Attachments: YARN-337.patch


 When the ResourceManager kills an application, it leaves the proxy URL 
 redirecting to the original tracking URL for the application even though the 
 ApplicationMaster is no longer there to service it.  It should redirect it 
 somewhere more useful, like the RM's web page for the application, where the 
 user can find that the application was killed and links to the AM logs.
 In addition, sometimes the AM during teardown from the kill can attempt to 
 unregister and provide an updated tracking URL, but unfortunately the RM has 
 forgotten the AM due to the kill and refuses to process the unregistration. 
  Instead it logs:
 {noformat}
 2013-01-09 17:37:49,671 [IPC Server handler 2 on 8030] ERROR
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
 AppAttemptId doesnt exist in cache appattempt_1357575694478_28614_01
 {noformat}
 It should go ahead and process the unregistration to update the tracking URL 
 since the application offered it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1024) Define a virtual core unambigiously

2013-08-13 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738985#comment-13738985
 ] 

Sandy Ryza commented on YARN-1024:
--

I've been thinking a lot about this, and wanted to propose a modified approach, 
inspired by an offline discussion with Arun and his max-vcores idea 
(https://issues.apache.org/jira/browse/YARN-1024?focusedCommentId=13730074page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13730074).

First, my assumptions about how CPUs work:
* A CPU is essentially a bathtub full of processing power that can be doled out 
to threads, with a limit per thread based on the power of each core within it.
* To give X processing power to a thread means that within a standard unit of 
time, roughly some number of instructions proportional to X can be executed for 
that thread. 
* No more than a certain amount of processing power (the amount of processing 
power per core) can be given to each thread.
* We can use CGroups to say that a task gets some fraction of the system's 
processing power.
* This means that if we have 5 cores with Y processing power each, we can give 
5 threads Y processing power each, or 6 threads 5Y/6 processing power each, but 
we can't give 4 threads 5Y/4 processing power each.
* It never makes sense to use CGroups assign a higher fraction of the system's 
processing power than (numthreads the task can take advantage of / number of 
cores) to a task.
* Equivalently, if my CPU has X processing power per core, it never makes sense 
to assign more than (numthreads the task can take advantage of) * X processing 
power to a task.

So as long as we account for that last constraint, we can essentially view 
processing power as a fluid resource like memory.  With this in mind, we can:
1. Split virtual cores into cores and yarnComputeUnitsPerCore.  Requests can 
include both and nodes can be configured with both.
2. Have a cluster-defined maxComputeUnitsPerCore, which would be the smallest 
yarnComputeUnitsPerCore on any node.  We min all yarnComputeUnitsPerCore 
requests with this number when they hit the RM.
3. Use YCUs, not cores, for scheduling.  I.e. the scheduler thinks of a node's 
CPU capacity in terms of the number of YCUs it can handle and thinks of a 
resource's CPU request in terms of its (normalized yarnComputeUnitsPerCore * # 
cores).  We use YCUs for DRF.
4. If we make YCUs small enough, no need for fractional anything.

This reduces to a number-of-cores-based approach if all containers are 
requested with yarnComputeUnitsPerCore=infinity, and reduces to a YCU approach 
if maxComputeUnitsPerCore is set to infinity.  Predictability, simplicity, and 
scheduling flexibility can be traded off per cluster without overloading the 
same concept with multiple definitions.

This doesn't take into account heteregeneous hardware within a cluster, but I 
think (2) can be tweaked to handle this by holding a value for each node  (can 
elaborate on how this would work).  It also doesn't take into account pinning 
threads to CPUs, but I don't think it's any less extensible for ultimately 
dealing with this than other proposals.

Sorry for the longwindedness.  Bobby, would this provide the flexibility you're 
looking for?

 Define a virtual core unambigiously
 ---

 Key: YARN-1024
 URL: https://issues.apache.org/jira/browse/YARN-1024
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Arun C Murthy
Assignee: Arun C Murthy

 We need to clearly define the meaning of a virtual core unambiguously so that 
 it's easy to migrate applications between clusters.
 For e.g. here is Amazon EC2 definition of ECU: 
 http://aws.amazon.com/ec2/faqs/#What_is_an_EC2_Compute_Unit_and_why_did_you_introduce_it
 Essentially we need to clearly define a YARN Virtual Core (YVC).
 Equivalently, we can use ECU itself: *One EC2 Compute Unit provides the 
 equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.*

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-292) ResourceManager throws ArrayIndexOutOfBoundsException while handling CONTAINER_ALLOCATED for application attempt

2013-08-13 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739011#comment-13739011
 ] 

Zhijie Shen commented on YARN-292:
--

Did more investigation on this issue:

{code}
2012-12-26 08:41:15,030 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler: 
Calling allocate on removed or non existant application 
appattempt_1356385141279_49525_01
{code}
This log indicates that ArrayIndexOutOfBoundsException happens because the 
application is not found. There're three possibilities where the application is 
not found:

1. The application hasn't been added into FiFoScheduler#applications. If it is 
the case, FiFoScheduler will not send APP_ACCEPTED event to the corresponding 
RMAppAttemptImpl. Without APP_ACCEPTED event, RMAppAttemptImpl will not enter 
SCHEDULED state, and will not go through AMContainerAllocatedTransition to 
ALLOCATED_SAVING consequently. Therefore, this case is impossible.

2. The application has already been removed from FiFoScheduler#applications. To 
trigger the removal operation, the corresponding RMAppAttemptImpl needs to go 
through BaseFinalTransition. 

It is worth mentioning first that RMAppAttemptImpl's transitions are executed 
on the thread of AsyncDispatcher, while YarnScheduler#handle is invoked on the 
thread of SchedulerEventDispatcher. The two threads will execute in parallel, 
indicating that the process of an RMAppAttemptEvent and that of a 
SchedulerEvent may interpolate. However, the processes of two 
RMAppAttemptEvents or two SchedulerEvents will not.

Therefore, AMContainerAllocatedTransition will not start before 
RMAppAttemptImpl has already finished BaseFinalTransition. Nevertheless, when 
RMAppAttemptImpl goes through BaseFinalTransition, it will enter an final state 
as well, such that AMContainerAllocatedTransition will not happen at all. In 
conclusion, this case is impossible as well.

3. The application is in FiFoScheduler#applications, but RMAppAttemptImpl 
doesn't get it. First of all, FiFoScheduler#applications is a TreeMap, which is 
not thread safe (FairScheduler#applications is a HashMap while 
CapcityScheduler#applications is a ConcurrentHashMap). Second, the methods of 
accessing the map are not consistently synchronized, thus, read and write on 
the same map can operate simultaneously. RMAppAttemptImpl on the thread of 
AsyncDispatcher will eventually call FiFoScheduler#applications#get in 
AMContainerAllocatedTransition, while FiFoScheduler on thread of 
SchedulerEventDispatcher will use FiFoScheduler#applications#add|remove. 
Therefore, getting null when the application actually exists happens under a 
big number of concurrent operations.

Please feel free to correct me if you think there's something wrong or missing 
with the analysis. I'm going to work on a patch to fix the problem.

 ResourceManager throws ArrayIndexOutOfBoundsException while handling 
 CONTAINER_ALLOCATED for application attempt
 

 Key: YARN-292
 URL: https://issues.apache.org/jira/browse/YARN-292
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.0.1-alpha
Reporter: Devaraj K
Assignee: Zhijie Shen

 {code:xml}
 2012-12-26 08:41:15,030 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler: 
 Calling allocate on removed or non existant application 
 appattempt_1356385141279_49525_01
 2012-12-26 08:41:15,031 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
 handling event type CONTAINER_ALLOCATED for applicationAttempt 
 application_1356385141279_49525
 java.lang.ArrayIndexOutOfBoundsException: 0
   at java.util.Arrays$ArrayList.get(Arrays.java:3381)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMContainerAllocatedTransition.transition(RMAppAttemptImpl.java:655)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMContainerAllocatedTransition.transition(RMAppAttemptImpl.java:644)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:490)
   at 
 

[jira] [Commented] (YARN-1058) Recovery issues on RM Restart with FileSystemRMStateStore

2013-08-13 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739013#comment-13739013
 ] 

Bikas Saha commented on YARN-1058:
--

It could be that history service was not properly shutdown in the first AM. 
Earlier, the AM would receive proper reboot command from the RM and would 
shutdown properly based on the reboot flag being set. Now the AM is getting an 
exception from the RM and so not shutting down properly. This should get fixed 
when we refresh the AM RM token from the saved value.

 Recovery issues on RM Restart with FileSystemRMStateStore
 -

 Key: YARN-1058
 URL: https://issues.apache.org/jira/browse/YARN-1058
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla

 App recovery doesn't work as expected using FileSystemRMStateStore.
 Steps to reproduce:
 - Ran sleep job with a single map and sleep time of 2 mins
 - Restarted RM while the map task is still running
 - The first attempt fails with the following error
 {noformat}
 Caused by: 
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
  Password not found for ApplicationAttempt 
 appattempt_1376294441253_0001_01
   at org.apache.hadoop.ipc.Client.call(Client.java:1404)
   at org.apache.hadoop.ipc.Client.call(Client.java:1357)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
   at $Proxy28.finishApplicationMaster(Unknown Source)
   at 
 org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.finishApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:91)
 {noformat}
 - The second attempt fails with a different error:
 {noformat}
 Caused by: 
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
  No lease on 
 /tmp/hadoop-yarn/staging/kasha/.staging/job_1376294441253_0001/job_1376294441253_0001_2.jhist:
  File does not exist. Holder DFSClient_NONMAPREDUCE_389533538_1 does not have 
 any open files.
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2737)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2543)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2454)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:534)
   at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:387)
   at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:48073)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:605)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1055) Handle app recovery differently for AM failures and RM restart

2013-08-13 Thread Alejandro Abdelnur (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739027#comment-13739027
 ] 

Alejandro Abdelnur commented on YARN-1055:
--

[~vinodkv], in theory I agree with you. In practice, there are 2 issues we 
Oozie cannot address in the short term:

* 1. Oozie still using a a launcher MRAM
* 2. mr/pig/hive/sqoop/distcp/... fat clients which are not aware of Yarn 
restart/recovery.

#1 will be addressed when Oozie implements an OozieLauncherAM instead 
piggybacking on an MR Map as driver.
#2 it is more complicated and I don't see this one be addressed in the 
short/medium term.

By having distinct knobs differentiating recover after AM failure and after RM 
restart Oozie can handle/recover jobs on the same set of failure scenarios 
possible with Hadoop 1. In order to get folks into Yarn we need to provide 
functional parity.

I suggest having the 2 knobs Karthik proposed {{restart.am.on.rm.restart}} and 
{{restart.am.on.on.failure}} with 
{{restart.am.on.rm.restart=$restar.am.on.am.failure}}. 

Does this sound reasonable?

 Handle app recovery differently for AM failures and RM restart
 --

 Key: YARN-1055
 URL: https://issues.apache.org/jira/browse/YARN-1055
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Karthik Kambatla

 Ideally, we would like to tolerate container, AM, RM failures. App recovery 
 for AM and RM currently relies on the max-attempts config; tolerating AM 
 failures requires it to be  1 and tolerating RM failure/restart requires it 
 to be = 1.
 We should handle these two differently, with two separate configs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1055) Handle app recovery differently for AM failures and RM restart

2013-08-13 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739040#comment-13739040
 ] 

Vinod Kumar Vavilapalli commented on YARN-1055:
---

This is a new issue with Hadoop 2 completely - we've added new failure 
conditions. All the apps handing AM restarts is really the right way forward 
given AMs can now run on random compute nodes that can just fail any time. 
Offline I started engaging some of Pig/Hive community folks. For MR, enough 
work is already done. Oozie needs to follow suit too.

Till work-preserving restart is finished, this is a real pain on RM restarts. 
Which is why I am proposing that oozie set max-attempts to 1 for its launcher 
action so that there are no split brain issues - RM restart or otherwise. Oozie 
has a retry mechanism anyways which will then be submitted as a new application.

Adding a separate knob just for restart is a hack I don't see any value of. If 
I read your proposal correctly, for launcher jobs, you will set 
restart.am.on.rm.restart to 1 and  restart.am.on.on.failure  1. Right? Which 
is not correct as I repeated - node failures will cause the same split brain 
issues.

 Handle app recovery differently for AM failures and RM restart
 --

 Key: YARN-1055
 URL: https://issues.apache.org/jira/browse/YARN-1055
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Karthik Kambatla

 Ideally, we would like to tolerate container, AM, RM failures. App recovery 
 for AM and RM currently relies on the max-attempts config; tolerating AM 
 failures requires it to be  1 and tolerating RM failure/restart requires it 
 to be = 1.
 We should handle these two differently, with two separate configs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1055) Handle app recovery differently for AM failures and RM restart

2013-08-13 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739073#comment-13739073
 ] 

Bikas Saha commented on YARN-1055:
--

Restart on am failure is already determined by the default value of max am 
retries in yarn config. Setting that to 1 will prevent RM from restarting AM's 
on failure. Thus no need for new config. Restart after RM restart is already 
covered by setting max am retries to 1 by the app client on app submission. If 
an app cannot handle this situation it should create its own config and set the 
correct value of 1 on submission. YARN should not add a config IMO. If I 
remember right, this config is being imported from hadoop 1 and the impl of 
this config in hadoop 1 is what RM already does to handle user defined max am 
retries.

 Handle app recovery differently for AM failures and RM restart
 --

 Key: YARN-1055
 URL: https://issues.apache.org/jira/browse/YARN-1055
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Karthik Kambatla

 Ideally, we would like to tolerate container, AM, RM failures. App recovery 
 for AM and RM currently relies on the max-attempts config; tolerating AM 
 failures requires it to be  1 and tolerating RM failure/restart requires it 
 to be = 1.
 We should handle these two differently, with two separate configs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1055) Handle app recovery differently for AM failures and RM restart

2013-08-13 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739109#comment-13739109
 ] 

Karthik Kambatla commented on YARN-1055:


From a YARN-user POV, I see it differently. I want to control whether my app 
should be recovered on AM/RM failures separately. I might want to recover on 
RM restart but not on AM failures or viceversa:
# In case of AM failure, user might want to check for user errors and hence not 
recover. But recover in case of RM failures.
# Like Oozie, might want to recover on AM failures but not on RM failures.

Also, is there a disadvantage to having two knobs for the two failures?

 Handle app recovery differently for AM failures and RM restart
 --

 Key: YARN-1055
 URL: https://issues.apache.org/jira/browse/YARN-1055
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Karthik Kambatla

 Ideally, we would like to tolerate container, AM, RM failures. App recovery 
 for AM and RM currently relies on the max-attempts config; tolerating AM 
 failures requires it to be  1 and tolerating RM failure/restart requires it 
 to be = 1.
 We should handle these two differently, with two separate configs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1061) NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager.

2013-08-13 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739178#comment-13739178
 ] 

Rohith Sharma K S commented on YARN-1061:
-

Actual issue I got in 5 node cluster (1 RM and 5 NM).It is hard to recure 
scenario for resourcemanager is hang up state in real cluster. 

The same scenario can be simulated manually bringing resourcemanager to hang up 
state with help of linux command KILL -STOP RM_PID. All the NM-RM call 
wait indefinitely. Another case where we can observer indefinite wait is Add 
new NodeManager when ResouceMangaer is hang up state.



 NodeManager is indefinitely waiting for nodeHeartBeat() response from 
 ResouceManager.
 -

 Key: YARN-1061
 URL: https://issues.apache.org/jira/browse/YARN-1061
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.5-alpha
Reporter: Rohith Sharma K S

 It is observed that in one of the scenario, NodeManger is indefinetly waiting 
 for nodeHeartbeat response from ResouceManger where ResouceManger is in 
 hanged up state.
 NodeManager should get timeout exception instead of waiting indefinetly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-451) Add more metrics to RM page

2013-08-13 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739213#comment-13739213
 ] 

Sangjin Lee commented on YARN-451:
--

I think showing this information on the app list page is actually more valuable 
than the per-app page. If this information is present in the app list page, one 
can quickly scan the list and get a sense of which job/app is bigger than 
others in terms of resource consumption. Also, it makes sorting possible.

One could in theory visit individual per-app pages one by one to get the same 
information, but it's so much more useful to have it ready at the overview page 
so one can get that information quickly.

In hadoop 1.0, one could get the same information by looking at the number of 
total mappers and reducers. That way, we got a very good idea on which ones are 
big jobs (and thus need to be monitored more closely) without drilling into any 
of the apps.

 Add more metrics to RM page
 ---

 Key: YARN-451
 URL: https://issues.apache.org/jira/browse/YARN-451
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.0.3-alpha
Reporter: Lohit Vijayarenu
Priority: Minor

 ResourceManager webUI shows list of RUNNING applications, but it does not 
 tell which applications are requesting more resource compared to others. With 
 cluster running hundreds of applications at once it would be useful to have 
 some kind of metric to show high-resource usage applications vs low-resource 
 usage ones. At the minimum showing number of containers is good option.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1060) Two tests in TestFairScheduler are missing @Test annotation

2013-08-13 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739216#comment-13739216
 ] 

Sandy Ryza commented on YARN-1060:
--

Committed to trunk and branch-2.  Thanks Niranjan!

 Two tests in TestFairScheduler are missing @Test annotation
 ---

 Key: YARN-1060
 URL: https://issues.apache.org/jira/browse/YARN-1060
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.1.0-beta
Reporter: Sandy Ryza
Assignee: Niranjan Singh
  Labels: newbie
 Attachments: YARN-1060.patch


 Amazingly, these tests appear to pass with the annotations added.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1060) Two tests in TestFairScheduler are missing @Test annotation

2013-08-13 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739218#comment-13739218
 ] 

Hudson commented on YARN-1060:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #4256 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/4256/])
YARN-1060. Two tests in TestFairScheduler are missing @Test annotation 
(Niranjan Singh via Sandy Ryza) (sandy: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1513724)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java


 Two tests in TestFairScheduler are missing @Test annotation
 ---

 Key: YARN-1060
 URL: https://issues.apache.org/jira/browse/YARN-1060
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.1.0-beta
Reporter: Sandy Ryza
Assignee: Niranjan Singh
  Labels: newbie
 Fix For: 2.3.0

 Attachments: YARN-1060.patch


 Amazingly, these tests appear to pass with the annotations added.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-993) job can not recovery after restart resourcemanager

2013-08-13 Thread prophy Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739219#comment-13739219
 ] 

prophy Yan commented on YARN-993:
-

Jian He i have tryed the patch file in the YARN-513 list,but some error occur 
when i use the patch. my test version is hadoop2.0.5-alpha,so can this patch 
work with this version? thank you.

 job can not recovery after restart resourcemanager
 --

 Key: YARN-993
 URL: https://issues.apache.org/jira/browse/YARN-993
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.5-alpha
 Environment: CentOS5.3 JDK1.7.0_11
Reporter: prophy Yan
Priority: Critical

 Recently, i have test the function job recovery in the YARN framework, but it 
 failed.
 first, i run the wordcount example program, and the i kill -9 the 
 resourcemanager process on the server when the wordcount process in map 100%.
 the job will exit with error in minutes.
 second, i restart the resourcemanager on the server by user the 
 'start-yarn.sh' command. but, the failed job(wordcount) can not to continue.
 the yarn log says file not exist!
 Here is the YARN log:
 013-07-23 16:05:21,472 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done 
 launching container Container: [ContainerId: 
 container_1374564764970_0001_02_01, NodeId: mv8.mzhen.cn:52117, 
 NodeHttpAddress: mv8.mzhen.cn:8042, Resource: memory:2048, vCores:1, 
 Priority: 0, State: NEW, Token: null, Status: container_id {, app_attempt_id 
 {, application_id {, id: 1, cluster_timestamp: 1374564764970, }, attemptId: 
 2, }, id: 1, }, state: C_NEW, ] for AM appattempt_1374564764970_0001_02
 2013-07-23 16:05:21,473 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 appattempt_1374564764970_0001_02 State change from ALLOCATED to LAUNCHED
 2013-07-23 16:05:21,925 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 appattempt_1374564764970_0001_02 State change from LAUNCHED to FAILED
 2013-07-23 16:05:21,925 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application 
 application_1374564764970_0001 failed 1 times due to AM Container for 
 appattempt_1374564764970_0001_02 exited with  exitCode: -1000 due to: 
 RemoteTrace:
 java.io.FileNotFoundException: File does not exist: 
 hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:815)
 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:176)
 at 
 org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:51)
 at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:284)
 at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:282)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:280)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:51)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
 at java.util.concurrent.FutureTask.run(FutureTask.java:166)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
 at java.util.concurrent.FutureTask.run(FutureTask.java:166)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)
  at LocalTrace:
 org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: 
 File does not exist: 
 hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens
 at 
 org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.convertFromProtoFormat(LocalResourceStatusPBImpl.java:217)
 at 
 org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.getException(LocalResourceStatusPBImpl.java:147)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.update(ResourceLocalizationService.java:819)
 at 
 

[jira] [Commented] (YARN-451) Add more metrics to RM page

2013-08-13 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739225#comment-13739225
 ] 

Vinod Kumar Vavilapalli commented on YARN-451:
--

Agreed about having it on the listing page, but that page is already dense. 
Have to do some basic UI design.

Again, like I mentioned, Hadoop-1 was different as number of maps, reduces 
doesn't change after job starts. Whereas in Hadoop-2, memory/cores allocated 
slowly increases over time , so it may or may not be of much use. I am 
ambivalent about adding it.

 Add more metrics to RM page
 ---

 Key: YARN-451
 URL: https://issues.apache.org/jira/browse/YARN-451
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.0.3-alpha
Reporter: Lohit Vijayarenu
Priority: Minor

 ResourceManager webUI shows list of RUNNING applications, but it does not 
 tell which applications are requesting more resource compared to others. With 
 cluster running hundreds of applications at once it would be useful to have 
 some kind of metric to show high-resource usage applications vs low-resource 
 usage ones. At the minimum showing number of containers is good option.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira