[jira] [Created] (YARN-1060) Two tests in TestFairScheduler are missing @Test annotation
Sandy Ryza created YARN-1060: Summary: Two tests in TestFairScheduler are missing @Test annotation Key: YARN-1060 URL: https://issues.apache.org/jira/browse/YARN-1060 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.1.0-beta Reporter: Sandy Ryza Amazingly, these tests appear to pass with the annotations added. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (YARN-1060) Two tests in TestFairScheduler are missing @Test annotation
[ https://issues.apache.org/jira/browse/YARN-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niranjan Singh reassigned YARN-1060: Assignee: Niranjan Singh Two tests in TestFairScheduler are missing @Test annotation --- Key: YARN-1060 URL: https://issues.apache.org/jira/browse/YARN-1060 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.1.0-beta Reporter: Sandy Ryza Assignee: Niranjan Singh Labels: newbie Amazingly, these tests appear to pass with the annotations added. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1060) Two tests in TestFairScheduler are missing @Test annotation
[ https://issues.apache.org/jira/browse/YARN-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niranjan Singh updated YARN-1060: - Attachment: YARN-1060.patch Added @Test annotations Two tests in TestFairScheduler are missing @Test annotation --- Key: YARN-1060 URL: https://issues.apache.org/jira/browse/YARN-1060 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.1.0-beta Reporter: Sandy Ryza Assignee: Niranjan Singh Labels: newbie Attachments: YARN-1060.patch Amazingly, these tests appear to pass with the annotations added. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-1061) NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager.
Rohith Sharma K S created YARN-1061: --- Summary: NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager. Key: YARN-1061 URL: https://issues.apache.org/jira/browse/YARN-1061 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.5-alpha Reporter: Rohith Sharma K S It is observed that in one of the scenario, NodeManger is indefinetly waiting for nodeHeartbeat response from ResouceManger where ResouceManger is in hanged up state. NodeManager should get timeout exception instead of waiting indefinetly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1061) NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager.
[ https://issues.apache.org/jira/browse/YARN-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13737990#comment-13737990 ] Rohith Sharma K S commented on YARN-1061: - Extracted thread dump from NodeManager is {noformat} Node Status Updater prio=10 tid=0x414dc000 nid=0x1d754 in Object.wait() [0x7fefa2dec000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:485) at org.apache.hadoop.ipc.Client.call(Client.java:1231) - locked 0xdef4f158 (a org.apache.hadoop.ipc.Client$Call) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) at $Proxy28.nodeHeartbeat(Unknown Source) at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:70) at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) at $Proxy30.nodeHeartbeat(Unknown Source) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:348) {noformat} NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager. - Key: YARN-1061 URL: https://issues.apache.org/jira/browse/YARN-1061 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.5-alpha Reporter: Rohith Sharma K S It is observed that in one of the scenario, NodeManger is indefinetly waiting for nodeHeartbeat response from ResouceManger where ResouceManger is in hanged up state. NodeManager should get timeout exception instead of waiting indefinetly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1059) IllegalArgumentException while starting YARN
[ https://issues.apache.org/jira/browse/YARN-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738002#comment-13738002 ] rvller commented on YARN-1059: -- It was my fault, I've changed another parameter to one line in the XML, so that's why RM was not able to start. When I changed all of the parameters to one line (value10.245.1.30:9030/value and etc) RM wa able to start. I suppose that this is a bug, because it's confusing. IllegalArgumentException while starting YARN Key: YARN-1059 URL: https://issues.apache.org/jira/browse/YARN-1059 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.5-alpha Environment: Ubuntu 12.04, hadoop 2.0.5 Reporter: rvller Here is the traceback while starting the yarn resourse manager: 2013-08-12 12:53:29,319 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error starting ResourceManager java.lang.IllegalArgumentException: Does not contain a valid host:port authority: 10.245.1.30:9030 (configuration property 'yarn.resourcemanager.resource-tracker.address') at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:193) at org.apache.hadoop.conf.Configuration.getSocketAddr(Configuration.java:1450) at org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.init(ResourceTrackerService.java:105) at org.apache.hadoop.yarn.service.CompositeService.init(CompositeService.java:58) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.init(ResourceManager.java:255) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:710) And here is the yarn-site.xml: configuration property name yarn.resourcemanager.address /name value 10.245.1.30:9010 /value description /description /property property name yarn.resourcemanager.scheduler.address /name value 10.245.1.30:9020 /value description /description /property property name yarn.resourcemanager.resource-tracker.address /name value 10.245.1.30:9030 /value description /description /property property name yarn.resourcemanager.admin.address /name value 10.245.1.30:9040 /value description /description /property property name yarn.resourcemanager.webapp.address /name value 10.245.1.30:9050 /value description /description /property !-- Site specific YARN configuration properties -- /configuration -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1008) MiniYARNCluster with multiple nodemanagers, all nodes have same key for allocations
[ https://issues.apache.org/jira/browse/YARN-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738175#comment-13738175 ] Alejandro Abdelnur commented on YARN-1008: -- What the patch does: Introduces a new configuration property in the RM, {{RM_SCHEDULER_USE_PORT_FOR_NODE_NAME}} (defaulting to {{false}}). This is an RM property but it is to be used by the scheduler implementations when matching a resourcerequest to a node. If the property is set to {{false}} things work as today, the matching is done using only the HOSTNAME obtained from the NodeId. If the property is set to {{true}}, the matching is done using the HOSTNAME:PORT obtained from the NodeId. There are not changes in the NM or AM side. If the property is set to {{true}}, the AM must be aware of such setting and when creating resource requests, it must set the location as HOSTNAME:PORT of the node instead of just HOSTNAME. The renaming of {{SchedulerNode#getHostName()}} to {{SchedulerNode#getNodeName()}} is to make obvious to developers that may not be HOSTNAME. Added javadocs explaing this clearly. This works with all 3 schedulers. The main motivation for this change is to be able to use yarn minicluster with multiple NMs and being able to target a specific NM instance. We could expose this for production use if there is a need. For that we would need to: * Expose via ApplicationReport the node matching mode: HOSTNAME or HOSTNAME:PORT * Provide a mechanism for AMs only aware of HOSTNAME matching mode to work with HOSTNAME:PORT mode If there is a usecase for this in a real deployment we should follow up this with another JIRA. MiniYARNCluster with multiple nodemanagers, all nodes have same key for allocations --- Key: YARN-1008 URL: https://issues.apache.org/jira/browse/YARN-1008 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.1.0-beta Reporter: Alejandro Abdelnur Assignee: Alejandro Abdelnur Attachments: YARN-1008.patch, YARN-1008.patch, YARN-1008.patch While the NMs are keyed using the NodeId, the allocation is done based on the hostname. This makes the different nodes indistinguishable to the scheduler. There should be an option to enabled the host:port instead just port for allocations. The nodes reported to the AM should report the 'key' (host or host:port). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1036) Distributed Cache gives inconsistent result if cache files get deleted from task tracker
[ https://issues.apache.org/jira/browse/YARN-1036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738337#comment-13738337 ] Jason Lowe commented on YARN-1036: -- Agree with Ravi that we should focus on porting the change to 0.23 and fix any issues that also apply to trunk/branch-2 in a separate JIRA. Therefore I agree with Omkar that we should simply break or omit the LOCALIZED case from the switch statement since 0.23 doesn't have localCacheDirectoryManager to match the trunk behavior. Otherwise patch looks good to me. Distributed Cache gives inconsistent result if cache files get deleted from task tracker - Key: YARN-1036 URL: https://issues.apache.org/jira/browse/YARN-1036 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 0.23.9 Reporter: Ravi Prakash Assignee: Ravi Prakash Attachments: YARN-1036.branch-0.23.patch, YARN-1036.branch-0.23.patch This is a JIRA to backport MAPREDUCE-4342. I had to open a new JIRA because that one had been closed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-1062) MRAppMaster take a long time to init taskAttempt
shenhong created YARN-1062: -- Summary: MRAppMaster take a long time to init taskAttempt Key: YARN-1062 URL: https://issues.apache.org/jira/browse/YARN-1062 Project: Hadoop YARN Issue Type: Bug Components: applications Affects Versions: 0.23.6 Reporter: shenhong In our cluster, MRAppMaster take a long time to init taskAttempt, the following log last one minute, 2013-07-17 11:28:06,328 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11012.yh.aliyun.com to /r01f11 2013-07-17 11:28:06,357 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11004.yh.aliyun.com to /r01f11 2013-07-17 11:28:06,383 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved r03b05042.yh.aliyun.com to /r03b05 2013-07-17 11:28:06,384 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1373523419753_4543_m_00_0 TaskAttempt Transitioned from NEW to UNASSIGNED The reason is: resolved one host to rack almost take 25ms, our hdfs cluster is more than 4000 datanodes, then a large input job will take a long time to init TaskAttempt. Is there any good idea to solve this problem. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1062) MRAppMaster take a long time to init taskAttempt
[ https://issues.apache.org/jira/browse/YARN-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shenhong updated YARN-1062: --- Description: In our cluster, MRAppMaster take a long time to init taskAttempt, the following log last one minute, 2013-07-17 11:28:06,328 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11012.yh.aliyun.com to /r01f11 2013-07-17 11:28:06,357 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11004.yh.aliyun.com to /r01f11 2013-07-17 11:28:06,383 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved r03b05042.yh.aliyun.com to /r03b05 2013-07-17 11:28:06,384 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1373523419753_4543_m_00_0 TaskAttempt Transitioned from NEW to UNASSIGNED 2013-07-17 11:28:06,415 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved r03b02006.yh.aliyun.com to /r03b02 2013-07-17 11:28:06,436 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved r02f02045.yh.aliyun.com to /r02f02 2013-07-17 11:28:06,457 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved r02f02034.yh.aliyun.com to /r02f02 2013-07-17 11:28:06,457 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1373523419753_4543_m_01_0 TaskAttempt Transitioned from NEW to UNASSIGNED The reason is: resolved one host to rack almost take 25ms (We resolve the host to rack by a python script). Our hdfs cluster is more than 4000 datanodes, then a large input job will take a long time to init TaskAttempt. Is there any good idea to solve this problem. was: In our cluster, MRAppMaster take a long time to init taskAttempt, the following log last one minute, 2013-07-17 11:28:06,328 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11012.yh.aliyun.com to /r01f11 2013-07-17 11:28:06,357 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11004.yh.aliyun.com to /r01f11 2013-07-17 11:28:06,383 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved r03b05042.yh.aliyun.com to /r03b05 2013-07-17 11:28:06,384 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1373523419753_4543_m_00_0 TaskAttempt Transitioned from NEW to UNASSIGNED The reason is: resolved one host to rack almost take 25ms, our hdfs cluster is more than 4000 datanodes, then a large input job will take a long time to init TaskAttempt. Is there any good idea to solve this problem. MRAppMaster take a long time to init taskAttempt Key: YARN-1062 URL: https://issues.apache.org/jira/browse/YARN-1062 Project: Hadoop YARN Issue Type: Bug Components: applications Affects Versions: 0.23.6 Reporter: shenhong In our cluster, MRAppMaster take a long time to init taskAttempt, the following log last one minute, 2013-07-17 11:28:06,328 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11012.yh.aliyun.com to /r01f11 2013-07-17 11:28:06,357 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11004.yh.aliyun.com to /r01f11 2013-07-17 11:28:06,383 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved r03b05042.yh.aliyun.com to /r03b05 2013-07-17 11:28:06,384 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1373523419753_4543_m_00_0 TaskAttempt Transitioned from NEW to UNASSIGNED 2013-07-17 11:28:06,415 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved r03b02006.yh.aliyun.com to /r03b02 2013-07-17 11:28:06,436 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved r02f02045.yh.aliyun.com to /r02f02 2013-07-17 11:28:06,457 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved r02f02034.yh.aliyun.com to /r02f02 2013-07-17 11:28:06,457 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1373523419753_4543_m_01_0 TaskAttempt Transitioned from NEW to UNASSIGNED The reason is: resolved one host to rack almost take 25ms (We resolve the host to rack by a python script). Our hdfs cluster is more than 4000 datanodes, then a large input job will take a long time to init TaskAttempt. Is there any good idea to solve this problem. -- This message is
[jira] [Commented] (YARN-1056) Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
[ https://issues.apache.org/jira/browse/YARN-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738413#comment-13738413 ] Vinod Kumar Vavilapalli commented on YARN-1056: --- Config changes ARE API changes. If you wish to rename it now itself, mark it as blocker and let the release manager now. Otherwise, you should deprecate this config and add a new one and wait for the next release. I'm okay either ways. Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs} Key: YARN-1056 URL: https://issues.apache.org/jira/browse/YARN-1056 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.1.0-beta Reporter: Karthik Kambatla Assignee: Karthik Kambatla Labels: conf Attachments: yarn-1056-1.patch Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs} to have a *resourcemanager* only once, make them consistent with other such yarn configs and add entries in yarn-default.xml -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1062) MRAppMaster take a long time to init taskAttempt
[ https://issues.apache.org/jira/browse/YARN-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738417#comment-13738417 ] Vinod Kumar Vavilapalli commented on YARN-1062: --- You should definitely see if you can improve your python script by looking at a static resolution file instead of dynamically pinging DNS at run time. That'll clearly improve your performance. Overall, we wish to expose this information to AMs from RM itself so that each AM doesn't need to do this itself. That is tracked via YARN-435. If that okay with you, please close this as duplicate. MRAppMaster take a long time to init taskAttempt Key: YARN-1062 URL: https://issues.apache.org/jira/browse/YARN-1062 Project: Hadoop YARN Issue Type: Bug Components: applications Affects Versions: 0.23.6 Reporter: shenhong In our cluster, MRAppMaster take a long time to init taskAttempt, the following log last one minute, 2013-07-17 11:28:06,328 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11012.yh.aliyun.com to /r01f11 2013-07-17 11:28:06,357 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11004.yh.aliyun.com to /r01f11 2013-07-17 11:28:06,383 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved r03b05042.yh.aliyun.com to /r03b05 2013-07-17 11:28:06,384 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1373523419753_4543_m_00_0 TaskAttempt Transitioned from NEW to UNASSIGNED 2013-07-17 11:28:06,415 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved r03b02006.yh.aliyun.com to /r03b02 2013-07-17 11:28:06,436 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved r02f02045.yh.aliyun.com to /r02f02 2013-07-17 11:28:06,457 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved r02f02034.yh.aliyun.com to /r02f02 2013-07-17 11:28:06,457 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1373523419753_4543_m_01_0 TaskAttempt Transitioned from NEW to UNASSIGNED The reason is: resolved one host to rack almost take 25ms (We resolve the host to rack by a python script). Our hdfs cluster is more than 4000 datanodes, then a large input job will take a long time to init TaskAttempt. Is there any good idea to solve this problem. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1023) [YARN-321] Webservices REST API's support for Application History
[ https://issues.apache.org/jira/browse/YARN-1023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1023: -- Summary: [YARN-321] Webservices REST API's support for Application History (was: [YARN-321] Weservices REST API's support for Application History) [YARN-321] Webservices REST API's support for Application History - Key: YARN-1023 URL: https://issues.apache.org/jira/browse/YARN-1023 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: YARN-321 Reporter: Devaraj K Assignee: Devaraj K Attachments: YARN-1023-v0.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1036) Distributed Cache gives inconsistent result if cache files get deleted from task tracker
[ https://issues.apache.org/jira/browse/YARN-1036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravi Prakash updated YARN-1036: --- Attachment: YARN-1036.branch-0.23.patch Thanks Jason and Omkar for your comments. Ok. Here is the updated patch which has src/main code exactly like Omkar suggested. I've tested it by using a pendrive to simulate drive failure, and the file is indeed localized again. Distributed Cache gives inconsistent result if cache files get deleted from task tracker - Key: YARN-1036 URL: https://issues.apache.org/jira/browse/YARN-1036 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 0.23.9 Reporter: Ravi Prakash Assignee: Ravi Prakash Attachments: YARN-1036.branch-0.23.patch, YARN-1036.branch-0.23.patch, YARN-1036.branch-0.23.patch This is a JIRA to backport MAPREDUCE-4342. I had to open a new JIRA because that one had been closed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-979) [YARN-321] Adding application attempt and container to ApplicationHistoryProtocol
[ https://issues.apache.org/jira/browse/YARN-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738519#comment-13738519 ] Zhijie Shen commented on YARN-979: -- There're some high-level comments on the patch: 1. To make the protocol work, it is required to define the corresponding protos in yarn_service.proto and update pplication_history_service.proto 2. The setter of the request/response APIs should be @Public, shouldn't it? 3. It's required to mark ApplicationHistoryProtocol as well. [YARN-321] Adding application attempt and container to ApplicationHistoryProtocol - Key: YARN-979 URL: https://issues.apache.org/jira/browse/YARN-979 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Attachments: YARN-979-1.patch Adding application attempt and container to ApplicationHistoryProtocol Thanks, Mayank -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1056) Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
[ https://issues.apache.org/jira/browse/YARN-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1056: --- Attachment: yarn-1056-1.patch Reuploading patch to kick Jenkins. Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs} Key: YARN-1056 URL: https://issues.apache.org/jira/browse/YARN-1056 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.1.0-beta Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Labels: conf Attachments: yarn-1056-1.patch, yarn-1056-1.patch Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs} to have a *resourcemanager* only once, make them consistent with other such yarn configs and add entries in yarn-default.xml -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1056) Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
[ https://issues.apache.org/jira/browse/YARN-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1056: --- Target Version/s: 2.1.0-beta (was: 2.1.1-beta) Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs} Key: YARN-1056 URL: https://issues.apache.org/jira/browse/YARN-1056 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.1.0-beta Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Labels: conf Attachments: yarn-1056-1.patch, yarn-1056-1.patch Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs} to have a *resourcemanager* only once, make them consistent with other such yarn configs and add entries in yarn-default.xml -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1036) Distributed Cache gives inconsistent result if cache files get deleted from task tracker
[ https://issues.apache.org/jira/browse/YARN-1036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738541#comment-13738541 ] Omkar Vinit Joshi commented on YARN-1036: - +1 ... thanks for updating the patch..lgtm. Distributed Cache gives inconsistent result if cache files get deleted from task tracker - Key: YARN-1036 URL: https://issues.apache.org/jira/browse/YARN-1036 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 0.23.9 Reporter: Ravi Prakash Assignee: Ravi Prakash Attachments: YARN-1036.branch-0.23.patch, YARN-1036.branch-0.23.patch, YARN-1036.branch-0.23.patch This is a JIRA to backport MAPREDUCE-4342. I had to open a new JIRA because that one had been closed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1056) Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
[ https://issues.apache.org/jira/browse/YARN-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738560#comment-13738560 ] Arun C Murthy commented on YARN-1056: - Looks fine, I'll commit after jenkins. Thanks [~kkambatl]. Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs} Key: YARN-1056 URL: https://issues.apache.org/jira/browse/YARN-1056 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.1.0-beta Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Labels: conf Attachments: yarn-1056-1.patch, yarn-1056-1.patch Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs} to have a *resourcemanager* only once, make them consistent with other such yarn configs and add entries in yarn-default.xml -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1021) Yarn Scheduler Load Simulator
[ https://issues.apache.org/jira/browse/YARN-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738556#comment-13738556 ] Wei Yan commented on YARN-1021: --- Updates of the patch: Reduce the number of threads needed for NMSimulators. Before, each NMSimulator uses one thread (for its AsyncDispatcher). Currently removed AsyncDispatcher and the total number of threads needed only depends on the thread pool size. Yarn Scheduler Load Simulator - Key: YARN-1021 URL: https://issues.apache.org/jira/browse/YARN-1021 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Reporter: Wei Yan Assignee: Wei Yan Attachments: YARN-1021-demo.tar.gz, YARN-1021-images.tar.gz, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.pdf The Yarn Scheduler is a fertile area of interest with different implementations, e.g., Fifo, Capacity and Fair schedulers. Meanwhile, several optimizations are also made to improve scheduler performance for different scenarios and workload. Each scheduler algorithm has its own set of features, and drives scheduling decisions by many factors, such as fairness, capacity guarantee, resource availability, etc. It is very important to evaluate a scheduler algorithm very well before we deploy it in a production cluster. Unfortunately, currently it is non-trivial to evaluate a scheduling algorithm. Evaluating in a real cluster is always time and cost consuming, and it is also very hard to find a large-enough cluster. Hence, a simulator which can predict how well a scheduler algorithm for some specific workload would be quite useful. We want to build a Scheduler Load Simulator to simulate large-scale Yarn clusters and application loads in a single machine. This would be invaluable in furthering Yarn by providing a tool for researchers and developers to prototype new scheduler features and predict their behavior and performance with reasonable amount of confidence, there-by aiding rapid innovation. The simulator will exercise the real Yarn ResourceManager removing the network factor by simulating NodeManagers and ApplicationMasters via handling and dispatching NM/AMs heartbeat events from within the same JVM. To keep tracking of scheduler behavior and performance, a scheduler wrapper will wrap the real scheduler. The simulator will produce real time metrics while executing, including: * Resource usages for whole cluster and each queue, which can be utilized to configure cluster and queue's capacity. * The detailed application execution trace (recorded in relation to simulated time), which can be analyzed to understand/validate the scheduler behavior (individual jobs turn around time, throughput, fairness, capacity guarantee, etc). * Several key metrics of scheduler algorithm, such as time cost of each scheduler operation (allocate, handle, etc), which can be utilized by Hadoop developers to find the code spots and scalability limits. The simulator will provide real time charts showing the behavior of the scheduler and its performance. A short demo is available http://www.youtube.com/watch?v=6thLi8q0qLE, showing how to use simulator to simulate Fair Scheduler and Capacity Scheduler. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1021) Yarn Scheduler Load Simulator
[ https://issues.apache.org/jira/browse/YARN-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738570#comment-13738570 ] Hadoop QA commented on YARN-1021: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12597774/YARN-1021.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-assemblies hadoop-tools/hadoop-sls hadoop-tools/hadoop-tools-dist. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1705//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1705//console This message is automatically generated. Yarn Scheduler Load Simulator - Key: YARN-1021 URL: https://issues.apache.org/jira/browse/YARN-1021 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Reporter: Wei Yan Assignee: Wei Yan Attachments: YARN-1021-demo.tar.gz, YARN-1021-images.tar.gz, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.pdf The Yarn Scheduler is a fertile area of interest with different implementations, e.g., Fifo, Capacity and Fair schedulers. Meanwhile, several optimizations are also made to improve scheduler performance for different scenarios and workload. Each scheduler algorithm has its own set of features, and drives scheduling decisions by many factors, such as fairness, capacity guarantee, resource availability, etc. It is very important to evaluate a scheduler algorithm very well before we deploy it in a production cluster. Unfortunately, currently it is non-trivial to evaluate a scheduling algorithm. Evaluating in a real cluster is always time and cost consuming, and it is also very hard to find a large-enough cluster. Hence, a simulator which can predict how well a scheduler algorithm for some specific workload would be quite useful. We want to build a Scheduler Load Simulator to simulate large-scale Yarn clusters and application loads in a single machine. This would be invaluable in furthering Yarn by providing a tool for researchers and developers to prototype new scheduler features and predict their behavior and performance with reasonable amount of confidence, there-by aiding rapid innovation. The simulator will exercise the real Yarn ResourceManager removing the network factor by simulating NodeManagers and ApplicationMasters via handling and dispatching NM/AMs heartbeat events from within the same JVM. To keep tracking of scheduler behavior and performance, a scheduler wrapper will wrap the real scheduler. The simulator will produce real time metrics while executing, including: * Resource usages for whole cluster and each queue, which can be utilized to configure cluster and queue's capacity. * The detailed application execution trace (recorded in relation to simulated time), which can be analyzed to understand/validate the scheduler behavior (individual jobs turn around time, throughput, fairness, capacity guarantee, etc). * Several key metrics of scheduler algorithm, such as time cost of each scheduler operation (allocate, handle, etc), which can be utilized by Hadoop developers to find the code spots and scalability limits. The simulator will provide real time charts showing the behavior of the scheduler and its performance. A short demo is available http://www.youtube.com/watch?v=6thLi8q0qLE, showing how to use simulator to simulate Fair Scheduler and Capacity Scheduler. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1056) Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
[ https://issues.apache.org/jira/browse/YARN-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738572#comment-13738572 ] Jian He commented on YARN-1056: --- Hi [~kkambatl], do you think it's also necessary to change 'yarn.resourcemanager.fs.rm-state-store.uri' to 'yarn.resourcemanager.fs.state-store.uri' for consistency with 'zk.state-store' ? Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs} Key: YARN-1056 URL: https://issues.apache.org/jira/browse/YARN-1056 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.1.0-beta Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Labels: conf Attachments: yarn-1056-1.patch, yarn-1056-1.patch Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs} to have a *resourcemanager* only once, make them consistent with other such yarn configs and add entries in yarn-default.xml -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1056) Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
[ https://issues.apache.org/jira/browse/YARN-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738582#comment-13738582 ] Hadoop QA commented on YARN-1056: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12597773/yarn-1056-1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1706//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1706//console This message is automatically generated. Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs} Key: YARN-1056 URL: https://issues.apache.org/jira/browse/YARN-1056 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.1.0-beta Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Labels: conf Attachments: yarn-1056-1.patch, yarn-1056-1.patch Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs} to have a *resourcemanager* only once, make them consistent with other such yarn configs and add entries in yarn-default.xml -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1056) Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
[ https://issues.apache.org/jira/browse/YARN-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738595#comment-13738595 ] Karthik Kambatla commented on YARN-1056: [~jianhe], good point. Let me upload a patch including that change shortly. Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs} Key: YARN-1056 URL: https://issues.apache.org/jira/browse/YARN-1056 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.1.0-beta Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Labels: conf Attachments: yarn-1056-1.patch, yarn-1056-1.patch Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs} to have a *resourcemanager* only once, make them consistent with other such yarn configs and add entries in yarn-default.xml -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1062) MRAppMaster take a long time to init taskAttempt
[ https://issues.apache.org/jira/browse/YARN-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738596#comment-13738596 ] shenhong commented on YARN-1062: Thanks Vinod Kumar Vavilapalli, I think YARN-435 is okay to me. MRAppMaster take a long time to init taskAttempt Key: YARN-1062 URL: https://issues.apache.org/jira/browse/YARN-1062 Project: Hadoop YARN Issue Type: Bug Components: applications Affects Versions: 0.23.6 Reporter: shenhong In our cluster, MRAppMaster take a long time to init taskAttempt, the following log last one minute, 2013-07-17 11:28:06,328 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11012.yh.aliyun.com to /r01f11 2013-07-17 11:28:06,357 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11004.yh.aliyun.com to /r01f11 2013-07-17 11:28:06,383 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved r03b05042.yh.aliyun.com to /r03b05 2013-07-17 11:28:06,384 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1373523419753_4543_m_00_0 TaskAttempt Transitioned from NEW to UNASSIGNED 2013-07-17 11:28:06,415 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved r03b02006.yh.aliyun.com to /r03b02 2013-07-17 11:28:06,436 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved r02f02045.yh.aliyun.com to /r02f02 2013-07-17 11:28:06,457 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved r02f02034.yh.aliyun.com to /r02f02 2013-07-17 11:28:06,457 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1373523419753_4543_m_01_0 TaskAttempt Transitioned from NEW to UNASSIGNED The reason is: resolved one host to rack almost take 25ms (We resolve the host to rack by a python script). Our hdfs cluster is more than 4000 datanodes, then a large input job will take a long time to init TaskAttempt. Is there any good idea to solve this problem. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1061) NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager.
[ https://issues.apache.org/jira/browse/YARN-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738601#comment-13738601 ] Omkar Vinit Joshi commented on YARN-1061: - Are you able to reproduce this scenario? Can you please enable DEBUG (HADOOP_ROOT_LOGGER YARN_ROOT_LOGGER) logs and attach them to this jira? How big is your cluster? what is the frequency at which nodemanagers are heartbeating? Can you also attach yarn-site.xml? which version are you using? NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager. - Key: YARN-1061 URL: https://issues.apache.org/jira/browse/YARN-1061 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.5-alpha Reporter: Rohith Sharma K S It is observed that in one of the scenario, NodeManger is indefinetly waiting for nodeHeartbeat response from ResouceManger where ResouceManger is in hanged up state. NodeManager should get timeout exception instead of waiting indefinetly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (YARN-1062) MRAppMaster take a long time to init taskAttempt
[ https://issues.apache.org/jira/browse/YARN-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shenhong resolved YARN-1062. Resolution: Duplicate MRAppMaster take a long time to init taskAttempt Key: YARN-1062 URL: https://issues.apache.org/jira/browse/YARN-1062 Project: Hadoop YARN Issue Type: Bug Components: applications Affects Versions: 0.23.6 Reporter: shenhong In our cluster, MRAppMaster take a long time to init taskAttempt, the following log last one minute, 2013-07-17 11:28:06,328 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11012.yh.aliyun.com to /r01f11 2013-07-17 11:28:06,357 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved r01f11004.yh.aliyun.com to /r01f11 2013-07-17 11:28:06,383 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved r03b05042.yh.aliyun.com to /r03b05 2013-07-17 11:28:06,384 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1373523419753_4543_m_00_0 TaskAttempt Transitioned from NEW to UNASSIGNED 2013-07-17 11:28:06,415 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved r03b02006.yh.aliyun.com to /r03b02 2013-07-17 11:28:06,436 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved r02f02045.yh.aliyun.com to /r02f02 2013-07-17 11:28:06,457 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved r02f02034.yh.aliyun.com to /r02f02 2013-07-17 11:28:06,457 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1373523419753_4543_m_01_0 TaskAttempt Transitioned from NEW to UNASSIGNED The reason is: resolved one host to rack almost take 25ms (We resolve the host to rack by a python script). Our hdfs cluster is more than 4000 datanodes, then a large input job will take a long time to init TaskAttempt. Is there any good idea to solve this problem. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1030) Adding AHS as service of RM
[ https://issues.apache.org/jira/browse/YARN-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1030: -- Attachment: YARN-1030.2.patch Thanks [~devaraj.k] for your review. I've updated patch according to your comments. If YARN-953 is committed first, I'll remove the change in pom.xml in this patch. Now I keep the change not to break the build. Adding AHS as service of RM --- Key: YARN-1030 URL: https://issues.apache.org/jira/browse/YARN-1030 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-1030.1.patch, YARN-1030.2.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1030) Adding AHS as service of RM
[ https://issues.apache.org/jira/browse/YARN-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738622#comment-13738622 ] Hadoop QA commented on YARN-1030: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12597781/YARN-1030.2.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1707//console This message is automatically generated. Adding AHS as service of RM --- Key: YARN-1030 URL: https://issues.apache.org/jira/browse/YARN-1030 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-1030.1.patch, YARN-1030.2.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1056) Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
[ https://issues.apache.org/jira/browse/YARN-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1056: --- Attachment: yarn-1056-2.patch Updated fs config to be fs.state-store.uri instead of fs.rm-state-store.uri Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs} Key: YARN-1056 URL: https://issues.apache.org/jira/browse/YARN-1056 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.1.0-beta Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Labels: conf Attachments: yarn-1056-1.patch, yarn-1056-1.patch, yarn-1056-2.patch Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs} to have a *resourcemanager* only once, make them consistent with other such yarn configs and add entries in yarn-default.xml -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-435) Make it easier to access cluster topology information in an AM
[ https://issues.apache.org/jira/browse/YARN-435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738647#comment-13738647 ] shenhong commented on YARN-435: --- Firstly, if AM get all nodes in the cluster including their rack information by calling RM. This will increase pressure on the RM's network. For example, the cluster had more than 5000 datanodes. Secondly, if the yarn cluster only has 100 nodemanagers, but the hdfs it accessed is a cluster with more than 5000 datanodes, we can't get all the nodes including their rack information. However, AM need all the datanode information in it's job.splitmetainfo file, in order to init TaskAttempt. In this case, we can't get all nodes by calling RM. Make it easier to access cluster topology information in an AM -- Key: YARN-435 URL: https://issues.apache.org/jira/browse/YARN-435 Project: Hadoop YARN Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Omkar Vinit Joshi ClientRMProtocol exposes a getClusterNodes api that provides a report on all nodes in the cluster including their rack information. However, this requires the AM to open and establish a separate connection to the RM in addition to one for the AMRMProtocol. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1056) Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
[ https://issues.apache.org/jira/browse/YARN-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738651#comment-13738651 ] Jian He commented on YARN-1056: --- Looks good, +1 Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs} Key: YARN-1056 URL: https://issues.apache.org/jira/browse/YARN-1056 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.1.0-beta Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Labels: conf Attachments: yarn-1056-1.patch, yarn-1056-1.patch, yarn-1056-2.patch Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs} to have a *resourcemanager* only once, make them consistent with other such yarn configs and add entries in yarn-default.xml -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1056) Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs}
[ https://issues.apache.org/jira/browse/YARN-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738686#comment-13738686 ] Hadoop QA commented on YARN-1056: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12597782/yarn-1056-2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1708//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1708//console This message is automatically generated. Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs} Key: YARN-1056 URL: https://issues.apache.org/jira/browse/YARN-1056 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.1.0-beta Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Labels: conf Attachments: yarn-1056-1.patch, yarn-1056-1.patch, yarn-1056-2.patch Fix configs yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_interval.secs} to have a *resourcemanager* only once, make them consistent with other such yarn configs and add entries in yarn-default.xml -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1036) Distributed Cache gives inconsistent result if cache files get deleted from task tracker
[ https://issues.apache.org/jira/browse/YARN-1036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738754#comment-13738754 ] Jason Lowe commented on YARN-1036: -- +1 lgtm as well. Committing this. Distributed Cache gives inconsistent result if cache files get deleted from task tracker - Key: YARN-1036 URL: https://issues.apache.org/jira/browse/YARN-1036 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 0.23.9 Reporter: Ravi Prakash Assignee: Ravi Prakash Attachments: YARN-1036.branch-0.23.patch, YARN-1036.branch-0.23.patch, YARN-1036.branch-0.23.patch This is a JIRA to backport MAPREDUCE-4342. I had to open a new JIRA because that one had been closed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-573) Shared data structures in Public Localizer and Private Localizer are not Thread safe.
[ https://issues.apache.org/jira/browse/YARN-573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738845#comment-13738845 ] Omkar Vinit Joshi commented on YARN-573: +1 ..lgtm for branch 0.23 Shared data structures in Public Localizer and Private Localizer are not Thread safe. - Key: YARN-573 URL: https://issues.apache.org/jira/browse/YARN-573 Project: Hadoop YARN Issue Type: Sub-task Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi Priority: Critical Fix For: 3.0.0, 2.1.1-beta Attachments: YARN-573-20130730.1.patch, YARN-573-20130731.1.patch, YARN-573.branch-0.23-08132013.patch PublicLocalizer 1) pending accessed by addResource (part of event handling) and run method (as a part of PublicLocalizer.run() ). PrivateLocalizer 1) pending accessed by addResource (part of event handling) and findNextResource (i.remove()). Also update method should be fixed. It too is sharing pending list. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-573) Shared data structures in Public Localizer and Private Localizer are not Thread safe.
[ https://issues.apache.org/jira/browse/YARN-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-573: Fix Version/s: 0.23.10 +1 lgtm as well, thanks Mit and Omkar! I committed this to branch-0.23. Shared data structures in Public Localizer and Private Localizer are not Thread safe. - Key: YARN-573 URL: https://issues.apache.org/jira/browse/YARN-573 Project: Hadoop YARN Issue Type: Sub-task Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi Priority: Critical Fix For: 3.0.0, 0.23.10, 2.1.1-beta Attachments: YARN-573-20130730.1.patch, YARN-573-20130731.1.patch, YARN-573.branch-0.23-08132013.patch PublicLocalizer 1) pending accessed by addResource (part of event handling) and run method (as a part of PublicLocalizer.run() ). PrivateLocalizer 1) pending accessed by addResource (part of event handling) and findNextResource (i.remove()). Also update method should be fixed. It too is sharing pending list. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1058) Recovery issues on RM Restart with FileSystemRMStateStore
[ https://issues.apache.org/jira/browse/YARN-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738934#comment-13738934 ] Karthik Kambatla commented on YARN-1058: I was expecting the first one, and Bikas is right about the second one. When I kil the job client, the job does finish successfully. However, the AM for the recovered attempt fails to write the history. {noformat} 2013-08-13 13:57:32,440 ERROR [eventHandlingThread] org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[eventHandlingThread,5,main] threw an Exception. org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /tmp/hadoop-yarn/staging/kasha/.staging/job_1376427059607_0002/job_1376427059607_0002_2.jhist: File does not exist. Holder DFSClient_NONMAPREDUCE_416024880_1 does not have any open files. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2737) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2543) ... at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2037) at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleEvent(JobHistoryEventHandler.java:514) at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler$1.run(JobHistoryEventHandler.java:276) at java.lang.Thread.run(Thread.java:662) {noformat} Recovery issues on RM Restart with FileSystemRMStateStore - Key: YARN-1058 URL: https://issues.apache.org/jira/browse/YARN-1058 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Karthik Kambatla Assignee: Karthik Kambatla App recovery doesn't work as expected using FileSystemRMStateStore. Steps to reproduce: - Ran sleep job with a single map and sleep time of 2 mins - Restarted RM while the map task is still running - The first attempt fails with the following error {noformat} Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): Password not found for ApplicationAttempt appattempt_1376294441253_0001_01 at org.apache.hadoop.ipc.Client.call(Client.java:1404) at org.apache.hadoop.ipc.Client.call(Client.java:1357) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at $Proxy28.finishApplicationMaster(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.finishApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:91) {noformat} - The second attempt fails with a different error: {noformat} Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /tmp/hadoop-yarn/staging/kasha/.staging/job_1376294441253_0001/job_1376294441253_0001_2.jhist: File does not exist. Holder DFSClient_NONMAPREDUCE_389533538_1 does not have any open files. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2737) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2543) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2454) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:534) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:387) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:48073) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:605) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-337) RM handles killed application tracking URL poorly
[ https://issues.apache.org/jira/browse/YARN-337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738968#comment-13738968 ] Thomas Graves commented on YARN-337: +1 looks good. Thanks Jason! Feel free to commit it. RM handles killed application tracking URL poorly - Key: YARN-337 URL: https://issues.apache.org/jira/browse/YARN-337 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.2-alpha, 0.23.5 Reporter: Jason Lowe Assignee: Jason Lowe Labels: usability Attachments: YARN-337.patch When the ResourceManager kills an application, it leaves the proxy URL redirecting to the original tracking URL for the application even though the ApplicationMaster is no longer there to service it. It should redirect it somewhere more useful, like the RM's web page for the application, where the user can find that the application was killed and links to the AM logs. In addition, sometimes the AM during teardown from the kill can attempt to unregister and provide an updated tracking URL, but unfortunately the RM has forgotten the AM due to the kill and refuses to process the unregistration. Instead it logs: {noformat} 2013-01-09 17:37:49,671 [IPC Server handler 2 on 8030] ERROR org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: AppAttemptId doesnt exist in cache appattempt_1357575694478_28614_01 {noformat} It should go ahead and process the unregistration to update the tracking URL since the application offered it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1024) Define a virtual core unambigiously
[ https://issues.apache.org/jira/browse/YARN-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738985#comment-13738985 ] Sandy Ryza commented on YARN-1024: -- I've been thinking a lot about this, and wanted to propose a modified approach, inspired by an offline discussion with Arun and his max-vcores idea (https://issues.apache.org/jira/browse/YARN-1024?focusedCommentId=13730074page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13730074). First, my assumptions about how CPUs work: * A CPU is essentially a bathtub full of processing power that can be doled out to threads, with a limit per thread based on the power of each core within it. * To give X processing power to a thread means that within a standard unit of time, roughly some number of instructions proportional to X can be executed for that thread. * No more than a certain amount of processing power (the amount of processing power per core) can be given to each thread. * We can use CGroups to say that a task gets some fraction of the system's processing power. * This means that if we have 5 cores with Y processing power each, we can give 5 threads Y processing power each, or 6 threads 5Y/6 processing power each, but we can't give 4 threads 5Y/4 processing power each. * It never makes sense to use CGroups assign a higher fraction of the system's processing power than (numthreads the task can take advantage of / number of cores) to a task. * Equivalently, if my CPU has X processing power per core, it never makes sense to assign more than (numthreads the task can take advantage of) * X processing power to a task. So as long as we account for that last constraint, we can essentially view processing power as a fluid resource like memory. With this in mind, we can: 1. Split virtual cores into cores and yarnComputeUnitsPerCore. Requests can include both and nodes can be configured with both. 2. Have a cluster-defined maxComputeUnitsPerCore, which would be the smallest yarnComputeUnitsPerCore on any node. We min all yarnComputeUnitsPerCore requests with this number when they hit the RM. 3. Use YCUs, not cores, for scheduling. I.e. the scheduler thinks of a node's CPU capacity in terms of the number of YCUs it can handle and thinks of a resource's CPU request in terms of its (normalized yarnComputeUnitsPerCore * # cores). We use YCUs for DRF. 4. If we make YCUs small enough, no need for fractional anything. This reduces to a number-of-cores-based approach if all containers are requested with yarnComputeUnitsPerCore=infinity, and reduces to a YCU approach if maxComputeUnitsPerCore is set to infinity. Predictability, simplicity, and scheduling flexibility can be traded off per cluster without overloading the same concept with multiple definitions. This doesn't take into account heteregeneous hardware within a cluster, but I think (2) can be tweaked to handle this by holding a value for each node (can elaborate on how this would work). It also doesn't take into account pinning threads to CPUs, but I don't think it's any less extensible for ultimately dealing with this than other proposals. Sorry for the longwindedness. Bobby, would this provide the flexibility you're looking for? Define a virtual core unambigiously --- Key: YARN-1024 URL: https://issues.apache.org/jira/browse/YARN-1024 Project: Hadoop YARN Issue Type: Improvement Reporter: Arun C Murthy Assignee: Arun C Murthy We need to clearly define the meaning of a virtual core unambiguously so that it's easy to migrate applications between clusters. For e.g. here is Amazon EC2 definition of ECU: http://aws.amazon.com/ec2/faqs/#What_is_an_EC2_Compute_Unit_and_why_did_you_introduce_it Essentially we need to clearly define a YARN Virtual Core (YVC). Equivalently, we can use ECU itself: *One EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.* -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-292) ResourceManager throws ArrayIndexOutOfBoundsException while handling CONTAINER_ALLOCATED for application attempt
[ https://issues.apache.org/jira/browse/YARN-292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739011#comment-13739011 ] Zhijie Shen commented on YARN-292: -- Did more investigation on this issue: {code} 2012-12-26 08:41:15,030 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler: Calling allocate on removed or non existant application appattempt_1356385141279_49525_01 {code} This log indicates that ArrayIndexOutOfBoundsException happens because the application is not found. There're three possibilities where the application is not found: 1. The application hasn't been added into FiFoScheduler#applications. If it is the case, FiFoScheduler will not send APP_ACCEPTED event to the corresponding RMAppAttemptImpl. Without APP_ACCEPTED event, RMAppAttemptImpl will not enter SCHEDULED state, and will not go through AMContainerAllocatedTransition to ALLOCATED_SAVING consequently. Therefore, this case is impossible. 2. The application has already been removed from FiFoScheduler#applications. To trigger the removal operation, the corresponding RMAppAttemptImpl needs to go through BaseFinalTransition. It is worth mentioning first that RMAppAttemptImpl's transitions are executed on the thread of AsyncDispatcher, while YarnScheduler#handle is invoked on the thread of SchedulerEventDispatcher. The two threads will execute in parallel, indicating that the process of an RMAppAttemptEvent and that of a SchedulerEvent may interpolate. However, the processes of two RMAppAttemptEvents or two SchedulerEvents will not. Therefore, AMContainerAllocatedTransition will not start before RMAppAttemptImpl has already finished BaseFinalTransition. Nevertheless, when RMAppAttemptImpl goes through BaseFinalTransition, it will enter an final state as well, such that AMContainerAllocatedTransition will not happen at all. In conclusion, this case is impossible as well. 3. The application is in FiFoScheduler#applications, but RMAppAttemptImpl doesn't get it. First of all, FiFoScheduler#applications is a TreeMap, which is not thread safe (FairScheduler#applications is a HashMap while CapcityScheduler#applications is a ConcurrentHashMap). Second, the methods of accessing the map are not consistently synchronized, thus, read and write on the same map can operate simultaneously. RMAppAttemptImpl on the thread of AsyncDispatcher will eventually call FiFoScheduler#applications#get in AMContainerAllocatedTransition, while FiFoScheduler on thread of SchedulerEventDispatcher will use FiFoScheduler#applications#add|remove. Therefore, getting null when the application actually exists happens under a big number of concurrent operations. Please feel free to correct me if you think there's something wrong or missing with the analysis. I'm going to work on a patch to fix the problem. ResourceManager throws ArrayIndexOutOfBoundsException while handling CONTAINER_ALLOCATED for application attempt Key: YARN-292 URL: https://issues.apache.org/jira/browse/YARN-292 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.0.1-alpha Reporter: Devaraj K Assignee: Zhijie Shen {code:xml} 2012-12-26 08:41:15,030 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler: Calling allocate on removed or non existant application appattempt_1356385141279_49525_01 2012-12-26 08:41:15,031 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type CONTAINER_ALLOCATED for applicationAttempt application_1356385141279_49525 java.lang.ArrayIndexOutOfBoundsException: 0 at java.util.Arrays$ArrayList.get(Arrays.java:3381) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMContainerAllocatedTransition.transition(RMAppAttemptImpl.java:655) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMContainerAllocatedTransition.transition(RMAppAttemptImpl.java:644) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:490) at
[jira] [Commented] (YARN-1058) Recovery issues on RM Restart with FileSystemRMStateStore
[ https://issues.apache.org/jira/browse/YARN-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739013#comment-13739013 ] Bikas Saha commented on YARN-1058: -- It could be that history service was not properly shutdown in the first AM. Earlier, the AM would receive proper reboot command from the RM and would shutdown properly based on the reboot flag being set. Now the AM is getting an exception from the RM and so not shutting down properly. This should get fixed when we refresh the AM RM token from the saved value. Recovery issues on RM Restart with FileSystemRMStateStore - Key: YARN-1058 URL: https://issues.apache.org/jira/browse/YARN-1058 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Karthik Kambatla Assignee: Karthik Kambatla App recovery doesn't work as expected using FileSystemRMStateStore. Steps to reproduce: - Ran sleep job with a single map and sleep time of 2 mins - Restarted RM while the map task is still running - The first attempt fails with the following error {noformat} Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): Password not found for ApplicationAttempt appattempt_1376294441253_0001_01 at org.apache.hadoop.ipc.Client.call(Client.java:1404) at org.apache.hadoop.ipc.Client.call(Client.java:1357) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at $Proxy28.finishApplicationMaster(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.finishApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:91) {noformat} - The second attempt fails with a different error: {noformat} Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /tmp/hadoop-yarn/staging/kasha/.staging/job_1376294441253_0001/job_1376294441253_0001_2.jhist: File does not exist. Holder DFSClient_NONMAPREDUCE_389533538_1 does not have any open files. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2737) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2543) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2454) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:534) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:387) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:48073) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:605) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1055) Handle app recovery differently for AM failures and RM restart
[ https://issues.apache.org/jira/browse/YARN-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739027#comment-13739027 ] Alejandro Abdelnur commented on YARN-1055: -- [~vinodkv], in theory I agree with you. In practice, there are 2 issues we Oozie cannot address in the short term: * 1. Oozie still using a a launcher MRAM * 2. mr/pig/hive/sqoop/distcp/... fat clients which are not aware of Yarn restart/recovery. #1 will be addressed when Oozie implements an OozieLauncherAM instead piggybacking on an MR Map as driver. #2 it is more complicated and I don't see this one be addressed in the short/medium term. By having distinct knobs differentiating recover after AM failure and after RM restart Oozie can handle/recover jobs on the same set of failure scenarios possible with Hadoop 1. In order to get folks into Yarn we need to provide functional parity. I suggest having the 2 knobs Karthik proposed {{restart.am.on.rm.restart}} and {{restart.am.on.on.failure}} with {{restart.am.on.rm.restart=$restar.am.on.am.failure}}. Does this sound reasonable? Handle app recovery differently for AM failures and RM restart -- Key: YARN-1055 URL: https://issues.apache.org/jira/browse/YARN-1055 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Karthik Kambatla Ideally, we would like to tolerate container, AM, RM failures. App recovery for AM and RM currently relies on the max-attempts config; tolerating AM failures requires it to be 1 and tolerating RM failure/restart requires it to be = 1. We should handle these two differently, with two separate configs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1055) Handle app recovery differently for AM failures and RM restart
[ https://issues.apache.org/jira/browse/YARN-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739040#comment-13739040 ] Vinod Kumar Vavilapalli commented on YARN-1055: --- This is a new issue with Hadoop 2 completely - we've added new failure conditions. All the apps handing AM restarts is really the right way forward given AMs can now run on random compute nodes that can just fail any time. Offline I started engaging some of Pig/Hive community folks. For MR, enough work is already done. Oozie needs to follow suit too. Till work-preserving restart is finished, this is a real pain on RM restarts. Which is why I am proposing that oozie set max-attempts to 1 for its launcher action so that there are no split brain issues - RM restart or otherwise. Oozie has a retry mechanism anyways which will then be submitted as a new application. Adding a separate knob just for restart is a hack I don't see any value of. If I read your proposal correctly, for launcher jobs, you will set restart.am.on.rm.restart to 1 and restart.am.on.on.failure 1. Right? Which is not correct as I repeated - node failures will cause the same split brain issues. Handle app recovery differently for AM failures and RM restart -- Key: YARN-1055 URL: https://issues.apache.org/jira/browse/YARN-1055 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Karthik Kambatla Ideally, we would like to tolerate container, AM, RM failures. App recovery for AM and RM currently relies on the max-attempts config; tolerating AM failures requires it to be 1 and tolerating RM failure/restart requires it to be = 1. We should handle these two differently, with two separate configs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1055) Handle app recovery differently for AM failures and RM restart
[ https://issues.apache.org/jira/browse/YARN-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739073#comment-13739073 ] Bikas Saha commented on YARN-1055: -- Restart on am failure is already determined by the default value of max am retries in yarn config. Setting that to 1 will prevent RM from restarting AM's on failure. Thus no need for new config. Restart after RM restart is already covered by setting max am retries to 1 by the app client on app submission. If an app cannot handle this situation it should create its own config and set the correct value of 1 on submission. YARN should not add a config IMO. If I remember right, this config is being imported from hadoop 1 and the impl of this config in hadoop 1 is what RM already does to handle user defined max am retries. Handle app recovery differently for AM failures and RM restart -- Key: YARN-1055 URL: https://issues.apache.org/jira/browse/YARN-1055 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Karthik Kambatla Ideally, we would like to tolerate container, AM, RM failures. App recovery for AM and RM currently relies on the max-attempts config; tolerating AM failures requires it to be 1 and tolerating RM failure/restart requires it to be = 1. We should handle these two differently, with two separate configs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1055) Handle app recovery differently for AM failures and RM restart
[ https://issues.apache.org/jira/browse/YARN-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739109#comment-13739109 ] Karthik Kambatla commented on YARN-1055: From a YARN-user POV, I see it differently. I want to control whether my app should be recovered on AM/RM failures separately. I might want to recover on RM restart but not on AM failures or viceversa: # In case of AM failure, user might want to check for user errors and hence not recover. But recover in case of RM failures. # Like Oozie, might want to recover on AM failures but not on RM failures. Also, is there a disadvantage to having two knobs for the two failures? Handle app recovery differently for AM failures and RM restart -- Key: YARN-1055 URL: https://issues.apache.org/jira/browse/YARN-1055 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Karthik Kambatla Ideally, we would like to tolerate container, AM, RM failures. App recovery for AM and RM currently relies on the max-attempts config; tolerating AM failures requires it to be 1 and tolerating RM failure/restart requires it to be = 1. We should handle these two differently, with two separate configs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1061) NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager.
[ https://issues.apache.org/jira/browse/YARN-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739178#comment-13739178 ] Rohith Sharma K S commented on YARN-1061: - Actual issue I got in 5 node cluster (1 RM and 5 NM).It is hard to recure scenario for resourcemanager is hang up state in real cluster. The same scenario can be simulated manually bringing resourcemanager to hang up state with help of linux command KILL -STOP RM_PID. All the NM-RM call wait indefinitely. Another case where we can observer indefinite wait is Add new NodeManager when ResouceMangaer is hang up state. NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager. - Key: YARN-1061 URL: https://issues.apache.org/jira/browse/YARN-1061 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.5-alpha Reporter: Rohith Sharma K S It is observed that in one of the scenario, NodeManger is indefinetly waiting for nodeHeartbeat response from ResouceManger where ResouceManger is in hanged up state. NodeManager should get timeout exception instead of waiting indefinetly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-451) Add more metrics to RM page
[ https://issues.apache.org/jira/browse/YARN-451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739213#comment-13739213 ] Sangjin Lee commented on YARN-451: -- I think showing this information on the app list page is actually more valuable than the per-app page. If this information is present in the app list page, one can quickly scan the list and get a sense of which job/app is bigger than others in terms of resource consumption. Also, it makes sorting possible. One could in theory visit individual per-app pages one by one to get the same information, but it's so much more useful to have it ready at the overview page so one can get that information quickly. In hadoop 1.0, one could get the same information by looking at the number of total mappers and reducers. That way, we got a very good idea on which ones are big jobs (and thus need to be monitored more closely) without drilling into any of the apps. Add more metrics to RM page --- Key: YARN-451 URL: https://issues.apache.org/jira/browse/YARN-451 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.0.3-alpha Reporter: Lohit Vijayarenu Priority: Minor ResourceManager webUI shows list of RUNNING applications, but it does not tell which applications are requesting more resource compared to others. With cluster running hundreds of applications at once it would be useful to have some kind of metric to show high-resource usage applications vs low-resource usage ones. At the minimum showing number of containers is good option. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1060) Two tests in TestFairScheduler are missing @Test annotation
[ https://issues.apache.org/jira/browse/YARN-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739216#comment-13739216 ] Sandy Ryza commented on YARN-1060: -- Committed to trunk and branch-2. Thanks Niranjan! Two tests in TestFairScheduler are missing @Test annotation --- Key: YARN-1060 URL: https://issues.apache.org/jira/browse/YARN-1060 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.1.0-beta Reporter: Sandy Ryza Assignee: Niranjan Singh Labels: newbie Attachments: YARN-1060.patch Amazingly, these tests appear to pass with the annotations added. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1060) Two tests in TestFairScheduler are missing @Test annotation
[ https://issues.apache.org/jira/browse/YARN-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739218#comment-13739218 ] Hudson commented on YARN-1060: -- SUCCESS: Integrated in Hadoop-trunk-Commit #4256 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/4256/]) YARN-1060. Two tests in TestFairScheduler are missing @Test annotation (Niranjan Singh via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1513724) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java Two tests in TestFairScheduler are missing @Test annotation --- Key: YARN-1060 URL: https://issues.apache.org/jira/browse/YARN-1060 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.1.0-beta Reporter: Sandy Ryza Assignee: Niranjan Singh Labels: newbie Fix For: 2.3.0 Attachments: YARN-1060.patch Amazingly, these tests appear to pass with the annotations added. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-993) job can not recovery after restart resourcemanager
[ https://issues.apache.org/jira/browse/YARN-993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739219#comment-13739219 ] prophy Yan commented on YARN-993: - Jian He i have tryed the patch file in the YARN-513 list,but some error occur when i use the patch. my test version is hadoop2.0.5-alpha,so can this patch work with this version? thank you. job can not recovery after restart resourcemanager -- Key: YARN-993 URL: https://issues.apache.org/jira/browse/YARN-993 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.5-alpha Environment: CentOS5.3 JDK1.7.0_11 Reporter: prophy Yan Priority: Critical Recently, i have test the function job recovery in the YARN framework, but it failed. first, i run the wordcount example program, and the i kill -9 the resourcemanager process on the server when the wordcount process in map 100%. the job will exit with error in minutes. second, i restart the resourcemanager on the server by user the 'start-yarn.sh' command. but, the failed job(wordcount) can not to continue. the yarn log says file not exist! Here is the YARN log: 013-07-23 16:05:21,472 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done launching container Container: [ContainerId: container_1374564764970_0001_02_01, NodeId: mv8.mzhen.cn:52117, NodeHttpAddress: mv8.mzhen.cn:8042, Resource: memory:2048, vCores:1, Priority: 0, State: NEW, Token: null, Status: container_id {, app_attempt_id {, application_id {, id: 1, cluster_timestamp: 1374564764970, }, attemptId: 2, }, id: 1, }, state: C_NEW, ] for AM appattempt_1374564764970_0001_02 2013-07-23 16:05:21,473 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1374564764970_0001_02 State change from ALLOCATED to LAUNCHED 2013-07-23 16:05:21,925 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1374564764970_0001_02 State change from LAUNCHED to FAILED 2013-07-23 16:05:21,925 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application application_1374564764970_0001 failed 1 times due to AM Container for appattempt_1374564764970_0001_02 exited with exitCode: -1000 due to: RemoteTrace: java.io.FileNotFoundException: File does not exist: hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:815) at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:176) at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:51) at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:284) at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:282) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:280) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:51) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) at LocalTrace: org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: File does not exist: hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens at org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.convertFromProtoFormat(LocalResourceStatusPBImpl.java:217) at org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.getException(LocalResourceStatusPBImpl.java:147) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.update(ResourceLocalizationService.java:819) at
[jira] [Commented] (YARN-451) Add more metrics to RM page
[ https://issues.apache.org/jira/browse/YARN-451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739225#comment-13739225 ] Vinod Kumar Vavilapalli commented on YARN-451: -- Agreed about having it on the listing page, but that page is already dense. Have to do some basic UI design. Again, like I mentioned, Hadoop-1 was different as number of maps, reduces doesn't change after job starts. Whereas in Hadoop-2, memory/cores allocated slowly increases over time , so it may or may not be of much use. I am ambivalent about adding it. Add more metrics to RM page --- Key: YARN-451 URL: https://issues.apache.org/jira/browse/YARN-451 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.0.3-alpha Reporter: Lohit Vijayarenu Priority: Minor ResourceManager webUI shows list of RUNNING applications, but it does not tell which applications are requesting more resource compared to others. With cluster running hundreds of applications at once it would be useful to have some kind of metric to show high-resource usage applications vs low-resource usage ones. At the minimum showing number of containers is good option. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira