[jira] [Commented] (YARN-8496) The capacity scheduler uses label to cause vcore to be incorrect
[ https://issues.apache.org/jira/browse/YARN-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16535626#comment-16535626 ] tangshangwen commented on YARN-8496: I'll update a patch later > The capacity scheduler uses label to cause vcore to be incorrect > > > Key: YARN-8496 > URL: https://issues.apache.org/jira/browse/YARN-8496 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.7.6 >Reporter: tangshangwen >Assignee: tangshangwen >Priority: Major > Attachments: yarn-bug.png > > > In my cluster, I used label scheduling, and I found that it caused the vcore > of the cluster to be incorrect > > capacity-scheduler.xml > > {code:java} > > > yarn.scheduler.capacity.root.queues > support > > > yarn.scheduler.capacity.root.support.capacity > 100 > > > yarn.scheduler.capacity.root.support.accessible-node-labels > test1 > > > yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity > 100 > > > yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity > 100 > > > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8496) The capacity scheduler uses label to cause vcore to be incorrect
[ https://issues.apache.org/jira/browse/YARN-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen updated YARN-8496: --- Component/s: (was: resourcemanager) capacity scheduler > The capacity scheduler uses label to cause vcore to be incorrect > > > Key: YARN-8496 > URL: https://issues.apache.org/jira/browse/YARN-8496 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.7.6 >Reporter: tangshangwen >Assignee: tangshangwen >Priority: Major > Attachments: yarn-bug.png > > > In my cluster, I used label scheduling, and I found that it caused the vcore > of the cluster to be incorrect > > capacity-scheduler.xml > > {code:java} > > > yarn.scheduler.capacity.root.queues > support > > > yarn.scheduler.capacity.root.support.capacity > 100 > > > yarn.scheduler.capacity.root.support.accessible-node-labels > test1 > > > yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity > 100 > > > yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity > 100 > > > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8496) The capacity scheduler uses label to cause vcore to be incorrect
[ https://issues.apache.org/jira/browse/YARN-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen updated YARN-8496: --- Description: In my cluster, I used label scheduling, and I found that it caused the vcore of the cluster to be incorrect capacity-scheduler.xml {code:java} yarn.scheduler.capacity.root.queues support yarn.scheduler.capacity.root.support.capacity 100 yarn.scheduler.capacity.root.support.accessible-node-labels test1 yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity 100 yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity 100 {code} was: In my cluster, I used label scheduling, and I found that it caused the vcore of the cluster to be incorrect !image-2018-07-05-18-29-32-697.png! capacity-scheduler.xml {code:java} yarn.scheduler.capacity.root.queues support yarn.scheduler.capacity.root.support.capacity 100 yarn.scheduler.capacity.root.support.accessible-node-labels test1 yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity 100 yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity 100 {code} > The capacity scheduler uses label to cause vcore to be incorrect > > > Key: YARN-8496 > URL: https://issues.apache.org/jira/browse/YARN-8496 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.6 >Reporter: tangshangwen >Assignee: tangshangwen >Priority: Major > Attachments: yarn-bug.png > > > In my cluster, I used label scheduling, and I found that it caused the vcore > of the cluster to be incorrect > > capacity-scheduler.xml > > {code:java} > > > yarn.scheduler.capacity.root.queues > support > > > yarn.scheduler.capacity.root.support.capacity > 100 > > > yarn.scheduler.capacity.root.support.accessible-node-labels > test1 > > > yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity > 100 > > > yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity > 100 > > > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8496) The capacity scheduler uses label to cause vcore to be incorrect
[ https://issues.apache.org/jira/browse/YARN-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen updated YARN-8496: --- Attachment: (was: image-2018-07-05-18-29-32-697.png) > The capacity scheduler uses label to cause vcore to be incorrect > > > Key: YARN-8496 > URL: https://issues.apache.org/jira/browse/YARN-8496 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.6 >Reporter: tangshangwen >Assignee: tangshangwen >Priority: Major > Attachments: yarn-bug.png > > > In my cluster, I used label scheduling, and I found that it caused the vcore > of the cluster to be incorrect > !image-2018-07-05-18-29-32-697.png! > capacity-scheduler.xml > > {code:java} > > > yarn.scheduler.capacity.root.queues > support > > > yarn.scheduler.capacity.root.support.capacity > 100 > > > yarn.scheduler.capacity.root.support.accessible-node-labels > test1 > > > yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity > 100 > > > yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity > 100 > > > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8496) The capacity scheduler uses label to cause vcore to be incorrect
[ https://issues.apache.org/jira/browse/YARN-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen updated YARN-8496: --- Description: In my cluster, I used label scheduling, and I found that it caused the vcore of the cluster to be incorrect !image-2018-07-05-18-29-32-697.png! capacity-scheduler.xml {code:java} yarn.scheduler.capacity.root.queues support yarn.scheduler.capacity.root.support.capacity 100 yarn.scheduler.capacity.root.support.accessible-node-labels test1 yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity 100 yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity 100 {code} was: In my cluster, I used tag scheduling, and I found that it caused the vcore of the cluster to be incorrect !image-2018-07-05-18-29-32-697.png! capacity-scheduler.xml {code:java} yarn.scheduler.capacity.root.queues support yarn.scheduler.capacity.root.support.capacity 100 yarn.scheduler.capacity.root.support.accessible-node-labels test1 yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity 100 yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity 100 {code} > The capacity scheduler uses label to cause vcore to be incorrect > > > Key: YARN-8496 > URL: https://issues.apache.org/jira/browse/YARN-8496 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.6 >Reporter: tangshangwen >Assignee: tangshangwen >Priority: Major > Attachments: image-2018-07-05-18-29-32-697.png, yarn-bug.png > > > In my cluster, I used label scheduling, and I found that it caused the vcore > of the cluster to be incorrect > !image-2018-07-05-18-29-32-697.png! > capacity-scheduler.xml > > {code:java} > > > yarn.scheduler.capacity.root.queues > support > > > yarn.scheduler.capacity.root.support.capacity > 100 > > > yarn.scheduler.capacity.root.support.accessible-node-labels > test1 > > > yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity > 100 > > > yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity > 100 > > > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8496) The capacity scheduler uses label to cause vcore to be incorrect
[ https://issues.apache.org/jira/browse/YARN-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16533497#comment-16533497 ] tangshangwen commented on YARN-8496: I think it's important to check that the resources meet the minimum resource allocation {code:java} // ParentQueue.java @Override public synchronized CSAssignment assignContainers(Resource clusterResource, FiCaSchedulerNode node, ResourceLimits resourceLimits) { CSAssignment assignment = new CSAssignment(Resources.createResource(0, 0), NodeType.NODE_LOCAL); Set nodeLabels = node.getLabels(); if (!Resources.fitsIn(minimumAllocation, node.getAvailableResource())) { return assignment; } ... } {code} > The capacity scheduler uses label to cause vcore to be incorrect > > > Key: YARN-8496 > URL: https://issues.apache.org/jira/browse/YARN-8496 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.6 >Reporter: tangshangwen >Assignee: tangshangwen >Priority: Major > Attachments: image-2018-07-05-18-29-32-697.png, yarn-bug.png > > > In my cluster, I used tag scheduling, and I found that it caused the vcore > of the cluster to be incorrect > !image-2018-07-05-18-29-32-697.png! > capacity-scheduler.xml > > {code:java} > > > yarn.scheduler.capacity.root.queues > support > > > yarn.scheduler.capacity.root.support.capacity > 100 > > > yarn.scheduler.capacity.root.support.accessible-node-labels > test1 > > > yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity > 100 > > > yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity > 100 > > > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8496) The capacity scheduler uses label to cause vcore to be incorrect
[ https://issues.apache.org/jira/browse/YARN-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen updated YARN-8496: --- Attachment: image-2018-07-05-18-29-32-697.png > The capacity scheduler uses label to cause vcore to be incorrect > > > Key: YARN-8496 > URL: https://issues.apache.org/jira/browse/YARN-8496 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.6 >Reporter: tangshangwen >Assignee: tangshangwen >Priority: Major > Attachments: image-2018-07-05-18-29-32-697.png, yarn-bug.png > > > In my cluster, I used tag scheduling, and I found that it caused the vcore > of the cluster to be incorrect > > capacity-scheduler.xml > > {code:java} > > > yarn.scheduler.capacity.root.queues > support > > > yarn.scheduler.capacity.root.support.capacity > 100 > > > yarn.scheduler.capacity.root.support.accessible-node-labels > test1 > > > yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity > 100 > > > yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity > 100 > > > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8496) The capacity scheduler uses label to cause vcore to be incorrect
[ https://issues.apache.org/jira/browse/YARN-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen updated YARN-8496: --- Description: In my cluster, I used tag scheduling, and I found that it caused the vcore of the cluster to be incorrect !image-2018-07-05-18-29-32-697.png! capacity-scheduler.xml {code:java} yarn.scheduler.capacity.root.queues support yarn.scheduler.capacity.root.support.capacity 100 yarn.scheduler.capacity.root.support.accessible-node-labels test1 yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity 100 yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity 100 {code} was: In my cluster, I used tag scheduling, and I found that it caused the vcore of the cluster to be incorrect capacity-scheduler.xml {code:java} yarn.scheduler.capacity.root.queues support yarn.scheduler.capacity.root.support.capacity 100 yarn.scheduler.capacity.root.support.accessible-node-labels test1 yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity 100 yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity 100 {code} > The capacity scheduler uses label to cause vcore to be incorrect > > > Key: YARN-8496 > URL: https://issues.apache.org/jira/browse/YARN-8496 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.6 >Reporter: tangshangwen >Assignee: tangshangwen >Priority: Major > Attachments: image-2018-07-05-18-29-32-697.png, yarn-bug.png > > > In my cluster, I used tag scheduling, and I found that it caused the vcore > of the cluster to be incorrect > !image-2018-07-05-18-29-32-697.png! > capacity-scheduler.xml > > {code:java} > > > yarn.scheduler.capacity.root.queues > support > > > yarn.scheduler.capacity.root.support.capacity > 100 > > > yarn.scheduler.capacity.root.support.accessible-node-labels > test1 > > > yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity > 100 > > > yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity > 100 > > > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8496) The capacity scheduler uses label to cause vcore to be incorrect
[ https://issues.apache.org/jira/browse/YARN-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen updated YARN-8496: --- Attachment: (was: image-2018-07-05-18-16-10-851.png) > The capacity scheduler uses label to cause vcore to be incorrect > > > Key: YARN-8496 > URL: https://issues.apache.org/jira/browse/YARN-8496 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.6 >Reporter: tangshangwen >Assignee: tangshangwen >Priority: Major > Attachments: yarn-bug.png > > > In my cluster, I used tag scheduling, and I found that it caused the vcore > of the cluster to be incorrect > > capacity-scheduler.xml > > {code:java} > > > yarn.scheduler.capacity.root.queues > support > > > yarn.scheduler.capacity.root.support.capacity > 100 > > > yarn.scheduler.capacity.root.support.accessible-node-labels > test1 > > > yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity > 100 > > > yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity > 100 > > > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8496) The capacity scheduler uses label to cause vcore to be incorrect
[ https://issues.apache.org/jira/browse/YARN-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen updated YARN-8496: --- Attachment: yarn-bug.png Description: In my cluster, I used tag scheduling, and I found that it caused the vcore of the cluster to be incorrect capacity-scheduler.xml {code:java} yarn.scheduler.capacity.root.queues support yarn.scheduler.capacity.root.support.capacity 100 yarn.scheduler.capacity.root.support.accessible-node-labels test1 yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity 100 yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity 100 {code} was: In my cluster, I used tag scheduling, and I found that it caused the vcore of the cluster to be incorrect !image-2018-07-05-18-16-10-851.png! capacity-scheduler.xml {code:java} yarn.scheduler.capacity.root.queues support yarn.scheduler.capacity.root.support.capacity 100 yarn.scheduler.capacity.root.support.accessible-node-labels test1 yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity 100 yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity 100 {code} > The capacity scheduler uses label to cause vcore to be incorrect > > > Key: YARN-8496 > URL: https://issues.apache.org/jira/browse/YARN-8496 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.6 >Reporter: tangshangwen >Assignee: tangshangwen >Priority: Major > Attachments: image-2018-07-05-18-16-10-851.png, yarn-bug.png > > > In my cluster, I used tag scheduling, and I found that it caused the vcore > of the cluster to be incorrect > > capacity-scheduler.xml > > {code:java} > > > yarn.scheduler.capacity.root.queues > support > > > yarn.scheduler.capacity.root.support.capacity > 100 > > > yarn.scheduler.capacity.root.support.accessible-node-labels > test1 > > > yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity > 100 > > > yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity > 100 > > > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8496) The capacity scheduler uses label to cause vcore to be incorrect
[ https://issues.apache.org/jira/browse/YARN-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen updated YARN-8496: --- Description: In my cluster, I used tag scheduling, and I found that it caused the vcore of the cluster to be incorrect !image-2018-07-05-18-16-10-851.png! capacity-scheduler.xml {code:java} yarn.scheduler.capacity.root.queues support yarn.scheduler.capacity.root.support.capacity 100 yarn.scheduler.capacity.root.support.accessible-node-labels test1 yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity 100 yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity 100 {code} was: I n my cluster, I used tag scheduling, and I found that it caused the vcore of the cluster to be incorrect !image-2018-07-05-18-16-10-851.png! capacity-scheduler.xml {code:java} yarn.scheduler.capacity.root.queues support yarn.scheduler.capacity.root.support.capacity 100 yarn.scheduler.capacity.root.support.accessible-node-labels test1 yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity 100 yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity 100 {code} > The capacity scheduler uses label to cause vcore to be incorrect > > > Key: YARN-8496 > URL: https://issues.apache.org/jira/browse/YARN-8496 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.6 >Reporter: tangshangwen >Assignee: tangshangwen >Priority: Major > Attachments: image-2018-07-05-18-16-10-851.png > > > In my cluster, I used tag scheduling, and I found that it caused the vcore > of the cluster to be incorrect > !image-2018-07-05-18-16-10-851.png! > > > capacity-scheduler.xml > > {code:java} > > > yarn.scheduler.capacity.root.queues > support > > > yarn.scheduler.capacity.root.support.capacity > 100 > > > yarn.scheduler.capacity.root.support.accessible-node-labels > test1 > > > yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity > 100 > > > yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity > 100 > > > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8496) The capacity scheduler uses label to cause vcore to be incorrect
[ https://issues.apache.org/jira/browse/YARN-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen updated YARN-8496: --- Attachment: image-2018-07-05-18-16-10-851.png > The capacity scheduler uses label to cause vcore to be incorrect > > > Key: YARN-8496 > URL: https://issues.apache.org/jira/browse/YARN-8496 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.6 >Reporter: tangshangwen >Assignee: tangshangwen >Priority: Major > Attachments: image-2018-07-05-18-16-10-851.png > > > > > I > n my cluster, I used tag scheduling, and I found that it caused the vcore of > the cluster to be incorrect > !image-2018-07-05-18-12-45-837.png! > > > capacity-scheduler.xml > > {code:java} > > > yarn.scheduler.capacity.root.queues > support > > > yarn.scheduler.capacity.root.support.capacity > 100 > > > yarn.scheduler.capacity.root.support.accessible-node-labels > test1 > > > yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity > 100 > > > yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity > 100 > > > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8496) The capacity scheduler uses label to cause vcore to be incorrect
[ https://issues.apache.org/jira/browse/YARN-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen updated YARN-8496: --- Description: I n my cluster, I used tag scheduling, and I found that it caused the vcore of the cluster to be incorrect !image-2018-07-05-18-16-10-851.png! capacity-scheduler.xml {code:java} yarn.scheduler.capacity.root.queues support yarn.scheduler.capacity.root.support.capacity 100 yarn.scheduler.capacity.root.support.accessible-node-labels test1 yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity 100 yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity 100 {code} was: I n my cluster, I used tag scheduling, and I found that it caused the vcore of the cluster to be incorrect !image-2018-07-05-18-12-45-837.png! capacity-scheduler.xml {code:java} yarn.scheduler.capacity.root.queues support yarn.scheduler.capacity.root.support.capacity 100 yarn.scheduler.capacity.root.support.accessible-node-labels test1 yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity 100 yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity 100 {code} > The capacity scheduler uses label to cause vcore to be incorrect > > > Key: YARN-8496 > URL: https://issues.apache.org/jira/browse/YARN-8496 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.6 >Reporter: tangshangwen >Assignee: tangshangwen >Priority: Major > Attachments: image-2018-07-05-18-16-10-851.png > > > > > I > n my cluster, I used tag scheduling, and I found that it caused the vcore of > the cluster to be incorrect > !image-2018-07-05-18-16-10-851.png! > > > capacity-scheduler.xml > > {code:java} > > > yarn.scheduler.capacity.root.queues > support > > > yarn.scheduler.capacity.root.support.capacity > 100 > > > yarn.scheduler.capacity.root.support.accessible-node-labels > test1 > > > yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity > 100 > > > yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity > 100 > > > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8496) The capacity scheduler uses label to cause vcore to be incorrect
tangshangwen created YARN-8496: -- Summary: The capacity scheduler uses label to cause vcore to be incorrect Key: YARN-8496 URL: https://issues.apache.org/jira/browse/YARN-8496 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.6 Reporter: tangshangwen Assignee: tangshangwen I n my cluster, I used tag scheduling, and I found that it caused the vcore of the cluster to be incorrect !image-2018-07-05-18-12-45-837.png! capacity-scheduler.xml {code:java} yarn.scheduler.capacity.root.queues support yarn.scheduler.capacity.root.support.capacity 100 yarn.scheduler.capacity.root.support.accessible-node-labels test1 yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity 100 yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity 100 {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5795) FairScheduler set AppMaster vcores didn't work
[ https://issues.apache.org/jira/browse/YARN-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15657056#comment-15657056 ] tangshangwen commented on YARN-5795: Hi [~templedf], would you like to review the patch or give me some pointer for the next step to do ? > FairScheduler set AppMaster vcores didn't work > -- > > Key: YARN-5795 > URL: https://issues.apache.org/jira/browse/YARN-5795 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: tangshangwen >Assignee: tangshangwen > Attachments: 0001-YARN-5795.patch > > > In our cluster, we use the linux container, I would like to increase the > number of cpu to get more CPU time slice, but it did not take effect, i set > yarn.app.mapreduce.am.resource.cpu-vcores = 2 ,but i found the > resourcemanager log > {noformat} > [2016-10-27T16:36:37.280 08:00] [INFO] > resourcemanager.scheduler.SchedulerNode.allocateContainer(SchedulerNode.java > 153) [ResourceManager Event Processor] : Assigned container > container_1477059529836_336635_01_01 of capacity > {noformat} > Because scheduler.getResourceCalculator() only computes memory > {code:title=RMAppManager.java|borderStyle=solid} > private ResourceRequest validateAndCreateResourceRequest( > ApplicationSubmissionContext submissionContext, boolean isRecovery) > throws InvalidResourceRequestException { > ... > SchedulerUtils.normalizeRequest(amReq, > scheduler.getResourceCalculator(), > scheduler.getClusterResource(), > scheduler.getMinimumResourceCapability(), > scheduler.getMaximumResourceCapability(), > scheduler.getMinimumResourceCapability()); > ... > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5795) FairScheduler set AppMaster vcores didn't work
[ https://issues.apache.org/jira/browse/YARN-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen updated YARN-5795: --- Attachment: 0001-YARN-5795.patch > FairScheduler set AppMaster vcores didn't work > -- > > Key: YARN-5795 > URL: https://issues.apache.org/jira/browse/YARN-5795 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: tangshangwen >Assignee: tangshangwen > Attachments: 0001-YARN-5795.patch > > > In our cluster, we use the linux container, I would like to increase the > number of cpu to get more CPU time slice, but it did not take effect, i set > yarn.app.mapreduce.am.resource.cpu-vcores = 2 ,but i found the > resourcemanager log > {noformat} > [2016-10-27T16:36:37.280 08:00] [INFO] > resourcemanager.scheduler.SchedulerNode.allocateContainer(SchedulerNode.java > 153) [ResourceManager Event Processor] : Assigned container > container_1477059529836_336635_01_01 of capacity > {noformat} > Because scheduler.getResourceCalculator() only computes memory > {code:title=RMAppManager.java|borderStyle=solid} > private ResourceRequest validateAndCreateResourceRequest( > ApplicationSubmissionContext submissionContext, boolean isRecovery) > throws InvalidResourceRequestException { > ... > SchedulerUtils.normalizeRequest(amReq, > scheduler.getResourceCalculator(), > scheduler.getClusterResource(), > scheduler.getMinimumResourceCapability(), > scheduler.getMaximumResourceCapability(), > scheduler.getMinimumResourceCapability()); > ... > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5795) FairScheduler set AppMaster vcores didn't work
[ https://issues.apache.org/jira/browse/YARN-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15614804#comment-15614804 ] tangshangwen commented on YARN-5795: hi [~kasha], i think the DefaultResourceCalculator in the normalize method should be to maintain the original vcore, {code:title=DefaultResourceCalculator.java|borderStyle=solid} @Override public Resource normalize(Resource r, Resource minimumResource, Resource maximumResource, Resource stepFactor) { int normalizedMemory = Math.min( roundUp( Math.max(r.getMemory(), minimumResource.getMemory()), stepFactor.getMemory()), maximumResource.getMemory()); return Resources.createResource(normalizedMemory); } {code} change to {code:title=DefaultResourceCalculator.java|borderStyle=solid} @Override public Resource normalize(Resource r, Resource minimumResource, Resource maximumResource, Resource stepFactor) { int normalizedMemory = Math.min( roundUp( Math.max(r.getMemory(), minimumResource.getMemory()), stepFactor.getMemory()), maximumResource.getMemory()); return Resources.createResource(normalizedMemory, r.getVirtualCores()); } {code} > FairScheduler set AppMaster vcores didn't work > -- > > Key: YARN-5795 > URL: https://issues.apache.org/jira/browse/YARN-5795 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: tangshangwen >Assignee: tangshangwen > > In our cluster, we use the linux container, I would like to increase the > number of cpu to get more CPU time slice, but it did not take effect, i set > yarn.app.mapreduce.am.resource.cpu-vcores = 2 ,but i found the > resourcemanager log > {noformat} > [2016-10-27T16:36:37.280 08:00] [INFO] > resourcemanager.scheduler.SchedulerNode.allocateContainer(SchedulerNode.java > 153) [ResourceManager Event Processor] : Assigned container > container_1477059529836_336635_01_01 of capacity > {noformat} > Because scheduler.getResourceCalculator() only computes memory > {code:title=RMAppManager.java|borderStyle=solid} > private ResourceRequest validateAndCreateResourceRequest( > ApplicationSubmissionContext submissionContext, boolean isRecovery) > throws InvalidResourceRequestException { > ... > SchedulerUtils.normalizeRequest(amReq, > scheduler.getResourceCalculator(), > scheduler.getClusterResource(), > scheduler.getMinimumResourceCapability(), > scheduler.getMaximumResourceCapability(), > scheduler.getMinimumResourceCapability()); > ... > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5795) FairScheduler set AppMaster vcores didn't work
[ https://issues.apache.org/jira/browse/YARN-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15614196#comment-15614196 ] tangshangwen commented on YARN-5795: I think allocate method using DOMINANT RESOURCE CALCULATOR, AppMaster should also be the same. {code:title=FairScheduler.java|borderStyle=solid} @Override public Allocation allocate(ApplicationAttemptId appAttemptId, List ask, List release, List blacklistAdditions, List blacklistRemovals) { // Make sure this application exists FSAppAttempt application = getSchedulerApp(appAttemptId); if (application == null) { LOG.info("Calling allocate on removed " + "or non existant application " + appAttemptId); return EMPTY_ALLOCATION; } // Sanity check SchedulerUtils.normalizeRequests(ask, DOMINANT_RESOURCE_CALCULATOR, clusterResource, minimumAllocation, getMaximumResourceCapability(), incrAllocation); .. } {code} > FairScheduler set AppMaster vcores didn't work > -- > > Key: YARN-5795 > URL: https://issues.apache.org/jira/browse/YARN-5795 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: tangshangwen >Assignee: tangshangwen > > In our cluster, we use the linux container, I would like to increase the > number of cpu to get more CPU time slice, but it did not take effect, i set > yarn.app.mapreduce.am.resource.cpu-vcores = 2 ,but i found the > resourcemanager log > {noformat} > [2016-10-27T16:36:37.280 08:00] [INFO] > resourcemanager.scheduler.SchedulerNode.allocateContainer(SchedulerNode.java > 153) [ResourceManager Event Processor] : Assigned container > container_1477059529836_336635_01_01 of capacity > {noformat} > Because scheduler.getResourceCalculator() only computes memory > {code:title=RMAppManager.java|borderStyle=solid} > private ResourceRequest validateAndCreateResourceRequest( > ApplicationSubmissionContext submissionContext, boolean isRecovery) > throws InvalidResourceRequestException { > ... > SchedulerUtils.normalizeRequest(amReq, > scheduler.getResourceCalculator(), > scheduler.getClusterResource(), > scheduler.getMinimumResourceCapability(), > scheduler.getMaximumResourceCapability(), > scheduler.getMinimumResourceCapability()); > ... > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Issue Comment Deleted] (YARN-5795) FairScheduler set AppMaster vcores didn't work
[ https://issues.apache.org/jira/browse/YARN-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen updated YARN-5795: --- Comment: was deleted (was: I think if we replace RESOURCE_CALCULATOR with DOMINANT_RESOURCE_CALCULATOR we should be able to fix the problem {code:title=FairScheduler.java|borderStyle=solid} @Override public ResourceCalculator getResourceCalculator() { //return RESOURCE_CALCULATOR; return DOMINANT_RESOURCE_CALCULATOR; } {code}) > FairScheduler set AppMaster vcores didn't work > -- > > Key: YARN-5795 > URL: https://issues.apache.org/jira/browse/YARN-5795 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: tangshangwen >Assignee: tangshangwen > > In our cluster, we use the linux container, I would like to increase the > number of cpu to get more CPU time slice, but it did not take effect, i set > yarn.app.mapreduce.am.resource.cpu-vcores = 2 ,but i found the > resourcemanager log > {noformat} > [2016-10-27T16:36:37.280 08:00] [INFO] > resourcemanager.scheduler.SchedulerNode.allocateContainer(SchedulerNode.java > 153) [ResourceManager Event Processor] : Assigned container > container_1477059529836_336635_01_01 of capacity > {noformat} > Because scheduler.getResourceCalculator() only computes memory > {code:title=RMAppManager.java|borderStyle=solid} > private ResourceRequest validateAndCreateResourceRequest( > ApplicationSubmissionContext submissionContext, boolean isRecovery) > throws InvalidResourceRequestException { > ... > SchedulerUtils.normalizeRequest(amReq, > scheduler.getResourceCalculator(), > scheduler.getClusterResource(), > scheduler.getMinimumResourceCapability(), > scheduler.getMaximumResourceCapability(), > scheduler.getMinimumResourceCapability()); > ... > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5795) FairScheduler set AppMaster vcores didn't work
[ https://issues.apache.org/jira/browse/YARN-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15614149#comment-15614149 ] tangshangwen commented on YARN-5795: I think if we replace RESOURCE_CALCULATOR with DOMINANT_RESOURCE_CALCULATOR we should be able to fix the problem {code:title=FairScheduler.java|borderStyle=solid} @Override public ResourceCalculator getResourceCalculator() { //return RESOURCE_CALCULATOR; return DOMINANT_RESOURCE_CALCULATOR; } {code} > FairScheduler set AppMaster vcores didn't work > -- > > Key: YARN-5795 > URL: https://issues.apache.org/jira/browse/YARN-5795 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: tangshangwen >Assignee: tangshangwen > > In our cluster, we use the linux container, I would like to increase the > number of cpu to get more CPU time slice, but it did not take effect, i set > yarn.app.mapreduce.am.resource.cpu-vcores = 2 ,but i found the > resourcemanager log > {noformat} > [2016-10-27T16:36:37.280 08:00] [INFO] > resourcemanager.scheduler.SchedulerNode.allocateContainer(SchedulerNode.java > 153) [ResourceManager Event Processor] : Assigned container > container_1477059529836_336635_01_01 of capacity > {noformat} > Because scheduler.getResourceCalculator() only computes memory > {code:title=RMAppManager.java|borderStyle=solid} > private ResourceRequest validateAndCreateResourceRequest( > ApplicationSubmissionContext submissionContext, boolean isRecovery) > throws InvalidResourceRequestException { > ... > SchedulerUtils.normalizeRequest(amReq, > scheduler.getResourceCalculator(), > scheduler.getClusterResource(), > scheduler.getMinimumResourceCapability(), > scheduler.getMaximumResourceCapability(), > scheduler.getMinimumResourceCapability()); > ... > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-5795) FairScheduler set AppMaster vcores didn't work
tangshangwen created YARN-5795: -- Summary: FairScheduler set AppMaster vcores didn't work Key: YARN-5795 URL: https://issues.apache.org/jira/browse/YARN-5795 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.1 Reporter: tangshangwen Assignee: tangshangwen In our cluster, we use the linux container, I would like to increase the number of cpu to get more CPU time slice, but it did not take effect, i set yarn.app.mapreduce.am.resource.cpu-vcores = 2 ,but i found the resourcemanager log {noformat} [2016-10-27T16:36:37.280 08:00] [INFO] resourcemanager.scheduler.SchedulerNode.allocateContainer(SchedulerNode.java 153) [ResourceManager Event Processor] : Assigned container container_1477059529836_336635_01_01 of capacity {noformat} Because scheduler.getResourceCalculator() only computes memory {code:title=RMAppManager.java|borderStyle=solid} private ResourceRequest validateAndCreateResourceRequest( ApplicationSubmissionContext submissionContext, boolean isRecovery) throws InvalidResourceRequestException { ... SchedulerUtils.normalizeRequest(amReq, scheduler.getResourceCalculator(), scheduler.getClusterResource(), scheduler.getMinimumResourceCapability(), scheduler.getMaximumResourceCapability(), scheduler.getMinimumResourceCapability()); ... } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5136) Error in handling event type APP_ATTEMPT_REMOVED to the scheduler
[ https://issues.apache.org/jira/browse/YARN-5136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen updated YARN-5136: --- Assignee: Wilfred Spiegelenburg (was: tangshangwen) > Error in handling event type APP_ATTEMPT_REMOVED to the scheduler > - > > Key: YARN-5136 > URL: https://issues.apache.org/jira/browse/YARN-5136 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: tangshangwen >Assignee: Wilfred Spiegelenburg > > move app cause rm exit > {noformat} > 2016-05-24 23:20:47,202 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type APP_ATTEMPT_REMOVED to the scheduler > java.lang.IllegalStateException: Given app to remove > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt@ea94c3b > does not exist in queue [root.bdp_xx.bdp_mart_xx_formal, > demand=, running= vCores:13422>, share=, w= weight=1.0>] > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:119) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:779) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1231) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:114) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:680) > at java.lang.Thread.run(Thread.java:745) > 2016-05-24 23:20:47,202 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_e04_1464073905025_15410_01_001759 Container Transitioned from > ACQUIRED to RELEASED > 2016-05-24 23:20:47,202 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5136) Error in handling event type APP_ATTEMPT_REMOVED to the scheduler
[ https://issues.apache.org/jira/browse/YARN-5136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15532199#comment-15532199 ] tangshangwen commented on YARN-5136: [~wilfreds]ok > Error in handling event type APP_ATTEMPT_REMOVED to the scheduler > - > > Key: YARN-5136 > URL: https://issues.apache.org/jira/browse/YARN-5136 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: tangshangwen >Assignee: Wilfred Spiegelenburg > > move app cause rm exit > {noformat} > 2016-05-24 23:20:47,202 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type APP_ATTEMPT_REMOVED to the scheduler > java.lang.IllegalStateException: Given app to remove > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt@ea94c3b > does not exist in queue [root.bdp_xx.bdp_mart_xx_formal, > demand=, running= vCores:13422>, share=, w= weight=1.0>] > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:119) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:779) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1231) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:114) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:680) > at java.lang.Thread.run(Thread.java:745) > 2016-05-24 23:20:47,202 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_e04_1464073905025_15410_01_001759 Container Transitioned from > ACQUIRED to RELEASED > 2016-05-24 23:20:47,202 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5535) Remove RMDelegationToken make resourcemanager recovery very slow
[ https://issues.apache.org/jira/browse/YARN-5535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426631#comment-15426631 ] tangshangwen commented on YARN-5535: I'm sorry, it is after recovery , and i found even queue size very large {noformat} [2016-08-12T19:43:25.986+08:00] [INFO] yarn.event.AsyncDispatcher.handle(AsyncDispatcher.java:235) [AsyncDispatcher event handler] : Size of event-queue is 643000 [2016-08-12T19:43:25.986+08:00] [INFO] yarn.event.AsyncDispatcher.handle(AsyncDispatcher.java:235) [AsyncDispatcher event handler] : Size of event-queue is 644000 [2016-08-12T19:43:25.986+08:00] [INFO] yarn.event.AsyncDispatcher.handle(AsyncDispatcher.java:235) [AsyncDispatcher event handler] : Size of event-queue is 645000 [2016-08-12T19:43:25.986+08:00] [INFO] yarn.event.AsyncDispatcher.handle(AsyncDispatcher.java:235) [AsyncDispatcher event handler] : Size of event-queue is 646000 {noformat} > Remove RMDelegationToken make resourcemanager recovery very slow > > > Key: YARN-5535 > URL: https://issues.apache.org/jira/browse/YARN-5535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: tangshangwen >Assignee: tangshangwen > > In our cluster, I found that when restart RM, RM recovery is very slow, this > is my log > {noformat} > [2016-08-12T19:43:21.478+08:00] [INFO] > resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136) > [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence > number: 737879 > [2016-08-12T19:43:21.478+08:00] [INFO] > resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) > [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber > [2016-08-12T19:43:21.486+08:00] [INFO] > resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136) > [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence > number: 737878 > [2016-08-12T19:43:21.486+08:00] [INFO] > resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) > [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber > [2016-08-12T19:43:21.494+08:00] [INFO] > resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136) > [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence > number: 737877 > [2016-08-12T19:43:21.494+08:00] [INFO] > resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) > [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber > [2016-08-12T19:43:21.503+08:00] [INFO] > resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136) > [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence > number: 737876 > [2016-08-12T19:43:21.503+08:00] [INFO] > resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) > [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber > [2016-08-12T19:43:21.519+08:00] [INFO] > resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136) > [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence > number: 737875 > [2016-08-12T19:43:21.519+08:00] [INFO] > resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) > [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber > [2016-08-12T19:43:21.533+08:00] [INFO] > security.authorize.ServiceAuthorizationManager.authorize(ServiceAuthorizationManager.java:148) > [Socket Reader #1 for port 8031] : Authorization successful for yarn > (auth:SIMPLE) for protocol=interface > org.apache.hadoop.yarn.server.api.ResourceTrackerPB > [2016-08-12T19:43:21.536+08:00] [INFO] > resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136) > [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence > number: 737874 > [2016-08-12T19:43:21.536+08:00] [INFO] > resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) > [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber > [2016-08-12T19:43:21.553+08:00] [INFO] > resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136) > [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence > number: 737873 > [2016-08-12T19:43:21.553+08:00] [INFO] > resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) > [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber > [2016-08-12T19:43:21.568+08:00] [INFO] > yarn.util.RackRes
[jira] [Commented] (YARN-5535) Remove RMDelegationToken make resourcemanager recovery very slow
[ https://issues.apache.org/jira/browse/YARN-5535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426107#comment-15426107 ] tangshangwen commented on YARN-5535: Thanks [~sunilg] for the comments. I think Removing RMDelegationToken and SequenceNumber may take a long time,lead to can't handle other events {code:title=ZKRMStateStore.java|borderStyle=solid} @Override protected synchronized void removeRMDelegationTokenState( RMDelegationTokenIdentifier rmDTIdentifier) throws Exception { String nodeRemovePath = getNodePath(delegationTokensRootPath, DELEGATION_TOKEN_PREFIX + rmDTIdentifier.getSequenceNumber()); if (LOG.isDebugEnabled()) { LOG.debug("Removing RMDelegationToken_" + rmDTIdentifier.getSequenceNumber()); } if (existsWithRetries(nodeRemovePath, false) != null) { ArrayList opList = new ArrayList(); opList.add(Op.delete(nodeRemovePath, -1)); doDeleteMultiWithRetries(opList); } else { LOG.debug("Attempted to delete a non-existing znode " + nodeRemovePath); } } {code} > Remove RMDelegationToken make resourcemanager recovery very slow > > > Key: YARN-5535 > URL: https://issues.apache.org/jira/browse/YARN-5535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: tangshangwen >Assignee: tangshangwen > > In our cluster, I found that when restart RM, RM recovery is very slow, this > is my log > {noformat} > [2016-08-12T19:43:21.478+08:00] [INFO] > resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136) > [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence > number: 737879 > [2016-08-12T19:43:21.478+08:00] [INFO] > resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) > [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber > [2016-08-12T19:43:21.486+08:00] [INFO] > resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136) > [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence > number: 737878 > [2016-08-12T19:43:21.486+08:00] [INFO] > resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) > [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber > [2016-08-12T19:43:21.494+08:00] [INFO] > resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136) > [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence > number: 737877 > [2016-08-12T19:43:21.494+08:00] [INFO] > resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) > [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber > [2016-08-12T19:43:21.503+08:00] [INFO] > resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136) > [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence > number: 737876 > [2016-08-12T19:43:21.503+08:00] [INFO] > resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) > [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber > [2016-08-12T19:43:21.519+08:00] [INFO] > resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136) > [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence > number: 737875 > [2016-08-12T19:43:21.519+08:00] [INFO] > resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) > [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber > [2016-08-12T19:43:21.533+08:00] [INFO] > security.authorize.ServiceAuthorizationManager.authorize(ServiceAuthorizationManager.java:148) > [Socket Reader #1 for port 8031] : Authorization successful for yarn > (auth:SIMPLE) for protocol=interface > org.apache.hadoop.yarn.server.api.ResourceTrackerPB > [2016-08-12T19:43:21.536+08:00] [INFO] > resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136) > [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence > number: 737874 > [2016-08-12T19:43:21.536+08:00] [INFO] > resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) > [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber > [2016-08-12T19:43:21.553+08:00] [INFO] > resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136) > [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence > number: 737873 > [2016-08-12T19:43:21.553+08:00] [INFO] > resourcemanager.recovery.RMStateStore.transition
[jira] [Updated] (YARN-5535) Remove RMDelegationToken make resourcemanager recovery very slow
[ https://issues.apache.org/jira/browse/YARN-5535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen updated YARN-5535: --- Description: In our cluster, I found that when restart RM, RM recovery is very slow, this is my log {noformat} [2016-08-12T19:43:21.478+08:00] [INFO] resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136) [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 737879 [2016-08-12T19:43:21.478+08:00] [INFO] resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber [2016-08-12T19:43:21.486+08:00] [INFO] resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136) [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 737878 [2016-08-12T19:43:21.486+08:00] [INFO] resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber [2016-08-12T19:43:21.494+08:00] [INFO] resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136) [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 737877 [2016-08-12T19:43:21.494+08:00] [INFO] resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber [2016-08-12T19:43:21.503+08:00] [INFO] resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136) [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 737876 [2016-08-12T19:43:21.503+08:00] [INFO] resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber [2016-08-12T19:43:21.519+08:00] [INFO] resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136) [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 737875 [2016-08-12T19:43:21.519+08:00] [INFO] resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber [2016-08-12T19:43:21.533+08:00] [INFO] security.authorize.ServiceAuthorizationManager.authorize(ServiceAuthorizationManager.java:148) [Socket Reader #1 for port 8031] : Authorization successful for yarn (auth:SIMPLE) for protocol=interface org.apache.hadoop.yarn.server.api.ResourceTrackerPB [2016-08-12T19:43:21.536+08:00] [INFO] resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136) [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 737874 [2016-08-12T19:43:21.536+08:00] [INFO] resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber [2016-08-12T19:43:21.553+08:00] [INFO] resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136) [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 737873 [2016-08-12T19:43:21.553+08:00] [INFO] resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber [2016-08-12T19:43:21.568+08:00] [INFO] yarn.util.RackResolver.coreResolve(RackResolver.java:109) [IPC Server handler 0 on 8031] : Resolved -7056.hadoop.xxx.local to /rack/rack5118 [2016-08-12T19:43:21.569+08:00] [INFO] resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136) [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 737872 [2016-08-12T19:43:21.569+08:00] [INFO] resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber [2016-08-12T19:43:21.570+08:00] [INFO] server.resourcemanager.ResourceTrackerService.registerNodeManager(ResourceTrackerService.java:343) [IPC Server handler 0 on 8031] : NodeManager from node x-7056.hadoop.xxx.local(cmPort: 50086 httpPort: 8042) registered with capability: , assigned nodeId xx-7056.hadoop.xxx.local:50086 [2016-08-12T19:43:21.572+08:00] [INFO] resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:424) [AsyncDispatcher event handler] : xx-7056.hadoop.xxx.local:50086 Node Transitioned from NEW to RUNNING [2016-08-12T19:43:21.576+08:00] [INFO] yarn.event.AsyncDispatcher.handle(AsyncDispatcher.java:235) [AsyncDispatcher event handler] : Size of event-queue is 1000 [2016-08-12T19:43:21.577+08:00] [INFO] scheduler.
[jira] [Updated] (YARN-5535) Remove RMDelegationToken make resourcemanager recovery very slow
[ https://issues.apache.org/jira/browse/YARN-5535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen updated YARN-5535: --- Description: In our cluster, I found that when restart RM, RM recovery is very slow, this is my log {noformat} [2016-08-12T19:43:21.478+08:00] [INFO] resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136) [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 737879 [2016-08-12T19:43:21.478+08:00] [INFO] resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber [2016-08-12T19:43:21.486+08:00] [INFO] resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136) [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 737878 [2016-08-12T19:43:21.486+08:00] [INFO] resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber [2016-08-12T19:43:21.494+08:00] [INFO] resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136) [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 737877 [2016-08-12T19:43:21.494+08:00] [INFO] resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber [2016-08-12T19:43:21.503+08:00] [INFO] resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136) [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 737876 [2016-08-12T19:43:21.503+08:00] [INFO] resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber [2016-08-12T19:43:21.519+08:00] [INFO] resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136) [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 737875 [2016-08-12T19:43:21.519+08:00] [INFO] resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber [2016-08-12T19:43:21.533+08:00] [INFO] security.authorize.ServiceAuthorizationManager.authorize(ServiceAuthorizationManager.java:148) [Socket Reader #1 for port 8031] : Authorization successful for yarn (auth:SIMPLE) for protocol=interface org.apache.hadoop.yarn.server.api.ResourceTrackerPB [2016-08-12T19:43:21.536+08:00] [INFO] resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136) [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 737874 [2016-08-12T19:43:21.536+08:00] [INFO] resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber [2016-08-12T19:43:21.553+08:00] [INFO] resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136) [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 737873 [2016-08-12T19:43:21.553+08:00] [INFO] resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber [2016-08-12T19:43:21.568+08:00] [INFO] yarn.util.RackResolver.coreResolve(RackResolver.java:109) [IPC Server handler 0 on 8031] : Resolved -7056.hadoop.jd.local to /rack/rack5118 [2016-08-12T19:43:21.569+08:00] [INFO] resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136) [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 737872 [2016-08-12T19:43:21.569+08:00] [INFO] resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber [2016-08-12T19:43:21.570+08:00] [INFO] server.resourcemanager.ResourceTrackerService.registerNodeManager(ResourceTrackerService.java:343) [IPC Server handler 0 on 8031] : NodeManager from node x-7056.hadoop.jd.local(cmPort: 50086 httpPort: 8042) registered with capability: , assigned nodeId xx-7056.hadoop.jd.local:50086 [2016-08-12T19:43:21.572+08:00] [INFO] resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:424) [AsyncDispatcher event handler] : xx-7056.hadoop.jd.local:50086 Node Transitioned from NEW to RUNNING [2016-08-12T19:43:21.576+08:00] [INFO] yarn.event.AsyncDispatcher.handle(AsyncDispatcher.java:235) [AsyncDispatcher event handler] : Size of event-queue is 1000 [2016-08-12T19:43:21.577+08:00] [INFO] scheduler.fair
[jira] [Created] (YARN-5535) Remove RMDelegationToken make resourcemanager recovery very slow
tangshangwen created YARN-5535: -- Summary: Remove RMDelegationToken make resourcemanager recovery very slow Key: YARN-5535 URL: https://issues.apache.org/jira/browse/YARN-5535 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.1 Reporter: tangshangwen Assignee: tangshangwen In our cluster, I found that when restart RM, RM recovery is very slow, this is my log {noformat} [2016-08-12T19:43:21.478+08:00] [INFO] resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136) [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 737879 [2016-08-12T19:43:21.478+08:00] [INFO] resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber [2016-08-12T19:43:21.486+08:00] [INFO] resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136) [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 737878 [2016-08-12T19:43:21.486+08:00] [INFO] resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber [2016-08-12T19:43:21.494+08:00] [INFO] resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136) [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 737877 [2016-08-12T19:43:21.494+08:00] [INFO] resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber [2016-08-12T19:43:21.503+08:00] [INFO] resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136) [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 737876 [2016-08-12T19:43:21.503+08:00] [INFO] resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber [2016-08-12T19:43:21.519+08:00] [INFO] resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136) [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 737875 [2016-08-12T19:43:21.519+08:00] [INFO] resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber [2016-08-12T19:43:21.533+08:00] [INFO] security.authorize.ServiceAuthorizationManager.authorize(ServiceAuthorizationManager.java:148) [Socket Reader #1 for port 8031] : Authorization successful for yarn (auth:SIMPLE) for protocol=interface org.apache.hadoop.yarn.server.api.ResourceTrackerPB [2016-08-12T19:43:21.536+08:00] [INFO] resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136) [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 737874 [2016-08-12T19:43:21.536+08:00] [INFO] resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber [2016-08-12T19:43:21.553+08:00] [INFO] resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136) [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 737873 [2016-08-12T19:43:21.553+08:00] [INFO] resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber [2016-08-12T19:43:21.568+08:00] [INFO] yarn.util.RackResolver.coreResolve(RackResolver.java:109) [IPC Server handler 0 on 8031] : Resolved BJHC-Jmartad-7056.hadoop.jd.local to /rack/rack5118 [2016-08-12T19:43:21.569+08:00] [INFO] resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136) [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 737872 [2016-08-12T19:43:21.569+08:00] [INFO] resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber [2016-08-12T19:43:21.570+08:00] [INFO] server.resourcemanager.ResourceTrackerService.registerNodeManager(ResourceTrackerService.java:343) [IPC Server handler 0 on 8031] : NodeManager from node BJHC-Jmartad-7056.hadoop.jd.local(cmPort: 50086 httpPort: 8042) registered with capability: , assigned nodeId BJHC-Jmartad-7056.hadoop.jd.local:50086 [2016-08-12T19:43:21.572+08:00] [INFO] resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:424) [AsyncDispatcher event handler] : BJHC-Jmartad-7056.hadoop.jd.local:50086 Node Transitioned fr
[jira] [Resolved] (YARN-5482) ContainerMetric Lead to memory leaks
[ https://issues.apache.org/jira/browse/YARN-5482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen resolved YARN-5482. Resolution: Duplicate > ContainerMetric Lead to memory leaks > > > Key: YARN-5482 > URL: https://issues.apache.org/jira/browse/YARN-5482 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: tangshangwen >Assignee: tangshangwen > Attachments: oom1.png, oom2.png > > > In our cluster, I often find NodeManager OOM, I dump the heap file and found > ContainerMetric takes up a lot of memory > {code} > export YARN_NODEMANAGER_OPTS="-Xmx2g -Xms2g -Xmn1g -XX:PermSize=128M > -XX:MaxPermSize=128M -XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError > -XX:HeapDumpPath=/data1/yarn-logs/nm_dump.log -Dcom.sun.management.jmxremote > -Xloggc:/data1/yarn-logs/nm_gc.log -verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime > -XX:+PrintGCApplicationConcurrentTime -XX:+PrintTenuringDistribution > -XX:ErrorFile=/data1/yarn-logs/nm_err_pid" > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Reopened] (YARN-5482) ContainerMetric Lead to memory leaks
[ https://issues.apache.org/jira/browse/YARN-5482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen reopened YARN-5482: > ContainerMetric Lead to memory leaks > > > Key: YARN-5482 > URL: https://issues.apache.org/jira/browse/YARN-5482 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: tangshangwen >Assignee: tangshangwen > Attachments: oom1.png, oom2.png > > > In our cluster, I often find NodeManager OOM, I dump the heap file and found > ContainerMetric takes up a lot of memory > {code} > export YARN_NODEMANAGER_OPTS="-Xmx2g -Xms2g -Xmn1g -XX:PermSize=128M > -XX:MaxPermSize=128M -XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError > -XX:HeapDumpPath=/data1/yarn-logs/nm_dump.log -Dcom.sun.management.jmxremote > -Xloggc:/data1/yarn-logs/nm_gc.log -verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime > -XX:+PrintGCApplicationConcurrentTime -XX:+PrintTenuringDistribution > -XX:ErrorFile=/data1/yarn-logs/nm_err_pid" > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-5482) ContainerMetric Lead to memory leaks
[ https://issues.apache.org/jira/browse/YARN-5482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen resolved YARN-5482. Resolution: Fixed > ContainerMetric Lead to memory leaks > > > Key: YARN-5482 > URL: https://issues.apache.org/jira/browse/YARN-5482 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: tangshangwen >Assignee: tangshangwen > Attachments: oom1.png, oom2.png > > > In our cluster, I often find NodeManager OOM, I dump the heap file and found > ContainerMetric takes up a lot of memory > {code} > export YARN_NODEMANAGER_OPTS="-Xmx2g -Xms2g -Xmn1g -XX:PermSize=128M > -XX:MaxPermSize=128M -XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError > -XX:HeapDumpPath=/data1/yarn-logs/nm_dump.log -Dcom.sun.management.jmxremote > -Xloggc:/data1/yarn-logs/nm_gc.log -verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime > -XX:+PrintGCApplicationConcurrentTime -XX:+PrintTenuringDistribution > -XX:ErrorFile=/data1/yarn-logs/nm_err_pid" > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5482) ContainerMetric Lead to memory leaks
[ https://issues.apache.org/jira/browse/YARN-5482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15411603#comment-15411603 ] tangshangwen commented on YARN-5482: Thanks [~bibinchundatt] > ContainerMetric Lead to memory leaks > > > Key: YARN-5482 > URL: https://issues.apache.org/jira/browse/YARN-5482 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: tangshangwen >Assignee: tangshangwen > Attachments: oom1.png, oom2.png > > > In our cluster, I often find NodeManager OOM, I dump the heap file and found > ContainerMetric takes up a lot of memory > {code} > export YARN_NODEMANAGER_OPTS="-Xmx2g -Xms2g -Xmn1g -XX:PermSize=128M > -XX:MaxPermSize=128M -XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError > -XX:HeapDumpPath=/data1/yarn-logs/nm_dump.log -Dcom.sun.management.jmxremote > -Xloggc:/data1/yarn-logs/nm_gc.log -verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime > -XX:+PrintGCApplicationConcurrentTime -XX:+PrintTenuringDistribution > -XX:ErrorFile=/data1/yarn-logs/nm_err_pid" > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5482) ContainerMetric Lead to memory leaks
[ https://issues.apache.org/jira/browse/YARN-5482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen updated YARN-5482: --- Attachment: oom2.png oom1.png > ContainerMetric Lead to memory leaks > > > Key: YARN-5482 > URL: https://issues.apache.org/jira/browse/YARN-5482 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: tangshangwen >Assignee: tangshangwen > Attachments: oom1.png, oom2.png > > > In our cluster, I often find NodeManager OOM, I dump the heap file and found > ContainerMetric takes up a lot of memory > {code} > export YARN_NODEMANAGER_OPTS="-Xmx2g -Xms2g -Xmn1g -XX:PermSize=128M > -XX:MaxPermSize=128M -XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError > -XX:HeapDumpPath=/data1/yarn-logs/nm_dump.log -Dcom.sun.management.jmxremote > -Xloggc:/data1/yarn-logs/nm_gc.log -verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime > -XX:+PrintGCApplicationConcurrentTime -XX:+PrintTenuringDistribution > -XX:ErrorFile=/data1/yarn-logs/nm_err_pid" > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-5482) ContainerMetric Lead to memory leaks
tangshangwen created YARN-5482: -- Summary: ContainerMetric Lead to memory leaks Key: YARN-5482 URL: https://issues.apache.org/jira/browse/YARN-5482 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.1 Reporter: tangshangwen Assignee: tangshangwen In our cluster, I often find NodeManager OOM, I dump the heap file and found ContainerMetric takes up a lot of memory {code} export YARN_NODEMANAGER_OPTS="-Xmx2g -Xms2g -Xmn1g -XX:PermSize=128M -XX:MaxPermSize=128M -XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/data1/yarn-logs/nm_dump.log -Dcom.sun.management.jmxremote -Xloggc:/data1/yarn-logs/nm_gc.log -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+PrintTenuringDistribution -XX:ErrorFile=/data1/yarn-logs/nm_err_pid" {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-5187) when the preempt reduce happen, map resources priority should be higher
tangshangwen created YARN-5187: -- Summary: when the preempt reduce happen, map resources priority should be higher Key: YARN-5187 URL: https://issues.apache.org/jira/browse/YARN-5187 Project: Hadoop YARN Issue Type: Improvement Reporter: tangshangwen Assignee: tangshangwen in our cluster, i found job hang long time, When the cluster resources nervous,many reduce were killed, map no resources to run,i think when the preempt reduce happen, map resources priority should be higher -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-5136) Error in handling event type APP_ATTEMPT_REMOVED to the scheduler
tangshangwen created YARN-5136: -- Summary: Error in handling event type APP_ATTEMPT_REMOVED to the scheduler Key: YARN-5136 URL: https://issues.apache.org/jira/browse/YARN-5136 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.1 Reporter: tangshangwen Assignee: tangshangwen move app cause rm exit {noformat} 2016-05-24 23:20:47,202 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_REMOVED to the scheduler java.lang.IllegalStateException: Given app to remove org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt@ea94c3b does not exist in queue [root.bdp_xx.bdp_mart_xx_formal, demand=, running=, share=, w=] at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:119) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:779) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1231) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:114) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:680) at java.lang.Thread.run(Thread.java:745) 2016-05-24 23:20:47,202 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_e04_1464073905025_15410_01_001759 Container Transitioned from ACQUIRED to RELEASED 2016-05-24 23:20:47,202 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-5134) Can't handle this event at current state Invalid event: FINISHED_CONTAINERS_PULLED_BY_AM at NEW
tangshangwen created YARN-5134: -- Summary: Can't handle this event at current state Invalid event: FINISHED_CONTAINERS_PULLED_BY_AM at NEW Key: YARN-5134 URL: https://issues.apache.org/jira/browse/YARN-5134 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.1 Reporter: tangshangwen Assignee: tangshangwen in out cluster, i found rm can not hanle the event {noformat} 2016-05-24 14:24:06,835 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid event FINISHED_CONTAINERS_PULLED_BY_AM on Node BJM6-Decare-138100.hadoop.jd.local:50086 2016-05-24 14:24:06,835 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: FINISHED_CONTAINERS_PULLED_BY_AM at NEW at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:417) at org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:78) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher.handle(ResourceManager.java:860) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher.handle(ResourceManager.java:844) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) 2016-05-24 14:24:06,835 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid event FINISHED_CONTAINERS_PULLED_BY_AM on Node BJM6-Decare-139122.hadoop.jd.local:50086 2016-05-24 14:24:06,835 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: FINISHED_CONTAINERS_PULLED_BY_AM at NEW at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:417) at org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:78) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher.handle(ResourceManager.java:860) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher.handle(ResourceManager.java:844) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) {noformat} and the event queue is very large {noformat} 2016-05-24 14:24:07,302 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 13337000 2016-05-24 14:24:07,298 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 13337000 2016-05-24 14:24:07,298 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 13337000 2016-05-24 14:24:07,295 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 13337000 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-5133) Can't handle this event at current state Invalid event: FINISHED_CONTAINERS_PULLED_BY_AM at NEW
tangshangwen created YARN-5133: -- Summary: Can't handle this event at current state Invalid event: FINISHED_CONTAINERS_PULLED_BY_AM at NEW Key: YARN-5133 URL: https://issues.apache.org/jira/browse/YARN-5133 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.1 Reporter: tangshangwen Assignee: tangshangwen in out cluster, i found rm can not hanle the event {noformat} 2016-05-24 14:24:06,835 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid event FINISHED_CONTAINERS_PULLED_BY_AM on Node BJM6-Decare-138100.hadoop.jd.local:50086 2016-05-24 14:24:06,835 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: FINISHED_CONTAINERS_PULLED_BY_AM at NEW at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:417) at org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:78) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher.handle(ResourceManager.java:860) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher.handle(ResourceManager.java:844) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) 2016-05-24 14:24:06,835 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid event FINISHED_CONTAINERS_PULLED_BY_AM on Node BJM6-Decare-139122.hadoop.jd.local:50086 2016-05-24 14:24:06,835 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: FINISHED_CONTAINERS_PULLED_BY_AM at NEW at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:417) at org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:78) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher.handle(ResourceManager.java:860) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher.handle(ResourceManager.java:844) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) {noformat} and the event queue is very large {noformat} 2016-05-24 14:24:07,302 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 13337000 2016-05-24 14:24:07,298 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 13337000 2016-05-24 14:24:07,298 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 13337000 2016-05-24 14:24:07,295 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 13337000 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5051) The RM can't update the Decommissioned Nodes Metric
[ https://issues.apache.org/jira/browse/YARN-5051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15279855#comment-15279855 ] tangshangwen commented on YARN-5051: yes, thanks [~kshukla] > The RM can't update the Decommissioned Nodes Metric > --- > > Key: YARN-5051 > URL: https://issues.apache.org/jira/browse/YARN-5051 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: tangshangwen >Assignee: tangshangwen > Attachments: rm.png > > > When the RM restart,the RM will refuse the Decommission NodeManager register, > and I put the NM host removed from exclude_mapred_host file, execute the > command > {noformat} > yarn rmadmin -refreshNodes > {noformat} > start nodemanager , the decommissioned nodes num can not update -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5051) The RM can't update the Decommissioned Nodes Metric
[ https://issues.apache.org/jira/browse/YARN-5051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15273834#comment-15273834 ] tangshangwen commented on YARN-5051: I think should add NEW events in updateMetricsForRejoinedNode method processing,like this: {code:title=RMNodeImpl.java|borderStyle=solid} private void updateMetricsForRejoinedNode(NodeState previousNodeState) { ClusterMetrics metrics = ClusterMetrics.getMetrics(); metrics.incrNumActiveNodes(); switch (previousNodeState) { case LOST: metrics.decrNumLostNMs(); break; case REBOOTED: metrics.decrNumRebootedNMs(); break; case NEW: case DECOMMISSIONED: metrics.decrDecommisionedNMs(); break; case UNHEALTHY: metrics.decrNumUnhealthyNMs(); break; default: LOG.debug("Unexpected previous node state"); } } {code} > The RM can't update the Decommissioned Nodes Metric > --- > > Key: YARN-5051 > URL: https://issues.apache.org/jira/browse/YARN-5051 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: tangshangwen >Assignee: tangshangwen > Attachments: rm.png > > > When the RM restart,the RM will refuse the Decommission NodeManager register, > and I put the NM host removed from exclude_mapred_host file, execute the > command > {noformat} > yarn rmadmin -refreshNodes > {noformat} > start nodemanager , the decommissioned nodes num can not update -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5051) The RM can't update the Decommissioned Nodes Metric
[ https://issues.apache.org/jira/browse/YARN-5051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15273825#comment-15273825 ] tangshangwen commented on YARN-5051: when the nodemanager start will trigger the AddNodeTransition, the node in the NEW state, will not reduce the value of DecommisionedNMs value. {code:title=RMNodeImpl.java|borderStyle=solid} Rpublic static class AddNodeTransition implements SingleArcTransition { @Override public void transition(RMNodeImpl rmNode, RMNodeEvent event) { // Inform the scheduler RMNodeStartedEvent startEvent = (RMNodeStartedEvent) event; List containers = null; String host = rmNode.nodeId.getHost(); if (rmNode.context.getInactiveRMNodes().containsKey(host)) { // Old node rejoining RMNode previouRMNode = rmNode.context.getInactiveRMNodes().get(host); rmNode.context.getInactiveRMNodes().remove(host); rmNode.updateMetricsForRejoinedNode(previouRMNode.getState()); } else { ClusterMetrics.getMetrics().incrNumActiveNodes(); {code} > The RM can't update the Decommissioned Nodes Metric > --- > > Key: YARN-5051 > URL: https://issues.apache.org/jira/browse/YARN-5051 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: tangshangwen >Assignee: tangshangwen > Attachments: rm.png > > > When the RM restart,the RM will refuse the Decommission NodeManager register, > and I put the NM host removed from exclude_mapred_host file, execute the > command > {noformat} > yarn rmadmin -refreshNodes > {noformat} > start nodemanager , the decommissioned nodes num can not update -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5051) The RM can't update the Decommissioned Nodes Metric
[ https://issues.apache.org/jira/browse/YARN-5051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15273819#comment-15273819 ] tangshangwen commented on YARN-5051: The include hosts file not empty also have the same problem > The RM can't update the Decommissioned Nodes Metric > --- > > Key: YARN-5051 > URL: https://issues.apache.org/jira/browse/YARN-5051 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: tangshangwen >Assignee: tangshangwen > Attachments: rm.png > > > When the RM restart,the RM will refuse the Decommission NodeManager register, > and I put the NM host removed from exclude_mapred_host file, execute the > command > {noformat} > yarn rmadmin -refreshNodes > {noformat} > start nodemanager , the decommissioned nodes num can not update -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5051) The RM can't update the Decommissioned Nodes Metric
[ https://issues.apache.org/jira/browse/YARN-5051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15273815#comment-15273815 ] tangshangwen commented on YARN-5051: i think we should put the decommission node in InactiveRMNodes when it is registered {code:title=ResourceTrackerService.java|borderStyle=solid} RMNode rmNode = new RMNodeImpl(nodeId, rmContext, host, cmPort, httpPort, resolve(host), capability, nodeManagerVersion); // Check if this node is a 'valid' node if (!this.nodesListManager.isValidNode(host)) { String message = "Disallowed NodeManager from " + host + ", Sending SHUTDOWN signal to the NodeManager."; LOG.info(message); response.setDiagnosticsMessage(message); response.setNodeAction(NodeAction.SHUTDOWN); this.rmContext.getInactiveRMNodes().put(rmNode.getNodeID().getHost(), rmNode); return response; } {code} > The RM can't update the Decommissioned Nodes Metric > --- > > Key: YARN-5051 > URL: https://issues.apache.org/jira/browse/YARN-5051 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: tangshangwen >Assignee: tangshangwen > Attachments: rm.png > > > When the RM restart,the RM will refuse the Decommission NodeManager register, > and I put the NM host removed from exclude_mapred_host file, execute the > command > {noformat} > yarn rmadmin -refreshNodes > {noformat} > start nodemanager , the decommissioned nodes num can not update -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5051) The RM can't update the Decommissioned Nodes Metric
[ https://issues.apache.org/jira/browse/YARN-5051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen updated YARN-5051: --- Description: When the RM restart,the RM will refuse the Decommission NodeManager register, and I put the NM host removed from exclude_mapred_host file, execute the command {noformat} yarn rmadmin -refreshNodes {noformat} start nodemanager , the decommissioned nodes num can not update was: When the RM restart,the RM will refuse the Decommission NodeManager register, and I put the NM host removed from exclude_mapred_host file, execute the command {noformat} yarn rmadmin -refreshNodes {noformat} start nodemanager , the decommissioned nodes can not update > The RM can't update the Decommissioned Nodes Metric > --- > > Key: YARN-5051 > URL: https://issues.apache.org/jira/browse/YARN-5051 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: tangshangwen >Assignee: tangshangwen > Attachments: rm.png > > > When the RM restart,the RM will refuse the Decommission NodeManager register, > and I put the NM host removed from exclude_mapred_host file, execute the > command > {noformat} > yarn rmadmin -refreshNodes > {noformat} > start nodemanager , the decommissioned nodes num can not update -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5051) The RM can't update the Decommissioned Nodes Metric
[ https://issues.apache.org/jira/browse/YARN-5051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen updated YARN-5051: --- Description: When the RM restart,the RM will refuse the Decommission NodeManager register, and I put the NM host removed from exclude_mapred_host file, execute the command {noformat} yarn rmadmin -refreshNodes {noformat} start nodemanager , the decommissioned nodes can not update was:When the RM restart,the RM will refuse the Decommission NodeManager register, and I put the NM host removed from exclude_mapred_host file, and start nodemanager , the decommissioned nodes can not update > The RM can't update the Decommissioned Nodes Metric > --- > > Key: YARN-5051 > URL: https://issues.apache.org/jira/browse/YARN-5051 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: tangshangwen >Assignee: tangshangwen > Attachments: rm.png > > > When the RM restart,the RM will refuse the Decommission NodeManager register, > and I put the NM host removed from exclude_mapred_host file, execute the > command > {noformat} > yarn rmadmin -refreshNodes > {noformat} > start nodemanager , the decommissioned nodes can not update -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5051) The RM can't update the Decommissioned Nodes Metric
[ https://issues.apache.org/jira/browse/YARN-5051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen updated YARN-5051: --- Attachment: rm.png > The RM can't update the Decommissioned Nodes Metric > --- > > Key: YARN-5051 > URL: https://issues.apache.org/jira/browse/YARN-5051 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: tangshangwen >Assignee: tangshangwen > Attachments: rm.png > > > When the RM restart,the RM will refuse the Decommission NodeManager register, > and I put the NM host removed from exclude_mapred_host file, and start > nodemanager , the decommissioned nodes can not update -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-5051) The RM can't update the Decommissioned Nodes Metric
tangshangwen created YARN-5051: -- Summary: The RM can't update the Decommissioned Nodes Metric Key: YARN-5051 URL: https://issues.apache.org/jira/browse/YARN-5051 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.1 Reporter: tangshangwen Assignee: tangshangwen When the RM restart,the RM will refuse the Decommission NodeManager register, and I put the NM host removed from exclude_mapred_host file, and start nodemanager , the decommissioned nodes can not update -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-5021) -1B of 3 GB physical memory used
tangshangwen created YARN-5021: -- Summary: -1B of 3 GB physical memory used Key: YARN-5021 URL: https://issues.apache.org/jira/browse/YARN-5021 Project: Hadoop YARN Issue Type: Bug Reporter: tangshangwen Assignee: tangshangwen in my cluster, I found nodemanager log {noformat} 2016-05-01 21:02:46,512 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 18210 for container-id container_1461592647020_15092_01_79: -1B of 3 GB physical memory used; -1B of 9.3 GB virtual memory used 2016-05-01 21:02:46,529 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 18405 for container-id container_1461592647020_15092_01_77: -1B of 3 GB physical memory used; -1B of 9.3 GB virtual memory used 2016-05-01 21:02:46,545 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 18893 for container-id container_1461592647020_15090_01_24: -1B of 3 GB physical memory used; -1B of 9.3 GB virtual memory used 2016-05-01 21:02:46,561 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 18555 for container-id container_1461592647020_15092_01_73: -1B of 3 GB physical memory used; -1B of 9.3 GB virtual memory used 2016-05-01 21:02:46,577 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 18510 for container-id container_1461592647020_15090_01_20: -1B of 3 GB physical memory used; -1B of 9.3 GB virtual memory used {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4598) Invalid event: RESOURCE_FAILED at CONTAINER_CLEANEDUP_AFTER_KILL
[ https://issues.apache.org/jira/browse/YARN-4598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen updated YARN-4598: --- Attachment: YARN-4598.1.patch I submitted a patch > Invalid event: RESOURCE_FAILED at CONTAINER_CLEANEDUP_AFTER_KILL > > > Key: YARN-4598 > URL: https://issues.apache.org/jira/browse/YARN-4598 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: tangshangwen >Assignee: tangshangwen > Attachments: YARN-4598.1.patch > > > In our cluster, I found that the container has some problems in state > transition,this is my log > {noformat} > 2016-01-12 17:42:50,088 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1452588902899_0001_01_87 transitioned from > CONTAINER_CLEANEDUP_AFTER_KILL to DONE > 2016-01-12 17:42:50,088 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Can't handle this event at current state: Current: > [CONTAINER_CLEANEDUP_AFTER_KILL], eventType: [RESOURCE_FAILED] > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > RESOURCE_FAILED at CONTAINER_CLEANEDUP_AFTER_KILL > > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1127) > > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83) > > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1078) > > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1071) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) > > > at java.lang.Thread.run(Thread.java:744) > > > 2016-01-12 17:42:50,089 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1452588902899_0001_01_94 transitioned from > CONTAINER_CLEANEDUP_AFTER_KILL to null > 2016-01-12 17:42:50,089 INFO > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hadoop > OPERATION=Container Finished - Killed TARGET=ContainerImpl > RESULT=SUCCESS APPID=application_1452588902899_0001 > CONTAINERID=container_1452588902899_0001_01_94 > > 2016-01-12 17:42:50,089 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1452588902899_0001_01_94 transitioned from > CONTAINER_CLEANEDUP_AFTER_KILL to DONE > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4598) Invalid event: RESOURCE_FAILED at CONTAINER_CLEANEDUP_AFTER_KILL
[ https://issues.apache.org/jira/browse/YARN-4598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15105369#comment-15105369 ] tangshangwen commented on YARN-4598: I think we should add a transition , have any Suggestions? {noformat} .addTransition(ContainerState.CONTAINER_CLEANEDUP_AFTER_KILL, ContainerState.CONTAINER_CLEANEDUP_AFTER_KILL, ContainerEventType.RESOURCE_FAILED) {noformat} > Invalid event: RESOURCE_FAILED at CONTAINER_CLEANEDUP_AFTER_KILL > > > Key: YARN-4598 > URL: https://issues.apache.org/jira/browse/YARN-4598 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: tangshangwen >Assignee: tangshangwen > > In our cluster, I found that the container has some problems in state > transition,this is my log > {noformat} > 2016-01-12 17:42:50,088 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1452588902899_0001_01_87 transitioned from > CONTAINER_CLEANEDUP_AFTER_KILL to DONE > 2016-01-12 17:42:50,088 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Can't handle this event at current state: Current: > [CONTAINER_CLEANEDUP_AFTER_KILL], eventType: [RESOURCE_FAILED] > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > RESOURCE_FAILED at CONTAINER_CLEANEDUP_AFTER_KILL > > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1127) > > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83) > > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1078) > > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1071) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) > > > at java.lang.Thread.run(Thread.java:744) > > > 2016-01-12 17:42:50,089 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1452588902899_0001_01_94 transitioned from > CONTAINER_CLEANEDUP_AFTER_KILL to null > 2016-01-12 17:42:50,089 INFO > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hadoop > OPERATION=Container Finished - Killed TARGET=ContainerImpl > RESULT=SUCCESS APPID=application_1452588902899_0001 > CONTAINERID=container_1452588902899_0001_01_94 > > 2016-01-12 17:42:50,089 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1452588902899_0001_01_94 transitioned from > CONTAINER_CLEANEDUP_AFTER_KILL to DONE > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4598) Invalid event: RESOURCE_FAILED at CONTAINER_CLEANEDUP_AFTER_KILL
tangshangwen created YARN-4598: -- Summary: Invalid event: RESOURCE_FAILED at CONTAINER_CLEANEDUP_AFTER_KILL Key: YARN-4598 URL: https://issues.apache.org/jira/browse/YARN-4598 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.1 Reporter: tangshangwen Assignee: tangshangwen In our cluster, I found that the container has some problems in state transition,this is my log {noformat} 2016-01-12 17:42:50,088 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1452588902899_0001_01_87 transitioned from CONTAINER_CLEANEDUP_AFTER_KILL to DONE 2016-01-12 17:42:50,088 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Can't handle this event at current state: Current: [CONTAINER_CLEANEDUP_AFTER_KILL], eventType: [RESOURCE_FAILED] org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: RESOURCE_FAILED at CONTAINER_CLEANEDUP_AFTER_KILL at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1127) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1078) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1071) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:744) 2016-01-12 17:42:50,089 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1452588902899_0001_01_94 transitioned from CONTAINER_CLEANEDUP_AFTER_KILL to null 2016-01-12 17:42:50,089 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hadoop OPERATION=Container Finished - Killed TARGET=ContainerImpl RESULT=SUCCESS APPID=application_1452588902899_0001 CONTAINERID=container_1452588902899_0001_01_94 2016-01-12 17:42:50,089 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1452588902899_0001_01_94 transitioned from CONTAINER_CLEANEDUP_AFTER_KILL to DONE {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (YARN-4539) CommonNodeLabelsManager throw NullPointerException when the fairScheduler init failed
[ https://issues.apache.org/jira/browse/YARN-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen reopened YARN-4539: > CommonNodeLabelsManager throw NullPointerException when the fairScheduler > init failed > - > > Key: YARN-4539 > URL: https://issues.apache.org/jira/browse/YARN-4539 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: tangshangwen >Assignee: tangshangwen > Attachments: YARN-4539.1.patch > > > When the scheduler initialization failed and RM stop compositeService cause > the CommonNodeLabelsManager throw NullPointerException. > {noformat} > 2016-01-04 22:19:52,190 INFO org.apache.hadoop.service.AbstractService: > Service > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler > failed in state INITED; cause: java.io.IOException: Failed to initialize > FairScheduler > java.io.IOException: Failed to initialize FairScheduler > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1377) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1394) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > > 2016-01-04 22:19:52,193 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ResourceManager > metrics system... > 2016-01-04 22:19:52,194 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics > system stopped. > 2016-01-04 22:19:52,194 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics > system shutdown complete. > 2016-01-04 22:19:52,194 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: > AsyncDispatcher is draining to stop, igonring any new events. > 2016-01-04 22:19:52,194 INFO org.apache.hadoop.service.AbstractService: > Service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in > state STOPPED; cause: java.lang.NullPointerException > java.lang.NullPointerException > java.lang.NullPointerException > at > org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.stopDispatcher(CommonNodeLabelsManager.java:251) > at > org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:257) > at > org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) > at > org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) > at > org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-4539) CommonNodeLabelsManager throw NullPointerException when the fairScheduler init failed
[ https://issues.apache.org/jira/browse/YARN-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen resolved YARN-4539. Resolution: Duplicate > CommonNodeLabelsManager throw NullPointerException when the fairScheduler > init failed > - > > Key: YARN-4539 > URL: https://issues.apache.org/jira/browse/YARN-4539 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: tangshangwen >Assignee: tangshangwen > Attachments: YARN-4539.1.patch > > > When the scheduler initialization failed and RM stop compositeService cause > the CommonNodeLabelsManager throw NullPointerException. > {noformat} > 2016-01-04 22:19:52,190 INFO org.apache.hadoop.service.AbstractService: > Service > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler > failed in state INITED; cause: java.io.IOException: Failed to initialize > FairScheduler > java.io.IOException: Failed to initialize FairScheduler > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1377) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1394) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > > 2016-01-04 22:19:52,193 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ResourceManager > metrics system... > 2016-01-04 22:19:52,194 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics > system stopped. > 2016-01-04 22:19:52,194 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics > system shutdown complete. > 2016-01-04 22:19:52,194 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: > AsyncDispatcher is draining to stop, igonring any new events. > 2016-01-04 22:19:52,194 INFO org.apache.hadoop.service.AbstractService: > Service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in > state STOPPED; cause: java.lang.NullPointerException > java.lang.NullPointerException > java.lang.NullPointerException > at > org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.stopDispatcher(CommonNodeLabelsManager.java:251) > at > org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:257) > at > org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) > at > org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) > at > org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4539) CommonNodeLabelsManager throw NullPointerException when the fairScheduler init failed
[ https://issues.apache.org/jira/browse/YARN-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15084576#comment-15084576 ] tangshangwen commented on YARN-4539: OK, Thanks [~bibinchundatt] > CommonNodeLabelsManager throw NullPointerException when the fairScheduler > init failed > - > > Key: YARN-4539 > URL: https://issues.apache.org/jira/browse/YARN-4539 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: tangshangwen >Assignee: tangshangwen > Attachments: YARN-4539.1.patch > > > When the scheduler initialization failed and RM stop compositeService cause > the CommonNodeLabelsManager throw NullPointerException. > {noformat} > 2016-01-04 22:19:52,190 INFO org.apache.hadoop.service.AbstractService: > Service > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler > failed in state INITED; cause: java.io.IOException: Failed to initialize > FairScheduler > java.io.IOException: Failed to initialize FairScheduler > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1377) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1394) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > > 2016-01-04 22:19:52,193 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ResourceManager > metrics system... > 2016-01-04 22:19:52,194 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics > system stopped. > 2016-01-04 22:19:52,194 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics > system shutdown complete. > 2016-01-04 22:19:52,194 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: > AsyncDispatcher is draining to stop, igonring any new events. > 2016-01-04 22:19:52,194 INFO org.apache.hadoop.service.AbstractService: > Service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in > state STOPPED; cause: java.lang.NullPointerException > java.lang.NullPointerException > java.lang.NullPointerException > at > org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.stopDispatcher(CommonNodeLabelsManager.java:251) > at > org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:257) > at > org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) > at > org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) > at > org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4539) CommonNodeLabelsManager throw NullPointerException when the fairScheduler init failed
[ https://issues.apache.org/jira/browse/YARN-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15084575#comment-15084575 ] tangshangwen commented on YARN-4539: OK, Thanks [~bibinchundatt] > CommonNodeLabelsManager throw NullPointerException when the fairScheduler > init failed > - > > Key: YARN-4539 > URL: https://issues.apache.org/jira/browse/YARN-4539 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: tangshangwen >Assignee: tangshangwen > Attachments: YARN-4539.1.patch > > > When the scheduler initialization failed and RM stop compositeService cause > the CommonNodeLabelsManager throw NullPointerException. > {noformat} > 2016-01-04 22:19:52,190 INFO org.apache.hadoop.service.AbstractService: > Service > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler > failed in state INITED; cause: java.io.IOException: Failed to initialize > FairScheduler > java.io.IOException: Failed to initialize FairScheduler > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1377) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1394) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > > 2016-01-04 22:19:52,193 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ResourceManager > metrics system... > 2016-01-04 22:19:52,194 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics > system stopped. > 2016-01-04 22:19:52,194 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics > system shutdown complete. > 2016-01-04 22:19:52,194 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: > AsyncDispatcher is draining to stop, igonring any new events. > 2016-01-04 22:19:52,194 INFO org.apache.hadoop.service.AbstractService: > Service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in > state STOPPED; cause: java.lang.NullPointerException > java.lang.NullPointerException > java.lang.NullPointerException > at > org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.stopDispatcher(CommonNodeLabelsManager.java:251) > at > org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:257) > at > org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) > at > org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) > at > org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-4539) CommonNodeLabelsManager throw NullPointerException when the fairScheduler init failed
[ https://issues.apache.org/jira/browse/YARN-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen resolved YARN-4539. Resolution: Fixed > CommonNodeLabelsManager throw NullPointerException when the fairScheduler > init failed > - > > Key: YARN-4539 > URL: https://issues.apache.org/jira/browse/YARN-4539 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: tangshangwen >Assignee: tangshangwen > Attachments: YARN-4539.1.patch > > > When the scheduler initialization failed and RM stop compositeService cause > the CommonNodeLabelsManager throw NullPointerException. > {noformat} > 2016-01-04 22:19:52,190 INFO org.apache.hadoop.service.AbstractService: > Service > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler > failed in state INITED; cause: java.io.IOException: Failed to initialize > FairScheduler > java.io.IOException: Failed to initialize FairScheduler > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1377) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1394) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > > 2016-01-04 22:19:52,193 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ResourceManager > metrics system... > 2016-01-04 22:19:52,194 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics > system stopped. > 2016-01-04 22:19:52,194 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics > system shutdown complete. > 2016-01-04 22:19:52,194 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: > AsyncDispatcher is draining to stop, igonring any new events. > 2016-01-04 22:19:52,194 INFO org.apache.hadoop.service.AbstractService: > Service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in > state STOPPED; cause: java.lang.NullPointerException > java.lang.NullPointerException > java.lang.NullPointerException > at > org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.stopDispatcher(CommonNodeLabelsManager.java:251) > at > org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:257) > at > org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) > at > org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) > at > org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4539) CommonNodeLabelsManager throw NullPointerException when the fairScheduler init failed
[ https://issues.apache.org/jira/browse/YARN-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15081522#comment-15081522 ] tangshangwen commented on YARN-4539: Yes, Thank you for your comment!!:D > CommonNodeLabelsManager throw NullPointerException when the fairScheduler > init failed > - > > Key: YARN-4539 > URL: https://issues.apache.org/jira/browse/YARN-4539 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: tangshangwen >Assignee: tangshangwen > Attachments: YARN-4539.1.patch > > > When the scheduler initialization failed and RM stop compositeService cause > the CommonNodeLabelsManager throw NullPointerException. > {noformat} > 2016-01-04 22:19:52,190 INFO org.apache.hadoop.service.AbstractService: > Service > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler > failed in state INITED; cause: java.io.IOException: Failed to initialize > FairScheduler > java.io.IOException: Failed to initialize FairScheduler > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1377) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1394) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > > 2016-01-04 22:19:52,193 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ResourceManager > metrics system... > 2016-01-04 22:19:52,194 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics > system stopped. > 2016-01-04 22:19:52,194 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics > system shutdown complete. > 2016-01-04 22:19:52,194 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: > AsyncDispatcher is draining to stop, igonring any new events. > 2016-01-04 22:19:52,194 INFO org.apache.hadoop.service.AbstractService: > Service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in > state STOPPED; cause: java.lang.NullPointerException > java.lang.NullPointerException > java.lang.NullPointerException > at > org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.stopDispatcher(CommonNodeLabelsManager.java:251) > at > org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:257) > at > org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) > at > org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) > at > org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4539) CommonNodeLabelsManager throw NullPointerException when the fairScheduler init failed
[ https://issues.apache.org/jira/browse/YARN-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen updated YARN-4539: --- Description: When the scheduler initialization failed and RM stop compositeService cause the CommonNodeLabelsManager throw NullPointerException. {noformat} 2016-01-04 22:19:52,190 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler failed in state INITED; cause: java.io.IOException: Failed to initialize FairScheduler java.io.IOException: Failed to initialize FairScheduler at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1377) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1394) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) 2016-01-04 22:19:52,193 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ResourceManager metrics system... 2016-01-04 22:19:52,194 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics system stopped. 2016-01-04 22:19:52,194 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics system shutdown complete. 2016-01-04 22:19:52,194 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: AsyncDispatcher is draining to stop, igonring any new events. 2016-01-04 22:19:52,194 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in state STOPPED; cause: java.lang.NullPointerException java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.stopDispatcher(CommonNodeLabelsManager.java:251) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:257) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) {noformat} was: When the scheduler initialization failed and RM stop compositeService cause the CommonNodeLabelsManager throw NullPointerException. {noformat} 2016-01-04 22:19:52,190 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler failed in state INITED; cause: java.io.IOException: Failed to initialize FairScheduler java.io.IOException: Failed to initialize FairScheduler at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1377) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1394) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) 2016-01-04 22:19:52,193 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ResourceManager metrics system... 2016-01-04 22:19:52,194 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics system stopped. 2016-01-04 22:19:52,194 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics system shutdown complete. 2016-01-04 22:19:52,194 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: AsyncDispatcher is draining to stop, igonring any new events. 2016-01-04 22:19:52,194 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in state STOPPED; cause: java.lang.NullPointerException java.lang.NullPointerException {noformat} > CommonNodeLabelsManager throw NullPointerException when the fairScheduler > init failed > - > > Key: YARN-4539 > URL: https://issues.apache.org/jira/browse/YARN-4539 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: tangshangwen >Assignee: tangshangwen > Attachments: YARN-4539.1.patch > > > When the scheduler initialization failed and RM stop compositeService cause > the CommonNodeLabelsManager throw NullPointerException. > {noformat} > 2016-01-04 22:19:52,190 INFO org.apache.hadoop.service.AbstractService: > Service > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler > failed in state INITED; cause: java.io.IOException: Failed to initialize > FairScheduler > java.io.IOException: Failed to initialize FairScheduler > at >
[jira] [Commented] (YARN-4539) CommonNodeLabelsManager throw NullPointerException when the fairScheduler init failed
[ https://issues.apache.org/jira/browse/YARN-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15081301#comment-15081301 ] tangshangwen commented on YARN-4539: I submitted a patch. > CommonNodeLabelsManager throw NullPointerException when the fairScheduler > init failed > - > > Key: YARN-4539 > URL: https://issues.apache.org/jira/browse/YARN-4539 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: tangshangwen >Assignee: tangshangwen > Attachments: YARN-4539.1.patch > > > When the scheduler initialization failed and RM stop compositeService cause > the CommonNodeLabelsManager throw NullPointerException. > {noformat} > 2016-01-04 22:19:52,190 INFO org.apache.hadoop.service.AbstractService: > Service > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler > failed in state INITED; cause: java.io.IOException: Failed to initialize > FairScheduler > java.io.IOException: Failed to initialize FairScheduler > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1377) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1394) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > > 2016-01-04 22:19:52,193 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ResourceManager > metrics system... > 2016-01-04 22:19:52,194 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics > system stopped. > 2016-01-04 22:19:52,194 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics > system shutdown complete. > 2016-01-04 22:19:52,194 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: > AsyncDispatcher is draining to stop, igonring any new events. > 2016-01-04 22:19:52,194 INFO org.apache.hadoop.service.AbstractService: > Service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in > state STOPPED; cause: java.lang.NullPointerException > java.lang.NullPointerException > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4539) CommonNodeLabelsManager throw NullPointerException when the fairScheduler init failed
[ https://issues.apache.org/jira/browse/YARN-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen updated YARN-4539: --- Attachment: YARN-4539.1.patch > CommonNodeLabelsManager throw NullPointerException when the fairScheduler > init failed > - > > Key: YARN-4539 > URL: https://issues.apache.org/jira/browse/YARN-4539 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: tangshangwen >Assignee: tangshangwen > Attachments: YARN-4539.1.patch > > > When the scheduler initialization failed and RM stop compositeService cause > the CommonNodeLabelsManager throw NullPointerException. > {noformat} > 2016-01-04 22:19:52,190 INFO org.apache.hadoop.service.AbstractService: > Service > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler > failed in state INITED; cause: java.io.IOException: Failed to initialize > FairScheduler > java.io.IOException: Failed to initialize FairScheduler > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1377) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1394) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > > 2016-01-04 22:19:52,193 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ResourceManager > metrics system... > 2016-01-04 22:19:52,194 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics > system stopped. > 2016-01-04 22:19:52,194 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics > system shutdown complete. > 2016-01-04 22:19:52,194 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: > AsyncDispatcher is draining to stop, igonring any new events. > 2016-01-04 22:19:52,194 INFO org.apache.hadoop.service.AbstractService: > Service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in > state STOPPED; cause: java.lang.NullPointerException > java.lang.NullPointerException > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4539) CommonNodeLabelsManager throw NullPointerException when the fairScheduler init failed
[ https://issues.apache.org/jira/browse/YARN-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15081292#comment-15081292 ] tangshangwen commented on YARN-4539: I think asyncDispatcher should check whether the null before closing {code:title=CommonNodeLabelsManager.java|borderStyle=solid} // for UT purpose protected void stopDispatcher() { AsyncDispatcher asyncDispatcher = (AsyncDispatcher) dispatcher; asyncDispatcher.stop(); } {code} > CommonNodeLabelsManager throw NullPointerException when the fairScheduler > init failed > - > > Key: YARN-4539 > URL: https://issues.apache.org/jira/browse/YARN-4539 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: tangshangwen >Assignee: tangshangwen > > When the scheduler initialization failed and RM stop compositeService cause > the CommonNodeLabelsManager throw NullPointerException. > {noformat} > 2016-01-04 22:19:52,190 INFO org.apache.hadoop.service.AbstractService: > Service > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler > failed in state INITED; cause: java.io.IOException: Failed to initialize > FairScheduler > java.io.IOException: Failed to initialize FairScheduler > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1377) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1394) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > > 2016-01-04 22:19:52,193 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ResourceManager > metrics system... > 2016-01-04 22:19:52,194 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics > system stopped. > 2016-01-04 22:19:52,194 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics > system shutdown complete. > 2016-01-04 22:19:52,194 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: > AsyncDispatcher is draining to stop, igonring any new events. > 2016-01-04 22:19:52,194 INFO org.apache.hadoop.service.AbstractService: > Service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in > state STOPPED; cause: java.lang.NullPointerException > java.lang.NullPointerException > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4539) CommonNodeLabelsManager throw NullPointerException when the fairScheduler init failed
tangshangwen created YARN-4539: -- Summary: CommonNodeLabelsManager throw NullPointerException when the fairScheduler init failed Key: YARN-4539 URL: https://issues.apache.org/jira/browse/YARN-4539 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.1 Reporter: tangshangwen Assignee: tangshangwen When the scheduler initialization failed and RM stop compositeService cause the CommonNodeLabelsManager throw NullPointerException. {noformat} 2016-01-04 22:19:52,190 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler failed in state INITED; cause: java.io.IOException: Failed to initialize FairScheduler java.io.IOException: Failed to initialize FairScheduler at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1377) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1394) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) 2016-01-04 22:19:52,193 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ResourceManager metrics system... 2016-01-04 22:19:52,194 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics system stopped. 2016-01-04 22:19:52,194 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics system shutdown complete. 2016-01-04 22:19:52,194 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: AsyncDispatcher is draining to stop, igonring any new events. 2016-01-04 22:19:52,194 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in state STOPPED; cause: java.lang.NullPointerException java.lang.NullPointerException {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4530) LocalizedResource trigger a NPE Cause the NodeManager exit
[ https://issues.apache.org/jira/browse/YARN-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15080424#comment-15080424 ] tangshangwen commented on YARN-4530: Hi [~rohithsharma] , I need to write a test case ? > LocalizedResource trigger a NPE Cause the NodeManager exit > -- > > Key: YARN-4530 > URL: https://issues.apache.org/jira/browse/YARN-4530 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.2.0, 2.7.1 >Reporter: tangshangwen > Attachments: YARN-4530.1.patch > > > In our cluster, I found that LocalizedResource download failed trigger a NPE > Cause the NodeManager shutdown. > {noformat} > 2015-12-29 17:18:33,706 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource > hdfs://ns3:8020/user/username/projects/user_insight/lookalike/oozie/workflow/conf/hive-site.xml > transitioned from DOWNLOADING to FAILED > 2015-12-29 17:18:33,708 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Downloading public rsrc:{ > hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/user_insight_pig_udf-0.0.1-SNAPSHOT-jar-with-dependencies.jar, > 1451380519635, FILE, null } > 2015-12-29 17:18:33,710 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Failed to download rsrc { { > hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar, > 1451380519452, FILE, null > },pending,[(container_1451039893865_261670_01_000578)],42332661980495938,DOWNLOADING} > java.io.IOException: Resource > hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar > changed on src filesystem (expected 1451380519452, was 1451380611793 > at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:176) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:276) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:50) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 2015-12-29 17:18:33,710 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource > hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar > transitioned from DOWNLOADING to FAILED > 2015-12-29 17:18:33,710 FATAL > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Error: Shutting down > java.lang.NullPointerException at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:712) > 2015-12-29 17:18:33,710 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Public cache exiting > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4530) LocalizedResource trigger a NPE Cause the NodeManager exit
[ https://issues.apache.org/jira/browse/YARN-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075855#comment-15075855 ] tangshangwen commented on YARN-4530: Hi [Rohith Sharma K S | https://issues.apache.org/jira/secure/ViewProfile.jspa?name=rohithsharma], In this patch,If assoc is null return directly, when completed.get() throw an ExecutionException,assoc will not be null,I think this patch is not need a new test cases {code:title=ResourceLocalizationService.java|borderStyle=solid} try { if (null == assoc) { LOG.error("Localized unknown resource to " + completed); // TODO delete return; } Path local = completed.get(); LocalResourceRequest key = assoc.getResource().getRequest(); publicRsrc.handle(new ResourceLocalizedEvent(key, local, FileUtil .getDU(new File(local.toUri(); assoc.getResource().unlock(); } catch (ExecutionException e) { LOG.info("Failed to download resource " + assoc.getResource(), e.getCause()); LocalResourceRequest req = assoc.getResource().getRequest(); publicRsrc.handle(new ResourceFailedLocalizationEvent(req, e.getMessage())); assoc.getResource().unlock(); } {code} > LocalizedResource trigger a NPE Cause the NodeManager exit > -- > > Key: YARN-4530 > URL: https://issues.apache.org/jira/browse/YARN-4530 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.2.0, 2.7.1 >Reporter: tangshangwen > Attachments: YARN-4530.1.patch > > > In our cluster, I found that LocalizedResource download failed trigger a NPE > Cause the NodeManager shutdown. > {noformat} > 2015-12-29 17:18:33,706 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource > hdfs://ns3:8020/user/username/projects/user_insight/lookalike/oozie/workflow/conf/hive-site.xml > transitioned from DOWNLOADING to FAILED > 2015-12-29 17:18:33,708 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Downloading public rsrc:{ > hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/user_insight_pig_udf-0.0.1-SNAPSHOT-jar-with-dependencies.jar, > 1451380519635, FILE, null } > 2015-12-29 17:18:33,710 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Failed to download rsrc { { > hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar, > 1451380519452, FILE, null > },pending,[(container_1451039893865_261670_01_000578)],42332661980495938,DOWNLOADING} > java.io.IOException: Resource > hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar > changed on src filesystem (expected 1451380519452, was 1451380611793 > at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:176) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:276) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:50) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 2015-12-29 17:18:33,710 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource > hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar > transitioned from DOWNLOADING to FAILED > 2015-12-29 17:18:33,710 FATAL > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Error: Shutting down > java.lang.NullPointerException at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:712) > 2015-12-29 17:18:33,710 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Public cache exiting > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4530) LocalizedResource trigger a NPE Cause the NodeManager exit
[ https://issues.apache.org/jira/browse/YARN-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen updated YARN-4530: --- Attachment: YARN-4530.1.patch I found 2.7.1 have the same problem,I submitted a patch. > LocalizedResource trigger a NPE Cause the NodeManager exit > -- > > Key: YARN-4530 > URL: https://issues.apache.org/jira/browse/YARN-4530 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.2.0, 2.7.1 >Reporter: tangshangwen > Attachments: YARN-4530.1.patch > > > In our cluster, I found that LocalizedResource download failed trigger a NPE > Cause the NodeManager shutdown. > {noformat} > 2015-12-29 17:18:33,706 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource > hdfs://ns3:8020/user/username/projects/user_insight/lookalike/oozie/workflow/conf/hive-site.xml > transitioned from DOWNLOADING to FAILED > 2015-12-29 17:18:33,708 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Downloading public rsrc:{ > hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/user_insight_pig_udf-0.0.1-SNAPSHOT-jar-with-dependencies.jar, > 1451380519635, FILE, null } > 2015-12-29 17:18:33,710 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Failed to download rsrc { { > hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar, > 1451380519452, FILE, null > },pending,[(container_1451039893865_261670_01_000578)],42332661980495938,DOWNLOADING} > java.io.IOException: Resource > hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar > changed on src filesystem (expected 1451380519452, was 1451380611793 > at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:176) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:276) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:50) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 2015-12-29 17:18:33,710 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource > hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar > transitioned from DOWNLOADING to FAILED > 2015-12-29 17:18:33,710 FATAL > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Error: Shutting down > java.lang.NullPointerException at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:712) > 2015-12-29 17:18:33,710 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Public cache exiting > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4530) LocalizedResource trigger a NPE Cause the NodeManager exit
[ https://issues.apache.org/jira/browse/YARN-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen updated YARN-4530: --- Affects Version/s: 2.7.1 > LocalizedResource trigger a NPE Cause the NodeManager exit > -- > > Key: YARN-4530 > URL: https://issues.apache.org/jira/browse/YARN-4530 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.2.0, 2.7.1 >Reporter: tangshangwen > > In our cluster, I found that LocalizedResource download failed trigger a NPE > Cause the NodeManager shutdown. > {noformat} > 2015-12-29 17:18:33,706 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource > hdfs://ns3:8020/user/username/projects/user_insight/lookalike/oozie/workflow/conf/hive-site.xml > transitioned from DOWNLOADING to FAILED > 2015-12-29 17:18:33,708 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Downloading public rsrc:{ > hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/user_insight_pig_udf-0.0.1-SNAPSHOT-jar-with-dependencies.jar, > 1451380519635, FILE, null } > 2015-12-29 17:18:33,710 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Failed to download rsrc { { > hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar, > 1451380519452, FILE, null > },pending,[(container_1451039893865_261670_01_000578)],42332661980495938,DOWNLOADING} > java.io.IOException: Resource > hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar > changed on src filesystem (expected 1451380519452, was 1451380611793 > at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:176) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:276) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:50) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 2015-12-29 17:18:33,710 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource > hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar > transitioned from DOWNLOADING to FAILED > 2015-12-29 17:18:33,710 FATAL > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Error: Shutting down > java.lang.NullPointerException at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:712) > 2015-12-29 17:18:33,710 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Public cache exiting > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4530) LocalizedResource trigger a NPE Cause the NodeManager exit
[ https://issues.apache.org/jira/browse/YARN-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075711#comment-15075711 ] tangshangwen commented on YARN-4530: I think I can fix it > LocalizedResource trigger a NPE Cause the NodeManager exit > -- > > Key: YARN-4530 > URL: https://issues.apache.org/jira/browse/YARN-4530 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.2.0 >Reporter: tangshangwen > > In our cluster, I found that LocalizedResource download failed trigger a NPE > Cause the NodeManager shutdown. > {noformat} > 2015-12-29 17:18:33,706 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource > hdfs://ns3:8020/user/username/projects/user_insight/lookalike/oozie/workflow/conf/hive-site.xml > transitioned from DOWNLOADING to FAILED > 2015-12-29 17:18:33,708 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Downloading public rsrc:{ > hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/user_insight_pig_udf-0.0.1-SNAPSHOT-jar-with-dependencies.jar, > 1451380519635, FILE, null } > 2015-12-29 17:18:33,710 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Failed to download rsrc { { > hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar, > 1451380519452, FILE, null > },pending,[(container_1451039893865_261670_01_000578)],42332661980495938,DOWNLOADING} > java.io.IOException: Resource > hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar > changed on src filesystem (expected 1451380519452, was 1451380611793 > at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:176) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:276) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:50) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 2015-12-29 17:18:33,710 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource > hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar > transitioned from DOWNLOADING to FAILED > 2015-12-29 17:18:33,710 FATAL > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Error: Shutting down > java.lang.NullPointerException at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:712) > 2015-12-29 17:18:33,710 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Public cache exiting > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4506) Application was killed by a resourcemanager, In the JobHistory Can't see the job detail
[ https://issues.apache.org/jira/browse/YARN-4506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075140#comment-15075140 ] tangshangwen commented on YARN-4506: Ok, I'll try to fix it > Application was killed by a resourcemanager, In the JobHistory Can't see the > job detail > --- > > Key: YARN-4506 > URL: https://issues.apache.org/jira/browse/YARN-4506 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.2.0 >Reporter: tangshangwen > Attachments: am.rar > > > 2015-12-15 03:08:54,073 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a > signal. Signaling RMCommunicator and JobHistoryEventHandler. > 2015-12-15 03:08:54,073 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator > notified that iSignalled is: true > 2015-12-15 03:08:54,073 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator > isAMLastRetry: true > 2015-12-15 03:08:54,073 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator > notified that shouldUnregistered is: true > 2015-12-15 03:08:54,073 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify JHEH isAMLastRetry: > true > 2015-12-15 03:08:54,073 INFO [Thread-1] > org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: > JobHistoryEventHandler notified that forceJobCompletion is true > 2015-12-15 03:08:54,074 INFO [Thread-1] > org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping > JobHistoryEventHandler. Size of the outstanding queue size is 0 > 2015-12-15 03:08:54,074 INFO [eventHandlingThread] > org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: EventQueue > take interrupted. Returning > 2015-12-15 03:08:54,078 WARN [Thread-1] > org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Found jobId > job_1449835724839_219910 to have not been closed. Will close -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4530) LocalizedResource trigger a NPE Cause the NodeManager exit
[ https://issues.apache.org/jira/browse/YARN-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075106#comment-15075106 ] tangshangwen commented on YARN-4530: when the assoc is null and the completed.get() throw a ExecutionException,This will happen, right? {code:title=ResourceLocalizationService.java|borderStyle=solid} try { Future completed = queue.take(); LocalizerResourceRequestEvent assoc = pending.remove(completed); try { Path local = completed.get(); if (null == assoc) { LOG.error("Localized unkonwn resource to " + completed); // TODO delete return; } LocalResourceRequest key = assoc.getResource().getRequest(); publicRsrc.handle(new ResourceLocalizedEvent(key, local, FileUtil .getDU(new File(local.toUri(); assoc.getResource().unlock(); } catch (ExecutionException e) { LOG.info("Failed to download rsrc " + assoc.getResource(), e.getCause()); LocalResourceRequest req = assoc.getResource().getRequest(); publicRsrc.handle(new ResourceFailedLocalizationEvent(req, e.getMessage())); assoc.getResource().unlock(); } catch (CancellationException e) { // ignore; shutting down } {code} > LocalizedResource trigger a NPE Cause the NodeManager exit > -- > > Key: YARN-4530 > URL: https://issues.apache.org/jira/browse/YARN-4530 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.2.0 >Reporter: tangshangwen > > In our cluster, I found that LocalizedResource download failed trigger a NPE > Cause the NodeManager shutdown. > {noformat} > 2015-12-29 17:18:33,706 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource > hdfs://ns3:8020/user/username/projects/user_insight/lookalike/oozie/workflow/conf/hive-site.xml > transitioned from DOWNLOADING to FAILED > 2015-12-29 17:18:33,708 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Downloading public rsrc:{ > hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/user_insight_pig_udf-0.0.1-SNAPSHOT-jar-with-dependencies.jar, > 1451380519635, FILE, null } > 2015-12-29 17:18:33,710 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Failed to download rsrc { { > hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar, > 1451380519452, FILE, null > },pending,[(container_1451039893865_261670_01_000578)],42332661980495938,DOWNLOADING} > java.io.IOException: Resource > hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar > changed on src filesystem (expected 1451380519452, was 1451380611793 > at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:176) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:276) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:50) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 2015-12-29 17:18:33,710 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource > hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar > transitioned from DOWNLOADING to FAILED > 2015-12-29 17:18:33,710 FATAL > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Error: Shutting down > java.lang.NullPointerException at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:712) > 2015-12-29 17:18:33,710 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Public cache exiting > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4530) LocalizedResource trigger a NPE Cause the NodeManager exit
tangshangwen created YARN-4530: -- Summary: LocalizedResource trigger a NPE Cause the NodeManager exit Key: YARN-4530 URL: https://issues.apache.org/jira/browse/YARN-4530 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Reporter: tangshangwen In our cluster, I found that LocalizedResource download failed trigger a NPE Cause the NodeManager shutdown. {noformat} 2015-12-29 17:18:33,706 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://ns3:8020/user/username/projects/user_insight/lookalike/oozie/workflow/conf/hive-site.xml transitioned from DOWNLOADING to FAILED 2015-12-29 17:18:33,708 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Downloading public rsrc:{ hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/user_insight_pig_udf-0.0.1-SNAPSHOT-jar-with-dependencies.jar, 1451380519635, FILE, null } 2015-12-29 17:18:33,710 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download rsrc { { hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar, 1451380519452, FILE, null },pending,[(container_1451039893865_261670_01_000578)],42332661980495938,DOWNLOADING} java.io.IOException: Resource hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar changed on src filesystem (expected 1451380519452, was 1451380611793 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:176) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:276) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:50) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-12-29 17:18:33,710 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar transitioned from DOWNLOADING to FAILED 2015-12-29 17:18:33,710 FATAL org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Error: Shutting down java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:712) 2015-12-29 17:18:33,710 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Public cache exiting {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4506) Application was killed by a resourcemanager, In the JobHistory Can't see the job detail
[ https://issues.apache.org/jira/browse/YARN-4506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15071325#comment-15071325 ] tangshangwen commented on YARN-4506: I found when the MRAppMaster received a signal, the thread is not copy job_ID.jhist to /user/history/done_intermediate in my am.log. > Application was killed by a resourcemanager, In the JobHistory Can't see the > job detail > --- > > Key: YARN-4506 > URL: https://issues.apache.org/jira/browse/YARN-4506 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.2.0 >Reporter: tangshangwen > Attachments: am.rar > > > 2015-12-15 03:08:54,073 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a > signal. Signaling RMCommunicator and JobHistoryEventHandler. > 2015-12-15 03:08:54,073 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator > notified that iSignalled is: true > 2015-12-15 03:08:54,073 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator > isAMLastRetry: true > 2015-12-15 03:08:54,073 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator > notified that shouldUnregistered is: true > 2015-12-15 03:08:54,073 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify JHEH isAMLastRetry: > true > 2015-12-15 03:08:54,073 INFO [Thread-1] > org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: > JobHistoryEventHandler notified that forceJobCompletion is true > 2015-12-15 03:08:54,074 INFO [Thread-1] > org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping > JobHistoryEventHandler. Size of the outstanding queue size is 0 > 2015-12-15 03:08:54,074 INFO [eventHandlingThread] > org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: EventQueue > take interrupted. Returning > 2015-12-15 03:08:54,078 WARN [Thread-1] > org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Found jobId > job_1449835724839_219910 to have not been closed. Will close -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4506) Application was killed by a resourcemanager, In the JobHistory Can't see the job detail
[ https://issues.apache.org/jira/browse/YARN-4506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15071119#comment-15071119 ] tangshangwen commented on YARN-4506: I'm sure it happened in 2.2,because i fond AM was kill by RM,I can't found the job in JobHistory。 2015-12-15 02:56:48,916 INFO [main] org.mortbay.log: Extract jar:file:/software/servers/hadoop-2.2.0/share/hadoop/yarn/hadoop-yarn-common-2.2.0.jar!/webapps/mapreduce > Application was killed by a resourcemanager, In the JobHistory Can't see the > job detail > --- > > Key: YARN-4506 > URL: https://issues.apache.org/jira/browse/YARN-4506 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.2.0 >Reporter: tangshangwen > Attachments: am.rar > > > 2015-12-15 03:08:54,073 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a > signal. Signaling RMCommunicator and JobHistoryEventHandler. > 2015-12-15 03:08:54,073 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator > notified that iSignalled is: true > 2015-12-15 03:08:54,073 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator > isAMLastRetry: true > 2015-12-15 03:08:54,073 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator > notified that shouldUnregistered is: true > 2015-12-15 03:08:54,073 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify JHEH isAMLastRetry: > true > 2015-12-15 03:08:54,073 INFO [Thread-1] > org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: > JobHistoryEventHandler notified that forceJobCompletion is true > 2015-12-15 03:08:54,074 INFO [Thread-1] > org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping > JobHistoryEventHandler. Size of the outstanding queue size is 0 > 2015-12-15 03:08:54,074 INFO [eventHandlingThread] > org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: EventQueue > take interrupted. Returning > 2015-12-15 03:08:54,078 WARN [Thread-1] > org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Found jobId > job_1449835724839_219910 to have not been closed. Will close -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4506) Application was killed by a resourcemanager, In the JobHistory Can't see the job detail
[ https://issues.apache.org/jira/browse/YARN-4506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen updated YARN-4506: --- Attachment: am.rar i update my am.log > Application was killed by a resourcemanager, In the JobHistory Can't see the > job detail > --- > > Key: YARN-4506 > URL: https://issues.apache.org/jira/browse/YARN-4506 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.2.0 >Reporter: tangshangwen > Attachments: am.rar > > > 2015-12-15 03:08:54,073 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a > signal. Signaling RMCommunicator and JobHistoryEventHandler. > 2015-12-15 03:08:54,073 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator > notified that iSignalled is: true > 2015-12-15 03:08:54,073 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator > isAMLastRetry: true > 2015-12-15 03:08:54,073 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator > notified that shouldUnregistered is: true > 2015-12-15 03:08:54,073 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify JHEH isAMLastRetry: > true > 2015-12-15 03:08:54,073 INFO [Thread-1] > org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: > JobHistoryEventHandler notified that forceJobCompletion is true > 2015-12-15 03:08:54,074 INFO [Thread-1] > org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping > JobHistoryEventHandler. Size of the outstanding queue size is 0 > 2015-12-15 03:08:54,074 INFO [eventHandlingThread] > org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: EventQueue > take interrupted. Returning > 2015-12-15 03:08:54,078 WARN [Thread-1] > org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Found jobId > job_1449835724839_219910 to have not been closed. Will close -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4507) Application was killed by a resourcemanager, In the JobHistory Can't see the job detail
[ https://issues.apache.org/jira/browse/YARN-4507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen updated YARN-4507: --- Description: when the AppMaster was killed by RM, we can't see the job detail in jobhistory,this is my log. 2015-12-15 03:08:54,073 INFO [Thread-1] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a signal. Signaling RMCommunicator and JobHistoryEventHandler. 2015-12-15 03:08:54,073 INFO [Thread-1] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator notified that iSignalled is: true 2015-12-15 03:08:54,073 INFO [Thread-1] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator isAMLastRetry: true 2015-12-15 03:08:54,073 INFO [Thread-1] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator notified that shouldUnregistered is: true 2015-12-15 03:08:54,073 INFO [Thread-1] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify JHEH isAMLastRetry: true 2015-12-15 03:08:54,073 INFO [Thread-1] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: JobHistoryEventHandler notified that forceJobCompletion is true 2015-12-15 03:08:54,074 INFO [Thread-1] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler. Size of the outstanding queue size is 0 2015-12-15 03:08:54,074 INFO [eventHandlingThread] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: EventQueue take interrupted. Returning 2015-12-15 03:08:54,078 WARN [Thread-1] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Found jobId job_1449835724839_219910 to have not been closed. Will close was: 2015-12-15 03:08:54,073 INFO [Thread-1] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a signal. Signaling RMCommunicator and JobHistoryEventHandler. 2015-12-15 03:08:54,073 INFO [Thread-1] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator notified that iSignalled is: true 2015-12-15 03:08:54,073 INFO [Thread-1] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator isAMLastRetry: true 2015-12-15 03:08:54,073 INFO [Thread-1] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator notified that shouldUnregistered is: true 2015-12-15 03:08:54,073 INFO [Thread-1] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify JHEH isAMLastRetry: true 2015-12-15 03:08:54,073 INFO [Thread-1] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: JobHistoryEventHandler notified that forceJobCompletion is true 2015-12-15 03:08:54,074 INFO [Thread-1] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler. Size of the outstanding queue size is 0 2015-12-15 03:08:54,074 INFO [eventHandlingThread] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: EventQueue take interrupted. Returning 2015-12-15 03:08:54,078 WARN [Thread-1] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Found jobId job_1449835724839_219910 to have not been closed. Will close > Application was killed by a resourcemanager, In the JobHistory Can't see the > job detail > --- > > Key: YARN-4507 > URL: https://issues.apache.org/jira/browse/YARN-4507 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.2.0 >Reporter: tangshangwen > > when the AppMaster was killed by RM, we can't see the job detail in > jobhistory,this is my log. > 2015-12-15 03:08:54,073 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a > signal. Signaling RMCommunicator and JobHistoryEventHandler. > 2015-12-15 03:08:54,073 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator > notified that iSignalled is: true > 2015-12-15 03:08:54,073 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator > isAMLastRetry: true > 2015-12-15 03:08:54,073 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator > notified that shouldUnregistered is: true > 2015-12-15 03:08:54,073 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify JHEH isAMLastRetry: > true > 2015-12-15 03:08:54,073 INFO [Thread-1] > org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: > JobHistoryEventHandler notified that forceJobCompletion is true > 2015-12-15 03:08:54,074 INFO [Thread-1] > org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping > JobHistoryEventHandler. Size of the outstanding queue size is 0 > 2015-12-15 03:08:54,074 INFO [eventHandlingThread] > org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: EventQueue > take interrupted. Returning > 2015-12-15 03:08:54,078 WARN [Thread-1]
[jira] [Created] (YARN-4507) Application was killed by a resourcemanager, In the JobHistory Can't see the job detail
tangshangwen created YARN-4507: -- Summary: Application was killed by a resourcemanager, In the JobHistory Can't see the job detail Key: YARN-4507 URL: https://issues.apache.org/jira/browse/YARN-4507 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Reporter: tangshangwen 2015-12-15 03:08:54,073 INFO [Thread-1] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a signal. Signaling RMCommunicator and JobHistoryEventHandler. 2015-12-15 03:08:54,073 INFO [Thread-1] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator notified that iSignalled is: true 2015-12-15 03:08:54,073 INFO [Thread-1] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator isAMLastRetry: true 2015-12-15 03:08:54,073 INFO [Thread-1] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator notified that shouldUnregistered is: true 2015-12-15 03:08:54,073 INFO [Thread-1] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify JHEH isAMLastRetry: true 2015-12-15 03:08:54,073 INFO [Thread-1] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: JobHistoryEventHandler notified that forceJobCompletion is true 2015-12-15 03:08:54,074 INFO [Thread-1] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler. Size of the outstanding queue size is 0 2015-12-15 03:08:54,074 INFO [eventHandlingThread] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: EventQueue take interrupted. Returning 2015-12-15 03:08:54,078 WARN [Thread-1] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Found jobId job_1449835724839_219910 to have not been closed. Will close -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4506) Application was killed by a resourcemanager, In the JobHistory Can't see the job detail
tangshangwen created YARN-4506: -- Summary: Application was killed by a resourcemanager, In the JobHistory Can't see the job detail Key: YARN-4506 URL: https://issues.apache.org/jira/browse/YARN-4506 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Reporter: tangshangwen 2015-12-15 03:08:54,073 INFO [Thread-1] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a signal. Signaling RMCommunicator and JobHistoryEventHandler. 2015-12-15 03:08:54,073 INFO [Thread-1] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator notified that iSignalled is: true 2015-12-15 03:08:54,073 INFO [Thread-1] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator isAMLastRetry: true 2015-12-15 03:08:54,073 INFO [Thread-1] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator notified that shouldUnregistered is: true 2015-12-15 03:08:54,073 INFO [Thread-1] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify JHEH isAMLastRetry: true 2015-12-15 03:08:54,073 INFO [Thread-1] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: JobHistoryEventHandler notified that forceJobCompletion is true 2015-12-15 03:08:54,074 INFO [Thread-1] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler. Size of the outstanding queue size is 0 2015-12-15 03:08:54,074 INFO [eventHandlingThread] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: EventQueue take interrupted. Returning 2015-12-15 03:08:54,078 WARN [Thread-1] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Found jobId job_1449835724839_219910 to have not been closed. Will close -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4324) AM hang more than 10 min was kill by RM
[ https://issues.apache.org/jira/browse/YARN-4324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen updated YARN-4324: --- Attachment: (was: am105361log.tar.gz) > AM hang more than 10 min was kill by RM > --- > > Key: YARN-4324 > URL: https://issues.apache.org/jira/browse/YARN-4324 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.2.0 >Reporter: tangshangwen > Attachments: logs.rar, yarn-nodemanager-dumpam.log > > > this is my logs > 2015-11-02 01:14:54,175 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 2865 > 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: > job_1446203652278_135526Job Transitioned from RUNNING to COMMITTING > 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: > attempt_1446203652278_135526_m_001777_1 TaskAttempt Transition > ed from UNASSIGNED to KILLED > 2015-11-02 01:14:54,176 INFO [CommitterEvent Processor #1] > org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing > the event EventType: JOB_COMMIT > 2015-11-02 01:24:15,851 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a > signal. Signaling RMCommunicator and JobHistoryEventHandler. > 2015-11-02 01:24:15,851 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator > notified that iSignalled is: true > 2015-11-02 01:24:15,851 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator > isAMLastRetry: true > the hive map run 100% and return map 0% and the job failed! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4324) AM hang more than 10 min was kill by RM
[ https://issues.apache.org/jira/browse/YARN-4324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen updated YARN-4324: --- Attachment: am105361log.tar.gz I update other AM Log > AM hang more than 10 min was kill by RM > --- > > Key: YARN-4324 > URL: https://issues.apache.org/jira/browse/YARN-4324 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.2.0 >Reporter: tangshangwen > Attachments: am105361log.tar.gz, logs.rar, yarn-nodemanager-dumpam.log > > > this is my logs > 2015-11-02 01:14:54,175 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 2865 > 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: > job_1446203652278_135526Job Transitioned from RUNNING to COMMITTING > 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: > attempt_1446203652278_135526_m_001777_1 TaskAttempt Transition > ed from UNASSIGNED to KILLED > 2015-11-02 01:14:54,176 INFO [CommitterEvent Processor #1] > org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing > the event EventType: JOB_COMMIT > 2015-11-02 01:24:15,851 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a > signal. Signaling RMCommunicator and JobHistoryEventHandler. > 2015-11-02 01:24:15,851 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator > notified that iSignalled is: true > 2015-11-02 01:24:15,851 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator > isAMLastRetry: true > the hive map run 100% and return map 0% and the job failed! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4324) AM hang more than 10 min was kill by RM
[ https://issues.apache.org/jira/browse/YARN-4324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15066331#comment-15066331 ] tangshangwen commented on YARN-4324: i found this message in the jstack,is a JDK epollWait bug?: "IPC Client (2118999553) connection to RMHost:8030 from UserName" daemon prio=10 tid=0x7f298c664000 nid=0x6e2d runnable [0x7f297d9a8000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87) - locked <0x000785182940> (a sun.nio.ch.Util$2) - locked <0x000785182930> (a java.util.Collections$UnmodifiableSet) - locked <0x000785182718> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98) at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.FilterInputStream.read(FilterInputStream.java:133) at java.io.FilterInputStream.read(FilterInputStream.java:133) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:457) at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) at java.io.BufferedInputStream.read(BufferedInputStream.java:254) - locked <0x00078023aa40> (a java.io.BufferedInputStream) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:995) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:891) > AM hang more than 10 min was kill by RM > --- > > Key: YARN-4324 > URL: https://issues.apache.org/jira/browse/YARN-4324 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.2.0 >Reporter: tangshangwen > Attachments: logs.rar, yarn-nodemanager-dumpam.log > > > this is my logs > 2015-11-02 01:14:54,175 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 2865 > 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: > job_1446203652278_135526Job Transitioned from RUNNING to COMMITTING > 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: > attempt_1446203652278_135526_m_001777_1 TaskAttempt Transition > ed from UNASSIGNED to KILLED > 2015-11-02 01:14:54,176 INFO [CommitterEvent Processor #1] > org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing > the event EventType: JOB_COMMIT > 2015-11-02 01:24:15,851 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a > signal. Signaling RMCommunicator and JobHistoryEventHandler. > 2015-11-02 01:24:15,851 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator > notified that iSignalled is: true > 2015-11-02 01:24:15,851 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator > isAMLastRetry: true > the hive map run 100% and return map 0% and the job failed! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4324) AM hang more than 10 min was kill by RM
[ https://issues.apache.org/jira/browse/YARN-4324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15057690#comment-15057690 ] tangshangwen commented on YARN-4324: I found the RMContainerAllocator last contact RM in AM logs ,and it Does not apply to reduce 2015-12-15 02:57:39,893 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:732 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:5773 AssignedReds:0 CompletedMaps:0 CompletedReds:0 ContAlloc:8995 ContRel:3222 HostLocal:5310 RackLocal:338 AM received an kill signal 2015-12-15 03:01:29,383 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1449835724839_219910_m_001345_1 TaskAttempt Transitioned from NEW to UNASSIGNED 2015-12-15 03:08:54,073 INFO [Thread-1] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a signal. Signaling RMCommunicator and JobHistoryEventHandler. I guess AM in 10min not send a heartbeat to RM,RM logs Rolling too fast,I will try to get RM logs and update > AM hang more than 10 min was kill by RM > --- > > Key: YARN-4324 > URL: https://issues.apache.org/jira/browse/YARN-4324 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.2.0 >Reporter: tangshangwen > Attachments: logs.rar, yarn-nodemanager-dumpam.log > > > this is my logs > 2015-11-02 01:14:54,175 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 2865 > 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: > job_1446203652278_135526Job Transitioned from RUNNING to COMMITTING > 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: > attempt_1446203652278_135526_m_001777_1 TaskAttempt Transition > ed from UNASSIGNED to KILLED > 2015-11-02 01:14:54,176 INFO [CommitterEvent Processor #1] > org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing > the event EventType: JOB_COMMIT > 2015-11-02 01:24:15,851 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a > signal. Signaling RMCommunicator and JobHistoryEventHandler. > 2015-11-02 01:24:15,851 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator > notified that iSignalled is: true > 2015-11-02 01:24:15,851 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator > isAMLastRetry: true > the hive map run 100% and return map 0% and the job failed! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4324) AM hang more than 10 min was kill by RM
[ https://issues.apache.org/jira/browse/YARN-4324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15057601#comment-15057601 ] tangshangwen commented on YARN-4324: Thank you for your attention,I have already upload AM Logs > AM hang more than 10 min was kill by RM > --- > > Key: YARN-4324 > URL: https://issues.apache.org/jira/browse/YARN-4324 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.2.0 >Reporter: tangshangwen > Attachments: logs.rar, yarn-nodemanager-dumpam.log > > > this is my logs > 2015-11-02 01:14:54,175 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 2865 > 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: > job_1446203652278_135526Job Transitioned from RUNNING to COMMITTING > 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: > attempt_1446203652278_135526_m_001777_1 TaskAttempt Transition > ed from UNASSIGNED to KILLED > 2015-11-02 01:14:54,176 INFO [CommitterEvent Processor #1] > org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing > the event EventType: JOB_COMMIT > 2015-11-02 01:24:15,851 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a > signal. Signaling RMCommunicator and JobHistoryEventHandler. > 2015-11-02 01:24:15,851 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator > notified that iSignalled is: true > 2015-11-02 01:24:15,851 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator > isAMLastRetry: true > the hive map run 100% and return map 0% and the job failed! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4324) AM hang more than 10 min was kill by RM
[ https://issues.apache.org/jira/browse/YARN-4324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen updated YARN-4324: --- Attachment: logs.rar I upload the new jstack and am logs > AM hang more than 10 min was kill by RM > --- > > Key: YARN-4324 > URL: https://issues.apache.org/jira/browse/YARN-4324 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.2.0 >Reporter: tangshangwen > Attachments: logs.rar, yarn-nodemanager-dumpam.log > > > this is my logs > 2015-11-02 01:14:54,175 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 2865 > 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: > job_1446203652278_135526Job Transitioned from RUNNING to COMMITTING > 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: > attempt_1446203652278_135526_m_001777_1 TaskAttempt Transition > ed from UNASSIGNED to KILLED > 2015-11-02 01:14:54,176 INFO [CommitterEvent Processor #1] > org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing > the event EventType: JOB_COMMIT > 2015-11-02 01:24:15,851 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a > signal. Signaling RMCommunicator and JobHistoryEventHandler. > 2015-11-02 01:24:15,851 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator > notified that iSignalled is: true > 2015-11-02 01:24:15,851 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator > isAMLastRetry: true > the hive map run 100% and return map 0% and the job failed! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4324) AM hang more than 10 min was kill by RM
[ https://issues.apache.org/jira/browse/YARN-4324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen updated YARN-4324: --- Attachment: yarn-nodemanager-dumpam.log > AM hang more than 10 min was kill by RM > --- > > Key: YARN-4324 > URL: https://issues.apache.org/jira/browse/YARN-4324 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.2.0 >Reporter: tangshangwen > Attachments: yarn-nodemanager-dumpam.log > > > this is my logs > 2015-11-02 01:14:54,175 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 2865 > 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: > job_1446203652278_135526Job Transitioned from RUNNING to COMMITTING > 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: > attempt_1446203652278_135526_m_001777_1 TaskAttempt Transition > ed from UNASSIGNED to KILLED > 2015-11-02 01:14:54,176 INFO [CommitterEvent Processor #1] > org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing > the event EventType: JOB_COMMIT > 2015-11-02 01:24:15,851 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a > signal. Signaling RMCommunicator and JobHistoryEventHandler. > 2015-11-02 01:24:15,851 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator > notified that iSignalled is: true > 2015-11-02 01:24:15,851 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator > isAMLastRetry: true > the hive map run 100% and return map 0% and the job failed! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4324) AM hang more than 10 min was kill by RM
[ https://issues.apache.org/jira/browse/YARN-4324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15057380#comment-15057380 ] tangshangwen commented on YARN-4324: Because the job failure is random, i dump the am jstack and pstack when am from RUNING to KILLING event, I upload my log > AM hang more than 10 min was kill by RM > --- > > Key: YARN-4324 > URL: https://issues.apache.org/jira/browse/YARN-4324 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.2.0 >Reporter: tangshangwen > > this is my logs > 2015-11-02 01:14:54,175 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 2865 > 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: > job_1446203652278_135526Job Transitioned from RUNNING to COMMITTING > 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: > attempt_1446203652278_135526_m_001777_1 TaskAttempt Transition > ed from UNASSIGNED to KILLED > 2015-11-02 01:14:54,176 INFO [CommitterEvent Processor #1] > org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing > the event EventType: JOB_COMMIT > 2015-11-02 01:24:15,851 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a > signal. Signaling RMCommunicator and JobHistoryEventHandler. > 2015-11-02 01:24:15,851 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator > notified that iSignalled is: true > 2015-11-02 01:24:15,851 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator > isAMLastRetry: true > the hive map run 100% and return map 0% and the job failed! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4324) AM hang more than 10 min was kill by RM
[ https://issues.apache.org/jira/browse/YARN-4324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen updated YARN-4324: --- Description: this is my logs 2015-11-02 01:14:54,175 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 2865 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1446203652278_135526Job Transitioned from RUNNING to COMMITTING 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1446203652278_135526_m_001777_1 TaskAttempt Transition ed from UNASSIGNED to KILLED 2015-11-02 01:14:54,176 INFO [CommitterEvent Processor #1] org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing the event EventType: JOB_COMMIT 2015-11-02 01:24:15,851 INFO [Thread-1] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a signal. Signaling RMCommunicator and JobHistoryEventHandler. 2015-11-02 01:24:15,851 INFO [Thread-1] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator notified that iSignalled is: true 2015-11-02 01:24:15,851 INFO [Thread-1] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator isAMLastRetry: true the hive map run 100% and return map 0% and the job failed! > AM hang more than 10 min was kill by RM > --- > > Key: YARN-4324 > URL: https://issues.apache.org/jira/browse/YARN-4324 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.2.0 >Reporter: tangshangwen > > this is my logs > 2015-11-02 01:14:54,175 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 2865 > 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: > job_1446203652278_135526Job Transitioned from RUNNING to COMMITTING > 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: > attempt_1446203652278_135526_m_001777_1 TaskAttempt Transition > ed from UNASSIGNED to KILLED > 2015-11-02 01:14:54,176 INFO [CommitterEvent Processor #1] > org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing > the event EventType: JOB_COMMIT > 2015-11-02 01:24:15,851 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a > signal. Signaling RMCommunicator and JobHistoryEventHandler. > 2015-11-02 01:24:15,851 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator > notified that iSignalled is: true > 2015-11-02 01:24:15,851 INFO [Thread-1] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator > isAMLastRetry: true > the hive map run 100% and return map 0% and the job failed! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4324) AM hang more than 10 min was kill by RM
tangshangwen created YARN-4324: -- Summary: AM hang more than 10 min was kill by RM Key: YARN-4324 URL: https://issues.apache.org/jira/browse/YARN-4324 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Reporter: tangshangwen -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4099) Container LocalizedResource more than 10min was kill
[ https://issues.apache.org/jira/browse/YARN-4099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14724949#comment-14724949 ] tangshangwen commented on YARN-4099: 2015-08-28 15:10:37,434 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Writing credentials to the nmPrivate file /data4/yarn1/local/nmPrivate/container_1440160718082_401272_01_01.tokens. Credentials list: 2015-08-28 15:22:02,578 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for > Container LocalizedResource more than 10min was kill > > > Key: YARN-4099 > URL: https://issues.apache.org/jira/browse/YARN-4099 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.2.0 > Environment: centos 6.5 > datanode 1500+ >Reporter: tangshangwen > > Container LocalizedResource more than 10min was kill,this is AM nodemanager > log: > 82_401272/libjars/UDFGetUserAgent.jar transitioned from INIT to DOWNLOADING > 2015-08-28 15:10:37,432 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource hdfs://ns1/user//.staging/job_1440160718 > 082_401272/libjars/IndexChange.jar transitioned from INIT to DOWNLOADING > 2015-08-28 15:10:37,432 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource hdfs://ns1/user//.staging/job_1440160718 > 082_401272/libjars/UserAgentUtils-1.8.jar transitioned from INIT to > DOWNLOADING > 2015-08-28 15:10:37,432 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource hdfs://ns1/user//.staging/job_1440160718 > 082_401272/libjars/UDFGetEndTime.jar transitioned from INIT to DOWNLOADING > 2015-08-28 15:10:37,432 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource hdfs://ns1/user//.staging/job_1440160718 > 082_401272/libjars/HexadecimalGB.jar transitioned from INIT to DOWNLOADING > 2015-08-28 15:10:37,432 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Created localizer for container_1440160718082 > _401272_01_01 > 2015-08-28 15:10:37,434 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Writing credentials to the nmPrivate file /da > ta4/yarn1/local/nmPrivate/container_1440160718082_401272_01_01.tokens. > Credentials list: > 2015-08-28 15:22:02,578 INFO SecurityLogger.org.apache.hadoop.ipc.Server: > Auth successful for appattempt_1440160718082_401272_01 (auth:SIMPLE) > 2015-08-28 15:22:02,580 INFO > SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: > Authorization successful for appattempt_1440160718082_401272_0 > 1 (auth:TOKEN) for protocol=interface > org.apache.hadoop.yarn.api.ContainerManagementProtocolPB > 2015-08-28 15:22:02,580 INFO org.apache.hadoop.yarn.s -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4099) Container LocalizedResource more than 10min was kill
tangshangwen created YARN-4099: -- Summary: Container LocalizedResource more than 10min was kill Key: YARN-4099 URL: https://issues.apache.org/jira/browse/YARN-4099 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Environment: centos 6.5 datanode 1500+ Reporter: tangshangwen Container LocalizedResource more than 10min was kill,this is AM nodemanager log: 82_401272/libjars/UDFGetUserAgent.jar transitioned from INIT to DOWNLOADING 2015-08-28 15:10:37,432 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://ns1/user//.staging/job_1440160718 082_401272/libjars/IndexChange.jar transitioned from INIT to DOWNLOADING 2015-08-28 15:10:37,432 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://ns1/user//.staging/job_1440160718 082_401272/libjars/UserAgentUtils-1.8.jar transitioned from INIT to DOWNLOADING 2015-08-28 15:10:37,432 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://ns1/user//.staging/job_1440160718 082_401272/libjars/UDFGetEndTime.jar transitioned from INIT to DOWNLOADING 2015-08-28 15:10:37,432 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://ns1/user//.staging/job_1440160718 082_401272/libjars/HexadecimalGB.jar transitioned from INIT to DOWNLOADING 2015-08-28 15:10:37,432 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1440160718082 _401272_01_01 2015-08-28 15:10:37,434 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Writing credentials to the nmPrivate file /da ta4/yarn1/local/nmPrivate/container_1440160718082_401272_01_01.tokens. Credentials list: 2015-08-28 15:22:02,578 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1440160718082_401272_01 (auth:SIMPLE) 2015-08-28 15:22:02,580 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for appattempt_1440160718082_401272_0 1 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB 2015-08-28 15:22:02,580 INFO org.apache.hadoop.yarn.s -- This message was sent by Atlassian JIRA (v6.3.4#6332)