from:"tangshangwen \(JIRA\)"

[jira] [Commented] (YARN-8496) The capacity scheduler uses label to cause vcore to be incorrect

2018-07-06 Thread tangshangwen (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16535626#comment-16535626
 ] 

tangshangwen commented on YARN-8496:


I'll update a patch later



> The capacity scheduler uses label to cause vcore to be incorrect
> 
>
> Key: YARN-8496
> URL: https://issues.apache.org/jira/browse/YARN-8496
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.7.6
>Reporter: tangshangwen
>Assignee: tangshangwen
>Priority: Major
> Attachments: yarn-bug.png
>
>
>  In my cluster, I used label scheduling, and I found that it caused the vcore 
> of the cluster to be incorrect
>  
> capacity-scheduler.xml
>  
> {code:java}
> 
> 
> yarn.scheduler.capacity.root.queues
> support
> 
> 
> yarn.scheduler.capacity.root.support.capacity
> 100
> 
> 
> yarn.scheduler.capacity.root.support.accessible-node-labels
> test1
> 
> 
> yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity
> 100
> 
> 
> yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity
> 100
> 
> 
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8496) The capacity scheduler uses label to cause vcore to be incorrect

2018-07-05 Thread tangshangwen (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangshangwen updated YARN-8496:
---
Component/s: (was: resourcemanager)
 capacity scheduler

> The capacity scheduler uses label to cause vcore to be incorrect
> 
>
> Key: YARN-8496
> URL: https://issues.apache.org/jira/browse/YARN-8496
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.7.6
>Reporter: tangshangwen
>Assignee: tangshangwen
>Priority: Major
> Attachments: yarn-bug.png
>
>
>  In my cluster, I used label scheduling, and I found that it caused the vcore 
> of the cluster to be incorrect
>  
> capacity-scheduler.xml
>  
> {code:java}
> 
> 
> yarn.scheduler.capacity.root.queues
> support
> 
> 
> yarn.scheduler.capacity.root.support.capacity
> 100
> 
> 
> yarn.scheduler.capacity.root.support.accessible-node-labels
> test1
> 
> 
> yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity
> 100
> 
> 
> yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity
> 100
> 
> 
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8496) The capacity scheduler uses label to cause vcore to be incorrect

2018-07-05 Thread tangshangwen (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangshangwen updated YARN-8496:
---
Description: 
 In my cluster, I used label scheduling, and I found that it caused the vcore 
of the cluster to be incorrect

 

capacity-scheduler.xml

 
{code:java}



yarn.scheduler.capacity.root.queues
support



yarn.scheduler.capacity.root.support.capacity
100



yarn.scheduler.capacity.root.support.accessible-node-labels
test1



yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity
100



yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity
100



{code}
 

  was:
 In my cluster, I used label scheduling, and I found that it caused the vcore 
of the cluster to be incorrect

  !image-2018-07-05-18-29-32-697.png!

capacity-scheduler.xml

 
{code:java}



yarn.scheduler.capacity.root.queues
support



yarn.scheduler.capacity.root.support.capacity
100



yarn.scheduler.capacity.root.support.accessible-node-labels
test1



yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity
100



yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity
100



{code}
 


> The capacity scheduler uses label to cause vcore to be incorrect
> 
>
> Key: YARN-8496
> URL: https://issues.apache.org/jira/browse/YARN-8496
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.6
>Reporter: tangshangwen
>Assignee: tangshangwen
>Priority: Major
> Attachments: yarn-bug.png
>
>
>  In my cluster, I used label scheduling, and I found that it caused the vcore 
> of the cluster to be incorrect
>  
> capacity-scheduler.xml
>  
> {code:java}
> 
> 
> yarn.scheduler.capacity.root.queues
> support
> 
> 
> yarn.scheduler.capacity.root.support.capacity
> 100
> 
> 
> yarn.scheduler.capacity.root.support.accessible-node-labels
> test1
> 
> 
> yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity
> 100
> 
> 
> yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity
> 100
> 
> 
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8496) The capacity scheduler uses label to cause vcore to be incorrect

2018-07-05 Thread tangshangwen (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangshangwen updated YARN-8496:
---
Attachment: (was: image-2018-07-05-18-29-32-697.png)

> The capacity scheduler uses label to cause vcore to be incorrect
> 
>
> Key: YARN-8496
> URL: https://issues.apache.org/jira/browse/YARN-8496
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.6
>Reporter: tangshangwen
>Assignee: tangshangwen
>Priority: Major
> Attachments: yarn-bug.png
>
>
>  In my cluster, I used label scheduling, and I found that it caused the vcore 
> of the cluster to be incorrect
>   !image-2018-07-05-18-29-32-697.png!
> capacity-scheduler.xml
>  
> {code:java}
> 
> 
> yarn.scheduler.capacity.root.queues
> support
> 
> 
> yarn.scheduler.capacity.root.support.capacity
> 100
> 
> 
> yarn.scheduler.capacity.root.support.accessible-node-labels
> test1
> 
> 
> yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity
> 100
> 
> 
> yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity
> 100
> 
> 
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8496) The capacity scheduler uses label to cause vcore to be incorrect

2018-07-05 Thread tangshangwen (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangshangwen updated YARN-8496:
---
Description: 
 In my cluster, I used label scheduling, and I found that it caused the vcore 
of the cluster to be incorrect

  !image-2018-07-05-18-29-32-697.png!

capacity-scheduler.xml

 
{code:java}



yarn.scheduler.capacity.root.queues
support



yarn.scheduler.capacity.root.support.capacity
100



yarn.scheduler.capacity.root.support.accessible-node-labels
test1



yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity
100



yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity
100



{code}
 

  was:
 In my cluster, I used tag scheduling, and I found that it caused the vcore of 
the cluster to be incorrect

  !image-2018-07-05-18-29-32-697.png!

capacity-scheduler.xml

 
{code:java}



yarn.scheduler.capacity.root.queues
support



yarn.scheduler.capacity.root.support.capacity
100



yarn.scheduler.capacity.root.support.accessible-node-labels
test1



yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity
100



yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity
100



{code}
 


> The capacity scheduler uses label to cause vcore to be incorrect
> 
>
> Key: YARN-8496
> URL: https://issues.apache.org/jira/browse/YARN-8496
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.6
>Reporter: tangshangwen
>Assignee: tangshangwen
>Priority: Major
> Attachments: image-2018-07-05-18-29-32-697.png, yarn-bug.png
>
>
>  In my cluster, I used label scheduling, and I found that it caused the vcore 
> of the cluster to be incorrect
>   !image-2018-07-05-18-29-32-697.png!
> capacity-scheduler.xml
>  
> {code:java}
> 
> 
> yarn.scheduler.capacity.root.queues
> support
> 
> 
> yarn.scheduler.capacity.root.support.capacity
> 100
> 
> 
> yarn.scheduler.capacity.root.support.accessible-node-labels
> test1
> 
> 
> yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity
> 100
> 
> 
> yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity
> 100
> 
> 
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8496) The capacity scheduler uses label to cause vcore to be incorrect

2018-07-05 Thread tangshangwen (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16533497#comment-16533497
 ] 

tangshangwen commented on YARN-8496:


 

I think it's important to check that the resources meet the minimum resource 
allocation
{code:java}
// ParentQueue.java
@Override
public synchronized CSAssignment assignContainers(Resource clusterResource,
FiCaSchedulerNode node, ResourceLimits resourceLimits) {
  CSAssignment assignment = 
  new CSAssignment(Resources.createResource(0, 0), NodeType.NODE_LOCAL);
  Set nodeLabels = node.getLabels();

if (!Resources.fitsIn(minimumAllocation, node.getAvailableResource())) {
  return assignment;
}
...
}

{code}

> The capacity scheduler uses label to cause vcore to be incorrect
> 
>
> Key: YARN-8496
> URL: https://issues.apache.org/jira/browse/YARN-8496
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.6
>Reporter: tangshangwen
>Assignee: tangshangwen
>Priority: Major
> Attachments: image-2018-07-05-18-29-32-697.png, yarn-bug.png
>
>
>  In my cluster, I used tag scheduling, and I found that it caused the vcore 
> of the cluster to be incorrect
>   !image-2018-07-05-18-29-32-697.png!
> capacity-scheduler.xml
>  
> {code:java}
> 
> 
> yarn.scheduler.capacity.root.queues
> support
> 
> 
> yarn.scheduler.capacity.root.support.capacity
> 100
> 
> 
> yarn.scheduler.capacity.root.support.accessible-node-labels
> test1
> 
> 
> yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity
> 100
> 
> 
> yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity
> 100
> 
> 
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8496) The capacity scheduler uses label to cause vcore to be incorrect

2018-07-05 Thread tangshangwen (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangshangwen updated YARN-8496:
---
Attachment: image-2018-07-05-18-29-32-697.png

> The capacity scheduler uses label to cause vcore to be incorrect
> 
>
> Key: YARN-8496
> URL: https://issues.apache.org/jira/browse/YARN-8496
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.6
>Reporter: tangshangwen
>Assignee: tangshangwen
>Priority: Major
> Attachments: image-2018-07-05-18-29-32-697.png, yarn-bug.png
>
>
>  In my cluster, I used tag scheduling, and I found that it caused the vcore 
> of the cluster to be incorrect
>  
> capacity-scheduler.xml
>  
> {code:java}
> 
> 
> yarn.scheduler.capacity.root.queues
> support
> 
> 
> yarn.scheduler.capacity.root.support.capacity
> 100
> 
> 
> yarn.scheduler.capacity.root.support.accessible-node-labels
> test1
> 
> 
> yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity
> 100
> 
> 
> yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity
> 100
> 
> 
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8496) The capacity scheduler uses label to cause vcore to be incorrect

2018-07-05 Thread tangshangwen (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangshangwen updated YARN-8496:
---
Description: 
 In my cluster, I used tag scheduling, and I found that it caused the vcore of 
the cluster to be incorrect

  !image-2018-07-05-18-29-32-697.png!

capacity-scheduler.xml

 
{code:java}



yarn.scheduler.capacity.root.queues
support



yarn.scheduler.capacity.root.support.capacity
100



yarn.scheduler.capacity.root.support.accessible-node-labels
test1



yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity
100



yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity
100



{code}
 

  was:
 In my cluster, I used tag scheduling, and I found that it caused the vcore of 
the cluster to be incorrect

 

capacity-scheduler.xml

 
{code:java}



yarn.scheduler.capacity.root.queues
support



yarn.scheduler.capacity.root.support.capacity
100



yarn.scheduler.capacity.root.support.accessible-node-labels
test1



yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity
100



yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity
100



{code}
 


> The capacity scheduler uses label to cause vcore to be incorrect
> 
>
> Key: YARN-8496
> URL: https://issues.apache.org/jira/browse/YARN-8496
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.6
>Reporter: tangshangwen
>Assignee: tangshangwen
>Priority: Major
> Attachments: image-2018-07-05-18-29-32-697.png, yarn-bug.png
>
>
>  In my cluster, I used tag scheduling, and I found that it caused the vcore 
> of the cluster to be incorrect
>   !image-2018-07-05-18-29-32-697.png!
> capacity-scheduler.xml
>  
> {code:java}
> 
> 
> yarn.scheduler.capacity.root.queues
> support
> 
> 
> yarn.scheduler.capacity.root.support.capacity
> 100
> 
> 
> yarn.scheduler.capacity.root.support.accessible-node-labels
> test1
> 
> 
> yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity
> 100
> 
> 
> yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity
> 100
> 
> 
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8496) The capacity scheduler uses label to cause vcore to be incorrect

2018-07-05 Thread tangshangwen (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangshangwen updated YARN-8496:
---
Attachment: (was: image-2018-07-05-18-16-10-851.png)

> The capacity scheduler uses label to cause vcore to be incorrect
> 
>
> Key: YARN-8496
> URL: https://issues.apache.org/jira/browse/YARN-8496
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.6
>Reporter: tangshangwen
>Assignee: tangshangwen
>Priority: Major
> Attachments: yarn-bug.png
>
>
>  In my cluster, I used tag scheduling, and I found that it caused the vcore 
> of the cluster to be incorrect
>  
> capacity-scheduler.xml
>  
> {code:java}
> 
> 
> yarn.scheduler.capacity.root.queues
> support
> 
> 
> yarn.scheduler.capacity.root.support.capacity
> 100
> 
> 
> yarn.scheduler.capacity.root.support.accessible-node-labels
> test1
> 
> 
> yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity
> 100
> 
> 
> yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity
> 100
> 
> 
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8496) The capacity scheduler uses label to cause vcore to be incorrect

2018-07-05 Thread tangshangwen (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangshangwen updated YARN-8496:
---
 Attachment: yarn-bug.png
Description: 
 In my cluster, I used tag scheduling, and I found that it caused the vcore of 
the cluster to be incorrect

 

capacity-scheduler.xml

 
{code:java}



yarn.scheduler.capacity.root.queues
support



yarn.scheduler.capacity.root.support.capacity
100



yarn.scheduler.capacity.root.support.accessible-node-labels
test1



yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity
100



yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity
100



{code}
 

  was:
 In my cluster, I used tag scheduling, and I found that it caused the vcore of 
the cluster to be incorrect

!image-2018-07-05-18-16-10-851.png!

 

 

capacity-scheduler.xml

 
{code:java}



yarn.scheduler.capacity.root.queues
support



yarn.scheduler.capacity.root.support.capacity
100



yarn.scheduler.capacity.root.support.accessible-node-labels
test1



yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity
100



yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity
100



{code}
 


> The capacity scheduler uses label to cause vcore to be incorrect
> 
>
> Key: YARN-8496
> URL: https://issues.apache.org/jira/browse/YARN-8496
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.6
>Reporter: tangshangwen
>Assignee: tangshangwen
>Priority: Major
> Attachments: image-2018-07-05-18-16-10-851.png, yarn-bug.png
>
>
>  In my cluster, I used tag scheduling, and I found that it caused the vcore 
> of the cluster to be incorrect
>  
> capacity-scheduler.xml
>  
> {code:java}
> 
> 
> yarn.scheduler.capacity.root.queues
> support
> 
> 
> yarn.scheduler.capacity.root.support.capacity
> 100
> 
> 
> yarn.scheduler.capacity.root.support.accessible-node-labels
> test1
> 
> 
> yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity
> 100
> 
> 
> yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity
> 100
> 
> 
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8496) The capacity scheduler uses label to cause vcore to be incorrect

2018-07-05 Thread tangshangwen (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangshangwen updated YARN-8496:
---
Description: 
 In my cluster, I used tag scheduling, and I found that it caused the vcore of 
the cluster to be incorrect

!image-2018-07-05-18-16-10-851.png!

 

 

capacity-scheduler.xml

 
{code:java}



yarn.scheduler.capacity.root.queues
support



yarn.scheduler.capacity.root.support.capacity
100



yarn.scheduler.capacity.root.support.accessible-node-labels
test1



yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity
100



yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity
100



{code}
 

  was:
 

 

I

n my cluster, I used tag scheduling, and I found that it caused the vcore of 
the cluster to be incorrect

!image-2018-07-05-18-16-10-851.png!

 

 

capacity-scheduler.xml

 
{code:java}



yarn.scheduler.capacity.root.queues
support



yarn.scheduler.capacity.root.support.capacity
100



yarn.scheduler.capacity.root.support.accessible-node-labels
test1



yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity
100



yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity
100



{code}
 


> The capacity scheduler uses label to cause vcore to be incorrect
> 
>
> Key: YARN-8496
> URL: https://issues.apache.org/jira/browse/YARN-8496
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.6
>Reporter: tangshangwen
>Assignee: tangshangwen
>Priority: Major
> Attachments: image-2018-07-05-18-16-10-851.png
>
>
>  In my cluster, I used tag scheduling, and I found that it caused the vcore 
> of the cluster to be incorrect
> !image-2018-07-05-18-16-10-851.png!
>  
>  
> capacity-scheduler.xml
>  
> {code:java}
> 
> 
> yarn.scheduler.capacity.root.queues
> support
> 
> 
> yarn.scheduler.capacity.root.support.capacity
> 100
> 
> 
> yarn.scheduler.capacity.root.support.accessible-node-labels
> test1
> 
> 
> yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity
> 100
> 
> 
> yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity
> 100
> 
> 
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8496) The capacity scheduler uses label to cause vcore to be incorrect

2018-07-05 Thread tangshangwen (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangshangwen updated YARN-8496:
---
Attachment: image-2018-07-05-18-16-10-851.png

> The capacity scheduler uses label to cause vcore to be incorrect
> 
>
> Key: YARN-8496
> URL: https://issues.apache.org/jira/browse/YARN-8496
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.6
>Reporter: tangshangwen
>Assignee: tangshangwen
>Priority: Major
> Attachments: image-2018-07-05-18-16-10-851.png
>
>
>  
>  
> I
> n my cluster, I used tag scheduling, and I found that it caused the vcore of 
> the cluster to be incorrect
>   !image-2018-07-05-18-12-45-837.png!
>  
>  
> capacity-scheduler.xml
>  
> {code:java}
> 
> 
> yarn.scheduler.capacity.root.queues
> support
> 
> 
> yarn.scheduler.capacity.root.support.capacity
> 100
> 
> 
> yarn.scheduler.capacity.root.support.accessible-node-labels
> test1
> 
> 
> yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity
> 100
> 
> 
> yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity
> 100
> 
> 
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8496) The capacity scheduler uses label to cause vcore to be incorrect

2018-07-05 Thread tangshangwen (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangshangwen updated YARN-8496:
---
Description: 
 

 

I

n my cluster, I used tag scheduling, and I found that it caused the vcore of 
the cluster to be incorrect

!image-2018-07-05-18-16-10-851.png!

 

 

capacity-scheduler.xml

 
{code:java}



yarn.scheduler.capacity.root.queues
support



yarn.scheduler.capacity.root.support.capacity
100



yarn.scheduler.capacity.root.support.accessible-node-labels
test1



yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity
100



yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity
100



{code}
 

  was:
 

 

I

n my cluster, I used tag scheduling, and I found that it caused the vcore of 
the cluster to be incorrect

  !image-2018-07-05-18-12-45-837.png!

 

 

capacity-scheduler.xml

 
{code:java}



yarn.scheduler.capacity.root.queues
support



yarn.scheduler.capacity.root.support.capacity
100



yarn.scheduler.capacity.root.support.accessible-node-labels
test1



yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity
100



yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity
100



{code}
 


> The capacity scheduler uses label to cause vcore to be incorrect
> 
>
> Key: YARN-8496
> URL: https://issues.apache.org/jira/browse/YARN-8496
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.6
>Reporter: tangshangwen
>Assignee: tangshangwen
>Priority: Major
> Attachments: image-2018-07-05-18-16-10-851.png
>
>
>  
>  
> I
> n my cluster, I used tag scheduling, and I found that it caused the vcore of 
> the cluster to be incorrect
> !image-2018-07-05-18-16-10-851.png!
>  
>  
> capacity-scheduler.xml
>  
> {code:java}
> 
> 
> yarn.scheduler.capacity.root.queues
> support
> 
> 
> yarn.scheduler.capacity.root.support.capacity
> 100
> 
> 
> yarn.scheduler.capacity.root.support.accessible-node-labels
> test1
> 
> 
> yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity
> 100
> 
> 
> yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity
> 100
> 
> 
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-8496) The capacity scheduler uses label to cause vcore to be incorrect

2018-07-05 Thread tangshangwen (JIRA)

tangshangwen created YARN-8496:
--

 Summary: The capacity scheduler uses label to cause vcore to be 
incorrect
 Key: YARN-8496
 URL: https://issues.apache.org/jira/browse/YARN-8496
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.6
Reporter: tangshangwen
Assignee: tangshangwen


 

 

I

n my cluster, I used tag scheduling, and I found that it caused the vcore of 
the cluster to be incorrect

  !image-2018-07-05-18-12-45-837.png!

 

 

capacity-scheduler.xml

 
{code:java}



yarn.scheduler.capacity.root.queues
support



yarn.scheduler.capacity.root.support.capacity
100



yarn.scheduler.capacity.root.support.accessible-node-labels
test1



yarn.scheduler.capacity.root.support.accessible-node-labels.test1.capacity
100



yarn.scheduler.capacity.root.accessible-node-labels.test1.capacity
100



{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-5795) FairScheduler set AppMaster vcores didn't work

2016-11-11 Thread tangshangwen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15657056#comment-15657056
 ] 

tangshangwen commented on YARN-5795:


Hi [~templedf], would you like to review the patch or give me some pointer for 
the next step to do ?

> FairScheduler set AppMaster vcores didn't work
> --
>
> Key: YARN-5795
> URL: https://issues.apache.org/jira/browse/YARN-5795
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: tangshangwen
>Assignee: tangshangwen
> Attachments: 0001-YARN-5795.patch
>
>
> In our cluster, we use the linux container, I would like to increase the 
> number of cpu to get more CPU time slice, but it did not take effect， i set 
> yarn.app.mapreduce.am.resource.cpu-vcores = 2 ,but i found the 
> resourcemanager log
> {noformat}
> [2016-10-27T16:36:37.280 08:00] [INFO] 
> resourcemanager.scheduler.SchedulerNode.allocateContainer(SchedulerNode.java 
> 153) [ResourceManager Event Processor] : Assigned container 
> container_1477059529836_336635_01_01 of capacity 
> {noformat}
> Because scheduler.getResourceCalculator() only computes memory
> {code:title=RMAppManager.java|borderStyle=solid}
>  private ResourceRequest validateAndCreateResourceRequest(
>   ApplicationSubmissionContext submissionContext, boolean isRecovery)
>   throws InvalidResourceRequestException {
> ...
>   SchedulerUtils.normalizeRequest(amReq, 
> scheduler.getResourceCalculator(),
>   scheduler.getClusterResource(),
>   scheduler.getMinimumResourceCapability(),
>   scheduler.getMaximumResourceCapability(),
>   scheduler.getMinimumResourceCapability());
> ...
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-5795) FairScheduler set AppMaster vcores didn't work

2016-10-28 Thread tangshangwen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangshangwen updated YARN-5795:
---
Attachment: 0001-YARN-5795.patch

> FairScheduler set AppMaster vcores didn't work
> --
>
> Key: YARN-5795
> URL: https://issues.apache.org/jira/browse/YARN-5795
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: tangshangwen
>Assignee: tangshangwen
> Attachments: 0001-YARN-5795.patch
>
>
> In our cluster, we use the linux container, I would like to increase the 
> number of cpu to get more CPU time slice, but it did not take effect， i set 
> yarn.app.mapreduce.am.resource.cpu-vcores = 2 ,but i found the 
> resourcemanager log
> {noformat}
> [2016-10-27T16:36:37.280 08:00] [INFO] 
> resourcemanager.scheduler.SchedulerNode.allocateContainer(SchedulerNode.java 
> 153) [ResourceManager Event Processor] : Assigned container 
> container_1477059529836_336635_01_01 of capacity 
> {noformat}
> Because scheduler.getResourceCalculator() only computes memory
> {code:title=RMAppManager.java|borderStyle=solid}
>  private ResourceRequest validateAndCreateResourceRequest(
>   ApplicationSubmissionContext submissionContext, boolean isRecovery)
>   throws InvalidResourceRequestException {
> ...
>   SchedulerUtils.normalizeRequest(amReq, 
> scheduler.getResourceCalculator(),
>   scheduler.getClusterResource(),
>   scheduler.getMinimumResourceCapability(),
>   scheduler.getMaximumResourceCapability(),
>   scheduler.getMinimumResourceCapability());
> ...
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-5795) FairScheduler set AppMaster vcores didn't work

2016-10-28 Thread tangshangwen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15614804#comment-15614804
 ] 

tangshangwen commented on YARN-5795:


hi [~kasha], i think the DefaultResourceCalculator in the normalize method 
should be to maintain the original vcore,
{code:title=DefaultResourceCalculator.java|borderStyle=solid}
  @Override
  public Resource normalize(Resource r, Resource minimumResource,
  Resource maximumResource, Resource stepFactor) {
int normalizedMemory = Math.min(
roundUp(
Math.max(r.getMemory(), minimumResource.getMemory()),
stepFactor.getMemory()),
maximumResource.getMemory());
return Resources.createResource(normalizedMemory);
  }
{code}
change to
{code:title=DefaultResourceCalculator.java|borderStyle=solid}
  @Override
  public Resource normalize(Resource r, Resource minimumResource,
  Resource maximumResource, Resource stepFactor) {
int normalizedMemory = Math.min(
roundUp(
Math.max(r.getMemory(), minimumResource.getMemory()),
stepFactor.getMemory()),
maximumResource.getMemory());
return Resources.createResource(normalizedMemory, r.getVirtualCores());
  }
{code}

> FairScheduler set AppMaster vcores didn't work
> --
>
> Key: YARN-5795
> URL: https://issues.apache.org/jira/browse/YARN-5795
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: tangshangwen
>Assignee: tangshangwen
>
> In our cluster, we use the linux container, I would like to increase the 
> number of cpu to get more CPU time slice, but it did not take effect， i set 
> yarn.app.mapreduce.am.resource.cpu-vcores = 2 ,but i found the 
> resourcemanager log
> {noformat}
> [2016-10-27T16:36:37.280 08:00] [INFO] 
> resourcemanager.scheduler.SchedulerNode.allocateContainer(SchedulerNode.java 
> 153) [ResourceManager Event Processor] : Assigned container 
> container_1477059529836_336635_01_01 of capacity 
> {noformat}
> Because scheduler.getResourceCalculator() only computes memory
> {code:title=RMAppManager.java|borderStyle=solid}
>  private ResourceRequest validateAndCreateResourceRequest(
>   ApplicationSubmissionContext submissionContext, boolean isRecovery)
>   throws InvalidResourceRequestException {
> ...
>   SchedulerUtils.normalizeRequest(amReq, 
> scheduler.getResourceCalculator(),
>   scheduler.getClusterResource(),
>   scheduler.getMinimumResourceCapability(),
>   scheduler.getMaximumResourceCapability(),
>   scheduler.getMinimumResourceCapability());
> ...
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-5795) FairScheduler set AppMaster vcores didn't work

2016-10-27 Thread tangshangwen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15614196#comment-15614196
 ] 

tangshangwen commented on YARN-5795:


I think allocate method using DOMINANT RESOURCE CALCULATOR, AppMaster should 
also be the same.
{code:title=FairScheduler.java|borderStyle=solid}
  @Override
  public Allocation allocate(ApplicationAttemptId appAttemptId,
  List ask, List release,
  List blacklistAdditions, List blacklistRemovals) {

// Make sure this application exists
FSAppAttempt application = getSchedulerApp(appAttemptId);
if (application == null) {
  LOG.info("Calling allocate on removed " +
  "or non existant application " + appAttemptId);
  return EMPTY_ALLOCATION;
}

// Sanity check
SchedulerUtils.normalizeRequests(ask, DOMINANT_RESOURCE_CALCULATOR,
clusterResource, minimumAllocation, getMaximumResourceCapability(),
incrAllocation);
..
}
{code}

> FairScheduler set AppMaster vcores didn't work
> --
>
> Key: YARN-5795
> URL: https://issues.apache.org/jira/browse/YARN-5795
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: tangshangwen
>Assignee: tangshangwen
>
> In our cluster, we use the linux container, I would like to increase the 
> number of cpu to get more CPU time slice, but it did not take effect， i set 
> yarn.app.mapreduce.am.resource.cpu-vcores = 2 ,but i found the 
> resourcemanager log
> {noformat}
> [2016-10-27T16:36:37.280 08:00] [INFO] 
> resourcemanager.scheduler.SchedulerNode.allocateContainer(SchedulerNode.java 
> 153) [ResourceManager Event Processor] : Assigned container 
> container_1477059529836_336635_01_01 of capacity 
> {noformat}
> Because scheduler.getResourceCalculator() only computes memory
> {code:title=RMAppManager.java|borderStyle=solid}
>  private ResourceRequest validateAndCreateResourceRequest(
>   ApplicationSubmissionContext submissionContext, boolean isRecovery)
>   throws InvalidResourceRequestException {
> ...
>   SchedulerUtils.normalizeRequest(amReq, 
> scheduler.getResourceCalculator(),
>   scheduler.getClusterResource(),
>   scheduler.getMinimumResourceCapability(),
>   scheduler.getMaximumResourceCapability(),
>   scheduler.getMinimumResourceCapability());
> ...
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Issue Comment Deleted] (YARN-5795) FairScheduler set AppMaster vcores didn't work

2016-10-27 Thread tangshangwen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangshangwen updated YARN-5795:
---
Comment: was deleted

(was: I think if we replace RESOURCE_CALCULATOR with 
DOMINANT_RESOURCE_CALCULATOR we should be able to fix the problem
{code:title=FairScheduler.java|borderStyle=solid}
  @Override
  public ResourceCalculator getResourceCalculator() {
//return RESOURCE_CALCULATOR;
return DOMINANT_RESOURCE_CALCULATOR;
  }
{code})

> FairScheduler set AppMaster vcores didn't work
> --
>
> Key: YARN-5795
> URL: https://issues.apache.org/jira/browse/YARN-5795
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: tangshangwen
>Assignee: tangshangwen
>
> In our cluster, we use the linux container, I would like to increase the 
> number of cpu to get more CPU time slice, but it did not take effect， i set 
> yarn.app.mapreduce.am.resource.cpu-vcores = 2 ,but i found the 
> resourcemanager log
> {noformat}
> [2016-10-27T16:36:37.280 08:00] [INFO] 
> resourcemanager.scheduler.SchedulerNode.allocateContainer(SchedulerNode.java 
> 153) [ResourceManager Event Processor] : Assigned container 
> container_1477059529836_336635_01_01 of capacity 
> {noformat}
> Because scheduler.getResourceCalculator() only computes memory
> {code:title=RMAppManager.java|borderStyle=solid}
>  private ResourceRequest validateAndCreateResourceRequest(
>   ApplicationSubmissionContext submissionContext, boolean isRecovery)
>   throws InvalidResourceRequestException {
> ...
>   SchedulerUtils.normalizeRequest(amReq, 
> scheduler.getResourceCalculator(),
>   scheduler.getClusterResource(),
>   scheduler.getMinimumResourceCapability(),
>   scheduler.getMaximumResourceCapability(),
>   scheduler.getMinimumResourceCapability());
> ...
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-5795) FairScheduler set AppMaster vcores didn't work

2016-10-27 Thread tangshangwen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15614149#comment-15614149
 ] 

tangshangwen commented on YARN-5795:


I think if we replace RESOURCE_CALCULATOR with DOMINANT_RESOURCE_CALCULATOR we 
should be able to fix the problem
{code:title=FairScheduler.java|borderStyle=solid}
  @Override
  public ResourceCalculator getResourceCalculator() {
//return RESOURCE_CALCULATOR;
return DOMINANT_RESOURCE_CALCULATOR;
  }
{code}

> FairScheduler set AppMaster vcores didn't work
> --
>
> Key: YARN-5795
> URL: https://issues.apache.org/jira/browse/YARN-5795
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: tangshangwen
>Assignee: tangshangwen
>
> In our cluster, we use the linux container, I would like to increase the 
> number of cpu to get more CPU time slice, but it did not take effect， i set 
> yarn.app.mapreduce.am.resource.cpu-vcores = 2 ,but i found the 
> resourcemanager log
> {noformat}
> [2016-10-27T16:36:37.280 08:00] [INFO] 
> resourcemanager.scheduler.SchedulerNode.allocateContainer(SchedulerNode.java 
> 153) [ResourceManager Event Processor] : Assigned container 
> container_1477059529836_336635_01_01 of capacity 
> {noformat}
> Because scheduler.getResourceCalculator() only computes memory
> {code:title=RMAppManager.java|borderStyle=solid}
>  private ResourceRequest validateAndCreateResourceRequest(
>   ApplicationSubmissionContext submissionContext, boolean isRecovery)
>   throws InvalidResourceRequestException {
> ...
>   SchedulerUtils.normalizeRequest(amReq, 
> scheduler.getResourceCalculator(),
>   scheduler.getClusterResource(),
>   scheduler.getMinimumResourceCapability(),
>   scheduler.getMaximumResourceCapability(),
>   scheduler.getMinimumResourceCapability());
> ...
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-5795) FairScheduler set AppMaster vcores didn't work

2016-10-27 Thread tangshangwen (JIRA)

tangshangwen created YARN-5795:
--

 Summary: FairScheduler set AppMaster vcores didn't work
 Key: YARN-5795
 URL: https://issues.apache.org/jira/browse/YARN-5795
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.1
Reporter: tangshangwen
Assignee: tangshangwen


In our cluster, we use the linux container, I would like to increase the number 
of cpu to get more CPU time slice, but it did not take effect， i set 
yarn.app.mapreduce.am.resource.cpu-vcores = 2 ,but i found the resourcemanager 
log
{noformat}
[2016-10-27T16:36:37.280 08:00] [INFO] 
resourcemanager.scheduler.SchedulerNode.allocateContainer(SchedulerNode.java 
153) [ResourceManager Event Processor] : Assigned container 
container_1477059529836_336635_01_01 of capacity 
{noformat}

Because scheduler.getResourceCalculator() only computes memory
{code:title=RMAppManager.java|borderStyle=solid}
 private ResourceRequest validateAndCreateResourceRequest(
  ApplicationSubmissionContext submissionContext, boolean isRecovery)
  throws InvalidResourceRequestException {
...
  SchedulerUtils.normalizeRequest(amReq, scheduler.getResourceCalculator(),
  scheduler.getClusterResource(),
  scheduler.getMinimumResourceCapability(),
  scheduler.getMaximumResourceCapability(),
  scheduler.getMinimumResourceCapability());
...
}
{code}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-5136) Error in handling event type APP_ATTEMPT_REMOVED to the scheduler

2016-09-29 Thread tangshangwen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-5136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangshangwen updated YARN-5136:
---
Assignee: Wilfred Spiegelenburg  (was: tangshangwen)

> Error in handling event type APP_ATTEMPT_REMOVED to the scheduler
> -
>
> Key: YARN-5136
> URL: https://issues.apache.org/jira/browse/YARN-5136
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: tangshangwen
>Assignee: Wilfred Spiegelenburg
>
> move app cause rm exit
> {noformat}
> 2016-05-24 23:20:47,202 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type APP_ATTEMPT_REMOVED to the scheduler
> java.lang.IllegalStateException: Given app to remove 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt@ea94c3b
>  does not exist in queue [root.bdp_xx.bdp_mart_xx_formal, 
> demand=, running= vCores:13422>, share=, w= weight=1.0>]
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:119)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:779)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1231)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:114)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:680)
> at java.lang.Thread.run(Thread.java:745)
> 2016-05-24 23:20:47,202 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_e04_1464073905025_15410_01_001759 Container Transitioned from 
> ACQUIRED to RELEASED
> 2016-05-24 23:20:47,202 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-5136) Error in handling event type APP_ATTEMPT_REMOVED to the scheduler

2016-09-29 Thread tangshangwen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15532199#comment-15532199
 ] 

tangshangwen commented on YARN-5136:


[~wilfreds]ok

> Error in handling event type APP_ATTEMPT_REMOVED to the scheduler
> -
>
> Key: YARN-5136
> URL: https://issues.apache.org/jira/browse/YARN-5136
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: tangshangwen
>Assignee: Wilfred Spiegelenburg
>
> move app cause rm exit
> {noformat}
> 2016-05-24 23:20:47,202 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type APP_ATTEMPT_REMOVED to the scheduler
> java.lang.IllegalStateException: Given app to remove 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt@ea94c3b
>  does not exist in queue [root.bdp_xx.bdp_mart_xx_formal, 
> demand=, running= vCores:13422>, share=, w= weight=1.0>]
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:119)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:779)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1231)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:114)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:680)
> at java.lang.Thread.run(Thread.java:745)
> 2016-05-24 23:20:47,202 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_e04_1464073905025_15410_01_001759 Container Transitioned from 
> ACQUIRED to RELEASED
> 2016-05-24 23:20:47,202 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-5535) Remove RMDelegationToken make resourcemanager recovery very slow

2016-08-18 Thread tangshangwen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426631#comment-15426631
 ] 

tangshangwen commented on YARN-5535:


I'm sorry, it is after recovery ,  and i found even queue size very large
{noformat}
[2016-08-12T19:43:25.986+08:00] [INFO] 
yarn.event.AsyncDispatcher.handle(AsyncDispatcher.java:235) [AsyncDispatcher 
event handler] : Size of event-queue is 643000
[2016-08-12T19:43:25.986+08:00] [INFO] 
yarn.event.AsyncDispatcher.handle(AsyncDispatcher.java:235) [AsyncDispatcher 
event handler] : Size of event-queue is 644000
[2016-08-12T19:43:25.986+08:00] [INFO] 
yarn.event.AsyncDispatcher.handle(AsyncDispatcher.java:235) [AsyncDispatcher 
event handler] : Size of event-queue is 645000
[2016-08-12T19:43:25.986+08:00] [INFO] 
yarn.event.AsyncDispatcher.handle(AsyncDispatcher.java:235) [AsyncDispatcher 
event handler] : Size of event-queue is 646000
{noformat} 

> Remove RMDelegationToken make resourcemanager recovery very slow
> 
>
> Key: YARN-5535
> URL: https://issues.apache.org/jira/browse/YARN-5535
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: tangshangwen
>Assignee: tangshangwen
>
> In our cluster, I found that when restart RM, RM recovery is very slow， this 
> is my log
> {noformat}
> [2016-08-12T19:43:21.478+08:00] [INFO] 
> resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136)
>  [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence 
> number: 737879
> [2016-08-12T19:43:21.478+08:00] [INFO] 
> resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) 
> [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber
> [2016-08-12T19:43:21.486+08:00] [INFO] 
> resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136)
>  [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence 
> number: 737878
> [2016-08-12T19:43:21.486+08:00] [INFO] 
> resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) 
> [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber
> [2016-08-12T19:43:21.494+08:00] [INFO] 
> resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136)
>  [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence 
> number: 737877
> [2016-08-12T19:43:21.494+08:00] [INFO] 
> resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) 
> [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber
> [2016-08-12T19:43:21.503+08:00] [INFO] 
> resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136)
>  [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence 
> number: 737876
> [2016-08-12T19:43:21.503+08:00] [INFO] 
> resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) 
> [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber
> [2016-08-12T19:43:21.519+08:00] [INFO] 
> resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136)
>  [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence 
> number: 737875
> [2016-08-12T19:43:21.519+08:00] [INFO] 
> resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) 
> [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber
> [2016-08-12T19:43:21.533+08:00] [INFO] 
> security.authorize.ServiceAuthorizationManager.authorize(ServiceAuthorizationManager.java:148)
>  [Socket Reader #1 for port 8031] : Authorization successful for yarn 
> (auth:SIMPLE) for protocol=interface 
> org.apache.hadoop.yarn.server.api.ResourceTrackerPB
> [2016-08-12T19:43:21.536+08:00] [INFO] 
> resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136)
>  [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence 
> number: 737874
> [2016-08-12T19:43:21.536+08:00] [INFO] 
> resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) 
> [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber
> [2016-08-12T19:43:21.553+08:00] [INFO] 
> resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136)
>  [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence 
> number: 737873
> [2016-08-12T19:43:21.553+08:00] [INFO] 
> resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) 
> [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber
> [2016-08-12T19:43:21.568+08:00] [INFO] 
> yarn.util.RackRes

[jira] [Commented] (YARN-5535) Remove RMDelegationToken make resourcemanager recovery very slow

2016-08-18 Thread tangshangwen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426107#comment-15426107
 ] 

tangshangwen commented on YARN-5535:


Thanks [~sunilg] for the comments. 
I think Removing RMDelegationToken and SequenceNumber may take a long time，lead 
to can't handle other events
{code:title=ZKRMStateStore.java|borderStyle=solid}
  @Override
  protected synchronized void removeRMDelegationTokenState(
  RMDelegationTokenIdentifier rmDTIdentifier) throws Exception {
String nodeRemovePath =
getNodePath(delegationTokensRootPath, DELEGATION_TOKEN_PREFIX
+ rmDTIdentifier.getSequenceNumber());
if (LOG.isDebugEnabled()) {
  LOG.debug("Removing RMDelegationToken_"
  + rmDTIdentifier.getSequenceNumber());
}
if (existsWithRetries(nodeRemovePath, false) != null) {
  ArrayList opList = new ArrayList();
  opList.add(Op.delete(nodeRemovePath, -1));
  doDeleteMultiWithRetries(opList);
} else {
  LOG.debug("Attempted to delete a non-existing znode " + nodeRemovePath);
}
  }
{code}

> Remove RMDelegationToken make resourcemanager recovery very slow
> 
>
> Key: YARN-5535
> URL: https://issues.apache.org/jira/browse/YARN-5535
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: tangshangwen
>Assignee: tangshangwen
>
> In our cluster, I found that when restart RM, RM recovery is very slow， this 
> is my log
> {noformat}
> [2016-08-12T19:43:21.478+08:00] [INFO] 
> resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136)
>  [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence 
> number: 737879
> [2016-08-12T19:43:21.478+08:00] [INFO] 
> resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) 
> [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber
> [2016-08-12T19:43:21.486+08:00] [INFO] 
> resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136)
>  [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence 
> number: 737878
> [2016-08-12T19:43:21.486+08:00] [INFO] 
> resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) 
> [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber
> [2016-08-12T19:43:21.494+08:00] [INFO] 
> resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136)
>  [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence 
> number: 737877
> [2016-08-12T19:43:21.494+08:00] [INFO] 
> resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) 
> [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber
> [2016-08-12T19:43:21.503+08:00] [INFO] 
> resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136)
>  [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence 
> number: 737876
> [2016-08-12T19:43:21.503+08:00] [INFO] 
> resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) 
> [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber
> [2016-08-12T19:43:21.519+08:00] [INFO] 
> resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136)
>  [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence 
> number: 737875
> [2016-08-12T19:43:21.519+08:00] [INFO] 
> resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) 
> [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber
> [2016-08-12T19:43:21.533+08:00] [INFO] 
> security.authorize.ServiceAuthorizationManager.authorize(ServiceAuthorizationManager.java:148)
>  [Socket Reader #1 for port 8031] : Authorization successful for yarn 
> (auth:SIMPLE) for protocol=interface 
> org.apache.hadoop.yarn.server.api.ResourceTrackerPB
> [2016-08-12T19:43:21.536+08:00] [INFO] 
> resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136)
>  [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence 
> number: 737874
> [2016-08-12T19:43:21.536+08:00] [INFO] 
> resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) 
> [Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber
> [2016-08-12T19:43:21.553+08:00] [INFO] 
> resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136)
>  [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence 
> number: 737873
> [2016-08-12T19:43:21.553+08:00] [INFO] 
> resourcemanager.recovery.RMStateStore.transition

[jira] [Updated] (YARN-5535) Remove RMDelegationToken make resourcemanager recovery very slow

2016-08-18 Thread tangshangwen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-5535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangshangwen updated YARN-5535:
---
Description: 
In our cluster, I found that when restart RM, RM recovery is very slow， this is 
my log
{noformat}
[2016-08-12T19:43:21.478+08:00] [INFO] 
resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136)
 [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 
737879
[2016-08-12T19:43:21.478+08:00] [INFO] 
resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) 
[Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber
[2016-08-12T19:43:21.486+08:00] [INFO] 
resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136)
 [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 
737878
[2016-08-12T19:43:21.486+08:00] [INFO] 
resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) 
[Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber
[2016-08-12T19:43:21.494+08:00] [INFO] 
resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136)
 [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 
737877
[2016-08-12T19:43:21.494+08:00] [INFO] 
resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) 
[Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber
[2016-08-12T19:43:21.503+08:00] [INFO] 
resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136)
 [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 
737876
[2016-08-12T19:43:21.503+08:00] [INFO] 
resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) 
[Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber
[2016-08-12T19:43:21.519+08:00] [INFO] 
resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136)
 [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 
737875
[2016-08-12T19:43:21.519+08:00] [INFO] 
resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) 
[Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber
[2016-08-12T19:43:21.533+08:00] [INFO] 
security.authorize.ServiceAuthorizationManager.authorize(ServiceAuthorizationManager.java:148)
 [Socket Reader #1 for port 8031] : Authorization successful for yarn 
(auth:SIMPLE) for protocol=interface 
org.apache.hadoop.yarn.server.api.ResourceTrackerPB
[2016-08-12T19:43:21.536+08:00] [INFO] 
resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136)
 [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 
737874
[2016-08-12T19:43:21.536+08:00] [INFO] 
resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) 
[Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber
[2016-08-12T19:43:21.553+08:00] [INFO] 
resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136)
 [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 
737873
[2016-08-12T19:43:21.553+08:00] [INFO] 
resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) 
[Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber
[2016-08-12T19:43:21.568+08:00] [INFO] 
yarn.util.RackResolver.coreResolve(RackResolver.java:109) [IPC Server handler 0 
on 8031] : Resolved -7056.hadoop.xxx.local to /rack/rack5118
[2016-08-12T19:43:21.569+08:00] [INFO] 
resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136)
 [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 
737872
[2016-08-12T19:43:21.569+08:00] [INFO] 
resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) 
[Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber
[2016-08-12T19:43:21.570+08:00] [INFO] 
server.resourcemanager.ResourceTrackerService.registerNodeManager(ResourceTrackerService.java:343)
 [IPC Server handler 0 on 8031] : NodeManager from node 
x-7056.hadoop.xxx.local(cmPort: 50086 httpPort: 8042) registered with 
capability: , assigned nodeId 
xx-7056.hadoop.xxx.local:50086
[2016-08-12T19:43:21.572+08:00] [INFO] 
resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:424) [AsyncDispatcher 
event handler] : xx-7056.hadoop.xxx.local:50086 Node Transitioned from NEW 
to RUNNING
[2016-08-12T19:43:21.576+08:00] [INFO] 
yarn.event.AsyncDispatcher.handle(AsyncDispatcher.java:235) [AsyncDispatcher 
event handler] : Size of event-queue is 1000
[2016-08-12T19:43:21.577+08:00] [INFO] 
scheduler.

[jira] [Updated] (YARN-5535) Remove RMDelegationToken make resourcemanager recovery very slow

2016-08-18 Thread tangshangwen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-5535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangshangwen updated YARN-5535:
---
Description: 
In our cluster, I found that when restart RM, RM recovery is very slow， this is 
my log
{noformat}
[2016-08-12T19:43:21.478+08:00] [INFO] 
resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136)
 [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 
737879
[2016-08-12T19:43:21.478+08:00] [INFO] 
resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) 
[Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber
[2016-08-12T19:43:21.486+08:00] [INFO] 
resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136)
 [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 
737878
[2016-08-12T19:43:21.486+08:00] [INFO] 
resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) 
[Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber
[2016-08-12T19:43:21.494+08:00] [INFO] 
resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136)
 [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 
737877
[2016-08-12T19:43:21.494+08:00] [INFO] 
resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) 
[Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber
[2016-08-12T19:43:21.503+08:00] [INFO] 
resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136)
 [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 
737876
[2016-08-12T19:43:21.503+08:00] [INFO] 
resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) 
[Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber
[2016-08-12T19:43:21.519+08:00] [INFO] 
resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136)
 [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 
737875
[2016-08-12T19:43:21.519+08:00] [INFO] 
resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) 
[Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber
[2016-08-12T19:43:21.533+08:00] [INFO] 
security.authorize.ServiceAuthorizationManager.authorize(ServiceAuthorizationManager.java:148)
 [Socket Reader #1 for port 8031] : Authorization successful for yarn 
(auth:SIMPLE) for protocol=interface 
org.apache.hadoop.yarn.server.api.ResourceTrackerPB
[2016-08-12T19:43:21.536+08:00] [INFO] 
resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136)
 [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 
737874
[2016-08-12T19:43:21.536+08:00] [INFO] 
resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) 
[Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber
[2016-08-12T19:43:21.553+08:00] [INFO] 
resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136)
 [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 
737873
[2016-08-12T19:43:21.553+08:00] [INFO] 
resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) 
[Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber
[2016-08-12T19:43:21.568+08:00] [INFO] 
yarn.util.RackResolver.coreResolve(RackResolver.java:109) [IPC Server handler 0 
on 8031] : Resolved -7056.hadoop.jd.local to /rack/rack5118
[2016-08-12T19:43:21.569+08:00] [INFO] 
resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136)
 [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 
737872
[2016-08-12T19:43:21.569+08:00] [INFO] 
resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) 
[Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber
[2016-08-12T19:43:21.570+08:00] [INFO] 
server.resourcemanager.ResourceTrackerService.registerNodeManager(ResourceTrackerService.java:343)
 [IPC Server handler 0 on 8031] : NodeManager from node 
x-7056.hadoop.jd.local(cmPort: 50086 httpPort: 8042) registered with 
capability: , assigned nodeId 
xx-7056.hadoop.jd.local:50086
[2016-08-12T19:43:21.572+08:00] [INFO] 
resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:424) [AsyncDispatcher 
event handler] : xx-7056.hadoop.jd.local:50086 Node Transitioned from NEW 
to RUNNING
[2016-08-12T19:43:21.576+08:00] [INFO] 
yarn.event.AsyncDispatcher.handle(AsyncDispatcher.java:235) [AsyncDispatcher 
event handler] : Size of event-queue is 1000
[2016-08-12T19:43:21.577+08:00] [INFO] 
scheduler.fair

[jira] [Created] (YARN-5535) Remove RMDelegationToken make resourcemanager recovery very slow

2016-08-18 Thread tangshangwen (JIRA)

tangshangwen created YARN-5535:
--

 Summary: Remove RMDelegationToken make resourcemanager recovery 
very slow
 Key: YARN-5535
 URL: https://issues.apache.org/jira/browse/YARN-5535
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.1
Reporter: tangshangwen
Assignee: tangshangwen


In our cluster, I found that when restart RM, RM recovery is very slow， this is 
my log
{noformat}
[2016-08-12T19:43:21.478+08:00] [INFO] 
resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136)
 [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 
737879
[2016-08-12T19:43:21.478+08:00] [INFO] 
resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) 
[Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber
[2016-08-12T19:43:21.486+08:00] [INFO] 
resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136)
 [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 
737878
[2016-08-12T19:43:21.486+08:00] [INFO] 
resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) 
[Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber
[2016-08-12T19:43:21.494+08:00] [INFO] 
resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136)
 [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 
737877
[2016-08-12T19:43:21.494+08:00] [INFO] 
resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) 
[Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber
[2016-08-12T19:43:21.503+08:00] [INFO] 
resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136)
 [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 
737876
[2016-08-12T19:43:21.503+08:00] [INFO] 
resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) 
[Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber
[2016-08-12T19:43:21.519+08:00] [INFO] 
resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136)
 [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 
737875
[2016-08-12T19:43:21.519+08:00] [INFO] 
resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) 
[Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber
[2016-08-12T19:43:21.533+08:00] [INFO] 
security.authorize.ServiceAuthorizationManager.authorize(ServiceAuthorizationManager.java:148)
 [Socket Reader #1 for port 8031] : Authorization successful for yarn 
(auth:SIMPLE) for protocol=interface 
org.apache.hadoop.yarn.server.api.ResourceTrackerPB
[2016-08-12T19:43:21.536+08:00] [INFO] 
resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136)
 [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 
737874
[2016-08-12T19:43:21.536+08:00] [INFO] 
resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) 
[Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber
[2016-08-12T19:43:21.553+08:00] [INFO] 
resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136)
 [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 
737873
[2016-08-12T19:43:21.553+08:00] [INFO] 
resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) 
[Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber
[2016-08-12T19:43:21.568+08:00] [INFO] 
yarn.util.RackResolver.coreResolve(RackResolver.java:109) [IPC Server handler 0 
on 8031] : Resolved BJHC-Jmartad-7056.hadoop.jd.local to /rack/rack5118
[2016-08-12T19:43:21.569+08:00] [INFO] 
resourcemanager.security.RMDelegationTokenSecretManager.removeStoredToken(RMDelegationTokenSecretManager.java:136)
 [Thread[Thread-26,5,main]] : removing RMDelegation token with sequence number: 
737872
[2016-08-12T19:43:21.569+08:00] [INFO] 
resourcemanager.recovery.RMStateStore.transition(RMStateStore.java:320) 
[Thread[Thread-26,5,main]] : Removing RMDelegationToken and SequenceNumber
[2016-08-12T19:43:21.570+08:00] [INFO] 
server.resourcemanager.ResourceTrackerService.registerNodeManager(ResourceTrackerService.java:343)
 [IPC Server handler 0 on 8031] : NodeManager from node 
BJHC-Jmartad-7056.hadoop.jd.local(cmPort: 50086 httpPort: 8042) registered with 
capability: , assigned nodeId 
BJHC-Jmartad-7056.hadoop.jd.local:50086
[2016-08-12T19:43:21.572+08:00] [INFO] 
resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:424) [AsyncDispatcher 
event handler] : BJHC-Jmartad-7056.hadoop.jd.local:50086 Node Transitioned fr

[jira] [Resolved] (YARN-5482) ContainerMetric Lead to memory leaks

2016-08-08 Thread tangshangwen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-5482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangshangwen resolved YARN-5482.

Resolution: Duplicate

> ContainerMetric Lead to memory leaks
> 
>
> Key: YARN-5482
> URL: https://issues.apache.org/jira/browse/YARN-5482
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: tangshangwen
>Assignee: tangshangwen
> Attachments: oom1.png, oom2.png
>
>
> In our cluster, I often find NodeManager OOM, I dump the heap file and found 
> ContainerMetric takes up a lot of memory
> {code}
> export YARN_NODEMANAGER_OPTS="-Xmx2g -Xms2g -Xmn1g -XX:PermSize=128M 
> -XX:MaxPermSize=128M -XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError 
> -XX:HeapDumpPath=/data1/yarn-logs/nm_dump.log -Dcom.sun.management.jmxremote 
> -Xloggc:/data1/yarn-logs/nm_gc.log -verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime 
> -XX:+PrintGCApplicationConcurrentTime -XX:+PrintTenuringDistribution 
> -XX:ErrorFile=/data1/yarn-logs/nm_err_pid"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Reopened] (YARN-5482) ContainerMetric Lead to memory leaks

2016-08-08 Thread tangshangwen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-5482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangshangwen reopened YARN-5482:


> ContainerMetric Lead to memory leaks
> 
>
> Key: YARN-5482
> URL: https://issues.apache.org/jira/browse/YARN-5482
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: tangshangwen
>Assignee: tangshangwen
> Attachments: oom1.png, oom2.png
>
>
> In our cluster, I often find NodeManager OOM, I dump the heap file and found 
> ContainerMetric takes up a lot of memory
> {code}
> export YARN_NODEMANAGER_OPTS="-Xmx2g -Xms2g -Xmn1g -XX:PermSize=128M 
> -XX:MaxPermSize=128M -XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError 
> -XX:HeapDumpPath=/data1/yarn-logs/nm_dump.log -Dcom.sun.management.jmxremote 
> -Xloggc:/data1/yarn-logs/nm_gc.log -verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime 
> -XX:+PrintGCApplicationConcurrentTime -XX:+PrintTenuringDistribution 
> -XX:ErrorFile=/data1/yarn-logs/nm_err_pid"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-5482) ContainerMetric Lead to memory leaks

2016-08-08 Thread tangshangwen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-5482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangshangwen resolved YARN-5482.

Resolution: Fixed

> ContainerMetric Lead to memory leaks
> 
>
> Key: YARN-5482
> URL: https://issues.apache.org/jira/browse/YARN-5482
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: tangshangwen
>Assignee: tangshangwen
> Attachments: oom1.png, oom2.png
>
>
> In our cluster, I often find NodeManager OOM, I dump the heap file and found 
> ContainerMetric takes up a lot of memory
> {code}
> export YARN_NODEMANAGER_OPTS="-Xmx2g -Xms2g -Xmn1g -XX:PermSize=128M 
> -XX:MaxPermSize=128M -XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError 
> -XX:HeapDumpPath=/data1/yarn-logs/nm_dump.log -Dcom.sun.management.jmxremote 
> -Xloggc:/data1/yarn-logs/nm_gc.log -verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime 
> -XX:+PrintGCApplicationConcurrentTime -XX:+PrintTenuringDistribution 
> -XX:ErrorFile=/data1/yarn-logs/nm_err_pid"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-5482) ContainerMetric Lead to memory leaks

2016-08-08 Thread tangshangwen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15411603#comment-15411603
 ] 

tangshangwen commented on YARN-5482:


Thanks [~bibinchundatt]

> ContainerMetric Lead to memory leaks
> 
>
> Key: YARN-5482
> URL: https://issues.apache.org/jira/browse/YARN-5482
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: tangshangwen
>Assignee: tangshangwen
> Attachments: oom1.png, oom2.png
>
>
> In our cluster, I often find NodeManager OOM, I dump the heap file and found 
> ContainerMetric takes up a lot of memory
> {code}
> export YARN_NODEMANAGER_OPTS="-Xmx2g -Xms2g -Xmn1g -XX:PermSize=128M 
> -XX:MaxPermSize=128M -XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError 
> -XX:HeapDumpPath=/data1/yarn-logs/nm_dump.log -Dcom.sun.management.jmxremote 
> -Xloggc:/data1/yarn-logs/nm_gc.log -verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime 
> -XX:+PrintGCApplicationConcurrentTime -XX:+PrintTenuringDistribution 
> -XX:ErrorFile=/data1/yarn-logs/nm_err_pid"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-5482) ContainerMetric Lead to memory leaks

2016-08-07 Thread tangshangwen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-5482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangshangwen updated YARN-5482:
---
Attachment: oom2.png
oom1.png

> ContainerMetric Lead to memory leaks
> 
>
> Key: YARN-5482
> URL: https://issues.apache.org/jira/browse/YARN-5482
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: tangshangwen
>Assignee: tangshangwen
> Attachments: oom1.png, oom2.png
>
>
> In our cluster, I often find NodeManager OOM, I dump the heap file and found 
> ContainerMetric takes up a lot of memory
> {code}
> export YARN_NODEMANAGER_OPTS="-Xmx2g -Xms2g -Xmn1g -XX:PermSize=128M 
> -XX:MaxPermSize=128M -XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError 
> -XX:HeapDumpPath=/data1/yarn-logs/nm_dump.log -Dcom.sun.management.jmxremote 
> -Xloggc:/data1/yarn-logs/nm_gc.log -verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime 
> -XX:+PrintGCApplicationConcurrentTime -XX:+PrintTenuringDistribution 
> -XX:ErrorFile=/data1/yarn-logs/nm_err_pid"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-5482) ContainerMetric Lead to memory leaks

2016-08-07 Thread tangshangwen (JIRA)

tangshangwen created YARN-5482:
--

 Summary: ContainerMetric Lead to memory leaks
 Key: YARN-5482
 URL: https://issues.apache.org/jira/browse/YARN-5482
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.1
Reporter: tangshangwen
Assignee: tangshangwen


In our cluster, I often find NodeManager OOM, I dump the heap file and found 
ContainerMetric takes up a lot of memory
{code}
export YARN_NODEMANAGER_OPTS="-Xmx2g -Xms2g -Xmn1g -XX:PermSize=128M 
-XX:MaxPermSize=128M -XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=/data1/yarn-logs/nm_dump.log -Dcom.sun.management.jmxremote 
-Xloggc:/data1/yarn-logs/nm_gc.log -verbose:gc -XX:+PrintGCDetails 
-XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime 
-XX:+PrintGCApplicationConcurrentTime -XX:+PrintTenuringDistribution 
-XX:ErrorFile=/data1/yarn-logs/nm_err_pid"
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-5187) when the preempt reduce happen, map resources priority should be higher

2016-05-31 Thread tangshangwen (JIRA)

tangshangwen created YARN-5187:
--

 Summary: when the preempt reduce happen, map  resources priority 
should be higher
 Key: YARN-5187
 URL: https://issues.apache.org/jira/browse/YARN-5187
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: tangshangwen
Assignee: tangshangwen


in our cluster, i found job hang long time, When the cluster resources 
nervous，many reduce were killed,  map no resources to run，i think when the 
preempt reduce happen, map  resources priority should be higher



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-5136) Error in handling event type APP_ATTEMPT_REMOVED to the scheduler

2016-05-24 Thread tangshangwen (JIRA)

tangshangwen created YARN-5136:
--

 Summary: Error in handling event type APP_ATTEMPT_REMOVED to the 
scheduler
 Key: YARN-5136
 URL: https://issues.apache.org/jira/browse/YARN-5136
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.1
Reporter: tangshangwen
Assignee: tangshangwen


move app cause rm exit
{noformat}
2016-05-24 23:20:47,202 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
handling event type APP_ATTEMPT_REMOVED to the scheduler
java.lang.IllegalStateException: Given app to remove 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt@ea94c3b
 does not exist in queue [root.bdp_xx.bdp_mart_xx_formal, 
demand=, running=, share=, w=]
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:119)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:779)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1231)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:114)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:680)
at java.lang.Thread.run(Thread.java:745)
2016-05-24 23:20:47,202 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
container_e04_1464073905025_15410_01_001759 Container Transitioned from 
ACQUIRED to RELEASED
2016-05-24 23:20:47,202 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-5134) Can't handle this event at current state Invalid event: FINISHED_CONTAINERS_PULLED_BY_AM at NEW

2016-05-24 Thread tangshangwen (JIRA)

tangshangwen created YARN-5134:
--

 Summary:  Can't handle this event at current state Invalid event: 
FINISHED_CONTAINERS_PULLED_BY_AM at NEW
 Key: YARN-5134
 URL: https://issues.apache.org/jira/browse/YARN-5134
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.1
Reporter: tangshangwen
Assignee: tangshangwen


in out cluster, i found rm can not hanle the event
{noformat}
2016-05-24 14:24:06,835 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid event 
FINISHED_CONTAINERS_PULLED_BY_AM on Node  
BJM6-Decare-138100.hadoop.jd.local:50086
2016-05-24 14:24:06,835 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle 
this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
FINISHED_CONTAINERS_PULLED_BY_AM at NEW
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:417)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:78)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher.handle(ResourceManager.java:860)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher.handle(ResourceManager.java:844)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
at java.lang.Thread.run(Thread.java:745)
2016-05-24 14:24:06,835 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid event 
FINISHED_CONTAINERS_PULLED_BY_AM on Node  
BJM6-Decare-139122.hadoop.jd.local:50086
2016-05-24 14:24:06,835 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle 
this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
FINISHED_CONTAINERS_PULLED_BY_AM at NEW
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:417)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:78)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher.handle(ResourceManager.java:860)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher.handle(ResourceManager.java:844)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
at java.lang.Thread.run(Thread.java:745)
{noformat}

and the event queue is very large
{noformat}
2016-05-24 14:24:07,302 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 13337000
2016-05-24 14:24:07,298 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 13337000
2016-05-24 14:24:07,298 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 13337000
2016-05-24 14:24:07,295 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 13337000
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-5133) Can't handle this event at current state Invalid event: FINISHED_CONTAINERS_PULLED_BY_AM at NEW

2016-05-24 Thread tangshangwen (JIRA)

tangshangwen created YARN-5133:
--

 Summary:  Can't handle this event at current state Invalid event: 
FINISHED_CONTAINERS_PULLED_BY_AM at NEW
 Key: YARN-5133
 URL: https://issues.apache.org/jira/browse/YARN-5133
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.1
Reporter: tangshangwen
Assignee: tangshangwen


in out cluster, i found rm can not hanle the event
{noformat}
2016-05-24 14:24:06,835 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid event 
FINISHED_CONTAINERS_PULLED_BY_AM on Node  
BJM6-Decare-138100.hadoop.jd.local:50086
2016-05-24 14:24:06,835 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle 
this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
FINISHED_CONTAINERS_PULLED_BY_AM at NEW
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:417)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:78)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher.handle(ResourceManager.java:860)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher.handle(ResourceManager.java:844)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
at java.lang.Thread.run(Thread.java:745)
2016-05-24 14:24:06,835 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid event 
FINISHED_CONTAINERS_PULLED_BY_AM on Node  
BJM6-Decare-139122.hadoop.jd.local:50086
2016-05-24 14:24:06,835 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle 
this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
FINISHED_CONTAINERS_PULLED_BY_AM at NEW
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:417)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:78)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher.handle(ResourceManager.java:860)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher.handle(ResourceManager.java:844)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
at java.lang.Thread.run(Thread.java:745)
{noformat}

and the event queue is very large
{noformat}
2016-05-24 14:24:07,302 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 13337000
2016-05-24 14:24:07,298 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 13337000
2016-05-24 14:24:07,298 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 13337000
2016-05-24 14:24:07,295 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 13337000
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-5051) The RM can't update the Decommissioned Nodes Metric

2016-05-11 Thread tangshangwen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15279855#comment-15279855
 ] 

tangshangwen commented on YARN-5051:


yes, thanks [~kshukla]

> The RM can't update the Decommissioned Nodes Metric
> ---
>
> Key: YARN-5051
> URL: https://issues.apache.org/jira/browse/YARN-5051
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: tangshangwen
>Assignee: tangshangwen
> Attachments: rm.png
>
>
> When the RM restart，the RM will refuse the Decommission NodeManager register, 
> and I put the NM host removed from exclude_mapred_host file, execute the 
> command
> {noformat}
> yarn rmadmin -refreshNodes
> {noformat} 
> start nodemanager , the decommissioned nodes num can not update



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-5051) The RM can't update the Decommissioned Nodes Metric

2016-05-06 Thread tangshangwen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15273834#comment-15273834
 ] 

tangshangwen commented on YARN-5051:


I think should add NEW events in updateMetricsForRejoinedNode method 
processing，like this:
{code:title=RMNodeImpl.java|borderStyle=solid}
  private void updateMetricsForRejoinedNode(NodeState previousNodeState) {
ClusterMetrics metrics = ClusterMetrics.getMetrics();
metrics.incrNumActiveNodes();

switch (previousNodeState) {
case LOST:
  metrics.decrNumLostNMs();
  break;
case REBOOTED:
  metrics.decrNumRebootedNMs();
  break;
case NEW:
case DECOMMISSIONED:
  metrics.decrDecommisionedNMs();
  break;
case UNHEALTHY:
  metrics.decrNumUnhealthyNMs();
  break;
default:
  LOG.debug("Unexpected previous node state");
}
  }
{code}

> The RM can't update the Decommissioned Nodes Metric
> ---
>
> Key: YARN-5051
> URL: https://issues.apache.org/jira/browse/YARN-5051
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: tangshangwen
>Assignee: tangshangwen
> Attachments: rm.png
>
>
> When the RM restart，the RM will refuse the Decommission NodeManager register, 
> and I put the NM host removed from exclude_mapred_host file, execute the 
> command
> {noformat}
> yarn rmadmin -refreshNodes
> {noformat} 
> start nodemanager , the decommissioned nodes num can not update



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-5051) The RM can't update the Decommissioned Nodes Metric

2016-05-06 Thread tangshangwen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15273825#comment-15273825
 ] 

tangshangwen commented on YARN-5051:


when the nodemanager start will trigger the AddNodeTransition, the node in the 
NEW state, will not reduce the value of DecommisionedNMs value.
{code:title=RMNodeImpl.java|borderStyle=solid}
Rpublic static class AddNodeTransition implements
  SingleArcTransition {

@Override
public void transition(RMNodeImpl rmNode, RMNodeEvent event) {
  // Inform the scheduler
  RMNodeStartedEvent startEvent = (RMNodeStartedEvent) event;
  List containers = null;

  String host = rmNode.nodeId.getHost();
  if (rmNode.context.getInactiveRMNodes().containsKey(host)) {

// Old node rejoining
RMNode previouRMNode = rmNode.context.getInactiveRMNodes().get(host);
rmNode.context.getInactiveRMNodes().remove(host);
rmNode.updateMetricsForRejoinedNode(previouRMNode.getState());
  } else {
ClusterMetrics.getMetrics().incrNumActiveNodes();
{code}

> The RM can't update the Decommissioned Nodes Metric
> ---
>
> Key: YARN-5051
> URL: https://issues.apache.org/jira/browse/YARN-5051
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: tangshangwen
>Assignee: tangshangwen
> Attachments: rm.png
>
>
> When the RM restart，the RM will refuse the Decommission NodeManager register, 
> and I put the NM host removed from exclude_mapred_host file, execute the 
> command
> {noformat}
> yarn rmadmin -refreshNodes
> {noformat} 
> start nodemanager , the decommissioned nodes num can not update



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-5051) The RM can't update the Decommissioned Nodes Metric

2016-05-06 Thread tangshangwen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15273819#comment-15273819
 ] 

tangshangwen commented on YARN-5051:


The include hosts file not empty also have the same problem

> The RM can't update the Decommissioned Nodes Metric
> ---
>
> Key: YARN-5051
> URL: https://issues.apache.org/jira/browse/YARN-5051
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: tangshangwen
>Assignee: tangshangwen
> Attachments: rm.png
>
>
> When the RM restart，the RM will refuse the Decommission NodeManager register, 
> and I put the NM host removed from exclude_mapred_host file, execute the 
> command
> {noformat}
> yarn rmadmin -refreshNodes
> {noformat} 
> start nodemanager , the decommissioned nodes num can not update



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-5051) The RM can't update the Decommissioned Nodes Metric

2016-05-06 Thread tangshangwen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15273815#comment-15273815
 ] 

tangshangwen commented on YARN-5051:


i think we should put the decommission node in InactiveRMNodes when it is 
registered
{code:title=ResourceTrackerService.java|borderStyle=solid}
RMNode rmNode = new RMNodeImpl(nodeId, rmContext, host, cmPort, httpPort,
resolve(host), capability, nodeManagerVersion);
// Check if this node is a 'valid' node
if (!this.nodesListManager.isValidNode(host)) {
  String message =
  "Disallowed NodeManager from  " + host
  + ", Sending SHUTDOWN signal to the NodeManager.";
  LOG.info(message);
  response.setDiagnosticsMessage(message);
  response.setNodeAction(NodeAction.SHUTDOWN);
   this.rmContext.getInactiveRMNodes().put(rmNode.getNodeID().getHost(), 
rmNode);
  return response;
}
{code}

> The RM can't update the Decommissioned Nodes Metric
> ---
>
> Key: YARN-5051
> URL: https://issues.apache.org/jira/browse/YARN-5051
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: tangshangwen
>Assignee: tangshangwen
> Attachments: rm.png
>
>
> When the RM restart，the RM will refuse the Decommission NodeManager register, 
> and I put the NM host removed from exclude_mapred_host file, execute the 
> command
> {noformat}
> yarn rmadmin -refreshNodes
> {noformat} 
> start nodemanager , the decommissioned nodes num can not update



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-5051) The RM can't update the Decommissioned Nodes Metric

2016-05-06 Thread tangshangwen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-5051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangshangwen updated YARN-5051:
---
Description: 
When the RM restart，the RM will refuse the Decommission NodeManager register, 
and I put the NM host removed from exclude_mapred_host file, execute the command
{noformat}
yarn rmadmin -refreshNodes
{noformat} 
start nodemanager , the decommissioned nodes num can not update

  was:
When the RM restart，the RM will refuse the Decommission NodeManager register, 
and I put the NM host removed from exclude_mapred_host file, execute the command
{noformat}
yarn rmadmin -refreshNodes
{noformat} 
start nodemanager , the decommissioned nodes can not update


> The RM can't update the Decommissioned Nodes Metric
> ---
>
> Key: YARN-5051
> URL: https://issues.apache.org/jira/browse/YARN-5051
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: tangshangwen
>Assignee: tangshangwen
> Attachments: rm.png
>
>
> When the RM restart，the RM will refuse the Decommission NodeManager register, 
> and I put the NM host removed from exclude_mapred_host file, execute the 
> command
> {noformat}
> yarn rmadmin -refreshNodes
> {noformat} 
> start nodemanager , the decommissioned nodes num can not update



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-5051) The RM can't update the Decommissioned Nodes Metric

2016-05-06 Thread tangshangwen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-5051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangshangwen updated YARN-5051:
---
Description: 
When the RM restart，the RM will refuse the Decommission NodeManager register, 
and I put the NM host removed from exclude_mapred_host file, execute the command
{noformat}
yarn rmadmin -refreshNodes
{noformat} 
start nodemanager , the decommissioned nodes can not update

  was:When the RM restart，the RM will refuse the Decommission NodeManager 
register, and I put the NM host removed from exclude_mapred_host file, and 
start nodemanager , the decommissioned nodes can not update


> The RM can't update the Decommissioned Nodes Metric
> ---
>
> Key: YARN-5051
> URL: https://issues.apache.org/jira/browse/YARN-5051
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: tangshangwen
>Assignee: tangshangwen
> Attachments: rm.png
>
>
> When the RM restart，the RM will refuse the Decommission NodeManager register, 
> and I put the NM host removed from exclude_mapred_host file, execute the 
> command
> {noformat}
> yarn rmadmin -refreshNodes
> {noformat} 
> start nodemanager , the decommissioned nodes can not update



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-5051) The RM can't update the Decommissioned Nodes Metric

2016-05-06 Thread tangshangwen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-5051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangshangwen updated YARN-5051:
---
Attachment: rm.png

> The RM can't update the Decommissioned Nodes Metric
> ---
>
> Key: YARN-5051
> URL: https://issues.apache.org/jira/browse/YARN-5051
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: tangshangwen
>Assignee: tangshangwen
> Attachments: rm.png
>
>
> When the RM restart，the RM will refuse the Decommission NodeManager register, 
> and I put the NM host removed from exclude_mapred_host file, and start 
> nodemanager , the decommissioned nodes can not update



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-5051) The RM can't update the Decommissioned Nodes Metric

2016-05-06 Thread tangshangwen (JIRA)

tangshangwen created YARN-5051:
--

 Summary: The RM can't update the Decommissioned Nodes Metric
 Key: YARN-5051
 URL: https://issues.apache.org/jira/browse/YARN-5051
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.1
Reporter: tangshangwen
Assignee: tangshangwen


When the RM restart，the RM will refuse the Decommission NodeManager register, 
and I put the NM host removed from exclude_mapred_host file, and start 
nodemanager , the decommissioned nodes can not update



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-5021) -1B of 3 GB physical memory used

2016-05-01 Thread tangshangwen (JIRA)

tangshangwen created YARN-5021:
--

 Summary: -1B of 3 GB physical memory used
 Key: YARN-5021
 URL: https://issues.apache.org/jira/browse/YARN-5021
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: tangshangwen
Assignee: tangshangwen


in my cluster, I found nodemanager log
{noformat}
2016-05-01 21:02:46,512 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Memory usage of ProcessTree 18210 for container-id 
container_1461592647020_15092_01_79: -1B of 3 GB physical memory used; -1B 
of 9.3 GB virtual memory used
2016-05-01 21:02:46,529 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Memory usage of ProcessTree 18405 for container-id 
container_1461592647020_15092_01_77: -1B of 3 GB physical memory used; -1B 
of 9.3 GB virtual memory used
2016-05-01 21:02:46,545 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Memory usage of ProcessTree 18893 for container-id 
container_1461592647020_15090_01_24: -1B of 3 GB physical memory used; -1B 
of 9.3 GB virtual memory used
2016-05-01 21:02:46,561 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Memory usage of ProcessTree 18555 for container-id 
container_1461592647020_15092_01_73: -1B of 3 GB physical memory used; -1B 
of 9.3 GB virtual memory used
2016-05-01 21:02:46,577 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Memory usage of ProcessTree 18510 for container-id 
container_1461592647020_15090_01_20: -1B of 3 GB physical memory used; -1B 
of 9.3 GB virtual memory used
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-4598) Invalid event: RESOURCE_FAILED at CONTAINER_CLEANEDUP_AFTER_KILL

2016-01-18 Thread tangshangwen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangshangwen updated YARN-4598:
---
Attachment: YARN-4598.1.patch

I submitted a patch

> Invalid event: RESOURCE_FAILED at CONTAINER_CLEANEDUP_AFTER_KILL
> 
>
> Key: YARN-4598
> URL: https://issues.apache.org/jira/browse/YARN-4598
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: tangshangwen
>Assignee: tangshangwen
> Attachments: YARN-4598.1.patch
>
>
> In our cluster, I found that the container has some problems in state 
> transition，this is my log
> {noformat}
> 2016-01-12 17:42:50,088 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1452588902899_0001_01_87 transitioned from 
> CONTAINER_CLEANEDUP_AFTER_KILL to DONE
> 2016-01-12 17:42:50,088 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Can't handle this event at current state: Current: 
> [CONTAINER_CLEANEDUP_AFTER_KILL], eventType: [RESOURCE_FAILED]
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
> RESOURCE_FAILED at CONTAINER_CLEANEDUP_AFTER_KILL 
>   
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
>   
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>  
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1127)
>
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1078)
>   
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1071)
>   
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
>   
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)  
>   
>  
> at java.lang.Thread.run(Thread.java:744)  
>   
> 
> 2016-01-12 17:42:50,089 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1452588902899_0001_01_94 transitioned from 
> CONTAINER_CLEANEDUP_AFTER_KILL to null
> 2016-01-12 17:42:50,089 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hadoop   
> OPERATION=Container Finished - Killed   TARGET=ContainerImpl
> RESULT=SUCCESS  APPID=application_1452588902899_0001
> CONTAINERID=container_1452588902899_0001_01_94
>   
> 2016-01-12 17:42:50,089 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1452588902899_0001_01_94 transitioned from 
> CONTAINER_CLEANEDUP_AFTER_KILL to DONE 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4598) Invalid event: RESOURCE_FAILED at CONTAINER_CLEANEDUP_AFTER_KILL

2016-01-18 Thread tangshangwen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15105369#comment-15105369
 ] 

tangshangwen commented on YARN-4598:


I think we should add a transition , have any Suggestions?
{noformat}   
 .addTransition(ContainerState.CONTAINER_CLEANEDUP_AFTER_KILL,
ContainerState.CONTAINER_CLEANEDUP_AFTER_KILL,
ContainerEventType.RESOURCE_FAILED)
{noformat}

> Invalid event: RESOURCE_FAILED at CONTAINER_CLEANEDUP_AFTER_KILL
> 
>
> Key: YARN-4598
> URL: https://issues.apache.org/jira/browse/YARN-4598
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: tangshangwen
>Assignee: tangshangwen
>
> In our cluster, I found that the container has some problems in state 
> transition，this is my log
> {noformat}
> 2016-01-12 17:42:50,088 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1452588902899_0001_01_87 transitioned from 
> CONTAINER_CLEANEDUP_AFTER_KILL to DONE
> 2016-01-12 17:42:50,088 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Can't handle this event at current state: Current: 
> [CONTAINER_CLEANEDUP_AFTER_KILL], eventType: [RESOURCE_FAILED]
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
> RESOURCE_FAILED at CONTAINER_CLEANEDUP_AFTER_KILL 
>   
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
>   
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>  
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1127)
>
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1078)
>   
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1071)
>   
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
>   
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)  
>   
>  
> at java.lang.Thread.run(Thread.java:744)  
>   
> 
> 2016-01-12 17:42:50,089 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1452588902899_0001_01_94 transitioned from 
> CONTAINER_CLEANEDUP_AFTER_KILL to null
> 2016-01-12 17:42:50,089 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hadoop   
> OPERATION=Container Finished - Killed   TARGET=ContainerImpl
> RESULT=SUCCESS  APPID=application_1452588902899_0001
> CONTAINERID=container_1452588902899_0001_01_94
>   
> 2016-01-12 17:42:50,089 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1452588902899_0001_01_94 transitioned from 
> CONTAINER_CLEANEDUP_AFTER_KILL to DONE 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-4598) Invalid event: RESOURCE_FAILED at CONTAINER_CLEANEDUP_AFTER_KILL

2016-01-14 Thread tangshangwen (JIRA)

tangshangwen created YARN-4598:
--

 Summary: Invalid event: RESOURCE_FAILED at 
CONTAINER_CLEANEDUP_AFTER_KILL
 Key: YARN-4598
 URL: https://issues.apache.org/jira/browse/YARN-4598
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.1
Reporter: tangshangwen
Assignee: tangshangwen


In our cluster, I found that the container has some problems in state 
transition，this is my log
{noformat}
2016-01-12 17:42:50,088 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
 Container container_1452588902899_0001_01_87 transitioned from 
CONTAINER_CLEANEDUP_AFTER_KILL to DONE
2016-01-12 17:42:50,088 WARN 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
 Can't handle this event at current state: Current: 
[CONTAINER_CLEANEDUP_AFTER_KILL], eventType: [RESOURCE_FAILED]
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
RESOURCE_FAILED at CONTAINER_CLEANEDUP_AFTER_KILL   

at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
  
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1127)
   
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83)
 
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1078)
  
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1071)
  
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) 
 
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
 
at java.lang.Thread.run(Thread.java:744)


2016-01-12 17:42:50,089 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
 Container container_1452588902899_0001_01_94 transitioned from 
CONTAINER_CLEANEDUP_AFTER_KILL to null
2016-01-12 17:42:50,089 INFO 
org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hadoop   
OPERATION=Container Finished - Killed   TARGET=ContainerImpl
RESULT=SUCCESS  APPID=application_1452588902899_0001
CONTAINERID=container_1452588902899_0001_01_94  

2016-01-12 17:42:50,089 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
 Container container_1452588902899_0001_01_94 transitioned from 
CONTAINER_CLEANEDUP_AFTER_KILL to DONE 
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Reopened] (YARN-4539) CommonNodeLabelsManager throw NullPointerException when the fairScheduler init failed

2016-01-05 Thread tangshangwen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangshangwen reopened YARN-4539:


> CommonNodeLabelsManager throw NullPointerException when the fairScheduler 
> init failed
> -
>
> Key: YARN-4539
> URL: https://issues.apache.org/jira/browse/YARN-4539
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: tangshangwen
>Assignee: tangshangwen
> Attachments: YARN-4539.1.patch
>
>
> When the scheduler initialization failed and RM stop compositeService cause 
> the CommonNodeLabelsManager throw NullPointerException.
> {noformat}
> 2016-01-04 22:19:52,190 INFO org.apache.hadoop.service.AbstractService: 
> Service 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 
> failed in state INITED; cause: java.io.IOException: Failed to initialize 
> FairScheduler
> java.io.IOException: Failed to initialize FairScheduler
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1377)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1394)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> 
> 2016-01-04 22:19:52,193 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ResourceManager 
> metrics system...
> 2016-01-04 22:19:52,194 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics 
> system stopped.
> 2016-01-04 22:19:52,194 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics 
> system shutdown complete.
> 2016-01-04 22:19:52,194 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
> AsyncDispatcher is draining to stop, igonring any new events.
> 2016-01-04 22:19:52,194 INFO org.apache.hadoop.service.AbstractService: 
> Service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in 
> state STOPPED; cause: java.lang.NullPointerException
> java.lang.NullPointerException
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.stopDispatcher(CommonNodeLabelsManager.java:251)
>   at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:257)
>   at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
>   at 
> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
>   at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (YARN-4539) CommonNodeLabelsManager throw NullPointerException when the fairScheduler init failed

2016-01-05 Thread tangshangwen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangshangwen resolved YARN-4539.

Resolution: Duplicate

> CommonNodeLabelsManager throw NullPointerException when the fairScheduler 
> init failed
> -
>
> Key: YARN-4539
> URL: https://issues.apache.org/jira/browse/YARN-4539
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: tangshangwen
>Assignee: tangshangwen
> Attachments: YARN-4539.1.patch
>
>
> When the scheduler initialization failed and RM stop compositeService cause 
> the CommonNodeLabelsManager throw NullPointerException.
> {noformat}
> 2016-01-04 22:19:52,190 INFO org.apache.hadoop.service.AbstractService: 
> Service 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 
> failed in state INITED; cause: java.io.IOException: Failed to initialize 
> FairScheduler
> java.io.IOException: Failed to initialize FairScheduler
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1377)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1394)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> 
> 2016-01-04 22:19:52,193 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ResourceManager 
> metrics system...
> 2016-01-04 22:19:52,194 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics 
> system stopped.
> 2016-01-04 22:19:52,194 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics 
> system shutdown complete.
> 2016-01-04 22:19:52,194 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
> AsyncDispatcher is draining to stop, igonring any new events.
> 2016-01-04 22:19:52,194 INFO org.apache.hadoop.service.AbstractService: 
> Service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in 
> state STOPPED; cause: java.lang.NullPointerException
> java.lang.NullPointerException
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.stopDispatcher(CommonNodeLabelsManager.java:251)
>   at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:257)
>   at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
>   at 
> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
>   at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4539) CommonNodeLabelsManager throw NullPointerException when the fairScheduler init failed

2016-01-05 Thread tangshangwen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15084576#comment-15084576
 ] 

tangshangwen commented on YARN-4539:


OK, Thanks [~bibinchundatt]

> CommonNodeLabelsManager throw NullPointerException when the fairScheduler 
> init failed
> -
>
> Key: YARN-4539
> URL: https://issues.apache.org/jira/browse/YARN-4539
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: tangshangwen
>Assignee: tangshangwen
> Attachments: YARN-4539.1.patch
>
>
> When the scheduler initialization failed and RM stop compositeService cause 
> the CommonNodeLabelsManager throw NullPointerException.
> {noformat}
> 2016-01-04 22:19:52,190 INFO org.apache.hadoop.service.AbstractService: 
> Service 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 
> failed in state INITED; cause: java.io.IOException: Failed to initialize 
> FairScheduler
> java.io.IOException: Failed to initialize FairScheduler
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1377)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1394)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> 
> 2016-01-04 22:19:52,193 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ResourceManager 
> metrics system...
> 2016-01-04 22:19:52,194 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics 
> system stopped.
> 2016-01-04 22:19:52,194 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics 
> system shutdown complete.
> 2016-01-04 22:19:52,194 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
> AsyncDispatcher is draining to stop, igonring any new events.
> 2016-01-04 22:19:52,194 INFO org.apache.hadoop.service.AbstractService: 
> Service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in 
> state STOPPED; cause: java.lang.NullPointerException
> java.lang.NullPointerException
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.stopDispatcher(CommonNodeLabelsManager.java:251)
>   at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:257)
>   at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
>   at 
> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
>   at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4539) CommonNodeLabelsManager throw NullPointerException when the fairScheduler init failed

2016-01-05 Thread tangshangwen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15084575#comment-15084575
 ] 

tangshangwen commented on YARN-4539:


OK, Thanks [~bibinchundatt]

> CommonNodeLabelsManager throw NullPointerException when the fairScheduler 
> init failed
> -
>
> Key: YARN-4539
> URL: https://issues.apache.org/jira/browse/YARN-4539
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: tangshangwen
>Assignee: tangshangwen
> Attachments: YARN-4539.1.patch
>
>
> When the scheduler initialization failed and RM stop compositeService cause 
> the CommonNodeLabelsManager throw NullPointerException.
> {noformat}
> 2016-01-04 22:19:52,190 INFO org.apache.hadoop.service.AbstractService: 
> Service 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 
> failed in state INITED; cause: java.io.IOException: Failed to initialize 
> FairScheduler
> java.io.IOException: Failed to initialize FairScheduler
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1377)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1394)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> 
> 2016-01-04 22:19:52,193 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ResourceManager 
> metrics system...
> 2016-01-04 22:19:52,194 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics 
> system stopped.
> 2016-01-04 22:19:52,194 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics 
> system shutdown complete.
> 2016-01-04 22:19:52,194 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
> AsyncDispatcher is draining to stop, igonring any new events.
> 2016-01-04 22:19:52,194 INFO org.apache.hadoop.service.AbstractService: 
> Service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in 
> state STOPPED; cause: java.lang.NullPointerException
> java.lang.NullPointerException
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.stopDispatcher(CommonNodeLabelsManager.java:251)
>   at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:257)
>   at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
>   at 
> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
>   at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (YARN-4539) CommonNodeLabelsManager throw NullPointerException when the fairScheduler init failed

2016-01-05 Thread tangshangwen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangshangwen resolved YARN-4539.

Resolution: Fixed

> CommonNodeLabelsManager throw NullPointerException when the fairScheduler 
> init failed
> -
>
> Key: YARN-4539
> URL: https://issues.apache.org/jira/browse/YARN-4539
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: tangshangwen
>Assignee: tangshangwen
> Attachments: YARN-4539.1.patch
>
>
> When the scheduler initialization failed and RM stop compositeService cause 
> the CommonNodeLabelsManager throw NullPointerException.
> {noformat}
> 2016-01-04 22:19:52,190 INFO org.apache.hadoop.service.AbstractService: 
> Service 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 
> failed in state INITED; cause: java.io.IOException: Failed to initialize 
> FairScheduler
> java.io.IOException: Failed to initialize FairScheduler
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1377)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1394)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> 
> 2016-01-04 22:19:52,193 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ResourceManager 
> metrics system...
> 2016-01-04 22:19:52,194 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics 
> system stopped.
> 2016-01-04 22:19:52,194 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics 
> system shutdown complete.
> 2016-01-04 22:19:52,194 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
> AsyncDispatcher is draining to stop, igonring any new events.
> 2016-01-04 22:19:52,194 INFO org.apache.hadoop.service.AbstractService: 
> Service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in 
> state STOPPED; cause: java.lang.NullPointerException
> java.lang.NullPointerException
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.stopDispatcher(CommonNodeLabelsManager.java:251)
>   at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:257)
>   at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
>   at 
> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
>   at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4539) CommonNodeLabelsManager throw NullPointerException when the fairScheduler init failed

2016-01-04 Thread tangshangwen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15081522#comment-15081522
 ] 

tangshangwen commented on YARN-4539:


Yes, Thank you for your comment!!:D

> CommonNodeLabelsManager throw NullPointerException when the fairScheduler 
> init failed
> -
>
> Key: YARN-4539
> URL: https://issues.apache.org/jira/browse/YARN-4539
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: tangshangwen
>Assignee: tangshangwen
> Attachments: YARN-4539.1.patch
>
>
> When the scheduler initialization failed and RM stop compositeService cause 
> the CommonNodeLabelsManager throw NullPointerException.
> {noformat}
> 2016-01-04 22:19:52,190 INFO org.apache.hadoop.service.AbstractService: 
> Service 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 
> failed in state INITED; cause: java.io.IOException: Failed to initialize 
> FairScheduler
> java.io.IOException: Failed to initialize FairScheduler
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1377)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1394)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> 
> 2016-01-04 22:19:52,193 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ResourceManager 
> metrics system...
> 2016-01-04 22:19:52,194 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics 
> system stopped.
> 2016-01-04 22:19:52,194 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics 
> system shutdown complete.
> 2016-01-04 22:19:52,194 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
> AsyncDispatcher is draining to stop, igonring any new events.
> 2016-01-04 22:19:52,194 INFO org.apache.hadoop.service.AbstractService: 
> Service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in 
> state STOPPED; cause: java.lang.NullPointerException
> java.lang.NullPointerException
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.stopDispatcher(CommonNodeLabelsManager.java:251)
>   at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:257)
>   at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
>   at 
> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
>   at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4539) CommonNodeLabelsManager throw NullPointerException when the fairScheduler init failed

2016-01-04 Thread tangshangwen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangshangwen updated YARN-4539:
---
Description: 
When the scheduler initialization failed and RM stop compositeService cause the 
CommonNodeLabelsManager throw NullPointerException.
{noformat}
2016-01-04 22:19:52,190 INFO org.apache.hadoop.service.AbstractService: Service 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 
failed in state INITED; cause: java.io.IOException: Failed to initialize 
FairScheduler
java.io.IOException: Failed to initialize FairScheduler
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1377)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1394)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)

2016-01-04 22:19:52,193 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: 
Stopping ResourceManager metrics system...
2016-01-04 22:19:52,194 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: 
ResourceManager metrics system stopped.
2016-01-04 22:19:52,194 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: 
ResourceManager metrics system shutdown complete.
2016-01-04 22:19:52,194 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
AsyncDispatcher is draining to stop, igonring any new events.
2016-01-04 22:19:52,194 INFO org.apache.hadoop.service.AbstractService: Service 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in state 
STOPPED; cause: java.lang.NullPointerException
java.lang.NullPointerException
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.stopDispatcher(CommonNodeLabelsManager.java:251)
at 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:257)
at 
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
at 
org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
at 
org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
{noformat}

  was:
When the scheduler initialization failed and RM stop compositeService cause the 
CommonNodeLabelsManager throw NullPointerException.
{noformat}
2016-01-04 22:19:52,190 INFO org.apache.hadoop.service.AbstractService: Service 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 
failed in state INITED; cause: java.io.IOException: Failed to initialize 
FairScheduler
java.io.IOException: Failed to initialize FairScheduler
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1377)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1394)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)

2016-01-04 22:19:52,193 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: 
Stopping ResourceManager metrics system...
2016-01-04 22:19:52,194 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: 
ResourceManager metrics system stopped.
2016-01-04 22:19:52,194 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: 
ResourceManager metrics system shutdown complete.
2016-01-04 22:19:52,194 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
AsyncDispatcher is draining to stop, igonring any new events.
2016-01-04 22:19:52,194 INFO org.apache.hadoop.service.AbstractService: Service 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in state 
STOPPED; cause: java.lang.NullPointerException
java.lang.NullPointerException
{noformat}


> CommonNodeLabelsManager throw NullPointerException when the fairScheduler 
> init failed
> -
>
> Key: YARN-4539
> URL: https://issues.apache.org/jira/browse/YARN-4539
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: tangshangwen
>Assignee: tangshangwen
> Attachments: YARN-4539.1.patch
>
>
> When the scheduler initialization failed and RM stop compositeService cause 
> the CommonNodeLabelsManager throw NullPointerException.
> {noformat}
> 2016-01-04 22:19:52,190 INFO org.apache.hadoop.service.AbstractService: 
> Service 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 
> failed in state INITED; cause: java.io.IOException: Failed to initialize 
> FairScheduler
> java.io.IOException: Failed to initialize FairScheduler
>   at 
>

[jira] [Commented] (YARN-4539) CommonNodeLabelsManager throw NullPointerException when the fairScheduler init failed

2016-01-04 Thread tangshangwen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15081301#comment-15081301
 ] 

tangshangwen commented on YARN-4539:


I submitted a patch.

> CommonNodeLabelsManager throw NullPointerException when the fairScheduler 
> init failed
> -
>
> Key: YARN-4539
> URL: https://issues.apache.org/jira/browse/YARN-4539
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: tangshangwen
>Assignee: tangshangwen
> Attachments: YARN-4539.1.patch
>
>
> When the scheduler initialization failed and RM stop compositeService cause 
> the CommonNodeLabelsManager throw NullPointerException.
> {noformat}
> 2016-01-04 22:19:52,190 INFO org.apache.hadoop.service.AbstractService: 
> Service 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 
> failed in state INITED; cause: java.io.IOException: Failed to initialize 
> FairScheduler
> java.io.IOException: Failed to initialize FairScheduler
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1377)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1394)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> 
> 2016-01-04 22:19:52,193 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ResourceManager 
> metrics system...
> 2016-01-04 22:19:52,194 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics 
> system stopped.
> 2016-01-04 22:19:52,194 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics 
> system shutdown complete.
> 2016-01-04 22:19:52,194 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
> AsyncDispatcher is draining to stop, igonring any new events.
> 2016-01-04 22:19:52,194 INFO org.apache.hadoop.service.AbstractService: 
> Service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in 
> state STOPPED; cause: java.lang.NullPointerException
> java.lang.NullPointerException
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4539) CommonNodeLabelsManager throw NullPointerException when the fairScheduler init failed

2016-01-04 Thread tangshangwen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangshangwen updated YARN-4539:
---
Attachment: YARN-4539.1.patch

> CommonNodeLabelsManager throw NullPointerException when the fairScheduler 
> init failed
> -
>
> Key: YARN-4539
> URL: https://issues.apache.org/jira/browse/YARN-4539
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: tangshangwen
>Assignee: tangshangwen
> Attachments: YARN-4539.1.patch
>
>
> When the scheduler initialization failed and RM stop compositeService cause 
> the CommonNodeLabelsManager throw NullPointerException.
> {noformat}
> 2016-01-04 22:19:52,190 INFO org.apache.hadoop.service.AbstractService: 
> Service 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 
> failed in state INITED; cause: java.io.IOException: Failed to initialize 
> FairScheduler
> java.io.IOException: Failed to initialize FairScheduler
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1377)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1394)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> 
> 2016-01-04 22:19:52,193 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ResourceManager 
> metrics system...
> 2016-01-04 22:19:52,194 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics 
> system stopped.
> 2016-01-04 22:19:52,194 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics 
> system shutdown complete.
> 2016-01-04 22:19:52,194 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
> AsyncDispatcher is draining to stop, igonring any new events.
> 2016-01-04 22:19:52,194 INFO org.apache.hadoop.service.AbstractService: 
> Service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in 
> state STOPPED; cause: java.lang.NullPointerException
> java.lang.NullPointerException
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4539) CommonNodeLabelsManager throw NullPointerException when the fairScheduler init failed

2016-01-04 Thread tangshangwen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15081292#comment-15081292
 ] 

tangshangwen commented on YARN-4539:


I think asyncDispatcher should check whether the null before closing
{code:title=CommonNodeLabelsManager.java|borderStyle=solid}
// for UT purpose
  protected void stopDispatcher() {
AsyncDispatcher asyncDispatcher = (AsyncDispatcher) dispatcher;  
asyncDispatcher.stop();
  }
{code}

> CommonNodeLabelsManager throw NullPointerException when the fairScheduler 
> init failed
> -
>
> Key: YARN-4539
> URL: https://issues.apache.org/jira/browse/YARN-4539
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: tangshangwen
>Assignee: tangshangwen
>
> When the scheduler initialization failed and RM stop compositeService cause 
> the CommonNodeLabelsManager throw NullPointerException.
> {noformat}
> 2016-01-04 22:19:52,190 INFO org.apache.hadoop.service.AbstractService: 
> Service 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 
> failed in state INITED; cause: java.io.IOException: Failed to initialize 
> FairScheduler
> java.io.IOException: Failed to initialize FairScheduler
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1377)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1394)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> 
> 2016-01-04 22:19:52,193 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ResourceManager 
> metrics system...
> 2016-01-04 22:19:52,194 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics 
> system stopped.
> 2016-01-04 22:19:52,194 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics 
> system shutdown complete.
> 2016-01-04 22:19:52,194 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
> AsyncDispatcher is draining to stop, igonring any new events.
> 2016-01-04 22:19:52,194 INFO org.apache.hadoop.service.AbstractService: 
> Service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in 
> state STOPPED; cause: java.lang.NullPointerException
> java.lang.NullPointerException
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-4539) CommonNodeLabelsManager throw NullPointerException when the fairScheduler init failed

2016-01-04 Thread tangshangwen (JIRA)

tangshangwen created YARN-4539:
--

 Summary: CommonNodeLabelsManager throw NullPointerException when 
the fairScheduler init failed
 Key: YARN-4539
 URL: https://issues.apache.org/jira/browse/YARN-4539
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.1
Reporter: tangshangwen
Assignee: tangshangwen


When the scheduler initialization failed and RM stop compositeService cause the 
CommonNodeLabelsManager throw NullPointerException.
{noformat}
2016-01-04 22:19:52,190 INFO org.apache.hadoop.service.AbstractService: Service 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 
failed in state INITED; cause: java.io.IOException: Failed to initialize 
FairScheduler
java.io.IOException: Failed to initialize FairScheduler
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1377)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1394)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)

2016-01-04 22:19:52,193 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: 
Stopping ResourceManager metrics system...
2016-01-04 22:19:52,194 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: 
ResourceManager metrics system stopped.
2016-01-04 22:19:52,194 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: 
ResourceManager metrics system shutdown complete.
2016-01-04 22:19:52,194 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
AsyncDispatcher is draining to stop, igonring any new events.
2016-01-04 22:19:52,194 INFO org.apache.hadoop.service.AbstractService: Service 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in state 
STOPPED; cause: java.lang.NullPointerException
java.lang.NullPointerException
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4530) LocalizedResource trigger a NPE Cause the NodeManager exit

2016-01-03 Thread tangshangwen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15080424#comment-15080424
 ] 

tangshangwen commented on YARN-4530:


Hi [~rohithsharma] , I need to write a test case ?

> LocalizedResource trigger a NPE Cause the NodeManager exit
> --
>
> Key: YARN-4530
> URL: https://issues.apache.org/jira/browse/YARN-4530
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.2.0, 2.7.1
>Reporter: tangshangwen
> Attachments: YARN-4530.1.patch
>
>
> In our cluster, I found that LocalizedResource download failed trigger a NPE 
> Cause the NodeManager shutdown.
> {noformat}
> 2015-12-29 17:18:33,706 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
>  Resource 
> hdfs://ns3:8020/user/username/projects/user_insight/lookalike/oozie/workflow/conf/hive-site.xml
>  transitioned from DOWNLOADING to FAILED
> 2015-12-29 17:18:33,708 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Downloading public rsrc:{ 
> hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/user_insight_pig_udf-0.0.1-SNAPSHOT-jar-with-dependencies.jar,
>  1451380519635, FILE, null }
> 2015-12-29 17:18:33,710 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Failed to download rsrc { { 
> hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar,
>  1451380519452, FILE, null 
> },pending,[(container_1451039893865_261670_01_000578)],42332661980495938,DOWNLOADING}
> java.io.IOException: Resource 
> hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar
>  changed on src filesystem (expected 1451380519452, was 1451380611793
>   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:176)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:276)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:50)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 2015-12-29 17:18:33,710 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
>  Resource 
> hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar
>  transitioned from DOWNLOADING to FAILED
> 2015-12-29 17:18:33,710 FATAL 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Error: Shutting down
> java.lang.NullPointerException at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:712)
> 2015-12-29 17:18:33,710 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Public cache exiting
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4530) LocalizedResource trigger a NPE Cause the NodeManager exit

2015-12-31 Thread tangshangwen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075855#comment-15075855
 ] 

tangshangwen commented on YARN-4530:


Hi [Rohith Sharma K S | 
https://issues.apache.org/jira/secure/ViewProfile.jspa?name=rohithsharma], In 
this patch，If assoc is null return directly, when completed.get() throw an 
ExecutionException，assoc will not be null，I think this patch is not need a new 
test cases
{code:title=ResourceLocalizationService.java|borderStyle=solid}
try {
  if (null == assoc) {
LOG.error("Localized unknown resource to " + completed);
// TODO delete
return;
  }
  Path local = completed.get();
  LocalResourceRequest key = assoc.getResource().getRequest();
  publicRsrc.handle(new ResourceLocalizedEvent(key, local, FileUtil
.getDU(new File(local.toUri();
  assoc.getResource().unlock();
} catch (ExecutionException e) {
  LOG.info("Failed to download resource " + assoc.getResource(),
  e.getCause());
  LocalResourceRequest req = assoc.getResource().getRequest();
  publicRsrc.handle(new ResourceFailedLocalizationEvent(req,
  e.getMessage()));
  assoc.getResource().unlock();
}
{code}

> LocalizedResource trigger a NPE Cause the NodeManager exit
> --
>
> Key: YARN-4530
> URL: https://issues.apache.org/jira/browse/YARN-4530
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.2.0, 2.7.1
>Reporter: tangshangwen
> Attachments: YARN-4530.1.patch
>
>
> In our cluster, I found that LocalizedResource download failed trigger a NPE 
> Cause the NodeManager shutdown.
> {noformat}
> 2015-12-29 17:18:33,706 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
>  Resource 
> hdfs://ns3:8020/user/username/projects/user_insight/lookalike/oozie/workflow/conf/hive-site.xml
>  transitioned from DOWNLOADING to FAILED
> 2015-12-29 17:18:33,708 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Downloading public rsrc:{ 
> hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/user_insight_pig_udf-0.0.1-SNAPSHOT-jar-with-dependencies.jar,
>  1451380519635, FILE, null }
> 2015-12-29 17:18:33,710 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Failed to download rsrc { { 
> hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar,
>  1451380519452, FILE, null 
> },pending,[(container_1451039893865_261670_01_000578)],42332661980495938,DOWNLOADING}
> java.io.IOException: Resource 
> hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar
>  changed on src filesystem (expected 1451380519452, was 1451380611793
>   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:176)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:276)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:50)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 2015-12-29 17:18:33,710 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
>  Resource 
> hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar
>  transitioned from DOWNLOADING to FAILED
> 2015-12-29 17:18:33,710 FATAL 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Error: Shutting down
> java.lang.NullPointerException at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:712)
> 2015-12-29 17:18:33,710 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Public cache exiting
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4530) LocalizedResource trigger a NPE Cause the NodeManager exit

2015-12-30 Thread tangshangwen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangshangwen updated YARN-4530:
---
Attachment: YARN-4530.1.patch

I found 2.7.1 have the same problem，I submitted a patch.

> LocalizedResource trigger a NPE Cause the NodeManager exit
> --
>
> Key: YARN-4530
> URL: https://issues.apache.org/jira/browse/YARN-4530
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.2.0, 2.7.1
>Reporter: tangshangwen
> Attachments: YARN-4530.1.patch
>
>
> In our cluster, I found that LocalizedResource download failed trigger a NPE 
> Cause the NodeManager shutdown.
> {noformat}
> 2015-12-29 17:18:33,706 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
>  Resource 
> hdfs://ns3:8020/user/username/projects/user_insight/lookalike/oozie/workflow/conf/hive-site.xml
>  transitioned from DOWNLOADING to FAILED
> 2015-12-29 17:18:33,708 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Downloading public rsrc:{ 
> hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/user_insight_pig_udf-0.0.1-SNAPSHOT-jar-with-dependencies.jar,
>  1451380519635, FILE, null }
> 2015-12-29 17:18:33,710 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Failed to download rsrc { { 
> hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar,
>  1451380519452, FILE, null 
> },pending,[(container_1451039893865_261670_01_000578)],42332661980495938,DOWNLOADING}
> java.io.IOException: Resource 
> hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar
>  changed on src filesystem (expected 1451380519452, was 1451380611793
>   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:176)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:276)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:50)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 2015-12-29 17:18:33,710 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
>  Resource 
> hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar
>  transitioned from DOWNLOADING to FAILED
> 2015-12-29 17:18:33,710 FATAL 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Error: Shutting down
> java.lang.NullPointerException at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:712)
> 2015-12-29 17:18:33,710 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Public cache exiting
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4530) LocalizedResource trigger a NPE Cause the NodeManager exit

2015-12-30 Thread tangshangwen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangshangwen updated YARN-4530:
---
Affects Version/s: 2.7.1

> LocalizedResource trigger a NPE Cause the NodeManager exit
> --
>
> Key: YARN-4530
> URL: https://issues.apache.org/jira/browse/YARN-4530
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.2.0, 2.7.1
>Reporter: tangshangwen
>
> In our cluster, I found that LocalizedResource download failed trigger a NPE 
> Cause the NodeManager shutdown.
> {noformat}
> 2015-12-29 17:18:33,706 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
>  Resource 
> hdfs://ns3:8020/user/username/projects/user_insight/lookalike/oozie/workflow/conf/hive-site.xml
>  transitioned from DOWNLOADING to FAILED
> 2015-12-29 17:18:33,708 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Downloading public rsrc:{ 
> hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/user_insight_pig_udf-0.0.1-SNAPSHOT-jar-with-dependencies.jar,
>  1451380519635, FILE, null }
> 2015-12-29 17:18:33,710 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Failed to download rsrc { { 
> hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar,
>  1451380519452, FILE, null 
> },pending,[(container_1451039893865_261670_01_000578)],42332661980495938,DOWNLOADING}
> java.io.IOException: Resource 
> hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar
>  changed on src filesystem (expected 1451380519452, was 1451380611793
>   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:176)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:276)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:50)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 2015-12-29 17:18:33,710 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
>  Resource 
> hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar
>  transitioned from DOWNLOADING to FAILED
> 2015-12-29 17:18:33,710 FATAL 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Error: Shutting down
> java.lang.NullPointerException at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:712)
> 2015-12-29 17:18:33,710 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Public cache exiting
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4530) LocalizedResource trigger a NPE Cause the NodeManager exit

2015-12-30 Thread tangshangwen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075711#comment-15075711
 ] 

tangshangwen commented on YARN-4530:


I think I can fix it

> LocalizedResource trigger a NPE Cause the NodeManager exit
> --
>
> Key: YARN-4530
> URL: https://issues.apache.org/jira/browse/YARN-4530
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.2.0
>Reporter: tangshangwen
>
> In our cluster, I found that LocalizedResource download failed trigger a NPE 
> Cause the NodeManager shutdown.
> {noformat}
> 2015-12-29 17:18:33,706 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
>  Resource 
> hdfs://ns3:8020/user/username/projects/user_insight/lookalike/oozie/workflow/conf/hive-site.xml
>  transitioned from DOWNLOADING to FAILED
> 2015-12-29 17:18:33,708 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Downloading public rsrc:{ 
> hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/user_insight_pig_udf-0.0.1-SNAPSHOT-jar-with-dependencies.jar,
>  1451380519635, FILE, null }
> 2015-12-29 17:18:33,710 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Failed to download rsrc { { 
> hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar,
>  1451380519452, FILE, null 
> },pending,[(container_1451039893865_261670_01_000578)],42332661980495938,DOWNLOADING}
> java.io.IOException: Resource 
> hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar
>  changed on src filesystem (expected 1451380519452, was 1451380611793
>   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:176)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:276)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:50)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 2015-12-29 17:18:33,710 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
>  Resource 
> hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar
>  transitioned from DOWNLOADING to FAILED
> 2015-12-29 17:18:33,710 FATAL 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Error: Shutting down
> java.lang.NullPointerException at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:712)
> 2015-12-29 17:18:33,710 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Public cache exiting
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4506) Application was killed by a resourcemanager, In the JobHistory Can't see the job detail

2015-12-30 Thread tangshangwen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075140#comment-15075140
 ] 

tangshangwen commented on YARN-4506:


Ok, I'll try to fix it

> Application was killed by a resourcemanager, In the JobHistory Can't see the 
> job detail
> ---
>
> Key: YARN-4506
> URL: https://issues.apache.org/jira/browse/YARN-4506
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.2.0
>Reporter: tangshangwen
> Attachments: am.rar
>
>
> 2015-12-15 03:08:54,073 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a 
> signal. Signaling RMCommunicator and JobHistoryEventHandler.
> 2015-12-15 03:08:54,073 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator 
> notified that iSignalled is: true
> 2015-12-15 03:08:54,073 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator 
> isAMLastRetry: true
> 2015-12-15 03:08:54,073 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator 
> notified that shouldUnregistered is: true
> 2015-12-15 03:08:54,073 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify JHEH isAMLastRetry: 
> true
> 2015-12-15 03:08:54,073 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: 
> JobHistoryEventHandler notified that forceJobCompletion is true
> 2015-12-15 03:08:54,074 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping 
> JobHistoryEventHandler. Size of the outstanding queue size is 0
> 2015-12-15 03:08:54,074 INFO [eventHandlingThread] 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: EventQueue 
> take interrupted. Returning
> 2015-12-15 03:08:54,078 WARN [Thread-1] 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Found jobId 
> job_1449835724839_219910 to have not been closed. Will close



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4530) LocalizedResource trigger a NPE Cause the NodeManager exit

2015-12-30 Thread tangshangwen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075106#comment-15075106
 ] 

tangshangwen commented on YARN-4530:


when the assoc is null and the completed.get() throw a ExecutionException，This 
will happen, right？
{code:title=ResourceLocalizationService.java|borderStyle=solid}
try {
Future completed = queue.take();
LocalizerResourceRequestEvent assoc = pending.remove(completed);
try {
  Path local = completed.get();
  if (null == assoc) {
LOG.error("Localized unkonwn resource to " + completed);
// TODO delete
return;
  }
  LocalResourceRequest key = assoc.getResource().getRequest();
  publicRsrc.handle(new ResourceLocalizedEvent(key, local, FileUtil
.getDU(new File(local.toUri();
  assoc.getResource().unlock();
} catch (ExecutionException e) {
  LOG.info("Failed to download rsrc " + assoc.getResource(),
  e.getCause());
  LocalResourceRequest req = assoc.getResource().getRequest();
  publicRsrc.handle(new ResourceFailedLocalizationEvent(req,
  e.getMessage()));
  assoc.getResource().unlock();
} catch (CancellationException e) {
  // ignore; shutting down
}
{code}

> LocalizedResource trigger a NPE Cause the NodeManager exit
> --
>
> Key: YARN-4530
> URL: https://issues.apache.org/jira/browse/YARN-4530
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.2.0
>Reporter: tangshangwen
>
> In our cluster, I found that LocalizedResource download failed trigger a NPE 
> Cause the NodeManager shutdown.
> {noformat}
> 2015-12-29 17:18:33,706 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
>  Resource 
> hdfs://ns3:8020/user/username/projects/user_insight/lookalike/oozie/workflow/conf/hive-site.xml
>  transitioned from DOWNLOADING to FAILED
> 2015-12-29 17:18:33,708 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Downloading public rsrc:{ 
> hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/user_insight_pig_udf-0.0.1-SNAPSHOT-jar-with-dependencies.jar,
>  1451380519635, FILE, null }
> 2015-12-29 17:18:33,710 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Failed to download rsrc { { 
> hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar,
>  1451380519452, FILE, null 
> },pending,[(container_1451039893865_261670_01_000578)],42332661980495938,DOWNLOADING}
> java.io.IOException: Resource 
> hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar
>  changed on src filesystem (expected 1451380519452, was 1451380611793
>   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:176)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:276)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:50)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 2015-12-29 17:18:33,710 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
>  Resource 
> hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar
>  transitioned from DOWNLOADING to FAILED
> 2015-12-29 17:18:33,710 FATAL 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Error: Shutting down
> java.lang.NullPointerException at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:712)
> 2015-12-29 17:18:33,710 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Public cache exiting
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-4530) LocalizedResource trigger a NPE Cause the NodeManager exit

2015-12-30 Thread tangshangwen (JIRA)

tangshangwen created YARN-4530:
--

 Summary: LocalizedResource trigger a NPE Cause the NodeManager exit
 Key: YARN-4530
 URL: https://issues.apache.org/jira/browse/YARN-4530
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.2.0
Reporter: tangshangwen


In our cluster, I found that LocalizedResource download failed trigger a NPE 
Cause the NodeManager shutdown.

{noformat}
2015-12-29 17:18:33,706 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
 Resource 
hdfs://ns3:8020/user/username/projects/user_insight/lookalike/oozie/workflow/conf/hive-site.xml
 transitioned from DOWNLOADING to FAILED
2015-12-29 17:18:33,708 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Downloading public rsrc:{ 
hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/user_insight_pig_udf-0.0.1-SNAPSHOT-jar-with-dependencies.jar,
 1451380519635, FILE, null }
2015-12-29 17:18:33,710 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Failed to download rsrc { { 
hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar,
 1451380519452, FILE, null 
},pending,[(container_1451039893865_261670_01_000578)],42332661980495938,DOWNLOADING}
java.io.IOException: Resource 
hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar
 changed on src filesystem (expected 1451380519452, was 1451380611793
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:176)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:276)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:50)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2015-12-29 17:18:33,710 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
 Resource 
hdfs://ns3/user/username/projects/user_insight/lookalike/oozie/workflow/lib/unilever_support_udf-0.0.1-SNAPSHOT.jar
 transitioned from DOWNLOADING to FAILED
2015-12-29 17:18:33,710 FATAL 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Error: Shutting down
java.lang.NullPointerException at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:712)
2015-12-29 17:18:33,710 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Public cache exiting
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4506) Application was killed by a resourcemanager, In the JobHistory Can't see the job detail

2015-12-24 Thread tangshangwen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15071325#comment-15071325
 ] 

tangshangwen commented on YARN-4506:


I found when the MRAppMaster received a signal, the thread is not copy 
job_ID.jhist to /user/history/done_intermediate in my am.log.

> Application was killed by a resourcemanager, In the JobHistory Can't see the 
> job detail
> ---
>
> Key: YARN-4506
> URL: https://issues.apache.org/jira/browse/YARN-4506
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.2.0
>Reporter: tangshangwen
> Attachments: am.rar
>
>
> 2015-12-15 03:08:54,073 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a 
> signal. Signaling RMCommunicator and JobHistoryEventHandler.
> 2015-12-15 03:08:54,073 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator 
> notified that iSignalled is: true
> 2015-12-15 03:08:54,073 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator 
> isAMLastRetry: true
> 2015-12-15 03:08:54,073 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator 
> notified that shouldUnregistered is: true
> 2015-12-15 03:08:54,073 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify JHEH isAMLastRetry: 
> true
> 2015-12-15 03:08:54,073 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: 
> JobHistoryEventHandler notified that forceJobCompletion is true
> 2015-12-15 03:08:54,074 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping 
> JobHistoryEventHandler. Size of the outstanding queue size is 0
> 2015-12-15 03:08:54,074 INFO [eventHandlingThread] 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: EventQueue 
> take interrupted. Returning
> 2015-12-15 03:08:54,078 WARN [Thread-1] 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Found jobId 
> job_1449835724839_219910 to have not been closed. Will close



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4506) Application was killed by a resourcemanager, In the JobHistory Can't see the job detail

2015-12-24 Thread tangshangwen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15071119#comment-15071119
 ] 

tangshangwen commented on YARN-4506:


I'm sure it happened in 2.2，because i fond AM was kill by RM，I can't found the 
job in JobHistory。

2015-12-15 02:56:48,916 INFO [main] org.mortbay.log: Extract 
jar:file:/software/servers/hadoop-2.2.0/share/hadoop/yarn/hadoop-yarn-common-2.2.0.jar!/webapps/mapreduce

> Application was killed by a resourcemanager, In the JobHistory Can't see the 
> job detail
> ---
>
> Key: YARN-4506
> URL: https://issues.apache.org/jira/browse/YARN-4506
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.2.0
>Reporter: tangshangwen
> Attachments: am.rar
>
>
> 2015-12-15 03:08:54,073 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a 
> signal. Signaling RMCommunicator and JobHistoryEventHandler.
> 2015-12-15 03:08:54,073 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator 
> notified that iSignalled is: true
> 2015-12-15 03:08:54,073 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator 
> isAMLastRetry: true
> 2015-12-15 03:08:54,073 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator 
> notified that shouldUnregistered is: true
> 2015-12-15 03:08:54,073 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify JHEH isAMLastRetry: 
> true
> 2015-12-15 03:08:54,073 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: 
> JobHistoryEventHandler notified that forceJobCompletion is true
> 2015-12-15 03:08:54,074 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping 
> JobHistoryEventHandler. Size of the outstanding queue size is 0
> 2015-12-15 03:08:54,074 INFO [eventHandlingThread] 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: EventQueue 
> take interrupted. Returning
> 2015-12-15 03:08:54,078 WARN [Thread-1] 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Found jobId 
> job_1449835724839_219910 to have not been closed. Will close



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4506) Application was killed by a resourcemanager, In the JobHistory Can't see the job detail

2015-12-24 Thread tangshangwen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangshangwen updated YARN-4506:
---
Attachment: am.rar

i update my am.log

> Application was killed by a resourcemanager, In the JobHistory Can't see the 
> job detail
> ---
>
> Key: YARN-4506
> URL: https://issues.apache.org/jira/browse/YARN-4506
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.2.0
>Reporter: tangshangwen
> Attachments: am.rar
>
>
> 2015-12-15 03:08:54,073 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a 
> signal. Signaling RMCommunicator and JobHistoryEventHandler.
> 2015-12-15 03:08:54,073 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator 
> notified that iSignalled is: true
> 2015-12-15 03:08:54,073 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator 
> isAMLastRetry: true
> 2015-12-15 03:08:54,073 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator 
> notified that shouldUnregistered is: true
> 2015-12-15 03:08:54,073 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify JHEH isAMLastRetry: 
> true
> 2015-12-15 03:08:54,073 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: 
> JobHistoryEventHandler notified that forceJobCompletion is true
> 2015-12-15 03:08:54,074 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping 
> JobHistoryEventHandler. Size of the outstanding queue size is 0
> 2015-12-15 03:08:54,074 INFO [eventHandlingThread] 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: EventQueue 
> take interrupted. Returning
> 2015-12-15 03:08:54,078 WARN [Thread-1] 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Found jobId 
> job_1449835724839_219910 to have not been closed. Will close



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4507) Application was killed by a resourcemanager, In the JobHistory Can't see the job detail

2015-12-24 Thread tangshangwen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangshangwen updated YARN-4507:
---
Description: 
when the AppMaster was killed by RM, we can't see the job detail in 
jobhistory,this is my log.

2015-12-15 03:08:54,073 INFO [Thread-1] 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a signal. 
Signaling RMCommunicator and JobHistoryEventHandler.
2015-12-15 03:08:54,073 INFO [Thread-1] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator 
notified that iSignalled is: true
2015-12-15 03:08:54,073 INFO [Thread-1] 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator 
isAMLastRetry: true
2015-12-15 03:08:54,073 INFO [Thread-1] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator 
notified that shouldUnregistered is: true
2015-12-15 03:08:54,073 INFO [Thread-1] 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify JHEH isAMLastRetry: true
2015-12-15 03:08:54,073 INFO [Thread-1] 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: 
JobHistoryEventHandler notified that forceJobCompletion is true
2015-12-15 03:08:54,074 INFO [Thread-1] 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping 
JobHistoryEventHandler. Size of the outstanding queue size is 0
2015-12-15 03:08:54,074 INFO [eventHandlingThread] 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: EventQueue take 
interrupted. Returning
2015-12-15 03:08:54,078 WARN [Thread-1] 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Found jobId 
job_1449835724839_219910 to have not been closed. Will close

  was:
2015-12-15 03:08:54,073 INFO [Thread-1] 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a signal. 
Signaling RMCommunicator and JobHistoryEventHandler.
2015-12-15 03:08:54,073 INFO [Thread-1] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator 
notified that iSignalled is: true
2015-12-15 03:08:54,073 INFO [Thread-1] 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator 
isAMLastRetry: true
2015-12-15 03:08:54,073 INFO [Thread-1] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator 
notified that shouldUnregistered is: true
2015-12-15 03:08:54,073 INFO [Thread-1] 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify JHEH isAMLastRetry: true
2015-12-15 03:08:54,073 INFO [Thread-1] 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: 
JobHistoryEventHandler notified that forceJobCompletion is true
2015-12-15 03:08:54,074 INFO [Thread-1] 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping 
JobHistoryEventHandler. Size of the outstanding queue size is 0
2015-12-15 03:08:54,074 INFO [eventHandlingThread] 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: EventQueue take 
interrupted. Returning
2015-12-15 03:08:54,078 WARN [Thread-1] 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Found jobId 
job_1449835724839_219910 to have not been closed. Will close


> Application was killed by a resourcemanager, In the JobHistory Can't see the 
> job detail
> ---
>
> Key: YARN-4507
> URL: https://issues.apache.org/jira/browse/YARN-4507
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.2.0
>Reporter: tangshangwen
>
> when the AppMaster was killed by RM, we can't see the job detail in 
> jobhistory,this is my log.
> 2015-12-15 03:08:54,073 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a 
> signal. Signaling RMCommunicator and JobHistoryEventHandler.
> 2015-12-15 03:08:54,073 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator 
> notified that iSignalled is: true
> 2015-12-15 03:08:54,073 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator 
> isAMLastRetry: true
> 2015-12-15 03:08:54,073 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator 
> notified that shouldUnregistered is: true
> 2015-12-15 03:08:54,073 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify JHEH isAMLastRetry: 
> true
> 2015-12-15 03:08:54,073 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: 
> JobHistoryEventHandler notified that forceJobCompletion is true
> 2015-12-15 03:08:54,074 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping 
> JobHistoryEventHandler. Size of the outstanding queue size is 0
> 2015-12-15 03:08:54,074 INFO [eventHandlingThread] 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: EventQueue 
> take interrupted. Returning
> 2015-12-15 03:08:54,078 WARN [Thread-1]

[jira] [Created] (YARN-4507) Application was killed by a resourcemanager, In the JobHistory Can't see the job detail

2015-12-24 Thread tangshangwen (JIRA)

tangshangwen created YARN-4507:
--

 Summary: Application was killed by a resourcemanager, In the 
JobHistory Can't see the job detail
 Key: YARN-4507
 URL: https://issues.apache.org/jira/browse/YARN-4507
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.2.0
Reporter: tangshangwen


2015-12-15 03:08:54,073 INFO [Thread-1] 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a signal. 
Signaling RMCommunicator and JobHistoryEventHandler.
2015-12-15 03:08:54,073 INFO [Thread-1] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator 
notified that iSignalled is: true
2015-12-15 03:08:54,073 INFO [Thread-1] 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator 
isAMLastRetry: true
2015-12-15 03:08:54,073 INFO [Thread-1] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator 
notified that shouldUnregistered is: true
2015-12-15 03:08:54,073 INFO [Thread-1] 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify JHEH isAMLastRetry: true
2015-12-15 03:08:54,073 INFO [Thread-1] 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: 
JobHistoryEventHandler notified that forceJobCompletion is true
2015-12-15 03:08:54,074 INFO [Thread-1] 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping 
JobHistoryEventHandler. Size of the outstanding queue size is 0
2015-12-15 03:08:54,074 INFO [eventHandlingThread] 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: EventQueue take 
interrupted. Returning
2015-12-15 03:08:54,078 WARN [Thread-1] 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Found jobId 
job_1449835724839_219910 to have not been closed. Will close



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-4506) Application was killed by a resourcemanager, In the JobHistory Can't see the job detail

2015-12-24 Thread tangshangwen (JIRA)

tangshangwen created YARN-4506:
--

 Summary: Application was killed by a resourcemanager, In the 
JobHistory Can't see the job detail
 Key: YARN-4506
 URL: https://issues.apache.org/jira/browse/YARN-4506
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.2.0
Reporter: tangshangwen


2015-12-15 03:08:54,073 INFO [Thread-1] 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a signal. 
Signaling RMCommunicator and JobHistoryEventHandler.
2015-12-15 03:08:54,073 INFO [Thread-1] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator 
notified that iSignalled is: true
2015-12-15 03:08:54,073 INFO [Thread-1] 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator 
isAMLastRetry: true
2015-12-15 03:08:54,073 INFO [Thread-1] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator 
notified that shouldUnregistered is: true
2015-12-15 03:08:54,073 INFO [Thread-1] 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify JHEH isAMLastRetry: true
2015-12-15 03:08:54,073 INFO [Thread-1] 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: 
JobHistoryEventHandler notified that forceJobCompletion is true
2015-12-15 03:08:54,074 INFO [Thread-1] 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping 
JobHistoryEventHandler. Size of the outstanding queue size is 0
2015-12-15 03:08:54,074 INFO [eventHandlingThread] 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: EventQueue take 
interrupted. Returning
2015-12-15 03:08:54,078 WARN [Thread-1] 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Found jobId 
job_1449835724839_219910 to have not been closed. Will close



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4324) AM hang more than 10 min was kill by RM

2015-12-22 Thread tangshangwen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangshangwen updated YARN-4324:
---
Attachment: (was: am105361log.tar.gz)

> AM hang more than 10 min was kill by RM
> ---
>
> Key: YARN-4324
> URL: https://issues.apache.org/jira/browse/YARN-4324
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.2.0
>Reporter: tangshangwen
> Attachments: logs.rar, yarn-nodemanager-dumpam.log
>
>
> this is my logs
> 2015-11-02 01:14:54,175 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 2865
> 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: 
> job_1446203652278_135526Job Transitioned from RUNNING to COMMITTING   
> 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
> attempt_1446203652278_135526_m_001777_1 TaskAttempt Transition
> ed from UNASSIGNED to KILLED
> 2015-11-02 01:14:54,176 INFO [CommitterEvent Processor #1] 
> org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing 
> the event EventType: JOB_COMMIT  
> 2015-11-02 01:24:15,851 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a 
> signal. Signaling RMCommunicator and JobHistoryEventHandler.
> 2015-11-02 01:24:15,851 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator 
> notified that iSignalled is: true
> 2015-11-02 01:24:15,851 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator 
> isAMLastRetry: true
> the hive map run 100% and return map 0% and the job failed!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4324) AM hang more than 10 min was kill by RM

2015-12-22 Thread tangshangwen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangshangwen updated YARN-4324:
---
Attachment: am105361log.tar.gz

I update other AM Log

> AM hang more than 10 min was kill by RM
> ---
>
> Key: YARN-4324
> URL: https://issues.apache.org/jira/browse/YARN-4324
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.2.0
>Reporter: tangshangwen
> Attachments: am105361log.tar.gz, logs.rar, yarn-nodemanager-dumpam.log
>
>
> this is my logs
> 2015-11-02 01:14:54,175 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 2865
> 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: 
> job_1446203652278_135526Job Transitioned from RUNNING to COMMITTING   
> 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
> attempt_1446203652278_135526_m_001777_1 TaskAttempt Transition
> ed from UNASSIGNED to KILLED
> 2015-11-02 01:14:54,176 INFO [CommitterEvent Processor #1] 
> org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing 
> the event EventType: JOB_COMMIT  
> 2015-11-02 01:24:15,851 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a 
> signal. Signaling RMCommunicator and JobHistoryEventHandler.
> 2015-11-02 01:24:15,851 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator 
> notified that iSignalled is: true
> 2015-11-02 01:24:15,851 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator 
> isAMLastRetry: true
> the hive map run 100% and return map 0% and the job failed!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4324) AM hang more than 10 min was kill by RM

2015-12-21 Thread tangshangwen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15066331#comment-15066331
 ] 

tangshangwen commented on YARN-4324:


i found this message in the jstack,is a  JDK epollWait bug?:

"IPC Client (2118999553) connection to RMHost:8030 from UserName" daemon 
prio=10 tid=0x7f298c664000 nid=0x6e2d runnable [0x7f297d9a8000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
- locked <0x000785182940> (a sun.nio.ch.Util$2)
- locked <0x000785182930> (a java.util.Collections$UnmodifiableSet)
- locked <0x000785182718> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
at 
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at 
org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:457)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
- locked <0x00078023aa40> (a java.io.BufferedInputStream)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at 
org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:995)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:891)



> AM hang more than 10 min was kill by RM
> ---
>
> Key: YARN-4324
> URL: https://issues.apache.org/jira/browse/YARN-4324
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.2.0
>Reporter: tangshangwen
> Attachments: logs.rar, yarn-nodemanager-dumpam.log
>
>
> this is my logs
> 2015-11-02 01:14:54,175 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 2865
> 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: 
> job_1446203652278_135526Job Transitioned from RUNNING to COMMITTING   
> 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
> attempt_1446203652278_135526_m_001777_1 TaskAttempt Transition
> ed from UNASSIGNED to KILLED
> 2015-11-02 01:14:54,176 INFO [CommitterEvent Processor #1] 
> org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing 
> the event EventType: JOB_COMMIT  
> 2015-11-02 01:24:15,851 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a 
> signal. Signaling RMCommunicator and JobHistoryEventHandler.
> 2015-11-02 01:24:15,851 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator 
> notified that iSignalled is: true
> 2015-11-02 01:24:15,851 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator 
> isAMLastRetry: true
> the hive map run 100% and return map 0% and the job failed!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4324) AM hang more than 10 min was kill by RM

2015-12-15 Thread tangshangwen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15057690#comment-15057690
 ] 

tangshangwen commented on YARN-4324:


I found the RMContainerAllocator last contact RM in AM logs ，and it Does not 
apply to reduce

2015-12-15 02:57:39,893 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
PendingReds:732 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:5773 
AssignedReds:0 CompletedMaps:0 CompletedReds:0 ContAlloc:8995 ContRel:3222 
HostLocal:5310 RackLocal:338
 
AM received an kill signal 

2015-12-15 03:01:29,383 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1449835724839_219910_m_001345_1 TaskAttempt Transitioned from NEW to 
UNASSIGNED
2015-12-15 03:08:54,073 INFO [Thread-1] 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a signal. 
Signaling RMCommunicator and JobHistoryEventHandler.

I guess AM in 10min not send a heartbeat to RM，RM logs Rolling too fast，I will 
try to get RM logs and update






> AM hang more than 10 min was kill by RM
> ---
>
> Key: YARN-4324
> URL: https://issues.apache.org/jira/browse/YARN-4324
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.2.0
>Reporter: tangshangwen
> Attachments: logs.rar, yarn-nodemanager-dumpam.log
>
>
> this is my logs
> 2015-11-02 01:14:54,175 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 2865
> 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: 
> job_1446203652278_135526Job Transitioned from RUNNING to COMMITTING   
> 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
> attempt_1446203652278_135526_m_001777_1 TaskAttempt Transition
> ed from UNASSIGNED to KILLED
> 2015-11-02 01:14:54,176 INFO [CommitterEvent Processor #1] 
> org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing 
> the event EventType: JOB_COMMIT  
> 2015-11-02 01:24:15,851 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a 
> signal. Signaling RMCommunicator and JobHistoryEventHandler.
> 2015-11-02 01:24:15,851 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator 
> notified that iSignalled is: true
> 2015-11-02 01:24:15,851 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator 
> isAMLastRetry: true
> the hive map run 100% and return map 0% and the job failed!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4324) AM hang more than 10 min was kill by RM

2015-12-15 Thread tangshangwen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15057601#comment-15057601
 ] 

tangshangwen commented on YARN-4324:


Thank you for your attention，I have already upload AM Logs

> AM hang more than 10 min was kill by RM
> ---
>
> Key: YARN-4324
> URL: https://issues.apache.org/jira/browse/YARN-4324
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.2.0
>Reporter: tangshangwen
> Attachments: logs.rar, yarn-nodemanager-dumpam.log
>
>
> this is my logs
> 2015-11-02 01:14:54,175 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 2865
> 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: 
> job_1446203652278_135526Job Transitioned from RUNNING to COMMITTING   
> 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
> attempt_1446203652278_135526_m_001777_1 TaskAttempt Transition
> ed from UNASSIGNED to KILLED
> 2015-11-02 01:14:54,176 INFO [CommitterEvent Processor #1] 
> org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing 
> the event EventType: JOB_COMMIT  
> 2015-11-02 01:24:15,851 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a 
> signal. Signaling RMCommunicator and JobHistoryEventHandler.
> 2015-11-02 01:24:15,851 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator 
> notified that iSignalled is: true
> 2015-11-02 01:24:15,851 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator 
> isAMLastRetry: true
> the hive map run 100% and return map 0% and the job failed!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4324) AM hang more than 10 min was kill by RM

2015-12-14 Thread tangshangwen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangshangwen updated YARN-4324:
---
Attachment: logs.rar

I upload the new jstack and am logs

> AM hang more than 10 min was kill by RM
> ---
>
> Key: YARN-4324
> URL: https://issues.apache.org/jira/browse/YARN-4324
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.2.0
>Reporter: tangshangwen
> Attachments: logs.rar, yarn-nodemanager-dumpam.log
>
>
> this is my logs
> 2015-11-02 01:14:54,175 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 2865
> 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: 
> job_1446203652278_135526Job Transitioned from RUNNING to COMMITTING   
> 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
> attempt_1446203652278_135526_m_001777_1 TaskAttempt Transition
> ed from UNASSIGNED to KILLED
> 2015-11-02 01:14:54,176 INFO [CommitterEvent Processor #1] 
> org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing 
> the event EventType: JOB_COMMIT  
> 2015-11-02 01:24:15,851 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a 
> signal. Signaling RMCommunicator and JobHistoryEventHandler.
> 2015-11-02 01:24:15,851 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator 
> notified that iSignalled is: true
> 2015-11-02 01:24:15,851 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator 
> isAMLastRetry: true
> the hive map run 100% and return map 0% and the job failed!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4324) AM hang more than 10 min was kill by RM

2015-12-14 Thread tangshangwen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangshangwen updated YARN-4324:
---
Attachment: yarn-nodemanager-dumpam.log

> AM hang more than 10 min was kill by RM
> ---
>
> Key: YARN-4324
> URL: https://issues.apache.org/jira/browse/YARN-4324
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.2.0
>Reporter: tangshangwen
> Attachments: yarn-nodemanager-dumpam.log
>
>
> this is my logs
> 2015-11-02 01:14:54,175 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 2865
> 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: 
> job_1446203652278_135526Job Transitioned from RUNNING to COMMITTING   
> 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
> attempt_1446203652278_135526_m_001777_1 TaskAttempt Transition
> ed from UNASSIGNED to KILLED
> 2015-11-02 01:14:54,176 INFO [CommitterEvent Processor #1] 
> org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing 
> the event EventType: JOB_COMMIT  
> 2015-11-02 01:24:15,851 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a 
> signal. Signaling RMCommunicator and JobHistoryEventHandler.
> 2015-11-02 01:24:15,851 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator 
> notified that iSignalled is: true
> 2015-11-02 01:24:15,851 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator 
> isAMLastRetry: true
> the hive map run 100% and return map 0% and the job failed!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4324) AM hang more than 10 min was kill by RM

2015-12-14 Thread tangshangwen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15057380#comment-15057380
 ] 

tangshangwen commented on YARN-4324:


Because the job failure is random, i dump the am jstack and pstack when am from 
RUNING to KILLING event, I upload my log

> AM hang more than 10 min was kill by RM
> ---
>
> Key: YARN-4324
> URL: https://issues.apache.org/jira/browse/YARN-4324
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.2.0
>Reporter: tangshangwen
>
> this is my logs
> 2015-11-02 01:14:54,175 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 2865
> 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: 
> job_1446203652278_135526Job Transitioned from RUNNING to COMMITTING   
> 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
> attempt_1446203652278_135526_m_001777_1 TaskAttempt Transition
> ed from UNASSIGNED to KILLED
> 2015-11-02 01:14:54,176 INFO [CommitterEvent Processor #1] 
> org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing 
> the event EventType: JOB_COMMIT  
> 2015-11-02 01:24:15,851 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a 
> signal. Signaling RMCommunicator and JobHistoryEventHandler.
> 2015-11-02 01:24:15,851 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator 
> notified that iSignalled is: true
> 2015-11-02 01:24:15,851 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator 
> isAMLastRetry: true
> the hive map run 100% and return map 0% and the job failed!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4324) AM hang more than 10 min was kill by RM

2015-11-01 Thread tangshangwen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tangshangwen updated YARN-4324:
---
Description: 
this is my logs
2015-11-02 01:14:54,175 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 2865
2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: 
job_1446203652278_135526Job Transitioned from RUNNING to COMMITTING   
2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1446203652278_135526_m_001777_1 TaskAttempt Transition
ed from UNASSIGNED to KILLED
2015-11-02 01:14:54,176 INFO [CommitterEvent Processor #1] 
org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing the 
event EventType: JOB_COMMIT  
2015-11-02 01:24:15,851 INFO [Thread-1] 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a signal. 
Signaling RMCommunicator and JobHistoryEventHandler.
2015-11-02 01:24:15,851 INFO [Thread-1] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator 
notified that iSignalled is: true
2015-11-02 01:24:15,851 INFO [Thread-1] 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator 
isAMLastRetry: true

the hive map run 100% and return map 0% and the job failed!

> AM hang more than 10 min was kill by RM
> ---
>
> Key: YARN-4324
> URL: https://issues.apache.org/jira/browse/YARN-4324
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.2.0
>Reporter: tangshangwen
>
> this is my logs
> 2015-11-02 01:14:54,175 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 2865
> 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: 
> job_1446203652278_135526Job Transitioned from RUNNING to COMMITTING   
> 2015-11-02 01:14:54,176 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
> attempt_1446203652278_135526_m_001777_1 TaskAttempt Transition
> ed from UNASSIGNED to KILLED
> 2015-11-02 01:14:54,176 INFO [CommitterEvent Processor #1] 
> org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing 
> the event EventType: JOB_COMMIT  
> 2015-11-02 01:24:15,851 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a 
> signal. Signaling RMCommunicator and JobHistoryEventHandler.
> 2015-11-02 01:24:15,851 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator 
> notified that iSignalled is: true
> 2015-11-02 01:24:15,851 INFO [Thread-1] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator 
> isAMLastRetry: true
> the hive map run 100% and return map 0% and the job failed!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-4324) AM hang more than 10 min was kill by RM

2015-11-01 Thread tangshangwen (JIRA)

tangshangwen created YARN-4324:
--

 Summary: AM hang more than 10 min was kill by RM
 Key: YARN-4324
 URL: https://issues.apache.org/jira/browse/YARN-4324
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.2.0
Reporter: tangshangwen






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4099) Container LocalizedResource more than 10min was kill

2015-09-01 Thread tangshangwen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14724949#comment-14724949
 ] 

tangshangwen commented on YARN-4099:


2015-08-28 15:10:37,434 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Writing credentials to the nmPrivate file 
/data4/yarn1/local/nmPrivate/container_1440160718082_401272_01_01.tokens. 
Credentials list:
2015-08-28 15:22:02,578 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth 
successful for 

> Container LocalizedResource more than 10min was kill
> 
>
> Key: YARN-4099
> URL: https://issues.apache.org/jira/browse/YARN-4099
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.2.0
> Environment: centos 6.5
> datanode 1500+
>Reporter: tangshangwen
>
> Container LocalizedResource more than 10min was kill，this is AM nodemanager 
> log:
> 82_401272/libjars/UDFGetUserAgent.jar transitioned from INIT to DOWNLOADING
> 2015-08-28 15:10:37,432 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
>  Resource hdfs://ns1/user//.staging/job_1440160718
> 082_401272/libjars/IndexChange.jar transitioned from INIT to DOWNLOADING
> 2015-08-28 15:10:37,432 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
>  Resource hdfs://ns1/user//.staging/job_1440160718
> 082_401272/libjars/UserAgentUtils-1.8.jar transitioned from INIT to 
> DOWNLOADING
> 2015-08-28 15:10:37,432 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
>  Resource hdfs://ns1/user//.staging/job_1440160718
> 082_401272/libjars/UDFGetEndTime.jar transitioned from INIT to DOWNLOADING
> 2015-08-28 15:10:37,432 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
>  Resource hdfs://ns1/user//.staging/job_1440160718
> 082_401272/libjars/HexadecimalGB.jar transitioned from INIT to DOWNLOADING
> 2015-08-28 15:10:37,432 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Created localizer for container_1440160718082
> _401272_01_01
> 2015-08-28 15:10:37,434 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Writing credentials to the nmPrivate file /da
> ta4/yarn1/local/nmPrivate/container_1440160718082_401272_01_01.tokens. 
> Credentials list:
> 2015-08-28 15:22:02,578 INFO SecurityLogger.org.apache.hadoop.ipc.Server: 
> Auth successful for appattempt_1440160718082_401272_01 (auth:SIMPLE)
> 2015-08-28 15:22:02,580 INFO 
> SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
>  Authorization successful for appattempt_1440160718082_401272_0
> 1 (auth:TOKEN) for protocol=interface 
> org.apache.hadoop.yarn.api.ContainerManagementProtocolPB
> 2015-08-28 15:22:02,580 INFO org.apache.hadoop.yarn.s



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-4099) Container LocalizedResource more than 10min was kill

2015-09-01 Thread tangshangwen (JIRA)

tangshangwen created YARN-4099:
--

 Summary: Container LocalizedResource more than 10min was kill
 Key: YARN-4099
 URL: https://issues.apache.org/jira/browse/YARN-4099
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.2.0
 Environment: centos 6.5
datanode 1500+
Reporter: tangshangwen



Container LocalizedResource more than 10min was kill，this is AM nodemanager log:
82_401272/libjars/UDFGetUserAgent.jar transitioned from INIT to DOWNLOADING
2015-08-28 15:10:37,432 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
 Resource hdfs://ns1/user//.staging/job_1440160718
082_401272/libjars/IndexChange.jar transitioned from INIT to DOWNLOADING
2015-08-28 15:10:37,432 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
 Resource hdfs://ns1/user//.staging/job_1440160718
082_401272/libjars/UserAgentUtils-1.8.jar transitioned from INIT to DOWNLOADING
2015-08-28 15:10:37,432 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
 Resource hdfs://ns1/user//.staging/job_1440160718
082_401272/libjars/UDFGetEndTime.jar transitioned from INIT to DOWNLOADING
2015-08-28 15:10:37,432 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
 Resource hdfs://ns1/user//.staging/job_1440160718
082_401272/libjars/HexadecimalGB.jar transitioned from INIT to DOWNLOADING
2015-08-28 15:10:37,432 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Created localizer for container_1440160718082
_401272_01_01
2015-08-28 15:10:37,434 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Writing credentials to the nmPrivate file /da
ta4/yarn1/local/nmPrivate/container_1440160718082_401272_01_01.tokens. 
Credentials list:
2015-08-28 15:22:02,578 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth 
successful for appattempt_1440160718082_401272_01 (auth:SIMPLE)
2015-08-28 15:22:02,580 INFO 
SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
 Authorization successful for appattempt_1440160718082_401272_0
1 (auth:TOKEN) for protocol=interface 
org.apache.hadoop.yarn.api.ContainerManagementProtocolPB
2015-08-28 15:22:02,580 INFO org.apache.hadoop.yarn.s



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

88 matches

Mail list logo