from:"Tao Yang \(Jira\)"

[jira] [Updated] (YARN-7527) Over-allocate node resource in async-scheduling mode of CapacityScheduler

2017-11-17 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-7527:
---
Attachment: YARN-7527.001.patch

Attaching init patch for review.

> Over-allocate node resource in async-scheduling mode of CapacityScheduler
> -
>
> Key: YARN-7527
> URL: https://issues.apache.org/jira/browse/YARN-7527
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.0.0-alpha4, 2.9.1
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-7527.001.patch
>
>
> Currently in async-scheduling mode of CapacityScheduler, node resource may be 
> over-allocated since node resource check is ignored.
> {{FiCaSchedulerApp#commonCheckContainerAllocation}} will check whether this 
> node have enough available resource for this proposal and return check result 
> (ture/false), but this result is ignored in {{CapacityScheduler#accept}} as 
> below.
> {noformat}
> commonCheckContainerAllocation(allocation, schedulerContainer);
> {noformat}
> If {{FiCaSchedulerApp#commonCheckContainerAllocation}} returns false, 
> {{CapacityScheduler#accept}} should also return false as below:
> {noformat}
> if (!commonCheckContainerAllocation(allocation, schedulerContainer)) {
>   return false;
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-7527) Over-allocate node resource in async-scheduling mode of CapacityScheduler

2017-11-17 Thread Tao Yang (JIRA)

Tao Yang created YARN-7527:
--

 Summary: Over-allocate node resource in async-scheduling mode of 
CapacityScheduler
 Key: YARN-7527
 URL: https://issues.apache.org/jira/browse/YARN-7527
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 3.0.0-alpha4, 2.9.1
Reporter: Tao Yang
Assignee: Tao Yang


Currently in async-scheduling mode of CapacityScheduler, node resource may be 
over-allocated since node resource check is ignored.
{{FiCaSchedulerApp#commonCheckContainerAllocation}} will check whether this 
node have enough available resource for this proposal and return check result 
(ture/false), but this result is ignored in {{CapacityScheduler#accept}} as 
below.
{noformat}
commonCheckContainerAllocation(allocation, schedulerContainer);
{noformat}
If {{FiCaSchedulerApp#commonCheckContainerAllocation}} returns false, 
{{CapacityScheduler#accept}} should also return false as below:
{noformat}
if (!commonCheckContainerAllocation(allocation, schedulerContainer)) {
  return false;
}
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7525) Incorrect query parameters in cluster nodes REST API document

2017-11-16 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-7525:
---
Fix Version/s: (was: 2.9.1)
   (was: 3.0.0-alpha4)

> Incorrect query parameters in cluster nodes REST API document
> -
>
> Key: YARN-7525
> URL: https://issues.apache.org/jira/browse/YARN-7525
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 3.0.0-alpha4, 2.9.1
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Minor
> Attachments: YARN-7525.001.patch
>
>
> Recently we use cluster nodes REST API and found the query parameters(state 
> and healthy) in document both are not exist.
> Now the query paramters in document is: 
> {noformat}
>   * state - the state of the node
>   * healthy - true or false 
> {noformat}
> The correct query parameters should be:
> {noformat}
>   * states - the states of the node, specified as a comma-separated list.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7525) Incorrect query parameters in cluster nodes REST API document

2017-11-16 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-7525:
---
Attachment: YARN-7525.001.patch

> Incorrect query parameters in cluster nodes REST API document
> -
>
> Key: YARN-7525
> URL: https://issues.apache.org/jira/browse/YARN-7525
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 3.0.0-alpha4, 2.9.1
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Minor
> Attachments: YARN-7525.001.patch
>
>
> Recently we use cluster nodes REST API and found the query parameters(state 
> and healthy) in document both are not exist.
> Now the query paramters in document is: 
> {noformat}
>   * state - the state of the node
>   * healthy - true or false 
> {noformat}
> The correct query parameters should be:
> {noformat}
>   * states - the states of the node, specified as a comma-separated list.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-7525) Incorrect query parameters in cluster nodes REST API document

2017-11-16 Thread Tao Yang (JIRA)

Tao Yang created YARN-7525:
--

 Summary: Incorrect query parameters in cluster nodes REST API 
document
 Key: YARN-7525
 URL: https://issues.apache.org/jira/browse/YARN-7525
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Affects Versions: 3.0.0-alpha4, 2.9.1
Reporter: Tao Yang
Assignee: Tao Yang
Priority: Minor


Recently we use cluster nodes REST API and found the query parameters(state and 
healthy) in document both are not exist.
Now the query paramters in document is: 
{noformat}
  * state - the state of the node
  * healthy - true or false 
{noformat}
The correct query parameters should be:
{noformat}
  * states - the states of the node, specified as a comma-separated list.
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-7508) NPE in FiCaSchedulerApp when debug log enabled and try to commit outdated reserved proposal in async-scheduling mode

2017-11-16 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16256364#comment-16256364
 ] 

Tao Yang commented on YARN-7508:


Thanks [~sunilg] and [~bibinchundatt] for your review and comments. 
Other instances of similar usage seem good since they can guarantee that 
{{schedulerContainer.getSchedulerNode().getReservedContainer()}} is not null.

> NPE in FiCaSchedulerApp when debug log enabled and try to commit outdated 
> reserved proposal in async-scheduling mode
> 
>
> Key: YARN-7508
> URL: https://issues.apache.org/jira/browse/YARN-7508
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.9.0, 3.0.0-alpha4
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-7508.001.patch
>
>
> YARN-6678 have fixed the IllegalStateException problem but the debug log it 
> added may cause NPE when trying to print containerId of non-existed reserved 
> container on this node. Replace 
> {{schedulerContainer.getSchedulerNode().getReservedContainer().getContainerId()}}
>  with {{schedulerContainer.getSchedulerNode().getReservedContainer()}} can 
> fix this problem.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7511) NPE in ContainerLocalizer when localization failed for running container

2017-11-16 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-7511:
---
Attachment: YARN-7511.001.patch

Attaching v1 patch for review.

> NPE in ContainerLocalizer when localization failed for running container
> 
>
> Key: YARN-7511
> URL: https://issues.apache.org/jira/browse/YARN-7511
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.0.0-alpha4, 2.9.1
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-7511.001.patch
>
>
> Error log:
> {noformat}
> 2017-09-30 20:14:32,839 FATAL [AsyncDispatcher event handler] 
> org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
> java.lang.NullPointerException
>         at 
> java.util.concurrent.ConcurrentHashMap.replaceNode(ConcurrentHashMap.java:1106)
>         at 
> java.util.concurrent.ConcurrentHashMap.remove(ConcurrentHashMap.java:1097)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceSet.resourceLocalizationFailed(ResourceSet.java:151)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ResourceLocalizationFailedWhileRunningTransition.transition(ContainerImpl.java:821)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ResourceLocalizationFailedWhileRunningTransition.transition(ContainerImpl.java:813)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1335)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:95)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1372)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1365)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
>         at java.lang.Thread.run(Thread.java:834)
> 2017-09-30 20:14:32,842 INFO [AsyncDispatcher ShutDown handler] 
> org.apache.hadoop.yarn.event.AsyncDispatcher: Exiting, bbye..
> {noformat}
> Reproduce this problem:
> 1. Container was running and ContainerManagerImpl#localize was called for 
> this container
> 2. Localization failed in ResourceLocalizationService$LocalizerRunner#run and 
> sent out ContainerResourceFailedEvent with null LocalResourceRequest.
> 3. NPE when ResourceLocalizationFailedWhileRunningTransition#transition --> 
> container.resourceSet.resourceLocalizationFailed(null)
> I think we can fix this problem through ensuring that request is not null 
> before remove it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-7511) NPE in ContainerLocalizer when localization failed for running container

2017-11-16 Thread Tao Yang (JIRA)

Tao Yang created YARN-7511:
--

 Summary: NPE in ContainerLocalizer when localization failed for 
running container
 Key: YARN-7511
 URL: https://issues.apache.org/jira/browse/YARN-7511
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 3.0.0-alpha4, 2.9.1
Reporter: Tao Yang
Assignee: Tao Yang


Error log:
{noformat}
2017-09-30 20:14:32,839 FATAL [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
java.lang.NullPointerException
        at 
java.util.concurrent.ConcurrentHashMap.replaceNode(ConcurrentHashMap.java:1106)
        at 
java.util.concurrent.ConcurrentHashMap.remove(ConcurrentHashMap.java:1097)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceSet.resourceLocalizationFailed(ResourceSet.java:151)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ResourceLocalizationFailedWhileRunningTransition.transition(ContainerImpl.java:821)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ResourceLocalizationFailedWhileRunningTransition.transition(ContainerImpl.java:813)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1335)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:95)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1372)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1365)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
        at java.lang.Thread.run(Thread.java:834)
2017-09-30 20:14:32,842 INFO [AsyncDispatcher ShutDown handler] 
org.apache.hadoop.yarn.event.AsyncDispatcher: Exiting, bbye..
{noformat}

Reproduce this problem:
1. Container was running and ContainerManagerImpl#localize was called for this 
container
2. Localization failed in ResourceLocalizationService$LocalizerRunner#run and 
sent out ContainerResourceFailedEvent with null LocalResourceRequest.
3. NPE when ResourceLocalizationFailedWhileRunningTransition#transition --> 
container.resourceSet.resourceLocalizationFailed(null)

I think we can fix this problem through ensuring that request is not null 
before remove it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7509) AsyncScheduleThread and ResourceCommitterService are still running after RM is transitioned to standby

2017-11-16 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-7509:
---
Attachment: YARN-7509.001.patch

Attaching v1 patch for review.

> AsyncScheduleThread and ResourceCommitterService are still running after RM 
> is transitioned to standby
> --
>
> Key: YARN-7509
> URL: https://issues.apache.org/jira/browse/YARN-7509
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha4, 2.9.1
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-7509.001.patch
>
>
> After RM is transitioned to standby, AsyncScheduleThread and 
> ResourceCommitterService will receive interrupt signal. When thread is 
> sleeping, it will ignore the interrupt signal since InterruptedException is 
> catched inside and the interrupt signal is cleared.
> For AsyncScheduleThread, InterruptedException was catched and ignored in  
> CapacityScheduler#schedule.
> For ResourceCommitterService, InterruptedException was catched inside and 
> ignored in ResourceCommitterService#run. 
> We should let the interrupt signal out and make these threads exit.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-7509) AsyncScheduleThread and ResourceCommitterService are still running after RM is transitioned to standby

2017-11-16 Thread Tao Yang (JIRA)

Tao Yang created YARN-7509:
--

 Summary: AsyncScheduleThread and ResourceCommitterService are 
still running after RM is transitioned to standby
 Key: YARN-7509
 URL: https://issues.apache.org/jira/browse/YARN-7509
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0-alpha4, 2.9.1
Reporter: Tao Yang


After RM is transitioned to standby, AsyncScheduleThread and 
ResourceCommitterService will receive interrupt signal. When thread is 
sleeping, it will ignore the interrupt signal since InterruptedException is 
catched inside and the interrupt signal is cleared.
For AsyncScheduleThread, InterruptedException was catched and ignored in  
CapacityScheduler#schedule.
For ResourceCommitterService, InterruptedException was catched inside and 
ignored in ResourceCommitterService#run. 
We should let the interrupt signal out and make these threads exit.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-7509) AsyncScheduleThread and ResourceCommitterService are still running after RM is transitioned to standby

2017-11-16 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang reassigned YARN-7509:
--

Assignee: Tao Yang

> AsyncScheduleThread and ResourceCommitterService are still running after RM 
> is transitioned to standby
> --
>
> Key: YARN-7509
> URL: https://issues.apache.org/jira/browse/YARN-7509
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha4, 2.9.1
>Reporter: Tao Yang
>Assignee: Tao Yang
>
> After RM is transitioned to standby, AsyncScheduleThread and 
> ResourceCommitterService will receive interrupt signal. When thread is 
> sleeping, it will ignore the interrupt signal since InterruptedException is 
> catched inside and the interrupt signal is cleared.
> For AsyncScheduleThread, InterruptedException was catched and ignored in  
> CapacityScheduler#schedule.
> For ResourceCommitterService, InterruptedException was catched inside and 
> ignored in ResourceCommitterService#run. 
> We should let the interrupt signal out and make these threads exit.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7508) NPE in FiCaSchedulerApp when debug log enabled and try to commit outdated reserved proposal in async-scheduling mode

2017-11-16 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-7508:
---
Attachment: YARN-7508.001.patch

Uploading v1 patch. [~sunilg], Could you help to review, please?

> NPE in FiCaSchedulerApp when debug log enabled and try to commit outdated 
> reserved proposal in async-scheduling mode
> 
>
> Key: YARN-7508
> URL: https://issues.apache.org/jira/browse/YARN-7508
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.9.0, 3.0.0-alpha4
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-7508.001.patch
>
>
> YARN-6678 have fixed the IllegalStateException problem but the debug log it 
> added may cause NPE when trying to print containerId of non-existed reserved 
> container on this node. Replace 
> {{schedulerContainer.getSchedulerNode().getReservedContainer().getContainerId()}}
>  with {{schedulerContainer.getSchedulerNode().getReservedContainer()}} can 
> fix this problem.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-7508) NPE in FiCaSchedulerApp when debug log enabled and try to commit outdated reserved proposal in async-scheduling mode

2017-11-16 Thread Tao Yang (JIRA)

Tao Yang created YARN-7508:
--

 Summary: NPE in FiCaSchedulerApp when debug log enabled and try to 
commit outdated reserved proposal in async-scheduling mode
 Key: YARN-7508
 URL: https://issues.apache.org/jira/browse/YARN-7508
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 3.0.0-alpha4, 2.9.0
Reporter: Tao Yang
Assignee: Tao Yang


YARN-6678 have fixed the IllegalStateException problem but the debug log it 
added may cause NPE when trying to print containerId of non-existed reserved 
container on this node. Replace 
{{schedulerContainer.getSchedulerNode().getReservedContainer().getContainerId()}}
 with {{schedulerContainer.getSchedulerNode().getReservedContainer()}} can fix 
this problem.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7461) DominantResourceCalculator#ratio calculation problem when right resource contains zero value

2017-11-15 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-7461:
---
Attachment: YARN-7461.003.patch

Updating the patch to skip ratio calculation for resource types whose left 
value and right value are both zero. [~templedf], could you help to review 
please?

> DominantResourceCalculator#ratio calculation problem when right resource 
> contains zero value
> 
>
> Key: YARN-7461
> URL: https://issues.apache.org/jira/browse/YARN-7461
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha4
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Minor
> Attachments: YARN-7461.001.patch, YARN-7461.002.patch, 
> YARN-7461.003.patch
>
>
> Currently DominantResourceCalculator#ratio may return wrong result when right 
> resource contains zero value. For example, there are three resource types 
> such as , leftResource=<5, 5, 0> and 
> rightResource=<10, 10, 0>, we expect the result of 
> DominantResourceCalculator#ratio(leftResource, rightResource) is 0.5 but 
> currently is NaN.
> There should be a verification before divide calculation to ensure that 
> dividend is not zero.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7489) ConcurrentModificationException in RMAppImpl#getRMAppMetrics

2017-11-14 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-7489:
---
Attachment: YARN-7489.001.patch

> ConcurrentModificationException in RMAppImpl#getRMAppMetrics
> 
>
> Key: YARN-7489
> URL: https://issues.apache.org/jira/browse/YARN-7489
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-7489.001.patch
>
>
> The REST clients have sometimes failed to query applications through apps 
> REST API in RMWebService and it happened when iterating 
> attempts(RMWebServices#getApps --> AppInfo# --> 
> RMAppImpl#getRMAppMetrics) and meanwhile these attempts 
> changed(AttemptFailedTransition#transition --> 
> RMAppImpl#createAndStartNewAttempt --> RMAppImpl#createNewAttempt). 
> Application state changed within the lockup period of writeLock in RMAppImpl, 
> so that we can add readLock before iterating attempts to fix this problem.
> Exception stack:
> {noformat}
> java.util.ConcurrentModificationException
> at 
> java.util.LinkedHashMap$LinkedHashIterator.nextNode(LinkedHashMap.java:719)
> at 
> java.util.LinkedHashMap$LinkedValueIterator.next(LinkedHashMap.java:747)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.getRMAppMetrics(RMAppImpl.java:1487)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.AppInfo.(AppInfo.java:199)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getApps(RMWebServices.java:597)
> at sun.reflect.GeneratedMethodAccessor81.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
> at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
> at 
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
> at 
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288)
> at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
> at 
> com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
> at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
> at 
> com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
> at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469)
> at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400)
> at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
> at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
> at 
> com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
> at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)
> at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:886)
> at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:178)
> at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
> at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
> at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
> at 
> com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7489) ConcurrentModificationException in RMAppImpl#getRMAppMetrics

2017-11-13 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-7489:
---
Description: 
The REST clients have sometimes failed to query applications through apps REST 
API in RMWebService and it happened when iterating 
attempts(RMWebServices#getApps --> AppInfo# --> 
RMAppImpl#getRMAppMetrics) and meanwhile these attempts 
changed(AttemptFailedTransition#transition --> 
RMAppImpl#createAndStartNewAttempt --> RMAppImpl#createNewAttempt). 
Application state changed within the lockup period of writeLock in RMAppImpl, 
so that we can add readLock before iterating attempts to fix this problem.
Exception stack:
{noformat}
java.util.ConcurrentModificationException
at 
java.util.LinkedHashMap$LinkedHashIterator.nextNode(LinkedHashMap.java:719)
at 
java.util.LinkedHashMap$LinkedValueIterator.next(LinkedHashMap.java:747)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.getRMAppMetrics(RMAppImpl.java:1487)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.AppInfo.(AppInfo.java:199)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getApps(RMWebServices.java:597)
at sun.reflect.GeneratedMethodAccessor81.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
at 
com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
at 
com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
at 
com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288)
at 
com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
at 
com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
at 
com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
at 
com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
at 
com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:886)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:178)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
at 
com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
at 
com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
at 
com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
{noformat}


  was:
The REST clients have sometimes failed to query applications through apps REST 
API in RMWebService and it happened when iterating 
attempts(RMWebServices#getApps --> AppInfo# --> 
RMAppImpl#getRMAppMetrics) and meanwhile these attempts 
changed(AttemptFailedTransition#transition --> 
RMAppImpl#createAndStartNewAttempt --> RMAppImpl#createNewAttempt). 
Application state changed within the lockup period of writeLock in RMAppImpl, 
so that we can add readLock before iterating attempts to fix this problem.
Error logs:
{noformat}
java.util.ConcurrentModificationException
at 
java.util.LinkedHashMap$LinkedHashIterator.nextNode(LinkedHashMap.java:719)
at 
java.util.LinkedHashMap$LinkedValueIterator.next(LinkedHashMap.java:747)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.getRMAppMetrics(RMAppImpl.java:1487)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.AppInfo.(AppInfo.java:199)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getApps(RMWebServices.java:597)
at sun.reflect.GeneratedMethodAccessor81.invoke(Unknown Source)
at

[jira] [Created] (YARN-7489) ConcurrentModificationException in RMAppImpl#getRMAppMetrics

2017-11-13 Thread Tao Yang (JIRA)

Tao Yang created YARN-7489:
--

 Summary: ConcurrentModificationException in 
RMAppImpl#getRMAppMetrics
 Key: YARN-7489
 URL: https://issues.apache.org/jira/browse/YARN-7489
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Reporter: Tao Yang
Assignee: Tao Yang


The REST clients have sometimes failed to query applications through apps REST 
API in RMWebService and it happened when iterating 
attempts(RMWebServices#getApps --> AppInfo# --> 
RMAppImpl#getRMAppMetrics) and meanwhile these attempts 
changed(AttemptFailedTransition#transition --> 
RMAppImpl#createAndStartNewAttempt --> RMAppImpl#createNewAttempt). 
Application state changed within the lockup period of writeLock in RMAppImpl, 
so that we can add readLock before iterating attempts to fix this problem.
Error logs:
{noformat}
java.util.ConcurrentModificationException
at 
java.util.LinkedHashMap$LinkedHashIterator.nextNode(LinkedHashMap.java:719)
at 
java.util.LinkedHashMap$LinkedValueIterator.next(LinkedHashMap.java:747)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.getRMAppMetrics(RMAppImpl.java:1487)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.AppInfo.(AppInfo.java:199)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getApps(RMWebServices.java:597)
at sun.reflect.GeneratedMethodAccessor81.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
at 
com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
at 
com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
at 
com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288)
at 
com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
at 
com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
at 
com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
at 
com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
at 
com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:886)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:178)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
at 
com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
at 
com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
at 
com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
{noformat}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7471) queueUsagePercentage is wrongly calculated for applications in zero-capacity queues

2017-11-10 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-7471:
---
Attachment: YARN-7471.001.patch

> queueUsagePercentage is wrongly calculated for applications in zero-capacity 
> queues
> ---
>
> Key: YARN-7471
> URL: https://issues.apache.org/jira/browse/YARN-7471
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.0.0-alpha4
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-7471.001.patch
>
>
> For applicaitons in zero-capacity queues, queueUsagePercentage is wrongly 
> calculated to INFINITY with expression (queueUsagePercentage = usedResource / 
> (totalPartitionRes * queueAbsMaxCapPerPartition) when the 
> queueAbsMaxCapPerPartition=0.
> We can add a precondition （queueAbsMaxCapPerPartition != 0） before this 
> calculation to fix this problem.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-7471) queueUsagePercentage is wrongly calculated for applications in zero-capacity queues

2017-11-10 Thread Tao Yang (JIRA)

Tao Yang created YARN-7471:
--

 Summary: queueUsagePercentage is wrongly calculated for 
applications in zero-capacity queues
 Key: YARN-7471
 URL: https://issues.apache.org/jira/browse/YARN-7471
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 3.0.0-alpha4
Reporter: Tao Yang
Assignee: Tao Yang


For applicaitons in zero-capacity queues, queueUsagePercentage is wrongly 
calculated to INFINITY with expression (queueUsagePercentage = usedResource / 
(totalPartitionRes * queueAbsMaxCapPerPartition) when the 
queueAbsMaxCapPerPartition=0.
We can add a precondition （queueAbsMaxCapPerPartition != 0） before this 
calculation to fix this problem.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-7461) DominantResourceCalculator#ratio calculation problem when right resource contains zero value

2017-11-09 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16247143#comment-16247143
 ] 

Tao Yang commented on YARN-7461:


Thanks [~templedf] for your comments.  I wrongly assumed that lhs is fit in rhs 
and ignored the case you mentioned. I think the right calculations with zero 
for DominantResourceCalculator#ratio should be: 
<1,1,0> / <1,1,1> = 1; 
<1,1,1> / <1,1,0> = INFINITY;
<1,1,0> / <1,1,0> = 1;
Thoughts?

> DominantResourceCalculator#ratio calculation problem when right resource 
> contains zero value
> 
>
> Key: YARN-7461
> URL: https://issues.apache.org/jira/browse/YARN-7461
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha4
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Minor
> Attachments: YARN-7461.001.patch, YARN-7461.002.patch
>
>
> Currently DominantResourceCalculator#ratio may return wrong result when right 
> resource contains zero value. For example, there are three resource types 
> such as , leftResource=<5, 5, 0> and 
> rightResource=<10, 10, 0>, we expect the result of 
> DominantResourceCalculator#ratio(leftResource, rightResource) is 0.5 but 
> currently is NaN.
> There should be a verification before divide calculation to ensure that 
> dividend is not zero.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-7461) DominantResourceCalculator#ratio calculation problem when right resource contains zero value

2017-11-08 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16245157#comment-16245157
 ] 

Tao Yang edited comment on YARN-7461 at 11/9/17 3:33 AM:
-

Thanks [~templedf] for your comments.
I tried to reproduce our problem before which is not necessary, thanks for 
reminding me. Replaced these code with {{setupExtraResource()}} in v2 patch.


was (Author: tao yang):
Thanks [~templedf] for your comments.
I tried to reproduce our problem before which is not necessary, so I replaced 
these code with {{setupExtraResource()}} in v2 patch.

> DominantResourceCalculator#ratio calculation problem when right resource 
> contains zero value
> 
>
> Key: YARN-7461
> URL: https://issues.apache.org/jira/browse/YARN-7461
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha4
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Minor
> Attachments: YARN-7461.001.patch, YARN-7461.002.patch
>
>
> Currently DominantResourceCalculator#ratio may return wrong result when right 
> resource contains zero value. For example, there are three resource types 
> such as , leftResource=<5, 5, 0> and 
> rightResource=<10, 10, 0>, we expect the result of 
> DominantResourceCalculator#ratio(leftResource, rightResource) is 0.5 but 
> currently is NaN.
> There should be a verification before divide calculation to ensure that 
> dividend is not zero.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7461) DominantResourceCalculator#ratio calculation problem when right resource contains zero value

2017-11-08 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-7461:
---
Attachment: YARN-7461.002.patch

Thanks [~templedf] for your comments.
I tried to reproduce our problem before which is not necessary, so I replaced 
these code with {{setupExtraResource()}} in v2 patch.

> DominantResourceCalculator#ratio calculation problem when right resource 
> contains zero value
> 
>
> Key: YARN-7461
> URL: https://issues.apache.org/jira/browse/YARN-7461
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha4
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Minor
> Attachments: YARN-7461.001.patch, YARN-7461.002.patch
>
>
> Currently DominantResourceCalculator#ratio may return wrong result when right 
> resource contains zero value. For example, there are three resource types 
> such as , leftResource=<5, 5, 0> and 
> rightResource=<10, 10, 0>, we expect the result of 
> DominantResourceCalculator#ratio(leftResource, rightResource) is 0.5 but 
> currently is NaN.
> There should be a verification before divide calculation to ensure that 
> dividend is not zero.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7461) DominantResourceCalculator#ratio calculation problem when right resource contains zero value

2017-11-08 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-7461:
---
Attachment: YARN-7461.001.patch

> DominantResourceCalculator#ratio calculation problem when right resource 
> contains zero value
> 
>
> Key: YARN-7461
> URL: https://issues.apache.org/jira/browse/YARN-7461
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha4
>Reporter: Tao Yang
>Priority: Minor
> Attachments: YARN-7461.001.patch
>
>
> Currently DominantResourceCalculator#ratio may return wrong result when right 
> resource contains zero value. For example, there are three resource types 
> such as , leftResource=<5, 5, 0> and 
> rightResource=<10, 10, 0>, we expect the result of 
> DominantResourceCalculator#ratio(leftResource, rightResource) is 0.5 but 
> currently is NaN.
> There should be a verification before divide calculation to ensure that 
> dividend is not zero.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-7461) DominantResourceCalculator#ratio calculation problem when right resource contains zero value

2017-11-08 Thread Tao Yang (JIRA)

Tao Yang created YARN-7461:
--

 Summary: DominantResourceCalculator#ratio calculation problem when 
right resource contains zero value
 Key: YARN-7461
 URL: https://issues.apache.org/jira/browse/YARN-7461
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0-alpha4
Reporter: Tao Yang
Priority: Minor


Currently DominantResourceCalculator#ratio may return wrong result when right 
resource contains zero value. For example, there are three resource types such 
as , leftResource=<5, 5, 0> and rightResource=<10, 10, 
0>, we expect the result of DominantResourceCalculator#ratio(leftResource, 
rightResource) is 0.5 but currently is NaN.
There should be a verification before divide calculation to ensure that 
dividend is not zero.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-6737) Rename getApplicationAttempt to getCurrentAttempt in AbstractYarnScheduler/CapacityScheduler

2017-08-29 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16146586#comment-16146586
 ] 

Tao Yang edited comment on YARN-6737 at 8/30/17 4:33 AM:
-

Upload v1 patch for trunk.  Sorry to be late for this update. 
I have scanned all the usages of AbstractYarnScheduler#getApplicationAttempt 
and CapacityScheduler#getApplicationAttempt and found one potential problem in 
QueuePriorityContainerCandidateSelector#preChecksForMovingReservedContainerToNode.
{code}
FiCaSchedulerApp app =
preemptionContext.getScheduler().getCurrentApplicationAttempt(
reservedContainer.getApplicationAttemptId());
if (!app.getAppSchedulingInfo().canDelayTo(
reservedContainer.getAllocatedSchedulerKey(), ResourceRequest.ANY)) {
  // This is a hard locality request
  return false;
}
{code}
NPE should happen here if app is no longer exist, I think we can correct it 
through adding null check for app like this (the outer caller will skip this 
invalid reservedContainer):
{code}
FiCaSchedulerApp app =
preemptionContext.getScheduler().getCurrentApplicationAttempt(
reservedContainer.getApplicationAttemptId());
if (app == null || !app.getAppSchedulingInfo().canDelayTo(
reservedContainer.getAllocatedSchedulerKey(), ResourceRequest.ANY)) {
  // This is a hard locality request
  return false;
}
{code}
[~sunilg] Please help to review this patch. Thanks!


was (Author: tao yang):
Upload v1 patch for trunk. 
Sorry to be late for this update. I have scanned all the usages of 
AbstractYarnScheduler#getApplicationAttempt and 
CapacityScheduler#getApplicationAttempt and found one potential problem in 
QueuePriorityContainerCandidateSelector#preChecksForMovingReservedContainerToNode.
{code}
FiCaSchedulerApp app =
preemptionContext.getScheduler().getCurrentApplicationAttempt(
reservedContainer.getApplicationAttemptId());
if (!app.getAppSchedulingInfo().canDelayTo(
reservedContainer.getAllocatedSchedulerKey(), ResourceRequest.ANY)) {
  // This is a hard locality request
  return false;
}
{code}
NPE should happen here if app is no longer exist, I think we can correct it 
through adding null check for app like this (the outer caller will skip this 
invalid reservedContainer):
{code}
FiCaSchedulerApp app =
preemptionContext.getScheduler().getCurrentApplicationAttempt(
reservedContainer.getApplicationAttemptId());
if (app == null || !app.getAppSchedulingInfo().canDelayTo(
reservedContainer.getAllocatedSchedulerKey(), ResourceRequest.ANY)) {
  // This is a hard locality request
  return false;
}
{code}
[~sunilg] Please help to review this patch. Thanks!

> Rename getApplicationAttempt to getCurrentAttempt in 
> AbstractYarnScheduler/CapacityScheduler
> 
>
> Key: YARN-6737
> URL: https://issues.apache.org/jira/browse/YARN-6737
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Tao Yang
>Priority: Minor
> Attachments: YARN-6737.001.patch
>
>
> As discussed in YARN-6714 
> (https://issues.apache.org/jira/browse/YARN-6714?focusedCommentId=16052158=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16052158)
> AbstractYarnScheduler#getApplicationAttempt is inconsistent to its name, it 
> discarded application_attempt_id and always return the latest attempt. We 
> should: 1) Rename it to getCurrentAttempt, 2) Change parameter from attemptId 
> to applicationId. 3) Took a scan of all usages to see if any similar issue 
> could happen.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6737) Rename getApplicationAttempt to getCurrentAttempt in AbstractYarnScheduler/CapacityScheduler

2017-08-29 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-6737:
---
Attachment: YARN-6737.001.patch

Upload v1 patch for trunk. 
Sorry to be late for this update. I have scanned all the usages of 
AbstractYarnScheduler#getApplicationAttempt and 
CapacityScheduler#getApplicationAttempt and found one potential problem in 
QueuePriorityContainerCandidateSelector#preChecksForMovingReservedContainerToNode.
{code}
FiCaSchedulerApp app =
preemptionContext.getScheduler().getCurrentApplicationAttempt(
reservedContainer.getApplicationAttemptId());
if (!app.getAppSchedulingInfo().canDelayTo(
reservedContainer.getAllocatedSchedulerKey(), ResourceRequest.ANY)) {
  // This is a hard locality request
  return false;
}
{code}
NPE should happen here if app is no longer exist, I think we can correct it 
through adding null check for app like this (the outer caller will skip this 
invalid reservedContainer):
{code}
FiCaSchedulerApp app =
preemptionContext.getScheduler().getCurrentApplicationAttempt(
reservedContainer.getApplicationAttemptId());
if (app == null || !app.getAppSchedulingInfo().canDelayTo(
reservedContainer.getAllocatedSchedulerKey(), ResourceRequest.ANY)) {
  // This is a hard locality request
  return false;
}
{code}
[~sunilg] Please help to review this patch. Thanks!

> Rename getApplicationAttempt to getCurrentAttempt in 
> AbstractYarnScheduler/CapacityScheduler
> 
>
> Key: YARN-6737
> URL: https://issues.apache.org/jira/browse/YARN-6737
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Tao Yang
>Priority: Minor
> Attachments: YARN-6737.001.patch
>
>
> As discussed in YARN-6714 
> (https://issues.apache.org/jira/browse/YARN-6714?focusedCommentId=16052158=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16052158)
> AbstractYarnScheduler#getApplicationAttempt is inconsistent to its name, it 
> discarded application_attempt_id and always return the latest attempt. We 
> should: 1) Rename it to getCurrentAttempt, 2) Change parameter from attemptId 
> to applicationId. 3) Took a scan of all usages to see if any similar issue 
> could happen.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-7037) Optimize data transfer with zero-copy approach for containerlogs REST API in NMWebServices

2017-08-29 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16146470#comment-16146470
 ] 

Tao Yang commented on YARN-7037:


Thanks [~djp] for review and commit !

> Optimize data transfer with zero-copy approach for containerlogs REST API in 
> NMWebServices
> --
>
> Key: YARN-7037
> URL: https://issues.apache.org/jira/browse/YARN-7037
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.8.0
>Reporter: Tao Yang
>Assignee: Tao Yang
> Fix For: 2.9.0, 3.0.0-beta1, 2.8.3
>
> Attachments: YARN-7037.001.patch, YARN-7037.branch-2.8.001.patch
>
>
> Split this improvement from YARN-6259.
> It's useful to read container logs more efficiently. With zero-copy approach, 
> data transfer pipeline (disk --> read buffer --> NM buffer --> socket buffer) 
> can be optimized to pipeline(disk --> read buffer --> socket buffer) .
> In my local test, time cost of copying 256MB file with zero-copy can be 
> reduced from 12 seconds to 2.5 seconds.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-7037) Optimize data transfer with zero-copy approach for containerlogs REST API in NMWebServices

2017-08-25 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141364#comment-16141364
 ] 

Tao Yang commented on YARN-7037:


Thanks [~djp] for looking into the issue. 
I chose to add new method since this optimization can not cover all use cases, 
zero-copy is only fit for local read. LogToolUtils#outputContainerLog was used 
for both local log which can be optimized by FileInputStream and aggregated log 
which can't because it's transferred by DataInputStream from remote. 

> Optimize data transfer with zero-copy approach for containerlogs REST API in 
> NMWebServices
> --
>
> Key: YARN-7037
> URL: https://issues.apache.org/jira/browse/YARN-7037
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.8.0
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-7037.001.patch, YARN-7037.branch-2.8.001.patch
>
>
> Split this improvement from YARN-6259.
> It's useful to read container logs more efficiently. With zero-copy approach, 
> data transfer pipeline (disk --> read buffer --> NM buffer --> socket buffer) 
> can be optimized to pipeline(disk --> read buffer --> socket buffer) .
> In my local test, time cost of copying 256MB file with zero-copy can be 
> reduced from 12 seconds to 2.5 seconds.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7037) Optimize data transfer with zero-copy approach for containerlogs REST API in NMWebServices

2017-08-23 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-7037:
---
Attachment: YARN-7037.001.patch
YARN-7037.branch-2.8.001.patch

Upload v1 patch for trunk and update v1 patch for branch-2.8(There is no need 
to close i/o channel).

> Optimize data transfer with zero-copy approach for containerlogs REST API in 
> NMWebServices
> --
>
> Key: YARN-7037
> URL: https://issues.apache.org/jira/browse/YARN-7037
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.8.3
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-7037.001.patch, YARN-7037.branch-2.8.001.patch
>
>
> Split this improvement from YARN-6259.
> It's useful to read container logs more efficiently. With zero-copy approach, 
> data transfer pipeline (disk --> read buffer --> NM buffer --> socket buffer) 
> can be optimized to pipeline(disk --> read buffer --> socket buffer) .
> In my local test, time cost of copying 256MB file with zero-copy can be 
> reduced from 12 seconds to 2.5 seconds.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7037) Optimize data transfer with zero-copy approach for containerlogs REST API in NMWebServices

2017-08-23 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-7037:
---
Attachment: (was: YARN-7037.branch-2.8.001.patch)

> Optimize data transfer with zero-copy approach for containerlogs REST API in 
> NMWebServices
> --
>
> Key: YARN-7037
> URL: https://issues.apache.org/jira/browse/YARN-7037
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.8.3
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-7037.001.patch, YARN-7037.branch-2.8.001.patch
>
>
> Split this improvement from YARN-6259.
> It's useful to read container logs more efficiently. With zero-copy approach, 
> data transfer pipeline (disk --> read buffer --> NM buffer --> socket buffer) 
> can be optimized to pipeline(disk --> read buffer --> socket buffer) .
> In my local test, time cost of copying 256MB file with zero-copy can be 
> reduced from 12 seconds to 2.5 seconds.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6257) CapacityScheduler REST API produces incorrect JSON - JSON object operationsInfo contains deplicate key

2017-08-18 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-6257:
---
Attachment: YARN-6257.002.patch

Upload v2 patch for review. RM REST document has been updated.

> CapacityScheduler REST API produces incorrect JSON - JSON object 
> operationsInfo contains deplicate key
> --
>
> Key: YARN-6257
> URL: https://issues.apache.org/jira/browse/YARN-6257
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.8.1
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Minor
> Attachments: YARN-6257.001.patch, YARN-6257.002.patch
>
>
> In response string of CapacityScheduler REST API, 
> scheduler/schedulerInfo/health/operationsInfo have duplicate key 'entry' as a 
> JSON object :
> {code}
> "operationsInfo":{
>   
> "entry":{"key":"last-preemption","value":{"nodeId":"N/A","containerId":"N/A","queue":"N/A"}},
>   
> "entry":{"key":"last-reservation","value":{"nodeId":"N/A","containerId":"N/A","queue":"N/A"}},
>   
> "entry":{"key":"last-allocation","value":{"nodeId":"N/A","containerId":"N/A","queue":"N/A"}},
>   
> "entry":{"key":"last-release","value":{"nodeId":"N/A","containerId":"N/A","queue":"N/A"}}
> }
> {code}
> To solve this problem, I suppose the type of operationsInfo field in 
> CapacitySchedulerHealthInfo class should be converted from Map to List.
> After convert to List, The operationsInfo string will be:
> {code}
> "operationInfos":[
>   
> {"operation":"last-allocation","nodeId":"N/A","containerId":"N/A","queue":"N/A"},
>   
> {"operation":"last-release","nodeId":"N/A","containerId":"N/A","queue":"N/A"},
>   
> {"operation":"last-preemption","nodeId":"N/A","containerId":"N/A","queue":"N/A"},
>   
> {"operation":"last-reservation","nodeId":"N/A","containerId":"N/A","queue":"N/A"}
> ]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-6259) Support pagination and optimize data transfer with zero-copy approach for containerlogs REST API in NMWebServices

2017-08-17 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16130303#comment-16130303
 ] 

Tao Yang commented on YARN-6259:


Thanks [~djp] for your suggestions. It makes sense to me. I have created 
YARN-7037 to handle performance improvement and the patch of this issue will be 
updated later. I noticed that there are many differences between 2.8 and 
2.9/trunk, 2.9/trunk supports getting head or tail part of log file. It's close 
to our requirements but still not enough to support pagination. 

> Support pagination and optimize data transfer with zero-copy approach for 
> containerlogs REST API in NMWebServices
> -
>
> Key: YARN-6259
> URL: https://issues.apache.org/jira/browse/YARN-6259
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.8.1
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6259.001.patch
>
>
> Currently containerlogs REST API in NMWebServices will read and send the 
> entire content of container logs. Most of container logs are large and it's 
> useful to support pagination.
> * Add pagesize and pageindex parameters for containerlogs REST API
> {code}
> URL: http:///ws/v1/node/containerlogs//
> QueryParams:
>   pagesize - max bytes of one page , default 1MB
>   pageindex - index of required page, default 0, can be nagative(set -1 will 
> get the last page content)
> {code}
> * Add containerlogs-info REST API since sometimes we need to know the 
> totalSize/pageSize/pageCount info of log 
> {code}
> URL: 
> http:///ws/v1/node/containerlogs-info//
> QueryParams:
>   pagesize - max bytes of one page , default 1MB
> Response example:
>   {"logInfo":{"totalSize":2497280,"pageSize":1048576,"pageCount":3}}
> {code}
> Moreover, the data transfer pipeline (disk --> read buffer --> NM buffer --> 
> socket buffer) can be optimized to pipeline(disk --> read buffer --> socket 
> buffer) with zero-copy approach.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7037) Optimize data transfer with zero-copy approach for containerlogs REST API in NMWebServices

2017-08-17 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-7037:
---
Attachment: YARN-7037.branch-2.8.001.patch

Upload v1 patch for review.

> Optimize data transfer with zero-copy approach for containerlogs REST API in 
> NMWebServices
> --
>
> Key: YARN-7037
> URL: https://issues.apache.org/jira/browse/YARN-7037
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.8.3
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-7037.branch-2.8.001.patch
>
>
> Split this improvement from YARN-6259.
> It's useful to read container logs more efficiently. With zero-copy approach, 
> data transfer pipeline (disk --> read buffer --> NM buffer --> socket buffer) 
> can be optimized to pipeline(disk --> read buffer --> socket buffer) .
> In my local test, time cost of copying 256MB file with zero-copy can be 
> reduced from 12 seconds to 2.5 seconds.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-7037) Optimize data transfer with zero-copy approach for containerlogs REST API in NMWebServices

2017-08-17 Thread Tao Yang (JIRA)

Tao Yang created YARN-7037:
--

 Summary: Optimize data transfer with zero-copy approach for 
containerlogs REST API in NMWebServices
 Key: YARN-7037
 URL: https://issues.apache.org/jira/browse/YARN-7037
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.8.3
Reporter: Tao Yang
Assignee: Tao Yang


Split this improvement from YARN-6259.
It's useful to read container logs more efficiently. With zero-copy approach, 
data transfer pipeline (disk --> read buffer --> NM buffer --> socket buffer) 
can be optimized to pipeline(disk --> read buffer --> socket buffer) .
In my local test, time cost of copying 256MB file with zero-copy can be reduced 
from 12 seconds to 2.5 seconds.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-6257) CapacityScheduler REST API produces incorrect JSON - JSON object operationsInfo contains deplicate key

2017-08-16 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129955#comment-16129955
 ] 

Tao Yang commented on YARN-6257:


Thanks [~sunilg] and [~leftnoteasy]. it makes sense to me. I will update the 
document and upload a new patch for review.

> CapacityScheduler REST API produces incorrect JSON - JSON object 
> operationsInfo contains deplicate key
> --
>
> Key: YARN-6257
> URL: https://issues.apache.org/jira/browse/YARN-6257
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.8.1
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Minor
> Attachments: YARN-6257.001.patch
>
>
> In response string of CapacityScheduler REST API, 
> scheduler/schedulerInfo/health/operationsInfo have duplicate key 'entry' as a 
> JSON object :
> {code}
> "operationsInfo":{
>   
> "entry":{"key":"last-preemption","value":{"nodeId":"N/A","containerId":"N/A","queue":"N/A"}},
>   
> "entry":{"key":"last-reservation","value":{"nodeId":"N/A","containerId":"N/A","queue":"N/A"}},
>   
> "entry":{"key":"last-allocation","value":{"nodeId":"N/A","containerId":"N/A","queue":"N/A"}},
>   
> "entry":{"key":"last-release","value":{"nodeId":"N/A","containerId":"N/A","queue":"N/A"}}
> }
> {code}
> To solve this problem, I suppose the type of operationsInfo field in 
> CapacitySchedulerHealthInfo class should be converted from Map to List.
> After convert to List, The operationsInfo string will be:
> {code}
> "operationInfos":[
>   
> {"operation":"last-allocation","nodeId":"N/A","containerId":"N/A","queue":"N/A"},
>   
> {"operation":"last-release","nodeId":"N/A","containerId":"N/A","queue":"N/A"},
>   
> {"operation":"last-preemption","nodeId":"N/A","containerId":"N/A","queue":"N/A"},
>   
> {"operation":"last-reservation","nodeId":"N/A","containerId":"N/A","queue":"N/A"}
> ]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-5683) Support specifying storage type for per-application local dirs

2017-08-16 Thread Tao Yang (JIRA)

[
https://issues.apache.org/jira/browse/YARN-5683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Tao Yang updated YARN-5683:
---
Description:
h3. Introduction
* Some applications of various frameworks (Flink, Spark and MapReduce etc)
using local storage (checkpoint, shuffle etc) might require high IO
performance. It's useful to allocate local directories to high performance
storage media for these applications on heterogeneous clusters.
* YARN does not distinguish different storage types and hence applications
cannot selectively use storage media with different performance
characteristics. Adding awareness of storage media can allow YARN to make
better decisions about the placement of local directories.

h3. Approach
* NodeManager will distinguish storage types for local directories.
** yarn.nodemanager.local-dirs and yarn.nodemanager.log-dirs configuration
should allow the cluster administrator to optionally specify the storage type
for each local directories. Example:
[SSD]/disk1/nm-local-dir,/disk2/nm-local-dir,/disk3/nm-local-dir (equals to
[SSD]/disk1/nm-local-dir,[DISK]/disk2/nm-local-dir,[DISK]/disk3/nm-local-dir)
** StorageType defines DISK/SSD storage types and takes DISK as the default
storage type.
** StorageLocation separates storage type and directory path, used by
LocalDirAllocator to aware the types of local dirs, the default storage type is
DISK.
** getLocalPathForWrite method of LocalDirAllcator will prefer to choose the
local directory of the specified storage type, and will fallback to not care
storage type if the requirement can not be satisfied.
** Support for container related local/log directories by ContainerLaunch. All
application frameworks can set the environment variables (LOCAL_STORAGE_TYPE
and LOG_STORAGE_TYPE) to specified the desired storage type of local/log
directories, and choose to not launch container if fallback through these
environment variables (ENSURE_LOCAL_STORAGE_TYPE and ENSURE_LOG_STORAGE_TYPE).
* Allow specified storage type for various frameworks (Take MapReduce as an
example)
** Add new configurations should allow application administrator to optionally
specify the storage type of local/log directories and fallback strategy
(MapReduce configurations: mapreduce.job.local-storage-type,
mapreduce.job.log-storage-type, mapreduce.job.ensure-local-storage-type and
mapreduce.job.ensure-log-storage-type).
** Support for container work directories. Set the environment variables
includes LOCAL_STORAGE_TYPE and LOG_STORAGE_TYPE according to configurations
above for ContainerLaunchContext and ApplicationSubmissionContext. (MapReduce
should update YARNRunner and TaskAttemptImpl)
** Add storage type prefix for request path to support for other local
directories of frameworks (such as shuffle directories for MapReduce).
(MapReduce should update YarnOutputFiles, MROutputFiles and YarnChild to
support for output/work directories)
** Flow diagram for MapReduce framework
!flow_diagram_for_MapReduce-2.png!

h3. Further Discussion
* Scheduling : The requirement of storage type for local/log directories may
not be satisfied for a part of nodes on heterogeneous clusters. To achieve
global optimum, scheduler should aware and manage disk resources.
** Approach-1: Based on node attributes (YARN-3409), Scheduler can allocate
containers which have SSD requirement on nodes with attribute:ssd=true.
** Approach-2: Based on extended resource model (YARN-3926), it's easy to
support scheduling through extending resource models like vdisk and vssd using
this feature, but hard to measure for applications and isolate for non-CFQ
based disks.
* Fallback strategy still needs to be concerned. Certain applications might not
work well when the requirement of storage type is not satisfied. When none of
desired storage type disk are available, should container launching be failed?
let AM handle? We have implemented a fallback strategy that fail to launch
container when none of desired storage type disk are available. Is there some
better methods?

This feature has been used for half a year to meet the needs of some
applications on Alibaba search clusters.
Please feel free to give your suggestions and opinions.

was:
h3. Introduction
* Some applications of various frameworks (Flink, Spark and MapReduce etc)
using local storage (checkpoint, shuffle etc) might require high IO
performance. It's useful to allocate local directories to high performance
storage media for these applications on heterogeneous clusters.
* YARN does not distinguish different storage types and hence applications
cannot selectively use storage media with different performance
characteristics. Adding awareness of storage media can allow YARN to make
better decisions about the placement of local directories.

h3. Approach
* NodeManager will distinguish storage types for local directories.
**

[jira] [Updated] (YARN-7004) Add configs cache to optimize refreshQueues performance for large scale of queues

2017-08-13 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-7004:
---
Summary: Add configs cache to optimize refreshQueues performance for large 
scale of queues  (was: Add configs cache to optimize refreshQueues performance 
for large scale queues)

> Add configs cache to optimize refreshQueues performance for large scale of 
> queues
> -
>
> Key: YARN-7004
> URL: https://issues.apache.org/jira/browse/YARN-7004
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Affects Versions: 2.9.0, 3.0.0-alpha4
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-7004.001.patch
>
>
> We have requirements for large scale queues in our production environment to 
> serve for many projects. So we did some tests for more than 5000 queues and 
> found that refreshQueues process took more than 1 minute. The refreshQueues 
> process costs most of time on iterating over all configurations to get 
> accessible-node-labels and ordering-policy configs for every queue.  
> Loading queue configs from cache should be beneficial to reduce time costs 
> (optimized from 1 minutes to 3 seconds for 5000 queues in our test) when 
> initializing/reinitializing queues. So I propose to load queue configs into 
> cache in CapacityScheduler#initializeQueues and 
> CapacityScheduler#reinitializeQueues. If cache has not be loaded on other 
> scenes, such as in test cases, it still can get queue configs by iterating 
> over all configurations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-7005) Skip unnecessary sorting and iterating process for child queues without pending resource to optimize schedule performance

2017-08-13 Thread Tao Yang (JIRA)

Tao Yang created YARN-7005:
--

 Summary: Skip unnecessary sorting and iterating process for child 
queues without pending resource to optimize schedule performance
 Key: YARN-7005
 URL: https://issues.apache.org/jira/browse/YARN-7005
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 3.0.0-alpha4, 2.9.0
Reporter: Tao Yang


Nowadays even if there is only one pending app in a queue, the scheduling 
process will go through all queues anyway and costs most of time on sorting and 
iterating child queues in ParentQueue#assignContainersToChildQueues. 
IIUIC, queues that have no pending resource can be skipped for sorting and 
iterating process to reduce time cost, obviously for a cluster with many 
queues. Please feel free to correct me if I ignore something else. Thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7004) Add configs cache to optimize refreshQueues performance for large scale queues

2017-08-12 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-7004:
---
Attachment: YARN-7004.001.patch

Uploaded v1 patch for review.

> Add configs cache to optimize refreshQueues performance for large scale queues
> --
>
> Key: YARN-7004
> URL: https://issues.apache.org/jira/browse/YARN-7004
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Affects Versions: 2.9.0, 3.0.0-alpha4
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-7004.001.patch
>
>
> We have requirements for large scale queues in our production environment to 
> serve for many projects. So we did some tests for more than 5000 queues and 
> found that refreshQueues process took more than 1 minute. The refreshQueues 
> process costs most of time on iterating over all configurations to get 
> accessible-node-labels and ordering-policy configs for every queue.  
> Loading queue configs from cache should be beneficial to reduce time costs 
> (optimized from 1 minutes to 3 seconds for 5000 queues in our test) when 
> initializing/reinitializing queues. So I propose to load queue configs into 
> cache in CapacityScheduler#initializeQueues and 
> CapacityScheduler#reinitializeQueues. If cache has not be loaded on other 
> scenes, such as in test cases, it still can get queue configs by iterating 
> over all configurations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-7004) Add configs cache to optimize refreshQueues performance for large scale queues

2017-08-12 Thread Tao Yang (JIRA)

Tao Yang created YARN-7004:
--

 Summary: Add configs cache to optimize refreshQueues performance 
for large scale queues
 Key: YARN-7004
 URL: https://issues.apache.org/jira/browse/YARN-7004
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 3.0.0-alpha4, 2.9.0
Reporter: Tao Yang
Assignee: Tao Yang


We have requirements for large scale queues in our production environment to 
serve for many projects. So we did some tests for more than 5000 queues and 
found that refreshQueues process took more than 1 minute. The refreshQueues 
process costs most of time on iterating over all configurations to get 
accessible-node-labels and ordering-policy configs for every queue.  
Loading queue configs from cache should be beneficial to reduce time costs 
(optimized from 1 minutes to 3 seconds for 5000 queues in our test) when 
initializing/reinitializing queues. So I propose to load queue configs into 
cache in CapacityScheduler#initializeQueues and 
CapacityScheduler#reinitializeQueues. If cache has not be loaded on other 
scenes, such as in test cases, it still can get queue configs by iterating over 
all configurations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7003) DRAINING state of queues can't be recovered after RM restart

2017-08-12 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-7003:
---
Attachment: YARN-7003.001.patch

> DRAINING state of queues can't be recovered after RM restart
> 
>
> Key: YARN-7003
> URL: https://issues.apache.org/jira/browse/YARN-7003
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.9.0, 3.0.0-alpha4
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-7003.001.patch
>
>
> DRAINING state is a temporary state in RM memory, when queue state is set to 
> be STOPPED but there are still some pending or active apps in it, the queue 
> state will be changed to DRAINING instead of STOPPED after refreshing queues. 
> We've encountered the problem that the state of this queue will aways be 
> STOPPED after RM restarted, so that it can be removed at any time and leave 
> some apps in a non-existing queue.
> To fix this problem, we could recover DRAINING state in the recovery process 
> of pending/active apps. I will upload a patch with test case later for review.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7003) DRAINING state of queues can't be recovered after RM restart

2017-08-12 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-7003:
---
Affects Version/s: (was: 3.0.0-alpha3)
   3.0.0-alpha4

> DRAINING state of queues can't be recovered after RM restart
> 
>
> Key: YARN-7003
> URL: https://issues.apache.org/jira/browse/YARN-7003
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.9.0, 3.0.0-alpha4
>Reporter: Tao Yang
>
> DRAINING state is a temporary state in RM memory, when queue state is set to 
> be STOPPED but there are still some pending or active apps in it, the queue 
> state will be changed to DRAINING instead of STOPPED after refreshing queues. 
> We've encountered the problem that the state of this queue will aways be 
> STOPPED after RM restarted, so that it can be removed at any time and leave 
> some apps in a non-existing queue.
> To fix this problem, we could recover DRAINING state in the recovery process 
> of pending/active apps. I will upload a patch with test case later for review.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7003) DRAINING state of queues can't be recovered after RM restart

2017-08-12 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-7003:
---
Affects Version/s: 2.9.0

> DRAINING state of queues can't be recovered after RM restart
> 
>
> Key: YARN-7003
> URL: https://issues.apache.org/jira/browse/YARN-7003
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.9.0, 3.0.0-alpha4
>Reporter: Tao Yang
>
> DRAINING state is a temporary state in RM memory, when queue state is set to 
> be STOPPED but there are still some pending or active apps in it, the queue 
> state will be changed to DRAINING instead of STOPPED after refreshing queues. 
> We've encountered the problem that the state of this queue will aways be 
> STOPPED after RM restarted, so that it can be removed at any time and leave 
> some apps in a non-existing queue.
> To fix this problem, we could recover DRAINING state in the recovery process 
> of pending/active apps. I will upload a patch with test case later for review.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-7003) DRAINING state of queues can't be recovered after RM restart

2017-08-12 Thread Tao Yang (JIRA)

Tao Yang created YARN-7003:
--

 Summary: DRAINING state of queues can't be recovered after RM 
restart
 Key: YARN-7003
 URL: https://issues.apache.org/jira/browse/YARN-7003
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 3.0.0-alpha3
Reporter: Tao Yang


DRAINING state is a temporary state in RM memory, when queue state is set to be 
STOPPED but there are still some pending or active apps in it, the queue state 
will be changed to DRAINING instead of STOPPED after refreshing queues. We've 
encountered the problem that the state of this queue will aways be STOPPED 
after RM restarted, so that it can be removed at any time and leave some apps 
in a non-existing queue.
To fix this problem, we could recover DRAINING state in the recovery process of 
pending/active apps. I will upload a patch with test case later for review.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-6044) Resource bar of Capacity Scheduler UI does not show correctly

2017-08-12 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang resolved YARN-6044.

Resolution: Duplicate

> Resource bar of Capacity Scheduler UI does not show correctly
> -
>
> Key: YARN-6044
> URL: https://issues.apache.org/jira/browse/YARN-6044
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.8.0
>Reporter: Tao Yang
>Priority: Minor
>
> Test Environment:
> 1. NodeLable
> yarn rmadmin -addToClusterNodeLabels "label1(exclusive=false)"
> 2. capacity-scheduler.xml
> yarn.scheduler.capacity.root.queues=a,b
> yarn.scheduler.capacity.root.a.capacity=60
> yarn.scheduler.capacity.root.b.capacity=40
> yarn.scheduler.capacity.root.a.accessible-node-labels=label1
> yarn.scheduler.capacity.root.accessible-node-labels.label1.capacity=100
> yarn.scheduler.capacity.root.a.accessible-node-labels.label1.capacity=100
> In this test case, for queue(root.b) in partition(label1), the resource 
> bar(represents absolute-max-capacity) should be 100%(default). The scheduler 
> UI shows correctly after RM started, but when I started an app in 
> queue(root.b) and partition(label1) , the resource bar of this queue is 
> changed from 100% to 0%. 
> For corrent queue(root.a), the queueCapacities of partition(label1) was 
> inited in ParentQueue/LeafQueue constructor and 
> max-capacity/absolute-max-capacity were setted with correct value, due to 
> yarn.scheduler.capacity.root.a.accessible-node-labels is defined in 
> capacity-scheduler.xml
> For incorrent queue(root.b), the queueCapacities of partition(label1) didn't 
> exist at first, the max-capacity and absolute-max-capacity were setted with 
> default value(100%) in PartitionQueueCapacitiesInfo so that Scheduler UI 
> could show correctly. When this queue was allocating resource for 
> partition(label1), the queueCapacities of partition(label1) was created and 
> only used-capacity and absolute-used-capacity were setted in 
> AbstractCSQueue#allocateResource. max-capacity and absolute-max-capacity have 
> to use float default value 0 which are defined in QueueCapacities$Capacities. 
> Whether max-capacity and absolute-max-capacity should have default 
> value(100%)  in Capacities constructor to avoid losing default value if  
> somewhere called not given?  
> Please feel free to give your suggestions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-6044) Resource bar of Capacity Scheduler UI does not show correctly

2017-08-12 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16124767#comment-16124767
 ] 

Tao Yang commented on YARN-6044:


Thanks [~djp] and [~sunilg] for your reply. The solution makes sense to me.

> Resource bar of Capacity Scheduler UI does not show correctly
> -
>
> Key: YARN-6044
> URL: https://issues.apache.org/jira/browse/YARN-6044
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.8.0
>Reporter: Tao Yang
>Priority: Minor
>
> Test Environment:
> 1. NodeLable
> yarn rmadmin -addToClusterNodeLabels "label1(exclusive=false)"
> 2. capacity-scheduler.xml
> yarn.scheduler.capacity.root.queues=a,b
> yarn.scheduler.capacity.root.a.capacity=60
> yarn.scheduler.capacity.root.b.capacity=40
> yarn.scheduler.capacity.root.a.accessible-node-labels=label1
> yarn.scheduler.capacity.root.accessible-node-labels.label1.capacity=100
> yarn.scheduler.capacity.root.a.accessible-node-labels.label1.capacity=100
> In this test case, for queue(root.b) in partition(label1), the resource 
> bar(represents absolute-max-capacity) should be 100%(default). The scheduler 
> UI shows correctly after RM started, but when I started an app in 
> queue(root.b) and partition(label1) , the resource bar of this queue is 
> changed from 100% to 0%. 
> For corrent queue(root.a), the queueCapacities of partition(label1) was 
> inited in ParentQueue/LeafQueue constructor and 
> max-capacity/absolute-max-capacity were setted with correct value, due to 
> yarn.scheduler.capacity.root.a.accessible-node-labels is defined in 
> capacity-scheduler.xml
> For incorrent queue(root.b), the queueCapacities of partition(label1) didn't 
> exist at first, the max-capacity and absolute-max-capacity were setted with 
> default value(100%) in PartitionQueueCapacitiesInfo so that Scheduler UI 
> could show correctly. When this queue was allocating resource for 
> partition(label1), the queueCapacities of partition(label1) was created and 
> only used-capacity and absolute-used-capacity were setted in 
> AbstractCSQueue#allocateResource. max-capacity and absolute-max-capacity have 
> to use float default value 0 which are defined in QueueCapacities$Capacities. 
> Whether max-capacity and absolute-max-capacity should have default 
> value(100%)  in Capacities constructor to avoid losing default value if  
> somewhere called not given?  
> Please feel free to give your suggestions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6629) NPE occurred when container allocation proposal is applied but its resource requests are removed before

2017-08-12 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-6629:
---
Attachment: YARN-6629.002.patch

Uploaded a new patch with test case.

> NPE occurred when container allocation proposal is applied but its resource 
> requests are removed before
> ---
>
> Key: YARN-6629
> URL: https://issues.apache.org/jira/browse/YARN-6629
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.0.0-alpha2
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6629.001.patch, YARN-6629.002.patch
>
>
> I wrote a test case to reproduce another problem for branch-2 and found new 
> NPE error,  log: 
> {code}
> FATAL event.EventDispatcher (EventDispatcher.java:run(75)) - Error in 
> handling event type NODE_UPDATE to the Event Dispatcher
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocate(AppSchedulingInfo.java:446)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:516)
> at 
> org.apache.hadoop.yarn.client.TestNegativePendingResource$1.answer(TestNegativePendingResource.java:225)
> at 
> org.mockito.internal.stubbing.StubbedInvocationMatcher.answer(StubbedInvocationMatcher.java:31)
> at org.mockito.internal.MockHandler.handle(MockHandler.java:97)
> at 
> org.mockito.internal.creation.MethodInterceptorFilter.intercept(MethodInterceptorFilter.java:47)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp$$EnhancerByMockitoWithCGLIB$$29eb8afc.apply()
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2396)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.submitResourceCommitRequest(CapacityScheduler.java:2281)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1247)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1236)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1325)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1112)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:987)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1367)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:143)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Reproduce this error in chronological order:
> 1. AM started and requested 1 container with schedulerRequestKey#1 : 
> ApplicationMasterService#allocate -->  CapacityScheduler#allocate --> 
> SchedulerApplicationAttempt#updateResourceRequests --> 
> AppSchedulingInfo#updateResourceRequests 
> Added schedulerRequestKey#1 into schedulerKeyToPlacementSets
> 2. Scheduler allocatd 1 container for this request and accepted the proposal
> 3. AM removed this request
> ApplicationMasterService#allocate -->  CapacityScheduler#allocate --> 
> SchedulerApplicationAttempt#updateResourceRequests --> 
> AppSchedulingInfo#updateResourceRequests --> 
> AppSchedulingInfo#addToPlacementSets --> 
> AppSchedulingInfo#updatePendingResources
> Removed schedulerRequestKey#1 from schedulerKeyToPlacementSets)
> 4. Scheduler applied this proposal
> CapacityScheduler#tryCommit --> FiCaSchedulerApp#apply --> 
> AppSchedulingInfo#allocate 
> Throw NPE when called 
> schedulerKeyToPlacementSets.get(schedulerRequestKey).allocate(schedulerKey, 
> type, node);



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-6629) NPE occurred when container allocation proposal is applied but its resource requests are removed before

2017-08-12 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16124569#comment-16124569
 ] 

Tao Yang commented on YARN-6629:


Sorry for the late reply. Thanks [~sunilg] for reviewing this issue. Yes, It's 
happening in trunk as well. I will write a test case and update the patch later.

> NPE occurred when container allocation proposal is applied but its resource 
> requests are removed before
> ---
>
> Key: YARN-6629
> URL: https://issues.apache.org/jira/browse/YARN-6629
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.0.0-alpha2
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6629.001.patch
>
>
> I wrote a test case to reproduce another problem for branch-2 and found new 
> NPE error,  log: 
> {code}
> FATAL event.EventDispatcher (EventDispatcher.java:run(75)) - Error in 
> handling event type NODE_UPDATE to the Event Dispatcher
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocate(AppSchedulingInfo.java:446)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:516)
> at 
> org.apache.hadoop.yarn.client.TestNegativePendingResource$1.answer(TestNegativePendingResource.java:225)
> at 
> org.mockito.internal.stubbing.StubbedInvocationMatcher.answer(StubbedInvocationMatcher.java:31)
> at org.mockito.internal.MockHandler.handle(MockHandler.java:97)
> at 
> org.mockito.internal.creation.MethodInterceptorFilter.intercept(MethodInterceptorFilter.java:47)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp$$EnhancerByMockitoWithCGLIB$$29eb8afc.apply()
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2396)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.submitResourceCommitRequest(CapacityScheduler.java:2281)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1247)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1236)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1325)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1112)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:987)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1367)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:143)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Reproduce this error in chronological order:
> 1. AM started and requested 1 container with schedulerRequestKey#1 : 
> ApplicationMasterService#allocate -->  CapacityScheduler#allocate --> 
> SchedulerApplicationAttempt#updateResourceRequests --> 
> AppSchedulingInfo#updateResourceRequests 
> Added schedulerRequestKey#1 into schedulerKeyToPlacementSets
> 2. Scheduler allocatd 1 container for this request and accepted the proposal
> 3. AM removed this request
> ApplicationMasterService#allocate -->  CapacityScheduler#allocate --> 
> SchedulerApplicationAttempt#updateResourceRequests --> 
> AppSchedulingInfo#updateResourceRequests --> 
> AppSchedulingInfo#addToPlacementSets --> 
> AppSchedulingInfo#updatePendingResources
> Removed schedulerRequestKey#1 from schedulerKeyToPlacementSets)
> 4. Scheduler applied this proposal
> CapacityScheduler#tryCommit --> FiCaSchedulerApp#apply --> 
> AppSchedulingInfo#allocate 
> Throw NPE when called 
> schedulerKeyToPlacementSets.get(schedulerRequestKey).allocate(schedulerKey, 
> type, node);



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-6257) CapacityScheduler REST API produces incorrect JSON - JSON object operationsInfo contains deplicate key

2017-08-12 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16124563#comment-16124563
 ] 

Tao Yang commented on YARN-6257:


[~leftnoteasy], thanks for the reply. Yes, duplicated keys in JSON object is 
completely unconsumable by clients. Take the parse results with different 
json-libs for example,  we will get JSONException(Duplicated Key ...) if using 
org.json, and will get the last entry(lose other entries) if use 
org.codehaus.jettison

> CapacityScheduler REST API produces incorrect JSON - JSON object 
> operationsInfo contains deplicate key
> --
>
> Key: YARN-6257
> URL: https://issues.apache.org/jira/browse/YARN-6257
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.8.1
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Minor
> Attachments: YARN-6257.001.patch
>
>
> In response string of CapacityScheduler REST API, 
> scheduler/schedulerInfo/health/operationsInfo have duplicate key 'entry' as a 
> JSON object :
> {code}
> "operationsInfo":{
>   
> "entry":{"key":"last-preemption","value":{"nodeId":"N/A","containerId":"N/A","queue":"N/A"}},
>   
> "entry":{"key":"last-reservation","value":{"nodeId":"N/A","containerId":"N/A","queue":"N/A"}},
>   
> "entry":{"key":"last-allocation","value":{"nodeId":"N/A","containerId":"N/A","queue":"N/A"}},
>   
> "entry":{"key":"last-release","value":{"nodeId":"N/A","containerId":"N/A","queue":"N/A"}}
> }
> {code}
> To solve this problem, I suppose the type of operationsInfo field in 
> CapacitySchedulerHealthInfo class should be converted from Map to List.
> After convert to List, The operationsInfo string will be:
> {code}
> "operationInfos":[
>   
> {"operation":"last-allocation","nodeId":"N/A","containerId":"N/A","queue":"N/A"},
>   
> {"operation":"last-release","nodeId":"N/A","containerId":"N/A","queue":"N/A"},
>   
> {"operation":"last-preemption","nodeId":"N/A","containerId":"N/A","queue":"N/A"},
>   
> {"operation":"last-reservation","nodeId":"N/A","containerId":"N/A","queue":"N/A"}
> ]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-6257) CapacityScheduler REST API produces incorrect JSON - JSON object operationsInfo contains deplicate key

2017-08-10 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122755#comment-16122755
 ] 

Tao Yang commented on YARN-6257:


This problem was imported by YARN-3293 (2.8.0+). The operationsInfo can't be 
correctly used before as it's not follow JSON format. [~vvasudev], Please help 
to review this issue.

> CapacityScheduler REST API produces incorrect JSON - JSON object 
> operationsInfo contains deplicate key
> --
>
> Key: YARN-6257
> URL: https://issues.apache.org/jira/browse/YARN-6257
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.8.1
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Minor
> Attachments: YARN-6257.001.patch
>
>
> In response string of CapacityScheduler REST API, 
> scheduler/schedulerInfo/health/operationsInfo have duplicate key 'entry' as a 
> JSON object :
> {code}
> "operationsInfo":{
>   
> "entry":{"key":"last-preemption","value":{"nodeId":"N/A","containerId":"N/A","queue":"N/A"}},
>   
> "entry":{"key":"last-reservation","value":{"nodeId":"N/A","containerId":"N/A","queue":"N/A"}},
>   
> "entry":{"key":"last-allocation","value":{"nodeId":"N/A","containerId":"N/A","queue":"N/A"}},
>   
> "entry":{"key":"last-release","value":{"nodeId":"N/A","containerId":"N/A","queue":"N/A"}}
> }
> {code}
> To solve this problem, I suppose the type of operationsInfo field in 
> CapacitySchedulerHealthInfo class should be converted from Map to List.
> After convert to List, The operationsInfo string will be:
> {code}
> "operationInfos":[
>   
> {"operation":"last-allocation","nodeId":"N/A","containerId":"N/A","queue":"N/A"},
>   
> {"operation":"last-release","nodeId":"N/A","containerId":"N/A","queue":"N/A"},
>   
> {"operation":"last-preemption","nodeId":"N/A","containerId":"N/A","queue":"N/A"},
>   
> {"operation":"last-reservation","nodeId":"N/A","containerId":"N/A","queue":"N/A"}
> ]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6678) Committer thread crashes with IllegalStateException in async-scheduling mode of CapacityScheduler

2017-07-25 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-6678:
---
Attachment: YARN-6678.branch-2.005.patch

Attached branch-2 patch for cleanly applying. Thanks [~sunilg] and 
[~leftnoteasy] for commits and reviews.

> Committer thread crashes with IllegalStateException in async-scheduling mode 
> of CapacityScheduler
> -
>
> Key: YARN-6678
> URL: https://issues.apache.org/jira/browse/YARN-6678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6678.001.patch, YARN-6678.002.patch, 
> YARN-6678.003.patch, YARN-6678.004.patch, YARN-6678.005.patch, 
> YARN-6678.branch-2.005.patch
>
>
> Error log:
> {noformat}
> java.lang.IllegalStateException: Trying to reserve container 
> container_e10_1495599791406_7129_01_001453 for application 
> appattempt_1495599791406_7129_01 when currently reserved container 
> container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 
> #containers=40 available=... used=...
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546)
> {noformat}
> Reproduce this problem:
> 1. nm1 re-reserved app-1/container-X1 and generated reserve proposal-1
> 2. nm2 had enough resource for app-1, un-reserved app-1/container-X1 and 
> allocated app-1/container-X2
> 3. nm1 reserved app-2/container-Y
> 4. proposal-1 was accepted but throw IllegalStateException when applying
> Currently the check code for reserve proposal in FiCaSchedulerApp#accept as 
> follows:
> {code}
>   // Container reserved first time will be NEW, after the container
>   // accepted & confirmed, it will become RESERVED state
>   if (schedulerContainer.getRmContainer().getState()
>   == RMContainerState.RESERVED) {
> // Set reReservation == true
> reReservation = true;
>   } else {
> // When reserve a resource (state == NEW is for new container,
> // state == RUNNING is for increase container).
> // Just check if the node is not already reserved by someone
> if (schedulerContainer.getSchedulerNode().getReservedContainer()
> != null) {
>   if (LOG.isDebugEnabled()) {
> LOG.debug("Try to reserve a container, but the node is "
> + "already reserved by another container="
> + schedulerContainer.getSchedulerNode()
> .getReservedContainer().getContainerId());
>   }
>   return false;
> }
>   }
> {code}
> The reserved container on the node of reserve proposal will be checked only 
> for first-reserve container.
> We should confirm that reserved container on this node is equal to re-reserve 
> container.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6714) IllegalStateException while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler

2017-07-14 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-6714:
---
Attachment: YARN-6714.branch-2.006.patch

> IllegalStateException while handling APP_ATTEMPT_REMOVED event when 
> async-scheduling enabled in CapacityScheduler
> -
>
> Key: YARN-6714
> URL: https://issues.apache.org/jira/browse/YARN-6714
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6714.001.patch, YARN-6714.002.patch, 
> YARN-6714.003.patch, YARN-6714.branch-2.003.patch, 
> YARN-6714.branch-2.004.patch, YARN-6714.branch-2.005.patch, 
> YARN-6714.branch-2.006.patch
>
>
> Currently in async-scheduling mode of CapacityScheduler, after AM failover 
> and unreserve all reserved containers, it still have chance to get and commit 
> the outdated reserve proposal of the failed app attempt. This problem 
> happened on an app in our cluster, when this app stopped, it unreserved all 
> reserved containers and compared these appAttemptId with current 
> appAttemptId, if not match it will throw IllegalStateException and make RM 
> crashed.
> Error log:
> {noformat}
> 2017-06-08 11:02:24,339 FATAL [ResourceManager Event Processor] 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type APP_ATTEMPT_REMOVED to the scheduler
> java.lang.IllegalStateException: Trying to unreserve  for application 
> appattempt_1495188831758_0121_02 when currently reserved  for application 
> application_1495188831758_0121 on node host: node1:45454 #containers=2 
> available=... used=...
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.unreserveResource(FiCaSchedulerNode.java:123)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:845)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1787)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1957)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:586)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:966)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1740)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:152)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:822)
> at java.lang.Thread.run(Thread.java:834)
> {noformat}
> When async-scheduling enabled, CapacityScheduler#doneApplicationAttempt and 
> CapacityScheduler#tryCommit both need to get write_lock before executing, so 
> we can check the app attempt state in commit process to avoid committing 
> outdated proposals.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6678) Committer thread crashes with IllegalStateException in async-scheduling mode of CapacityScheduler

2017-07-14 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-6678:
---
Attachment: YARN-6678.005.patch

Sure, upload new patch to resolve the conflict with YARN-6714 in 
TestCapacitySchedulerAsyncScheduling.

> Committer thread crashes with IllegalStateException in async-scheduling mode 
> of CapacityScheduler
> -
>
> Key: YARN-6678
> URL: https://issues.apache.org/jira/browse/YARN-6678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6678.001.patch, YARN-6678.002.patch, 
> YARN-6678.003.patch, YARN-6678.004.patch, YARN-6678.005.patch
>
>
> Error log:
> {noformat}
> java.lang.IllegalStateException: Trying to reserve container 
> container_e10_1495599791406_7129_01_001453 for application 
> appattempt_1495599791406_7129_01 when currently reserved container 
> container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 
> #containers=40 available=... used=...
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546)
> {noformat}
> Reproduce this problem:
> 1. nm1 re-reserved app-1/container-X1 and generated reserve proposal-1
> 2. nm2 had enough resource for app-1, un-reserved app-1/container-X1 and 
> allocated app-1/container-X2
> 3. nm1 reserved app-2/container-Y
> 4. proposal-1 was accepted but throw IllegalStateException when applying
> Currently the check code for reserve proposal in FiCaSchedulerApp#accept as 
> follows:
> {code}
>   // Container reserved first time will be NEW, after the container
>   // accepted & confirmed, it will become RESERVED state
>   if (schedulerContainer.getRmContainer().getState()
>   == RMContainerState.RESERVED) {
> // Set reReservation == true
> reReservation = true;
>   } else {
> // When reserve a resource (state == NEW is for new container,
> // state == RUNNING is for increase container).
> // Just check if the node is not already reserved by someone
> if (schedulerContainer.getSchedulerNode().getReservedContainer()
> != null) {
>   if (LOG.isDebugEnabled()) {
> LOG.debug("Try to reserve a container, but the node is "
> + "already reserved by another container="
> + schedulerContainer.getSchedulerNode()
> .getReservedContainer().getContainerId());
>   }
>   return false;
> }
>   }
> {code}
> The reserved container on the node of reserve proposal will be checked only 
> for first-reserve container.
> We should confirm that reserved container on this node is equal to re-reserve 
> container.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6714) IllegalStateException while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler

2017-07-14 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-6714:
---
Attachment: YARN-6714.branch-2.005.patch

> IllegalStateException while handling APP_ATTEMPT_REMOVED event when 
> async-scheduling enabled in CapacityScheduler
> -
>
> Key: YARN-6714
> URL: https://issues.apache.org/jira/browse/YARN-6714
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6714.001.patch, YARN-6714.002.patch, 
> YARN-6714.003.patch, YARN-6714.branch-2.003.patch, 
> YARN-6714.branch-2.004.patch, YARN-6714.branch-2.005.patch
>
>
> Currently in async-scheduling mode of CapacityScheduler, after AM failover 
> and unreserve all reserved containers, it still have chance to get and commit 
> the outdated reserve proposal of the failed app attempt. This problem 
> happened on an app in our cluster, when this app stopped, it unreserved all 
> reserved containers and compared these appAttemptId with current 
> appAttemptId, if not match it will throw IllegalStateException and make RM 
> crashed.
> Error log:
> {noformat}
> 2017-06-08 11:02:24,339 FATAL [ResourceManager Event Processor] 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type APP_ATTEMPT_REMOVED to the scheduler
> java.lang.IllegalStateException: Trying to unreserve  for application 
> appattempt_1495188831758_0121_02 when currently reserved  for application 
> application_1495188831758_0121 on node host: node1:45454 #containers=2 
> available=... used=...
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.unreserveResource(FiCaSchedulerNode.java:123)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:845)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1787)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1957)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:586)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:966)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1740)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:152)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:822)
> at java.lang.Thread.run(Thread.java:834)
> {noformat}
> When async-scheduling enabled, CapacityScheduler#doneApplicationAttempt and 
> CapacityScheduler#tryCommit both need to get write_lock before executing, so 
> we can check the app attempt state in commit process to avoid committing 
> outdated proposals.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-6714) IllegalStateException while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler

2017-07-14 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16087164#comment-16087164
 ] 

Tao Yang edited comment on YARN-6714 at 7/14/17 2:14 PM:
-

Sorry to have misplaced the actual types, and there are more custom generic 
types should be explicitly specified. Upload a new patch.


was (Author: tao yang):
Sorry to have misplaced the actual types. Upload a new patch.

> IllegalStateException while handling APP_ATTEMPT_REMOVED event when 
> async-scheduling enabled in CapacityScheduler
> -
>
> Key: YARN-6714
> URL: https://issues.apache.org/jira/browse/YARN-6714
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6714.001.patch, YARN-6714.002.patch, 
> YARN-6714.003.patch, YARN-6714.branch-2.003.patch, 
> YARN-6714.branch-2.004.patch
>
>
> Currently in async-scheduling mode of CapacityScheduler, after AM failover 
> and unreserve all reserved containers, it still have chance to get and commit 
> the outdated reserve proposal of the failed app attempt. This problem 
> happened on an app in our cluster, when this app stopped, it unreserved all 
> reserved containers and compared these appAttemptId with current 
> appAttemptId, if not match it will throw IllegalStateException and make RM 
> crashed.
> Error log:
> {noformat}
> 2017-06-08 11:02:24,339 FATAL [ResourceManager Event Processor] 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type APP_ATTEMPT_REMOVED to the scheduler
> java.lang.IllegalStateException: Trying to unreserve  for application 
> appattempt_1495188831758_0121_02 when currently reserved  for application 
> application_1495188831758_0121 on node host: node1:45454 #containers=2 
> available=... used=...
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.unreserveResource(FiCaSchedulerNode.java:123)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:845)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1787)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1957)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:586)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:966)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1740)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:152)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:822)
> at java.lang.Thread.run(Thread.java:834)
> {noformat}
> When async-scheduling enabled, CapacityScheduler#doneApplicationAttempt and 
> CapacityScheduler#tryCommit both need to get write_lock before executing, so 
> we can check the app attempt state in commit process to avoid committing 
> outdated proposals.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6714) IllegalStateException while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler

2017-07-14 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-6714:
---
Attachment: (was: YARN-6714.branch-2.005.patch)

> IllegalStateException while handling APP_ATTEMPT_REMOVED event when 
> async-scheduling enabled in CapacityScheduler
> -
>
> Key: YARN-6714
> URL: https://issues.apache.org/jira/browse/YARN-6714
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6714.001.patch, YARN-6714.002.patch, 
> YARN-6714.003.patch, YARN-6714.branch-2.003.patch, 
> YARN-6714.branch-2.004.patch
>
>
> Currently in async-scheduling mode of CapacityScheduler, after AM failover 
> and unreserve all reserved containers, it still have chance to get and commit 
> the outdated reserve proposal of the failed app attempt. This problem 
> happened on an app in our cluster, when this app stopped, it unreserved all 
> reserved containers and compared these appAttemptId with current 
> appAttemptId, if not match it will throw IllegalStateException and make RM 
> crashed.
> Error log:
> {noformat}
> 2017-06-08 11:02:24,339 FATAL [ResourceManager Event Processor] 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type APP_ATTEMPT_REMOVED to the scheduler
> java.lang.IllegalStateException: Trying to unreserve  for application 
> appattempt_1495188831758_0121_02 when currently reserved  for application 
> application_1495188831758_0121 on node host: node1:45454 #containers=2 
> available=... used=...
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.unreserveResource(FiCaSchedulerNode.java:123)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:845)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1787)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1957)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:586)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:966)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1740)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:152)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:822)
> at java.lang.Thread.run(Thread.java:834)
> {noformat}
> When async-scheduling enabled, CapacityScheduler#doneApplicationAttempt and 
> CapacityScheduler#tryCommit both need to get write_lock before executing, so 
> we can check the app attempt state in commit process to avoid committing 
> outdated proposals.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6714) IllegalStateException while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler

2017-07-14 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-6714:
---
Attachment: YARN-6714.branch-2.005.patch

Sorry to have misplaced the actual types. Upload a new patch.

> IllegalStateException while handling APP_ATTEMPT_REMOVED event when 
> async-scheduling enabled in CapacityScheduler
> -
>
> Key: YARN-6714
> URL: https://issues.apache.org/jira/browse/YARN-6714
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6714.001.patch, YARN-6714.002.patch, 
> YARN-6714.003.patch, YARN-6714.branch-2.003.patch, 
> YARN-6714.branch-2.004.patch, YARN-6714.branch-2.005.patch
>
>
> Currently in async-scheduling mode of CapacityScheduler, after AM failover 
> and unreserve all reserved containers, it still have chance to get and commit 
> the outdated reserve proposal of the failed app attempt. This problem 
> happened on an app in our cluster, when this app stopped, it unreserved all 
> reserved containers and compared these appAttemptId with current 
> appAttemptId, if not match it will throw IllegalStateException and make RM 
> crashed.
> Error log:
> {noformat}
> 2017-06-08 11:02:24,339 FATAL [ResourceManager Event Processor] 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type APP_ATTEMPT_REMOVED to the scheduler
> java.lang.IllegalStateException: Trying to unreserve  for application 
> appattempt_1495188831758_0121_02 when currently reserved  for application 
> application_1495188831758_0121 on node host: node1:45454 #containers=2 
> available=... used=...
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.unreserveResource(FiCaSchedulerNode.java:123)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:845)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1787)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1957)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:586)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:966)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1740)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:152)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:822)
> at java.lang.Thread.run(Thread.java:834)
> {noformat}
> When async-scheduling enabled, CapacityScheduler#doneApplicationAttempt and 
> CapacityScheduler#tryCommit both need to get write_lock before executing, so 
> we can check the app attempt state in commit process to avoid committing 
> outdated proposals.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6714) IllegalStateException while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler

2017-07-13 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-6714:
---
Attachment: YARN-6714.branch-2.004.patch

For the check javac warning, It seems that the custom generic types of 
SchedulerContainer should be explicitly specified while creating a new instance in 
branch-2.
Upload new patch to add the actual types:
SchedulerContainer reservedContainer = ...

> IllegalStateException while handling APP_ATTEMPT_REMOVED event when 
> async-scheduling enabled in CapacityScheduler
> -
>
> Key: YARN-6714
> URL: https://issues.apache.org/jira/browse/YARN-6714
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6714.001.patch, YARN-6714.002.patch, 
> YARN-6714.003.patch, YARN-6714.branch-2.003.patch, 
> YARN-6714.branch-2.004.patch
>
>
> Currently in async-scheduling mode of CapacityScheduler, after AM failover 
> and unreserve all reserved containers, it still have chance to get and commit 
> the outdated reserve proposal of the failed app attempt. This problem 
> happened on an app in our cluster, when this app stopped, it unreserved all 
> reserved containers and compared these appAttemptId with current 
> appAttemptId, if not match it will throw IllegalStateException and make RM 
> crashed.
> Error log:
> {noformat}
> 2017-06-08 11:02:24,339 FATAL [ResourceManager Event Processor] 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type APP_ATTEMPT_REMOVED to the scheduler
> java.lang.IllegalStateException: Trying to unreserve  for application 
> appattempt_1495188831758_0121_02 when currently reserved  for application 
> application_1495188831758_0121 on node host: node1:45454 #containers=2 
> available=... used=...
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.unreserveResource(FiCaSchedulerNode.java:123)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:845)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1787)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1957)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:586)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:966)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1740)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:152)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:822)
> at java.lang.Thread.run(Thread.java:834)
> {noformat}
> When async-scheduling enabled, CapacityScheduler#doneApplicationAttempt and 
> CapacityScheduler#tryCommit both need to get write_lock before executing, so 
> we can check the app attempt state in commit process to avoid committing 
> outdated proposals.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6714) IllegalStateException while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler

2017-07-12 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-6714:
---
Attachment: YARN-6714.branch-2.003.patch

Upload a patch for branch-2. Thanks [~sunilg] for review and committing.

> IllegalStateException while handling APP_ATTEMPT_REMOVED event when 
> async-scheduling enabled in CapacityScheduler
> -
>
> Key: YARN-6714
> URL: https://issues.apache.org/jira/browse/YARN-6714
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6714.001.patch, YARN-6714.002.patch, 
> YARN-6714.003.patch, YARN-6714.branch-2.003.patch
>
>
> Currently in async-scheduling mode of CapacityScheduler, after AM failover 
> and unreserve all reserved containers, it still have chance to get and commit 
> the outdated reserve proposal of the failed app attempt. This problem 
> happened on an app in our cluster, when this app stopped, it unreserved all 
> reserved containers and compared these appAttemptId with current 
> appAttemptId, if not match it will throw IllegalStateException and make RM 
> crashed.
> Error log:
> {noformat}
> 2017-06-08 11:02:24,339 FATAL [ResourceManager Event Processor] 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type APP_ATTEMPT_REMOVED to the scheduler
> java.lang.IllegalStateException: Trying to unreserve  for application 
> appattempt_1495188831758_0121_02 when currently reserved  for application 
> application_1495188831758_0121 on node host: node1:45454 #containers=2 
> available=... used=...
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.unreserveResource(FiCaSchedulerNode.java:123)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:845)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1787)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1957)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:586)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:966)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1740)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:152)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:822)
> at java.lang.Thread.run(Thread.java:834)
> {noformat}
> When async-scheduling enabled, CapacityScheduler#doneApplicationAttempt and 
> CapacityScheduler#tryCommit both need to get write_lock before executing, so 
> we can check the app attempt state in commit process to avoid committing 
> outdated proposals.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-6678) Committer thread crashes with IllegalStateException in async-scheduling mode of CapacityScheduler

2017-07-12 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083824#comment-16083824
 ] 

Tao Yang commented on YARN-6678:


I confirmed that it's fine. Thanks [~sunilg] for your help.

> Committer thread crashes with IllegalStateException in async-scheduling mode 
> of CapacityScheduler
> -
>
> Key: YARN-6678
> URL: https://issues.apache.org/jira/browse/YARN-6678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6678.001.patch, YARN-6678.002.patch, 
> YARN-6678.003.patch, YARN-6678.004.patch
>
>
> Error log:
> {noformat}
> java.lang.IllegalStateException: Trying to reserve container 
> container_e10_1495599791406_7129_01_001453 for application 
> appattempt_1495599791406_7129_01 when currently reserved container 
> container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 
> #containers=40 available=... used=...
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546)
> {noformat}
> Reproduce this problem:
> 1. nm1 re-reserved app-1/container-X1 and generated reserve proposal-1
> 2. nm2 had enough resource for app-1, un-reserved app-1/container-X1 and 
> allocated app-1/container-X2
> 3. nm1 reserved app-2/container-Y
> 4. proposal-1 was accepted but throw IllegalStateException when applying
> Currently the check code for reserve proposal in FiCaSchedulerApp#accept as 
> follows:
> {code}
>   // Container reserved first time will be NEW, after the container
>   // accepted & confirmed, it will become RESERVED state
>   if (schedulerContainer.getRmContainer().getState()
>   == RMContainerState.RESERVED) {
> // Set reReservation == true
> reReservation = true;
>   } else {
> // When reserve a resource (state == NEW is for new container,
> // state == RUNNING is for increase container).
> // Just check if the node is not already reserved by someone
> if (schedulerContainer.getSchedulerNode().getReservedContainer()
> != null) {
>   if (LOG.isDebugEnabled()) {
> LOG.debug("Try to reserve a container, but the node is "
> + "already reserved by another container="
> + schedulerContainer.getSchedulerNode()
> .getReservedContainer().getContainerId());
>   }
>   return false;
> }
>   }
> {code}
> The reserved container on the node of reserve proposal will be checked only 
> for first-reserve container.
> We should confirm that reserved container on this node is equal to re-reserve 
> container.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6678) Committer thread crashes with IllegalStateException in async-scheduling mode of CapacityScheduler

2017-06-30 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-6678:
---
Attachment: (was: YARN-6678.004.patch)

> Committer thread crashes with IllegalStateException in async-scheduling mode 
> of CapacityScheduler
> -
>
> Key: YARN-6678
> URL: https://issues.apache.org/jira/browse/YARN-6678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6678.001.patch, YARN-6678.002.patch, 
> YARN-6678.003.patch, YARN-6678.004.patch
>
>
> Error log:
> {noformat}
> java.lang.IllegalStateException: Trying to reserve container 
> container_e10_1495599791406_7129_01_001453 for application 
> appattempt_1495599791406_7129_01 when currently reserved container 
> container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 
> #containers=40 available=... used=...
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546)
> {noformat}
> Reproduce this problem:
> 1. nm1 re-reserved app-1/container-X1 and generated reserve proposal-1
> 2. nm2 had enough resource for app-1, un-reserved app-1/container-X1 and 
> allocated app-1/container-X2
> 3. nm1 reserved app-2/container-Y
> 4. proposal-1 was accepted but throw IllegalStateException when applying
> Currently the check code for reserve proposal in FiCaSchedulerApp#accept as 
> follows:
> {code}
>   // Container reserved first time will be NEW, after the container
>   // accepted & confirmed, it will become RESERVED state
>   if (schedulerContainer.getRmContainer().getState()
>   == RMContainerState.RESERVED) {
> // Set reReservation == true
> reReservation = true;
>   } else {
> // When reserve a resource (state == NEW is for new container,
> // state == RUNNING is for increase container).
> // Just check if the node is not already reserved by someone
> if (schedulerContainer.getSchedulerNode().getReservedContainer()
> != null) {
>   if (LOG.isDebugEnabled()) {
> LOG.debug("Try to reserve a container, but the node is "
> + "already reserved by another container="
> + schedulerContainer.getSchedulerNode()
> .getReservedContainer().getContainerId());
>   }
>   return false;
> }
>   }
> {code}
> The reserved container on the node of reserve proposal will be checked only 
> for first-reserve container.
> We should confirm that reserved container on this node is equal to re-reserve 
> container.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6678) Committer thread crashes with IllegalStateException in async-scheduling mode of CapacityScheduler

2017-06-30 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-6678:
---
Attachment: YARN-6678.004.patch

> Committer thread crashes with IllegalStateException in async-scheduling mode 
> of CapacityScheduler
> -
>
> Key: YARN-6678
> URL: https://issues.apache.org/jira/browse/YARN-6678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6678.001.patch, YARN-6678.002.patch, 
> YARN-6678.003.patch, YARN-6678.004.patch
>
>
> Error log:
> {noformat}
> java.lang.IllegalStateException: Trying to reserve container 
> container_e10_1495599791406_7129_01_001453 for application 
> appattempt_1495599791406_7129_01 when currently reserved container 
> container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 
> #containers=40 available=... used=...
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546)
> {noformat}
> Reproduce this problem:
> 1. nm1 re-reserved app-1/container-X1 and generated reserve proposal-1
> 2. nm2 had enough resource for app-1, un-reserved app-1/container-X1 and 
> allocated app-1/container-X2
> 3. nm1 reserved app-2/container-Y
> 4. proposal-1 was accepted but throw IllegalStateException when applying
> Currently the check code for reserve proposal in FiCaSchedulerApp#accept as 
> follows:
> {code}
>   // Container reserved first time will be NEW, after the container
>   // accepted & confirmed, it will become RESERVED state
>   if (schedulerContainer.getRmContainer().getState()
>   == RMContainerState.RESERVED) {
> // Set reReservation == true
> reReservation = true;
>   } else {
> // When reserve a resource (state == NEW is for new container,
> // state == RUNNING is for increase container).
> // Just check if the node is not already reserved by someone
> if (schedulerContainer.getSchedulerNode().getReservedContainer()
> != null) {
>   if (LOG.isDebugEnabled()) {
> LOG.debug("Try to reserve a container, but the node is "
> + "already reserved by another container="
> + schedulerContainer.getSchedulerNode()
> .getReservedContainer().getContainerId());
>   }
>   return false;
> }
>   }
> {code}
> The reserved container on the node of reserve proposal will be checked only 
> for first-reserve container.
> We should confirm that reserved container on this node is equal to re-reserve 
> container.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6678) Committer thread crashes with IllegalStateException in async-scheduling mode of CapacityScheduler

2017-06-30 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-6678:
---
Attachment: YARN-6678.004.patch

Thanks [~sunilg] for your time.  
As your mentioned, This new patch adds timeout for every where clause, adds 
nodeId for debug info, and calls MockRM#stop at last of new test case. 
TestCapacitySchedulerAsyncScheduling can be passed now.
Sorry to be late for updating this patch.

> Committer thread crashes with IllegalStateException in async-scheduling mode 
> of CapacityScheduler
> -
>
> Key: YARN-6678
> URL: https://issues.apache.org/jira/browse/YARN-6678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6678.001.patch, YARN-6678.002.patch, 
> YARN-6678.003.patch, YARN-6678.004.patch
>
>
> Error log:
> {noformat}
> java.lang.IllegalStateException: Trying to reserve container 
> container_e10_1495599791406_7129_01_001453 for application 
> appattempt_1495599791406_7129_01 when currently reserved container 
> container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 
> #containers=40 available=... used=...
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546)
> {noformat}
> Reproduce this problem:
> 1. nm1 re-reserved app-1/container-X1 and generated reserve proposal-1
> 2. nm2 had enough resource for app-1, un-reserved app-1/container-X1 and 
> allocated app-1/container-X2
> 3. nm1 reserved app-2/container-Y
> 4. proposal-1 was accepted but throw IllegalStateException when applying
> Currently the check code for reserve proposal in FiCaSchedulerApp#accept as 
> follows:
> {code}
>   // Container reserved first time will be NEW, after the container
>   // accepted & confirmed, it will become RESERVED state
>   if (schedulerContainer.getRmContainer().getState()
>   == RMContainerState.RESERVED) {
> // Set reReservation == true
> reReservation = true;
>   } else {
> // When reserve a resource (state == NEW is for new container,
> // state == RUNNING is for increase container).
> // Just check if the node is not already reserved by someone
> if (schedulerContainer.getSchedulerNode().getReservedContainer()
> != null) {
>   if (LOG.isDebugEnabled()) {
> LOG.debug("Try to reserve a container, but the node is "
> + "already reserved by another container="
> + schedulerContainer.getSchedulerNode()
> .getReservedContainer().getContainerId());
>   }
>   return false;
> }
>   }
> {code}
> The reserved container on the node of reserve proposal will be checked only 
> for first-reserve container.
> We should confirm that reserved container on this node is equal to re-reserve 
> container.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-6737) Rename getApplicationAttempt to getCurrentAttempt in AbstractYarnScheduler/CapacityScheduler

2017-06-23 Thread Tao Yang (JIRA)

Tao Yang created YARN-6737:
--

 Summary: Rename getApplicationAttempt to getCurrentAttempt in 
AbstractYarnScheduler/CapacityScheduler
 Key: YARN-6737
 URL: https://issues.apache.org/jira/browse/YARN-6737
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 3.0.0-alpha3, 2.9.0
Reporter: Tao Yang
Priority: Minor


As discussed in YARN-6714 
(https://issues.apache.org/jira/browse/YARN-6714?focusedCommentId=16052158=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16052158)
AbstractYarnScheduler#getApplicationAttempt is inconsistent to its name, it 
discarded application_attempt_id and always return the latest attempt. We 
should: 1) Rename it to getCurrentAttempt, 2) Change parameter from attemptId 
to applicationId. 3) Took a scan of all usages to see if any similar issue 
could happen.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6714) RM crashed with IllegalStateException while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler

2017-06-23 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-6714:
---
Attachment: YARN-6714.003.patch

Update the patch with adding comments for sanity check of attemptId. Thanks 
[~sunilg] for your suggestion.

> RM crashed with IllegalStateException while handling APP_ATTEMPT_REMOVED 
> event when async-scheduling enabled in CapacityScheduler
> -
>
> Key: YARN-6714
> URL: https://issues.apache.org/jira/browse/YARN-6714
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6714.001.patch, YARN-6714.002.patch, 
> YARN-6714.003.patch
>
>
> Currently in async-scheduling mode of CapacityScheduler, after AM failover 
> and unreserve all reserved containers, it still have chance to get and commit 
> the outdated reserve proposal of the failed app attempt. This problem 
> happened on an app in our cluster, when this app stopped, it unreserved all 
> reserved containers and compared these appAttemptId with current 
> appAttemptId, if not match it will throw IllegalStateException and make RM 
> crashed.
> Error log:
> {noformat}
> 2017-06-08 11:02:24,339 FATAL [ResourceManager Event Processor] 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type APP_ATTEMPT_REMOVED to the scheduler
> java.lang.IllegalStateException: Trying to unreserve  for application 
> appattempt_1495188831758_0121_02 when currently reserved  for application 
> application_1495188831758_0121 on node host: node1:45454 #containers=2 
> available=... used=...
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.unreserveResource(FiCaSchedulerNode.java:123)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:845)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1787)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1957)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:586)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:966)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1740)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:152)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:822)
> at java.lang.Thread.run(Thread.java:834)
> {noformat}
> When async-scheduling enabled, CapacityScheduler#doneApplicationAttempt and 
> CapacityScheduler#tryCommit both need to get write_lock before executing, so 
> we can check the app attempt state in commit process to avoid committing 
> outdated proposals.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-6678) Committer thread crashes with IllegalStateException in async-scheduling mode of CapacityScheduler

2017-06-22 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16060372#comment-16060372
 ] 

Tao Yang edited comment on YARN-6678 at 6/23/17 4:11 AM:
-

Thanks [~sunilg] for your comments.
{quote}
1. In FiCaSchedulerApp#accept, its better to use RMContainer#equals instead of 
using !=
{quote}
As [~leftnoteasy] mentioned, it should be enough to use == to compare two 
instances. Are there some other concerns about this? 
I noticed that this patch caused several failed tests, but these are all passed 
when I run it locally. What might be the problem?


was (Author: tao yang):
Thanks [~sunilg] for your comments.
{quote}
1. In FiCaSchedulerApp#accept, its better to use RMContainer#equals instead of 
using !=
{quote}
As [~leftnoteasy] mentioned, it should be enough to use == to compare two 
instances. Are there some other concerns about this? 

> Committer thread crashes with IllegalStateException in async-scheduling mode 
> of CapacityScheduler
> -
>
> Key: YARN-6678
> URL: https://issues.apache.org/jira/browse/YARN-6678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6678.001.patch, YARN-6678.002.patch, 
> YARN-6678.003.patch
>
>
> Error log:
> {noformat}
> java.lang.IllegalStateException: Trying to reserve container 
> container_e10_1495599791406_7129_01_001453 for application 
> appattempt_1495599791406_7129_01 when currently reserved container 
> container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 
> #containers=40 available=... used=...
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546)
> {noformat}
> Reproduce this problem:
> 1. nm1 re-reserved app-1/container-X1 and generated reserve proposal-1
> 2. nm2 had enough resource for app-1, un-reserved app-1/container-X1 and 
> allocated app-1/container-X2
> 3. nm1 reserved app-2/container-Y
> 4. proposal-1 was accepted but throw IllegalStateException when applying
> Currently the check code for reserve proposal in FiCaSchedulerApp#accept as 
> follows:
> {code}
>   // Container reserved first time will be NEW, after the container
>   // accepted & confirmed, it will become RESERVED state
>   if (schedulerContainer.getRmContainer().getState()
>   == RMContainerState.RESERVED) {
> // Set reReservation == true
> reReservation = true;
>   } else {
> // When reserve a resource (state == NEW is for new container,
> // state == RUNNING is for increase container).
> // Just check if the node is not already reserved by someone
> if (schedulerContainer.getSchedulerNode().getReservedContainer()
> != null) {
>   if (LOG.isDebugEnabled()) {
> LOG.debug("Try to reserve a container, but the node is "
> + "already reserved by another container="
> + schedulerContainer.getSchedulerNode()
> .getReservedContainer().getContainerId());
>   }
>   return false;
> }
>   }
> {code}
> The reserved container on the node of reserve proposal will be checked only 
> for first-reserve container.
> We should confirm that reserved container on this node is equal to re-reserve 
> container.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-6678) Committer thread crashes with IllegalStateException in async-scheduling mode of CapacityScheduler

2017-06-22 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16060372#comment-16060372
 ] 

Tao Yang commented on YARN-6678:


Thanks [~sunilg] for your comments.
{quote}
1. In FiCaSchedulerApp#accept, its better to use RMContainer#equals instead of 
using !=
{quote}
As [~leftnoteasy] mentioned, it should be enough to use == to compare two 
instances. Are there some other concerns about this? 

> Committer thread crashes with IllegalStateException in async-scheduling mode 
> of CapacityScheduler
> -
>
> Key: YARN-6678
> URL: https://issues.apache.org/jira/browse/YARN-6678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6678.001.patch, YARN-6678.002.patch, 
> YARN-6678.003.patch
>
>
> Error log:
> {noformat}
> java.lang.IllegalStateException: Trying to reserve container 
> container_e10_1495599791406_7129_01_001453 for application 
> appattempt_1495599791406_7129_01 when currently reserved container 
> container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 
> #containers=40 available=... used=...
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546)
> {noformat}
> Reproduce this problem:
> 1. nm1 re-reserved app-1/container-X1 and generated reserve proposal-1
> 2. nm2 had enough resource for app-1, un-reserved app-1/container-X1 and 
> allocated app-1/container-X2
> 3. nm1 reserved app-2/container-Y
> 4. proposal-1 was accepted but throw IllegalStateException when applying
> Currently the check code for reserve proposal in FiCaSchedulerApp#accept as 
> follows:
> {code}
>   // Container reserved first time will be NEW, after the container
>   // accepted & confirmed, it will become RESERVED state
>   if (schedulerContainer.getRmContainer().getState()
>   == RMContainerState.RESERVED) {
> // Set reReservation == true
> reReservation = true;
>   } else {
> // When reserve a resource (state == NEW is for new container,
> // state == RUNNING is for increase container).
> // Just check if the node is not already reserved by someone
> if (schedulerContainer.getSchedulerNode().getReservedContainer()
> != null) {
>   if (LOG.isDebugEnabled()) {
> LOG.debug("Try to reserve a container, but the node is "
> + "already reserved by another container="
> + schedulerContainer.getSchedulerNode()
> .getReservedContainer().getContainerId());
>   }
>   return false;
> }
>   }
> {code}
> The reserved container on the node of reserve proposal will be checked only 
> for first-reserve container.
> We should confirm that reserved container on this node is equal to re-reserve 
> container.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6678) Committer thread crashes with IllegalStateException in async-scheduling mode of CapacityScheduler

2017-06-22 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-6678:
---
Attachment: YARN-6678.003.patch

Updated the patch without adding new method to CapacityScheduler. Thanks 
[~leftnoteasy] for your suggestion, it's fine to only change the spy target for 
the test case.

> Committer thread crashes with IllegalStateException in async-scheduling mode 
> of CapacityScheduler
> -
>
> Key: YARN-6678
> URL: https://issues.apache.org/jira/browse/YARN-6678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6678.001.patch, YARN-6678.002.patch, 
> YARN-6678.003.patch
>
>
> Error log:
> {noformat}
> java.lang.IllegalStateException: Trying to reserve container 
> container_e10_1495599791406_7129_01_001453 for application 
> appattempt_1495599791406_7129_01 when currently reserved container 
> container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 
> #containers=40 available=... used=...
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546)
> {noformat}
> Reproduce this problem:
> 1. nm1 re-reserved app-1/container-X1 and generated reserve proposal-1
> 2. nm2 had enough resource for app-1, un-reserved app-1/container-X1 and 
> allocated app-1/container-X2
> 3. nm1 reserved app-2/container-Y
> 4. proposal-1 was accepted but throw IllegalStateException when applying
> Currently the check code for reserve proposal in FiCaSchedulerApp#accept as 
> follows:
> {code}
>   // Container reserved first time will be NEW, after the container
>   // accepted & confirmed, it will become RESERVED state
>   if (schedulerContainer.getRmContainer().getState()
>   == RMContainerState.RESERVED) {
> // Set reReservation == true
> reReservation = true;
>   } else {
> // When reserve a resource (state == NEW is for new container,
> // state == RUNNING is for increase container).
> // Just check if the node is not already reserved by someone
> if (schedulerContainer.getSchedulerNode().getReservedContainer()
> != null) {
>   if (LOG.isDebugEnabled()) {
> LOG.debug("Try to reserve a container, but the node is "
> + "already reserved by another container="
> + schedulerContainer.getSchedulerNode()
> .getReservedContainer().getContainerId());
>   }
>   return false;
> }
>   }
> {code}
> The reserved container on the node of reserve proposal will be checked only 
> for first-reserve container.
> We should confirm that reserved container on this node is equal to re-reserve 
> container.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6714) RM crashed with IllegalStateException while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler

2017-06-22 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-6714:
---
Attachment: YARN-6714.002.patch

Updated the patch with moving test case to TestCapacitySchedulerAsyncScheduling.

> RM crashed with IllegalStateException while handling APP_ATTEMPT_REMOVED 
> event when async-scheduling enabled in CapacityScheduler
> -
>
> Key: YARN-6714
> URL: https://issues.apache.org/jira/browse/YARN-6714
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6714.001.patch, YARN-6714.002.patch
>
>
> Currently in async-scheduling mode of CapacityScheduler, after AM failover 
> and unreserve all reserved containers, it still have chance to get and commit 
> the outdated reserve proposal of the failed app attempt. This problem 
> happened on an app in our cluster, when this app stopped, it unreserved all 
> reserved containers and compared these appAttemptId with current 
> appAttemptId, if not match it will throw IllegalStateException and make RM 
> crashed.
> Error log:
> {noformat}
> 2017-06-08 11:02:24,339 FATAL [ResourceManager Event Processor] 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type APP_ATTEMPT_REMOVED to the scheduler
> java.lang.IllegalStateException: Trying to unreserve  for application 
> appattempt_1495188831758_0121_02 when currently reserved  for application 
> application_1495188831758_0121 on node host: node1:45454 #containers=2 
> available=... used=...
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.unreserveResource(FiCaSchedulerNode.java:123)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:845)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1787)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1957)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:586)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:966)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1740)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:152)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:822)
> at java.lang.Thread.run(Thread.java:834)
> {noformat}
> When async-scheduling enabled, CapacityScheduler#doneApplicationAttempt and 
> CapacityScheduler#tryCommit both need to get write_lock before executing, so 
> we can check the app attempt state in commit process to avoid committing 
> outdated proposals.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-6714) RM crashed with IllegalStateException while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler

2017-06-20 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16055413#comment-16055413
 ] 

Tao Yang edited comment on YARN-6714 at 6/20/17 9:47 AM:
-

Thanks [~leftnoteasy] for reviewing the patch.
{quote}
Could you move test case from TestCapacityScheduler to 
TestCapacitySchedulerAsyncScheduling (same comment to YARN-6678 as well).
{quote}
I remembered why these test cases are not in 
TestCapacitySchedulerAsyncScheduling before, these cases are complex and hard 
to reproduce when async-scheduling enabled, for example, it's hard to allocate 
multiple containers as we need. Can I move these test cases to 
TestCapacitySchedulerAsyncScheduling but not enable async-scheduling ？
{quote}
could you file a separate JIRA for that? (And welcome if you can work on that ).
{quote}
I'm glad to work on that :D


was (Author: tao yang):
Thanks [~leftnoteasy] for reviewing the patch.
{quote}
Could you move test case from TestCapacityScheduler to 
TestCapacitySchedulerAsyncScheduling (same comment to YARN-6678 as well).
{quote}
I remembered why these test cases are not in 
TestCapacitySchedulerAsyncScheduling before, these cases is complex and hard to 
reproduce when async-scheduling enabled. Can I move these test cases to 
TestCapacitySchedulerAsyncScheduling but not enable async-scheduling ？
{quote}
could you file a separate JIRA for that? (And welcome if you can work on that ).
{quote}
I'm glad to work on that :D

> RM crashed with IllegalStateException while handling APP_ATTEMPT_REMOVED 
> event when async-scheduling enabled in CapacityScheduler
> -
>
> Key: YARN-6714
> URL: https://issues.apache.org/jira/browse/YARN-6714
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6714.001.patch
>
>
> Currently in async-scheduling mode of CapacityScheduler, after AM failover 
> and unreserve all reserved containers, it still have chance to get and commit 
> the outdated reserve proposal of the failed app attempt. This problem 
> happened on an app in our cluster, when this app stopped, it unreserved all 
> reserved containers and compared these appAttemptId with current 
> appAttemptId, if not match it will throw IllegalStateException and make RM 
> crashed.
> Error log:
> {noformat}
> 2017-06-08 11:02:24,339 FATAL [ResourceManager Event Processor] 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type APP_ATTEMPT_REMOVED to the scheduler
> java.lang.IllegalStateException: Trying to unreserve  for application 
> appattempt_1495188831758_0121_02 when currently reserved  for application 
> application_1495188831758_0121 on node host: node1:45454 #containers=2 
> available=... used=...
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.unreserveResource(FiCaSchedulerNode.java:123)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:845)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1787)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1957)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:586)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:966)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1740)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:152)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:822)
> at java.lang.Thread.run(Thread.java:834)
> {noformat}
> When async-scheduling enabled, CapacityScheduler#doneApplicationAttempt and 
> CapacityScheduler#tryCommit both need to get write_lock before executing, so 
> we can check the app attempt state in commit process to avoid committing 
> outdated proposals.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-6714) RM crashed with IllegalStateException while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler

2017-06-20 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16055413#comment-16055413
 ] 

Tao Yang edited comment on YARN-6714 at 6/20/17 9:38 AM:
-

Thanks [~leftnoteasy] for reviewing the patch.
{quote}
Could you move test case from TestCapacityScheduler to 
TestCapacitySchedulerAsyncScheduling (same comment to YARN-6678 as well).
{quote}
I remembered why these test cases are not in 
TestCapacitySchedulerAsyncScheduling before, these cases is complex and hard to 
reproduce when async-scheduling enabled. Can I move these test cases to 
TestCapacitySchedulerAsyncScheduling but not enable async-scheduling ？
{quote}
could you file a separate JIRA for that? (And welcome if you can work on that ).
{quote}
I'm glad to work on that :D


was (Author: tao yang):
Thanks [~leftnoteasy] for reviewing the patch.
{quote}
Could you move test case from TestCapacityScheduler to 
TestCapacitySchedulerAsyncScheduling (same comment to YARN-6678 as well).
{quote}
Sure, I will update the patch later for this and YARN-6678.
{quote}
could you file a separate JIRA for that? (And welcome if you can work on that ).
{quote}
I'm glad to work on that :D

> RM crashed with IllegalStateException while handling APP_ATTEMPT_REMOVED 
> event when async-scheduling enabled in CapacityScheduler
> -
>
> Key: YARN-6714
> URL: https://issues.apache.org/jira/browse/YARN-6714
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6714.001.patch
>
>
> Currently in async-scheduling mode of CapacityScheduler, after AM failover 
> and unreserve all reserved containers, it still have chance to get and commit 
> the outdated reserve proposal of the failed app attempt. This problem 
> happened on an app in our cluster, when this app stopped, it unreserved all 
> reserved containers and compared these appAttemptId with current 
> appAttemptId, if not match it will throw IllegalStateException and make RM 
> crashed.
> Error log:
> {noformat}
> 2017-06-08 11:02:24,339 FATAL [ResourceManager Event Processor] 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type APP_ATTEMPT_REMOVED to the scheduler
> java.lang.IllegalStateException: Trying to unreserve  for application 
> appattempt_1495188831758_0121_02 when currently reserved  for application 
> application_1495188831758_0121 on node host: node1:45454 #containers=2 
> available=... used=...
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.unreserveResource(FiCaSchedulerNode.java:123)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:845)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1787)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1957)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:586)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:966)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1740)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:152)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:822)
> at java.lang.Thread.run(Thread.java:834)
> {noformat}
> When async-scheduling enabled, CapacityScheduler#doneApplicationAttempt and 
> CapacityScheduler#tryCommit both need to get write_lock before executing, so 
> we can check the app attempt state in commit process to avoid committing 
> outdated proposals.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-6714) RM crashed with IllegalStateException while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler

2017-06-20 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16055413#comment-16055413
 ] 

Tao Yang edited comment on YARN-6714 at 6/20/17 9:08 AM:
-

Thanks [~leftnoteasy] for reviewing the patch.
{quote}
Could you move test case from TestCapacityScheduler to 
TestCapacitySchedulerAsyncScheduling (same comment to YARN-6678 as well).
{quote}
Sure, I will update the patch later for this and YARN-6678.
{quote}
could you file a separate JIRA for that? (And welcome if you can work on that ).
{quote}
I'm glad to work on that :D


was (Author: tao yang):
Thanks [~leftnoteasy] for reviewing the patch.
{quote}
Could you move test case from TestCapacityScheduler to 
TestCapacitySchedulerAsyncScheduling (same comment to YARN-6678 as well).
{quote}
Sure, I will update the patch later for this and YARN-6678.
{quote}
could you file a separate JIRA for that? (And welcome if you can work on that ).
{quote}
I'm glad to work on this :D

> RM crashed with IllegalStateException while handling APP_ATTEMPT_REMOVED 
> event when async-scheduling enabled in CapacityScheduler
> -
>
> Key: YARN-6714
> URL: https://issues.apache.org/jira/browse/YARN-6714
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6714.001.patch
>
>
> Currently in async-scheduling mode of CapacityScheduler, after AM failover 
> and unreserve all reserved containers, it still have chance to get and commit 
> the outdated reserve proposal of the failed app attempt. This problem 
> happened on an app in our cluster, when this app stopped, it unreserved all 
> reserved containers and compared these appAttemptId with current 
> appAttemptId, if not match it will throw IllegalStateException and make RM 
> crashed.
> Error log:
> {noformat}
> 2017-06-08 11:02:24,339 FATAL [ResourceManager Event Processor] 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type APP_ATTEMPT_REMOVED to the scheduler
> java.lang.IllegalStateException: Trying to unreserve  for application 
> appattempt_1495188831758_0121_02 when currently reserved  for application 
> application_1495188831758_0121 on node host: node1:45454 #containers=2 
> available=... used=...
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.unreserveResource(FiCaSchedulerNode.java:123)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:845)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1787)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1957)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:586)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:966)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1740)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:152)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:822)
> at java.lang.Thread.run(Thread.java:834)
> {noformat}
> When async-scheduling enabled, CapacityScheduler#doneApplicationAttempt and 
> CapacityScheduler#tryCommit both need to get write_lock before executing, so 
> we can check the app attempt state in commit process to avoid committing 
> outdated proposals.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-6714) RM crashed with IllegalStateException while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler

2017-06-20 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16055413#comment-16055413
 ] 

Tao Yang commented on YARN-6714:


Thanks [~leftnoteasy] for reviewing the patch.
{quote}
Could you move test case from TestCapacityScheduler to 
TestCapacitySchedulerAsyncScheduling (same comment to YARN-6678 as well).
{quote}
Sure, I will update the patch later for this and YARN-6678.
{quote}
could you file a separate JIRA for that? (And welcome if you can work on that ).
{quote}
I'm glad to work on this :D

> RM crashed with IllegalStateException while handling APP_ATTEMPT_REMOVED 
> event when async-scheduling enabled in CapacityScheduler
> -
>
> Key: YARN-6714
> URL: https://issues.apache.org/jira/browse/YARN-6714
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6714.001.patch
>
>
> Currently in async-scheduling mode of CapacityScheduler, after AM failover 
> and unreserve all reserved containers, it still have chance to get and commit 
> the outdated reserve proposal of the failed app attempt. This problem 
> happened on an app in our cluster, when this app stopped, it unreserved all 
> reserved containers and compared these appAttemptId with current 
> appAttemptId, if not match it will throw IllegalStateException and make RM 
> crashed.
> Error log:
> {noformat}
> 2017-06-08 11:02:24,339 FATAL [ResourceManager Event Processor] 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type APP_ATTEMPT_REMOVED to the scheduler
> java.lang.IllegalStateException: Trying to unreserve  for application 
> appattempt_1495188831758_0121_02 when currently reserved  for application 
> application_1495188831758_0121 on node host: node1:45454 #containers=2 
> available=... used=...
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.unreserveResource(FiCaSchedulerNode.java:123)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:845)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1787)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1957)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:586)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:966)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1740)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:152)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:822)
> at java.lang.Thread.run(Thread.java:834)
> {noformat}
> When async-scheduling enabled, CapacityScheduler#doneApplicationAttempt and 
> CapacityScheduler#tryCommit both need to get write_lock before executing, so 
> we can check the app attempt state in commit process to avoid committing 
> outdated proposals.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-6678) Committer thread crashes with IllegalStateException in async-scheduling mode of CapacityScheduler

2017-06-20 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16055310#comment-16055310
 ] 

Tao Yang commented on YARN-6678:


Thanks [~leftnoteasy] for your comments.
{quote}
Instead of using RmContainer().equals, it should be enough to use == to compare 
two instances, correct?
{quote}
Correct, just noticed that as you mentioned.
{quote}
is there any other way to avoid adding the new method to CapacityScheduler?
{quote}
It's necessary to add new method if spy on app attempt. I'll try to find 
another way to test this problem, for example, spy on CapacityScheduler instance

> Committer thread crashes with IllegalStateException in async-scheduling mode 
> of CapacityScheduler
> -
>
> Key: YARN-6678
> URL: https://issues.apache.org/jira/browse/YARN-6678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6678.001.patch, YARN-6678.002.patch
>
>
> Error log:
> {noformat}
> java.lang.IllegalStateException: Trying to reserve container 
> container_e10_1495599791406_7129_01_001453 for application 
> appattempt_1495599791406_7129_01 when currently reserved container 
> container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 
> #containers=40 available=... used=...
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546)
> {noformat}
> Reproduce this problem:
> 1. nm1 re-reserved app-1/container-X1 and generated reserve proposal-1
> 2. nm2 had enough resource for app-1, un-reserved app-1/container-X1 and 
> allocated app-1/container-X2
> 3. nm1 reserved app-2/container-Y
> 4. proposal-1 was accepted but throw IllegalStateException when applying
> Currently the check code for reserve proposal in FiCaSchedulerApp#accept as 
> follows:
> {code}
>   // Container reserved first time will be NEW, after the container
>   // accepted & confirmed, it will become RESERVED state
>   if (schedulerContainer.getRmContainer().getState()
>   == RMContainerState.RESERVED) {
> // Set reReservation == true
> reReservation = true;
>   } else {
> // When reserve a resource (state == NEW is for new container,
> // state == RUNNING is for increase container).
> // Just check if the node is not already reserved by someone
> if (schedulerContainer.getSchedulerNode().getReservedContainer()
> != null) {
>   if (LOG.isDebugEnabled()) {
> LOG.debug("Try to reserve a container, but the node is "
> + "already reserved by another container="
> + schedulerContainer.getSchedulerNode()
> .getReservedContainer().getContainerId());
>   }
>   return false;
> }
>   }
> {code}
> The reserved container on the node of reserve proposal will be checked only 
> for first-reserve container.
> We should confirm that reserved container on this node is equal to re-reserve 
> container.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6678) Committer thread crashes with IllegalStateException in async-scheduling mode of CapacityScheduler

2017-06-15 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-6678:
---
Attachment: YARN-6678.002.patch

> Committer thread crashes with IllegalStateException in async-scheduling mode 
> of CapacityScheduler
> -
>
> Key: YARN-6678
> URL: https://issues.apache.org/jira/browse/YARN-6678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6678.001.patch, YARN-6678.002.patch
>
>
> Error log:
> {noformat}
> java.lang.IllegalStateException: Trying to reserve container 
> container_e10_1495599791406_7129_01_001453 for application 
> appattempt_1495599791406_7129_01 when currently reserved container 
> container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 
> #containers=40 available=... used=...
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546)
> {noformat}
> Reproduce this problem:
> 1. nm1 re-reserved app-1/container-X1 and generated reserve proposal-1
> 2. nm2 had enough resource for app-1, un-reserved app-1/container-X1 and 
> allocated app-1/container-X2
> 3. nm1 reserved app-2/container-Y
> 4. proposal-1 was accepted but throw IllegalStateException when applying
> Currently the check code for reserve proposal in FiCaSchedulerApp#accept as 
> follows:
> {code}
>   // Container reserved first time will be NEW, after the container
>   // accepted & confirmed, it will become RESERVED state
>   if (schedulerContainer.getRmContainer().getState()
>   == RMContainerState.RESERVED) {
> // Set reReservation == true
> reReservation = true;
>   } else {
> // When reserve a resource (state == NEW is for new container,
> // state == RUNNING is for increase container).
> // Just check if the node is not already reserved by someone
> if (schedulerContainer.getSchedulerNode().getReservedContainer()
> != null) {
>   if (LOG.isDebugEnabled()) {
> LOG.debug("Try to reserve a container, but the node is "
> + "already reserved by another container="
> + schedulerContainer.getSchedulerNode()
> .getReservedContainer().getContainerId());
>   }
>   return false;
> }
>   }
> {code}
> The reserved container on the node of reserve proposal will be checked only 
> for first-reserve container.
> We should confirm that reserved container on this node is equal to re-reserve 
> container.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6678) Committer thread crashes with IllegalStateException in async-scheduling mode of CapacityScheduler

2017-06-15 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-6678:
---
Attachment: (was: YARN-6678.002.patch)

> Committer thread crashes with IllegalStateException in async-scheduling mode 
> of CapacityScheduler
> -
>
> Key: YARN-6678
> URL: https://issues.apache.org/jira/browse/YARN-6678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6678.001.patch
>
>
> Error log:
> {noformat}
> java.lang.IllegalStateException: Trying to reserve container 
> container_e10_1495599791406_7129_01_001453 for application 
> appattempt_1495599791406_7129_01 when currently reserved container 
> container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 
> #containers=40 available=... used=...
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546)
> {noformat}
> Reproduce this problem:
> 1. nm1 re-reserved app-1/container-X1 and generated reserve proposal-1
> 2. nm2 had enough resource for app-1, un-reserved app-1/container-X1 and 
> allocated app-1/container-X2
> 3. nm1 reserved app-2/container-Y
> 4. proposal-1 was accepted but throw IllegalStateException when applying
> Currently the check code for reserve proposal in FiCaSchedulerApp#accept as 
> follows:
> {code}
>   // Container reserved first time will be NEW, after the container
>   // accepted & confirmed, it will become RESERVED state
>   if (schedulerContainer.getRmContainer().getState()
>   == RMContainerState.RESERVED) {
> // Set reReservation == true
> reReservation = true;
>   } else {
> // When reserve a resource (state == NEW is for new container,
> // state == RUNNING is for increase container).
> // Just check if the node is not already reserved by someone
> if (schedulerContainer.getSchedulerNode().getReservedContainer()
> != null) {
>   if (LOG.isDebugEnabled()) {
> LOG.debug("Try to reserve a container, but the node is "
> + "already reserved by another container="
> + schedulerContainer.getSchedulerNode()
> .getReservedContainer().getContainerId());
>   }
>   return false;
> }
>   }
> {code}
> The reserved container on the node of reserve proposal will be checked only 
> for first-reserve container.
> We should confirm that reserved container on this node is equal to re-reserve 
> container.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6714) RM crashed with IllegalStateException while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler

2017-06-15 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-6714:
---
Attachment: YARN-6714.001.patch

Attach a patch for review

> RM crashed with IllegalStateException while handling APP_ATTEMPT_REMOVED 
> event when async-scheduling enabled in CapacityScheduler
> -
>
> Key: YARN-6714
> URL: https://issues.apache.org/jira/browse/YARN-6714
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6714.001.patch
>
>
> Currently in async-scheduling mode of CapacityScheduler, after AM failover 
> and unreserve all reserved containers, it still have chance to get and commit 
> the outdated reserve proposal of the failed app attempt. This problem 
> happened on an app in our cluster, when this app stopped, it unreserved all 
> reserved containers and compared these appAttemptId with current 
> appAttemptId, if not match it will throw IllegalStateException and make RM 
> crashed.
> Error log:
> {noformat}
> 2017-06-08 11:02:24,339 FATAL [ResourceManager Event Processor] 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type APP_ATTEMPT_REMOVED to the scheduler
> java.lang.IllegalStateException: Trying to unreserve  for application 
> appattempt_1495188831758_0121_02 when currently reserved  for application 
> application_1495188831758_0121 on node host: node1:45454 #containers=2 
> available=... used=...
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.unreserveResource(FiCaSchedulerNode.java:123)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:845)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1787)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1957)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:586)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:966)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1740)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:152)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:822)
> at java.lang.Thread.run(Thread.java:834)
> {noformat}
> When async-scheduling enabled, CapacityScheduler#doneApplicationAttempt and 
> CapacityScheduler#tryCommit both need to get write_lock before executing, so 
> we can check the app attempt state in commit process to avoid committing 
> outdated proposals.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-6714) RM crashed with IllegalStateException while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler

2017-06-15 Thread Tao Yang (JIRA)

Tao Yang created YARN-6714:
--

 Summary: RM crashed with IllegalStateException while handling 
APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler
 Key: YARN-6714
 URL: https://issues.apache.org/jira/browse/YARN-6714
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0-alpha3, 2.9.0
Reporter: Tao Yang
Assignee: Tao Yang


Currently in async-scheduling mode of CapacityScheduler, after AM failover and 
unreserve all reserved containers, it still have chance to get and commit the 
outdated reserve proposal of the failed app attempt. This problem happened on 
an app in our cluster, when this app stopped, it unreserved all reserved 
containers and compared these appAttemptId with current appAttemptId, if not 
match it will throw IllegalStateException and make RM crashed.

Error log:
{noformat}
2017-06-08 11:02:24,339 FATAL [ResourceManager Event Processor] 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
handling event type APP_ATTEMPT_REMOVED to the scheduler
java.lang.IllegalStateException: Trying to unreserve  for application 
appattempt_1495188831758_0121_02 when currently reserved  for application 
application_1495188831758_0121 on node host: node1:45454 #containers=2 
available=... used=...
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.unreserveResource(FiCaSchedulerNode.java:123)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:845)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1787)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1957)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:586)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:966)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1740)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:152)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:822)
at java.lang.Thread.run(Thread.java:834)
{noformat}

When async-scheduling enabled, CapacityScheduler#doneApplicationAttempt and 
CapacityScheduler#tryCommit both need to get write_lock before executing, so we 
can check the app attempt state in commit process to avoid committing outdated 
proposals.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6678) Committer thread crashes with IllegalStateException in async-scheduling mode of CapacityScheduler

2017-06-15 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-6678:
---
Attachment: YARN-6678.002.patch

> Committer thread crashes with IllegalStateException in async-scheduling mode 
> of CapacityScheduler
> -
>
> Key: YARN-6678
> URL: https://issues.apache.org/jira/browse/YARN-6678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6678.001.patch, YARN-6678.002.patch
>
>
> Error log:
> {noformat}
> java.lang.IllegalStateException: Trying to reserve container 
> container_e10_1495599791406_7129_01_001453 for application 
> appattempt_1495599791406_7129_01 when currently reserved container 
> container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 
> #containers=40 available=... used=...
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546)
> {noformat}
> Reproduce this problem:
> 1. nm1 re-reserved app-1/container-X1 and generated reserve proposal-1
> 2. nm2 had enough resource for app-1, un-reserved app-1/container-X1 and 
> allocated app-1/container-X2
> 3. nm1 reserved app-2/container-Y
> 4. proposal-1 was accepted but throw IllegalStateException when applying
> Currently the check code for reserve proposal in FiCaSchedulerApp#accept as 
> follows:
> {code}
>   // Container reserved first time will be NEW, after the container
>   // accepted & confirmed, it will become RESERVED state
>   if (schedulerContainer.getRmContainer().getState()
>   == RMContainerState.RESERVED) {
> // Set reReservation == true
> reReservation = true;
>   } else {
> // When reserve a resource (state == NEW is for new container,
> // state == RUNNING is for increase container).
> // Just check if the node is not already reserved by someone
> if (schedulerContainer.getSchedulerNode().getReservedContainer()
> != null) {
>   if (LOG.isDebugEnabled()) {
> LOG.debug("Try to reserve a container, but the node is "
> + "already reserved by another container="
> + schedulerContainer.getSchedulerNode()
> .getReservedContainer().getContainerId());
>   }
>   return false;
> }
>   }
> {code}
> The reserved container on the node of reserve proposal will be checked only 
> for first-reserve container.
> We should confirm that reserved container on this node is equal to re-reserve 
> container.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6678) Committer thread crashes with IllegalStateException in async-scheduling mode of CapacityScheduler

2017-06-15 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-6678:
---
Description: 
Error log:
{noformat}
java.lang.IllegalStateException: Trying to reserve container 
container_e10_1495599791406_7129_01_001453 for application 
appattempt_1495599791406_7129_01 when currently reserved container 
container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 
#containers=40 available=... used=...
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546)
{noformat}

Reproduce this problem:
1. nm1 re-reserved app-1/container-X1 and generated reserve proposal-1
2. nm2 had enough resource for app-1, un-reserved app-1/container-X1 and 
allocated app-1/container-X2
3. nm1 reserved app-2/container-Y
4. proposal-1 was accepted but throw IllegalStateException when applying

Currently the check code for reserve proposal in FiCaSchedulerApp#accept as 
follows:
{code}
  // Container reserved first time will be NEW, after the container
  // accepted & confirmed, it will become RESERVED state
  if (schedulerContainer.getRmContainer().getState()
  == RMContainerState.RESERVED) {
// Set reReservation == true
reReservation = true;
  } else {
// When reserve a resource (state == NEW is for new container,
// state == RUNNING is for increase container).
// Just check if the node is not already reserved by someone
if (schedulerContainer.getSchedulerNode().getReservedContainer()
!= null) {
  if (LOG.isDebugEnabled()) {
LOG.debug("Try to reserve a container, but the node is "
+ "already reserved by another container="
+ schedulerContainer.getSchedulerNode()
.getReservedContainer().getContainerId());
  }
  return false;
}
  }
{code}

The reserved container on the node of reserve proposal will be checked only for 
first-reserve container.
We should confirm that reserved container on this node is equal to re-reserve 
container.

  was:
Error log:
{noformat}
java.lang.IllegalStateException: Trying to reserve container 
container_e10_1495599791406_7129_01_001453 for application 
appattempt_1495599791406_7129_01 when currently reserved container 
container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 
#containers=40 available=... used=...
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546)
{noformat}

Reproduce this problem:
1. nm1 re-reserved app-1/container-X1 and generated reserve proposal-1
2. nm2 had enough resource for app-1, un-reserved app-1/container-X1 and 
allocated app-1/container-X2
3. nm1 reserved app-2/container-Y
4. proposal-1 was accepted but throw IllegalStateException when applying

Currently the check code for reserve proposal in FiCaSchedulerApp#accept as 
follows:
{code}
  // Container reserved first time will be NEW, after the container
  // accepted & confirmed, it will become RESERVED state
  if (schedulerContainer.getRmContainer().getState()
  == RMContainerState.RESERVED) {
// Set reReservation == true
reReservation = true;
  } else {
// When reserve a resource (state == NEW is for new container,
// state == RUNNING is for increase container).
// Just check if the node is not already reserved by someone
if (schedulerContainer.getSchedulerNode().getReservedContainer()
!= null) {
  if (LOG.isDebugEnabled()) {

[jira] [Updated] (YARN-6678) Committer thread crashes with IllegalStateException in async-scheduling mode of CapacityScheduler

2017-06-15 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-6678:
---
Description: 
Error log:
{noformat}
java.lang.IllegalStateException: Trying to reserve container 
container_e10_1495599791406_7129_01_001453 for application 
appattempt_1495599791406_7129_01 when currently reserved container 
container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 
#containers=40 available=... used=...
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546)
{noformat}

Reproduce this problem:
1. nm1 re-reserved app-1/container-X1 and generated reserve proposal-1
2. nm2 had enough resource for app-1, un-reserved app-1/container-X1 and 
allocated app-1/container-X2
3. nm1 reserved app-2/container-Y
4. proposal-1 was accepted but throw IllegalStateException when applying

Currently the check code for reserve proposal in FiCaSchedulerApp#accept as 
follows:
{code}
  // Container reserved first time will be NEW, after the container
  // accepted & confirmed, it will become RESERVED state
  if (schedulerContainer.getRmContainer().getState()
  == RMContainerState.RESERVED) {
// Set reReservation == true
reReservation = true;
  } else {
// When reserve a resource (state == NEW is for new container,
// state == RUNNING is for increase container).
// Just check if the node is not already reserved by someone
if (schedulerContainer.getSchedulerNode().getReservedContainer()
!= null) {
  if (LOG.isDebugEnabled()) {
LOG.debug("Try to reserve a container, but the node is "
+ "already reserved by another container="
+ schedulerContainer.getSchedulerNode()
.getReservedContainer().getContainerId());
  }
  return false;
}
  }
{code}

The reserved container on the node of reserve proposal will be checked only for 
first-reserve container, not for the re-reserve container.
We could check reserved container on this node with re-reserve container to 
avoid this problem.


  was:
Error log:
{noformat}
java.lang.IllegalStateException: Trying to reserve container 
container_e10_1495599791406_7129_01_001453 for application 
appattempt_1495599791406_7129_01 when currently reserved container 
container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 
#containers=40 available=... used=...
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546)
{noformat}

Reproduce this problem:
1. nm1 re-reserved app-1/container-X1 and generated reserve proposal-1
2. nm2 had enough resource for app-1, un-reserved app-1/container-X1 and 
allocated app-1/container-X2
3. nm1 reserved app-2/container-Y
4. proposal-1 was accepted but throw IllegalStateException when applying

Currently the check code for reserve proposal in FiCaSchedulerApp#accept as 
follows:
{code}
  // Container reserved first time will be NEW, after the container
  // accepted & confirmed, it will become RESERVED state
  if (schedulerContainer.getRmContainer().getState()
  == RMContainerState.RESERVED) {
// Set reReservation == true
reReservation = true;
  } else {
// When reserve a resource (state == NEW is for new container,
// state == RUNNING is for increase container).
// Just check if the node is not already reserved by someone
if (schedulerContainer.getSchedulerNode().getReservedContainer()
!= null) {

[jira] [Updated] (YARN-6678) Committer thread crashes with IllegalStateException in async-scheduling mode of CapacityScheduler

2017-06-01 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-6678:
---
Description: 
Error log:
{noformat}
java.lang.IllegalStateException: Trying to reserve container 
container_e10_1495599791406_7129_01_001453 for application 
appattempt_1495599791406_7129_01 when currently reserved container 
container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 
#containers=40 available=... used=...
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546)
{noformat}

Reproduce this problem:
1. nm1 re-reserved app-1/container-X1 and generated reserve proposal-1
2. nm2 had enough resource for app-1, un-reserved app-1/container-X1 and 
allocated app-1/container-X2
3. nm1 reserved app-2/container-Y
4. proposal-1 was accepted but throw IllegalStateException when applying

Currently the check code for reserve proposal in FiCaSchedulerApp#accept as 
follows:
{code}
  // Container reserved first time will be NEW, after the container
  // accepted & confirmed, it will become RESERVED state
  if (schedulerContainer.getRmContainer().getState()
  == RMContainerState.RESERVED) {
// Set reReservation == true
reReservation = true;
  } else {
// When reserve a resource (state == NEW is for new container,
// state == RUNNING is for increase container).
// Just check if the node is not already reserved by someone
if (schedulerContainer.getSchedulerNode().getReservedContainer()
!= null) {
  if (LOG.isDebugEnabled()) {
LOG.debug("Try to reserve a container, but the node is "
+ "already reserved by another container="
+ schedulerContainer.getSchedulerNode()
.getReservedContainer().getContainerId());
  }
  return false;
}
  }
{code}

The reserved container on the node of reserve proposal will be checked only for 
first-reserve container, not for the re-reserve container.
I think FiCaSchedulerApp#accept should do this check for all reserve proposal 
not matter if the container is re-reserve or not.


  was:
Error log:
{noformat}
java.lang.IllegalStateException: Trying to reserve container 
container_e10_1495599791406_7129_01_001453 for application 
appattempt_1495599791406_7129_01 when currently reserved container 
container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 
#containers=40 available=... used=...
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546)
{noformat}

Reproduce this problem:
1. nm1 re-reserved app-1/container-X1 and generated reserved proposal-1
2. nm2 has enough resource for app-1, un-reserved app-1/container-X1 and 
allocated app-1/container-X2
3. nm1 reserved app-2/container-Y
4. proposal-1 was accepted but throw IllegalStateException when applying

Currently the check code for reserve proposal in FiCaSchedulerApp#accept as 
follows:
{code}
  // Container reserved first time will be NEW, after the container
  // accepted & confirmed, it will become RESERVED state
  if (schedulerContainer.getRmContainer().getState()
  == RMContainerState.RESERVED) {
// Set reReservation == true
reReservation = true;
  } else {
// When reserve a resource (state == NEW is for new container,
// state == RUNNING is for increase container).
// Just check if the node is not already reserved by someone
if (schedulerContainer.getSchedulerNode().getReservedContainer()

[jira] [Assigned] (YARN-6678) Committer thread crashes with IllegalStateException in async-scheduling mode of CapacityScheduler

2017-06-01 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang reassigned YARN-6678:
--

Assignee: Tao Yang

> Committer thread crashes with IllegalStateException in async-scheduling mode 
> of CapacityScheduler
> -
>
> Key: YARN-6678
> URL: https://issues.apache.org/jira/browse/YARN-6678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6678.001.patch
>
>
> Error log:
> {noformat}
> java.lang.IllegalStateException: Trying to reserve container 
> container_e10_1495599791406_7129_01_001453 for application 
> appattempt_1495599791406_7129_01 when currently reserved container 
> container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 
> #containers=40 available=... used=...
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546)
> {noformat}
> Reproduce this problem:
> 1. nm1 re-reserved app-1/container-X1 and generated reserved proposal-1
> 2. nm2 has enough resource for app-1, un-reserved app-1/container-X1 and 
> allocated app-1/container-X2
> 3. nm1 reserved app-2/container-Y
> 4. proposal-1 was accepted but throw IllegalStateException when applying
> Currently the check code for reserve proposal in FiCaSchedulerApp#accept as 
> follows:
> {code}
>   // Container reserved first time will be NEW, after the container
>   // accepted & confirmed, it will become RESERVED state
>   if (schedulerContainer.getRmContainer().getState()
>   == RMContainerState.RESERVED) {
> // Set reReservation == true
> reReservation = true;
>   } else {
> // When reserve a resource (state == NEW is for new container,
> // state == RUNNING is for increase container).
> // Just check if the node is not already reserved by someone
> if (schedulerContainer.getSchedulerNode().getReservedContainer()
> != null) {
>   if (LOG.isDebugEnabled()) {
> LOG.debug("Try to reserve a container, but the node is "
> + "already reserved by another container="
> + schedulerContainer.getSchedulerNode()
> .getReservedContainer().getContainerId());
>   }
>   return false;
> }
>   }
> {code}
> The reserved container on the node of reserve proposal will be checked only 
> for first-reserve container, not for the re-reserve container.
> I think FiCaSchedulerApp#accept should do this check for all reserve proposal 
> not matter if the container is re-reserve or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6678) Committer thread crashes with IllegalStateException in async-scheduling mode of CapacityScheduler

2017-06-01 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-6678:
---
Attachment: YARN-6678.001.patch

Attach a patch with UT for review.

> Committer thread crashes with IllegalStateException in async-scheduling mode 
> of CapacityScheduler
> -
>
> Key: YARN-6678
> URL: https://issues.apache.org/jira/browse/YARN-6678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Tao Yang
> Attachments: YARN-6678.001.patch
>
>
> Error log:
> {noformat}
> java.lang.IllegalStateException: Trying to reserve container 
> container_e10_1495599791406_7129_01_001453 for application 
> appattempt_1495599791406_7129_01 when currently reserved container 
> container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 
> #containers=40 available=... used=...
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546)
> {noformat}
> Reproduce this problem:
> 1. nm1 re-reserved app-1/container-X1 and generated reserved proposal-1
> 2. nm2 has enough resource for app-1, un-reserved app-1/container-X1 and 
> allocated app-1/container-X2
> 3. nm1 reserved app-2/container-Y
> 4. proposal-1 was accepted but throw IllegalStateException when applying
> Currently the check code for reserve proposal in FiCaSchedulerApp#accept as 
> follows:
> {code}
>   // Container reserved first time will be NEW, after the container
>   // accepted & confirmed, it will become RESERVED state
>   if (schedulerContainer.getRmContainer().getState()
>   == RMContainerState.RESERVED) {
> // Set reReservation == true
> reReservation = true;
>   } else {
> // When reserve a resource (state == NEW is for new container,
> // state == RUNNING is for increase container).
> // Just check if the node is not already reserved by someone
> if (schedulerContainer.getSchedulerNode().getReservedContainer()
> != null) {
>   if (LOG.isDebugEnabled()) {
> LOG.debug("Try to reserve a container, but the node is "
> + "already reserved by another container="
> + schedulerContainer.getSchedulerNode()
> .getReservedContainer().getContainerId());
>   }
>   return false;
> }
>   }
> {code}
> The reserved container on the node of reserve proposal will be checked only 
> for first-reserve container, not for the re-reserve container.
> I think FiCaSchedulerApp#accept should do this check for all reserve proposal 
> not matter if the container is re-reserve or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-6678) Committer thread crashes with IllegalStateException in async-scheduling mode of CapacityScheduler

2017-06-01 Thread Tao Yang (JIRA)

Tao Yang created YARN-6678:
--

 Summary: Committer thread crashes with IllegalStateException in 
async-scheduling mode of CapacityScheduler
 Key: YARN-6678
 URL: https://issues.apache.org/jira/browse/YARN-6678
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 3.0.0-alpha3, 2.9.0
Reporter: Tao Yang


Error log:
{noformat}
java.lang.IllegalStateException: Trying to reserve container 
container_e10_1495599791406_7129_01_001453 for application 
appattempt_1495599791406_7129_01 when currently reserved container 
container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 
#containers=40 available=... used=...
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546)
{noformat}

Reproduce this problem:
1. nm1 re-reserved app-1/container-X1 and generated reserved proposal-1
2. nm2 has enough resource for app-1, un-reserved app-1/container-X1 and 
allocated app-1/container-X2
3. nm1 reserved app-2/container-Y
4. proposal-1 was accepted but throw IllegalStateException when applying

Currently the check code for reserve proposal in FiCaSchedulerApp#accept as 
follows:
{code}
  // Container reserved first time will be NEW, after the container
  // accepted & confirmed, it will become RESERVED state
  if (schedulerContainer.getRmContainer().getState()
  == RMContainerState.RESERVED) {
// Set reReservation == true
reReservation = true;
  } else {
// When reserve a resource (state == NEW is for new container,
// state == RUNNING is for increase container).
// Just check if the node is not already reserved by someone
if (schedulerContainer.getSchedulerNode().getReservedContainer()
!= null) {
  if (LOG.isDebugEnabled()) {
LOG.debug("Try to reserve a container, but the node is "
+ "already reserved by another container="
+ schedulerContainer.getSchedulerNode()
.getReservedContainer().getContainerId());
  }
  return false;
}
  }
{code}

The reserved container on the node of reserve proposal will be checked only for 
first-reserve container, not for the re-reserve container.
I think FiCaSchedulerApp#accept should do this check for all reserve proposal 
not matter if the container is re-reserve or not.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6629) NPE occurred when container allocation proposal is applied but its resource requests are removed before

2017-05-22 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-6629:
---
Description: 
I wrote a test case to reproduce another problem for branch-2 and found new NPE 
error,  log: 
{code}
FATAL event.EventDispatcher (EventDispatcher.java:run(75)) - Error in handling 
event type NODE_UPDATE to the Event Dispatcher
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocate(AppSchedulingInfo.java:446)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:516)
at 
org.apache.hadoop.yarn.client.TestNegativePendingResource$1.answer(TestNegativePendingResource.java:225)
at 
org.mockito.internal.stubbing.StubbedInvocationMatcher.answer(StubbedInvocationMatcher.java:31)
at org.mockito.internal.MockHandler.handle(MockHandler.java:97)
at 
org.mockito.internal.creation.MethodInterceptorFilter.intercept(MethodInterceptorFilter.java:47)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp$$EnhancerByMockitoWithCGLIB$$29eb8afc.apply()
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2396)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.submitResourceCommitRequest(CapacityScheduler.java:2281)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1247)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1236)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1325)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1112)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:987)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1367)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:143)
at 
org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
at java.lang.Thread.run(Thread.java:745)
{code}

Reproduce this error in chronological order:
1. AM started and requested 1 container with schedulerRequestKey#1 : 
ApplicationMasterService#allocate -->  CapacityScheduler#allocate --> 
SchedulerApplicationAttempt#updateResourceRequests --> 
AppSchedulingInfo#updateResourceRequests 
Added schedulerRequestKey#1 into schedulerKeyToPlacementSets
2. Scheduler allocatd 1 container for this request and accepted the proposal
3. AM removed this request
ApplicationMasterService#allocate -->  CapacityScheduler#allocate --> 
SchedulerApplicationAttempt#updateResourceRequests --> 
AppSchedulingInfo#updateResourceRequests --> 
AppSchedulingInfo#addToPlacementSets --> 
AppSchedulingInfo#updatePendingResources
Removed schedulerRequestKey#1 from schedulerKeyToPlacementSets)
4. Scheduler applied this proposal
CapacityScheduler#tryCommit --> FiCaSchedulerApp#apply --> 
AppSchedulingInfo#allocate 
Throw NPE when called 
schedulerKeyToPlacementSets.get(schedulerRequestKey).allocate(schedulerKey, 
type, node);

  was:
I wrote a test case to reproduce another problem for branch-2 and found new NPE 
error,  log: 
{code}
FATAL event.EventDispatcher (EventDispatcher.java:run(75)) - Error in handling 
event type NODE_UPDATE to the Event Dispatcher
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocate(AppSchedulingInfo.java:446)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:516)
at 
org.apache.hadoop.yarn.client.TestNegativePendingResource$1.answer(TestNegativePendingResource.java:225)
at 
org.mockito.internal.stubbing.StubbedInvocationMatcher.answer(StubbedInvocationMatcher.java:31)
at org.mockito.internal.MockHandler.handle(MockHandler.java:97)
at 
org.mockito.internal.creation.MethodInterceptorFilter.intercept(MethodInterceptorFilter.java:47)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp$$EnhancerByMockitoWithCGLIB$$29eb8afc.apply()
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2396)
at

[jira] [Updated] (YARN-6629) NPE occurred when container allocation proposal is applied but its resource requests are removed before

2017-05-22 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-6629:
---
Description: 
I wrote a test case to reproduce another problem for branch-2 and found new NPE 
error,  log: 
{code}
FATAL event.EventDispatcher (EventDispatcher.java:run(75)) - Error in handling 
event type NODE_UPDATE to the Event Dispatcher
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocate(AppSchedulingInfo.java:446)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:516)
at 
org.apache.hadoop.yarn.client.TestNegativePendingResource$1.answer(TestNegativePendingResource.java:225)
at 
org.mockito.internal.stubbing.StubbedInvocationMatcher.answer(StubbedInvocationMatcher.java:31)
at org.mockito.internal.MockHandler.handle(MockHandler.java:97)
at 
org.mockito.internal.creation.MethodInterceptorFilter.intercept(MethodInterceptorFilter.java:47)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp$$EnhancerByMockitoWithCGLIB$$29eb8afc.apply()
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2396)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.submitResourceCommitRequest(CapacityScheduler.java:2281)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1247)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1236)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1325)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1112)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:987)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1367)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:143)
at 
org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
at java.lang.Thread.run(Thread.java:745)
{code}

Reproduce this error in chronological order:
1. AM started and requested 1 container with schedulerRequestKey#1 : 
ApplicationMasterService#allocate -->  CapacityScheduler#allocate --> 
SchedulerApplicationAttempt#updateResourceRequests --> 
AppSchedulingInfo#updateResourceRequests 
Added schedulerRequestKey#1 into schedulerKeyToPlacementSets
2. Scheduler allocatd 1 container for this request and accepted the proposal
3. AM removed this request
ApplicationMasterService#allocate -->  CapacityScheduler#allocate --> 
SchedulerApplicationAttempt#updateResourceRequests --> 
AppSchedulingInfo#updateResourceRequests --> 
AppSchedulingInfo#addToPlacementSets --> 
AppSchedulingInfo#updatePendingResources
Removed schedulerRequestKey#1 from schedulerKeyToPlacementSets)
4. Scheduler applied this proposal and wanted to deduct the pending resource
CapacityScheduler#tryCommit --> FiCaSchedulerApp#apply --> 
AppSchedulingInfo#allocate 
Throw NPE when called 
schedulerKeyToPlacementSets.get(schedulerRequestKey).allocate(schedulerKey, 
type, node);

  was:
I wrote a test case to test other problem for branch-2 and found new NPE error, 
 log: 
{code}
FATAL event.EventDispatcher (EventDispatcher.java:run(75)) - Error in handling 
event type NODE_UPDATE to the Event Dispatcher
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocate(AppSchedulingInfo.java:446)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:516)
at 
org.apache.hadoop.yarn.client.TestNegativePendingResource$1.answer(TestNegativePendingResource.java:225)
at 
org.mockito.internal.stubbing.StubbedInvocationMatcher.answer(StubbedInvocationMatcher.java:31)
at org.mockito.internal.MockHandler.handle(MockHandler.java:97)
at 
org.mockito.internal.creation.MethodInterceptorFilter.intercept(MethodInterceptorFilter.java:47)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp$$EnhancerByMockitoWithCGLIB$$29eb8afc.apply()
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2396)
at

[jira] [Updated] (YARN-6629) NPE occurred when container allocation proposal is applied but its resource requests are removed before

2017-05-22 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-6629:
---
Description: 
I wrote a test case to test other problem for branch-2 and found new NPE error, 
 log: 
{code}
FATAL event.EventDispatcher (EventDispatcher.java:run(75)) - Error in handling 
event type NODE_UPDATE to the Event Dispatcher
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocate(AppSchedulingInfo.java:446)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:516)
at 
org.apache.hadoop.yarn.client.TestNegativePendingResource$1.answer(TestNegativePendingResource.java:225)
at 
org.mockito.internal.stubbing.StubbedInvocationMatcher.answer(StubbedInvocationMatcher.java:31)
at org.mockito.internal.MockHandler.handle(MockHandler.java:97)
at 
org.mockito.internal.creation.MethodInterceptorFilter.intercept(MethodInterceptorFilter.java:47)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp$$EnhancerByMockitoWithCGLIB$$29eb8afc.apply()
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2396)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.submitResourceCommitRequest(CapacityScheduler.java:2281)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1247)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1236)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1325)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1112)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:987)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1367)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:143)
at 
org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
at java.lang.Thread.run(Thread.java:745)
{code}

Reproduce this error in chronological order:
1. AM started and requested 1 container with schedulerRequestKey#1 : 
ApplicationMasterService#allocate -->  CapacityScheduler#allocate --> 
SchedulerApplicationAttempt#updateResourceRequests --> 
AppSchedulingInfo#updateResourceRequests 
Added schedulerRequestKey#1 into schedulerKeyToPlacementSets
2. Scheduler allocatd 1 container for this request and accepted the proposal
3. AM removed this request
ApplicationMasterService#allocate -->  CapacityScheduler#allocate --> 
SchedulerApplicationAttempt#updateResourceRequests --> 
AppSchedulingInfo#updateResourceRequests --> 
AppSchedulingInfo#addToPlacementSets --> 
AppSchedulingInfo#updatePendingResources
Removed schedulerRequestKey#1 from schedulerKeyToPlacementSets)
4. Scheduler applied this proposal and wanted to deduct the pending resource
CapacityScheduler#tryCommit --> FiCaSchedulerApp#apply --> 
AppSchedulingInfo#allocate 
Throw NPE when called 
schedulerKeyToPlacementSets.get(schedulerRequestKey).allocate(schedulerKey, 
type, node);

  was:
Error log:
{code}
FATAL event.EventDispatcher (EventDispatcher.java:run(75)) - Error in handling 
event type NODE_UPDATE to the Event Dispatcher
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocate(AppSchedulingInfo.java:446)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:516)
at 
org.apache.hadoop.yarn.client.TestNegativePendingResource$1.answer(TestNegativePendingResource.java:225)
at 
org.mockito.internal.stubbing.StubbedInvocationMatcher.answer(StubbedInvocationMatcher.java:31)
at org.mockito.internal.MockHandler.handle(MockHandler.java:97)
at 
org.mockito.internal.creation.MethodInterceptorFilter.intercept(MethodInterceptorFilter.java:47)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp$$EnhancerByMockitoWithCGLIB$$29eb8afc.apply()
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2396)
at

[jira] [Updated] (YARN-6629) NPE occurred when container allocation proposal is applied but its resource requests are removed before

2017-05-22 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-6629:
---
Attachment: YARN-6629.001.patch

Attach a patch for review.

> NPE occurred when container allocation proposal is applied but its resource 
> requests are removed before
> ---
>
> Key: YARN-6629
> URL: https://issues.apache.org/jira/browse/YARN-6629
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.0.0-alpha2
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6629.001.patch
>
>
> Error log:
> {code}
> FATAL event.EventDispatcher (EventDispatcher.java:run(75)) - Error in 
> handling event type NODE_UPDATE to the Event Dispatcher
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocate(AppSchedulingInfo.java:446)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:516)
> at 
> org.apache.hadoop.yarn.client.TestNegativePendingResource$1.answer(TestNegativePendingResource.java:225)
> at 
> org.mockito.internal.stubbing.StubbedInvocationMatcher.answer(StubbedInvocationMatcher.java:31)
> at org.mockito.internal.MockHandler.handle(MockHandler.java:97)
> at 
> org.mockito.internal.creation.MethodInterceptorFilter.intercept(MethodInterceptorFilter.java:47)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp$$EnhancerByMockitoWithCGLIB$$29eb8afc.apply()
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2396)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.submitResourceCommitRequest(CapacityScheduler.java:2281)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1247)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1236)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1325)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1112)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:987)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1367)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:143)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Reproduce this error in chronological order:
> 1. AM started and requested 1 container with schedulerRequestKey#1 : 
> ApplicationMasterService#allocate -->  CapacityScheduler#allocate --> 
> SchedulerApplicationAttempt#updateResourceRequests --> 
> AppSchedulingInfo#updateResourceRequests 
> Added schedulerRequestKey#1 into schedulerKeyToPlacementSets
> 2. Scheduler allocatd 1 container for this request and accepted the proposal
> 3. AM removed this request
> ApplicationMasterService#allocate -->  CapacityScheduler#allocate --> 
> SchedulerApplicationAttempt#updateResourceRequests --> 
> AppSchedulingInfo#updateResourceRequests --> 
> AppSchedulingInfo#addToPlacementSets --> 
> AppSchedulingInfo#updatePendingResources
> Removed schedulerRequestKey#1 from schedulerKeyToPlacementSets)
> 4. Scheduler applied this proposal and wanted to deduct the pending resource
> CapacityScheduler#tryCommit --> FiCaSchedulerApp#apply --> 
> AppSchedulingInfo#allocate 
> Throw NPE when called 
> schedulerKeyToPlacementSets.get(schedulerRequestKey).allocate(schedulerKey, 
> type, node);



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-6629) NPE occurred when container allocation proposal is applied but its resource requests are removed before

2017-05-22 Thread Tao Yang (JIRA)

Tao Yang created YARN-6629:
--

 Summary: NPE occurred when container allocation proposal is 
applied but its resource requests are removed before
 Key: YARN-6629
 URL: https://issues.apache.org/jira/browse/YARN-6629
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0-alpha2, 2.9.0
Reporter: Tao Yang
Assignee: Tao Yang


Error log:
{code}
FATAL event.EventDispatcher (EventDispatcher.java:run(75)) - Error in handling 
event type NODE_UPDATE to the Event Dispatcher
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocate(AppSchedulingInfo.java:446)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:516)
at 
org.apache.hadoop.yarn.client.TestNegativePendingResource$1.answer(TestNegativePendingResource.java:225)
at 
org.mockito.internal.stubbing.StubbedInvocationMatcher.answer(StubbedInvocationMatcher.java:31)
at org.mockito.internal.MockHandler.handle(MockHandler.java:97)
at 
org.mockito.internal.creation.MethodInterceptorFilter.intercept(MethodInterceptorFilter.java:47)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp$$EnhancerByMockitoWithCGLIB$$29eb8afc.apply()
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2396)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.submitResourceCommitRequest(CapacityScheduler.java:2281)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1247)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1236)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1325)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1112)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:987)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1367)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:143)
at 
org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
at java.lang.Thread.run(Thread.java:745)
{code}

Reproduce this error in chronological order:
1. AM started and requested 1 container with schedulerRequestKey#1 : 
ApplicationMasterService#allocate -->  CapacityScheduler#allocate --> 
SchedulerApplicationAttempt#updateResourceRequests --> 
AppSchedulingInfo#updateResourceRequests 
Added schedulerRequestKey#1 into schedulerKeyToPlacementSets
2. Scheduler allocatd 1 container for this request and accepted the proposal
3. AM removed this request
ApplicationMasterService#allocate -->  CapacityScheduler#allocate --> 
SchedulerApplicationAttempt#updateResourceRequests --> 
AppSchedulingInfo#updateResourceRequests --> 
AppSchedulingInfo#addToPlacementSets --> 
AppSchedulingInfo#updatePendingResources
Removed schedulerRequestKey#1 from schedulerKeyToPlacementSets)
4. Scheduler applied this proposal and wanted to deduct the pending resource
CapacityScheduler#tryCommit --> FiCaSchedulerApp#apply --> 
AppSchedulingInfo#allocate 
Throw NPE when called 
schedulerKeyToPlacementSets.get(schedulerRequestKey).allocate(schedulerKey, 
type, node);



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-6403) Invalid local resource request can raise NPE and make NM exit

2017-04-05 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15958249#comment-15958249
 ] 

Tao Yang commented on YARN-6403:


[~jlowe], thanks for review and committing!

> Invalid local resource request can raise NPE and make NM exit
> -
>
> Key: YARN-6403
> URL: https://issues.apache.org/jira/browse/YARN-6403
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.8.0
>Reporter: Tao Yang
>Assignee: Tao Yang
> Fix For: 2.9.0, 2.8.1, 3.0.0-alpha3
>
> Attachments: YARN-6403.001.patch, YARN-6403.002.patch, 
> YARN-6403.004.patch, YARN-6403.branch-2.8.003.patch, 
> YARN-6403.branch-2.8.004.patch, YARN-6403.branch-2.8.004.patch
>
>
> Recently we found this problem on our testing environment. The app that 
> caused this problem added a invalid local resource request(have no location) 
> into ContainerLaunchContext like this:
> {code}
> localResources.put("test", LocalResource.newInstance(location,
> LocalResourceType.FILE, LocalResourceVisibility.PRIVATE, 100,
> System.currentTimeMillis()));
> ContainerLaunchContext amContainer =
> ContainerLaunchContext.newInstance(localResources, environment,
>   vargsFinal, null, securityTokens, acls);
> {code}
> The actual value of location was null although app doesn't expect that. This 
> mistake cause several NMs exited with the NPE below and can't restart until 
> the nm recovery dirs were deleted. 
> {code}
> FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourceRequest.(LocalResourceRequest.java:46)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:711)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:660)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1320)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:88)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1293)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1286)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> NPE occured when created LocalResourceRequest instance for invalid resource 
> request.
> {code}
>   public LocalResourceRequest(LocalResource resource)
>   throws URISyntaxException {
> this(resource.getResource().toPath(),  //NPE occurred here
> resource.getTimestamp(),
> resource.getType(),
> resource.getVisibility(),
> resource.getPattern());
>   }
> {code}
> We can't guarantee the validity of local resource request now, but we could 
> avoid damaging the cluster. Perhaps we can verify the resource both in 
> ContainerLaunchContext and LocalResourceRequest? Please feel free to give 
> your suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6403) Invalid local resource request can raise NPE and make NM exit

2017-04-03 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-6403:
---
Attachment: YARN-6403.004.patch
YARN-6403.branch-2.8.004.patch

Thanks [~jlowe] for your suggestions. Client-side test is moved to 
TestApplicationClientProtocolRecords now and TestContainerManagerWithLCE is 
updated to avoid failure. Attach new patches for branch-2.8 and trunk.

> Invalid local resource request can raise NPE and make NM exit
> -
>
> Key: YARN-6403
> URL: https://issues.apache.org/jira/browse/YARN-6403
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.8.0
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6403.001.patch, YARN-6403.002.patch, 
> YARN-6403.004.patch, YARN-6403.branch-2.8.003.patch, 
> YARN-6403.branch-2.8.004.patch
>
>
> Recently we found this problem on our testing environment. The app that 
> caused this problem added a invalid local resource request(have no location) 
> into ContainerLaunchContext like this:
> {code}
> localResources.put("test", LocalResource.newInstance(location,
> LocalResourceType.FILE, LocalResourceVisibility.PRIVATE, 100,
> System.currentTimeMillis()));
> ContainerLaunchContext amContainer =
> ContainerLaunchContext.newInstance(localResources, environment,
>   vargsFinal, null, securityTokens, acls);
> {code}
> The actual value of location was null although app doesn't expect that. This 
> mistake cause several NMs exited with the NPE below and can't restart until 
> the nm recovery dirs were deleted. 
> {code}
> FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourceRequest.(LocalResourceRequest.java:46)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:711)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:660)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1320)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:88)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1293)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1286)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> NPE occured when created LocalResourceRequest instance for invalid resource 
> request.
> {code}
>   public LocalResourceRequest(LocalResource resource)
>   throws URISyntaxException {
> this(resource.getResource().toPath(),  //NPE occurred here
> resource.getTimestamp(),
> resource.getType(),
> resource.getVisibility(),
> resource.getPattern());
>   }
> {code}
> We can't guarantee the validity of local resource request now, but we could 
> avoid damaging the cluster. Perhaps we can verify the resource both in 
> ContainerLaunchContext and LocalResourceRequest? Please feel free to give 
> your suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6403) Invalid local resource request can raise NPE and make NM exit

2017-03-31 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-6403:
---
Attachment: YARN-6403.branch-2.8.003.patch

Attach new patch for branch-2.8

> Invalid local resource request can raise NPE and make NM exit
> -
>
> Key: YARN-6403
> URL: https://issues.apache.org/jira/browse/YARN-6403
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.8.0
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6403.001.patch, YARN-6403.002.patch, 
> YARN-6403.branch-2.8.003.patch
>
>
> Recently we found this problem on our testing environment. The app that 
> caused this problem added a invalid local resource request(have no location) 
> into ContainerLaunchContext like this:
> {code}
> localResources.put("test", LocalResource.newInstance(location,
> LocalResourceType.FILE, LocalResourceVisibility.PRIVATE, 100,
> System.currentTimeMillis()));
> ContainerLaunchContext amContainer =
> ContainerLaunchContext.newInstance(localResources, environment,
>   vargsFinal, null, securityTokens, acls);
> {code}
> The actual value of location was null although app doesn't expect that. This 
> mistake cause several NMs exited with the NPE below and can't restart until 
> the nm recovery dirs were deleted. 
> {code}
> FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourceRequest.(LocalResourceRequest.java:46)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:711)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:660)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1320)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:88)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1293)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1286)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> NPE occured when created LocalResourceRequest instance for invalid resource 
> request.
> {code}
>   public LocalResourceRequest(LocalResource resource)
>   throws URISyntaxException {
> this(resource.getResource().toPath(),  //NPE occurred here
> resource.getTimestamp(),
> resource.getType(),
> resource.getVisibility(),
> resource.getPattern());
>   }
> {code}
> We can't guarantee the validity of local resource request now, but we could 
> avoid damaging the cluster. Perhaps we can verify the resource both in 
> ContainerLaunchContext and LocalResourceRequest? Please feel free to give 
> your suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-6403) Invalid local resource request can raise NPE and make NM exit

2017-03-31 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950419#comment-15950419
 ] 

Tao Yang commented on YARN-6403:


[~jlowe] Thanks for your time! 
{quote}
I believe it's appropriate to throw NPE in our client check code as well rather 
than a generic RuntimeException. It's a minor point since the net effect will 
be similar for the client in either case.
{quote}
Make sense, sorry for missing the point before.
{quote}
TestApplicationClientProtocolRecords looks like a decent place since it's 
already has another test for ContainerLaunchContextPBImpl there.
{quote}
TestApplicationClientProtocolRecords is not exist in branch-2.8, so is it ok to 
place the UT for client-side in 
TestPBImplRecords#testContainerLaunchContextPBImpl?
In addition, the error message and unit test code will be improved in next 
patch.
One patch can't fit for all branches, perhaps it's necessary to submit patches 
for 2.9(branch-2) and 3.0.0-alpha3(trunk)?

> Invalid local resource request can raise NPE and make NM exit
> -
>
> Key: YARN-6403
> URL: https://issues.apache.org/jira/browse/YARN-6403
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.8.0
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6403.001.patch, YARN-6403.002.patch
>
>
> Recently we found this problem on our testing environment. The app that 
> caused this problem added a invalid local resource request(have no location) 
> into ContainerLaunchContext like this:
> {code}
> localResources.put("test", LocalResource.newInstance(location,
> LocalResourceType.FILE, LocalResourceVisibility.PRIVATE, 100,
> System.currentTimeMillis()));
> ContainerLaunchContext amContainer =
> ContainerLaunchContext.newInstance(localResources, environment,
>   vargsFinal, null, securityTokens, acls);
> {code}
> The actual value of location was null although app doesn't expect that. This 
> mistake cause several NMs exited with the NPE below and can't restart until 
> the nm recovery dirs were deleted. 
> {code}
> FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourceRequest.(LocalResourceRequest.java:46)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:711)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:660)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1320)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:88)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1293)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1286)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> NPE occured when created LocalResourceRequest instance for invalid resource 
> request.
> {code}
>   public LocalResourceRequest(LocalResource resource)
>   throws URISyntaxException {
> this(resource.getResource().toPath(),  //NPE occurred here
> resource.getTimestamp(),
> resource.getType(),
> resource.getVisibility(),
> resource.getPattern());
>   }
> {code}
> We can't guarantee the validity of local resource request now, but we could 
> avoid damaging the cluster. Perhaps we can verify the resource both in 
> ContainerLaunchContext and LocalResourceRequest? Please feel free to give 
> your suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail:

[jira] [Updated] (YARN-6403) Invalid local resource request can raise NPE and make NM exit

2017-03-29 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-6403:
---
Attachment: YARN-6403.002.patch

[~jlowe] Thanks for correcting me. 
The last server-side change is not proper and I corrected it as your mentioned. 
For the client-side change, IIUIC the generated protobuf code won't throws NPE 
for this case actually.
Unit tests for both the client and server change is added.
Attach a new patch for review, please correct me if I missed something.

> Invalid local resource request can raise NPE and make NM exit
> -
>
> Key: YARN-6403
> URL: https://issues.apache.org/jira/browse/YARN-6403
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.8.0
>Reporter: Tao Yang
> Attachments: YARN-6403.001.patch, YARN-6403.002.patch
>
>
> Recently we found this problem on our testing environment. The app that 
> caused this problem added a invalid local resource request(have no location) 
> into ContainerLaunchContext like this:
> {code}
> localResources.put("test", LocalResource.newInstance(location,
> LocalResourceType.FILE, LocalResourceVisibility.PRIVATE, 100,
> System.currentTimeMillis()));
> ContainerLaunchContext amContainer =
> ContainerLaunchContext.newInstance(localResources, environment,
>   vargsFinal, null, securityTokens, acls);
> {code}
> The actual value of location was null although app doesn't expect that. This 
> mistake cause several NMs exited with the NPE below and can't restart until 
> the nm recovery dirs were deleted. 
> {code}
> FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourceRequest.(LocalResourceRequest.java:46)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:711)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:660)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1320)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:88)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1293)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1286)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> NPE occured when created LocalResourceRequest instance for invalid resource 
> request.
> {code}
>   public LocalResourceRequest(LocalResource resource)
>   throws URISyntaxException {
> this(resource.getResource().toPath(),  //NPE occurred here
> resource.getTimestamp(),
> resource.getType(),
> resource.getVisibility(),
> resource.getPattern());
>   }
> {code}
> We can't guarantee the validity of local resource request now, but we could 
> avoid damaging the cluster. Perhaps we can verify the resource both in 
> ContainerLaunchContext and LocalResourceRequest? Please feel free to give 
> your suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6403) Invalid local resource request can raise NPE and make NM exit

2017-03-29 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-6403:
---
Attachment: YARN-6403.001.patch

Attach a patch for review.
* Add local resources check in ContainerImpl$RequestResourcesTransition to 
avoid NM failing, the container with invalid resource will fail to launch in 
this step.
* Add local resources check in ContainerLaunchContextPBImpl#setLocalResources 
to fail the app with invalid resource early in client, as it's a waste for 
cluster to launch a bound-to-fail app.

> Invalid local resource request can raise NPE and make NM exit
> -
>
> Key: YARN-6403
> URL: https://issues.apache.org/jira/browse/YARN-6403
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.8.0
>Reporter: Tao Yang
> Attachments: YARN-6403.001.patch
>
>
> Recently we found this problem on our testing environment. The app that 
> caused this problem added a invalid local resource request(have no location) 
> into ContainerLaunchContext like this:
> {code}
> localResources.put("test", LocalResource.newInstance(location,
> LocalResourceType.FILE, LocalResourceVisibility.PRIVATE, 100,
> System.currentTimeMillis()));
> ContainerLaunchContext amContainer =
> ContainerLaunchContext.newInstance(localResources, environment,
>   vargsFinal, null, securityTokens, acls);
> {code}
> The actual value of location was null although app doesn't expect that. This 
> mistake cause several NMs exited with the NPE below and can't restart until 
> the nm recovery dirs were deleted. 
> {code}
> FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourceRequest.(LocalResourceRequest.java:46)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:711)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:660)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1320)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:88)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1293)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1286)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> NPE occured when created LocalResourceRequest instance for invalid resource 
> request.
> {code}
>   public LocalResourceRequest(LocalResource resource)
>   throws URISyntaxException {
> this(resource.getResource().toPath(),  //NPE occurred here
> resource.getTimestamp(),
> resource.getType(),
> resource.getVisibility(),
> resource.getPattern());
>   }
> {code}
> We can't guarantee the validity of local resource request now, but we could 
> avoid damaging the cluster. Perhaps we can verify the resource both in 
> ContainerLaunchContext and LocalResourceRequest? Please feel free to give 
> your suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-6403) Invalid local resource request can raise NPE and make NM exit

2017-03-28 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946419#comment-15946419
 ] 

Tao Yang commented on YARN-6403:


[~Naganarasimha] Yes, I would like to work on this and will submit a patch for 
review soon.

> Invalid local resource request can raise NPE and make NM exit
> -
>
> Key: YARN-6403
> URL: https://issues.apache.org/jira/browse/YARN-6403
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.8.0
>Reporter: Tao Yang
>
> Recently we found this problem on our testing environment. The app that 
> caused this problem added a invalid local resource request(have no location) 
> into ContainerLaunchContext like this:
> {code}
> localResources.put("test", LocalResource.newInstance(location,
> LocalResourceType.FILE, LocalResourceVisibility.PRIVATE, 100,
> System.currentTimeMillis()));
> ContainerLaunchContext amContainer =
> ContainerLaunchContext.newInstance(localResources, environment,
>   vargsFinal, null, securityTokens, acls);
> {code}
> The actual value of location was null although app doesn't expect that. This 
> mistake cause several NMs exited with the NPE below and can't restart until 
> the nm recovery dirs were deleted. 
> {code}
> FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourceRequest.(LocalResourceRequest.java:46)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:711)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:660)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1320)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:88)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1293)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1286)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> NPE occured when created LocalResourceRequest instance for invalid resource 
> request.
> {code}
>   public LocalResourceRequest(LocalResource resource)
>   throws URISyntaxException {
> this(resource.getResource().toPath(),  //NPE occurred here
> resource.getTimestamp(),
> resource.getType(),
> resource.getVisibility(),
> resource.getPattern());
>   }
> {code}
> We can't guarantee the validity of local resource request now, but we could 
> avoid damaging the cluster. Perhaps we can verify the resource both in 
> ContainerLaunchContext and LocalResourceRequest? Please feel free to give 
> your suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6403) Invalid local resource request can raise NPE and make NM exit

2017-03-28 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-6403:
---
Description: 
Recently we found this problem on our testing environment. The app that caused 
this problem added a invalid local resource request(have no location) into 
ContainerLaunchContext like this:
{code}
localResources.put("test", LocalResource.newInstance(location,
LocalResourceType.FILE, LocalResourceVisibility.PRIVATE, 100,
System.currentTimeMillis()));
ContainerLaunchContext amContainer =
ContainerLaunchContext.newInstance(localResources, environment,
  vargsFinal, null, securityTokens, acls);
{code}

The actual value of location was null although app doesn't expect that. This 
mistake cause several NMs exited with the NPE below and can't restart until the 
nm recovery dirs were deleted. 
{code}
FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourceRequest.(LocalResourceRequest.java:46)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:711)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:660)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1320)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:88)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1293)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1286)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
at java.lang.Thread.run(Thread.java:745)
{code}

NPE occured when created LocalResourceRequest instance for invalid resource 
request.
{code}
  public LocalResourceRequest(LocalResource resource)
  throws URISyntaxException {
this(resource.getResource().toPath(),  //NPE occurred here
resource.getTimestamp(),
resource.getType(),
resource.getVisibility(),
resource.getPattern());
  }
{code}

We can't guarantee the validity of local resource request now, but we could 
avoid damaging the cluster. Perhaps we can verify the resource both in 
ContainerLaunchContext and LocalResourceRequest? Please feel free to give your 
suggestions.

  was:
Recently we found this problem on our testing environment. The app that caused 
this problem added a invalid local resource request(have no location) into 
ContainerLaunchContext like this:
{code}
localResources.put("test", LocalResource.newInstance(location,
LocalResourceType.FILE, LocalResourceVisibility.PRIVATE, 100,
System.currentTimeMillis()));
ContainerLaunchContext amContainer =
ContainerLaunchContext.newInstance(localResources, environment,
  vargsFinal, null, securityTokens, acls);
{code}

The actual value of location was null although app doesn't expect that. This 
mistake cause several NMs exited with the NPE below and can't restart until the 
nm recovery dirs were deleted. 
{code}
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourceRequest.(LocalResourceRequest.java:46)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:711)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:660)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at

[jira] [Created] (YARN-6403) Invalid local resource request can raise NPE and make NM exit

2017-03-28 Thread Tao Yang (JIRA)

Tao Yang created YARN-6403:
--

 Summary: Invalid local resource request can raise NPE and make NM 
exit
 Key: YARN-6403
 URL: https://issues.apache.org/jira/browse/YARN-6403
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.8.0
Reporter: Tao Yang


Recently we found this problem on our testing environment. The app that caused 
this problem added a invalid local resource request(have no location) into 
ContainerLaunchContext like this:
{code}
localResources.put("test", LocalResource.newInstance(location,
LocalResourceType.FILE, LocalResourceVisibility.PRIVATE, 100,
System.currentTimeMillis()));
ContainerLaunchContext amContainer =
ContainerLaunchContext.newInstance(localResources, environment,
  vargsFinal, null, securityTokens, acls);
{code}

The actual value of location was null although app doesn't expect that. This 
mistake cause several NMs exited with the NPE below and can't restart until the 
nm recovery dirs were deleted. 
{code}
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourceRequest.(LocalResourceRequest.java:46)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:711)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:660)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1320)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:88)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1293)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1286)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
at java.lang.Thread.run(Thread.java:745)
{code}

NPE occured when created LocalResourceRequest instance for invalid resource 
request.
{code}
  public LocalResourceRequest(LocalResource resource)
  throws URISyntaxException {
this(resource.getResource().toPath(),  //NPE occurred here
resource.getTimestamp(),
resource.getType(),
resource.getVisibility(),
resource.getPattern());
  }
{code}

We can't guarantee the validity of local resource request now, but we could 
avoid damaging the cluster. Perhaps we can verify the resource both in 
ContainerLaunchContext and LocalResourceRequest? Please feel free to give your 
suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-6259) Support pagination and optimize data transfer with zero-copy approach for containerlogs REST API in NMWebServices

2017-03-01 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15891423#comment-15891423
 ] 

Tao Yang commented on YARN-6259:


Hi, [~rohithsharma]. Thank you for looking into this issue.
{quote}
I am not sure about how use cases will be served
{quote}
One common use case is to request last part of log and easily skip to another 
part for detecting problem, instead of loading the entire log, it perhaps can 
save a lot of time. We have an outer system to track apps and show container 
logs inside, meanwhile most of logs are very large, so that pagination function 
is needed and the newly added containerlogs-info REST API is a part of it.

{quote}
Instead of adding new LogInfo file, there is ContainerLogInfo file which can be 
used for pageSize and pageIndex.
{quote}
ContainerLogInfo seems not exist in branch-2.8, perhaps it's for higher version?

> Support pagination and optimize data transfer with zero-copy approach for 
> containerlogs REST API in NMWebServices
> -
>
> Key: YARN-6259
> URL: https://issues.apache.org/jira/browse/YARN-6259
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.8.1
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-6259.001.patch
>
>
> Currently containerlogs REST API in NMWebServices will read and send the 
> entire content of container logs. Most of container logs are large and it's 
> useful to support pagination.
> * Add pagesize and pageindex parameters for containerlogs REST API
> {code}
> URL: http:///ws/v1/node/containerlogs//
> QueryParams:
>   pagesize - max bytes of one page , default 1MB
>   pageindex - index of required page, default 0, can be nagative(set -1 will 
> get the last page content)
> {code}
> * Add containerlogs-info REST API since sometimes we need to know the 
> totalSize/pageSize/pageCount info of log 
> {code}
> URL: 
> http:///ws/v1/node/containerlogs-info//
> QueryParams:
>   pagesize - max bytes of one page , default 1MB
> Response example:
>   {"logInfo":{"totalSize":2497280,"pageSize":1048576,"pageCount":3}}
> {code}
> Moreover, the data transfer pipeline (disk --> read buffer --> NM buffer --> 
> socket buffer) can be optimized to pipeline(disk --> read buffer --> socket 
> buffer) with zero-copy approach.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

< 4 5 6 7 8 9 10 >

801 - 900 of 941 matches

Mail list logo