[jira] [Comment Edited] (YARN-6959) RM may allocate wrong AM Container for new attempt

2017-08-10 Thread Yuqi Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121202#comment-16121202
 ] 

Yuqi Wang edited comment on YARN-6959 at 8/10/17 7:44 AM:
--

I already added a comment on it in the patch:
// TODO: Rename it to getCurrentApplicationAttempt

I think it is clear. What do you think about it?


was (Author: yqwang):
I already add a comment on it:
// TODO: Rename it to getCurrentApplicationAttempt

I think it is clear. What do you think about it?

> RM may allocate wrong AM Container for new attempt
> --
>
> Key: YARN-6959
> URL: https://issues.apache.org/jira/browse/YARN-6959
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, fairscheduler, scheduler
>Affects Versions: 2.7.1
>Reporter: Yuqi Wang
>Assignee: Yuqi Wang
>  Labels: patch
> Fix For: 2.7.1, 3.0.0-alpha4
>
> Attachments: YARN-6959.001.patch, YARN-6959.002.patch, 
> YARN-6959.003.patch, YARN-6959.004.patch, YARN-6959.005.patch, 
> YARN-6959-branch-2.7.001.patch, YARN-6959.yarn_nm.log.zip, 
> YARN-6959.yarn_rm.log.zip
>
>
> *Issue Summary:*
> Previous attempt ResourceRequest may be recorded into current attempt 
> ResourceRequests. These mis-recorded ResourceRequests may confuse AM 
> Container Request and Allocation for current attempt.
> *Issue Pipeline:*
> {code:java}
> // Executing precondition check for the incoming attempt id.
> ApplicationMasterService.allocate() ->
> scheduler.allocate(attemptId, ask, ...) ->
> // Previous precondition check for the attempt id may be outdated here, 
> // i.e. the currentAttempt may not be the corresponding attempt of the 
> attemptId.
> // Such as the attempt id is corresponding to the previous attempt.
> currentAttempt = scheduler.getApplicationAttempt(attemptId) ->
> // Previous attempt ResourceRequest may be recorded into current attempt 
> ResourceRequests
> currentAttempt.updateResourceRequests(ask) ->
> // RM may allocate wrong AM Container for the current attempt, because its 
> ResourceRequests
> // may come from previous attempt which can be any ResourceRequests previous 
> AM asked
> // and there is not matching logic for the original AM Container 
> ResourceRequest and 
> // the returned amContainerAllocation below.
> AMContainerAllocatedTransition.transition(...) ->
> amContainerAllocation = scheduler.allocate(currentAttemptId, ...)
> {code}
> *Patch Correctness:*
> Because after this Patch, RM will definitely record ResourceRequests from 
> different attempt into different objects of 
> SchedulerApplicationAttempt.AppSchedulingInfo.
> So, even if RM still record ResourceRequests from old attempt at any time, 
> these ResourceRequests will be recorded in old AppSchedulingInfo object which 
> will not impact current attempt's resource requests and allocation.
> *Concerns:*
> The getApplicationAttempt function in AbstractYarnScheduler is so confusing, 
> we should better rename it to getCurrentApplicationAttempt. And reconsider 
> whether there are any other bugs related to getApplicationAttempt.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6959) RM may allocate wrong AM Container for new attempt

2017-08-10 Thread Yuqi Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121193#comment-16121193
 ] 

Yuqi Wang edited comment on YARN-6959 at 8/10/17 7:34 AM:
--

As this issue, other places which call getApplicationAttempt may also want to 
get the attempt specified in the arg instead of current attempt.
And if I just change getApplicationAttempt to getCurrentApplicationAttempt, it 
is more likely to hide the bugs.
I think only for this JIRA, I will not touch getApplicationAttempt until we 
have confirmed all places used getApplicationAttempt is bugfree.


was (Author: yqwang):
As this issue, other places which call getApplicationAttempt may also want to 
get the attempt specified in the arg instead of current attempt.
And if I just change getApplicationAttempt to getCurrentApplicationAttempt, it 
is more likely to hide the bugs.
I think only for this JIRA, I will not touch getApplicationAttempt until we 
have confirmed all places used getApplicationAttempt is safe.

> RM may allocate wrong AM Container for new attempt
> --
>
> Key: YARN-6959
> URL: https://issues.apache.org/jira/browse/YARN-6959
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, fairscheduler, scheduler
>Affects Versions: 2.7.1
>Reporter: Yuqi Wang
>Assignee: Yuqi Wang
>  Labels: patch
> Fix For: 2.7.1, 3.0.0-alpha4
>
> Attachments: YARN-6959.001.patch, YARN-6959.002.patch, 
> YARN-6959.003.patch, YARN-6959.004.patch, YARN-6959.005.patch, 
> YARN-6959-branch-2.7.001.patch, YARN-6959.yarn_nm.log.zip, 
> YARN-6959.yarn_rm.log.zip
>
>
> *Issue Summary:*
> Previous attempt ResourceRequest may be recorded into current attempt 
> ResourceRequests. These mis-recorded ResourceRequests may confuse AM 
> Container Request and Allocation for current attempt.
> *Issue Pipeline:*
> {code:java}
> // Executing precondition check for the incoming attempt id.
> ApplicationMasterService.allocate() ->
> scheduler.allocate(attemptId, ask, ...) ->
> // Previous precondition check for the attempt id may be outdated here, 
> // i.e. the currentAttempt may not be the corresponding attempt of the 
> attemptId.
> // Such as the attempt id is corresponding to the previous attempt.
> currentAttempt = scheduler.getApplicationAttempt(attemptId) ->
> // Previous attempt ResourceRequest may be recorded into current attempt 
> ResourceRequests
> currentAttempt.updateResourceRequests(ask) ->
> // RM may allocate wrong AM Container for the current attempt, because its 
> ResourceRequests
> // may come from previous attempt which can be any ResourceRequests previous 
> AM asked
> // and there is not matching logic for the original AM Container 
> ResourceRequest and 
> // the returned amContainerAllocation below.
> AMContainerAllocatedTransition.transition(...) ->
> amContainerAllocation = scheduler.allocate(currentAttemptId, ...)
> {code}
> *Patch Correctness:*
> Because after this Patch, RM will definitely record ResourceRequests from 
> different attempt into different objects of 
> SchedulerApplicationAttempt.AppSchedulingInfo.
> So, even if RM still record ResourceRequests from old attempt at any time, 
> these ResourceRequests will be recorded in old AppSchedulingInfo object which 
> will not impact current attempt's resource requests and allocation.
> *Concerns:*
> The getApplicationAttempt function in AbstractYarnScheduler is so confusing, 
> we should better rename it to getCurrentApplicationAttempt. And reconsider 
> whether there are any other bugs related to getApplicationAttempt.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6959) RM may allocate wrong AM Container for new attempt

2017-08-10 Thread Yuqi Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121193#comment-16121193
 ] 

Yuqi Wang edited comment on YARN-6959 at 8/10/17 7:34 AM:
--

As this issue, other places which call getApplicationAttempt may also want to 
get the attempt specified in the arg instead of current attempt.
And if I just change getApplicationAttempt to getCurrentApplicationAttempt, it 
is more likely to hide the bugs.
I think only for this JIRA, I will not touch getApplicationAttempt until we 
have confirmed all places used getApplicationAttempt is safe.


was (Author: yqwang):
As this issue, other places which call getApplicationAttempt may also want to 
get the attempt specified in the arg instead of current attempt.
And if I just change getApplicationAttempt to getCurrentApplicationAttempt, it 
is more likely to hide the bugs.
I think only for this fix, I will not touch getApplicationAttempt until we have 
a confirmed all places used getApplicationAttempt is safe.

> RM may allocate wrong AM Container for new attempt
> --
>
> Key: YARN-6959
> URL: https://issues.apache.org/jira/browse/YARN-6959
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, fairscheduler, scheduler
>Affects Versions: 2.7.1
>Reporter: Yuqi Wang
>Assignee: Yuqi Wang
>  Labels: patch
> Fix For: 2.7.1, 3.0.0-alpha4
>
> Attachments: YARN-6959.001.patch, YARN-6959.002.patch, 
> YARN-6959.003.patch, YARN-6959.004.patch, YARN-6959.005.patch, 
> YARN-6959-branch-2.7.001.patch, YARN-6959.yarn_nm.log.zip, 
> YARN-6959.yarn_rm.log.zip
>
>
> *Issue Summary:*
> Previous attempt ResourceRequest may be recorded into current attempt 
> ResourceRequests. These mis-recorded ResourceRequests may confuse AM 
> Container Request and Allocation for current attempt.
> *Issue Pipeline:*
> {code:java}
> // Executing precondition check for the incoming attempt id.
> ApplicationMasterService.allocate() ->
> scheduler.allocate(attemptId, ask, ...) ->
> // Previous precondition check for the attempt id may be outdated here, 
> // i.e. the currentAttempt may not be the corresponding attempt of the 
> attemptId.
> // Such as the attempt id is corresponding to the previous attempt.
> currentAttempt = scheduler.getApplicationAttempt(attemptId) ->
> // Previous attempt ResourceRequest may be recorded into current attempt 
> ResourceRequests
> currentAttempt.updateResourceRequests(ask) ->
> // RM may allocate wrong AM Container for the current attempt, because its 
> ResourceRequests
> // may come from previous attempt which can be any ResourceRequests previous 
> AM asked
> // and there is not matching logic for the original AM Container 
> ResourceRequest and 
> // the returned amContainerAllocation below.
> AMContainerAllocatedTransition.transition(...) ->
> amContainerAllocation = scheduler.allocate(currentAttemptId, ...)
> {code}
> *Patch Correctness:*
> Because after this Patch, RM will definitely record ResourceRequests from 
> different attempt into different objects of 
> SchedulerApplicationAttempt.AppSchedulingInfo.
> So, even if RM still record ResourceRequests from old attempt at any time, 
> these ResourceRequests will be recorded in old AppSchedulingInfo object which 
> will not impact current attempt's resource requests and allocation.
> *Concerns:*
> The getApplicationAttempt function in AbstractYarnScheduler is so confusing, 
> we should better rename it to getCurrentApplicationAttempt. And reconsider 
> whether there are any other bugs related to getApplicationAttempt.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6959) RM may allocate wrong AM Container for new attempt

2017-08-09 Thread Yuqi Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121060#comment-16121060
 ] 

Yuqi Wang edited comment on YARN-6959 at 8/10/17 5:10 AM:
--

[~jianhe]
The whole pipeline was that:
Step0. AM sent heartbeats to RM.
Step1. AM process crashed with exitcode 15 without unregister to RM.
Step2-a. The heartbeats sent in step0, was processing by RM between MARK1 and 
MARK3.
Step2-b. NM told RM the AM container has completed.
Step3. RM switched to the new attempt.
Step4. RM recorded requests in the heartbeats from previous AM into current 
attempt.

So, it is possible.



was (Author: yqwang):
[~jianhe]
The whole pipeline was that:
Step0. AM sent heartbeats to RM.
Step1. AM process crashed with exitcode 15 without unregister to RM.
Step2-a. NM told RM the AM container has completed.
Step2-b. The heartbeats sent in step0, was processing by RM between MARK1 and 
MARK3.
Step3. RM switched to the new attempt.
Step4. RM recorded requests in the heartbeats from previous AM into current 
attempt.

So, it is possible.


> RM may allocate wrong AM Container for new attempt
> --
>
> Key: YARN-6959
> URL: https://issues.apache.org/jira/browse/YARN-6959
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, fairscheduler, scheduler
>Affects Versions: 2.7.1
>Reporter: Yuqi Wang
>Assignee: Yuqi Wang
>  Labels: patch
> Fix For: 2.7.1, 3.0.0-alpha4
>
> Attachments: YARN-6959.001.patch, YARN-6959.002.patch, 
> YARN-6959.003.patch, YARN-6959.004.patch, YARN-6959.005.patch, 
> YARN-6959-branch-2.7.001.patch, YARN-6959.yarn_nm.log.zip, 
> YARN-6959.yarn_rm.log.zip
>
>
> *Issue Summary:*
> Previous attempt ResourceRequest may be recorded into current attempt 
> ResourceRequests. These mis-recorded ResourceRequests may confuse AM 
> Container Request and Allocation for current attempt.
> *Issue Pipeline:*
> {code:java}
> // Executing precondition check for the incoming attempt id.
> ApplicationMasterService.allocate() ->
> scheduler.allocate(attemptId, ask, ...) ->
> // Previous precondition check for the attempt id may be outdated here, 
> // i.e. the currentAttempt may not be the corresponding attempt of the 
> attemptId.
> // Such as the attempt id is corresponding to the previous attempt.
> currentAttempt = scheduler.getApplicationAttempt(attemptId) ->
> // Previous attempt ResourceRequest may be recorded into current attempt 
> ResourceRequests
> currentAttempt.updateResourceRequests(ask) ->
> // RM may allocate wrong AM Container for the current attempt, because its 
> ResourceRequests
> // may come from previous attempt which can be any ResourceRequests previous 
> AM asked
> // and there is not matching logic for the original AM Container 
> ResourceRequest and 
> // the returned amContainerAllocation below.
> AMContainerAllocatedTransition.transition(...) ->
> amContainerAllocation = scheduler.allocate(currentAttemptId, ...)
> {code}
> *Patch Correctness:*
> Because after this Patch, RM will definitely record ResourceRequests from 
> different attempt into different objects of 
> SchedulerApplicationAttempt.AppSchedulingInfo.
> So, even if RM still record ResourceRequests from old attempt at any time, 
> these ResourceRequests will be recorded in old AppSchedulingInfo object which 
> will not impact current attempt's resource requests and allocation.
> *Concerns:*
> The getApplicationAttempt function in AbstractYarnScheduler is so confusing, 
> we should better rename it to getCurrentApplicationAttempt. And reconsider 
> whether there are any other bugs related to getApplicationAttempt.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6959) RM may allocate wrong AM Container for new attempt

2017-08-09 Thread Yuqi Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121060#comment-16121060
 ] 

Yuqi Wang edited comment on YARN-6959 at 8/10/17 5:08 AM:
--

[~jianhe]
The whole pipeline was that:
Step0. AM sent heartbeats to RM.
Step1. AM process crashed with exitcode 15 without unregister to RM.
Step2-a. NM told RM the AM container has completed.
Step2-b. The heartbeats sent in step0, was processing by RM between MARK1 and 
MARK3.
Step3. RM switched to the new attempt.
Step4. RM recorded requests in the heartbeats from previous AM into current 
attempt.

So, it is possible.



was (Author: yqwang):
[~jianhe]
The whole pipeline was that:
Step0. AM sent heartbeats to RM.
Step1. AM process crashed with exitcode 15 without unregister to RM.
Step2-a. NM told RM the AM container has completed.
Step2-b. The heartbeats sent in step0, was processing by RM between MARK1 and 
MARK3.
Step3. RM switched to the new attempt.
Step4. The heartbeats record request from previous AM into current attempt.

So, it is possible.


> RM may allocate wrong AM Container for new attempt
> --
>
> Key: YARN-6959
> URL: https://issues.apache.org/jira/browse/YARN-6959
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, fairscheduler, scheduler
>Affects Versions: 2.7.1
>Reporter: Yuqi Wang
>Assignee: Yuqi Wang
>  Labels: patch
> Fix For: 2.7.1, 3.0.0-alpha4
>
> Attachments: YARN-6959.001.patch, YARN-6959.002.patch, 
> YARN-6959.003.patch, YARN-6959.004.patch, YARN-6959.005.patch, 
> YARN-6959-branch-2.7.001.patch, YARN-6959.yarn_nm.log.zip, 
> YARN-6959.yarn_rm.log.zip
>
>
> *Issue Summary:*
> Previous attempt ResourceRequest may be recorded into current attempt 
> ResourceRequests. These mis-recorded ResourceRequests may confuse AM 
> Container Request and Allocation for current attempt.
> *Issue Pipeline:*
> {code:java}
> // Executing precondition check for the incoming attempt id.
> ApplicationMasterService.allocate() ->
> scheduler.allocate(attemptId, ask, ...) ->
> // Previous precondition check for the attempt id may be outdated here, 
> // i.e. the currentAttempt may not be the corresponding attempt of the 
> attemptId.
> // Such as the attempt id is corresponding to the previous attempt.
> currentAttempt = scheduler.getApplicationAttempt(attemptId) ->
> // Previous attempt ResourceRequest may be recorded into current attempt 
> ResourceRequests
> currentAttempt.updateResourceRequests(ask) ->
> // RM may allocate wrong AM Container for the current attempt, because its 
> ResourceRequests
> // may come from previous attempt which can be any ResourceRequests previous 
> AM asked
> // and there is not matching logic for the original AM Container 
> ResourceRequest and 
> // the returned amContainerAllocation below.
> AMContainerAllocatedTransition.transition(...) ->
> amContainerAllocation = scheduler.allocate(currentAttemptId, ...)
> {code}
> *Patch Correctness:*
> Because after this Patch, RM will definitely record ResourceRequests from 
> different attempt into different objects of 
> SchedulerApplicationAttempt.AppSchedulingInfo.
> So, even if RM still record ResourceRequests from old attempt at any time, 
> these ResourceRequests will be recorded in old AppSchedulingInfo object which 
> will not impact current attempt's resource requests and allocation.
> *Concerns:*
> The getApplicationAttempt function in AbstractYarnScheduler is so confusing, 
> we should better rename it to getCurrentApplicationAttempt. And reconsider 
> whether there are any other bugs related to getApplicationAttempt.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6959) RM may allocate wrong AM Container for new attempt

2017-08-09 Thread Yuqi Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121029#comment-16121029
 ] 

Yuqi Wang edited comment on YARN-6959 at 8/10/17 4:12 AM:
--

Attach NM log for this bug.

{code:java}
YARN-6959.yarn_nm.log.zip
{code}



was (Author: yqwang):
Add NM log for this issue.

> RM may allocate wrong AM Container for new attempt
> --
>
> Key: YARN-6959
> URL: https://issues.apache.org/jira/browse/YARN-6959
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, fairscheduler, scheduler
>Affects Versions: 2.7.1
>Reporter: Yuqi Wang
>Assignee: Yuqi Wang
>  Labels: patch
> Fix For: 2.7.1, 3.0.0-alpha4
>
> Attachments: YARN-6959.001.patch, YARN-6959.002.patch, 
> YARN-6959.003.patch, YARN-6959.004.patch, YARN-6959.005.patch, 
> YARN-6959-branch-2.7.001.patch, YARN-6959.yarn_nm.log.zip, 
> YARN-6959.yarn_rm.log.zip
>
>
> *Issue Summary:*
> Previous attempt ResourceRequest may be recorded into current attempt 
> ResourceRequests. These mis-recorded ResourceRequests may confuse AM 
> Container Request and Allocation for current attempt.
> *Issue Pipeline:*
> {code:java}
> // Executing precondition check for the incoming attempt id.
> ApplicationMasterService.allocate() ->
> scheduler.allocate(attemptId, ask, ...) ->
> // Previous precondition check for the attempt id may be outdated here, 
> // i.e. the currentAttempt may not be the corresponding attempt of the 
> attemptId.
> // Such as the attempt id is corresponding to the previous attempt.
> currentAttempt = scheduler.getApplicationAttempt(attemptId) ->
> // Previous attempt ResourceRequest may be recorded into current attempt 
> ResourceRequests
> currentAttempt.updateResourceRequests(ask) ->
> // RM may allocate wrong AM Container for the current attempt, because its 
> ResourceRequests
> // may come from previous attempt which can be any ResourceRequests previous 
> AM asked
> // and there is not matching logic for the original AM Container 
> ResourceRequest and 
> // the returned amContainerAllocation below.
> AMContainerAllocatedTransition.transition(...) ->
> amContainerAllocation = scheduler.allocate(currentAttemptId, ...)
> {code}
> *Patch Correctness:*
> Because after this Patch, RM will definitely record ResourceRequests from 
> different attempt into different objects of 
> SchedulerApplicationAttempt.AppSchedulingInfo.
> So, even if RM still record ResourceRequests from old attempt at any time, 
> these ResourceRequests will be recorded in old AppSchedulingInfo object which 
> will not impact current attempt's resource requests and allocation.
> *Concerns:*
> The getApplicationAttempt function in AbstractYarnScheduler is so confusing, 
> we should better rename it to getCurrentApplicationAttempt. And reconsider 
> whether there are any other bugs related to getApplicationAttempt.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6959) RM may allocate wrong AM Container for new attempt

2017-08-08 Thread Yuqi Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16119337#comment-16119337
 ] 

Yuqi Wang edited comment on YARN-6959 at 8/9/17 2:47 AM:
-

Attach RM log for this bug: 

{code:java}
YARN-6959.yarn_rm.log.zip
{code}



was (Author: yqwang):
Attach RM log for this bug: YARN-6959.yarn_rm.log.zip

> RM may allocate wrong AM Container for new attempt
> --
>
> Key: YARN-6959
> URL: https://issues.apache.org/jira/browse/YARN-6959
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, fairscheduler, scheduler
>Affects Versions: 2.7.1
>Reporter: Yuqi Wang
>Assignee: Yuqi Wang
>  Labels: patch
> Fix For: 2.7.1, 3.0.0-alpha4
>
> Attachments: YARN-6959.001.patch, YARN-6959.002.patch, 
> YARN-6959.003.patch, YARN-6959.004.patch, YARN-6959.005.patch, 
> YARN-6959-branch-2.7.001.patch, YARN-6959.yarn_rm.log.zip
>
>
> *Issue Summary:*
> Previous attempt ResourceRequest may be recorded into current attempt 
> ResourceRequests. These mis-recorded ResourceRequests may confuse AM 
> Container Request and Allocation for current attempt.
> *Issue Pipeline:*
> {code:java}
> // Executing precondition check for the incoming attempt id.
> ApplicationMasterService.allocate() ->
> scheduler.allocate(attemptId, ask, ...) ->
> // Previous precondition check for the attempt id may be outdated here, 
> // i.e. the currentAttempt may not be the corresponding attempt of the 
> attemptId.
> // Such as the attempt id is corresponding to the previous attempt.
> currentAttempt = scheduler.getApplicationAttempt(attemptId) ->
> // Previous attempt ResourceRequest may be recorded into current attempt 
> ResourceRequests
> currentAttempt.updateResourceRequests(ask) ->
> // RM may allocate wrong AM Container for the current attempt, because its 
> ResourceRequests
> // may come from previous attempt which can be any ResourceRequests previous 
> AM asked
> // and there is not matching logic for the original AM Container 
> ResourceRequest and 
> // the returned amContainerAllocation below.
> AMContainerAllocatedTransition.transition(...) ->
> amContainerAllocation = scheduler.allocate(currentAttemptId, ...)
> {code}
> *Patch Correctness:*
> Because after this Patch, RM will definitely record ResourceRequests from 
> different attempt into different objects of 
> SchedulerApplicationAttempt.AppSchedulingInfo.
> So, even if RM still record ResourceRequests from old attempt at any time, 
> these ResourceRequests will be recorded in old AppSchedulingInfo object which 
> will not impact current attempt's resource requests and allocation.
> *Concerns:*
> The getApplicationAttempt function in AbstractYarnScheduler is so confusing, 
> we should better rename it to getCurrentApplicationAttempt. And reconsider 
> whether there are any other bugs related to getApplicationAttempt.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6959) RM may allocate wrong AM Container for new attempt

2017-08-08 Thread Yuqi Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16119337#comment-16119337
 ] 

Yuqi Wang edited comment on YARN-6959 at 8/9/17 2:46 AM:
-

Attach RM log for this bug: YARN-6959.yarn_rm.log.zip


was (Author: yqwang):
RM log for this bug.

> RM may allocate wrong AM Container for new attempt
> --
>
> Key: YARN-6959
> URL: https://issues.apache.org/jira/browse/YARN-6959
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, fairscheduler, scheduler
>Affects Versions: 2.7.1
>Reporter: Yuqi Wang
>Assignee: Yuqi Wang
>  Labels: patch
> Fix For: 2.7.1, 3.0.0-alpha4
>
> Attachments: YARN-6959.001.patch, YARN-6959.002.patch, 
> YARN-6959.003.patch, YARN-6959.004.patch, YARN-6959.005.patch, 
> YARN-6959-branch-2.7.001.patch, YARN-6959.yarn_rm.log.zip
>
>
> *Issue Summary:*
> Previous attempt ResourceRequest may be recorded into current attempt 
> ResourceRequests. These mis-recorded ResourceRequests may confuse AM 
> Container Request and Allocation for current attempt.
> *Issue Pipeline:*
> {code:java}
> // Executing precondition check for the incoming attempt id.
> ApplicationMasterService.allocate() ->
> scheduler.allocate(attemptId, ask, ...) ->
> // Previous precondition check for the attempt id may be outdated here, 
> // i.e. the currentAttempt may not be the corresponding attempt of the 
> attemptId.
> // Such as the attempt id is corresponding to the previous attempt.
> currentAttempt = scheduler.getApplicationAttempt(attemptId) ->
> // Previous attempt ResourceRequest may be recorded into current attempt 
> ResourceRequests
> currentAttempt.updateResourceRequests(ask) ->
> // RM may allocate wrong AM Container for the current attempt, because its 
> ResourceRequests
> // may come from previous attempt which can be any ResourceRequests previous 
> AM asked
> // and there is not matching logic for the original AM Container 
> ResourceRequest and 
> // the returned amContainerAllocation below.
> AMContainerAllocatedTransition.transition(...) ->
> amContainerAllocation = scheduler.allocate(currentAttemptId, ...)
> {code}
> *Patch Correctness:*
> Because after this Patch, RM will definitely record ResourceRequests from 
> different attempt into different objects of 
> SchedulerApplicationAttempt.AppSchedulingInfo.
> So, even if RM still record ResourceRequests from old attempt at any time, 
> these ResourceRequests will be recorded in old AppSchedulingInfo object which 
> will not impact current attempt's resource requests and allocation.
> *Concerns:*
> The getApplicationAttempt function in AbstractYarnScheduler is so confusing, 
> we should better rename it to getCurrentApplicationAttempt. And reconsider 
> whether there are any other bugs related to getApplicationAttempt.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6959) RM may allocate wrong AM Container for new attempt

2017-08-07 Thread Yuqi Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16116220#comment-16116220
 ] 

Yuqi Wang edited comment on YARN-6959 at 8/7/17 8:11 AM:
-

Here is the log for the issue:

application_1500967702061_2512 asked for 20GB for AM Container and 5GB for its 
Task Container:
{code:java}
2017-07-31 20:58:49,532 INFO [Container Monitor] 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Memory usage of ProcessTree container_e71_1500967702061_2512_01_01 for 
container-id container_e71_1500967702061_2512_01_01: 307.8 MB of 20 GB 
physical memory used; 1.2 GB of 30 GB virtual memory used
{code}

After its first attempt failed, the second attempt was submitted; however, NM 
mistakenly believed the AM Container was 5GB:
{code:java}
2017-07-31 21:29:46,219 INFO [Container Monitor] 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Memory usage of ProcessTree container_e71_1500967702061_2512_02_01 for 
container-id container_e71_1500967702061_2512_02_01: 352.5 MB of 5 GB 
physical memory used; 1.4 GB of 7.5 GB virtual memory used

{code}

Here is the RM log for the second attempt, which also has the 
InvalidStateTransitonException: Invalid event: CONTAINER_ALLOCATED at 
ALLOCATED_SAVING:

{code:java}
2017-07-31 21:29:38,510 INFO [ResourceManager Event Processor] 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
Application added - appId: application_1500967702061_2512 user: 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue$User@57fbb4f5,
 leaf-queue: prod-new #user-pending-applications: 0 #user-active-applications: 
6 #queue-pending-applications: 0 #queue-active-applications: 6
2017-07-31 21:29:38,510 INFO [ResourceManager Event Processor] 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Added Application Attempt appattempt_1500967702061_2512_02 to scheduler 
from user hadoop in queue prod-new
2017-07-31 21:29:38,514 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
appattempt_1500967702061_2512_02 State change from SUBMITTED to SCHEDULED

2017-07-31 21:29:38,517 INFO [Thread-13] 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
container_e71_1500967702061_2512_02_01 Container Transitioned from NEW to 
ALLOCATED
2017-07-31 21:29:38,517 INFO [Thread-13] 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop   
OPERATION=AM Allocated ContainerTARGET=SchedulerApp RESULT=SUCCESS  
APPID=application_1500967702061_2512
CONTAINERID=container_e71_1500967702061_2512_02_01
2017-07-31 21:29:38,517 INFO [Thread-13] 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
assignedContainer application attempt=appattempt_1500967702061_2512_02 
container=Container: [ContainerId: container_e71_1500967702061_2512_02_01, 
NodeId: BN2APS0A98AEA0:10025, NodeHttpAddress: 
Proxy5.Yarn-Prod-Bn2.BN2.ap.gbl:81/proxy/nodemanager/BN2APS0A98AEA0/8042, 
Resource: , Priority: 1, Token: null, ] 
queue=prod-new: capacity=0.7, absoluteCapacity=0.7, usedResources=, usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=6, 
numContainers=8016 clusterResource= 
type=OFF_SWITCH
2017-07-31 21:29:38,517 INFO [Thread-13] 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
assignedContainer queue=root usedCapacity=0.0 absoluteUsedCapacity=0.0 
used= cluster=
2017-07-31 21:29:38,517 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM:
 Sending NMToken for nodeId : BN2APS0A98AEA0:10025 for container : 
container_e71_1500967702061_2512_02_01
2017-07-31 21:29:38,517 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
container_e71_1500967702061_2512_02_01 Container Transitioned from 
ALLOCATED to ACQUIRED
2017-07-31 21:29:38,517 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM:
 Clear node set for appattempt_1500967702061_2512_02
2017-07-31 21:29:38,517  
LOP-998291496]-[download]-[0@1]-[application_1501027078051_3009],prod-new,null,null,-1,"
 for attrs weka.core.FastVector@789038c6
2017-07-31 21:29:38,517 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
Storing attempt: AppId: application_1500967702061_2512 AttemptId: 
appattempt_1500967702061_2512_02 MasterContainer: Container: