[jira] [Commented] (YARN-8933) [AMRMProxy] Fix potential empty AvailableResource and NumClusterNode in allocation response

2018-11-05 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675623#comment-16675623
 ] 

Botong Huang commented on YARN-8933:


Ah good catch, and thx for reviewing! 

> [AMRMProxy] Fix potential empty AvailableResource and NumClusterNode in 
> allocation response
> ---
>
> Key: YARN-8933
> URL: https://issues.apache.org/jira/browse/YARN-8933
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: amrmproxy, federation
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8933.v1.patch, YARN-8933.v2.patch
>
>
> After YARN-8696, the allocate response by FederationInterceptor is merged 
> from the responses from a random subset of all sub-clusters, depending on the 
> async heartbeat timing. As a result, cluster-wide information fields in the 
> response, e.g. AvailableResources and NumClusterNodes, are not consistent at 
> all. It can even be null/zero because the specific response is merged from an 
> empty set of sub-cluster responses. 
> In this patch, we let FederationInterceptor remember the last allocate 
> response from all known sub-clusters, and always construct the cluster-wide 
> info fields from all of them. We also moved sub-cluster timeout from 
> LocalityMulticastAMRMProxyPolicy to FederationInterceptor, so that 
> sub-clusters that expired (haven't had a successful allocate response for a 
> while) won't be included in the computation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8933) [AMRMProxy] Fix potential empty AvailableResource and NumClusterNode in allocation response

2018-11-05 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16674898#comment-16674898
 ] 

Bibin A Chundatt commented on YARN-8933:


Thank you [~botong] for patch.

Overall approach looks good to me.

Moving 1 min timeout from LocalityMulticastAMRMProxyPolicy to 
FederationInteceptor and caching the last response.
All the policies should be able to take advantage of the same.
 
One concern is, what happens if due to some crazy GC at AM side , AM doesn't 
set heartbeat for one min. As per the current implementation will never send 
allocate request to secondary subclusters rt ?

To evaluate timeout of subcluster we should  consider the last 
allocate/heartbeat from AM.

Also could you add a test to verify recover case, with 
LocalityMulticastAMRMProxyPolicy i think 
{{AllocationBookkeeper#activeAndEnabledSC}} will be empty always.

> [AMRMProxy] Fix potential empty AvailableResource and NumClusterNode in 
> allocation response
> ---
>
> Key: YARN-8933
> URL: https://issues.apache.org/jira/browse/YARN-8933
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: amrmproxy, federation
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8933.v1.patch, YARN-8933.v2.patch
>
>
> After YARN-8696, the allocate response by FederationInterceptor is merged 
> from the responses from a random subset of all sub-clusters, depending on the 
> async heartbeat timing. As a result, cluster-wide information fields in the 
> response, e.g. AvailableResources and NumClusterNodes, are not consistent at 
> all. It can even be null/zero because the specific response is merged from an 
> empty set of sub-cluster responses. 
> In this patch, we let FederationInterceptor remember the last allocate 
> response from all known sub-clusters, and always construct the cluster-wide 
> info fields from all of them. We also moved sub-cluster timeout from 
> LocalityMulticastAMRMProxyPolicy to FederationInterceptor, so that 
> sub-clusters that expired (haven't had a successful allocate response for a 
> while) won't be included in the computation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8933) [AMRMProxy] Fix potential empty AvailableResource and NumClusterNode in allocation response

2018-10-23 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16661702#comment-16661702
 ] 

Botong Huang commented on YARN-8933:


TestContainerManager failure is not related and tracked under YARN-8672

> [AMRMProxy] Fix potential empty AvailableResource and NumClusterNode in 
> allocation response
> ---
>
> Key: YARN-8933
> URL: https://issues.apache.org/jira/browse/YARN-8933
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: amrmproxy, federation
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8933.v1.patch, YARN-8933.v2.patch
>
>
> After YARN-8696, the allocate response by FederationInterceptor is merged 
> from the responses from a random subset of all sub-clusters, depending on the 
> async heartbeat timing. As a result, cluster-wide information fields in the 
> response, e.g. AvailableResources and NumClusterNodes, are not consistent at 
> all. It can even be null/zero because the specific response is merged from an 
> empty set of sub-cluster responses. 
> In this patch, we let FederationInterceptor remember the last allocate 
> response from all known sub-clusters, and always construct the cluster-wide 
> info fields from all of them. We also moved sub-cluster timeout from 
> LocalityMulticastAMRMProxyPolicy to FederationInterceptor, so that 
> sub-clusters that expired (haven't had a successful allocate response for a 
> while) won't be included in the computation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8933) [AMRMProxy] Fix potential empty AvailableResource and NumClusterNode in allocation response

2018-10-23 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16661547#comment-16661547
 ] 

Hadoop QA commented on YARN-8933:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
15s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 6 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  3m 
18s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 
20s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
47s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 7s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
17s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 59s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
57s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
50s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
12s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 9s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
37s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  2m 
37s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
54s{color} | {color:green} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server: 
The patch generated 0 new + 0 unchanged - 1 fixed = 0 total (was 1) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
6s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 27s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
44s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  2m 
13s{color} | {color:green} hadoop-yarn-server-common in the patch passed. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 18m 26s{color} 
| {color:red} hadoop-yarn-server-nodemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
26s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 89m  1s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.nodemanager.containermanager.TestContainerManager |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:4b8c2b1 |
| JIRA Issue | YARN-8933 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12945300/YARN-8933.v2.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 5865de7e571f 3.13.0-139-generic #188-Ubuntu SMP Tue Jan 9 
14:43:09 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / a0c0b79 |
| maven