[jira] [Commented] (YARN-7720) Race condition between second app attempt and UAM timeout when first attempt node is down

2023-05-17 Thread Botong Huang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723495#comment-17723495
 ] 

Botong Huang commented on YARN-7720:


Yeah, as I said in the initial description as well as v1 patch. I think the 
easiest way is to change the timeout config.

> Race condition between second app attempt and UAM timeout when first attempt 
> node is down
> -
>
> Key: YARN-7720
> URL: https://issues.apache.org/jira/browse/YARN-7720
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Shilun Fan
>Priority: Major
> Attachments: YARN-7720.v1.patch, YARN-7720.v2.patch
>
>
> In Federation, multiple attempts of an application share the same UAM in each 
> secondary sub-cluster. When first attempt fails, we reply on the fact that 
> secondary RM won't kill the existing UAM before the AM heartbeat timeout 
> (default at 10 min). When second attempt comes up in the home sub-cluster, it 
> will pick up the UAM token from Yarn Registry and resume the UAM heartbeat to 
> secondary RMs. 
> The default heartbeat timeout for NM and AM are both 10 mins. The problem is 
> that when the first attempt node goes down or out of connection, only after 
> 10 mins will the home RM mark the first attempt as failed, and then schedule 
> the 2nd attempt in some other node. By then the UAMs in secondaries are 
> already timing out, and they might not survive until the second attempt comes 
> up. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-7720) Race condition between second app attempt and UAM timeout when first attempt node is down

2023-05-17 Thread Botong Huang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17572705#comment-17572705
 ] 

Botong Huang edited comment on YARN-7720 at 5/17/23 2:42 PM:
-

I will continue to follow up on this pr.


was (Author: slfan1989):
I will continue to follow up on this pr.

> Race condition between second app attempt and UAM timeout when first attempt 
> node is down
> -
>
> Key: YARN-7720
> URL: https://issues.apache.org/jira/browse/YARN-7720
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Shilun Fan
>Priority: Major
> Attachments: YARN-7720.v1.patch, YARN-7720.v2.patch
>
>
> In Federation, multiple attempts of an application share the same UAM in each 
> secondary sub-cluster. When first attempt fails, we reply on the fact that 
> secondary RM won't kill the existing UAM before the AM heartbeat timeout 
> (default at 10 min). When second attempt comes up in the home sub-cluster, it 
> will pick up the UAM token from Yarn Registry and resume the UAM heartbeat to 
> secondary RMs. 
> The default heartbeat timeout for NM and AM are both 10 mins. The problem is 
> that when the first attempt node goes down or out of connection, only after 
> 10 mins will the home RM mark the first attempt as failed, and then schedule 
> the 2nd attempt in some other node. By then the UAMs in secondaries are 
> already timing out, and they might not survive until the second attempt comes 
> up. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7899) [AMRMProxy] Stateful FederationInterceptor for pending requests

2022-12-27 Thread Botong Huang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17652348#comment-17652348
 ] 

Botong Huang commented on YARN-7899:


[~walhl] "cancel pending request in one sub-cluster and re-send it to other 
sub-clusters" this is not done yet. 

> [AMRMProxy] Stateful FederationInterceptor for pending requests
> ---
>
> Key: YARN-7899
> URL: https://issues.apache.org/jira/browse/YARN-7899
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
>  Labels: amrmproxy, federation
> Fix For: 3.2.0
>
> Attachments: YARN-7899-branch-2.v3.patch, YARN-7899.v1.patch, 
> YARN-7899.v3.patch
>
>
> Today FederationInterceptor (in AMRMProxy for YARN Federation) is stateless 
> in terms of pending (outstanding) requests. Whenever AM issues new requests, 
> FI simply splits and sends them to sub-cluster YarnRMs and forget about them. 
> This JIRA attempts to make FI stateful so that it remembers the pending 
> requests in all relevant sub-clusters. This has two major benefits: 
> 1. It is a prerequisite for FI to be able to cancel pending request in one 
> sub-cluster and re-send it to other sub-clusters. This is needed for load 
> balancing and to fully comply with the relax locality fallback to ANY 
> semantic. When we send a request to one sub-cluster, we have effectively 
> restrained the allocation for this request to be within this sub-cluster 
> rather than everywhere. If the cluster capacity in this sub-cluster for this 
> app is full or this YarnRM is overloaded and slow, the request will be stuck 
> there for a long time even if there is free capacity in other sub-clusters. 
> We need FI to remember and adjust the pending requests on the fly. 
> 2. This makes pending request recovery easier when YarnRM fails over. Today 
> whenever one sub-cluster RM fails over, in order to recover lost pending 
> requests for this sub-cluster, 
> we have to propagate the ApplicationMasterNotRegisteredException from the 
> YarnRM back to AM, triggering a full pending resend from AM. This contains 
> pending for not only the failing-over sub-cluster, but everyone. Since our 
> split-merge (AMRMProxyPolicy) does not guarantee idempotency, the same 
> request we sent to sub-cluster-1 earlier might be resent to sub-cluster-2. If 
> both these YarnRMs have not failed over, they will both allocate for this 
> request, leading to over-allocation. Also, these full pending resends also 
> puts unnecessary load on every YarnRM in the cluster everytime one YarnRM 
> fails over. With stateful FederationInterceptor, since we remember pending 
> requests we have sent out earlier, we can shield the 
> ApplicationMasterNotRegisteredException for AM and resend the pending only to 
> the failed over YarnRM. This eliminates over-allocation and minimizes the 
> recovery overhead. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7720) Race condition between second app attempt and UAM timeout when first attempt node is down

2022-07-28 Thread Botong Huang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17572711#comment-17572711
 ] 

Botong Huang commented on YARN-7720:


[~slfan1989] yes please

> Race condition between second app attempt and UAM timeout when first attempt 
> node is down
> -
>
> Key: YARN-7720
> URL: https://issues.apache.org/jira/browse/YARN-7720
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-7720.v1.patch, YARN-7720.v2.patch
>
>
> In Federation, multiple attempts of an application share the same UAM in each 
> secondary sub-cluster. When first attempt fails, we reply on the fact that 
> secondary RM won't kill the existing UAM before the AM heartbeat timeout 
> (default at 10 min). When second attempt comes up in the home sub-cluster, it 
> will pick up the UAM token from Yarn Registry and resume the UAM heartbeat to 
> secondary RMs. 
> The default heartbeat timeout for NM and AM are both 10 mins. The problem is 
> that when the first attempt node goes down or out of connection, only after 
> 10 mins will the home RM mark the first attempt as failed, and then schedule 
> the 2nd attempt in some other node. By then the UAMs in secondaries are 
> already timing out, and they might not survive until the second attempt comes 
> up. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9689) Router does not support kerberos proxy when in secure mode

2019-10-22 Thread Botong Huang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16957445#comment-16957445
 ] 

Botong Huang commented on YARN-9689:


+1 lgtm

> Router does not support kerberos proxy when in secure mode
> --
>
> Key: YARN-9689
> URL: https://issues.apache.org/jira/browse/YARN-9689
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: federation
>Affects Versions: 3.1.2
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: YARN-9689.001.patch
>
>
> When we enable kerberos in YARN-Federation mode, we can not get new app since 
> it will throw kerberos exception below.Which should be handled!
> {code:java}
> 2019-07-22,18:43:25,523 WARN org.apache.hadoop.ipc.Client: Exception 
> encountered while connecting to the server : 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> 2019-07-22,18:43:25,528 WARN 
> org.apache.hadoop.yarn.server.router.clientrm.FederationClientInterceptor: 
> Unable to create a new ApplicationId in SubCluster xxx
> java.io.IOException: DestHost:destPort xxx , LocalHost:localPort xxx. Failed 
> on local exception: java.io.IOException: javax.security.sasl.SaslException: 
> GSS initiate failed [Caused by GSSException: No valid credentials provided 
> (Mechanism level: Failed to find any Kerberos tgt)]
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831)
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:806)
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1564)
> at org.apache.hadoop.ipc.Client.call(Client.java:1506)
> at org.apache.hadoop.ipc.Client.call(Client.java:1416)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
> at com.sun.proxy.$Proxy91.getNewApplication(Unknown Source)
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getNewApplication(ApplicationClientProtocolPBClientImpl.java:274)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy92.getNewApplication(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.router.clientrm.FederationClientInterceptor.getNewApplication(FederationClientInterceptor.java:252)
> at 
> org.apache.hadoop.yarn.server.router.clientrm.RouterClientRMService.getNewApplication(RouterClientRMService.java:218)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getNewApplication(ApplicationClientProtocolPBServiceImpl.java:263)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:559)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:525)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:992)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:885)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:831)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1716)
> at org.apache.

[jira] [Commented] (YARN-7599) [GPG] ApplicationCleaner in Global Policy Generator

2019-09-24 Thread Botong Huang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16937004#comment-16937004
 ] 

Botong Huang commented on YARN-7599:


Hi [~qiuliang988], did you mean "GPG couldn't parse this XML" from Router? If 
you look at _minRouterSuccessCount_ in this patch. By default, only when GPG 
pulls from Router three times successfully will it go ahead and delete things. 

> [GPG] ApplicationCleaner in Global Policy Generator
> ---
>
> Key: YARN-7599
> URL: https://issues.apache.org/jira/browse/YARN-7599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Minor
>  Labels: federation, gpg
> Attachments: YARN-7599-YARN-7402.v1.patch, 
> YARN-7599-YARN-7402.v2.patch, YARN-7599-YARN-7402.v3.patch, 
> YARN-7599-YARN-7402.v4.patch, YARN-7599-YARN-7402.v5.patch, 
> YARN-7599-YARN-7402.v6.patch, YARN-7599-YARN-7402.v7.patch, 
> YARN-7599-YARN-7402.v8.patch
>
>
> In Federation, we need a cleanup service for StateStore as well as Yarn 
> Registry. For the former, we need to remove old application records. For the 
> latter, failed and killed applications might leave records in the Yarn 
> Registry (see YARN-6128). We plan to do both cleanup work in 
> ApplicationCleaner in GPG



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-7652) Handle AM register requests asynchronously in FederationInterceptor

2019-07-23 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang reassigned YARN-7652:
--

Assignee: Botong Huang  (was: hunshenshi)

> Handle AM register requests asynchronously in FederationInterceptor
> ---
>
> Key: YARN-7652
> URL: https://issues.apache.org/jira/browse/YARN-7652
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: amrmproxy, federation
>Affects Versions: 2.9.0, 3.0.0
>Reporter: Subru Krishnan
>Assignee: Botong Huang
>Priority: Major
> Fix For: 2.10.0, 3.3.0
>
> Attachments: YARN-7652.v1.patch, YARN-7652.v2.patch
>
>
> We (cc [~goiri]/[~botong]) observed that the {{FederationInterceptor}} in 
> {{AMRMProxy}} (and consequently the AM) is blocked if the _StateStore_ has 
> outdated info about a _SubCluster_. This is because we handle AM register 
> requests synchronously. This jira proposes to move to async similar to how we 
> operate with allocate invocations.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9689) Router does not support kerberos proxy when in secure mode

2019-07-22 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890402#comment-16890402
 ] 

Botong Huang commented on YARN-9689:


+[~giovanni.fumarola] for help

> Router does not support kerberos proxy when in secure mode
> --
>
> Key: YARN-9689
> URL: https://issues.apache.org/jira/browse/YARN-9689
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: federation
>Affects Versions: 3.1.2
>Reporter: zhoukang
>Priority: Major
>
> When we enable kerberos in YARN-Federation mode, we can not get new app since 
> it will throw kerberos exception below.Which should be handled!
> {code:java}
> 2019-07-22,18:43:25,523 WARN org.apache.hadoop.ipc.Client: Exception 
> encountered while connecting to the server : 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> 2019-07-22,18:43:25,528 WARN 
> org.apache.hadoop.yarn.server.router.clientrm.FederationClientInterceptor: 
> Unable to create a new ApplicationId in SubCluster xxx
> java.io.IOException: DestHost:destPort xxx , LocalHost:localPort xxx. Failed 
> on local exception: java.io.IOException: javax.security.sasl.SaslException: 
> GSS initiate failed [Caused by GSSException: No valid credentials provided 
> (Mechanism level: Failed to find any Kerberos tgt)]
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831)
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:806)
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1564)
> at org.apache.hadoop.ipc.Client.call(Client.java:1506)
> at org.apache.hadoop.ipc.Client.call(Client.java:1416)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
> at com.sun.proxy.$Proxy91.getNewApplication(Unknown Source)
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getNewApplication(ApplicationClientProtocolPBClientImpl.java:274)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy92.getNewApplication(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.router.clientrm.FederationClientInterceptor.getNewApplication(FederationClientInterceptor.java:252)
> at 
> org.apache.hadoop.yarn.server.router.clientrm.RouterClientRMService.getNewApplication(RouterClientRMService.java:218)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getNewApplication(ApplicationClientProtocolPBServiceImpl.java:263)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:559)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:525)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:992)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:885)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:831)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1716)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2691)
> Caus

[jira] [Updated] (YARN-9108) fix FederationIntercepter merge home and secondary allocate response typo

2018-12-22 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-9108:
---
Fix Version/s: 3.3.0
   2.10.0

> fix FederationIntercepter merge home and secondary allocate response typo
> -
>
> Key: YARN-9108
> URL: https://issues.apache.org/jira/browse/YARN-9108
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Affects Versions: 3.3.0
>Reporter: Morty Zhong
>Assignee: Abhishek Modi
>Priority: Minor
> Fix For: 2.10.0, 3.3.0
>
> Attachments: YARN-9108.001.patch, YARN-9108.002.patch, 
> YARN-9108.003.patch, YARN-9108.004.patch, YARN-9108.005.patch, 
> YARN-9108.006.patch
>
>
> method 'mergeAllocateResponse' in class FederationIntercepter.java line 1315
> the left variable `par2` should be `par1`
> {code:java}
> if (par1 != null && par2 != null) {
>   par1.getResourceRequest().addAll(par2.getResourceRequest());
>   par2.getContainers().addAll(par2.getContainers());
> }
> {code}
> should be
> {code:java}
> if (par1 != null && par2 != null) {
>   par1.getResourceRequest().addAll(par2.getResourceRequest());
>   par1.getContainers().addAll(par2.getContainers());//edited line
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9108) fix FederationIntercepter merge home and secondary allocate response typo

2018-12-22 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16727565#comment-16727565
 ] 

Botong Huang commented on YARN-9108:


+1. Committing to trunk and branch-2. Thanks [~Cedar] and [~abmodi] for the 
contribution and [~goiri] for reviewing. 

> fix FederationIntercepter merge home and secondary allocate response typo
> -
>
> Key: YARN-9108
> URL: https://issues.apache.org/jira/browse/YARN-9108
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Affects Versions: 3.3.0
>Reporter: Morty Zhong
>Assignee: Abhishek Modi
>Priority: Minor
> Attachments: YARN-9108.001.patch, YARN-9108.002.patch, 
> YARN-9108.003.patch, YARN-9108.004.patch, YARN-9108.005.patch, 
> YARN-9108.006.patch
>
>
> method 'mergeAllocateResponse' in class FederationIntercepter.java line 1315
> the left variable `par2` should be `par1`
> {code:java}
> if (par1 != null && par2 != null) {
>   par1.getResourceRequest().addAll(par2.getResourceRequest());
>   par2.getContainers().addAll(par2.getContainers());
> }
> {code}
> should be
> {code:java}
> if (par1 != null && par2 != null) {
>   par1.getResourceRequest().addAll(par2.getResourceRequest());
>   par1.getContainers().addAll(par2.getContainers());//edited line
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9108) fix FederationIntercepter merge home and secondary allocate response typo

2018-12-22 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-9108:
---
Summary: fix FederationIntercepter merge home and secondary allocate 
response typo  (was: FederationIntercepter merge home and second response local 
variable spell mistake)

> fix FederationIntercepter merge home and secondary allocate response typo
> -
>
> Key: YARN-9108
> URL: https://issues.apache.org/jira/browse/YARN-9108
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Affects Versions: 3.3.0
>Reporter: Morty Zhong
>Assignee: Abhishek Modi
>Priority: Minor
> Attachments: YARN-9108.001.patch, YARN-9108.002.patch, 
> YARN-9108.003.patch, YARN-9108.004.patch, YARN-9108.005.patch, 
> YARN-9108.006.patch
>
>
> method 'mergeAllocateResponse' in class FederationIntercepter.java line 1315
> the left variable `par2` should be `par1`
> {code:java}
> if (par1 != null && par2 != null) {
>   par1.getResourceRequest().addAll(par2.getResourceRequest());
>   par2.getContainers().addAll(par2.getContainers());
> }
> {code}
> should be
> {code:java}
> if (par1 != null && par2 != null) {
>   par1.getResourceRequest().addAll(par2.getResourceRequest());
>   par1.getContainers().addAll(par2.getContainers());//edited line
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9013) [GPG] fix order of steps cleaning Registry entries in ApplicationCleaner

2018-12-05 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710981#comment-16710981
 ] 

Botong Huang commented on YARN-9013:


Thanks [~giovanni.fumarola] for the review. Committing to YARN-7402. 

> [GPG] fix order of steps cleaning Registry entries in ApplicationCleaner
> 
>
> Key: YARN-9013
> URL: https://issues.apache.org/jira/browse/YARN-9013
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-9013-YARN-7402.v1.patch, 
> YARN-9013-YARN-7402.v2.patch
>
>
> ApplicationCleaner today deletes the entry for all finished (non-running) 
> application in YarnRegistry using this logic:
>  # GPG gets the list of running applications from Router.
>  # GPG gets the full list of applications in registry
>  # GPG deletes in registry every app in 2 that’s not in 1
> The problem is that jobs that started between 1 and 2 meets the criteria in 
> 3, and thus get deleted by mistake. The fix/right order should be 2->1->3, 
> rather than 1->2->3.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7720) Race condition between second app attempt and UAM timeout when first attempt node is down

2018-11-30 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-7720:
---
Attachment: YARN-7720.v2.patch

> Race condition between second app attempt and UAM timeout when first attempt 
> node is down
> -
>
> Key: YARN-7720
> URL: https://issues.apache.org/jira/browse/YARN-7720
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-7720.v1.patch, YARN-7720.v2.patch
>
>
> In Federation, multiple attempts of an application share the same UAM in each 
> secondary sub-cluster. When first attempt fails, we reply on the fact that 
> secondary RM won't kill the existing UAM before the AM heartbeat timeout 
> (default at 10 min). When second attempt comes up in the home sub-cluster, it 
> will pick up the UAM token from Yarn Registry and resume the UAM heartbeat to 
> secondary RMs. 
> The default heartbeat timeout for NM and AM are both 10 mins. The problem is 
> that when the first attempt node goes down or out of connection, only after 
> 10 mins will the home RM mark the first attempt as failed, and then schedule 
> the 2nd attempt in some other node. By then the UAMs in secondaries are 
> already timing out, and they might not survive until the second attempt comes 
> up. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7720) Race condition between second app attempt and UAM timeout when first attempt node is down

2018-11-30 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-7720:
---
Attachment: YARN-9013-YARN-7402.v2.patch

> Race condition between second app attempt and UAM timeout when first attempt 
> node is down
> -
>
> Key: YARN-7720
> URL: https://issues.apache.org/jira/browse/YARN-7720
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-7720.v1.patch
>
>
> In Federation, multiple attempts of an application share the same UAM in each 
> secondary sub-cluster. When first attempt fails, we reply on the fact that 
> secondary RM won't kill the existing UAM before the AM heartbeat timeout 
> (default at 10 min). When second attempt comes up in the home sub-cluster, it 
> will pick up the UAM token from Yarn Registry and resume the UAM heartbeat to 
> secondary RMs. 
> The default heartbeat timeout for NM and AM are both 10 mins. The problem is 
> that when the first attempt node goes down or out of connection, only after 
> 10 mins will the home RM mark the first attempt as failed, and then schedule 
> the 2nd attempt in some other node. By then the UAMs in secondaries are 
> already timing out, and they might not survive until the second attempt comes 
> up. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7720) Race condition between second app attempt and UAM timeout when first attempt node is down

2018-11-30 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-7720:
---
Attachment: (was: YARN-9013-YARN-7402.v2.patch)

> Race condition between second app attempt and UAM timeout when first attempt 
> node is down
> -
>
> Key: YARN-7720
> URL: https://issues.apache.org/jira/browse/YARN-7720
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-7720.v1.patch
>
>
> In Federation, multiple attempts of an application share the same UAM in each 
> secondary sub-cluster. When first attempt fails, we reply on the fact that 
> secondary RM won't kill the existing UAM before the AM heartbeat timeout 
> (default at 10 min). When second attempt comes up in the home sub-cluster, it 
> will pick up the UAM token from Yarn Registry and resume the UAM heartbeat to 
> secondary RMs. 
> The default heartbeat timeout for NM and AM are both 10 mins. The problem is 
> that when the first attempt node goes down or out of connection, only after 
> 10 mins will the home RM mark the first attempt as failed, and then schedule 
> the 2nd attempt in some other node. By then the UAMs in secondaries are 
> already timing out, and they might not survive until the second attempt comes 
> up. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7720) Race condition between second app attempt and UAM timeout when first attempt node is down

2018-11-30 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-7720:
---
Attachment: YARN-7720.v1.patch

> Race condition between second app attempt and UAM timeout when first attempt 
> node is down
> -
>
> Key: YARN-7720
> URL: https://issues.apache.org/jira/browse/YARN-7720
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-7720.v1.patch
>
>
> In Federation, multiple attempts of an application share the same UAM in each 
> secondary sub-cluster. When first attempt fails, we reply on the fact that 
> secondary RM won't kill the existing UAM before the AM heartbeat timeout 
> (default at 10 min). When second attempt comes up in the home sub-cluster, it 
> will pick up the UAM token from Yarn Registry and resume the UAM heartbeat to 
> secondary RMs. 
> The default heartbeat timeout for NM and AM are both 10 mins. The problem is 
> that when the first attempt node goes down or out of connection, only after 
> 10 mins will the home RM mark the first attempt as failed, and then schedule 
> the 2nd attempt in some other node. By then the UAMs in secondaries are 
> already timing out, and they might not survive until the second attempt comes 
> up. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9049) Add application submit data to state store

2018-11-29 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703500#comment-16703500
 ] 

Botong Huang commented on YARN-9049:


Understood. Let me ask this way: forget about implementation details, in 
general, why would adding future app data entry in _ApplicationData_ be easier 
than adding it directly in _ApplicationHomeSubCluster_?

I think at API/interface level, the latter makes more sense because 
_ApplicationHomeSubCluster_ should already serve as _ApplicationData_ if not 
renamed to it. If for mysql implementation, the former is easier (by adding a 
extra layer), then this should be kept implementation specific, and not exposed 
to the API? 

> Add application submit data to state store
> --
>
> Key: YARN-9049
> URL: https://issues.apache.org/jira/browse/YARN-9049
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Major
> Attachments: YARN-9049.001.path
>
>
> As per the discussion in YARN-8898 we need to persist trimmend 
> ApplicationSubmissionContext details to federation State Store.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9049) Add application submit data to state store

2018-11-29 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703500#comment-16703500
 ] 

Botong Huang edited comment on YARN-9049 at 11/29/18 4:57 PM:
--

Understood. Let me ask this way: forget about implementation details, in 
general, why would adding future app data entry in _ApplicationData_ be easier 
than adding it directly in _ApplicationHomeSubCluster_?

I think at API/interface level, the latter makes more sense because 
_ApplicationHomeSubCluster_ should already serve as _ApplicationData_ (if not 
renamed to it). If for mysql implementation, the former is easier (by adding a 
extra layer), then this should be kept implementation specific, and not exposed 
to the API? 


was (Author: botong):
Understood. Let me ask this way: forget about implementation details, in 
general, why would adding future app data entry in _ApplicationData_ be easier 
than adding it directly in _ApplicationHomeSubCluster_?

I think at API/interface level, the latter makes more sense because 
_ApplicationHomeSubCluster_ should already serve as _ApplicationData_ if not 
renamed to it. If for mysql implementation, the former is easier (by adding a 
extra layer), then this should be kept implementation specific, and not exposed 
to the API? 

> Add application submit data to state store
> --
>
> Key: YARN-9049
> URL: https://issues.apache.org/jira/browse/YARN-9049
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Major
> Attachments: YARN-9049.001.path
>
>
> As per the discussion in YARN-8898 we need to persist trimmend 
> ApplicationSubmissionContext details to federation State Store.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8934) [GPG] Add JvmMetricsInfo and pause monitor

2018-11-28 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16702496#comment-16702496
 ] 

Botong Huang commented on YARN-8934:


Thanks [~BilwaST] for the patch. Committing to YARN-7402. 

> [GPG] Add JvmMetricsInfo and pause monitor
> --
>
> Key: YARN-8934
> URL: https://issues.apache.org/jira/browse/YARN-8934
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-8934-001.patch, YARN-8934-YARN-7402.v1.patch, 
> YARN-8934-YARN-7402.v2.patch, YARN-8934-YARN-7402.v3.patch, 
> YARN-8934-YARN-7402.v4.patch, YARN-8934-YARN-7402.v5.patch, 
> image-2018-11-19-15-37-18-647.png
>
>
> Similar to resourcemanager and nodemanager serivce we can add JvmMetricsInfo 
> to gpg service.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9049) Add application submit data to state store

2018-11-28 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16702430#comment-16702430
 ] 

Botong Huang commented on YARN-9049:


Thanks [~bibinchundatt] for the patch! I just have one question: why not adding 
_ApplicationSubmissionContext_ directly in _ApplicationHomeSubCluster_, but 
wrap it with _ApplicationData_ first? Every entry of 
_ApplicationHomeSubCluster_ stores all info about an app, including its home 
subcluster id. I think it should already serve as _ApplicationData_  if not 
renamed to it. 

> Add application submit data to state store
> --
>
> Key: YARN-9049
> URL: https://issues.apache.org/jira/browse/YARN-9049
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Major
> Attachments: YARN-9049.001.path
>
>
> As per the discussion in YARN-8898 we need to persist trimmend 
> ApplicationSubmissionContext details to federation State Store.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8934) [GPG] Add JvmMetricsInfo and pause monitor

2018-11-27 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700980#comment-16700980
 ] 

Botong Huang commented on YARN-8934:


Overall lgtm, one more small thing besides Bibin's: change 
GPG_WEBAPP_ENABLE_CORS_FILTER to use/start with GPG_WEBAPP_PREFIX

> [GPG] Add JvmMetricsInfo and pause monitor
> --
>
> Key: YARN-8934
> URL: https://issues.apache.org/jira/browse/YARN-8934
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-8934-001.patch, YARN-8934-YARN-7402.v1.patch, 
> YARN-8934-YARN-7402.v2.patch, YARN-8934-YARN-7402.v3.patch, 
> YARN-8934-YARN-7402.v4.patch, image-2018-11-19-15-37-18-647.png
>
>
> Similar to resourcemanager and nodemanager serivce we can add JvmMetricsInfo 
> to gpg service.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8898) Fix FederationInterceptor#allocate to set application priority in allocateResponse

2018-11-13 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16685911#comment-16685911
 ] 

Botong Huang commented on YARN-8898:


+[~giovanni.fumarola] as well

> Fix FederationInterceptor#allocate to set application priority in 
> allocateResponse
> --
>
> Key: YARN-8898
> URL: https://issues.apache.org/jira/browse/YARN-8898
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-8898.wip.patch
>
>
> In case of FederationInterceptor#mergeAllocateResponses skips 
> application_priority in response returned



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8898) Fix FederationInterceptor#allocate to set application priority in allocateResponse

2018-11-13 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16685637#comment-16685637
 ] 

Botong Huang commented on YARN-8898:


For the record, I was leaning towards Solution 2 later in the discussion :)
{quote}Anyways, it might be cleaner to go for Solution 2.
{quote}
In _FederationStateStore_ there's already an application table, we can 
piggyback in it (_ApplicationHomeSubCluster_). I think for future compatibility 
we should just put the _ApplicationSubmissionContext_ object in it, rather than 
creating a new trimmed type. If by trimming you meant setting some of the 
entries to null, then sure by all means.

> Fix FederationInterceptor#allocate to set application priority in 
> allocateResponse
> --
>
> Key: YARN-8898
> URL: https://issues.apache.org/jira/browse/YARN-8898
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-8898.wip.patch
>
>
> In case of FederationInterceptor#mergeAllocateResponses skips 
> application_priority in response returned



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9013) [GPG] fix order of steps cleaning Registry entries in ApplicationCleaner

2018-11-12 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-9013:
---
Attachment: YARN-9013-YARN-7402.v2.patch

> [GPG] fix order of steps cleaning Registry entries in ApplicationCleaner
> 
>
> Key: YARN-9013
> URL: https://issues.apache.org/jira/browse/YARN-9013
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-9013-YARN-7402.v1.patch, 
> YARN-9013-YARN-7402.v2.patch
>
>
> ApplicationCleaner today deletes the entry for all finished (non-running) 
> application in YarnRegistry using this logic:
>  # GPG gets the list of running applications from Router.
>  # GPG gets the full list of applications in registry
>  # GPG deletes in registry every app in 2 that’s not in 1
> The problem is that jobs that started between 1 and 2 meets the criteria in 
> 3, and thus get deleted by mistake. The fix/right order should be 2->1->3, 
> rather than 1->2->3.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9013) [GPG] fix order of steps cleaning Registry entries in ApplicationCleaner

2018-11-12 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-9013:
---
Attachment: YARN-9013-YARN-7402.v1.patch

> [GPG] fix order of steps cleaning Registry entries in ApplicationCleaner
> 
>
> Key: YARN-9013
> URL: https://issues.apache.org/jira/browse/YARN-9013
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-9013-YARN-7402.v1.patch
>
>
> ApplicationCleaner today deletes the entry for all finished (non-running) 
> application in YarnRegistry using this logic:
>  # GPG gets the list of running applications from Router.
>  # GPG gets the full list of applications in registry
>  # GPG deletes in registry every app in 2 that’s not in 1
> The problem is that jobs that started between 1 and 2 meets the criteria in 
> 3, and thus get deleted by mistake. The fix/right order should be 2->1->3, 
> rather than 1->2->3.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9013) [GPG] fix order of steps cleaning Registry entries in ApplicationCleaner

2018-11-12 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-9013:
---
Parent Issue: YARN-7402  (was: YARN-5597)

> [GPG] fix order of steps cleaning Registry entries in ApplicationCleaner
> 
>
> Key: YARN-9013
> URL: https://issues.apache.org/jira/browse/YARN-9013
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
>
> ApplicationCleaner today deletes the entry for all finished (non-running) 
> application in YarnRegistry using this logic:
>  # GPG gets the list of running applications from Router.
>  # GPG gets the full list of applications in registry
>  # GPG deletes in registry every app in 2 that’s not in 1
> The problem is that jobs that started between 1 and 2 meets the criteria in 
> 3, and thus get deleted by mistake. The fix/right order should be 2->1->3, 
> rather than 1->2->3.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9013) [GPG] fix order of steps cleaning Registry entries in ApplicationCleaner

2018-11-12 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-9013:
---
Issue Type: Sub-task  (was: Task)
Parent: YARN-5597

> [GPG] fix order of steps cleaning Registry entries in ApplicationCleaner
> 
>
> Key: YARN-9013
> URL: https://issues.apache.org/jira/browse/YARN-9013
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
>
> ApplicationCleaner today deletes the entry for all finished (non-running) 
> application in YarnRegistry using this logic:
>  # GPG gets the list of running applications from Router.
>  # GPG gets the full list of applications in registry
>  # GPG deletes in registry every app in 2 that’s not in 1
> The problem is that jobs that started between 1 and 2 meets the criteria in 
> 3, and thus get deleted by mistake. The fix/right order should be 2->1->3, 
> rather than 1->2->3.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9013) [GPG] fix order of steps cleaning Registry entries in ApplicationCleaner

2018-11-12 Thread Botong Huang (JIRA)
Botong Huang created YARN-9013:
--

 Summary: [GPG] fix order of steps cleaning Registry entries in 
ApplicationCleaner
 Key: YARN-9013
 URL: https://issues.apache.org/jira/browse/YARN-9013
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Botong Huang
Assignee: Botong Huang


ApplicationCleaner today deletes the entry for all finished (non-running) 
application in YarnRegistry using this logic:
 # GPG gets the list of running applications from Router.
 # GPG gets the full list of applications in registry
 # GPG deletes in registry every app in 2 that’s not in 1

The problem is that jobs that started between 1 and 2 meets the criteria in 3, 
and thus get deleted by mistake. The fix/right order should be 2->1->3, rather 
than 1->2->3.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8933) [AMRMProxy] Fix potential empty fields in allocation response, move SubClusterTimeout to FederationInterceptor

2018-11-11 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16682964#comment-16682964
 ] 

Botong Huang commented on YARN-8933:


Thanks [~bibinchundatt] for the comments and review, committing to trunk and 
branch-2

> [AMRMProxy] Fix potential empty fields in allocation response, move 
> SubClusterTimeout to FederationInterceptor
> --
>
> Key: YARN-8933
> URL: https://issues.apache.org/jira/browse/YARN-8933
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: amrmproxy, federation
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Fix For: 2.10.0, 3.3.0
>
> Attachments: YARN-8933.v1.patch, YARN-8933.v2.patch, 
> YARN-8933.v3.patch
>
>
> After YARN-8696, the allocate response by FederationInterceptor is merged 
> from the responses from a random subset of all sub-clusters, depending on the 
> async heartbeat timing. As a result, cluster-wide information fields in the 
> response, e.g. AvailableResources and NumClusterNodes, are not consistent at 
> all. It can even be null/zero because the specific response is merged from an 
> empty set of sub-cluster responses. 
> In this patch, we let FederationInterceptor remember the last allocate 
> response from all known sub-clusters, and always construct the cluster-wide 
> info fields from all of them. We also moved sub-cluster timeout from 
> LocalityMulticastAMRMProxyPolicy to FederationInterceptor, so that 
> sub-clusters that expired (haven't had a successful allocate response for a 
> while) won't be included in the computation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8933) [AMRMProxy] Fix potential empty fields in allocation response, move SubClusterTimeout to FederationInterceptor

2018-11-11 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-8933:
---
Fix Version/s: 3.3.0
   2.10.0

> [AMRMProxy] Fix potential empty fields in allocation response, move 
> SubClusterTimeout to FederationInterceptor
> --
>
> Key: YARN-8933
> URL: https://issues.apache.org/jira/browse/YARN-8933
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: amrmproxy, federation
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Fix For: 2.10.0, 3.3.0
>
> Attachments: YARN-8933.v1.patch, YARN-8933.v2.patch, 
> YARN-8933.v3.patch
>
>
> After YARN-8696, the allocate response by FederationInterceptor is merged 
> from the responses from a random subset of all sub-clusters, depending on the 
> async heartbeat timing. As a result, cluster-wide information fields in the 
> response, e.g. AvailableResources and NumClusterNodes, are not consistent at 
> all. It can even be null/zero because the specific response is merged from an 
> empty set of sub-cluster responses. 
> In this patch, we let FederationInterceptor remember the last allocate 
> response from all known sub-clusters, and always construct the cluster-wide 
> info fields from all of them. We also moved sub-cluster timeout from 
> LocalityMulticastAMRMProxyPolicy to FederationInterceptor, so that 
> sub-clusters that expired (haven't had a successful allocate response for a 
> while) won't be included in the computation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8980) Mapreduce application container start fail after AM restart.

2018-11-11 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16682962#comment-16682962
 ] 

Botong Huang commented on YARN-8980:


I agree. I am also worried about container leaks, since the new attempt (old) 
AM is not even aware of the existing containers from the UAMs. Note that RM 
only supports one attempt for UAM and this UAM attempt is used throughout all 
AM attempts in home SC.

I think on top of 1 you mentioned (clear token cache in RM), 
_FederationInterceptor_ needs to know the _keepContainer_ flag of the original 
AM. If it is false, after reattaching to the UAMs in 
_registerApplicationMaster_ it needs to release all running containers from UAM.

> Mapreduce application container start  fail after AM restart.
> -
>
> Key: YARN-8980
> URL: https://issues.apache.org/jira/browse/YARN-8980
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Priority: Major
>
> UAM to subclusters are always launched with keepContainers.
> On AM restart scenarios , UAM register again with RM . UAM receive running 
> containers with NMToken. NMToken received by UAM in 
> getPreviousAttemptContainersNMToken is never used by mapreduce application.  
> Federation Interceptor should take care of such scenarios too. Merge NMToken 
> received at registration to allocate response.
> Container allocation response on same node will have NMToken empty.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8980) Mapreduce application container start fail after AM restart.

2018-11-10 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16682485#comment-16682485
 ] 

Botong Huang commented on YARN-8980:


Thanks [~bibinchundatt] for reporting. This is along the discussion we are 
having in YARN-8898. Basically it is better to use the original 
_ApplicationSubmissionContext_ for the app when launching the UAMs. We will 
probably need to go with Solution 2 discussed there: Push 
applicationSubmissionContext also to federationStore at router side. [~subru] 
what do you think? 

> Mapreduce application container start  fail after AM restart.
> -
>
> Key: YARN-8980
> URL: https://issues.apache.org/jira/browse/YARN-8980
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Priority: Major
>
> UAM to subclusters are always launched with keepContainers.
> On AM restart scenarios , UAM register again with RM . UAM receive running 
> containers with NMToken. NMToken received by UAM in 
> getPreviousAttemptContainersNMToken is never used by mapreduce application.  
> Federation Interceptor should take care of such scenarios too. Merge NMToken 
> received at registration to allocate response.
> Container allocation response on same node will have NMToken empty.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8933) [AMRMProxy] Fix potential empty fields in allocation response, move SubClusterTimeout to FederationInterceptor

2018-11-08 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16680622#comment-16680622
 ] 

Botong Huang edited comment on YARN-8933 at 11/8/18 11:53 PM:
--

Good questions, there are several aspects: 
 # When we try to span to a new SC. We deliberately put (current time - 
subcluster timeout) into the map so that initially it is considered expired 
because the async UAM launch/reattach might fail/took a long time. We don't 
want to consider this SC as available/healthy and start routing resource 
requests there until we know for sure that it is ready (received a heartbeat 
response from it). In fact if the UAM launch fails, we will keep trying in the 
background (triggered by new AM heartbeat). Without being initialized as 
expired, this SC will become a black hole sink for container requests. 
 # What you mentioned is possible, that in some corner cases for one AM 
heartbeat, we consider the subcluster as expired/unhealthy. However note that 
all we do is not routing new resource request to this SC for this heartbeat 
only. A heartbeat without new resource request will still be send out to this 
SC and if we get a response successfully, next time it won't be marked as 
expired, most likely. 
 # Initializing the lastAMHeartbeatTime as -1 as a special value would work. I 
didn't do this because _MonotonicClock.getTime()_ can return negative value as 
well as -1 (as opposed to System.currentTimeMillis() is always positive). I 
think initializing the lastAMHeartbeatTime in constructor easier and would work 
as well. 


was (Author: botong):
Good questions, there are several aspects: 
 # When we try to span to a new SC. We deliberately put (current time - 
subcluster timeout) into the map so that initially it is considered expired 
because the async UAM launch/reattach might fail/took a long time. We don't 
want to consider this SC as available/healthy and start routing resource 
requests there until we know for sure that it is ready (received a heartbeat 
response from it). In fact if the UAM launch fails, we will keep trying in the 
background (triggered by new AM heartbeat). Without being initialized as 
expired, this SC will become a black hole sink for container requests. 
 # What you mentioned is possible, that in some corner cases for one AM 
heartbeat, we consider the subcluster as expired/unhealthy. However note that 
all we do is not routing new resource request to this SC for this heartbeat 
only. A heartbeat without new resource request will still be send out to this 
SC and if we get a response successfully, next time it won't be marked as 
expired, most likely. 
 # Initializing the lastheartbeat as -1 as a special value would work. I didn't 
do this because _MonotonicClock.getTime()_ can return negative value as well as 
-1 (as opposed to System.currentTimeMillis() is always positive). I think 
initializing the lastAMHeartbeatTime in constructor easier and would work as 
well. 

> [AMRMProxy] Fix potential empty fields in allocation response, move 
> SubClusterTimeout to FederationInterceptor
> --
>
> Key: YARN-8933
> URL: https://issues.apache.org/jira/browse/YARN-8933
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: amrmproxy, federation
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8933.v1.patch, YARN-8933.v2.patch, 
> YARN-8933.v3.patch
>
>
> After YARN-8696, the allocate response by FederationInterceptor is merged 
> from the responses from a random subset of all sub-clusters, depending on the 
> async heartbeat timing. As a result, cluster-wide information fields in the 
> response, e.g. AvailableResources and NumClusterNodes, are not consistent at 
> all. It can even be null/zero because the specific response is merged from an 
> empty set of sub-cluster responses. 
> In this patch, we let FederationInterceptor remember the last allocate 
> response from all known sub-clusters, and always construct the cluster-wide 
> info fields from all of them. We also moved sub-cluster timeout from 
> LocalityMulticastAMRMProxyPolicy to FederationInterceptor, so that 
> sub-clusters that expired (haven't had a successful allocate response for a 
> while) won't be included in the computation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8933) [AMRMProxy] Fix potential empty fields in allocation response, move SubClusterTimeout to FederationInterceptor

2018-11-08 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16680622#comment-16680622
 ] 

Botong Huang commented on YARN-8933:


Good questions, there are several aspects: 
 # When we try to span to a new SC. We deliberately put (current time - 
subcluster timeout) into the map so that initially it is considered expired 
because the async UAM launch/reattach might fail/took a long time. We don't 
want to consider this SC as available/healthy and start routing resource 
requests there until we know for sure that it is ready (received a heartbeat 
response from it). In fact if the UAM launch fails, we will keep trying in the 
background (triggered by new AM heartbeat). Without being initialized as 
expired, this SC will become a black hole sink for container requests. 
 # What you mentioned is possible, that in some corner cases for one AM 
heartbeat, we consider the subcluster as expired/unhealthy. However note that 
all we do is not routing new resource request to this SC for this heartbeat 
only. A heartbeat without new resource request will still be send out to this 
SC and if we get a response successfully, next time it won't be marked as 
expired, most likely. 
 # Initializing the lastheartbeat as -1 as a special value would work. I didn't 
do this because _MonotonicClock.getTime()_ can return negative value as well as 
-1 (as opposed to System.currentTimeMillis() is always positive). I think 
initializing the lastAMHeartbeatTime in constructor easier and would work as 
well. 

> [AMRMProxy] Fix potential empty fields in allocation response, move 
> SubClusterTimeout to FederationInterceptor
> --
>
> Key: YARN-8933
> URL: https://issues.apache.org/jira/browse/YARN-8933
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: amrmproxy, federation
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8933.v1.patch, YARN-8933.v2.patch, 
> YARN-8933.v3.patch
>
>
> After YARN-8696, the allocate response by FederationInterceptor is merged 
> from the responses from a random subset of all sub-clusters, depending on the 
> async heartbeat timing. As a result, cluster-wide information fields in the 
> response, e.g. AvailableResources and NumClusterNodes, are not consistent at 
> all. It can even be null/zero because the specific response is merged from an 
> empty set of sub-cluster responses. 
> In this patch, we let FederationInterceptor remember the last allocate 
> response from all known sub-clusters, and always construct the cluster-wide 
> info fields from all of them. We also moved sub-cluster timeout from 
> LocalityMulticastAMRMProxyPolicy to FederationInterceptor, so that 
> sub-clusters that expired (haven't had a successful allocate response for a 
> while) won't be included in the computation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8933) [AMRMProxy] Fix potential empty fields in allocation response, move SubClusterTimeout to FederationInterceptor

2018-11-08 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675623#comment-16675623
 ] 

Botong Huang edited comment on YARN-8933 at 11/8/18 12:25 AM:
--

Ah good catch, and thx for reviewing! 

Can you explain a bit what you mean by test recover case? There's already a 
_testRecoverWith(out)AMRMProxyHA_ in _TestFederationInterceptor_. 


was (Author: botong):
Ah good catch, and thx for reviewing! 

> [AMRMProxy] Fix potential empty fields in allocation response, move 
> SubClusterTimeout to FederationInterceptor
> --
>
> Key: YARN-8933
> URL: https://issues.apache.org/jira/browse/YARN-8933
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: amrmproxy, federation
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8933.v1.patch, YARN-8933.v2.patch, 
> YARN-8933.v3.patch
>
>
> After YARN-8696, the allocate response by FederationInterceptor is merged 
> from the responses from a random subset of all sub-clusters, depending on the 
> async heartbeat timing. As a result, cluster-wide information fields in the 
> response, e.g. AvailableResources and NumClusterNodes, are not consistent at 
> all. It can even be null/zero because the specific response is merged from an 
> empty set of sub-cluster responses. 
> In this patch, we let FederationInterceptor remember the last allocate 
> response from all known sub-clusters, and always construct the cluster-wide 
> info fields from all of them. We also moved sub-cluster timeout from 
> LocalityMulticastAMRMProxyPolicy to FederationInterceptor, so that 
> sub-clusters that expired (haven't had a successful allocate response for a 
> while) won't be included in the computation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8984) AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty

2018-11-08 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16680065#comment-16680065
 ] 

Botong Huang commented on YARN-8984:


bq. ContainerPBImpl#getAllocationTags() will new a empty hashSet when the tag 
is null. 
We should not assume the implementation (ContainerPBImpl) will do so in 
general. Other implementations of Container in the future might still return 
null. So please keep the null check here. You can remove the isEmpty() check if 
needed. 

> AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty
> --
>
> Key: YARN-8984
> URL: https://issues.apache.org/jira/browse/YARN-8984
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Critical
> Attachments: YARN-8984-001.patch, YARN-8984-002.patch, 
> YARN-8984-003.patch
>
>
> In AMRMClient, outstandingSchedRequests should be removed or decreased when 
> container allocated. However, it could not work when allocation tag is null 
> or empty.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8933) [AMRMProxy] Fix potential empty AvailableResource and NumClusterNode in allocation response

2018-11-07 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-8933:
---
Attachment: YARN-8933.v3.patch

> [AMRMProxy] Fix potential empty AvailableResource and NumClusterNode in 
> allocation response
> ---
>
> Key: YARN-8933
> URL: https://issues.apache.org/jira/browse/YARN-8933
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: amrmproxy, federation
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8933.v1.patch, YARN-8933.v2.patch, 
> YARN-8933.v3.patch
>
>
> After YARN-8696, the allocate response by FederationInterceptor is merged 
> from the responses from a random subset of all sub-clusters, depending on the 
> async heartbeat timing. As a result, cluster-wide information fields in the 
> response, e.g. AvailableResources and NumClusterNodes, are not consistent at 
> all. It can even be null/zero because the specific response is merged from an 
> empty set of sub-cluster responses. 
> In this patch, we let FederationInterceptor remember the last allocate 
> response from all known sub-clusters, and always construct the cluster-wide 
> info fields from all of them. We also moved sub-cluster timeout from 
> LocalityMulticastAMRMProxyPolicy to FederationInterceptor, so that 
> sub-clusters that expired (haven't had a successful allocate response for a 
> while) won't be included in the computation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8933) [AMRMProxy] Fix potential empty fields in allocation response, move SubClusterTimeout to FederationInterceptor

2018-11-07 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-8933:
---
Summary: [AMRMProxy] Fix potential empty fields in allocation response, 
move SubClusterTimeout to FederationInterceptor  (was: [AMRMProxy] Fix 
potential empty AvailableResource and NumClusterNode in allocation response)

> [AMRMProxy] Fix potential empty fields in allocation response, move 
> SubClusterTimeout to FederationInterceptor
> --
>
> Key: YARN-8933
> URL: https://issues.apache.org/jira/browse/YARN-8933
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: amrmproxy, federation
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8933.v1.patch, YARN-8933.v2.patch, 
> YARN-8933.v3.patch
>
>
> After YARN-8696, the allocate response by FederationInterceptor is merged 
> from the responses from a random subset of all sub-clusters, depending on the 
> async heartbeat timing. As a result, cluster-wide information fields in the 
> response, e.g. AvailableResources and NumClusterNodes, are not consistent at 
> all. It can even be null/zero because the specific response is merged from an 
> empty set of sub-cluster responses. 
> In this patch, we let FederationInterceptor remember the last allocate 
> response from all known sub-clusters, and always construct the cluster-wide 
> info fields from all of them. We also moved sub-cluster timeout from 
> LocalityMulticastAMRMProxyPolicy to FederationInterceptor, so that 
> sub-clusters that expired (haven't had a successful allocate response for a 
> while) won't be included in the computation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8984) AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty

2018-11-07 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16678606#comment-16678606
 ] 

Botong Huang edited comment on YARN-8984 at 11/7/18 6:27 PM:
-

Took a quick look. It is expected for AMRMClient to re-send all 
outstanding/pending request after an RM master-slave switch. When a container 
is allocated, we should remove it from the outstanding list, which is exactly 
what _removeFromOutstandingSchedulingRequests()_ is doing here. If we are not 
cleaning it up properly, very likely is because RM is not feeding in the proper 
allocationTags in the allocated _Container_ object? So we need to fix this 
instead of removing the null check here? 


was (Author: botong):
Took a quick look. It is expected for AMRMClient to re-send all pending request 
after an RM failover. Whenever a container is allocated, we should remove it 
from the pending list, which is exactly what 
_removeFromOutstandingSchedulingRequests()_ is doing here. If we are not 
cleaning it up properly, very likely is it because RM is not feeding in the 
proper allocationTags in the allocated Container? So we need to fix this 
instead of removing the null check here? 

> AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty
> --
>
> Key: YARN-8984
> URL: https://issues.apache.org/jira/browse/YARN-8984
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Critical
> Attachments: YARN-8984-001.patch, YARN-8984-002.patch, 
> YARN-8984-003.patch
>
>
> In AMRMClient, outstandingSchedRequests should be removed or decreased when 
> container allocated. However, it could not work when allocation tag is null 
> or empty.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8984) AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty

2018-11-07 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16678606#comment-16678606
 ] 

Botong Huang commented on YARN-8984:


Took a quick look. It is expected for AMRMClient to re-send all pending request 
after an RM failover. Whenever a container is allocated, we should remove it 
from the pending list, which is exactly what 
_removeFromOutstandingSchedulingRequests()_ is doing here. If we are not 
cleaning it up properly, very likely is it because RM is not feeding in the 
proper allocationTags in the allocated Container? So we need to fix this 
instead of removing the null check here? 

> AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty
> --
>
> Key: YARN-8984
> URL: https://issues.apache.org/jira/browse/YARN-8984
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Critical
> Attachments: YARN-8984-001.patch, YARN-8984-002.patch, 
> YARN-8984-003.patch
>
>
> In AMRMClient, outstandingSchedRequests should be removed or decreased when 
> container allocated. However, it could not work when allocation tag is null 
> or empty.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8984) AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty

2018-11-07 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16678554#comment-16678554
 ] 

Botong Huang commented on YARN-8984:


+[~kkaranasos]

> AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty
> --
>
> Key: YARN-8984
> URL: https://issues.apache.org/jira/browse/YARN-8984
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Critical
> Attachments: YARN-8984-001.patch, YARN-8984-002.patch, 
> YARN-8984-003.patch
>
>
> In AMRMClient, outstandingSchedRequests should be removed or decreased when 
> container allocated. However, it could not work when allocation tag is null 
> or empty.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8898) Fix FederationInterceptor#allocate to set application priority in allocateResponse

2018-11-06 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16677330#comment-16677330
 ] 

Botong Huang commented on YARN-8898:


Technically most of these info are already there for the proxy, in the 
_ContainerToken_ in _ContainerLaunchContext_ as well as _AllocateResponse_. 
This is how AM will get them and pass on to its containers later. Anyways, it 
might be cleaner to go for Solution 2. Let's see what [~subru] thinks. 

> Fix FederationInterceptor#allocate to set application priority in 
> allocateResponse
> --
>
> Key: YARN-8898
> URL: https://issues.apache.org/jira/browse/YARN-8898
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Major
>
> In case of FederationInterceptor#mergeAllocateResponses skips 
> application_priority in response returned



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8933) [AMRMProxy] Fix potential empty AvailableResource and NumClusterNode in allocation response

2018-11-05 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675623#comment-16675623
 ] 

Botong Huang commented on YARN-8933:


Ah good catch, and thx for reviewing! 

> [AMRMProxy] Fix potential empty AvailableResource and NumClusterNode in 
> allocation response
> ---
>
> Key: YARN-8933
> URL: https://issues.apache.org/jira/browse/YARN-8933
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: amrmproxy, federation
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8933.v1.patch, YARN-8933.v2.patch
>
>
> After YARN-8696, the allocate response by FederationInterceptor is merged 
> from the responses from a random subset of all sub-clusters, depending on the 
> async heartbeat timing. As a result, cluster-wide information fields in the 
> response, e.g. AvailableResources and NumClusterNodes, are not consistent at 
> all. It can even be null/zero because the specific response is merged from an 
> empty set of sub-cluster responses. 
> In this patch, we let FederationInterceptor remember the last allocate 
> response from all known sub-clusters, and always construct the cluster-wide 
> info fields from all of them. We also moved sub-cluster timeout from 
> LocalityMulticastAMRMProxyPolicy to FederationInterceptor, so that 
> sub-clusters that expired (haven't had a successful allocate response for a 
> while) won't be included in the computation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8898) Fix FederationInterceptor#allocate to set application priority in allocateResponse

2018-11-05 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675609#comment-16675609
 ] 

Botong Huang commented on YARN-8898:


bq. Better option could be pushing along with ApplicationHomeSubCluster the 
application Submission Context too. And let interceptor query when AM 
registration happens.
If necessary, yes I agree this works. But if you are talking about 
ApplicationPriority alone, the change would seem big (Router, StateStore, 
AMRMProxy). Down the line we might need to deal with two source of truth issues 
(from StateStore vs RM allocate response) as well. On the other hand, the 
existing priority value is in AllocateResponse and thus we are relying on the 
RM version rather than AM version. We can cherry-pick YARN-4170 to 2.7 if 
needed. For old RM versions where this value is not fed in, I guess we can 
leave the UAM to default priority. What do you think? 

> Fix FederationInterceptor#allocate to set application priority in 
> allocateResponse
> --
>
> Key: YARN-8898
> URL: https://issues.apache.org/jira/browse/YARN-8898
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Major
>
> In case of FederationInterceptor#mergeAllocateResponses skips 
> application_priority in response returned



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7631) ResourceRequest with different Capacity (Resource) overrides each other in RM and thus lost

2018-11-05 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675566#comment-16675566
 ] 

Botong Huang commented on YARN-7631:


Please consider directly using _ResourceRequestSetKey_ to replace 
_SchedulerRequestKey_ for this, thx!

> ResourceRequest with different Capacity (Resource) overrides each other in RM 
> and thus lost
> ---
>
> Key: YARN-7631
> URL: https://issues.apache.org/jira/browse/YARN-7631
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Botong Huang
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: resourcebug.patch
>
>
> Today in AMRMClientImpl, the ResourceRequests (RR) are kept as: RequestId -> 
> Priority -> ResourceName -> ExecutionType -> Resource (Capacity) -> 
> ResourceRequestInfo (the actual RR). This means that only RRs with the same 
> (requestId, priority, resourcename, executionType, resource) will be grouped 
> and aggregated together. 
> While in RM side, the mapping is SchedulerRequestKey (RequestId, priority) -> 
> LocalityAppPlacementAllocator (ResourceName -> RR). 
> The issue is that in RM side Resource is not in the key to the RR at all. 
> (Note that executionType is also not in the RM side, but it is fine because 
> RM handles it separately as container update requests.) This means that under 
> the same value of (requestId, priority, resourcename), RRs with different 
> Resource values will be grouped together and override each other in RM. As a 
> result, some of the container requests are lost and will never be allocated. 
> Furthermore, since the two RRs are kept under different keys in AMRMClient 
> side, allocation of RR1 will only trigger cancel for RR1, the pending RR2 
> will not get resend as well. 
> I’ve attached an unit test (resourcebug.patch) which is failing in trunk to 
> illustrate this issue. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8893) [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client

2018-11-02 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-8893:
---
Fix Version/s: 2.10.0

> [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client
> ---
>
> Key: YARN-8893
> URL: https://issues.apache.org/jira/browse/YARN-8893
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: amrmproxy, federation
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Fix For: 2.10.0, 3.3.0
>
> Attachments: YARN-8893.v1.patch, YARN-8893.v2.patch
>
>
> Fix thread leak in AMRMClientRelayer and UAM client used by 
> FederationInterceptor, when destroying the interceptor pipeline in AMRMProxy. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8893) [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client

2018-11-02 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16673807#comment-16673807
 ] 

Botong Huang commented on YARN-8893:


Thanks [~giovanni.fumarola] for the review! Committed to branch-2 as well. 

> [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client
> ---
>
> Key: YARN-8893
> URL: https://issues.apache.org/jira/browse/YARN-8893
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: amrmproxy, federation
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-8893.v1.patch, YARN-8893.v2.patch
>
>
> Fix thread leak in AMRMClientRelayer and UAM client used by 
> FederationInterceptor, when destroying the interceptor pipeline in AMRMProxy. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8893) [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client

2018-11-02 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16673748#comment-16673748
 ] 

Botong Huang commented on YARN-8893:


This is also a proper shutdown and close connection in the forceKill case, so 
that after we forceKilled the UAM, our local proxy connection inside the 
AMRMClientRelayer won't be left open. 

> [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client
> ---
>
> Key: YARN-8893
> URL: https://issues.apache.org/jira/browse/YARN-8893
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: amrmproxy, federation
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8893.v1.patch, YARN-8893.v2.patch
>
>
> Fix thread leak in AMRMClientRelayer and UAM client used by 
> FederationInterceptor, when destroying the interceptor pipeline in AMRMProxy. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8898) Fix FederationInterceptor#allocate to set application priority in allocateResponse

2018-10-25 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16664064#comment-16664064
 ] 

Botong Huang commented on YARN-8898:


I meant I don't have the code. Feel free to take a crack at it. Please do it on 
top of YARN-8933 and use the last responses from all SCs.

Since FederationInteceptor sits between AM and RM, I don't think it can get the 
ApplicationSubmissionContext easily. When AMRMProxy initializes the interceptor 
pipeline for the AM, it has the ContainerLaunchContext for the AM, and 
currently it is also not passed into the interceptors as well.

I agree that FederationInteceptor need more information, I think it is better 
to use/add fields in the AM RM allocate protocol. Generally it should figure 
out all information by looking at the communication between AM and (home) RM, 
e.g. application priority, node label etc.

If application priority can change over time, then I think we should just 
follow the application priority in the last home RM response (reuse YARN-8933). 
Whenever it detects a priority change in home SC, perhaps FederationInterceptor 
should change the priority of the UAM in secondaries as well. This last part I 
think we may or may not need it for now, I am okay with both ways. But when we 
launch the UAM initially, we should definitely make sure to submit it in the 
same priority as the home SC at that moment.

Regarding Router, the current design is that Router only tracks the home SC for 
an application. The expansion to (which subset of) secondary SCs are solely up 
to the FederationInceptor according to proxy policy, Router should not be aware 
of it. So when client updates the priority for the app, Router should only 
update it in the home RM, and leave the rest to FederationInterceptor.

> Fix FederationInterceptor#allocate to set application priority in 
> allocateResponse
> --
>
> Key: YARN-8898
> URL: https://issues.apache.org/jira/browse/YARN-8898
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Major
>
> In case of FederationInterceptor#mergeAllocateResponses skips 
> application_priority in response returned



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8933) [AMRMProxy] Fix potential empty AvailableResource and NumClusterNode in allocation response

2018-10-23 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16661702#comment-16661702
 ] 

Botong Huang commented on YARN-8933:


TestContainerManager failure is not related and tracked under YARN-8672

> [AMRMProxy] Fix potential empty AvailableResource and NumClusterNode in 
> allocation response
> ---
>
> Key: YARN-8933
> URL: https://issues.apache.org/jira/browse/YARN-8933
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: amrmproxy, federation
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8933.v1.patch, YARN-8933.v2.patch
>
>
> After YARN-8696, the allocate response by FederationInterceptor is merged 
> from the responses from a random subset of all sub-clusters, depending on the 
> async heartbeat timing. As a result, cluster-wide information fields in the 
> response, e.g. AvailableResources and NumClusterNodes, are not consistent at 
> all. It can even be null/zero because the specific response is merged from an 
> empty set of sub-cluster responses. 
> In this patch, we let FederationInterceptor remember the last allocate 
> response from all known sub-clusters, and always construct the cluster-wide 
> info fields from all of them. We also moved sub-cluster timeout from 
> LocalityMulticastAMRMProxyPolicy to FederationInterceptor, so that 
> sub-clusters that expired (haven't had a successful allocate response for a 
> while) won't be included in the computation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8898) Fix FederationInterceptor#allocate to set application priority in allocateResponse

2018-10-23 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16661487#comment-16661487
 ] 

Botong Huang commented on YARN-8898:


Hi [~bibinchundatt], good questions. 
bq. Application priority update, what should be the behaviour if subcluster's 
priorities are not same. Should we be considering appriority of home cluster 
always?
Since I missed this priority field when I port the code to trunk, right now 
UAMs in secondary sub-clusters didn't set this priority at all in their 
ApplicationSubmissionContext in UnmanagedApplicationMaster. By browsing the 
code, my understanding is once submitted, the app priority won't change any 
more, correct? I think we should submit the UAMs with the right priority in the 
first place. 

bq. Also in case of async response from subcluster we should maintain response 
order too. 
bq. If response from home cluster not received during merge of response, 
probably we have to remember the last response from home cluster.
For these two, I am actually already doing this in YARN-8933. Please take a 
look. I think we should just always take the priority in the last remembered 
home response. 

> Fix FederationInterceptor#allocate to set application priority in 
> allocateResponse
> --
>
> Key: YARN-8898
> URL: https://issues.apache.org/jira/browse/YARN-8898
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Major
>
> In case of FederationInterceptor#mergeAllocateResponses skips 
> application_priority in response returned



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8933) [AMRMProxy] Fix potential empty AvailableResource and NumClusterNode in allocation response

2018-10-23 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-8933:
---
Attachment: YARN-8933.v2.patch

> [AMRMProxy] Fix potential empty AvailableResource and NumClusterNode in 
> allocation response
> ---
>
> Key: YARN-8933
> URL: https://issues.apache.org/jira/browse/YARN-8933
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: amrmproxy, federation
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8933.v1.patch, YARN-8933.v2.patch
>
>
> After YARN-8696, the allocate response by FederationInterceptor is merged 
> from the responses from a random subset of all sub-clusters, depending on the 
> async heartbeat timing. As a result, cluster-wide information fields in the 
> response, e.g. AvailableResources and NumClusterNodes, are not consistent at 
> all. It can even be null/zero because the specific response is merged from an 
> empty set of sub-cluster responses. 
> In this patch, we let FederationInterceptor remember the last allocate 
> response from all known sub-clusters, and always construct the cluster-wide 
> info fields from all of them. We also moved sub-cluster timeout from 
> LocalityMulticastAMRMProxyPolicy to FederationInterceptor, so that 
> sub-clusters that expired (haven't had a successful allocate response for a 
> while) won't be included in the computation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8933) [AMRMProxy] Fix potential empty AvailableResource and NumClusterNode in allocation response

2018-10-22 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-8933:
---
Summary: [AMRMProxy] Fix potential empty AvailableResource and 
NumClusterNode in allocation response  (was: [AMRMProxy] Fix potential null 
AvailableResource and NumClusterNode in allocation response)

> [AMRMProxy] Fix potential empty AvailableResource and NumClusterNode in 
> allocation response
> ---
>
> Key: YARN-8933
> URL: https://issues.apache.org/jira/browse/YARN-8933
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: amrmproxy, federation
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8933.v1.patch
>
>
> After YARN-8696, the allocate response by FederationInterceptor is merged 
> from the responses from a random subset of all sub-clusters, depending on the 
> async heartbeat timing. As a result, cluster-wide information fields in the 
> response, e.g. AvailableResources and NumClusterNodes, are not consistent at 
> all. It can even be null/zero because the specific response is merged from an 
> empty set of sub-cluster responses. 
> In this patch, we let FederationInterceptor remember the last allocate 
> response from all known sub-clusters, and always construct the cluster-wide 
> info fields from all of them. We also moved sub-cluster timeout from 
> LocalityMulticastAMRMProxyPolicy to FederationInterceptor, so that 
> sub-clusters that expired (haven't had a successful allocate response for a 
> while) won't be included in the computation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8933) [AMRMProxy] Fix potential null AvailableResource and NumClusterNode in allocation response

2018-10-22 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-8933:
---
Attachment: YARN-8933.v1.patch

> [AMRMProxy] Fix potential null AvailableResource and NumClusterNode in 
> allocation response
> --
>
> Key: YARN-8933
> URL: https://issues.apache.org/jira/browse/YARN-8933
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: amrmproxy, federation
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8933.v1.patch
>
>
> After YARN-8696, the allocate response by FederationInterceptor is merged 
> from the responses from a random subset of all sub-clusters, depending on the 
> async heartbeat timing. As a result, cluster-wide information fields in the 
> response, e.g. AvailableResources and NumClusterNodes, are not consistent at 
> all. It can even be null/zero because the specific response is merged from an 
> empty set of sub-cluster responses. 
> In this patch, we let FederationInterceptor remember the last allocate 
> response from all known sub-clusters, and always construct the cluster-wide 
> info fields from all of them. We also moved sub-cluster timeout from 
> LocalityMulticastAMRMProxyPolicy to FederationInterceptor, so that 
> sub-clusters that expired (haven't had a successful allocate response for a 
> while) won't be included in the computation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8933) [AMRMProxy] Fix potential null AvailableResource and NumClusterNode in allocation response

2018-10-22 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-8933:
---
Issue Type: Sub-task  (was: Task)
Parent: YARN-5597

> [AMRMProxy] Fix potential null AvailableResource and NumClusterNode in 
> allocation response
> --
>
> Key: YARN-8933
> URL: https://issues.apache.org/jira/browse/YARN-8933
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: amrmproxy, federation
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
>
> After YARN-8696, the allocate response by FederationInterceptor is merged 
> from the responses from a random subset of all sub-clusters, depending on the 
> async heartbeat timing. As a result, cluster-wide information fields in the 
> response, e.g. AvailableResources and NumClusterNodes, are not consistent at 
> all. It can even be null/zero because the specific response is merged from an 
> empty set of sub-cluster responses. 
> In this patch, we let FederationInterceptor remember the last allocate 
> response from all known sub-clusters, and always construct the cluster-wide 
> info fields from all of them. We also moved sub-cluster timeout from 
> LocalityMulticastAMRMProxyPolicy to FederationInterceptor, so that 
> sub-clusters that expired (haven't had a successful allocate response for a 
> while) won't be included in the computation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8933) [AMRMProxy] Fix potential null AvailableResource and NumClusterNode in allocation response

2018-10-22 Thread Botong Huang (JIRA)
Botong Huang created YARN-8933:
--

 Summary: [AMRMProxy] Fix potential null AvailableResource and 
NumClusterNode in allocation response
 Key: YARN-8933
 URL: https://issues.apache.org/jira/browse/YARN-8933
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Botong Huang
Assignee: Botong Huang


After YARN-8696, the allocate response by FederationInterceptor is merged from 
the responses from a random subset of all sub-clusters, depending on the async 
heartbeat timing. As a result, cluster-wide information fields in the response, 
e.g. AvailableResources and NumClusterNodes, are not consistent at all. It can 
even be null/zero because the specific response is merged from an empty set of 
sub-cluster responses. 

In this patch, we let FederationInterceptor remember the last allocate response 
from all known sub-clusters, and always construct the cluster-wide info fields 
from all of them. We also moved sub-cluster timeout from 
LocalityMulticastAMRMProxyPolicy to FederationInterceptor, so that sub-clusters 
that expired (haven't had a successful allocate response for a while) won't be 
included in the computation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8933) [AMRMProxy] Fix potential null AvailableResource and NumClusterNode in allocation response

2018-10-22 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-8933:
---
Component/s: federation
 amrmproxy

> [AMRMProxy] Fix potential null AvailableResource and NumClusterNode in 
> allocation response
> --
>
> Key: YARN-8933
> URL: https://issues.apache.org/jira/browse/YARN-8933
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: amrmproxy, federation
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
>
> After YARN-8696, the allocate response by FederationInterceptor is merged 
> from the responses from a random subset of all sub-clusters, depending on the 
> async heartbeat timing. As a result, cluster-wide information fields in the 
> response, e.g. AvailableResources and NumClusterNodes, are not consistent at 
> all. It can even be null/zero because the specific response is merged from an 
> empty set of sub-cluster responses. 
> In this patch, we let FederationInterceptor remember the last allocate 
> response from all known sub-clusters, and always construct the cluster-wide 
> info fields from all of them. We also moved sub-cluster timeout from 
> LocalityMulticastAMRMProxyPolicy to FederationInterceptor, so that 
> sub-clusters that expired (haven't had a successful allocate response for a 
> while) won't be included in the computation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8893) [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client

2018-10-22 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16659262#comment-16659262
 ] 

Botong Huang commented on YARN-8893:


The cetest failure in NM is being tracked in YARN-8922

> [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client
> ---
>
> Key: YARN-8893
> URL: https://issues.apache.org/jira/browse/YARN-8893
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: amrmproxy, federation
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8893.v1.patch, YARN-8893.v2.patch
>
>
> Fix thread leak in AMRMClientRelayer and UAM client used by 
> FederationInterceptor, when destroying the interceptor pipeline in AMRMProxy. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8862) [GPG] Add Yarn Registry cleanup in ApplicationCleaner

2018-10-18 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16655696#comment-16655696
 ] 

Botong Huang commented on YARN-8862:


Committed to YARN-7402. Thanks [~bibinchundatt] and [~giovanni.fumarola] for 
reviewing!

> [GPG] Add Yarn Registry cleanup in ApplicationCleaner
> -
>
> Key: YARN-8862
> URL: https://issues.apache.org/jira/browse/YARN-8862
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8862-YARN-7402.v1.patch, 
> YARN-8862-YARN-7402.v2.patch, YARN-8862-YARN-7402.v3.patch, 
> YARN-8862-YARN-7402.v4.patch, YARN-8862-YARN-7402.v5.patch, 
> YARN-8862-YARN-7402.v6.patch
>
>
> In Yarn Federation, we use Yarn Registry to use the AMToken for UAMs in 
> secondary sub-clusters. Because of potential more app attempts later, 
> AMRMProxy cannot kill the UAM and delete the tokens when one local attempt 
> finishes. So similar to the StateStore application table, we need 
> ApplicationCleaner in GPG to cleanup the finished app entries in Yarn 
> Registry. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8862) [GPG] add Yarn Registry cleanup in ApplicationCleaner

2018-10-17 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-8862:
---
Attachment: YARN-8862-YARN-7402.v6.patch

> [GPG] add Yarn Registry cleanup in ApplicationCleaner
> -
>
> Key: YARN-8862
> URL: https://issues.apache.org/jira/browse/YARN-8862
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8862-YARN-7402.v1.patch, 
> YARN-8862-YARN-7402.v2.patch, YARN-8862-YARN-7402.v3.patch, 
> YARN-8862-YARN-7402.v4.patch, YARN-8862-YARN-7402.v5.patch, 
> YARN-8862-YARN-7402.v6.patch
>
>
> In Yarn Federation, we use Yarn Registry to use the AMToken for UAMs in 
> secondary sub-clusters. Because of potential more app attempts later, 
> AMRMProxy cannot kill the UAM and delete the tokens when one local attempt 
> finishes. So similar to the StateStore application table, we need 
> ApplicationCleaner in GPG to cleanup the finished app entries in Yarn 
> Registry. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8862) [GPG] add Yarn Registry cleanup in ApplicationCleaner

2018-10-17 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16654120#comment-16654120
 ] 

Botong Huang commented on YARN-8862:


Thanks [~bibinchundatt] and [~giovanni.fumarola] for the comments! v5 patch 
uploaded. 

> [GPG] add Yarn Registry cleanup in ApplicationCleaner
> -
>
> Key: YARN-8862
> URL: https://issues.apache.org/jira/browse/YARN-8862
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8862-YARN-7402.v1.patch, 
> YARN-8862-YARN-7402.v2.patch, YARN-8862-YARN-7402.v3.patch, 
> YARN-8862-YARN-7402.v4.patch, YARN-8862-YARN-7402.v5.patch
>
>
> In Yarn Federation, we use Yarn Registry to use the AMToken for UAMs in 
> secondary sub-clusters. Because of potential more app attempts later, 
> AMRMProxy cannot kill the UAM and delete the tokens when one local attempt 
> finishes. So similar to the StateStore application table, we need 
> ApplicationCleaner in GPG to cleanup the finished app entries in Yarn 
> Registry. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8862) [GPG] add Yarn Registry cleanup in ApplicationCleaner

2018-10-17 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-8862:
---
Attachment: YARN-8862-YARN-7402.v5.patch

> [GPG] add Yarn Registry cleanup in ApplicationCleaner
> -
>
> Key: YARN-8862
> URL: https://issues.apache.org/jira/browse/YARN-8862
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8862-YARN-7402.v1.patch, 
> YARN-8862-YARN-7402.v2.patch, YARN-8862-YARN-7402.v3.patch, 
> YARN-8862-YARN-7402.v4.patch, YARN-8862-YARN-7402.v5.patch
>
>
> In Yarn Federation, we use Yarn Registry to use the AMToken for UAMs in 
> secondary sub-clusters. Because of potential more app attempts later, 
> AMRMProxy cannot kill the UAM and delete the tokens when one local attempt 
> finishes. So similar to the StateStore application table, we need 
> ApplicationCleaner in GPG to cleanup the finished app entries in Yarn 
> Registry. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8893) [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client

2018-10-17 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-8893:
---
Attachment: YARN-8893.v2.patch

> [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client
> ---
>
> Key: YARN-8893
> URL: https://issues.apache.org/jira/browse/YARN-8893
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: amrmproxy, federation
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8893.v1.patch, YARN-8893.v2.patch
>
>
> Fix thread leak in AMRMClientRelayer and UAM client used by 
> FederationInterceptor, when destroying the interceptor pipeline in AMRMProxy. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8898) Fix FederationInterceptor#allocate to set application priority in allocateResponse

2018-10-17 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653757#comment-16653757
 ] 

Botong Huang commented on YARN-8898:


Good catch, thanks for reporting!

> Fix FederationInterceptor#allocate to set application priority in 
> allocateResponse
> --
>
> Key: YARN-8898
> URL: https://issues.apache.org/jira/browse/YARN-8898
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Major
>
> In case of FederationInterceptor#mergeAllocateResponses skips 
> application_priority in response returned



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8893) [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client

2018-10-16 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-8893:
---
Attachment: YARN-8893.v1.patch

> [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client
> ---
>
> Key: YARN-8893
> URL: https://issues.apache.org/jira/browse/YARN-8893
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: amrmproxy, federation
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8893.v1.patch
>
>
> Fix thread leak in AMRMClientRelayer and UAM client used by 
> FederationInterceptor, when destroying the interceptor pipeline in AMRMProxy. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8893) [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client

2018-10-16 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-8893:
---
Component/s: federation
 amrmproxy

> [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client
> ---
>
> Key: YARN-8893
> URL: https://issues.apache.org/jira/browse/YARN-8893
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: amrmproxy, federation
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8893.v1.patch
>
>
> Fix thread leak in AMRMClientRelayer and UAM client used by 
> FederationInterceptor, when destroying the interceptor pipeline in AMRMProxy. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8893) [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client

2018-10-16 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-8893:
---
Issue Type: Sub-task  (was: Task)
Parent: YARN-5597

> [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client
> ---
>
> Key: YARN-8893
> URL: https://issues.apache.org/jira/browse/YARN-8893
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
>
> Fix thread leak in AMRMClientRelayer and UAM client used by 
> FederationInterceptor, when destroying the interceptor pipeline in AMRMProxy. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8893) [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client

2018-10-16 Thread Botong Huang (JIRA)
Botong Huang created YARN-8893:
--

 Summary: [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM 
client
 Key: YARN-8893
 URL: https://issues.apache.org/jira/browse/YARN-8893
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Botong Huang
Assignee: Botong Huang


Fix thread leak in AMRMClientRelayer and UAM client used by 
FederationInterceptor, when destroying the interceptor pipeline in AMRMProxy. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8862) [GPG] add Yarn Registry cleanup in ApplicationCleaner

2018-10-16 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-8862:
---
Attachment: YARN-8862-YARN-7402.v4.patch

> [GPG] add Yarn Registry cleanup in ApplicationCleaner
> -
>
> Key: YARN-8862
> URL: https://issues.apache.org/jira/browse/YARN-8862
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8862-YARN-7402.v1.patch, 
> YARN-8862-YARN-7402.v2.patch, YARN-8862-YARN-7402.v3.patch, 
> YARN-8862-YARN-7402.v4.patch
>
>
> In Yarn Federation, we use Yarn Registry to use the AMToken for UAMs in 
> secondary sub-clusters. Because of potential more app attempts later, 
> AMRMProxy cannot kill the UAM and delete the tokens when one local attempt 
> finishes. So similar to the StateStore application table, we need 
> ApplicationCleaner in GPG to cleanup the finished app entries in Yarn 
> Registry. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8481) AMRMProxyPolicies should accept heartbeat response from new/unknown subclusters

2018-10-16 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-8481:
---
Issue Type: Sub-task  (was: Bug)
Parent: YARN-5597

> AMRMProxyPolicies should accept heartbeat response from new/unknown 
> subclusters
> ---
>
> Key: YARN-8481
> URL: https://issues.apache.org/jira/browse/YARN-8481
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: amrmproxy, federation
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Minor
> Fix For: 2.10.0, 3.2.0, 2.9.2
>
> Attachments: YARN-8481.v1.patch
>
>
> Currently BroadcastAMRMProxyPolicy assumes that we only span the application 
> to the sub-clusters instructed by itself via _splitResourceRequests_. 
> However, with AMRMProxy HA, second attempts of the application might come up 
> with multiple sub-clusters initially without consulting the AMRMProxyPolicy 
> at all. This leads to exceptions in _notifyOfResponse._ It should simply 
> allow the new/unknown sub-cluster heartbeat responses. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8855) Application fails if one of the sublcluster is down.

2018-10-10 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16645703#comment-16645703
 ] 

Botong Huang commented on YARN-8855:


Hi [~bibinchundatt], yes YARN-7652 can be the reason. There's other possibility 
as well, say YARN-8581, depending on the config setup. 

If a sub-cluster is gone for longer than some time, SubclusterCleaner in GPG 
(YARN-6648) will mark the sub-cluster to LOST state in StateStore. AMRMProxy 
will eventually pick it up. 

> Application fails if one of the sublcluster is down.
> 
>
> Key: YARN-8855
> URL: https://issues.apache.org/jira/browse/YARN-8855
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Reporter: Rahul Anand
>Priority: Major
>
> If one of sub cluster is down then application keeps on trying multiple times 
> and then it fails About 30 failover attempts found in the logs. Below is the 
> detailed exception. 
> {code:java}
> 2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Container 
> container_e03_1538297667953_0005_01_01 transitioned from 
> CONTAINER_CLEANEDUP_AFTER_KILL to DONE | ContainerImpl.java:2093
> 2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Removing 
> container_e03_1538297667953_0005_01_01 from application 
> application_1538297667953_0005 | ApplicationImpl.java:512
> 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping 
> resource-monitoring for container_e03_1538297667953_0005_01_01 | 
> ContainersMonitorImpl.java:932
> 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Considering 
> container container_e03_1538297667953_0005_01_01 for log-aggregation | 
> AppLogAggregatorImpl.java:538
> 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Got event 
> CONTAINER_STOP for appId application_1538297667953_0005 | AuxServices.java:350
> 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping 
> container container_e03_1538297667953_0005_01_01 | 
> YarnShuffleService.java:295
> 2018-10-08 14:21:21,247 | WARN | NM Event dispatcher | couldn't find 
> container container_e03_1538297667953_0005_01_01 while processing 
> FINISH_CONTAINERS event | ContainerManagerImpl.java:1660
> 2018-10-08 14:21:22,248 | INFO | Node Status Updater | Removed completed 
> containers from NM context: [container_e03_1538297667953_0005_01_01] | 
> NodeStatusUpdaterImpl.java:696
> 2018-10-08 14:21:26,734 | INFO | pool-16-thread-1 | Failing over to the 
> ResourceManager for SubClusterId: cluster2 | 
> FederationRMFailoverProxyProvider.java:124
> 2018-10-08 14:21:26,735 | INFO | pool-16-thread-1 | Flushing subClusters from 
> cache and rehydrating from store, most likely on account of RM failover. | 
> FederationStateStoreFacade.java:258
> 2018-10-08 14:21:26,738 | INFO | pool-16-thread-1 | Connecting to 
> /192.168.0.25:8032 subClusterId cluster2 with protocol 
> ApplicationClientProtocol as user root (auth:SIMPLE) | 
> FederationRMFailoverProxyProvider.java:145
> 2018-10-08 14:21:26,741 | INFO | pool-16-thread-1 | 
> java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to 
> node-master1-IYTxR:8032 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see: 
> http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
> ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after 
> 28 failover attempts. Trying to failover after sleeping for 15261ms. | 
> RetryInvocationHandler.java:411
> 2018-10-08 14:21:42,002 | INFO | pool-16-thread-1 | Failing over to the 
> ResourceManager for SubClusterId: cluster2 | 
> FederationRMFailoverProxyProvider.java:124
> 2018-10-08 14:21:42,003 | INFO | pool-16-thread-1 | Flushing subClusters from 
> cache and rehydrating from store, most likely on account of RM failover. | 
> FederationStateStoreFacade.java:258
> 2018-10-08 14:21:42,005 | INFO | pool-16-thread-1 | Connecting to 
> /192.168.0.25:8032 subClusterId cluster2 with protocol 
> ApplicationClientProtocol as user root (auth:SIMPLE) | 
> FederationRMFailoverProxyProvider.java:145
> 2018-10-08 14:21:42,007 | INFO | pool-16-thread-1 | 
> java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to 
> node-master1-IYTxR:8032 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see: 
> http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
> ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after 
> 29 failover attempts. Trying to failover after sleeping for 21175ms. | 
> RetryInvocationHandler.java:411
> 2018-10-08 14:22:03,183 | INFO | pool-16-thread-1 | Failing over to the 
> ResourceManager for SubClusterId: cluster2 | 
> Federati

[jira] [Updated] (YARN-8862) [GPG] add Yarn Registry cleanup in ApplicationCleaner

2018-10-10 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-8862:
---
Attachment: YARN-8862-YARN-7402.v3.patch

> [GPG] add Yarn Registry cleanup in ApplicationCleaner
> -
>
> Key: YARN-8862
> URL: https://issues.apache.org/jira/browse/YARN-8862
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8862-YARN-7402.v1.patch, 
> YARN-8862-YARN-7402.v2.patch, YARN-8862-YARN-7402.v3.patch
>
>
> In Yarn Federation, we use Yarn Registry to use the AMToken for UAMs in 
> secondary sub-clusters. Because of potential more app attempts later, 
> AMRMProxy cannot kill the UAM and delete the tokens when one local attempt 
> finishes. So similar to the StateStore application table, we need 
> ApplicationCleaner in GPG to cleanup the finished app entries in Yarn 
> Registry. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8862) [GPG] add Yarn Registry cleanup in ApplicationCleaner

2018-10-09 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-8862:
---
Attachment: YARN-8862-YARN-7402.v2.patch

> [GPG] add Yarn Registry cleanup in ApplicationCleaner
> -
>
> Key: YARN-8862
> URL: https://issues.apache.org/jira/browse/YARN-8862
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8862-YARN-7402.v1.patch, 
> YARN-8862-YARN-7402.v2.patch
>
>
> In Yarn Federation, we use Yarn Registry to use the AMToken for UAMs in 
> secondary sub-clusters. Because of potential more app attempts later, 
> AMRMProxy cannot kill the UAM and delete the tokens when one local attempt 
> finishes. So similar to the StateStore application table, we need 
> ApplicationCleaner in GPG to cleanup the finished app entries in Yarn 
> Registry. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8862) [GPG] add Yarn Registry cleanup in ApplicationCleaner

2018-10-09 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-8862:
---
Attachment: YARN-8862-YARN-7402.v1.patch

> [GPG] add Yarn Registry cleanup in ApplicationCleaner
> -
>
> Key: YARN-8862
> URL: https://issues.apache.org/jira/browse/YARN-8862
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8862-YARN-7402.v1.patch
>
>
> In Yarn Federation, we use Yarn Registry to use the AMToken for UAMs in 
> secondary sub-clusters. Because of potential more app attempts later, 
> AMRMProxy cannot kill the UAM and delete the tokens when one local attempt 
> finishes. So similar to the StateStore application table, we need 
> ApplicationCleaner in GPG to cleanup the finished app entries in Yarn 
> Registry. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8862) [GPG] add Yarn Registry cleanup in ApplicationCleaner

2018-10-09 Thread Botong Huang (JIRA)
Botong Huang created YARN-8862:
--

 Summary: [GPG] add Yarn Registry cleanup in ApplicationCleaner
 Key: YARN-8862
 URL: https://issues.apache.org/jira/browse/YARN-8862
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Botong Huang
Assignee: Botong Huang


In Yarn Federation, we use Yarn Registry to use the AMToken for UAMs in 
secondary sub-clusters. Because of potential more app attempts later, AMRMProxy 
cannot kill the UAM and delete the tokens when one local attempt finishes. So 
similar to the StateStore application table, we need ApplicationCleaner in GPG 
to cleanup the finished app entries in Yarn Registry. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8862) [GPG] add Yarn Registry cleanup in ApplicationCleaner

2018-10-09 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-8862:
---
Issue Type: Sub-task  (was: Task)
Parent: YARN-7402

> [GPG] add Yarn Registry cleanup in ApplicationCleaner
> -
>
> Key: YARN-8862
> URL: https://issues.apache.org/jira/browse/YARN-8862
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
>
> In Yarn Federation, we use Yarn Registry to use the AMToken for UAMs in 
> secondary sub-clusters. Because of potential more app attempts later, 
> AMRMProxy cannot kill the UAM and delete the tokens when one local attempt 
> finishes. So similar to the StateStore application table, we need 
> ApplicationCleaner in GPG to cleanup the finished app entries in Yarn 
> Registry. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7652) Handle AM register requests asynchronously in FederationInterceptor

2018-10-09 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16643789#comment-16643789
 ] 

Botong Huang commented on YARN-7652:


Thanks [~goiri] for the review and commit!

> Handle AM register requests asynchronously in FederationInterceptor
> ---
>
> Key: YARN-7652
> URL: https://issues.apache.org/jira/browse/YARN-7652
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: amrmproxy, federation
>Affects Versions: 2.9.0, 3.0.0
>Reporter: Subru Krishnan
>Assignee: Botong Huang
>Priority: Major
> Fix For: 2.10.0, 3.3.0
>
> Attachments: YARN-7652.v1.patch, YARN-7652.v2.patch
>
>
> We (cc [~goiri]/[~botong]) observed that the {{FederationInterceptor}} in 
> {{AMRMProxy}} (and consequently the AM) is blocked if the _StateStore_ has 
> outdated info about a _SubCluster_. This is because we handle AM register 
> requests synchronously. This jira proposes to move to async similar to how we 
> operate with allocate invocations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8855) Application fails if one of the sublcluster is down.

2018-10-08 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16642204#comment-16642204
 ] 

Botong Huang commented on YARN-8855:


Thanks [~rahulanand90] for reporting it! Which federation policy 
(yarn.federation.policy-manager) and code version are you using? This should 
have been fixed in latest trunk and branch-2.

> Application fails if one of the sublcluster is down.
> 
>
> Key: YARN-8855
> URL: https://issues.apache.org/jira/browse/YARN-8855
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rahul Anand
>Priority: Major
>
> If one of sub cluster is down then application keeps on trying multiple times 
> and then it fails About 30 failover attempts found in the logs. Below is the 
> detailed exception. 
> {code:java}
> 2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Container 
> container_e03_1538297667953_0005_01_01 transitioned from 
> CONTAINER_CLEANEDUP_AFTER_KILL to DONE | ContainerImpl.java:2093
> 2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Removing 
> container_e03_1538297667953_0005_01_01 from application 
> application_1538297667953_0005 | ApplicationImpl.java:512
> 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping 
> resource-monitoring for container_e03_1538297667953_0005_01_01 | 
> ContainersMonitorImpl.java:932
> 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Considering 
> container container_e03_1538297667953_0005_01_01 for log-aggregation | 
> AppLogAggregatorImpl.java:538
> 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Got event 
> CONTAINER_STOP for appId application_1538297667953_0005 | AuxServices.java:350
> 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping 
> container container_e03_1538297667953_0005_01_01 | 
> YarnShuffleService.java:295
> 2018-10-08 14:21:21,247 | WARN | NM Event dispatcher | couldn't find 
> container container_e03_1538297667953_0005_01_01 while processing 
> FINISH_CONTAINERS event | ContainerManagerImpl.java:1660
> 2018-10-08 14:21:22,248 | INFO | Node Status Updater | Removed completed 
> containers from NM context: [container_e03_1538297667953_0005_01_01] | 
> NodeStatusUpdaterImpl.java:696
> 2018-10-08 14:21:26,734 | INFO | pool-16-thread-1 | Failing over to the 
> ResourceManager for SubClusterId: cluster2 | 
> FederationRMFailoverProxyProvider.java:124
> 2018-10-08 14:21:26,735 | INFO | pool-16-thread-1 | Flushing subClusters from 
> cache and rehydrating from store, most likely on account of RM failover. | 
> FederationStateStoreFacade.java:258
> 2018-10-08 14:21:26,738 | INFO | pool-16-thread-1 | Connecting to 
> /192.168.0.25:8032 subClusterId cluster2 with protocol 
> ApplicationClientProtocol as user root (auth:SIMPLE) | 
> FederationRMFailoverProxyProvider.java:145
> 2018-10-08 14:21:26,741 | INFO | pool-16-thread-1 | 
> java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to 
> node-master1-IYTxR:8032 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see: 
> http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
> ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after 
> 28 failover attempts. Trying to failover after sleeping for 15261ms. | 
> RetryInvocationHandler.java:411
> 2018-10-08 14:21:42,002 | INFO | pool-16-thread-1 | Failing over to the 
> ResourceManager for SubClusterId: cluster2 | 
> FederationRMFailoverProxyProvider.java:124
> 2018-10-08 14:21:42,003 | INFO | pool-16-thread-1 | Flushing subClusters from 
> cache and rehydrating from store, most likely on account of RM failover. | 
> FederationStateStoreFacade.java:258
> 2018-10-08 14:21:42,005 | INFO | pool-16-thread-1 | Connecting to 
> /192.168.0.25:8032 subClusterId cluster2 with protocol 
> ApplicationClientProtocol as user root (auth:SIMPLE) | 
> FederationRMFailoverProxyProvider.java:145
> 2018-10-08 14:21:42,007 | INFO | pool-16-thread-1 | 
> java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to 
> node-master1-IYTxR:8032 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see: 
> http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
> ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after 
> 29 failover attempts. Trying to failover after sleeping for 21175ms. | 
> RetryInvocationHandler.java:411
> 2018-10-08 14:22:03,183 | INFO | pool-16-thread-1 | Failing over to the 
> ResourceManager for SubClusterId: cluster2 | 
> FederationRMFailoverProxyProvider.java:124
> 2018-10-08 14:22:03,183 | INFO | pool-16-thread-1 | Flushing subClusters from 
> cache and rehydrating from store, most likely on 

[jira] [Updated] (YARN-7652) Handle AM register requests asynchronously in FederationInterceptor

2018-10-07 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-7652:
---
Attachment: YARN-7652.v2.patch

> Handle AM register requests asynchronously in FederationInterceptor
> ---
>
> Key: YARN-7652
> URL: https://issues.apache.org/jira/browse/YARN-7652
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: amrmproxy, federation
>Affects Versions: 2.9.0, 3.0.0
>Reporter: Subru Krishnan
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-7652.v1.patch, YARN-7652.v2.patch
>
>
> We (cc [~goiri]/[~botong]) observed that the {{FederationInterceptor}} in 
> {{AMRMProxy}} (and consequently the AM) is blocked if the _StateStore_ has 
> outdated info about a _SubCluster_. This is because we handle AM register 
> requests synchronously. This jira proposes to move to async similar to how we 
> operate with allocate invocations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8837) TestNMProxy.testNMProxyRPCRetry Improvement

2018-10-06 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640751#comment-16640751
 ] 

Botong Huang commented on YARN-8837:


Nvm, it is fixed in YARN-8844 now. Please address Jason's comment as an 
improvement. Thanks! 

> TestNMProxy.testNMProxyRPCRetry Improvement
> ---
>
> Key: YARN-8837
> URL: https://issues.apache.org/jira/browse/YARN-8837
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.2.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: YARN-8789.1.patch
>
>
> The unit test 
> {{org.apache.hadoop.yarn.server.nodemanager.containermanager.TestNMProxy.testNMProxyRetry()}}
>  has had some issues in the past. You can search JIRA for it, but one example 
> is [YARN-5104].  I recently had some issues with it myself and found the 
> follow change helpful in troubleshooting.
> {code:java|title=Current Implementation}
> } catch (IOException e) {
> // socket exception should be thrown immediately, without RPC retries.
> Assert.assertTrue(e instanceof java.net.SocketException);
> }
> {code}
> The issue here is that the test is true/false.  The testing framework does 
> not give me any feedback regarding the type of exception that was thrown, it 
> just says "assertion failed."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7652) Handle AM register requests asynchronously in FederationInterceptor

2018-10-05 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-7652:
---
Attachment: YARN-7652.v1.patch

> Handle AM register requests asynchronously in FederationInterceptor
> ---
>
> Key: YARN-7652
> URL: https://issues.apache.org/jira/browse/YARN-7652
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: amrmproxy, federation
>Affects Versions: 2.9.0, 3.0.0
>Reporter: Subru Krishnan
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-7652.v1.patch
>
>
> We (cc [~goiri]/[~botong]) observed that the {{FederationInterceptor}} in 
> {{AMRMProxy}} (and consequently the AM) is blocked if the _StateStore_ has 
> outdated info about a _SubCluster_. This is because we handle AM register 
> requests synchronously. This jira proposes to move to async similar to how we 
> operate with allocate invocations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5597) YARN Federation improvements

2018-10-05 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-5597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-5597:
---
Attachment: YARN-7652.v1.patch

> YARN Federation improvements
> 
>
> Key: YARN-5597
> URL: https://issues.apache.org/jira/browse/YARN-5597
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Subru Krishnan
>Assignee: Subru Krishnan
>Priority: Major
>
> This umbrella JIRA tracks set of improvements over the YARN Federation MVP 
> (YARN-2915)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5597) YARN Federation improvements

2018-10-05 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-5597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-5597:
---
Attachment: (was: YARN-7652.v1.patch)

> YARN Federation improvements
> 
>
> Key: YARN-5597
> URL: https://issues.apache.org/jira/browse/YARN-5597
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Subru Krishnan
>Assignee: Subru Krishnan
>Priority: Major
>
> This umbrella JIRA tracks set of improvements over the YARN Federation MVP 
> (YARN-2915)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8760) [AMRMProxy] Fix concurrent re-register due to YarnRM failover in AMRMClientRelayer

2018-10-01 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16634695#comment-16634695
 ] 

Botong Huang commented on YARN-8760:


Thanks [~giovanni.fumarola] for the review and commit!

> [AMRMProxy] Fix concurrent re-register due to YarnRM failover in 
> AMRMClientRelayer
> --
>
> Key: YARN-8760
> URL: https://issues.apache.org/jira/browse/YARN-8760
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-8760.v1.patch
>
>
> When home YarnRM is failing over, FinishApplicationMaster call from AM can 
> have multiple retry threads outstanding in FederationInterceptor. When new 
> YarnRM come back up, all retry threads will re-register to YarnRM. The first 
> one will succeed but the rest will get "Application Master is already 
> registered" exception. We should catch and swallow this exception and move 
> on. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8837) TestNMProxy.testNMProxyRPCRetry Improvement

2018-10-01 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16634523#comment-16634523
 ] 

Botong Huang commented on YARN-8837:


Other than improving surfacing the exception message, can we try fix this unit 
test as well? It is failing in trunk now. 

> TestNMProxy.testNMProxyRPCRetry Improvement
> ---
>
> Key: YARN-8837
> URL: https://issues.apache.org/jira/browse/YARN-8837
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.2.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: YARN-8789.1.patch
>
>
> The unit test 
> {{org.apache.hadoop.yarn.server.nodemanager.containermanager.TestNMProxy.testNMProxyRetry()}}
>  has had some issues in the past. You can search JIRA for it, but one example 
> is [YARN-5104].  I recently had some issues with it myself and found the 
> follow change helpful in troubleshooting.
> {code:java|title=Current Implementation}
> } catch (IOException e) {
> // socket exception should be thrown immediately, without RPC retries.
> Assert.assertTrue(e instanceof java.net.SocketException);
> }
> {code}
> The issue here is that the test is true/false.  The testing framework does 
> not give me any feedback regarding the type of exception that was thrown, it 
> just says "assertion failed."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8760) [AMRMProxy] Fix concurrent re-register due to YarnRM failover in AMRMClientRelayer

2018-10-01 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16634229#comment-16634229
 ] 

Botong Huang commented on YARN-8760:


TestNMProxy failure is irrelevant and is tracked under YARN-8837

> [AMRMProxy] Fix concurrent re-register due to YarnRM failover in 
> AMRMClientRelayer
> --
>
> Key: YARN-8760
> URL: https://issues.apache.org/jira/browse/YARN-8760
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8760.v1.patch
>
>
> When home YarnRM is failing over, FinishApplicationMaster call from AM can 
> have multiple retry threads outstanding in FederationInterceptor. When new 
> YarnRM come back up, all retry threads will re-register to YarnRM. The first 
> one will succeed but the rest will get "Application Master is already 
> registered" exception. We should catch and swallow this exception and move 
> on. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8760) [AMRMProxy] Fix concurrent re-register due to YarnRM failover in AMRMClientRelayer

2018-09-27 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-8760:
---
Attachment: YARN-8760.v1.patch

> [AMRMProxy] Fix concurrent re-register due to YarnRM failover in 
> AMRMClientRelayer
> --
>
> Key: YARN-8760
> URL: https://issues.apache.org/jira/browse/YARN-8760
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8760.v1.patch
>
>
> When home YarnRM is failing over, FinishApplicationMaster call from AM can 
> have multiple retry threads outstanding in FederationInterceptor. When new 
> YarnRM come back up, all retry threads will re-register to YarnRM. The first 
> one will succeed but the rest will get "Application Master is already 
> registered" exception. We should catch and swallow this exception and move 
> on. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8696) [AMRMProxy] FederationInterceptor upgrade: home sub-cluster heartbeat async

2018-09-27 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630739#comment-16630739
 ] 

Botong Huang commented on YARN-8696:


Thanks [~giovanni.fumarola] for the review and commit!

> [AMRMProxy] FederationInterceptor upgrade: home sub-cluster heartbeat async
> ---
>
> Key: YARN-8696
> URL: https://issues.apache.org/jira/browse/YARN-8696
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-8696-branch-2.v6.patch, YARN-8696.v1.patch, 
> YARN-8696.v2.patch, YARN-8696.v3.patch, YARN-8696.v4.patch, 
> YARN-8696.v5.patch, YARN-8696.v6.patch
>
>
> Today in _FederationInterceptor_, the heartbeat to home sub-cluster is 
> synchronous. After the heartbeat is sent out to home sub-cluster, it waits 
> for the home response to come back before merging and returning the (merged) 
> heartbeat result to back AM. If home sub-cluster is suffering from connection 
> issues, or down during an YarnRM master-slave switch, all heartbeat threads 
> in _FederationInterceptor_ will be blocked waiting for home response. As a 
> result, the successful UAM heartbeats from secondary sub-clusters will not be 
> returned to AM at all. Additionally, because of the fact that we kept the 
> same heartbeat responseId between AM and home RM, lots of tricky handling are 
> needed regarding the responseId resync when it comes to 
> _FederationInterceptor_ (part of AMRMProxy, NM) work preserving restart 
> (YARN-6127, YARN-1336), home RM master-slave switch etc. 
> In this patch, we change the heartbeat to home sub-cluster to asynchronous, 
> same as the way we handle UAM heartbeats in secondaries. So that any 
> sub-cluster down or connection issues won't impact AM getting responses from 
> other sub-clusters. The responseId is also managed separately for home 
> sub-cluster and AM, and they increment independently. The resync logic 
> becomes much cleaner. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8696) [AMRMProxy] FederationInterceptor upgrade: home sub-cluster heartbeat async

2018-09-22 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-8696:
---
Attachment: YARN-8696-branch-2.v6.patch

> [AMRMProxy] FederationInterceptor upgrade: home sub-cluster heartbeat async
> ---
>
> Key: YARN-8696
> URL: https://issues.apache.org/jira/browse/YARN-8696
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8696-branch-2.v6.patch, YARN-8696.v1.patch, 
> YARN-8696.v2.patch, YARN-8696.v3.patch, YARN-8696.v4.patch, 
> YARN-8696.v5.patch, YARN-8696.v6.patch
>
>
> Today in _FederationInterceptor_, the heartbeat to home sub-cluster is 
> synchronous. After the heartbeat is sent out to home sub-cluster, it waits 
> for the home response to come back before merging and returning the (merged) 
> heartbeat result to back AM. If home sub-cluster is suffering from connection 
> issues, or down during an YarnRM master-slave switch, all heartbeat threads 
> in _FederationInterceptor_ will be blocked waiting for home response. As a 
> result, the successful UAM heartbeats from secondary sub-clusters will not be 
> returned to AM at all. Additionally, because of the fact that we kept the 
> same heartbeat responseId between AM and home RM, lots of tricky handling are 
> needed regarding the responseId resync when it comes to 
> _FederationInterceptor_ (part of AMRMProxy, NM) work preserving restart 
> (YARN-6127, YARN-1336), home RM master-slave switch etc. 
> In this patch, we change the heartbeat to home sub-cluster to asynchronous, 
> same as the way we handle UAM heartbeats in secondaries. So that any 
> sub-cluster down or connection issues won't impact AM getting responses from 
> other sub-clusters. The responseId is also managed separately for home 
> sub-cluster and AM, and they increment independently. The resync logic 
> becomes much cleaner. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8696) [AMRMProxy] FederationInterceptor upgrade: home sub-cluster heartbeat async

2018-09-22 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-8696:
---
Attachment: (was: YARN-8696-branch-2.v6.patch)

> [AMRMProxy] FederationInterceptor upgrade: home sub-cluster heartbeat async
> ---
>
> Key: YARN-8696
> URL: https://issues.apache.org/jira/browse/YARN-8696
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8696.v1.patch, YARN-8696.v2.patch, 
> YARN-8696.v3.patch, YARN-8696.v4.patch, YARN-8696.v5.patch, YARN-8696.v6.patch
>
>
> Today in _FederationInterceptor_, the heartbeat to home sub-cluster is 
> synchronous. After the heartbeat is sent out to home sub-cluster, it waits 
> for the home response to come back before merging and returning the (merged) 
> heartbeat result to back AM. If home sub-cluster is suffering from connection 
> issues, or down during an YarnRM master-slave switch, all heartbeat threads 
> in _FederationInterceptor_ will be blocked waiting for home response. As a 
> result, the successful UAM heartbeats from secondary sub-clusters will not be 
> returned to AM at all. Additionally, because of the fact that we kept the 
> same heartbeat responseId between AM and home RM, lots of tricky handling are 
> needed regarding the responseId resync when it comes to 
> _FederationInterceptor_ (part of AMRMProxy, NM) work preserving restart 
> (YARN-6127, YARN-1336), home RM master-slave switch etc. 
> In this patch, we change the heartbeat to home sub-cluster to asynchronous, 
> same as the way we handle UAM heartbeats in secondaries. So that any 
> sub-cluster down or connection issues won't impact AM getting responses from 
> other sub-clusters. The responseId is also managed separately for home 
> sub-cluster and AM, and they increment independently. The resync logic 
> becomes much cleaner. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-7599) [GPG] ApplicationCleaner in Global Policy Generator

2018-09-21 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang resolved YARN-7599.

Resolution: Fixed

> [GPG] ApplicationCleaner in Global Policy Generator
> ---
>
> Key: YARN-7599
> URL: https://issues.apache.org/jira/browse/YARN-7599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Minor
>  Labels: federation, gpg
> Attachments: YARN-7599-YARN-7402.v1.patch, 
> YARN-7599-YARN-7402.v2.patch, YARN-7599-YARN-7402.v3.patch, 
> YARN-7599-YARN-7402.v4.patch, YARN-7599-YARN-7402.v5.patch, 
> YARN-7599-YARN-7402.v6.patch, YARN-7599-YARN-7402.v7.patch, 
> YARN-7599-YARN-7402.v8.patch
>
>
> In Federation, we need a cleanup service for StateStore as well as Yarn 
> Registry. For the former, we need to remove old application records. For the 
> latter, failed and killed applications might leave records in the Yarn 
> Registry (see YARN-6128). We plan to do both cleanup work in 
> ApplicationCleaner in GPG



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7599) [GPG] ApplicationCleaner in Global Policy Generator

2018-09-21 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16624388#comment-16624388
 ] 

Botong Huang commented on YARN-7599:


Committed to YARN-7402. Thanks [~bibinchundatt] for the review!

> [GPG] ApplicationCleaner in Global Policy Generator
> ---
>
> Key: YARN-7599
> URL: https://issues.apache.org/jira/browse/YARN-7599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Minor
>  Labels: federation, gpg
> Attachments: YARN-7599-YARN-7402.v1.patch, 
> YARN-7599-YARN-7402.v2.patch, YARN-7599-YARN-7402.v3.patch, 
> YARN-7599-YARN-7402.v4.patch, YARN-7599-YARN-7402.v5.patch, 
> YARN-7599-YARN-7402.v6.patch, YARN-7599-YARN-7402.v7.patch, 
> YARN-7599-YARN-7402.v8.patch
>
>
> In Federation, we need a cleanup service for StateStore as well as Yarn 
> Registry. For the former, we need to remove old application records. For the 
> latter, failed and killed applications might leave records in the Yarn 
> Registry (see YARN-6128). We plan to do both cleanup work in 
> ApplicationCleaner in GPG



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7599) [GPG] ApplicationCleaner in Global Policy Generator

2018-09-21 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16624053#comment-16624053
 ] 

Botong Huang commented on YARN-7599:


Ah good point. v8 uploaded. Will commit pending on yetus. Thanks!

> [GPG] ApplicationCleaner in Global Policy Generator
> ---
>
> Key: YARN-7599
> URL: https://issues.apache.org/jira/browse/YARN-7599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Minor
>  Labels: federation, gpg
> Attachments: YARN-7599-YARN-7402.v1.patch, 
> YARN-7599-YARN-7402.v2.patch, YARN-7599-YARN-7402.v3.patch, 
> YARN-7599-YARN-7402.v4.patch, YARN-7599-YARN-7402.v5.patch, 
> YARN-7599-YARN-7402.v6.patch, YARN-7599-YARN-7402.v7.patch, 
> YARN-7599-YARN-7402.v8.patch
>
>
> In Federation, we need a cleanup service for StateStore as well as Yarn 
> Registry. For the former, we need to remove old application records. For the 
> latter, failed and killed applications might leave records in the Yarn 
> Registry (see YARN-6128). We plan to do both cleanup work in 
> ApplicationCleaner in GPG



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7599) [GPG] ApplicationCleaner in Global Policy Generator

2018-09-21 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-7599:
---
Attachment: YARN-7599-YARN-7402.v8.patch

> [GPG] ApplicationCleaner in Global Policy Generator
> ---
>
> Key: YARN-7599
> URL: https://issues.apache.org/jira/browse/YARN-7599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Minor
>  Labels: federation, gpg
> Attachments: YARN-7599-YARN-7402.v1.patch, 
> YARN-7599-YARN-7402.v2.patch, YARN-7599-YARN-7402.v3.patch, 
> YARN-7599-YARN-7402.v4.patch, YARN-7599-YARN-7402.v5.patch, 
> YARN-7599-YARN-7402.v6.patch, YARN-7599-YARN-7402.v7.patch, 
> YARN-7599-YARN-7402.v8.patch
>
>
> In Federation, we need a cleanup service for StateStore as well as Yarn 
> Registry. For the former, we need to remove old application records. For the 
> latter, failed and killed applications might leave records in the Yarn 
> Registry (see YARN-6128). We plan to do both cleanup work in 
> ApplicationCleaner in GPG



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8696) [AMRMProxy] FederationInterceptor upgrade: home sub-cluster heartbeat async

2018-09-21 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-8696:
---
Attachment: YARN-8696-branch-2.v6.patch

> [AMRMProxy] FederationInterceptor upgrade: home sub-cluster heartbeat async
> ---
>
> Key: YARN-8696
> URL: https://issues.apache.org/jira/browse/YARN-8696
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8696-branch-2.v6.patch, YARN-8696.v1.patch, 
> YARN-8696.v2.patch, YARN-8696.v3.patch, YARN-8696.v4.patch, 
> YARN-8696.v5.patch, YARN-8696.v6.patch
>
>
> Today in _FederationInterceptor_, the heartbeat to home sub-cluster is 
> synchronous. After the heartbeat is sent out to home sub-cluster, it waits 
> for the home response to come back before merging and returning the (merged) 
> heartbeat result to back AM. If home sub-cluster is suffering from connection 
> issues, or down during an YarnRM master-slave switch, all heartbeat threads 
> in _FederationInterceptor_ will be blocked waiting for home response. As a 
> result, the successful UAM heartbeats from secondary sub-clusters will not be 
> returned to AM at all. Additionally, because of the fact that we kept the 
> same heartbeat responseId between AM and home RM, lots of tricky handling are 
> needed regarding the responseId resync when it comes to 
> _FederationInterceptor_ (part of AMRMProxy, NM) work preserving restart 
> (YARN-6127, YARN-1336), home RM master-slave switch etc. 
> In this patch, we change the heartbeat to home sub-cluster to asynchronous, 
> same as the way we handle UAM heartbeats in secondaries. So that any 
> sub-cluster down or connection issues won't impact AM getting responses from 
> other sub-clusters. The responseId is also managed separately for home 
> sub-cluster and AM, and they increment independently. The resync logic 
> becomes much cleaner. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7599) [GPG] ApplicationCleaner in Global Policy Generator

2018-09-20 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-7599:
---
Attachment: YARN-7599-YARN-7402.v7.patch

> [GPG] ApplicationCleaner in Global Policy Generator
> ---
>
> Key: YARN-7599
> URL: https://issues.apache.org/jira/browse/YARN-7599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Minor
>  Labels: federation, gpg
> Attachments: YARN-7599-YARN-7402.v1.patch, 
> YARN-7599-YARN-7402.v2.patch, YARN-7599-YARN-7402.v3.patch, 
> YARN-7599-YARN-7402.v4.patch, YARN-7599-YARN-7402.v5.patch, 
> YARN-7599-YARN-7402.v6.patch, YARN-7599-YARN-7402.v7.patch
>
>
> In Federation, we need a cleanup service for StateStore as well as Yarn 
> Registry. For the former, we need to remove old application records. For the 
> latter, failed and killed applications might leave records in the Yarn 
> Registry (see YARN-6128). We plan to do both cleanup work in 
> ApplicationCleaner in GPG



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8696) [AMRMProxy] FederationInterceptor upgrade: home sub-cluster heartbeat async

2018-09-19 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-8696:
---
Attachment: YARN-8696.v6.patch

> [AMRMProxy] FederationInterceptor upgrade: home sub-cluster heartbeat async
> ---
>
> Key: YARN-8696
> URL: https://issues.apache.org/jira/browse/YARN-8696
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8696.v1.patch, YARN-8696.v2.patch, 
> YARN-8696.v3.patch, YARN-8696.v4.patch, YARN-8696.v5.patch, YARN-8696.v6.patch
>
>
> Today in _FederationInterceptor_, the heartbeat to home sub-cluster is 
> synchronous. After the heartbeat is sent out to home sub-cluster, it waits 
> for the home response to come back before merging and returning the (merged) 
> heartbeat result to back AM. If home sub-cluster is suffering from connection 
> issues, or down during an YarnRM master-slave switch, all heartbeat threads 
> in _FederationInterceptor_ will be blocked waiting for home response. As a 
> result, the successful UAM heartbeats from secondary sub-clusters will not be 
> returned to AM at all. Additionally, because of the fact that we kept the 
> same heartbeat responseId between AM and home RM, lots of tricky handling are 
> needed regarding the responseId resync when it comes to 
> _FederationInterceptor_ (part of AMRMProxy, NM) work preserving restart 
> (YARN-6127, YARN-1336), home RM master-slave switch etc. 
> In this patch, we change the heartbeat to home sub-cluster to asynchronous, 
> same as the way we handle UAM heartbeats in secondaries. So that any 
> sub-cluster down or connection issues won't impact AM getting responses from 
> other sub-clusters. The responseId is also managed separately for home 
> sub-cluster and AM, and they increment independently. The resync logic 
> becomes much cleaner. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7599) [GPG] ApplicationCleaner in Global Policy Generator

2018-09-19 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-7599:
---
Attachment: YARN-7599-YARN-7402.v6.patch

> [GPG] ApplicationCleaner in Global Policy Generator
> ---
>
> Key: YARN-7599
> URL: https://issues.apache.org/jira/browse/YARN-7599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Minor
>  Labels: federation, gpg
> Attachments: YARN-7599-YARN-7402.v1.patch, 
> YARN-7599-YARN-7402.v2.patch, YARN-7599-YARN-7402.v3.patch, 
> YARN-7599-YARN-7402.v4.patch, YARN-7599-YARN-7402.v5.patch, 
> YARN-7599-YARN-7402.v6.patch
>
>
> In Federation, we need a cleanup service for StateStore as well as Yarn 
> Registry. For the former, we need to remove old application records. For the 
> latter, failed and killed applications might leave records in the Yarn 
> Registry (see YARN-6128). We plan to do both cleanup work in 
> ApplicationCleaner in GPG



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7599) [GPG] ApplicationCleaner in Global Policy Generator

2018-09-19 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621216#comment-16621216
 ] 

Botong Huang commented on YARN-7599:


Thanks [~bibinchundatt] for the comment! v6 patch uploaded. 

bq. I was thinking of disabling cleaner while the GPG service is live
I see. Yeah let's leave it as future work. For now restarting GPG will do, it 
is an out of band service anyways. 

bq. Can you change to single configuration similar to 
dfs.http.client.retry.policy.spec {min,max,interval}
I already changed the new configs to something like 
application.cleaner.router.min.success. This is what you meant right? 

Somehow the link from yetus run hasn't work at all. I think the checkstyle run 
has some build issue. I just rebased the YARN-7402 base branch to latest trunk, 
let's see. 

> [GPG] ApplicationCleaner in Global Policy Generator
> ---
>
> Key: YARN-7599
> URL: https://issues.apache.org/jira/browse/YARN-7599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Minor
>  Labels: federation, gpg
> Attachments: YARN-7599-YARN-7402.v1.patch, 
> YARN-7599-YARN-7402.v2.patch, YARN-7599-YARN-7402.v3.patch, 
> YARN-7599-YARN-7402.v4.patch, YARN-7599-YARN-7402.v5.patch
>
>
> In Federation, we need a cleanup service for StateStore as well as Yarn 
> Registry. For the former, we need to remove old application records. For the 
> latter, failed and killed applications might leave records in the Yarn 
> Registry (see YARN-6128). We plan to do both cleanup work in 
> ApplicationCleaner in GPG



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8696) [AMRMProxy] FederationInterceptor upgrade: home sub-cluster heartbeat async

2018-09-19 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621060#comment-16621060
 ] 

Botong Huang commented on YARN-8696:


Unit test failure in TestCapacityOverTimePolicy is irrelevant. 

> [AMRMProxy] FederationInterceptor upgrade: home sub-cluster heartbeat async
> ---
>
> Key: YARN-8696
> URL: https://issues.apache.org/jira/browse/YARN-8696
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8696.v1.patch, YARN-8696.v2.patch, 
> YARN-8696.v3.patch, YARN-8696.v4.patch, YARN-8696.v5.patch
>
>
> Today in _FederationInterceptor_, the heartbeat to home sub-cluster is 
> synchronous. After the heartbeat is sent out to home sub-cluster, it waits 
> for the home response to come back before merging and returning the (merged) 
> heartbeat result to back AM. If home sub-cluster is suffering from connection 
> issues, or down during an YarnRM master-slave switch, all heartbeat threads 
> in _FederationInterceptor_ will be blocked waiting for home response. As a 
> result, the successful UAM heartbeats from secondary sub-clusters will not be 
> returned to AM at all. Additionally, because of the fact that we kept the 
> same heartbeat responseId between AM and home RM, lots of tricky handling are 
> needed regarding the responseId resync when it comes to 
> _FederationInterceptor_ (part of AMRMProxy, NM) work preserving restart 
> (YARN-6127, YARN-1336), home RM master-slave switch etc. 
> In this patch, we change the heartbeat to home sub-cluster to asynchronous, 
> same as the way we handle UAM heartbeats in secondaries. So that any 
> sub-cluster down or connection issues won't impact AM getting responses from 
> other sub-clusters. The responseId is also managed separately for home 
> sub-cluster and AM, and they increment independently. The resync logic 
> becomes much cleaner. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



  1   2   3   4   5   6   7   8   >