[jira] [Commented] (YARN-7720) Race condition between second app attempt and UAM timeout when first attempt node is down
[ https://issues.apache.org/jira/browse/YARN-7720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723495#comment-17723495 ] Botong Huang commented on YARN-7720: Yeah, as I said in the initial description as well as v1 patch. I think the easiest way is to change the timeout config. > Race condition between second app attempt and UAM timeout when first attempt > node is down > - > > Key: YARN-7720 > URL: https://issues.apache.org/jira/browse/YARN-7720 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Shilun Fan >Priority: Major > Attachments: YARN-7720.v1.patch, YARN-7720.v2.patch > > > In Federation, multiple attempts of an application share the same UAM in each > secondary sub-cluster. When first attempt fails, we reply on the fact that > secondary RM won't kill the existing UAM before the AM heartbeat timeout > (default at 10 min). When second attempt comes up in the home sub-cluster, it > will pick up the UAM token from Yarn Registry and resume the UAM heartbeat to > secondary RMs. > The default heartbeat timeout for NM and AM are both 10 mins. The problem is > that when the first attempt node goes down or out of connection, only after > 10 mins will the home RM mark the first attempt as failed, and then schedule > the 2nd attempt in some other node. By then the UAMs in secondaries are > already timing out, and they might not survive until the second attempt comes > up. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-7720) Race condition between second app attempt and UAM timeout when first attempt node is down
[ https://issues.apache.org/jira/browse/YARN-7720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17572705#comment-17572705 ] Botong Huang edited comment on YARN-7720 at 5/17/23 2:42 PM: - I will continue to follow up on this pr. was (Author: slfan1989): I will continue to follow up on this pr. > Race condition between second app attempt and UAM timeout when first attempt > node is down > - > > Key: YARN-7720 > URL: https://issues.apache.org/jira/browse/YARN-7720 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Shilun Fan >Priority: Major > Attachments: YARN-7720.v1.patch, YARN-7720.v2.patch > > > In Federation, multiple attempts of an application share the same UAM in each > secondary sub-cluster. When first attempt fails, we reply on the fact that > secondary RM won't kill the existing UAM before the AM heartbeat timeout > (default at 10 min). When second attempt comes up in the home sub-cluster, it > will pick up the UAM token from Yarn Registry and resume the UAM heartbeat to > secondary RMs. > The default heartbeat timeout for NM and AM are both 10 mins. The problem is > that when the first attempt node goes down or out of connection, only after > 10 mins will the home RM mark the first attempt as failed, and then schedule > the 2nd attempt in some other node. By then the UAMs in secondaries are > already timing out, and they might not survive until the second attempt comes > up. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7899) [AMRMProxy] Stateful FederationInterceptor for pending requests
[ https://issues.apache.org/jira/browse/YARN-7899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17652348#comment-17652348 ] Botong Huang commented on YARN-7899: [~walhl] "cancel pending request in one sub-cluster and re-send it to other sub-clusters" this is not done yet. > [AMRMProxy] Stateful FederationInterceptor for pending requests > --- > > Key: YARN-7899 > URL: https://issues.apache.org/jira/browse/YARN-7899 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Labels: amrmproxy, federation > Fix For: 3.2.0 > > Attachments: YARN-7899-branch-2.v3.patch, YARN-7899.v1.patch, > YARN-7899.v3.patch > > > Today FederationInterceptor (in AMRMProxy for YARN Federation) is stateless > in terms of pending (outstanding) requests. Whenever AM issues new requests, > FI simply splits and sends them to sub-cluster YarnRMs and forget about them. > This JIRA attempts to make FI stateful so that it remembers the pending > requests in all relevant sub-clusters. This has two major benefits: > 1. It is a prerequisite for FI to be able to cancel pending request in one > sub-cluster and re-send it to other sub-clusters. This is needed for load > balancing and to fully comply with the relax locality fallback to ANY > semantic. When we send a request to one sub-cluster, we have effectively > restrained the allocation for this request to be within this sub-cluster > rather than everywhere. If the cluster capacity in this sub-cluster for this > app is full or this YarnRM is overloaded and slow, the request will be stuck > there for a long time even if there is free capacity in other sub-clusters. > We need FI to remember and adjust the pending requests on the fly. > 2. This makes pending request recovery easier when YarnRM fails over. Today > whenever one sub-cluster RM fails over, in order to recover lost pending > requests for this sub-cluster, > we have to propagate the ApplicationMasterNotRegisteredException from the > YarnRM back to AM, triggering a full pending resend from AM. This contains > pending for not only the failing-over sub-cluster, but everyone. Since our > split-merge (AMRMProxyPolicy) does not guarantee idempotency, the same > request we sent to sub-cluster-1 earlier might be resent to sub-cluster-2. If > both these YarnRMs have not failed over, they will both allocate for this > request, leading to over-allocation. Also, these full pending resends also > puts unnecessary load on every YarnRM in the cluster everytime one YarnRM > fails over. With stateful FederationInterceptor, since we remember pending > requests we have sent out earlier, we can shield the > ApplicationMasterNotRegisteredException for AM and resend the pending only to > the failed over YarnRM. This eliminates over-allocation and minimizes the > recovery overhead. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7720) Race condition between second app attempt and UAM timeout when first attempt node is down
[ https://issues.apache.org/jira/browse/YARN-7720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17572711#comment-17572711 ] Botong Huang commented on YARN-7720: [~slfan1989] yes please > Race condition between second app attempt and UAM timeout when first attempt > node is down > - > > Key: YARN-7720 > URL: https://issues.apache.org/jira/browse/YARN-7720 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-7720.v1.patch, YARN-7720.v2.patch > > > In Federation, multiple attempts of an application share the same UAM in each > secondary sub-cluster. When first attempt fails, we reply on the fact that > secondary RM won't kill the existing UAM before the AM heartbeat timeout > (default at 10 min). When second attempt comes up in the home sub-cluster, it > will pick up the UAM token from Yarn Registry and resume the UAM heartbeat to > secondary RMs. > The default heartbeat timeout for NM and AM are both 10 mins. The problem is > that when the first attempt node goes down or out of connection, only after > 10 mins will the home RM mark the first attempt as failed, and then schedule > the 2nd attempt in some other node. By then the UAMs in secondaries are > already timing out, and they might not survive until the second attempt comes > up. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9689) Router does not support kerberos proxy when in secure mode
[ https://issues.apache.org/jira/browse/YARN-9689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16957445#comment-16957445 ] Botong Huang commented on YARN-9689: +1 lgtm > Router does not support kerberos proxy when in secure mode > -- > > Key: YARN-9689 > URL: https://issues.apache.org/jira/browse/YARN-9689 > Project: Hadoop YARN > Issue Type: Improvement > Components: federation >Affects Versions: 3.1.2 >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: YARN-9689.001.patch > > > When we enable kerberos in YARN-Federation mode, we can not get new app since > it will throw kerberos exception below.Which should be handled! > {code:java} > 2019-07-22,18:43:25,523 WARN org.apache.hadoop.ipc.Client: Exception > encountered while connecting to the server : > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > 2019-07-22,18:43:25,528 WARN > org.apache.hadoop.yarn.server.router.clientrm.FederationClientInterceptor: > Unable to create a new ApplicationId in SubCluster xxx > java.io.IOException: DestHost:destPort xxx , LocalHost:localPort xxx. Failed > on local exception: java.io.IOException: javax.security.sasl.SaslException: > GSS initiate failed [Caused by GSSException: No valid credentials provided > (Mechanism level: Failed to find any Kerberos tgt)] > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831) > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:806) > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1564) > at org.apache.hadoop.ipc.Client.call(Client.java:1506) > at org.apache.hadoop.ipc.Client.call(Client.java:1416) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) > at com.sun.proxy.$Proxy91.getNewApplication(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getNewApplication(ApplicationClientProtocolPBClientImpl.java:274) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy92.getNewApplication(Unknown Source) > at > org.apache.hadoop.yarn.server.router.clientrm.FederationClientInterceptor.getNewApplication(FederationClientInterceptor.java:252) > at > org.apache.hadoop.yarn.server.router.clientrm.RouterClientRMService.getNewApplication(RouterClientRMService.java:218) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getNewApplication(ApplicationClientProtocolPBServiceImpl.java:263) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:559) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:525) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:992) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:885) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:831) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1716) > at org.apache.
[jira] [Commented] (YARN-7599) [GPG] ApplicationCleaner in Global Policy Generator
[ https://issues.apache.org/jira/browse/YARN-7599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16937004#comment-16937004 ] Botong Huang commented on YARN-7599: Hi [~qiuliang988], did you mean "GPG couldn't parse this XML" from Router? If you look at _minRouterSuccessCount_ in this patch. By default, only when GPG pulls from Router three times successfully will it go ahead and delete things. > [GPG] ApplicationCleaner in Global Policy Generator > --- > > Key: YARN-7599 > URL: https://issues.apache.org/jira/browse/YARN-7599 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Minor > Labels: federation, gpg > Attachments: YARN-7599-YARN-7402.v1.patch, > YARN-7599-YARN-7402.v2.patch, YARN-7599-YARN-7402.v3.patch, > YARN-7599-YARN-7402.v4.patch, YARN-7599-YARN-7402.v5.patch, > YARN-7599-YARN-7402.v6.patch, YARN-7599-YARN-7402.v7.patch, > YARN-7599-YARN-7402.v8.patch > > > In Federation, we need a cleanup service for StateStore as well as Yarn > Registry. For the former, we need to remove old application records. For the > latter, failed and killed applications might leave records in the Yarn > Registry (see YARN-6128). We plan to do both cleanup work in > ApplicationCleaner in GPG -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-7652) Handle AM register requests asynchronously in FederationInterceptor
[ https://issues.apache.org/jira/browse/YARN-7652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang reassigned YARN-7652: -- Assignee: Botong Huang (was: hunshenshi) > Handle AM register requests asynchronously in FederationInterceptor > --- > > Key: YARN-7652 > URL: https://issues.apache.org/jira/browse/YARN-7652 > Project: Hadoop YARN > Issue Type: Sub-task > Components: amrmproxy, federation >Affects Versions: 2.9.0, 3.0.0 >Reporter: Subru Krishnan >Assignee: Botong Huang >Priority: Major > Fix For: 2.10.0, 3.3.0 > > Attachments: YARN-7652.v1.patch, YARN-7652.v2.patch > > > We (cc [~goiri]/[~botong]) observed that the {{FederationInterceptor}} in > {{AMRMProxy}} (and consequently the AM) is blocked if the _StateStore_ has > outdated info about a _SubCluster_. This is because we handle AM register > requests synchronously. This jira proposes to move to async similar to how we > operate with allocate invocations. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9689) Router does not support kerberos proxy when in secure mode
[ https://issues.apache.org/jira/browse/YARN-9689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890402#comment-16890402 ] Botong Huang commented on YARN-9689: +[~giovanni.fumarola] for help > Router does not support kerberos proxy when in secure mode > -- > > Key: YARN-9689 > URL: https://issues.apache.org/jira/browse/YARN-9689 > Project: Hadoop YARN > Issue Type: Improvement > Components: federation >Affects Versions: 3.1.2 >Reporter: zhoukang >Priority: Major > > When we enable kerberos in YARN-Federation mode, we can not get new app since > it will throw kerberos exception below.Which should be handled! > {code:java} > 2019-07-22,18:43:25,523 WARN org.apache.hadoop.ipc.Client: Exception > encountered while connecting to the server : > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > 2019-07-22,18:43:25,528 WARN > org.apache.hadoop.yarn.server.router.clientrm.FederationClientInterceptor: > Unable to create a new ApplicationId in SubCluster xxx > java.io.IOException: DestHost:destPort xxx , LocalHost:localPort xxx. Failed > on local exception: java.io.IOException: javax.security.sasl.SaslException: > GSS initiate failed [Caused by GSSException: No valid credentials provided > (Mechanism level: Failed to find any Kerberos tgt)] > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831) > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:806) > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1564) > at org.apache.hadoop.ipc.Client.call(Client.java:1506) > at org.apache.hadoop.ipc.Client.call(Client.java:1416) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) > at com.sun.proxy.$Proxy91.getNewApplication(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getNewApplication(ApplicationClientProtocolPBClientImpl.java:274) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy92.getNewApplication(Unknown Source) > at > org.apache.hadoop.yarn.server.router.clientrm.FederationClientInterceptor.getNewApplication(FederationClientInterceptor.java:252) > at > org.apache.hadoop.yarn.server.router.clientrm.RouterClientRMService.getNewApplication(RouterClientRMService.java:218) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getNewApplication(ApplicationClientProtocolPBServiceImpl.java:263) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:559) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:525) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:992) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:885) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:831) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1716) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2691) > Caus
[jira] [Updated] (YARN-9108) fix FederationIntercepter merge home and secondary allocate response typo
[ https://issues.apache.org/jira/browse/YARN-9108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-9108: --- Fix Version/s: 3.3.0 2.10.0 > fix FederationIntercepter merge home and secondary allocate response typo > - > > Key: YARN-9108 > URL: https://issues.apache.org/jira/browse/YARN-9108 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.3.0 >Reporter: Morty Zhong >Assignee: Abhishek Modi >Priority: Minor > Fix For: 2.10.0, 3.3.0 > > Attachments: YARN-9108.001.patch, YARN-9108.002.patch, > YARN-9108.003.patch, YARN-9108.004.patch, YARN-9108.005.patch, > YARN-9108.006.patch > > > method 'mergeAllocateResponse' in class FederationIntercepter.java line 1315 > the left variable `par2` should be `par1` > {code:java} > if (par1 != null && par2 != null) { > par1.getResourceRequest().addAll(par2.getResourceRequest()); > par2.getContainers().addAll(par2.getContainers()); > } > {code} > should be > {code:java} > if (par1 != null && par2 != null) { > par1.getResourceRequest().addAll(par2.getResourceRequest()); > par1.getContainers().addAll(par2.getContainers());//edited line > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9108) fix FederationIntercepter merge home and secondary allocate response typo
[ https://issues.apache.org/jira/browse/YARN-9108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16727565#comment-16727565 ] Botong Huang commented on YARN-9108: +1. Committing to trunk and branch-2. Thanks [~Cedar] and [~abmodi] for the contribution and [~goiri] for reviewing. > fix FederationIntercepter merge home and secondary allocate response typo > - > > Key: YARN-9108 > URL: https://issues.apache.org/jira/browse/YARN-9108 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.3.0 >Reporter: Morty Zhong >Assignee: Abhishek Modi >Priority: Minor > Attachments: YARN-9108.001.patch, YARN-9108.002.patch, > YARN-9108.003.patch, YARN-9108.004.patch, YARN-9108.005.patch, > YARN-9108.006.patch > > > method 'mergeAllocateResponse' in class FederationIntercepter.java line 1315 > the left variable `par2` should be `par1` > {code:java} > if (par1 != null && par2 != null) { > par1.getResourceRequest().addAll(par2.getResourceRequest()); > par2.getContainers().addAll(par2.getContainers()); > } > {code} > should be > {code:java} > if (par1 != null && par2 != null) { > par1.getResourceRequest().addAll(par2.getResourceRequest()); > par1.getContainers().addAll(par2.getContainers());//edited line > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9108) fix FederationIntercepter merge home and secondary allocate response typo
[ https://issues.apache.org/jira/browse/YARN-9108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-9108: --- Summary: fix FederationIntercepter merge home and secondary allocate response typo (was: FederationIntercepter merge home and second response local variable spell mistake) > fix FederationIntercepter merge home and secondary allocate response typo > - > > Key: YARN-9108 > URL: https://issues.apache.org/jira/browse/YARN-9108 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.3.0 >Reporter: Morty Zhong >Assignee: Abhishek Modi >Priority: Minor > Attachments: YARN-9108.001.patch, YARN-9108.002.patch, > YARN-9108.003.patch, YARN-9108.004.patch, YARN-9108.005.patch, > YARN-9108.006.patch > > > method 'mergeAllocateResponse' in class FederationIntercepter.java line 1315 > the left variable `par2` should be `par1` > {code:java} > if (par1 != null && par2 != null) { > par1.getResourceRequest().addAll(par2.getResourceRequest()); > par2.getContainers().addAll(par2.getContainers()); > } > {code} > should be > {code:java} > if (par1 != null && par2 != null) { > par1.getResourceRequest().addAll(par2.getResourceRequest()); > par1.getContainers().addAll(par2.getContainers());//edited line > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9013) [GPG] fix order of steps cleaning Registry entries in ApplicationCleaner
[ https://issues.apache.org/jira/browse/YARN-9013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710981#comment-16710981 ] Botong Huang commented on YARN-9013: Thanks [~giovanni.fumarola] for the review. Committing to YARN-7402. > [GPG] fix order of steps cleaning Registry entries in ApplicationCleaner > > > Key: YARN-9013 > URL: https://issues.apache.org/jira/browse/YARN-9013 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-9013-YARN-7402.v1.patch, > YARN-9013-YARN-7402.v2.patch > > > ApplicationCleaner today deletes the entry for all finished (non-running) > application in YarnRegistry using this logic: > # GPG gets the list of running applications from Router. > # GPG gets the full list of applications in registry > # GPG deletes in registry every app in 2 that’s not in 1 > The problem is that jobs that started between 1 and 2 meets the criteria in > 3, and thus get deleted by mistake. The fix/right order should be 2->1->3, > rather than 1->2->3. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7720) Race condition between second app attempt and UAM timeout when first attempt node is down
[ https://issues.apache.org/jira/browse/YARN-7720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-7720: --- Attachment: YARN-7720.v2.patch > Race condition between second app attempt and UAM timeout when first attempt > node is down > - > > Key: YARN-7720 > URL: https://issues.apache.org/jira/browse/YARN-7720 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-7720.v1.patch, YARN-7720.v2.patch > > > In Federation, multiple attempts of an application share the same UAM in each > secondary sub-cluster. When first attempt fails, we reply on the fact that > secondary RM won't kill the existing UAM before the AM heartbeat timeout > (default at 10 min). When second attempt comes up in the home sub-cluster, it > will pick up the UAM token from Yarn Registry and resume the UAM heartbeat to > secondary RMs. > The default heartbeat timeout for NM and AM are both 10 mins. The problem is > that when the first attempt node goes down or out of connection, only after > 10 mins will the home RM mark the first attempt as failed, and then schedule > the 2nd attempt in some other node. By then the UAMs in secondaries are > already timing out, and they might not survive until the second attempt comes > up. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7720) Race condition between second app attempt and UAM timeout when first attempt node is down
[ https://issues.apache.org/jira/browse/YARN-7720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-7720: --- Attachment: YARN-9013-YARN-7402.v2.patch > Race condition between second app attempt and UAM timeout when first attempt > node is down > - > > Key: YARN-7720 > URL: https://issues.apache.org/jira/browse/YARN-7720 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-7720.v1.patch > > > In Federation, multiple attempts of an application share the same UAM in each > secondary sub-cluster. When first attempt fails, we reply on the fact that > secondary RM won't kill the existing UAM before the AM heartbeat timeout > (default at 10 min). When second attempt comes up in the home sub-cluster, it > will pick up the UAM token from Yarn Registry and resume the UAM heartbeat to > secondary RMs. > The default heartbeat timeout for NM and AM are both 10 mins. The problem is > that when the first attempt node goes down or out of connection, only after > 10 mins will the home RM mark the first attempt as failed, and then schedule > the 2nd attempt in some other node. By then the UAMs in secondaries are > already timing out, and they might not survive until the second attempt comes > up. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7720) Race condition between second app attempt and UAM timeout when first attempt node is down
[ https://issues.apache.org/jira/browse/YARN-7720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-7720: --- Attachment: (was: YARN-9013-YARN-7402.v2.patch) > Race condition between second app attempt and UAM timeout when first attempt > node is down > - > > Key: YARN-7720 > URL: https://issues.apache.org/jira/browse/YARN-7720 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-7720.v1.patch > > > In Federation, multiple attempts of an application share the same UAM in each > secondary sub-cluster. When first attempt fails, we reply on the fact that > secondary RM won't kill the existing UAM before the AM heartbeat timeout > (default at 10 min). When second attempt comes up in the home sub-cluster, it > will pick up the UAM token from Yarn Registry and resume the UAM heartbeat to > secondary RMs. > The default heartbeat timeout for NM and AM are both 10 mins. The problem is > that when the first attempt node goes down or out of connection, only after > 10 mins will the home RM mark the first attempt as failed, and then schedule > the 2nd attempt in some other node. By then the UAMs in secondaries are > already timing out, and they might not survive until the second attempt comes > up. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7720) Race condition between second app attempt and UAM timeout when first attempt node is down
[ https://issues.apache.org/jira/browse/YARN-7720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-7720: --- Attachment: YARN-7720.v1.patch > Race condition between second app attempt and UAM timeout when first attempt > node is down > - > > Key: YARN-7720 > URL: https://issues.apache.org/jira/browse/YARN-7720 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-7720.v1.patch > > > In Federation, multiple attempts of an application share the same UAM in each > secondary sub-cluster. When first attempt fails, we reply on the fact that > secondary RM won't kill the existing UAM before the AM heartbeat timeout > (default at 10 min). When second attempt comes up in the home sub-cluster, it > will pick up the UAM token from Yarn Registry and resume the UAM heartbeat to > secondary RMs. > The default heartbeat timeout for NM and AM are both 10 mins. The problem is > that when the first attempt node goes down or out of connection, only after > 10 mins will the home RM mark the first attempt as failed, and then schedule > the 2nd attempt in some other node. By then the UAMs in secondaries are > already timing out, and they might not survive until the second attempt comes > up. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9049) Add application submit data to state store
[ https://issues.apache.org/jira/browse/YARN-9049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703500#comment-16703500 ] Botong Huang commented on YARN-9049: Understood. Let me ask this way: forget about implementation details, in general, why would adding future app data entry in _ApplicationData_ be easier than adding it directly in _ApplicationHomeSubCluster_? I think at API/interface level, the latter makes more sense because _ApplicationHomeSubCluster_ should already serve as _ApplicationData_ if not renamed to it. If for mysql implementation, the former is easier (by adding a extra layer), then this should be kept implementation specific, and not exposed to the API? > Add application submit data to state store > -- > > Key: YARN-9049 > URL: https://issues.apache.org/jira/browse/YARN-9049 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Major > Attachments: YARN-9049.001.path > > > As per the discussion in YARN-8898 we need to persist trimmend > ApplicationSubmissionContext details to federation State Store. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9049) Add application submit data to state store
[ https://issues.apache.org/jira/browse/YARN-9049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703500#comment-16703500 ] Botong Huang edited comment on YARN-9049 at 11/29/18 4:57 PM: -- Understood. Let me ask this way: forget about implementation details, in general, why would adding future app data entry in _ApplicationData_ be easier than adding it directly in _ApplicationHomeSubCluster_? I think at API/interface level, the latter makes more sense because _ApplicationHomeSubCluster_ should already serve as _ApplicationData_ (if not renamed to it). If for mysql implementation, the former is easier (by adding a extra layer), then this should be kept implementation specific, and not exposed to the API? was (Author: botong): Understood. Let me ask this way: forget about implementation details, in general, why would adding future app data entry in _ApplicationData_ be easier than adding it directly in _ApplicationHomeSubCluster_? I think at API/interface level, the latter makes more sense because _ApplicationHomeSubCluster_ should already serve as _ApplicationData_ if not renamed to it. If for mysql implementation, the former is easier (by adding a extra layer), then this should be kept implementation specific, and not exposed to the API? > Add application submit data to state store > -- > > Key: YARN-9049 > URL: https://issues.apache.org/jira/browse/YARN-9049 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Major > Attachments: YARN-9049.001.path > > > As per the discussion in YARN-8898 we need to persist trimmend > ApplicationSubmissionContext details to federation State Store. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8934) [GPG] Add JvmMetricsInfo and pause monitor
[ https://issues.apache.org/jira/browse/YARN-8934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16702496#comment-16702496 ] Botong Huang commented on YARN-8934: Thanks [~BilwaST] for the patch. Committing to YARN-7402. > [GPG] Add JvmMetricsInfo and pause monitor > -- > > Key: YARN-8934 > URL: https://issues.apache.org/jira/browse/YARN-8934 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > Attachments: YARN-8934-001.patch, YARN-8934-YARN-7402.v1.patch, > YARN-8934-YARN-7402.v2.patch, YARN-8934-YARN-7402.v3.patch, > YARN-8934-YARN-7402.v4.patch, YARN-8934-YARN-7402.v5.patch, > image-2018-11-19-15-37-18-647.png > > > Similar to resourcemanager and nodemanager serivce we can add JvmMetricsInfo > to gpg service. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9049) Add application submit data to state store
[ https://issues.apache.org/jira/browse/YARN-9049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16702430#comment-16702430 ] Botong Huang commented on YARN-9049: Thanks [~bibinchundatt] for the patch! I just have one question: why not adding _ApplicationSubmissionContext_ directly in _ApplicationHomeSubCluster_, but wrap it with _ApplicationData_ first? Every entry of _ApplicationHomeSubCluster_ stores all info about an app, including its home subcluster id. I think it should already serve as _ApplicationData_ if not renamed to it. > Add application submit data to state store > -- > > Key: YARN-9049 > URL: https://issues.apache.org/jira/browse/YARN-9049 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Major > Attachments: YARN-9049.001.path > > > As per the discussion in YARN-8898 we need to persist trimmend > ApplicationSubmissionContext details to federation State Store. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8934) [GPG] Add JvmMetricsInfo and pause monitor
[ https://issues.apache.org/jira/browse/YARN-8934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700980#comment-16700980 ] Botong Huang commented on YARN-8934: Overall lgtm, one more small thing besides Bibin's: change GPG_WEBAPP_ENABLE_CORS_FILTER to use/start with GPG_WEBAPP_PREFIX > [GPG] Add JvmMetricsInfo and pause monitor > -- > > Key: YARN-8934 > URL: https://issues.apache.org/jira/browse/YARN-8934 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > Attachments: YARN-8934-001.patch, YARN-8934-YARN-7402.v1.patch, > YARN-8934-YARN-7402.v2.patch, YARN-8934-YARN-7402.v3.patch, > YARN-8934-YARN-7402.v4.patch, image-2018-11-19-15-37-18-647.png > > > Similar to resourcemanager and nodemanager serivce we can add JvmMetricsInfo > to gpg service. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8898) Fix FederationInterceptor#allocate to set application priority in allocateResponse
[ https://issues.apache.org/jira/browse/YARN-8898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16685911#comment-16685911 ] Botong Huang commented on YARN-8898: +[~giovanni.fumarola] as well > Fix FederationInterceptor#allocate to set application priority in > allocateResponse > -- > > Key: YARN-8898 > URL: https://issues.apache.org/jira/browse/YARN-8898 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin A Chundatt >Assignee: Bilwa S T >Priority: Major > Attachments: YARN-8898.wip.patch > > > In case of FederationInterceptor#mergeAllocateResponses skips > application_priority in response returned -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8898) Fix FederationInterceptor#allocate to set application priority in allocateResponse
[ https://issues.apache.org/jira/browse/YARN-8898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16685637#comment-16685637 ] Botong Huang commented on YARN-8898: For the record, I was leaning towards Solution 2 later in the discussion :) {quote}Anyways, it might be cleaner to go for Solution 2. {quote} In _FederationStateStore_ there's already an application table, we can piggyback in it (_ApplicationHomeSubCluster_). I think for future compatibility we should just put the _ApplicationSubmissionContext_ object in it, rather than creating a new trimmed type. If by trimming you meant setting some of the entries to null, then sure by all means. > Fix FederationInterceptor#allocate to set application priority in > allocateResponse > -- > > Key: YARN-8898 > URL: https://issues.apache.org/jira/browse/YARN-8898 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin A Chundatt >Assignee: Bilwa S T >Priority: Major > Attachments: YARN-8898.wip.patch > > > In case of FederationInterceptor#mergeAllocateResponses skips > application_priority in response returned -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9013) [GPG] fix order of steps cleaning Registry entries in ApplicationCleaner
[ https://issues.apache.org/jira/browse/YARN-9013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-9013: --- Attachment: YARN-9013-YARN-7402.v2.patch > [GPG] fix order of steps cleaning Registry entries in ApplicationCleaner > > > Key: YARN-9013 > URL: https://issues.apache.org/jira/browse/YARN-9013 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-9013-YARN-7402.v1.patch, > YARN-9013-YARN-7402.v2.patch > > > ApplicationCleaner today deletes the entry for all finished (non-running) > application in YarnRegistry using this logic: > # GPG gets the list of running applications from Router. > # GPG gets the full list of applications in registry > # GPG deletes in registry every app in 2 that’s not in 1 > The problem is that jobs that started between 1 and 2 meets the criteria in > 3, and thus get deleted by mistake. The fix/right order should be 2->1->3, > rather than 1->2->3. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9013) [GPG] fix order of steps cleaning Registry entries in ApplicationCleaner
[ https://issues.apache.org/jira/browse/YARN-9013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-9013: --- Attachment: YARN-9013-YARN-7402.v1.patch > [GPG] fix order of steps cleaning Registry entries in ApplicationCleaner > > > Key: YARN-9013 > URL: https://issues.apache.org/jira/browse/YARN-9013 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-9013-YARN-7402.v1.patch > > > ApplicationCleaner today deletes the entry for all finished (non-running) > application in YarnRegistry using this logic: > # GPG gets the list of running applications from Router. > # GPG gets the full list of applications in registry > # GPG deletes in registry every app in 2 that’s not in 1 > The problem is that jobs that started between 1 and 2 meets the criteria in > 3, and thus get deleted by mistake. The fix/right order should be 2->1->3, > rather than 1->2->3. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9013) [GPG] fix order of steps cleaning Registry entries in ApplicationCleaner
[ https://issues.apache.org/jira/browse/YARN-9013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-9013: --- Parent Issue: YARN-7402 (was: YARN-5597) > [GPG] fix order of steps cleaning Registry entries in ApplicationCleaner > > > Key: YARN-9013 > URL: https://issues.apache.org/jira/browse/YARN-9013 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > > ApplicationCleaner today deletes the entry for all finished (non-running) > application in YarnRegistry using this logic: > # GPG gets the list of running applications from Router. > # GPG gets the full list of applications in registry > # GPG deletes in registry every app in 2 that’s not in 1 > The problem is that jobs that started between 1 and 2 meets the criteria in > 3, and thus get deleted by mistake. The fix/right order should be 2->1->3, > rather than 1->2->3. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9013) [GPG] fix order of steps cleaning Registry entries in ApplicationCleaner
[ https://issues.apache.org/jira/browse/YARN-9013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-9013: --- Issue Type: Sub-task (was: Task) Parent: YARN-5597 > [GPG] fix order of steps cleaning Registry entries in ApplicationCleaner > > > Key: YARN-9013 > URL: https://issues.apache.org/jira/browse/YARN-9013 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > > ApplicationCleaner today deletes the entry for all finished (non-running) > application in YarnRegistry using this logic: > # GPG gets the list of running applications from Router. > # GPG gets the full list of applications in registry > # GPG deletes in registry every app in 2 that’s not in 1 > The problem is that jobs that started between 1 and 2 meets the criteria in > 3, and thus get deleted by mistake. The fix/right order should be 2->1->3, > rather than 1->2->3. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9013) [GPG] fix order of steps cleaning Registry entries in ApplicationCleaner
Botong Huang created YARN-9013: -- Summary: [GPG] fix order of steps cleaning Registry entries in ApplicationCleaner Key: YARN-9013 URL: https://issues.apache.org/jira/browse/YARN-9013 Project: Hadoop YARN Issue Type: Task Reporter: Botong Huang Assignee: Botong Huang ApplicationCleaner today deletes the entry for all finished (non-running) application in YarnRegistry using this logic: # GPG gets the list of running applications from Router. # GPG gets the full list of applications in registry # GPG deletes in registry every app in 2 that’s not in 1 The problem is that jobs that started between 1 and 2 meets the criteria in 3, and thus get deleted by mistake. The fix/right order should be 2->1->3, rather than 1->2->3. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8933) [AMRMProxy] Fix potential empty fields in allocation response, move SubClusterTimeout to FederationInterceptor
[ https://issues.apache.org/jira/browse/YARN-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16682964#comment-16682964 ] Botong Huang commented on YARN-8933: Thanks [~bibinchundatt] for the comments and review, committing to trunk and branch-2 > [AMRMProxy] Fix potential empty fields in allocation response, move > SubClusterTimeout to FederationInterceptor > -- > > Key: YARN-8933 > URL: https://issues.apache.org/jira/browse/YARN-8933 > Project: Hadoop YARN > Issue Type: Sub-task > Components: amrmproxy, federation >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Fix For: 2.10.0, 3.3.0 > > Attachments: YARN-8933.v1.patch, YARN-8933.v2.patch, > YARN-8933.v3.patch > > > After YARN-8696, the allocate response by FederationInterceptor is merged > from the responses from a random subset of all sub-clusters, depending on the > async heartbeat timing. As a result, cluster-wide information fields in the > response, e.g. AvailableResources and NumClusterNodes, are not consistent at > all. It can even be null/zero because the specific response is merged from an > empty set of sub-cluster responses. > In this patch, we let FederationInterceptor remember the last allocate > response from all known sub-clusters, and always construct the cluster-wide > info fields from all of them. We also moved sub-cluster timeout from > LocalityMulticastAMRMProxyPolicy to FederationInterceptor, so that > sub-clusters that expired (haven't had a successful allocate response for a > while) won't be included in the computation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8933) [AMRMProxy] Fix potential empty fields in allocation response, move SubClusterTimeout to FederationInterceptor
[ https://issues.apache.org/jira/browse/YARN-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-8933: --- Fix Version/s: 3.3.0 2.10.0 > [AMRMProxy] Fix potential empty fields in allocation response, move > SubClusterTimeout to FederationInterceptor > -- > > Key: YARN-8933 > URL: https://issues.apache.org/jira/browse/YARN-8933 > Project: Hadoop YARN > Issue Type: Sub-task > Components: amrmproxy, federation >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Fix For: 2.10.0, 3.3.0 > > Attachments: YARN-8933.v1.patch, YARN-8933.v2.patch, > YARN-8933.v3.patch > > > After YARN-8696, the allocate response by FederationInterceptor is merged > from the responses from a random subset of all sub-clusters, depending on the > async heartbeat timing. As a result, cluster-wide information fields in the > response, e.g. AvailableResources and NumClusterNodes, are not consistent at > all. It can even be null/zero because the specific response is merged from an > empty set of sub-cluster responses. > In this patch, we let FederationInterceptor remember the last allocate > response from all known sub-clusters, and always construct the cluster-wide > info fields from all of them. We also moved sub-cluster timeout from > LocalityMulticastAMRMProxyPolicy to FederationInterceptor, so that > sub-clusters that expired (haven't had a successful allocate response for a > while) won't be included in the computation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8980) Mapreduce application container start fail after AM restart.
[ https://issues.apache.org/jira/browse/YARN-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16682962#comment-16682962 ] Botong Huang commented on YARN-8980: I agree. I am also worried about container leaks, since the new attempt (old) AM is not even aware of the existing containers from the UAMs. Note that RM only supports one attempt for UAM and this UAM attempt is used throughout all AM attempts in home SC. I think on top of 1 you mentioned (clear token cache in RM), _FederationInterceptor_ needs to know the _keepContainer_ flag of the original AM. If it is false, after reattaching to the UAMs in _registerApplicationMaster_ it needs to release all running containers from UAM. > Mapreduce application container start fail after AM restart. > - > > Key: YARN-8980 > URL: https://issues.apache.org/jira/browse/YARN-8980 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin A Chundatt >Priority: Major > > UAM to subclusters are always launched with keepContainers. > On AM restart scenarios , UAM register again with RM . UAM receive running > containers with NMToken. NMToken received by UAM in > getPreviousAttemptContainersNMToken is never used by mapreduce application. > Federation Interceptor should take care of such scenarios too. Merge NMToken > received at registration to allocate response. > Container allocation response on same node will have NMToken empty. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8980) Mapreduce application container start fail after AM restart.
[ https://issues.apache.org/jira/browse/YARN-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16682485#comment-16682485 ] Botong Huang commented on YARN-8980: Thanks [~bibinchundatt] for reporting. This is along the discussion we are having in YARN-8898. Basically it is better to use the original _ApplicationSubmissionContext_ for the app when launching the UAMs. We will probably need to go with Solution 2 discussed there: Push applicationSubmissionContext also to federationStore at router side. [~subru] what do you think? > Mapreduce application container start fail after AM restart. > - > > Key: YARN-8980 > URL: https://issues.apache.org/jira/browse/YARN-8980 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin A Chundatt >Priority: Major > > UAM to subclusters are always launched with keepContainers. > On AM restart scenarios , UAM register again with RM . UAM receive running > containers with NMToken. NMToken received by UAM in > getPreviousAttemptContainersNMToken is never used by mapreduce application. > Federation Interceptor should take care of such scenarios too. Merge NMToken > received at registration to allocate response. > Container allocation response on same node will have NMToken empty. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8933) [AMRMProxy] Fix potential empty fields in allocation response, move SubClusterTimeout to FederationInterceptor
[ https://issues.apache.org/jira/browse/YARN-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16680622#comment-16680622 ] Botong Huang edited comment on YARN-8933 at 11/8/18 11:53 PM: -- Good questions, there are several aspects: # When we try to span to a new SC. We deliberately put (current time - subcluster timeout) into the map so that initially it is considered expired because the async UAM launch/reattach might fail/took a long time. We don't want to consider this SC as available/healthy and start routing resource requests there until we know for sure that it is ready (received a heartbeat response from it). In fact if the UAM launch fails, we will keep trying in the background (triggered by new AM heartbeat). Without being initialized as expired, this SC will become a black hole sink for container requests. # What you mentioned is possible, that in some corner cases for one AM heartbeat, we consider the subcluster as expired/unhealthy. However note that all we do is not routing new resource request to this SC for this heartbeat only. A heartbeat without new resource request will still be send out to this SC and if we get a response successfully, next time it won't be marked as expired, most likely. # Initializing the lastAMHeartbeatTime as -1 as a special value would work. I didn't do this because _MonotonicClock.getTime()_ can return negative value as well as -1 (as opposed to System.currentTimeMillis() is always positive). I think initializing the lastAMHeartbeatTime in constructor easier and would work as well. was (Author: botong): Good questions, there are several aspects: # When we try to span to a new SC. We deliberately put (current time - subcluster timeout) into the map so that initially it is considered expired because the async UAM launch/reattach might fail/took a long time. We don't want to consider this SC as available/healthy and start routing resource requests there until we know for sure that it is ready (received a heartbeat response from it). In fact if the UAM launch fails, we will keep trying in the background (triggered by new AM heartbeat). Without being initialized as expired, this SC will become a black hole sink for container requests. # What you mentioned is possible, that in some corner cases for one AM heartbeat, we consider the subcluster as expired/unhealthy. However note that all we do is not routing new resource request to this SC for this heartbeat only. A heartbeat without new resource request will still be send out to this SC and if we get a response successfully, next time it won't be marked as expired, most likely. # Initializing the lastheartbeat as -1 as a special value would work. I didn't do this because _MonotonicClock.getTime()_ can return negative value as well as -1 (as opposed to System.currentTimeMillis() is always positive). I think initializing the lastAMHeartbeatTime in constructor easier and would work as well. > [AMRMProxy] Fix potential empty fields in allocation response, move > SubClusterTimeout to FederationInterceptor > -- > > Key: YARN-8933 > URL: https://issues.apache.org/jira/browse/YARN-8933 > Project: Hadoop YARN > Issue Type: Sub-task > Components: amrmproxy, federation >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8933.v1.patch, YARN-8933.v2.patch, > YARN-8933.v3.patch > > > After YARN-8696, the allocate response by FederationInterceptor is merged > from the responses from a random subset of all sub-clusters, depending on the > async heartbeat timing. As a result, cluster-wide information fields in the > response, e.g. AvailableResources and NumClusterNodes, are not consistent at > all. It can even be null/zero because the specific response is merged from an > empty set of sub-cluster responses. > In this patch, we let FederationInterceptor remember the last allocate > response from all known sub-clusters, and always construct the cluster-wide > info fields from all of them. We also moved sub-cluster timeout from > LocalityMulticastAMRMProxyPolicy to FederationInterceptor, so that > sub-clusters that expired (haven't had a successful allocate response for a > while) won't be included in the computation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8933) [AMRMProxy] Fix potential empty fields in allocation response, move SubClusterTimeout to FederationInterceptor
[ https://issues.apache.org/jira/browse/YARN-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16680622#comment-16680622 ] Botong Huang commented on YARN-8933: Good questions, there are several aspects: # When we try to span to a new SC. We deliberately put (current time - subcluster timeout) into the map so that initially it is considered expired because the async UAM launch/reattach might fail/took a long time. We don't want to consider this SC as available/healthy and start routing resource requests there until we know for sure that it is ready (received a heartbeat response from it). In fact if the UAM launch fails, we will keep trying in the background (triggered by new AM heartbeat). Without being initialized as expired, this SC will become a black hole sink for container requests. # What you mentioned is possible, that in some corner cases for one AM heartbeat, we consider the subcluster as expired/unhealthy. However note that all we do is not routing new resource request to this SC for this heartbeat only. A heartbeat without new resource request will still be send out to this SC and if we get a response successfully, next time it won't be marked as expired, most likely. # Initializing the lastheartbeat as -1 as a special value would work. I didn't do this because _MonotonicClock.getTime()_ can return negative value as well as -1 (as opposed to System.currentTimeMillis() is always positive). I think initializing the lastAMHeartbeatTime in constructor easier and would work as well. > [AMRMProxy] Fix potential empty fields in allocation response, move > SubClusterTimeout to FederationInterceptor > -- > > Key: YARN-8933 > URL: https://issues.apache.org/jira/browse/YARN-8933 > Project: Hadoop YARN > Issue Type: Sub-task > Components: amrmproxy, federation >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8933.v1.patch, YARN-8933.v2.patch, > YARN-8933.v3.patch > > > After YARN-8696, the allocate response by FederationInterceptor is merged > from the responses from a random subset of all sub-clusters, depending on the > async heartbeat timing. As a result, cluster-wide information fields in the > response, e.g. AvailableResources and NumClusterNodes, are not consistent at > all. It can even be null/zero because the specific response is merged from an > empty set of sub-cluster responses. > In this patch, we let FederationInterceptor remember the last allocate > response from all known sub-clusters, and always construct the cluster-wide > info fields from all of them. We also moved sub-cluster timeout from > LocalityMulticastAMRMProxyPolicy to FederationInterceptor, so that > sub-clusters that expired (haven't had a successful allocate response for a > while) won't be included in the computation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8933) [AMRMProxy] Fix potential empty fields in allocation response, move SubClusterTimeout to FederationInterceptor
[ https://issues.apache.org/jira/browse/YARN-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675623#comment-16675623 ] Botong Huang edited comment on YARN-8933 at 11/8/18 12:25 AM: -- Ah good catch, and thx for reviewing! Can you explain a bit what you mean by test recover case? There's already a _testRecoverWith(out)AMRMProxyHA_ in _TestFederationInterceptor_. was (Author: botong): Ah good catch, and thx for reviewing! > [AMRMProxy] Fix potential empty fields in allocation response, move > SubClusterTimeout to FederationInterceptor > -- > > Key: YARN-8933 > URL: https://issues.apache.org/jira/browse/YARN-8933 > Project: Hadoop YARN > Issue Type: Sub-task > Components: amrmproxy, federation >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8933.v1.patch, YARN-8933.v2.patch, > YARN-8933.v3.patch > > > After YARN-8696, the allocate response by FederationInterceptor is merged > from the responses from a random subset of all sub-clusters, depending on the > async heartbeat timing. As a result, cluster-wide information fields in the > response, e.g. AvailableResources and NumClusterNodes, are not consistent at > all. It can even be null/zero because the specific response is merged from an > empty set of sub-cluster responses. > In this patch, we let FederationInterceptor remember the last allocate > response from all known sub-clusters, and always construct the cluster-wide > info fields from all of them. We also moved sub-cluster timeout from > LocalityMulticastAMRMProxyPolicy to FederationInterceptor, so that > sub-clusters that expired (haven't had a successful allocate response for a > while) won't be included in the computation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8984) AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty
[ https://issues.apache.org/jira/browse/YARN-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16680065#comment-16680065 ] Botong Huang commented on YARN-8984: bq. ContainerPBImpl#getAllocationTags() will new a empty hashSet when the tag is null. We should not assume the implementation (ContainerPBImpl) will do so in general. Other implementations of Container in the future might still return null. So please keep the null check here. You can remove the isEmpty() check if needed. > AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty > -- > > Key: YARN-8984 > URL: https://issues.apache.org/jira/browse/YARN-8984 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Critical > Attachments: YARN-8984-001.patch, YARN-8984-002.patch, > YARN-8984-003.patch > > > In AMRMClient, outstandingSchedRequests should be removed or decreased when > container allocated. However, it could not work when allocation tag is null > or empty. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8933) [AMRMProxy] Fix potential empty AvailableResource and NumClusterNode in allocation response
[ https://issues.apache.org/jira/browse/YARN-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-8933: --- Attachment: YARN-8933.v3.patch > [AMRMProxy] Fix potential empty AvailableResource and NumClusterNode in > allocation response > --- > > Key: YARN-8933 > URL: https://issues.apache.org/jira/browse/YARN-8933 > Project: Hadoop YARN > Issue Type: Sub-task > Components: amrmproxy, federation >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8933.v1.patch, YARN-8933.v2.patch, > YARN-8933.v3.patch > > > After YARN-8696, the allocate response by FederationInterceptor is merged > from the responses from a random subset of all sub-clusters, depending on the > async heartbeat timing. As a result, cluster-wide information fields in the > response, e.g. AvailableResources and NumClusterNodes, are not consistent at > all. It can even be null/zero because the specific response is merged from an > empty set of sub-cluster responses. > In this patch, we let FederationInterceptor remember the last allocate > response from all known sub-clusters, and always construct the cluster-wide > info fields from all of them. We also moved sub-cluster timeout from > LocalityMulticastAMRMProxyPolicy to FederationInterceptor, so that > sub-clusters that expired (haven't had a successful allocate response for a > while) won't be included in the computation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8933) [AMRMProxy] Fix potential empty fields in allocation response, move SubClusterTimeout to FederationInterceptor
[ https://issues.apache.org/jira/browse/YARN-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-8933: --- Summary: [AMRMProxy] Fix potential empty fields in allocation response, move SubClusterTimeout to FederationInterceptor (was: [AMRMProxy] Fix potential empty AvailableResource and NumClusterNode in allocation response) > [AMRMProxy] Fix potential empty fields in allocation response, move > SubClusterTimeout to FederationInterceptor > -- > > Key: YARN-8933 > URL: https://issues.apache.org/jira/browse/YARN-8933 > Project: Hadoop YARN > Issue Type: Sub-task > Components: amrmproxy, federation >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8933.v1.patch, YARN-8933.v2.patch, > YARN-8933.v3.patch > > > After YARN-8696, the allocate response by FederationInterceptor is merged > from the responses from a random subset of all sub-clusters, depending on the > async heartbeat timing. As a result, cluster-wide information fields in the > response, e.g. AvailableResources and NumClusterNodes, are not consistent at > all. It can even be null/zero because the specific response is merged from an > empty set of sub-cluster responses. > In this patch, we let FederationInterceptor remember the last allocate > response from all known sub-clusters, and always construct the cluster-wide > info fields from all of them. We also moved sub-cluster timeout from > LocalityMulticastAMRMProxyPolicy to FederationInterceptor, so that > sub-clusters that expired (haven't had a successful allocate response for a > while) won't be included in the computation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8984) AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty
[ https://issues.apache.org/jira/browse/YARN-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16678606#comment-16678606 ] Botong Huang edited comment on YARN-8984 at 11/7/18 6:27 PM: - Took a quick look. It is expected for AMRMClient to re-send all outstanding/pending request after an RM master-slave switch. When a container is allocated, we should remove it from the outstanding list, which is exactly what _removeFromOutstandingSchedulingRequests()_ is doing here. If we are not cleaning it up properly, very likely is because RM is not feeding in the proper allocationTags in the allocated _Container_ object? So we need to fix this instead of removing the null check here? was (Author: botong): Took a quick look. It is expected for AMRMClient to re-send all pending request after an RM failover. Whenever a container is allocated, we should remove it from the pending list, which is exactly what _removeFromOutstandingSchedulingRequests()_ is doing here. If we are not cleaning it up properly, very likely is it because RM is not feeding in the proper allocationTags in the allocated Container? So we need to fix this instead of removing the null check here? > AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty > -- > > Key: YARN-8984 > URL: https://issues.apache.org/jira/browse/YARN-8984 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Critical > Attachments: YARN-8984-001.patch, YARN-8984-002.patch, > YARN-8984-003.patch > > > In AMRMClient, outstandingSchedRequests should be removed or decreased when > container allocated. However, it could not work when allocation tag is null > or empty. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8984) AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty
[ https://issues.apache.org/jira/browse/YARN-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16678606#comment-16678606 ] Botong Huang commented on YARN-8984: Took a quick look. It is expected for AMRMClient to re-send all pending request after an RM failover. Whenever a container is allocated, we should remove it from the pending list, which is exactly what _removeFromOutstandingSchedulingRequests()_ is doing here. If we are not cleaning it up properly, very likely is it because RM is not feeding in the proper allocationTags in the allocated Container? So we need to fix this instead of removing the null check here? > AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty > -- > > Key: YARN-8984 > URL: https://issues.apache.org/jira/browse/YARN-8984 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Critical > Attachments: YARN-8984-001.patch, YARN-8984-002.patch, > YARN-8984-003.patch > > > In AMRMClient, outstandingSchedRequests should be removed or decreased when > container allocated. However, it could not work when allocation tag is null > or empty. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8984) AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty
[ https://issues.apache.org/jira/browse/YARN-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16678554#comment-16678554 ] Botong Huang commented on YARN-8984: +[~kkaranasos] > AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty > -- > > Key: YARN-8984 > URL: https://issues.apache.org/jira/browse/YARN-8984 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Critical > Attachments: YARN-8984-001.patch, YARN-8984-002.patch, > YARN-8984-003.patch > > > In AMRMClient, outstandingSchedRequests should be removed or decreased when > container allocated. However, it could not work when allocation tag is null > or empty. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8898) Fix FederationInterceptor#allocate to set application priority in allocateResponse
[ https://issues.apache.org/jira/browse/YARN-8898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16677330#comment-16677330 ] Botong Huang commented on YARN-8898: Technically most of these info are already there for the proxy, in the _ContainerToken_ in _ContainerLaunchContext_ as well as _AllocateResponse_. This is how AM will get them and pass on to its containers later. Anyways, it might be cleaner to go for Solution 2. Let's see what [~subru] thinks. > Fix FederationInterceptor#allocate to set application priority in > allocateResponse > -- > > Key: YARN-8898 > URL: https://issues.apache.org/jira/browse/YARN-8898 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin A Chundatt >Assignee: Bilwa S T >Priority: Major > > In case of FederationInterceptor#mergeAllocateResponses skips > application_priority in response returned -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8933) [AMRMProxy] Fix potential empty AvailableResource and NumClusterNode in allocation response
[ https://issues.apache.org/jira/browse/YARN-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675623#comment-16675623 ] Botong Huang commented on YARN-8933: Ah good catch, and thx for reviewing! > [AMRMProxy] Fix potential empty AvailableResource and NumClusterNode in > allocation response > --- > > Key: YARN-8933 > URL: https://issues.apache.org/jira/browse/YARN-8933 > Project: Hadoop YARN > Issue Type: Sub-task > Components: amrmproxy, federation >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8933.v1.patch, YARN-8933.v2.patch > > > After YARN-8696, the allocate response by FederationInterceptor is merged > from the responses from a random subset of all sub-clusters, depending on the > async heartbeat timing. As a result, cluster-wide information fields in the > response, e.g. AvailableResources and NumClusterNodes, are not consistent at > all. It can even be null/zero because the specific response is merged from an > empty set of sub-cluster responses. > In this patch, we let FederationInterceptor remember the last allocate > response from all known sub-clusters, and always construct the cluster-wide > info fields from all of them. We also moved sub-cluster timeout from > LocalityMulticastAMRMProxyPolicy to FederationInterceptor, so that > sub-clusters that expired (haven't had a successful allocate response for a > while) won't be included in the computation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8898) Fix FederationInterceptor#allocate to set application priority in allocateResponse
[ https://issues.apache.org/jira/browse/YARN-8898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675609#comment-16675609 ] Botong Huang commented on YARN-8898: bq. Better option could be pushing along with ApplicationHomeSubCluster the application Submission Context too. And let interceptor query when AM registration happens. If necessary, yes I agree this works. But if you are talking about ApplicationPriority alone, the change would seem big (Router, StateStore, AMRMProxy). Down the line we might need to deal with two source of truth issues (from StateStore vs RM allocate response) as well. On the other hand, the existing priority value is in AllocateResponse and thus we are relying on the RM version rather than AM version. We can cherry-pick YARN-4170 to 2.7 if needed. For old RM versions where this value is not fed in, I guess we can leave the UAM to default priority. What do you think? > Fix FederationInterceptor#allocate to set application priority in > allocateResponse > -- > > Key: YARN-8898 > URL: https://issues.apache.org/jira/browse/YARN-8898 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin A Chundatt >Assignee: Bilwa S T >Priority: Major > > In case of FederationInterceptor#mergeAllocateResponses skips > application_priority in response returned -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7631) ResourceRequest with different Capacity (Resource) overrides each other in RM and thus lost
[ https://issues.apache.org/jira/browse/YARN-7631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675566#comment-16675566 ] Botong Huang commented on YARN-7631: Please consider directly using _ResourceRequestSetKey_ to replace _SchedulerRequestKey_ for this, thx! > ResourceRequest with different Capacity (Resource) overrides each other in RM > and thus lost > --- > > Key: YARN-7631 > URL: https://issues.apache.org/jira/browse/YARN-7631 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Botong Huang >Assignee: Szilard Nemeth >Priority: Major > Attachments: resourcebug.patch > > > Today in AMRMClientImpl, the ResourceRequests (RR) are kept as: RequestId -> > Priority -> ResourceName -> ExecutionType -> Resource (Capacity) -> > ResourceRequestInfo (the actual RR). This means that only RRs with the same > (requestId, priority, resourcename, executionType, resource) will be grouped > and aggregated together. > While in RM side, the mapping is SchedulerRequestKey (RequestId, priority) -> > LocalityAppPlacementAllocator (ResourceName -> RR). > The issue is that in RM side Resource is not in the key to the RR at all. > (Note that executionType is also not in the RM side, but it is fine because > RM handles it separately as container update requests.) This means that under > the same value of (requestId, priority, resourcename), RRs with different > Resource values will be grouped together and override each other in RM. As a > result, some of the container requests are lost and will never be allocated. > Furthermore, since the two RRs are kept under different keys in AMRMClient > side, allocation of RR1 will only trigger cancel for RR1, the pending RR2 > will not get resend as well. > I’ve attached an unit test (resourcebug.patch) which is failing in trunk to > illustrate this issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8893) [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client
[ https://issues.apache.org/jira/browse/YARN-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-8893: --- Fix Version/s: 2.10.0 > [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client > --- > > Key: YARN-8893 > URL: https://issues.apache.org/jira/browse/YARN-8893 > Project: Hadoop YARN > Issue Type: Sub-task > Components: amrmproxy, federation >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Fix For: 2.10.0, 3.3.0 > > Attachments: YARN-8893.v1.patch, YARN-8893.v2.patch > > > Fix thread leak in AMRMClientRelayer and UAM client used by > FederationInterceptor, when destroying the interceptor pipeline in AMRMProxy. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8893) [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client
[ https://issues.apache.org/jira/browse/YARN-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16673807#comment-16673807 ] Botong Huang commented on YARN-8893: Thanks [~giovanni.fumarola] for the review! Committed to branch-2 as well. > [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client > --- > > Key: YARN-8893 > URL: https://issues.apache.org/jira/browse/YARN-8893 > Project: Hadoop YARN > Issue Type: Sub-task > Components: amrmproxy, federation >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-8893.v1.patch, YARN-8893.v2.patch > > > Fix thread leak in AMRMClientRelayer and UAM client used by > FederationInterceptor, when destroying the interceptor pipeline in AMRMProxy. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8893) [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client
[ https://issues.apache.org/jira/browse/YARN-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16673748#comment-16673748 ] Botong Huang commented on YARN-8893: This is also a proper shutdown and close connection in the forceKill case, so that after we forceKilled the UAM, our local proxy connection inside the AMRMClientRelayer won't be left open. > [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client > --- > > Key: YARN-8893 > URL: https://issues.apache.org/jira/browse/YARN-8893 > Project: Hadoop YARN > Issue Type: Sub-task > Components: amrmproxy, federation >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8893.v1.patch, YARN-8893.v2.patch > > > Fix thread leak in AMRMClientRelayer and UAM client used by > FederationInterceptor, when destroying the interceptor pipeline in AMRMProxy. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8898) Fix FederationInterceptor#allocate to set application priority in allocateResponse
[ https://issues.apache.org/jira/browse/YARN-8898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16664064#comment-16664064 ] Botong Huang commented on YARN-8898: I meant I don't have the code. Feel free to take a crack at it. Please do it on top of YARN-8933 and use the last responses from all SCs. Since FederationInteceptor sits between AM and RM, I don't think it can get the ApplicationSubmissionContext easily. When AMRMProxy initializes the interceptor pipeline for the AM, it has the ContainerLaunchContext for the AM, and currently it is also not passed into the interceptors as well. I agree that FederationInteceptor need more information, I think it is better to use/add fields in the AM RM allocate protocol. Generally it should figure out all information by looking at the communication between AM and (home) RM, e.g. application priority, node label etc. If application priority can change over time, then I think we should just follow the application priority in the last home RM response (reuse YARN-8933). Whenever it detects a priority change in home SC, perhaps FederationInterceptor should change the priority of the UAM in secondaries as well. This last part I think we may or may not need it for now, I am okay with both ways. But when we launch the UAM initially, we should definitely make sure to submit it in the same priority as the home SC at that moment. Regarding Router, the current design is that Router only tracks the home SC for an application. The expansion to (which subset of) secondary SCs are solely up to the FederationInceptor according to proxy policy, Router should not be aware of it. So when client updates the priority for the app, Router should only update it in the home RM, and leave the rest to FederationInterceptor. > Fix FederationInterceptor#allocate to set application priority in > allocateResponse > -- > > Key: YARN-8898 > URL: https://issues.apache.org/jira/browse/YARN-8898 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin A Chundatt >Assignee: Bilwa S T >Priority: Major > > In case of FederationInterceptor#mergeAllocateResponses skips > application_priority in response returned -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8933) [AMRMProxy] Fix potential empty AvailableResource and NumClusterNode in allocation response
[ https://issues.apache.org/jira/browse/YARN-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16661702#comment-16661702 ] Botong Huang commented on YARN-8933: TestContainerManager failure is not related and tracked under YARN-8672 > [AMRMProxy] Fix potential empty AvailableResource and NumClusterNode in > allocation response > --- > > Key: YARN-8933 > URL: https://issues.apache.org/jira/browse/YARN-8933 > Project: Hadoop YARN > Issue Type: Sub-task > Components: amrmproxy, federation >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8933.v1.patch, YARN-8933.v2.patch > > > After YARN-8696, the allocate response by FederationInterceptor is merged > from the responses from a random subset of all sub-clusters, depending on the > async heartbeat timing. As a result, cluster-wide information fields in the > response, e.g. AvailableResources and NumClusterNodes, are not consistent at > all. It can even be null/zero because the specific response is merged from an > empty set of sub-cluster responses. > In this patch, we let FederationInterceptor remember the last allocate > response from all known sub-clusters, and always construct the cluster-wide > info fields from all of them. We also moved sub-cluster timeout from > LocalityMulticastAMRMProxyPolicy to FederationInterceptor, so that > sub-clusters that expired (haven't had a successful allocate response for a > while) won't be included in the computation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8898) Fix FederationInterceptor#allocate to set application priority in allocateResponse
[ https://issues.apache.org/jira/browse/YARN-8898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16661487#comment-16661487 ] Botong Huang commented on YARN-8898: Hi [~bibinchundatt], good questions. bq. Application priority update, what should be the behaviour if subcluster's priorities are not same. Should we be considering appriority of home cluster always? Since I missed this priority field when I port the code to trunk, right now UAMs in secondary sub-clusters didn't set this priority at all in their ApplicationSubmissionContext in UnmanagedApplicationMaster. By browsing the code, my understanding is once submitted, the app priority won't change any more, correct? I think we should submit the UAMs with the right priority in the first place. bq. Also in case of async response from subcluster we should maintain response order too. bq. If response from home cluster not received during merge of response, probably we have to remember the last response from home cluster. For these two, I am actually already doing this in YARN-8933. Please take a look. I think we should just always take the priority in the last remembered home response. > Fix FederationInterceptor#allocate to set application priority in > allocateResponse > -- > > Key: YARN-8898 > URL: https://issues.apache.org/jira/browse/YARN-8898 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin A Chundatt >Assignee: Bilwa S T >Priority: Major > > In case of FederationInterceptor#mergeAllocateResponses skips > application_priority in response returned -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8933) [AMRMProxy] Fix potential empty AvailableResource and NumClusterNode in allocation response
[ https://issues.apache.org/jira/browse/YARN-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-8933: --- Attachment: YARN-8933.v2.patch > [AMRMProxy] Fix potential empty AvailableResource and NumClusterNode in > allocation response > --- > > Key: YARN-8933 > URL: https://issues.apache.org/jira/browse/YARN-8933 > Project: Hadoop YARN > Issue Type: Sub-task > Components: amrmproxy, federation >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8933.v1.patch, YARN-8933.v2.patch > > > After YARN-8696, the allocate response by FederationInterceptor is merged > from the responses from a random subset of all sub-clusters, depending on the > async heartbeat timing. As a result, cluster-wide information fields in the > response, e.g. AvailableResources and NumClusterNodes, are not consistent at > all. It can even be null/zero because the specific response is merged from an > empty set of sub-cluster responses. > In this patch, we let FederationInterceptor remember the last allocate > response from all known sub-clusters, and always construct the cluster-wide > info fields from all of them. We also moved sub-cluster timeout from > LocalityMulticastAMRMProxyPolicy to FederationInterceptor, so that > sub-clusters that expired (haven't had a successful allocate response for a > while) won't be included in the computation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8933) [AMRMProxy] Fix potential empty AvailableResource and NumClusterNode in allocation response
[ https://issues.apache.org/jira/browse/YARN-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-8933: --- Summary: [AMRMProxy] Fix potential empty AvailableResource and NumClusterNode in allocation response (was: [AMRMProxy] Fix potential null AvailableResource and NumClusterNode in allocation response) > [AMRMProxy] Fix potential empty AvailableResource and NumClusterNode in > allocation response > --- > > Key: YARN-8933 > URL: https://issues.apache.org/jira/browse/YARN-8933 > Project: Hadoop YARN > Issue Type: Sub-task > Components: amrmproxy, federation >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8933.v1.patch > > > After YARN-8696, the allocate response by FederationInterceptor is merged > from the responses from a random subset of all sub-clusters, depending on the > async heartbeat timing. As a result, cluster-wide information fields in the > response, e.g. AvailableResources and NumClusterNodes, are not consistent at > all. It can even be null/zero because the specific response is merged from an > empty set of sub-cluster responses. > In this patch, we let FederationInterceptor remember the last allocate > response from all known sub-clusters, and always construct the cluster-wide > info fields from all of them. We also moved sub-cluster timeout from > LocalityMulticastAMRMProxyPolicy to FederationInterceptor, so that > sub-clusters that expired (haven't had a successful allocate response for a > while) won't be included in the computation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8933) [AMRMProxy] Fix potential null AvailableResource and NumClusterNode in allocation response
[ https://issues.apache.org/jira/browse/YARN-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-8933: --- Attachment: YARN-8933.v1.patch > [AMRMProxy] Fix potential null AvailableResource and NumClusterNode in > allocation response > -- > > Key: YARN-8933 > URL: https://issues.apache.org/jira/browse/YARN-8933 > Project: Hadoop YARN > Issue Type: Sub-task > Components: amrmproxy, federation >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8933.v1.patch > > > After YARN-8696, the allocate response by FederationInterceptor is merged > from the responses from a random subset of all sub-clusters, depending on the > async heartbeat timing. As a result, cluster-wide information fields in the > response, e.g. AvailableResources and NumClusterNodes, are not consistent at > all. It can even be null/zero because the specific response is merged from an > empty set of sub-cluster responses. > In this patch, we let FederationInterceptor remember the last allocate > response from all known sub-clusters, and always construct the cluster-wide > info fields from all of them. We also moved sub-cluster timeout from > LocalityMulticastAMRMProxyPolicy to FederationInterceptor, so that > sub-clusters that expired (haven't had a successful allocate response for a > while) won't be included in the computation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8933) [AMRMProxy] Fix potential null AvailableResource and NumClusterNode in allocation response
[ https://issues.apache.org/jira/browse/YARN-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-8933: --- Issue Type: Sub-task (was: Task) Parent: YARN-5597 > [AMRMProxy] Fix potential null AvailableResource and NumClusterNode in > allocation response > -- > > Key: YARN-8933 > URL: https://issues.apache.org/jira/browse/YARN-8933 > Project: Hadoop YARN > Issue Type: Sub-task > Components: amrmproxy, federation >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > > After YARN-8696, the allocate response by FederationInterceptor is merged > from the responses from a random subset of all sub-clusters, depending on the > async heartbeat timing. As a result, cluster-wide information fields in the > response, e.g. AvailableResources and NumClusterNodes, are not consistent at > all. It can even be null/zero because the specific response is merged from an > empty set of sub-cluster responses. > In this patch, we let FederationInterceptor remember the last allocate > response from all known sub-clusters, and always construct the cluster-wide > info fields from all of them. We also moved sub-cluster timeout from > LocalityMulticastAMRMProxyPolicy to FederationInterceptor, so that > sub-clusters that expired (haven't had a successful allocate response for a > while) won't be included in the computation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8933) [AMRMProxy] Fix potential null AvailableResource and NumClusterNode in allocation response
Botong Huang created YARN-8933: -- Summary: [AMRMProxy] Fix potential null AvailableResource and NumClusterNode in allocation response Key: YARN-8933 URL: https://issues.apache.org/jira/browse/YARN-8933 Project: Hadoop YARN Issue Type: Task Reporter: Botong Huang Assignee: Botong Huang After YARN-8696, the allocate response by FederationInterceptor is merged from the responses from a random subset of all sub-clusters, depending on the async heartbeat timing. As a result, cluster-wide information fields in the response, e.g. AvailableResources and NumClusterNodes, are not consistent at all. It can even be null/zero because the specific response is merged from an empty set of sub-cluster responses. In this patch, we let FederationInterceptor remember the last allocate response from all known sub-clusters, and always construct the cluster-wide info fields from all of them. We also moved sub-cluster timeout from LocalityMulticastAMRMProxyPolicy to FederationInterceptor, so that sub-clusters that expired (haven't had a successful allocate response for a while) won't be included in the computation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8933) [AMRMProxy] Fix potential null AvailableResource and NumClusterNode in allocation response
[ https://issues.apache.org/jira/browse/YARN-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-8933: --- Component/s: federation amrmproxy > [AMRMProxy] Fix potential null AvailableResource and NumClusterNode in > allocation response > -- > > Key: YARN-8933 > URL: https://issues.apache.org/jira/browse/YARN-8933 > Project: Hadoop YARN > Issue Type: Sub-task > Components: amrmproxy, federation >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > > After YARN-8696, the allocate response by FederationInterceptor is merged > from the responses from a random subset of all sub-clusters, depending on the > async heartbeat timing. As a result, cluster-wide information fields in the > response, e.g. AvailableResources and NumClusterNodes, are not consistent at > all. It can even be null/zero because the specific response is merged from an > empty set of sub-cluster responses. > In this patch, we let FederationInterceptor remember the last allocate > response from all known sub-clusters, and always construct the cluster-wide > info fields from all of them. We also moved sub-cluster timeout from > LocalityMulticastAMRMProxyPolicy to FederationInterceptor, so that > sub-clusters that expired (haven't had a successful allocate response for a > while) won't be included in the computation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8893) [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client
[ https://issues.apache.org/jira/browse/YARN-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16659262#comment-16659262 ] Botong Huang commented on YARN-8893: The cetest failure in NM is being tracked in YARN-8922 > [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client > --- > > Key: YARN-8893 > URL: https://issues.apache.org/jira/browse/YARN-8893 > Project: Hadoop YARN > Issue Type: Sub-task > Components: amrmproxy, federation >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8893.v1.patch, YARN-8893.v2.patch > > > Fix thread leak in AMRMClientRelayer and UAM client used by > FederationInterceptor, when destroying the interceptor pipeline in AMRMProxy. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8862) [GPG] Add Yarn Registry cleanup in ApplicationCleaner
[ https://issues.apache.org/jira/browse/YARN-8862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16655696#comment-16655696 ] Botong Huang commented on YARN-8862: Committed to YARN-7402. Thanks [~bibinchundatt] and [~giovanni.fumarola] for reviewing! > [GPG] Add Yarn Registry cleanup in ApplicationCleaner > - > > Key: YARN-8862 > URL: https://issues.apache.org/jira/browse/YARN-8862 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8862-YARN-7402.v1.patch, > YARN-8862-YARN-7402.v2.patch, YARN-8862-YARN-7402.v3.patch, > YARN-8862-YARN-7402.v4.patch, YARN-8862-YARN-7402.v5.patch, > YARN-8862-YARN-7402.v6.patch > > > In Yarn Federation, we use Yarn Registry to use the AMToken for UAMs in > secondary sub-clusters. Because of potential more app attempts later, > AMRMProxy cannot kill the UAM and delete the tokens when one local attempt > finishes. So similar to the StateStore application table, we need > ApplicationCleaner in GPG to cleanup the finished app entries in Yarn > Registry. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8862) [GPG] add Yarn Registry cleanup in ApplicationCleaner
[ https://issues.apache.org/jira/browse/YARN-8862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-8862: --- Attachment: YARN-8862-YARN-7402.v6.patch > [GPG] add Yarn Registry cleanup in ApplicationCleaner > - > > Key: YARN-8862 > URL: https://issues.apache.org/jira/browse/YARN-8862 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8862-YARN-7402.v1.patch, > YARN-8862-YARN-7402.v2.patch, YARN-8862-YARN-7402.v3.patch, > YARN-8862-YARN-7402.v4.patch, YARN-8862-YARN-7402.v5.patch, > YARN-8862-YARN-7402.v6.patch > > > In Yarn Federation, we use Yarn Registry to use the AMToken for UAMs in > secondary sub-clusters. Because of potential more app attempts later, > AMRMProxy cannot kill the UAM and delete the tokens when one local attempt > finishes. So similar to the StateStore application table, we need > ApplicationCleaner in GPG to cleanup the finished app entries in Yarn > Registry. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8862) [GPG] add Yarn Registry cleanup in ApplicationCleaner
[ https://issues.apache.org/jira/browse/YARN-8862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16654120#comment-16654120 ] Botong Huang commented on YARN-8862: Thanks [~bibinchundatt] and [~giovanni.fumarola] for the comments! v5 patch uploaded. > [GPG] add Yarn Registry cleanup in ApplicationCleaner > - > > Key: YARN-8862 > URL: https://issues.apache.org/jira/browse/YARN-8862 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8862-YARN-7402.v1.patch, > YARN-8862-YARN-7402.v2.patch, YARN-8862-YARN-7402.v3.patch, > YARN-8862-YARN-7402.v4.patch, YARN-8862-YARN-7402.v5.patch > > > In Yarn Federation, we use Yarn Registry to use the AMToken for UAMs in > secondary sub-clusters. Because of potential more app attempts later, > AMRMProxy cannot kill the UAM and delete the tokens when one local attempt > finishes. So similar to the StateStore application table, we need > ApplicationCleaner in GPG to cleanup the finished app entries in Yarn > Registry. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8862) [GPG] add Yarn Registry cleanup in ApplicationCleaner
[ https://issues.apache.org/jira/browse/YARN-8862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-8862: --- Attachment: YARN-8862-YARN-7402.v5.patch > [GPG] add Yarn Registry cleanup in ApplicationCleaner > - > > Key: YARN-8862 > URL: https://issues.apache.org/jira/browse/YARN-8862 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8862-YARN-7402.v1.patch, > YARN-8862-YARN-7402.v2.patch, YARN-8862-YARN-7402.v3.patch, > YARN-8862-YARN-7402.v4.patch, YARN-8862-YARN-7402.v5.patch > > > In Yarn Federation, we use Yarn Registry to use the AMToken for UAMs in > secondary sub-clusters. Because of potential more app attempts later, > AMRMProxy cannot kill the UAM and delete the tokens when one local attempt > finishes. So similar to the StateStore application table, we need > ApplicationCleaner in GPG to cleanup the finished app entries in Yarn > Registry. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8893) [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client
[ https://issues.apache.org/jira/browse/YARN-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-8893: --- Attachment: YARN-8893.v2.patch > [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client > --- > > Key: YARN-8893 > URL: https://issues.apache.org/jira/browse/YARN-8893 > Project: Hadoop YARN > Issue Type: Sub-task > Components: amrmproxy, federation >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8893.v1.patch, YARN-8893.v2.patch > > > Fix thread leak in AMRMClientRelayer and UAM client used by > FederationInterceptor, when destroying the interceptor pipeline in AMRMProxy. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8898) Fix FederationInterceptor#allocate to set application priority in allocateResponse
[ https://issues.apache.org/jira/browse/YARN-8898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653757#comment-16653757 ] Botong Huang commented on YARN-8898: Good catch, thanks for reporting! > Fix FederationInterceptor#allocate to set application priority in > allocateResponse > -- > > Key: YARN-8898 > URL: https://issues.apache.org/jira/browse/YARN-8898 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin A Chundatt >Assignee: Bilwa S T >Priority: Major > > In case of FederationInterceptor#mergeAllocateResponses skips > application_priority in response returned -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8893) [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client
[ https://issues.apache.org/jira/browse/YARN-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-8893: --- Attachment: YARN-8893.v1.patch > [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client > --- > > Key: YARN-8893 > URL: https://issues.apache.org/jira/browse/YARN-8893 > Project: Hadoop YARN > Issue Type: Sub-task > Components: amrmproxy, federation >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8893.v1.patch > > > Fix thread leak in AMRMClientRelayer and UAM client used by > FederationInterceptor, when destroying the interceptor pipeline in AMRMProxy. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8893) [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client
[ https://issues.apache.org/jira/browse/YARN-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-8893: --- Component/s: federation amrmproxy > [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client > --- > > Key: YARN-8893 > URL: https://issues.apache.org/jira/browse/YARN-8893 > Project: Hadoop YARN > Issue Type: Sub-task > Components: amrmproxy, federation >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8893.v1.patch > > > Fix thread leak in AMRMClientRelayer and UAM client used by > FederationInterceptor, when destroying the interceptor pipeline in AMRMProxy. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8893) [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client
[ https://issues.apache.org/jira/browse/YARN-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-8893: --- Issue Type: Sub-task (was: Task) Parent: YARN-5597 > [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client > --- > > Key: YARN-8893 > URL: https://issues.apache.org/jira/browse/YARN-8893 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > > Fix thread leak in AMRMClientRelayer and UAM client used by > FederationInterceptor, when destroying the interceptor pipeline in AMRMProxy. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8893) [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client
Botong Huang created YARN-8893: -- Summary: [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client Key: YARN-8893 URL: https://issues.apache.org/jira/browse/YARN-8893 Project: Hadoop YARN Issue Type: Task Reporter: Botong Huang Assignee: Botong Huang Fix thread leak in AMRMClientRelayer and UAM client used by FederationInterceptor, when destroying the interceptor pipeline in AMRMProxy. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8862) [GPG] add Yarn Registry cleanup in ApplicationCleaner
[ https://issues.apache.org/jira/browse/YARN-8862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-8862: --- Attachment: YARN-8862-YARN-7402.v4.patch > [GPG] add Yarn Registry cleanup in ApplicationCleaner > - > > Key: YARN-8862 > URL: https://issues.apache.org/jira/browse/YARN-8862 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8862-YARN-7402.v1.patch, > YARN-8862-YARN-7402.v2.patch, YARN-8862-YARN-7402.v3.patch, > YARN-8862-YARN-7402.v4.patch > > > In Yarn Federation, we use Yarn Registry to use the AMToken for UAMs in > secondary sub-clusters. Because of potential more app attempts later, > AMRMProxy cannot kill the UAM and delete the tokens when one local attempt > finishes. So similar to the StateStore application table, we need > ApplicationCleaner in GPG to cleanup the finished app entries in Yarn > Registry. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8481) AMRMProxyPolicies should accept heartbeat response from new/unknown subclusters
[ https://issues.apache.org/jira/browse/YARN-8481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-8481: --- Issue Type: Sub-task (was: Bug) Parent: YARN-5597 > AMRMProxyPolicies should accept heartbeat response from new/unknown > subclusters > --- > > Key: YARN-8481 > URL: https://issues.apache.org/jira/browse/YARN-8481 > Project: Hadoop YARN > Issue Type: Sub-task > Components: amrmproxy, federation >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Minor > Fix For: 2.10.0, 3.2.0, 2.9.2 > > Attachments: YARN-8481.v1.patch > > > Currently BroadcastAMRMProxyPolicy assumes that we only span the application > to the sub-clusters instructed by itself via _splitResourceRequests_. > However, with AMRMProxy HA, second attempts of the application might come up > with multiple sub-clusters initially without consulting the AMRMProxyPolicy > at all. This leads to exceptions in _notifyOfResponse._ It should simply > allow the new/unknown sub-cluster heartbeat responses. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8855) Application fails if one of the sublcluster is down.
[ https://issues.apache.org/jira/browse/YARN-8855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16645703#comment-16645703 ] Botong Huang commented on YARN-8855: Hi [~bibinchundatt], yes YARN-7652 can be the reason. There's other possibility as well, say YARN-8581, depending on the config setup. If a sub-cluster is gone for longer than some time, SubclusterCleaner in GPG (YARN-6648) will mark the sub-cluster to LOST state in StateStore. AMRMProxy will eventually pick it up. > Application fails if one of the sublcluster is down. > > > Key: YARN-8855 > URL: https://issues.apache.org/jira/browse/YARN-8855 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Reporter: Rahul Anand >Priority: Major > > If one of sub cluster is down then application keeps on trying multiple times > and then it fails About 30 failover attempts found in the logs. Below is the > detailed exception. > {code:java} > 2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Container > container_e03_1538297667953_0005_01_01 transitioned from > CONTAINER_CLEANEDUP_AFTER_KILL to DONE | ContainerImpl.java:2093 > 2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Removing > container_e03_1538297667953_0005_01_01 from application > application_1538297667953_0005 | ApplicationImpl.java:512 > 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping > resource-monitoring for container_e03_1538297667953_0005_01_01 | > ContainersMonitorImpl.java:932 > 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Considering > container container_e03_1538297667953_0005_01_01 for log-aggregation | > AppLogAggregatorImpl.java:538 > 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Got event > CONTAINER_STOP for appId application_1538297667953_0005 | AuxServices.java:350 > 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping > container container_e03_1538297667953_0005_01_01 | > YarnShuffleService.java:295 > 2018-10-08 14:21:21,247 | WARN | NM Event dispatcher | couldn't find > container container_e03_1538297667953_0005_01_01 while processing > FINISH_CONTAINERS event | ContainerManagerImpl.java:1660 > 2018-10-08 14:21:22,248 | INFO | Node Status Updater | Removed completed > containers from NM context: [container_e03_1538297667953_0005_01_01] | > NodeStatusUpdaterImpl.java:696 > 2018-10-08 14:21:26,734 | INFO | pool-16-thread-1 | Failing over to the > ResourceManager for SubClusterId: cluster2 | > FederationRMFailoverProxyProvider.java:124 > 2018-10-08 14:21:26,735 | INFO | pool-16-thread-1 | Flushing subClusters from > cache and rehydrating from store, most likely on account of RM failover. | > FederationStateStoreFacade.java:258 > 2018-10-08 14:21:26,738 | INFO | pool-16-thread-1 | Connecting to > /192.168.0.25:8032 subClusterId cluster2 with protocol > ApplicationClientProtocol as user root (auth:SIMPLE) | > FederationRMFailoverProxyProvider.java:145 > 2018-10-08 14:21:26,741 | INFO | pool-16-thread-1 | > java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to > node-master1-IYTxR:8032 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused, while invoking > ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after > 28 failover attempts. Trying to failover after sleeping for 15261ms. | > RetryInvocationHandler.java:411 > 2018-10-08 14:21:42,002 | INFO | pool-16-thread-1 | Failing over to the > ResourceManager for SubClusterId: cluster2 | > FederationRMFailoverProxyProvider.java:124 > 2018-10-08 14:21:42,003 | INFO | pool-16-thread-1 | Flushing subClusters from > cache and rehydrating from store, most likely on account of RM failover. | > FederationStateStoreFacade.java:258 > 2018-10-08 14:21:42,005 | INFO | pool-16-thread-1 | Connecting to > /192.168.0.25:8032 subClusterId cluster2 with protocol > ApplicationClientProtocol as user root (auth:SIMPLE) | > FederationRMFailoverProxyProvider.java:145 > 2018-10-08 14:21:42,007 | INFO | pool-16-thread-1 | > java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to > node-master1-IYTxR:8032 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused, while invoking > ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after > 29 failover attempts. Trying to failover after sleeping for 21175ms. | > RetryInvocationHandler.java:411 > 2018-10-08 14:22:03,183 | INFO | pool-16-thread-1 | Failing over to the > ResourceManager for SubClusterId: cluster2 | > Federati
[jira] [Updated] (YARN-8862) [GPG] add Yarn Registry cleanup in ApplicationCleaner
[ https://issues.apache.org/jira/browse/YARN-8862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-8862: --- Attachment: YARN-8862-YARN-7402.v3.patch > [GPG] add Yarn Registry cleanup in ApplicationCleaner > - > > Key: YARN-8862 > URL: https://issues.apache.org/jira/browse/YARN-8862 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8862-YARN-7402.v1.patch, > YARN-8862-YARN-7402.v2.patch, YARN-8862-YARN-7402.v3.patch > > > In Yarn Federation, we use Yarn Registry to use the AMToken for UAMs in > secondary sub-clusters. Because of potential more app attempts later, > AMRMProxy cannot kill the UAM and delete the tokens when one local attempt > finishes. So similar to the StateStore application table, we need > ApplicationCleaner in GPG to cleanup the finished app entries in Yarn > Registry. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8862) [GPG] add Yarn Registry cleanup in ApplicationCleaner
[ https://issues.apache.org/jira/browse/YARN-8862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-8862: --- Attachment: YARN-8862-YARN-7402.v2.patch > [GPG] add Yarn Registry cleanup in ApplicationCleaner > - > > Key: YARN-8862 > URL: https://issues.apache.org/jira/browse/YARN-8862 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8862-YARN-7402.v1.patch, > YARN-8862-YARN-7402.v2.patch > > > In Yarn Federation, we use Yarn Registry to use the AMToken for UAMs in > secondary sub-clusters. Because of potential more app attempts later, > AMRMProxy cannot kill the UAM and delete the tokens when one local attempt > finishes. So similar to the StateStore application table, we need > ApplicationCleaner in GPG to cleanup the finished app entries in Yarn > Registry. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8862) [GPG] add Yarn Registry cleanup in ApplicationCleaner
[ https://issues.apache.org/jira/browse/YARN-8862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-8862: --- Attachment: YARN-8862-YARN-7402.v1.patch > [GPG] add Yarn Registry cleanup in ApplicationCleaner > - > > Key: YARN-8862 > URL: https://issues.apache.org/jira/browse/YARN-8862 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8862-YARN-7402.v1.patch > > > In Yarn Federation, we use Yarn Registry to use the AMToken for UAMs in > secondary sub-clusters. Because of potential more app attempts later, > AMRMProxy cannot kill the UAM and delete the tokens when one local attempt > finishes. So similar to the StateStore application table, we need > ApplicationCleaner in GPG to cleanup the finished app entries in Yarn > Registry. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8862) [GPG] add Yarn Registry cleanup in ApplicationCleaner
Botong Huang created YARN-8862: -- Summary: [GPG] add Yarn Registry cleanup in ApplicationCleaner Key: YARN-8862 URL: https://issues.apache.org/jira/browse/YARN-8862 Project: Hadoop YARN Issue Type: Task Reporter: Botong Huang Assignee: Botong Huang In Yarn Federation, we use Yarn Registry to use the AMToken for UAMs in secondary sub-clusters. Because of potential more app attempts later, AMRMProxy cannot kill the UAM and delete the tokens when one local attempt finishes. So similar to the StateStore application table, we need ApplicationCleaner in GPG to cleanup the finished app entries in Yarn Registry. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8862) [GPG] add Yarn Registry cleanup in ApplicationCleaner
[ https://issues.apache.org/jira/browse/YARN-8862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-8862: --- Issue Type: Sub-task (was: Task) Parent: YARN-7402 > [GPG] add Yarn Registry cleanup in ApplicationCleaner > - > > Key: YARN-8862 > URL: https://issues.apache.org/jira/browse/YARN-8862 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > > In Yarn Federation, we use Yarn Registry to use the AMToken for UAMs in > secondary sub-clusters. Because of potential more app attempts later, > AMRMProxy cannot kill the UAM and delete the tokens when one local attempt > finishes. So similar to the StateStore application table, we need > ApplicationCleaner in GPG to cleanup the finished app entries in Yarn > Registry. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7652) Handle AM register requests asynchronously in FederationInterceptor
[ https://issues.apache.org/jira/browse/YARN-7652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16643789#comment-16643789 ] Botong Huang commented on YARN-7652: Thanks [~goiri] for the review and commit! > Handle AM register requests asynchronously in FederationInterceptor > --- > > Key: YARN-7652 > URL: https://issues.apache.org/jira/browse/YARN-7652 > Project: Hadoop YARN > Issue Type: Sub-task > Components: amrmproxy, federation >Affects Versions: 2.9.0, 3.0.0 >Reporter: Subru Krishnan >Assignee: Botong Huang >Priority: Major > Fix For: 2.10.0, 3.3.0 > > Attachments: YARN-7652.v1.patch, YARN-7652.v2.patch > > > We (cc [~goiri]/[~botong]) observed that the {{FederationInterceptor}} in > {{AMRMProxy}} (and consequently the AM) is blocked if the _StateStore_ has > outdated info about a _SubCluster_. This is because we handle AM register > requests synchronously. This jira proposes to move to async similar to how we > operate with allocate invocations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8855) Application fails if one of the sublcluster is down.
[ https://issues.apache.org/jira/browse/YARN-8855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16642204#comment-16642204 ] Botong Huang commented on YARN-8855: Thanks [~rahulanand90] for reporting it! Which federation policy (yarn.federation.policy-manager) and code version are you using? This should have been fixed in latest trunk and branch-2. > Application fails if one of the sublcluster is down. > > > Key: YARN-8855 > URL: https://issues.apache.org/jira/browse/YARN-8855 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rahul Anand >Priority: Major > > If one of sub cluster is down then application keeps on trying multiple times > and then it fails About 30 failover attempts found in the logs. Below is the > detailed exception. > {code:java} > 2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Container > container_e03_1538297667953_0005_01_01 transitioned from > CONTAINER_CLEANEDUP_AFTER_KILL to DONE | ContainerImpl.java:2093 > 2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Removing > container_e03_1538297667953_0005_01_01 from application > application_1538297667953_0005 | ApplicationImpl.java:512 > 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping > resource-monitoring for container_e03_1538297667953_0005_01_01 | > ContainersMonitorImpl.java:932 > 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Considering > container container_e03_1538297667953_0005_01_01 for log-aggregation | > AppLogAggregatorImpl.java:538 > 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Got event > CONTAINER_STOP for appId application_1538297667953_0005 | AuxServices.java:350 > 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping > container container_e03_1538297667953_0005_01_01 | > YarnShuffleService.java:295 > 2018-10-08 14:21:21,247 | WARN | NM Event dispatcher | couldn't find > container container_e03_1538297667953_0005_01_01 while processing > FINISH_CONTAINERS event | ContainerManagerImpl.java:1660 > 2018-10-08 14:21:22,248 | INFO | Node Status Updater | Removed completed > containers from NM context: [container_e03_1538297667953_0005_01_01] | > NodeStatusUpdaterImpl.java:696 > 2018-10-08 14:21:26,734 | INFO | pool-16-thread-1 | Failing over to the > ResourceManager for SubClusterId: cluster2 | > FederationRMFailoverProxyProvider.java:124 > 2018-10-08 14:21:26,735 | INFO | pool-16-thread-1 | Flushing subClusters from > cache and rehydrating from store, most likely on account of RM failover. | > FederationStateStoreFacade.java:258 > 2018-10-08 14:21:26,738 | INFO | pool-16-thread-1 | Connecting to > /192.168.0.25:8032 subClusterId cluster2 with protocol > ApplicationClientProtocol as user root (auth:SIMPLE) | > FederationRMFailoverProxyProvider.java:145 > 2018-10-08 14:21:26,741 | INFO | pool-16-thread-1 | > java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to > node-master1-IYTxR:8032 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused, while invoking > ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after > 28 failover attempts. Trying to failover after sleeping for 15261ms. | > RetryInvocationHandler.java:411 > 2018-10-08 14:21:42,002 | INFO | pool-16-thread-1 | Failing over to the > ResourceManager for SubClusterId: cluster2 | > FederationRMFailoverProxyProvider.java:124 > 2018-10-08 14:21:42,003 | INFO | pool-16-thread-1 | Flushing subClusters from > cache and rehydrating from store, most likely on account of RM failover. | > FederationStateStoreFacade.java:258 > 2018-10-08 14:21:42,005 | INFO | pool-16-thread-1 | Connecting to > /192.168.0.25:8032 subClusterId cluster2 with protocol > ApplicationClientProtocol as user root (auth:SIMPLE) | > FederationRMFailoverProxyProvider.java:145 > 2018-10-08 14:21:42,007 | INFO | pool-16-thread-1 | > java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to > node-master1-IYTxR:8032 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused, while invoking > ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after > 29 failover attempts. Trying to failover after sleeping for 21175ms. | > RetryInvocationHandler.java:411 > 2018-10-08 14:22:03,183 | INFO | pool-16-thread-1 | Failing over to the > ResourceManager for SubClusterId: cluster2 | > FederationRMFailoverProxyProvider.java:124 > 2018-10-08 14:22:03,183 | INFO | pool-16-thread-1 | Flushing subClusters from > cache and rehydrating from store, most likely on
[jira] [Updated] (YARN-7652) Handle AM register requests asynchronously in FederationInterceptor
[ https://issues.apache.org/jira/browse/YARN-7652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-7652: --- Attachment: YARN-7652.v2.patch > Handle AM register requests asynchronously in FederationInterceptor > --- > > Key: YARN-7652 > URL: https://issues.apache.org/jira/browse/YARN-7652 > Project: Hadoop YARN > Issue Type: Sub-task > Components: amrmproxy, federation >Affects Versions: 2.9.0, 3.0.0 >Reporter: Subru Krishnan >Assignee: Botong Huang >Priority: Major > Attachments: YARN-7652.v1.patch, YARN-7652.v2.patch > > > We (cc [~goiri]/[~botong]) observed that the {{FederationInterceptor}} in > {{AMRMProxy}} (and consequently the AM) is blocked if the _StateStore_ has > outdated info about a _SubCluster_. This is because we handle AM register > requests synchronously. This jira proposes to move to async similar to how we > operate with allocate invocations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8837) TestNMProxy.testNMProxyRPCRetry Improvement
[ https://issues.apache.org/jira/browse/YARN-8837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640751#comment-16640751 ] Botong Huang commented on YARN-8837: Nvm, it is fixed in YARN-8844 now. Please address Jason's comment as an improvement. Thanks! > TestNMProxy.testNMProxyRPCRetry Improvement > --- > > Key: YARN-8837 > URL: https://issues.apache.org/jira/browse/YARN-8837 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.2.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Trivial > Attachments: YARN-8789.1.patch > > > The unit test > {{org.apache.hadoop.yarn.server.nodemanager.containermanager.TestNMProxy.testNMProxyRetry()}} > has had some issues in the past. You can search JIRA for it, but one example > is [YARN-5104]. I recently had some issues with it myself and found the > follow change helpful in troubleshooting. > {code:java|title=Current Implementation} > } catch (IOException e) { > // socket exception should be thrown immediately, without RPC retries. > Assert.assertTrue(e instanceof java.net.SocketException); > } > {code} > The issue here is that the test is true/false. The testing framework does > not give me any feedback regarding the type of exception that was thrown, it > just says "assertion failed." -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7652) Handle AM register requests asynchronously in FederationInterceptor
[ https://issues.apache.org/jira/browse/YARN-7652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-7652: --- Attachment: YARN-7652.v1.patch > Handle AM register requests asynchronously in FederationInterceptor > --- > > Key: YARN-7652 > URL: https://issues.apache.org/jira/browse/YARN-7652 > Project: Hadoop YARN > Issue Type: Sub-task > Components: amrmproxy, federation >Affects Versions: 2.9.0, 3.0.0 >Reporter: Subru Krishnan >Assignee: Botong Huang >Priority: Major > Attachments: YARN-7652.v1.patch > > > We (cc [~goiri]/[~botong]) observed that the {{FederationInterceptor}} in > {{AMRMProxy}} (and consequently the AM) is blocked if the _StateStore_ has > outdated info about a _SubCluster_. This is because we handle AM register > requests synchronously. This jira proposes to move to async similar to how we > operate with allocate invocations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5597) YARN Federation improvements
[ https://issues.apache.org/jira/browse/YARN-5597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-5597: --- Attachment: YARN-7652.v1.patch > YARN Federation improvements > > > Key: YARN-5597 > URL: https://issues.apache.org/jira/browse/YARN-5597 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Subru Krishnan >Assignee: Subru Krishnan >Priority: Major > > This umbrella JIRA tracks set of improvements over the YARN Federation MVP > (YARN-2915) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5597) YARN Federation improvements
[ https://issues.apache.org/jira/browse/YARN-5597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-5597: --- Attachment: (was: YARN-7652.v1.patch) > YARN Federation improvements > > > Key: YARN-5597 > URL: https://issues.apache.org/jira/browse/YARN-5597 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Subru Krishnan >Assignee: Subru Krishnan >Priority: Major > > This umbrella JIRA tracks set of improvements over the YARN Federation MVP > (YARN-2915) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8760) [AMRMProxy] Fix concurrent re-register due to YarnRM failover in AMRMClientRelayer
[ https://issues.apache.org/jira/browse/YARN-8760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16634695#comment-16634695 ] Botong Huang commented on YARN-8760: Thanks [~giovanni.fumarola] for the review and commit! > [AMRMProxy] Fix concurrent re-register due to YarnRM failover in > AMRMClientRelayer > -- > > Key: YARN-8760 > URL: https://issues.apache.org/jira/browse/YARN-8760 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Fix For: 3.2.0 > > Attachments: YARN-8760.v1.patch > > > When home YarnRM is failing over, FinishApplicationMaster call from AM can > have multiple retry threads outstanding in FederationInterceptor. When new > YarnRM come back up, all retry threads will re-register to YarnRM. The first > one will succeed but the rest will get "Application Master is already > registered" exception. We should catch and swallow this exception and move > on. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8837) TestNMProxy.testNMProxyRPCRetry Improvement
[ https://issues.apache.org/jira/browse/YARN-8837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16634523#comment-16634523 ] Botong Huang commented on YARN-8837: Other than improving surfacing the exception message, can we try fix this unit test as well? It is failing in trunk now. > TestNMProxy.testNMProxyRPCRetry Improvement > --- > > Key: YARN-8837 > URL: https://issues.apache.org/jira/browse/YARN-8837 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.2.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Trivial > Attachments: YARN-8789.1.patch > > > The unit test > {{org.apache.hadoop.yarn.server.nodemanager.containermanager.TestNMProxy.testNMProxyRetry()}} > has had some issues in the past. You can search JIRA for it, but one example > is [YARN-5104]. I recently had some issues with it myself and found the > follow change helpful in troubleshooting. > {code:java|title=Current Implementation} > } catch (IOException e) { > // socket exception should be thrown immediately, without RPC retries. > Assert.assertTrue(e instanceof java.net.SocketException); > } > {code} > The issue here is that the test is true/false. The testing framework does > not give me any feedback regarding the type of exception that was thrown, it > just says "assertion failed." -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8760) [AMRMProxy] Fix concurrent re-register due to YarnRM failover in AMRMClientRelayer
[ https://issues.apache.org/jira/browse/YARN-8760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16634229#comment-16634229 ] Botong Huang commented on YARN-8760: TestNMProxy failure is irrelevant and is tracked under YARN-8837 > [AMRMProxy] Fix concurrent re-register due to YarnRM failover in > AMRMClientRelayer > -- > > Key: YARN-8760 > URL: https://issues.apache.org/jira/browse/YARN-8760 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8760.v1.patch > > > When home YarnRM is failing over, FinishApplicationMaster call from AM can > have multiple retry threads outstanding in FederationInterceptor. When new > YarnRM come back up, all retry threads will re-register to YarnRM. The first > one will succeed but the rest will get "Application Master is already > registered" exception. We should catch and swallow this exception and move > on. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8760) [AMRMProxy] Fix concurrent re-register due to YarnRM failover in AMRMClientRelayer
[ https://issues.apache.org/jira/browse/YARN-8760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-8760: --- Attachment: YARN-8760.v1.patch > [AMRMProxy] Fix concurrent re-register due to YarnRM failover in > AMRMClientRelayer > -- > > Key: YARN-8760 > URL: https://issues.apache.org/jira/browse/YARN-8760 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8760.v1.patch > > > When home YarnRM is failing over, FinishApplicationMaster call from AM can > have multiple retry threads outstanding in FederationInterceptor. When new > YarnRM come back up, all retry threads will re-register to YarnRM. The first > one will succeed but the rest will get "Application Master is already > registered" exception. We should catch and swallow this exception and move > on. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8696) [AMRMProxy] FederationInterceptor upgrade: home sub-cluster heartbeat async
[ https://issues.apache.org/jira/browse/YARN-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630739#comment-16630739 ] Botong Huang commented on YARN-8696: Thanks [~giovanni.fumarola] for the review and commit! > [AMRMProxy] FederationInterceptor upgrade: home sub-cluster heartbeat async > --- > > Key: YARN-8696 > URL: https://issues.apache.org/jira/browse/YARN-8696 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Fix For: 3.2.0 > > Attachments: YARN-8696-branch-2.v6.patch, YARN-8696.v1.patch, > YARN-8696.v2.patch, YARN-8696.v3.patch, YARN-8696.v4.patch, > YARN-8696.v5.patch, YARN-8696.v6.patch > > > Today in _FederationInterceptor_, the heartbeat to home sub-cluster is > synchronous. After the heartbeat is sent out to home sub-cluster, it waits > for the home response to come back before merging and returning the (merged) > heartbeat result to back AM. If home sub-cluster is suffering from connection > issues, or down during an YarnRM master-slave switch, all heartbeat threads > in _FederationInterceptor_ will be blocked waiting for home response. As a > result, the successful UAM heartbeats from secondary sub-clusters will not be > returned to AM at all. Additionally, because of the fact that we kept the > same heartbeat responseId between AM and home RM, lots of tricky handling are > needed regarding the responseId resync when it comes to > _FederationInterceptor_ (part of AMRMProxy, NM) work preserving restart > (YARN-6127, YARN-1336), home RM master-slave switch etc. > In this patch, we change the heartbeat to home sub-cluster to asynchronous, > same as the way we handle UAM heartbeats in secondaries. So that any > sub-cluster down or connection issues won't impact AM getting responses from > other sub-clusters. The responseId is also managed separately for home > sub-cluster and AM, and they increment independently. The resync logic > becomes much cleaner. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8696) [AMRMProxy] FederationInterceptor upgrade: home sub-cluster heartbeat async
[ https://issues.apache.org/jira/browse/YARN-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-8696: --- Attachment: YARN-8696-branch-2.v6.patch > [AMRMProxy] FederationInterceptor upgrade: home sub-cluster heartbeat async > --- > > Key: YARN-8696 > URL: https://issues.apache.org/jira/browse/YARN-8696 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8696-branch-2.v6.patch, YARN-8696.v1.patch, > YARN-8696.v2.patch, YARN-8696.v3.patch, YARN-8696.v4.patch, > YARN-8696.v5.patch, YARN-8696.v6.patch > > > Today in _FederationInterceptor_, the heartbeat to home sub-cluster is > synchronous. After the heartbeat is sent out to home sub-cluster, it waits > for the home response to come back before merging and returning the (merged) > heartbeat result to back AM. If home sub-cluster is suffering from connection > issues, or down during an YarnRM master-slave switch, all heartbeat threads > in _FederationInterceptor_ will be blocked waiting for home response. As a > result, the successful UAM heartbeats from secondary sub-clusters will not be > returned to AM at all. Additionally, because of the fact that we kept the > same heartbeat responseId between AM and home RM, lots of tricky handling are > needed regarding the responseId resync when it comes to > _FederationInterceptor_ (part of AMRMProxy, NM) work preserving restart > (YARN-6127, YARN-1336), home RM master-slave switch etc. > In this patch, we change the heartbeat to home sub-cluster to asynchronous, > same as the way we handle UAM heartbeats in secondaries. So that any > sub-cluster down or connection issues won't impact AM getting responses from > other sub-clusters. The responseId is also managed separately for home > sub-cluster and AM, and they increment independently. The resync logic > becomes much cleaner. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8696) [AMRMProxy] FederationInterceptor upgrade: home sub-cluster heartbeat async
[ https://issues.apache.org/jira/browse/YARN-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-8696: --- Attachment: (was: YARN-8696-branch-2.v6.patch) > [AMRMProxy] FederationInterceptor upgrade: home sub-cluster heartbeat async > --- > > Key: YARN-8696 > URL: https://issues.apache.org/jira/browse/YARN-8696 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8696.v1.patch, YARN-8696.v2.patch, > YARN-8696.v3.patch, YARN-8696.v4.patch, YARN-8696.v5.patch, YARN-8696.v6.patch > > > Today in _FederationInterceptor_, the heartbeat to home sub-cluster is > synchronous. After the heartbeat is sent out to home sub-cluster, it waits > for the home response to come back before merging and returning the (merged) > heartbeat result to back AM. If home sub-cluster is suffering from connection > issues, or down during an YarnRM master-slave switch, all heartbeat threads > in _FederationInterceptor_ will be blocked waiting for home response. As a > result, the successful UAM heartbeats from secondary sub-clusters will not be > returned to AM at all. Additionally, because of the fact that we kept the > same heartbeat responseId between AM and home RM, lots of tricky handling are > needed regarding the responseId resync when it comes to > _FederationInterceptor_ (part of AMRMProxy, NM) work preserving restart > (YARN-6127, YARN-1336), home RM master-slave switch etc. > In this patch, we change the heartbeat to home sub-cluster to asynchronous, > same as the way we handle UAM heartbeats in secondaries. So that any > sub-cluster down or connection issues won't impact AM getting responses from > other sub-clusters. The responseId is also managed separately for home > sub-cluster and AM, and they increment independently. The resync logic > becomes much cleaner. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-7599) [GPG] ApplicationCleaner in Global Policy Generator
[ https://issues.apache.org/jira/browse/YARN-7599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang resolved YARN-7599. Resolution: Fixed > [GPG] ApplicationCleaner in Global Policy Generator > --- > > Key: YARN-7599 > URL: https://issues.apache.org/jira/browse/YARN-7599 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Minor > Labels: federation, gpg > Attachments: YARN-7599-YARN-7402.v1.patch, > YARN-7599-YARN-7402.v2.patch, YARN-7599-YARN-7402.v3.patch, > YARN-7599-YARN-7402.v4.patch, YARN-7599-YARN-7402.v5.patch, > YARN-7599-YARN-7402.v6.patch, YARN-7599-YARN-7402.v7.patch, > YARN-7599-YARN-7402.v8.patch > > > In Federation, we need a cleanup service for StateStore as well as Yarn > Registry. For the former, we need to remove old application records. For the > latter, failed and killed applications might leave records in the Yarn > Registry (see YARN-6128). We plan to do both cleanup work in > ApplicationCleaner in GPG -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7599) [GPG] ApplicationCleaner in Global Policy Generator
[ https://issues.apache.org/jira/browse/YARN-7599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16624388#comment-16624388 ] Botong Huang commented on YARN-7599: Committed to YARN-7402. Thanks [~bibinchundatt] for the review! > [GPG] ApplicationCleaner in Global Policy Generator > --- > > Key: YARN-7599 > URL: https://issues.apache.org/jira/browse/YARN-7599 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Minor > Labels: federation, gpg > Attachments: YARN-7599-YARN-7402.v1.patch, > YARN-7599-YARN-7402.v2.patch, YARN-7599-YARN-7402.v3.patch, > YARN-7599-YARN-7402.v4.patch, YARN-7599-YARN-7402.v5.patch, > YARN-7599-YARN-7402.v6.patch, YARN-7599-YARN-7402.v7.patch, > YARN-7599-YARN-7402.v8.patch > > > In Federation, we need a cleanup service for StateStore as well as Yarn > Registry. For the former, we need to remove old application records. For the > latter, failed and killed applications might leave records in the Yarn > Registry (see YARN-6128). We plan to do both cleanup work in > ApplicationCleaner in GPG -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7599) [GPG] ApplicationCleaner in Global Policy Generator
[ https://issues.apache.org/jira/browse/YARN-7599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16624053#comment-16624053 ] Botong Huang commented on YARN-7599: Ah good point. v8 uploaded. Will commit pending on yetus. Thanks! > [GPG] ApplicationCleaner in Global Policy Generator > --- > > Key: YARN-7599 > URL: https://issues.apache.org/jira/browse/YARN-7599 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Minor > Labels: federation, gpg > Attachments: YARN-7599-YARN-7402.v1.patch, > YARN-7599-YARN-7402.v2.patch, YARN-7599-YARN-7402.v3.patch, > YARN-7599-YARN-7402.v4.patch, YARN-7599-YARN-7402.v5.patch, > YARN-7599-YARN-7402.v6.patch, YARN-7599-YARN-7402.v7.patch, > YARN-7599-YARN-7402.v8.patch > > > In Federation, we need a cleanup service for StateStore as well as Yarn > Registry. For the former, we need to remove old application records. For the > latter, failed and killed applications might leave records in the Yarn > Registry (see YARN-6128). We plan to do both cleanup work in > ApplicationCleaner in GPG -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7599) [GPG] ApplicationCleaner in Global Policy Generator
[ https://issues.apache.org/jira/browse/YARN-7599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-7599: --- Attachment: YARN-7599-YARN-7402.v8.patch > [GPG] ApplicationCleaner in Global Policy Generator > --- > > Key: YARN-7599 > URL: https://issues.apache.org/jira/browse/YARN-7599 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Minor > Labels: federation, gpg > Attachments: YARN-7599-YARN-7402.v1.patch, > YARN-7599-YARN-7402.v2.patch, YARN-7599-YARN-7402.v3.patch, > YARN-7599-YARN-7402.v4.patch, YARN-7599-YARN-7402.v5.patch, > YARN-7599-YARN-7402.v6.patch, YARN-7599-YARN-7402.v7.patch, > YARN-7599-YARN-7402.v8.patch > > > In Federation, we need a cleanup service for StateStore as well as Yarn > Registry. For the former, we need to remove old application records. For the > latter, failed and killed applications might leave records in the Yarn > Registry (see YARN-6128). We plan to do both cleanup work in > ApplicationCleaner in GPG -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8696) [AMRMProxy] FederationInterceptor upgrade: home sub-cluster heartbeat async
[ https://issues.apache.org/jira/browse/YARN-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-8696: --- Attachment: YARN-8696-branch-2.v6.patch > [AMRMProxy] FederationInterceptor upgrade: home sub-cluster heartbeat async > --- > > Key: YARN-8696 > URL: https://issues.apache.org/jira/browse/YARN-8696 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8696-branch-2.v6.patch, YARN-8696.v1.patch, > YARN-8696.v2.patch, YARN-8696.v3.patch, YARN-8696.v4.patch, > YARN-8696.v5.patch, YARN-8696.v6.patch > > > Today in _FederationInterceptor_, the heartbeat to home sub-cluster is > synchronous. After the heartbeat is sent out to home sub-cluster, it waits > for the home response to come back before merging and returning the (merged) > heartbeat result to back AM. If home sub-cluster is suffering from connection > issues, or down during an YarnRM master-slave switch, all heartbeat threads > in _FederationInterceptor_ will be blocked waiting for home response. As a > result, the successful UAM heartbeats from secondary sub-clusters will not be > returned to AM at all. Additionally, because of the fact that we kept the > same heartbeat responseId between AM and home RM, lots of tricky handling are > needed regarding the responseId resync when it comes to > _FederationInterceptor_ (part of AMRMProxy, NM) work preserving restart > (YARN-6127, YARN-1336), home RM master-slave switch etc. > In this patch, we change the heartbeat to home sub-cluster to asynchronous, > same as the way we handle UAM heartbeats in secondaries. So that any > sub-cluster down or connection issues won't impact AM getting responses from > other sub-clusters. The responseId is also managed separately for home > sub-cluster and AM, and they increment independently. The resync logic > becomes much cleaner. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7599) [GPG] ApplicationCleaner in Global Policy Generator
[ https://issues.apache.org/jira/browse/YARN-7599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-7599: --- Attachment: YARN-7599-YARN-7402.v7.patch > [GPG] ApplicationCleaner in Global Policy Generator > --- > > Key: YARN-7599 > URL: https://issues.apache.org/jira/browse/YARN-7599 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Minor > Labels: federation, gpg > Attachments: YARN-7599-YARN-7402.v1.patch, > YARN-7599-YARN-7402.v2.patch, YARN-7599-YARN-7402.v3.patch, > YARN-7599-YARN-7402.v4.patch, YARN-7599-YARN-7402.v5.patch, > YARN-7599-YARN-7402.v6.patch, YARN-7599-YARN-7402.v7.patch > > > In Federation, we need a cleanup service for StateStore as well as Yarn > Registry. For the former, we need to remove old application records. For the > latter, failed and killed applications might leave records in the Yarn > Registry (see YARN-6128). We plan to do both cleanup work in > ApplicationCleaner in GPG -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8696) [AMRMProxy] FederationInterceptor upgrade: home sub-cluster heartbeat async
[ https://issues.apache.org/jira/browse/YARN-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-8696: --- Attachment: YARN-8696.v6.patch > [AMRMProxy] FederationInterceptor upgrade: home sub-cluster heartbeat async > --- > > Key: YARN-8696 > URL: https://issues.apache.org/jira/browse/YARN-8696 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8696.v1.patch, YARN-8696.v2.patch, > YARN-8696.v3.patch, YARN-8696.v4.patch, YARN-8696.v5.patch, YARN-8696.v6.patch > > > Today in _FederationInterceptor_, the heartbeat to home sub-cluster is > synchronous. After the heartbeat is sent out to home sub-cluster, it waits > for the home response to come back before merging and returning the (merged) > heartbeat result to back AM. If home sub-cluster is suffering from connection > issues, or down during an YarnRM master-slave switch, all heartbeat threads > in _FederationInterceptor_ will be blocked waiting for home response. As a > result, the successful UAM heartbeats from secondary sub-clusters will not be > returned to AM at all. Additionally, because of the fact that we kept the > same heartbeat responseId between AM and home RM, lots of tricky handling are > needed regarding the responseId resync when it comes to > _FederationInterceptor_ (part of AMRMProxy, NM) work preserving restart > (YARN-6127, YARN-1336), home RM master-slave switch etc. > In this patch, we change the heartbeat to home sub-cluster to asynchronous, > same as the way we handle UAM heartbeats in secondaries. So that any > sub-cluster down or connection issues won't impact AM getting responses from > other sub-clusters. The responseId is also managed separately for home > sub-cluster and AM, and they increment independently. The resync logic > becomes much cleaner. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7599) [GPG] ApplicationCleaner in Global Policy Generator
[ https://issues.apache.org/jira/browse/YARN-7599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-7599: --- Attachment: YARN-7599-YARN-7402.v6.patch > [GPG] ApplicationCleaner in Global Policy Generator > --- > > Key: YARN-7599 > URL: https://issues.apache.org/jira/browse/YARN-7599 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Minor > Labels: federation, gpg > Attachments: YARN-7599-YARN-7402.v1.patch, > YARN-7599-YARN-7402.v2.patch, YARN-7599-YARN-7402.v3.patch, > YARN-7599-YARN-7402.v4.patch, YARN-7599-YARN-7402.v5.patch, > YARN-7599-YARN-7402.v6.patch > > > In Federation, we need a cleanup service for StateStore as well as Yarn > Registry. For the former, we need to remove old application records. For the > latter, failed and killed applications might leave records in the Yarn > Registry (see YARN-6128). We plan to do both cleanup work in > ApplicationCleaner in GPG -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7599) [GPG] ApplicationCleaner in Global Policy Generator
[ https://issues.apache.org/jira/browse/YARN-7599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621216#comment-16621216 ] Botong Huang commented on YARN-7599: Thanks [~bibinchundatt] for the comment! v6 patch uploaded. bq. I was thinking of disabling cleaner while the GPG service is live I see. Yeah let's leave it as future work. For now restarting GPG will do, it is an out of band service anyways. bq. Can you change to single configuration similar to dfs.http.client.retry.policy.spec {min,max,interval} I already changed the new configs to something like application.cleaner.router.min.success. This is what you meant right? Somehow the link from yetus run hasn't work at all. I think the checkstyle run has some build issue. I just rebased the YARN-7402 base branch to latest trunk, let's see. > [GPG] ApplicationCleaner in Global Policy Generator > --- > > Key: YARN-7599 > URL: https://issues.apache.org/jira/browse/YARN-7599 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Minor > Labels: federation, gpg > Attachments: YARN-7599-YARN-7402.v1.patch, > YARN-7599-YARN-7402.v2.patch, YARN-7599-YARN-7402.v3.patch, > YARN-7599-YARN-7402.v4.patch, YARN-7599-YARN-7402.v5.patch > > > In Federation, we need a cleanup service for StateStore as well as Yarn > Registry. For the former, we need to remove old application records. For the > latter, failed and killed applications might leave records in the Yarn > Registry (see YARN-6128). We plan to do both cleanup work in > ApplicationCleaner in GPG -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8696) [AMRMProxy] FederationInterceptor upgrade: home sub-cluster heartbeat async
[ https://issues.apache.org/jira/browse/YARN-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621060#comment-16621060 ] Botong Huang commented on YARN-8696: Unit test failure in TestCapacityOverTimePolicy is irrelevant. > [AMRMProxy] FederationInterceptor upgrade: home sub-cluster heartbeat async > --- > > Key: YARN-8696 > URL: https://issues.apache.org/jira/browse/YARN-8696 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8696.v1.patch, YARN-8696.v2.patch, > YARN-8696.v3.patch, YARN-8696.v4.patch, YARN-8696.v5.patch > > > Today in _FederationInterceptor_, the heartbeat to home sub-cluster is > synchronous. After the heartbeat is sent out to home sub-cluster, it waits > for the home response to come back before merging and returning the (merged) > heartbeat result to back AM. If home sub-cluster is suffering from connection > issues, or down during an YarnRM master-slave switch, all heartbeat threads > in _FederationInterceptor_ will be blocked waiting for home response. As a > result, the successful UAM heartbeats from secondary sub-clusters will not be > returned to AM at all. Additionally, because of the fact that we kept the > same heartbeat responseId between AM and home RM, lots of tricky handling are > needed regarding the responseId resync when it comes to > _FederationInterceptor_ (part of AMRMProxy, NM) work preserving restart > (YARN-6127, YARN-1336), home RM master-slave switch etc. > In this patch, we change the heartbeat to home sub-cluster to asynchronous, > same as the way we handle UAM heartbeats in secondaries. So that any > sub-cluster down or connection issues won't impact AM getting responses from > other sub-clusters. The responseId is also managed separately for home > sub-cluster and AM, and they increment independently. The resync logic > becomes much cleaner. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org