[jira] [Commented] (YARN-8855) Application fails if one of the sublcluster is down.
[ https://issues.apache.org/jira/browse/YARN-8855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16645703#comment-16645703 ] Botong Huang commented on YARN-8855: Hi [~bibinchundatt], yes YARN-7652 can be the reason. There's other possibility as well, say YARN-8581, depending on the config setup. If a sub-cluster is gone for longer than some time, SubclusterCleaner in GPG (YARN-6648) will mark the sub-cluster to LOST state in StateStore. AMRMProxy will eventually pick it up. > Application fails if one of the sublcluster is down. > > > Key: YARN-8855 > URL: https://issues.apache.org/jira/browse/YARN-8855 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Reporter: Rahul Anand >Priority: Major > > If one of sub cluster is down then application keeps on trying multiple times > and then it fails About 30 failover attempts found in the logs. Below is the > detailed exception. > {code:java} > 2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Container > container_e03_1538297667953_0005_01_01 transitioned from > CONTAINER_CLEANEDUP_AFTER_KILL to DONE | ContainerImpl.java:2093 > 2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Removing > container_e03_1538297667953_0005_01_01 from application > application_1538297667953_0005 | ApplicationImpl.java:512 > 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping > resource-monitoring for container_e03_1538297667953_0005_01_01 | > ContainersMonitorImpl.java:932 > 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Considering > container container_e03_1538297667953_0005_01_01 for log-aggregation | > AppLogAggregatorImpl.java:538 > 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Got event > CONTAINER_STOP for appId application_1538297667953_0005 | AuxServices.java:350 > 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping > container container_e03_1538297667953_0005_01_01 | > YarnShuffleService.java:295 > 2018-10-08 14:21:21,247 | WARN | NM Event dispatcher | couldn't find > container container_e03_1538297667953_0005_01_01 while processing > FINISH_CONTAINERS event | ContainerManagerImpl.java:1660 > 2018-10-08 14:21:22,248 | INFO | Node Status Updater | Removed completed > containers from NM context: [container_e03_1538297667953_0005_01_01] | > NodeStatusUpdaterImpl.java:696 > 2018-10-08 14:21:26,734 | INFO | pool-16-thread-1 | Failing over to the > ResourceManager for SubClusterId: cluster2 | > FederationRMFailoverProxyProvider.java:124 > 2018-10-08 14:21:26,735 | INFO | pool-16-thread-1 | Flushing subClusters from > cache and rehydrating from store, most likely on account of RM failover. | > FederationStateStoreFacade.java:258 > 2018-10-08 14:21:26,738 | INFO | pool-16-thread-1 | Connecting to > /192.168.0.25:8032 subClusterId cluster2 with protocol > ApplicationClientProtocol as user root (auth:SIMPLE) | > FederationRMFailoverProxyProvider.java:145 > 2018-10-08 14:21:26,741 | INFO | pool-16-thread-1 | > java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to > node-master1-IYTxR:8032 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused, while invoking > ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after > 28 failover attempts. Trying to failover after sleeping for 15261ms. | > RetryInvocationHandler.java:411 > 2018-10-08 14:21:42,002 | INFO | pool-16-thread-1 | Failing over to the > ResourceManager for SubClusterId: cluster2 | > FederationRMFailoverProxyProvider.java:124 > 2018-10-08 14:21:42,003 | INFO | pool-16-thread-1 | Flushing subClusters from > cache and rehydrating from store, most likely on account of RM failover. | > FederationStateStoreFacade.java:258 > 2018-10-08 14:21:42,005 | INFO | pool-16-thread-1 | Connecting to > /192.168.0.25:8032 subClusterId cluster2 with protocol > ApplicationClientProtocol as user root (auth:SIMPLE) | > FederationRMFailoverProxyProvider.java:145 > 2018-10-08 14:21:42,007 | INFO | pool-16-thread-1 | > java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to > node-master1-IYTxR:8032 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused, while invoking > ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after > 29 failover attempts. Trying to failover after sleeping for 21175ms. | > RetryInvocationHandler.java:411 > 2018-10-08 14:22:03,183 | INFO | pool-16-thread-1 | Failing over to the > ResourceManager for SubClusterId: cluster2 | >
[jira] [Commented] (YARN-8855) Application fails if one of the sublcluster is down.
[ https://issues.apache.org/jira/browse/YARN-8855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16645291#comment-16645291 ] Bibin A Chundatt commented on YARN-8855: [~botong] adding one observation {{FederationStateStoreFacade}} -> {{FederationStateStore}} for filtering out active subclusters, statestore depends on {{info.getState().isActive()}} check. Which compares state is {{SC_RUNNING}} or not. Incase of proper shutdown of RM this check seems fine. But in case of machine failure / process kill the state could still be RUNNING. But RMs will be unavailable. This could result in unwanted retry/ connection attempts to failed clusters rt ?? > Application fails if one of the sublcluster is down. > > > Key: YARN-8855 > URL: https://issues.apache.org/jira/browse/YARN-8855 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rahul Anand >Priority: Major > > If one of sub cluster is down then application keeps on trying multiple times > and then it fails About 30 failover attempts found in the logs. Below is the > detailed exception. > {code:java} > 2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Container > container_e03_1538297667953_0005_01_01 transitioned from > CONTAINER_CLEANEDUP_AFTER_KILL to DONE | ContainerImpl.java:2093 > 2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Removing > container_e03_1538297667953_0005_01_01 from application > application_1538297667953_0005 | ApplicationImpl.java:512 > 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping > resource-monitoring for container_e03_1538297667953_0005_01_01 | > ContainersMonitorImpl.java:932 > 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Considering > container container_e03_1538297667953_0005_01_01 for log-aggregation | > AppLogAggregatorImpl.java:538 > 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Got event > CONTAINER_STOP for appId application_1538297667953_0005 | AuxServices.java:350 > 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping > container container_e03_1538297667953_0005_01_01 | > YarnShuffleService.java:295 > 2018-10-08 14:21:21,247 | WARN | NM Event dispatcher | couldn't find > container container_e03_1538297667953_0005_01_01 while processing > FINISH_CONTAINERS event | ContainerManagerImpl.java:1660 > 2018-10-08 14:21:22,248 | INFO | Node Status Updater | Removed completed > containers from NM context: [container_e03_1538297667953_0005_01_01] | > NodeStatusUpdaterImpl.java:696 > 2018-10-08 14:21:26,734 | INFO | pool-16-thread-1 | Failing over to the > ResourceManager for SubClusterId: cluster2 | > FederationRMFailoverProxyProvider.java:124 > 2018-10-08 14:21:26,735 | INFO | pool-16-thread-1 | Flushing subClusters from > cache and rehydrating from store, most likely on account of RM failover. | > FederationStateStoreFacade.java:258 > 2018-10-08 14:21:26,738 | INFO | pool-16-thread-1 | Connecting to > /192.168.0.25:8032 subClusterId cluster2 with protocol > ApplicationClientProtocol as user root (auth:SIMPLE) | > FederationRMFailoverProxyProvider.java:145 > 2018-10-08 14:21:26,741 | INFO | pool-16-thread-1 | > java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to > node-master1-IYTxR:8032 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused, while invoking > ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after > 28 failover attempts. Trying to failover after sleeping for 15261ms. | > RetryInvocationHandler.java:411 > 2018-10-08 14:21:42,002 | INFO | pool-16-thread-1 | Failing over to the > ResourceManager for SubClusterId: cluster2 | > FederationRMFailoverProxyProvider.java:124 > 2018-10-08 14:21:42,003 | INFO | pool-16-thread-1 | Flushing subClusters from > cache and rehydrating from store, most likely on account of RM failover. | > FederationStateStoreFacade.java:258 > 2018-10-08 14:21:42,005 | INFO | pool-16-thread-1 | Connecting to > /192.168.0.25:8032 subClusterId cluster2 with protocol > ApplicationClientProtocol as user root (auth:SIMPLE) | > FederationRMFailoverProxyProvider.java:145 > 2018-10-08 14:21:42,007 | INFO | pool-16-thread-1 | > java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to > node-master1-IYTxR:8032 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused, while invoking > ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after > 29 failover attempts. Trying to failover after sleeping for 21175ms. | > RetryInvocationHandler.java:411 >
[jira] [Commented] (YARN-8855) Application fails if one of the sublcluster is down.
[ https://issues.apache.org/jira/browse/YARN-8855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16645275#comment-16645275 ] Bibin A Chundatt commented on YARN-8855: [~rahulanand90] YARN-7652 should solve the issue mentioned.. [~botong] , correct me if i am wrong. > Application fails if one of the sublcluster is down. > > > Key: YARN-8855 > URL: https://issues.apache.org/jira/browse/YARN-8855 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rahul Anand >Priority: Major > > If one of sub cluster is down then application keeps on trying multiple times > and then it fails About 30 failover attempts found in the logs. Below is the > detailed exception. > {code:java} > 2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Container > container_e03_1538297667953_0005_01_01 transitioned from > CONTAINER_CLEANEDUP_AFTER_KILL to DONE | ContainerImpl.java:2093 > 2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Removing > container_e03_1538297667953_0005_01_01 from application > application_1538297667953_0005 | ApplicationImpl.java:512 > 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping > resource-monitoring for container_e03_1538297667953_0005_01_01 | > ContainersMonitorImpl.java:932 > 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Considering > container container_e03_1538297667953_0005_01_01 for log-aggregation | > AppLogAggregatorImpl.java:538 > 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Got event > CONTAINER_STOP for appId application_1538297667953_0005 | AuxServices.java:350 > 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping > container container_e03_1538297667953_0005_01_01 | > YarnShuffleService.java:295 > 2018-10-08 14:21:21,247 | WARN | NM Event dispatcher | couldn't find > container container_e03_1538297667953_0005_01_01 while processing > FINISH_CONTAINERS event | ContainerManagerImpl.java:1660 > 2018-10-08 14:21:22,248 | INFO | Node Status Updater | Removed completed > containers from NM context: [container_e03_1538297667953_0005_01_01] | > NodeStatusUpdaterImpl.java:696 > 2018-10-08 14:21:26,734 | INFO | pool-16-thread-1 | Failing over to the > ResourceManager for SubClusterId: cluster2 | > FederationRMFailoverProxyProvider.java:124 > 2018-10-08 14:21:26,735 | INFO | pool-16-thread-1 | Flushing subClusters from > cache and rehydrating from store, most likely on account of RM failover. | > FederationStateStoreFacade.java:258 > 2018-10-08 14:21:26,738 | INFO | pool-16-thread-1 | Connecting to > /192.168.0.25:8032 subClusterId cluster2 with protocol > ApplicationClientProtocol as user root (auth:SIMPLE) | > FederationRMFailoverProxyProvider.java:145 > 2018-10-08 14:21:26,741 | INFO | pool-16-thread-1 | > java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to > node-master1-IYTxR:8032 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused, while invoking > ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after > 28 failover attempts. Trying to failover after sleeping for 15261ms. | > RetryInvocationHandler.java:411 > 2018-10-08 14:21:42,002 | INFO | pool-16-thread-1 | Failing over to the > ResourceManager for SubClusterId: cluster2 | > FederationRMFailoverProxyProvider.java:124 > 2018-10-08 14:21:42,003 | INFO | pool-16-thread-1 | Flushing subClusters from > cache and rehydrating from store, most likely on account of RM failover. | > FederationStateStoreFacade.java:258 > 2018-10-08 14:21:42,005 | INFO | pool-16-thread-1 | Connecting to > /192.168.0.25:8032 subClusterId cluster2 with protocol > ApplicationClientProtocol as user root (auth:SIMPLE) | > FederationRMFailoverProxyProvider.java:145 > 2018-10-08 14:21:42,007 | INFO | pool-16-thread-1 | > java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to > node-master1-IYTxR:8032 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused, while invoking > ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after > 29 failover attempts. Trying to failover after sleeping for 21175ms. | > RetryInvocationHandler.java:411 > 2018-10-08 14:22:03,183 | INFO | pool-16-thread-1 | Failing over to the > ResourceManager for SubClusterId: cluster2 | > FederationRMFailoverProxyProvider.java:124 > 2018-10-08 14:22:03,183 | INFO | pool-16-thread-1 | Flushing subClusters from > cache and rehydrating from store, most likely on account of RM failover. | > FederationStateStoreFacade.java:258 > 2018-10-08 14:22:03,186 | INFO
[jira] [Commented] (YARN-8855) Application fails if one of the sublcluster is down.
[ https://issues.apache.org/jira/browse/YARN-8855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16642204#comment-16642204 ] Botong Huang commented on YARN-8855: Thanks [~rahulanand90] for reporting it! Which federation policy (yarn.federation.policy-manager) and code version are you using? This should have been fixed in latest trunk and branch-2. > Application fails if one of the sublcluster is down. > > > Key: YARN-8855 > URL: https://issues.apache.org/jira/browse/YARN-8855 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rahul Anand >Priority: Major > > If one of sub cluster is down then application keeps on trying multiple times > and then it fails About 30 failover attempts found in the logs. Below is the > detailed exception. > {code:java} > 2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Container > container_e03_1538297667953_0005_01_01 transitioned from > CONTAINER_CLEANEDUP_AFTER_KILL to DONE | ContainerImpl.java:2093 > 2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Removing > container_e03_1538297667953_0005_01_01 from application > application_1538297667953_0005 | ApplicationImpl.java:512 > 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping > resource-monitoring for container_e03_1538297667953_0005_01_01 | > ContainersMonitorImpl.java:932 > 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Considering > container container_e03_1538297667953_0005_01_01 for log-aggregation | > AppLogAggregatorImpl.java:538 > 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Got event > CONTAINER_STOP for appId application_1538297667953_0005 | AuxServices.java:350 > 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping > container container_e03_1538297667953_0005_01_01 | > YarnShuffleService.java:295 > 2018-10-08 14:21:21,247 | WARN | NM Event dispatcher | couldn't find > container container_e03_1538297667953_0005_01_01 while processing > FINISH_CONTAINERS event | ContainerManagerImpl.java:1660 > 2018-10-08 14:21:22,248 | INFO | Node Status Updater | Removed completed > containers from NM context: [container_e03_1538297667953_0005_01_01] | > NodeStatusUpdaterImpl.java:696 > 2018-10-08 14:21:26,734 | INFO | pool-16-thread-1 | Failing over to the > ResourceManager for SubClusterId: cluster2 | > FederationRMFailoverProxyProvider.java:124 > 2018-10-08 14:21:26,735 | INFO | pool-16-thread-1 | Flushing subClusters from > cache and rehydrating from store, most likely on account of RM failover. | > FederationStateStoreFacade.java:258 > 2018-10-08 14:21:26,738 | INFO | pool-16-thread-1 | Connecting to > /192.168.0.25:8032 subClusterId cluster2 with protocol > ApplicationClientProtocol as user root (auth:SIMPLE) | > FederationRMFailoverProxyProvider.java:145 > 2018-10-08 14:21:26,741 | INFO | pool-16-thread-1 | > java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to > node-master1-IYTxR:8032 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused, while invoking > ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after > 28 failover attempts. Trying to failover after sleeping for 15261ms. | > RetryInvocationHandler.java:411 > 2018-10-08 14:21:42,002 | INFO | pool-16-thread-1 | Failing over to the > ResourceManager for SubClusterId: cluster2 | > FederationRMFailoverProxyProvider.java:124 > 2018-10-08 14:21:42,003 | INFO | pool-16-thread-1 | Flushing subClusters from > cache and rehydrating from store, most likely on account of RM failover. | > FederationStateStoreFacade.java:258 > 2018-10-08 14:21:42,005 | INFO | pool-16-thread-1 | Connecting to > /192.168.0.25:8032 subClusterId cluster2 with protocol > ApplicationClientProtocol as user root (auth:SIMPLE) | > FederationRMFailoverProxyProvider.java:145 > 2018-10-08 14:21:42,007 | INFO | pool-16-thread-1 | > java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to > node-master1-IYTxR:8032 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused, while invoking > ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after > 29 failover attempts. Trying to failover after sleeping for 21175ms. | > RetryInvocationHandler.java:411 > 2018-10-08 14:22:03,183 | INFO | pool-16-thread-1 | Failing over to the > ResourceManager for SubClusterId: cluster2 | > FederationRMFailoverProxyProvider.java:124 > 2018-10-08 14:22:03,183 | INFO | pool-16-thread-1 | Flushing subClusters from > cache and rehydrating from store, most likely on account of RM