[jira] [Commented] (YARN-8855) Application fails if one of the sublcluster is down.

2018-10-10 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16645703#comment-16645703
 ] 

Botong Huang commented on YARN-8855:


Hi [~bibinchundatt], yes YARN-7652 can be the reason. There's other possibility 
as well, say YARN-8581, depending on the config setup. 

If a sub-cluster is gone for longer than some time, SubclusterCleaner in GPG 
(YARN-6648) will mark the sub-cluster to LOST state in StateStore. AMRMProxy 
will eventually pick it up. 

> Application fails if one of the sublcluster is down.
> 
>
> Key: YARN-8855
> URL: https://issues.apache.org/jira/browse/YARN-8855
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Reporter: Rahul Anand
>Priority: Major
>
> If one of sub cluster is down then application keeps on trying multiple times 
> and then it fails About 30 failover attempts found in the logs. Below is the 
> detailed exception. 
> {code:java}
> 2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Container 
> container_e03_1538297667953_0005_01_01 transitioned from 
> CONTAINER_CLEANEDUP_AFTER_KILL to DONE | ContainerImpl.java:2093
> 2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Removing 
> container_e03_1538297667953_0005_01_01 from application 
> application_1538297667953_0005 | ApplicationImpl.java:512
> 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping 
> resource-monitoring for container_e03_1538297667953_0005_01_01 | 
> ContainersMonitorImpl.java:932
> 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Considering 
> container container_e03_1538297667953_0005_01_01 for log-aggregation | 
> AppLogAggregatorImpl.java:538
> 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Got event 
> CONTAINER_STOP for appId application_1538297667953_0005 | AuxServices.java:350
> 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping 
> container container_e03_1538297667953_0005_01_01 | 
> YarnShuffleService.java:295
> 2018-10-08 14:21:21,247 | WARN | NM Event dispatcher | couldn't find 
> container container_e03_1538297667953_0005_01_01 while processing 
> FINISH_CONTAINERS event | ContainerManagerImpl.java:1660
> 2018-10-08 14:21:22,248 | INFO | Node Status Updater | Removed completed 
> containers from NM context: [container_e03_1538297667953_0005_01_01] | 
> NodeStatusUpdaterImpl.java:696
> 2018-10-08 14:21:26,734 | INFO | pool-16-thread-1 | Failing over to the 
> ResourceManager for SubClusterId: cluster2 | 
> FederationRMFailoverProxyProvider.java:124
> 2018-10-08 14:21:26,735 | INFO | pool-16-thread-1 | Flushing subClusters from 
> cache and rehydrating from store, most likely on account of RM failover. | 
> FederationStateStoreFacade.java:258
> 2018-10-08 14:21:26,738 | INFO | pool-16-thread-1 | Connecting to 
> /192.168.0.25:8032 subClusterId cluster2 with protocol 
> ApplicationClientProtocol as user root (auth:SIMPLE) | 
> FederationRMFailoverProxyProvider.java:145
> 2018-10-08 14:21:26,741 | INFO | pool-16-thread-1 | 
> java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to 
> node-master1-IYTxR:8032 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see: 
> http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
> ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after 
> 28 failover attempts. Trying to failover after sleeping for 15261ms. | 
> RetryInvocationHandler.java:411
> 2018-10-08 14:21:42,002 | INFO | pool-16-thread-1 | Failing over to the 
> ResourceManager for SubClusterId: cluster2 | 
> FederationRMFailoverProxyProvider.java:124
> 2018-10-08 14:21:42,003 | INFO | pool-16-thread-1 | Flushing subClusters from 
> cache and rehydrating from store, most likely on account of RM failover. | 
> FederationStateStoreFacade.java:258
> 2018-10-08 14:21:42,005 | INFO | pool-16-thread-1 | Connecting to 
> /192.168.0.25:8032 subClusterId cluster2 with protocol 
> ApplicationClientProtocol as user root (auth:SIMPLE) | 
> FederationRMFailoverProxyProvider.java:145
> 2018-10-08 14:21:42,007 | INFO | pool-16-thread-1 | 
> java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to 
> node-master1-IYTxR:8032 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see: 
> http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
> ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after 
> 29 failover attempts. Trying to failover after sleeping for 21175ms. | 
> RetryInvocationHandler.java:411
> 2018-10-08 14:22:03,183 | INFO | pool-16-thread-1 | Failing over to the 
> ResourceManager for SubClusterId: cluster2 | 
> 

[jira] [Commented] (YARN-8855) Application fails if one of the sublcluster is down.

2018-10-10 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16645291#comment-16645291
 ] 

Bibin A Chundatt commented on YARN-8855:


[~botong] adding one observation 

{{FederationStateStoreFacade}} -> {{FederationStateStore}} for  filtering out 
active subclusters, statestore depends on {{info.getState().isActive()}} check.
Which compares state is {{SC_RUNNING}} or not. Incase of proper shutdown of RM 
this check seems fine.

But in case of machine failure / process kill the state could still be RUNNING. 
But RMs will be unavailable.

This could result in unwanted retry/ connection attempts to failed clusters  rt 
?? 

> Application fails if one of the sublcluster is down.
> 
>
> Key: YARN-8855
> URL: https://issues.apache.org/jira/browse/YARN-8855
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rahul Anand
>Priority: Major
>
> If one of sub cluster is down then application keeps on trying multiple times 
> and then it fails About 30 failover attempts found in the logs. Below is the 
> detailed exception. 
> {code:java}
> 2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Container 
> container_e03_1538297667953_0005_01_01 transitioned from 
> CONTAINER_CLEANEDUP_AFTER_KILL to DONE | ContainerImpl.java:2093
> 2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Removing 
> container_e03_1538297667953_0005_01_01 from application 
> application_1538297667953_0005 | ApplicationImpl.java:512
> 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping 
> resource-monitoring for container_e03_1538297667953_0005_01_01 | 
> ContainersMonitorImpl.java:932
> 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Considering 
> container container_e03_1538297667953_0005_01_01 for log-aggregation | 
> AppLogAggregatorImpl.java:538
> 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Got event 
> CONTAINER_STOP for appId application_1538297667953_0005 | AuxServices.java:350
> 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping 
> container container_e03_1538297667953_0005_01_01 | 
> YarnShuffleService.java:295
> 2018-10-08 14:21:21,247 | WARN | NM Event dispatcher | couldn't find 
> container container_e03_1538297667953_0005_01_01 while processing 
> FINISH_CONTAINERS event | ContainerManagerImpl.java:1660
> 2018-10-08 14:21:22,248 | INFO | Node Status Updater | Removed completed 
> containers from NM context: [container_e03_1538297667953_0005_01_01] | 
> NodeStatusUpdaterImpl.java:696
> 2018-10-08 14:21:26,734 | INFO | pool-16-thread-1 | Failing over to the 
> ResourceManager for SubClusterId: cluster2 | 
> FederationRMFailoverProxyProvider.java:124
> 2018-10-08 14:21:26,735 | INFO | pool-16-thread-1 | Flushing subClusters from 
> cache and rehydrating from store, most likely on account of RM failover. | 
> FederationStateStoreFacade.java:258
> 2018-10-08 14:21:26,738 | INFO | pool-16-thread-1 | Connecting to 
> /192.168.0.25:8032 subClusterId cluster2 with protocol 
> ApplicationClientProtocol as user root (auth:SIMPLE) | 
> FederationRMFailoverProxyProvider.java:145
> 2018-10-08 14:21:26,741 | INFO | pool-16-thread-1 | 
> java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to 
> node-master1-IYTxR:8032 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see: 
> http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
> ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after 
> 28 failover attempts. Trying to failover after sleeping for 15261ms. | 
> RetryInvocationHandler.java:411
> 2018-10-08 14:21:42,002 | INFO | pool-16-thread-1 | Failing over to the 
> ResourceManager for SubClusterId: cluster2 | 
> FederationRMFailoverProxyProvider.java:124
> 2018-10-08 14:21:42,003 | INFO | pool-16-thread-1 | Flushing subClusters from 
> cache and rehydrating from store, most likely on account of RM failover. | 
> FederationStateStoreFacade.java:258
> 2018-10-08 14:21:42,005 | INFO | pool-16-thread-1 | Connecting to 
> /192.168.0.25:8032 subClusterId cluster2 with protocol 
> ApplicationClientProtocol as user root (auth:SIMPLE) | 
> FederationRMFailoverProxyProvider.java:145
> 2018-10-08 14:21:42,007 | INFO | pool-16-thread-1 | 
> java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to 
> node-master1-IYTxR:8032 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see: 
> http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
> ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after 
> 29 failover attempts. Trying to failover after sleeping for 21175ms. | 
> RetryInvocationHandler.java:411
> 

[jira] [Commented] (YARN-8855) Application fails if one of the sublcluster is down.

2018-10-10 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16645275#comment-16645275
 ] 

Bibin A Chundatt commented on YARN-8855:


[~rahulanand90] YARN-7652 should solve the issue mentioned.. [~botong] , 
correct me if i am wrong.

> Application fails if one of the sublcluster is down.
> 
>
> Key: YARN-8855
> URL: https://issues.apache.org/jira/browse/YARN-8855
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rahul Anand
>Priority: Major
>
> If one of sub cluster is down then application keeps on trying multiple times 
> and then it fails About 30 failover attempts found in the logs. Below is the 
> detailed exception. 
> {code:java}
> 2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Container 
> container_e03_1538297667953_0005_01_01 transitioned from 
> CONTAINER_CLEANEDUP_AFTER_KILL to DONE | ContainerImpl.java:2093
> 2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Removing 
> container_e03_1538297667953_0005_01_01 from application 
> application_1538297667953_0005 | ApplicationImpl.java:512
> 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping 
> resource-monitoring for container_e03_1538297667953_0005_01_01 | 
> ContainersMonitorImpl.java:932
> 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Considering 
> container container_e03_1538297667953_0005_01_01 for log-aggregation | 
> AppLogAggregatorImpl.java:538
> 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Got event 
> CONTAINER_STOP for appId application_1538297667953_0005 | AuxServices.java:350
> 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping 
> container container_e03_1538297667953_0005_01_01 | 
> YarnShuffleService.java:295
> 2018-10-08 14:21:21,247 | WARN | NM Event dispatcher | couldn't find 
> container container_e03_1538297667953_0005_01_01 while processing 
> FINISH_CONTAINERS event | ContainerManagerImpl.java:1660
> 2018-10-08 14:21:22,248 | INFO | Node Status Updater | Removed completed 
> containers from NM context: [container_e03_1538297667953_0005_01_01] | 
> NodeStatusUpdaterImpl.java:696
> 2018-10-08 14:21:26,734 | INFO | pool-16-thread-1 | Failing over to the 
> ResourceManager for SubClusterId: cluster2 | 
> FederationRMFailoverProxyProvider.java:124
> 2018-10-08 14:21:26,735 | INFO | pool-16-thread-1 | Flushing subClusters from 
> cache and rehydrating from store, most likely on account of RM failover. | 
> FederationStateStoreFacade.java:258
> 2018-10-08 14:21:26,738 | INFO | pool-16-thread-1 | Connecting to 
> /192.168.0.25:8032 subClusterId cluster2 with protocol 
> ApplicationClientProtocol as user root (auth:SIMPLE) | 
> FederationRMFailoverProxyProvider.java:145
> 2018-10-08 14:21:26,741 | INFO | pool-16-thread-1 | 
> java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to 
> node-master1-IYTxR:8032 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see: 
> http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
> ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after 
> 28 failover attempts. Trying to failover after sleeping for 15261ms. | 
> RetryInvocationHandler.java:411
> 2018-10-08 14:21:42,002 | INFO | pool-16-thread-1 | Failing over to the 
> ResourceManager for SubClusterId: cluster2 | 
> FederationRMFailoverProxyProvider.java:124
> 2018-10-08 14:21:42,003 | INFO | pool-16-thread-1 | Flushing subClusters from 
> cache and rehydrating from store, most likely on account of RM failover. | 
> FederationStateStoreFacade.java:258
> 2018-10-08 14:21:42,005 | INFO | pool-16-thread-1 | Connecting to 
> /192.168.0.25:8032 subClusterId cluster2 with protocol 
> ApplicationClientProtocol as user root (auth:SIMPLE) | 
> FederationRMFailoverProxyProvider.java:145
> 2018-10-08 14:21:42,007 | INFO | pool-16-thread-1 | 
> java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to 
> node-master1-IYTxR:8032 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see: 
> http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
> ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after 
> 29 failover attempts. Trying to failover after sleeping for 21175ms. | 
> RetryInvocationHandler.java:411
> 2018-10-08 14:22:03,183 | INFO | pool-16-thread-1 | Failing over to the 
> ResourceManager for SubClusterId: cluster2 | 
> FederationRMFailoverProxyProvider.java:124
> 2018-10-08 14:22:03,183 | INFO | pool-16-thread-1 | Flushing subClusters from 
> cache and rehydrating from store, most likely on account of RM failover. | 
> FederationStateStoreFacade.java:258
> 2018-10-08 14:22:03,186 | INFO 

[jira] [Commented] (YARN-8855) Application fails if one of the sublcluster is down.

2018-10-08 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16642204#comment-16642204
 ] 

Botong Huang commented on YARN-8855:


Thanks [~rahulanand90] for reporting it! Which federation policy 
(yarn.federation.policy-manager) and code version are you using? This should 
have been fixed in latest trunk and branch-2.

> Application fails if one of the sublcluster is down.
> 
>
> Key: YARN-8855
> URL: https://issues.apache.org/jira/browse/YARN-8855
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rahul Anand
>Priority: Major
>
> If one of sub cluster is down then application keeps on trying multiple times 
> and then it fails About 30 failover attempts found in the logs. Below is the 
> detailed exception. 
> {code:java}
> 2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Container 
> container_e03_1538297667953_0005_01_01 transitioned from 
> CONTAINER_CLEANEDUP_AFTER_KILL to DONE | ContainerImpl.java:2093
> 2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Removing 
> container_e03_1538297667953_0005_01_01 from application 
> application_1538297667953_0005 | ApplicationImpl.java:512
> 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping 
> resource-monitoring for container_e03_1538297667953_0005_01_01 | 
> ContainersMonitorImpl.java:932
> 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Considering 
> container container_e03_1538297667953_0005_01_01 for log-aggregation | 
> AppLogAggregatorImpl.java:538
> 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Got event 
> CONTAINER_STOP for appId application_1538297667953_0005 | AuxServices.java:350
> 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping 
> container container_e03_1538297667953_0005_01_01 | 
> YarnShuffleService.java:295
> 2018-10-08 14:21:21,247 | WARN | NM Event dispatcher | couldn't find 
> container container_e03_1538297667953_0005_01_01 while processing 
> FINISH_CONTAINERS event | ContainerManagerImpl.java:1660
> 2018-10-08 14:21:22,248 | INFO | Node Status Updater | Removed completed 
> containers from NM context: [container_e03_1538297667953_0005_01_01] | 
> NodeStatusUpdaterImpl.java:696
> 2018-10-08 14:21:26,734 | INFO | pool-16-thread-1 | Failing over to the 
> ResourceManager for SubClusterId: cluster2 | 
> FederationRMFailoverProxyProvider.java:124
> 2018-10-08 14:21:26,735 | INFO | pool-16-thread-1 | Flushing subClusters from 
> cache and rehydrating from store, most likely on account of RM failover. | 
> FederationStateStoreFacade.java:258
> 2018-10-08 14:21:26,738 | INFO | pool-16-thread-1 | Connecting to 
> /192.168.0.25:8032 subClusterId cluster2 with protocol 
> ApplicationClientProtocol as user root (auth:SIMPLE) | 
> FederationRMFailoverProxyProvider.java:145
> 2018-10-08 14:21:26,741 | INFO | pool-16-thread-1 | 
> java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to 
> node-master1-IYTxR:8032 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see: 
> http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
> ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after 
> 28 failover attempts. Trying to failover after sleeping for 15261ms. | 
> RetryInvocationHandler.java:411
> 2018-10-08 14:21:42,002 | INFO | pool-16-thread-1 | Failing over to the 
> ResourceManager for SubClusterId: cluster2 | 
> FederationRMFailoverProxyProvider.java:124
> 2018-10-08 14:21:42,003 | INFO | pool-16-thread-1 | Flushing subClusters from 
> cache and rehydrating from store, most likely on account of RM failover. | 
> FederationStateStoreFacade.java:258
> 2018-10-08 14:21:42,005 | INFO | pool-16-thread-1 | Connecting to 
> /192.168.0.25:8032 subClusterId cluster2 with protocol 
> ApplicationClientProtocol as user root (auth:SIMPLE) | 
> FederationRMFailoverProxyProvider.java:145
> 2018-10-08 14:21:42,007 | INFO | pool-16-thread-1 | 
> java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to 
> node-master1-IYTxR:8032 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see: 
> http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
> ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after 
> 29 failover attempts. Trying to failover after sleeping for 21175ms. | 
> RetryInvocationHandler.java:411
> 2018-10-08 14:22:03,183 | INFO | pool-16-thread-1 | Failing over to the 
> ResourceManager for SubClusterId: cluster2 | 
> FederationRMFailoverProxyProvider.java:124
> 2018-10-08 14:22:03,183 | INFO | pool-16-thread-1 | Flushing subClusters from 
> cache and rehydrating from store, most likely on account of RM