[jira] [Commented] (YARN-4971) RM fails to re-bind to wildcard IP after failover in multi homed clusters
[ https://issues.apache.org/jira/browse/YARN-4971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802749#comment-17802749 ] Shilun Fan commented on YARN-4971: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > RM fails to re-bind to wildcard IP after failover in multi homed clusters > - > > Key: YARN-4971 > URL: https://issues.apache.org/jira/browse/YARN-4971 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-4971.1.patch > > > If the RM has the {{yarn.resourcemanager.bind-host}} set to 0.0.0.0 the first > time the service becomes active binding to the wildcard works as expected. If > the service has transitioned from active to standby and then becomes active > again after failovers the service only binds to one of the ip addresses. > There is a difference between the services inside the RM: it only seem to > happen for the services listening on ports: 8030 and 8032 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4971) RM fails to re-bind to wildcard IP after failover in multi homed clusters
[ https://issues.apache.org/jira/browse/YARN-4971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194378#comment-17194378 ] Wangda Tan commented on YARN-4971: -- I think we should revisit the patch based on comment from Karthik: https://issues.apache.org/jira/browse/YARN-4971?focusedCommentId=15281097=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15281097 I also don't quite understand why the following methods of ClientRMService are different: One is: {code:java} InetSocketAddress getBindAddress(Configuration conf) { return conf.getSocketAddr( YarnConfiguration.RM_BIND_HOST, YarnConfiguration.RM_ADDRESS, YarnConfiguration.DEFAULT_RM_ADDRESS, YarnConfiguration.DEFAULT_RM_PORT); } {code} And another one is: {code:java} clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST, YarnConfiguration.RM_ADDRESS, YarnConfiguration.DEFAULT_RM_ADDRESS, server.getListenerAddress());{code} Basically, in serviceInit and serviceStart, how to get RM address is different. Is that a potential root cause of the problem? [~wilfreds], [~shuzirra] > RM fails to re-bind to wildcard IP after failover in multi homed clusters > - > > Key: YARN-4971 > URL: https://issues.apache.org/jira/browse/YARN-4971 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-4971.1.patch > > > If the RM has the {{yarn.resourcemanager.bind-host}} set to 0.0.0.0 the first > time the service becomes active binding to the wildcard works as expected. If > the service has transitioned from active to standby and then becomes active > again after failovers the service only binds to one of the ip addresses. > There is a difference between the services inside the RM: it only seem to > happen for the services listening on ports: 8030 and 8032 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4971) RM fails to re-bind to wildcard IP after failover in multi homed clusters
[ https://issues.apache.org/jira/browse/YARN-4971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15281097#comment-15281097 ] Karthik Kambatla commented on YARN-4971: I must be missing something, but can't figure out why not setting the variable helps here. If I understand the code correctly, the individual variables {{clientBindAddress}} and {{masterServiceAddress}} are used only in tests and the one other place in {{DelegationTokenRenewer}} that Daniel pointed out. Both ClientRMService and ApplicationMasterService are part of RMActiveServices. On transition to standby, both services are inited again to be started when the RM transitions back to active. This code path, in theory at least, shouldn't be different from the first time around. Am I missing something or misreading the code? > RM fails to re-bind to wildcard IP after failover in multi homed clusters > - > > Key: YARN-4971 > URL: https://issues.apache.org/jira/browse/YARN-4971 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg > Attachments: YARN-4971.1.patch > > > If the RM has the {{yarn.resourcemanager.bind-host}} set to 0.0.0.0 the first > time the service becomes active binding to the wildcard works as expected. If > the service has transitioned from active to standby and then becomes active > again after failovers the service only binds to one of the ip addresses. > There is a difference between the services inside the RM: it only seem to > happen for the services listening on ports: 8030 and 8032 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4971) RM fails to re-bind to wildcard IP after failover in multi homed clusters
[ https://issues.apache.org/jira/browse/YARN-4971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15258327#comment-15258327 ] Daniel Templeton commented on YARN-4971: I agree on [~rchiang]'s test suggestion. I'm surprised the patch doesn't break any tests since it changes the behavior of the {{getBindAddress()}} method, which is used for testing. I suspect the patch may break with security enabled because the {{DelegationTokenRenewer.setLocalSecretManagerAndServiceAddr()}} method relies on {{ClientRMService.getBindAddress()}}. Security code generally doesn't like non-specific addresses. It would be good to do a thorough test with Kerberos enabled to verify. I don't know what operation uses {{DelegationTokenRenewer.setLocalSecretManagerAndServiceAddr()}}, but it shouldn't be too hard to figure out. > RM fails to re-bind to wildcard IP after failover in multi homed clusters > - > > Key: YARN-4971 > URL: https://issues.apache.org/jira/browse/YARN-4971 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg > Attachments: YARN-4971.1.patch > > > If the RM has the {{yarn.resourcemanager.bind-host}} set to 0.0.0.0 the first > time the service becomes active binding to the wildcard works as expected. If > the service has transitioned from active to standby and then becomes active > again after failovers the service only binds to one of the ip addresses. > There is a difference between the services inside the RM: it only seem to > happen for the services listening on ports: 8030 and 8032 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4971) RM fails to re-bind to wildcard IP after failover in multi homed clusters
[ https://issues.apache.org/jira/browse/YARN-4971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254507#comment-15254507 ] Ray Chiang commented on YARN-4971: -- +1 (nonbinding). The only new test I can think of would be to verify that the member variable address stays at 0.0.0.0 if it's initially 0.0.0.0--mainly useful as a "spec" for the class behavior. > RM fails to re-bind to wildcard IP after failover in multi homed clusters > - > > Key: YARN-4971 > URL: https://issues.apache.org/jira/browse/YARN-4971 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg > Attachments: YARN-4971.1.patch > > > If the RM has the {{yarn.resourcemanager.bind-host}} set to 0.0.0.0 the first > time the service becomes active binding to the wildcard works as expected. If > the service has transitioned from active to standby and then becomes active > again after failovers the service only binds to one of the ip addresses. > There is a difference between the services inside the RM: it only seem to > happen for the services listening on ports: 8030 and 8032 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4971) RM fails to re-bind to wildcard IP after failover in multi homed clusters
[ https://issues.apache.org/jira/browse/YARN-4971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249961#comment-15249961 ] Wilfred Spiegelenburg commented on YARN-4971: - During the service init the service bind address is calculated based on all the settings and stored as an InetSocketAddress in a local variable. In the startup the server is created using that socket address. This all works. In the {{ApplicationMasterService}} and the {{ClientRMService}} after the service start the bind address is updated as part of the config update. This update does not happen in the other services. The update is the return value of {{updateConnectAddr()}}. This method in its javadoc shows the following text: bq. The wildcard address is replaced with the local host's address. Since we store this return value in the bind address we break binding to the wildcard address in later cycles. > RM fails to re-bind to wildcard IP after failover in multi homed clusters > - > > Key: YARN-4971 > URL: https://issues.apache.org/jira/browse/YARN-4971 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg > > If the RM has the {{yarn.resourcemanager.bind-host}} set to 0.0.0.0 the first > time the service becomes active binding to the wildcard works as expected. If > the service has transitioned from active to standby and then becomes active > again after failovers the service only binds to one of the ip addresses. > There is a difference between the services inside the RM: it only seem to > happen for the services listening on ports: 8030 and 8032 -- This message was sent by Atlassian JIRA (v6.3.4#6332)