[jira] [Commented] (HDFS-1973) HA: HDFS clients must handle namenode failover and switch over to the new active namenode.
[ https://issues.apache.org/jira/browse/HDFS-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13122403#comment-13122403 ] Todd Lipcon commented on HDFS-1973: --- +1 lgtm > HA: HDFS clients must handle namenode failover and switch over to the new > active namenode. > -- > > Key: HDFS-1973 > URL: https://issues.apache.org/jira/browse/HDFS-1973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Suresh Srinivas >Assignee: Aaron T. Myers > Attachments: HDFS-1973-HDFS-1623.patch, HDFS-1973-HDFS-1623.patch, > HDFS-1973-HDFS-1623.patch, HDFS-1973-HDFS-1623.patch, > HDFS-1973-HDFS-1623.patch, HDFS-1973-HDFS-1623.patch, > HDFS-1973-HDFS-1623.patch, hdfs-1973.0.patch > > > During failover, a client must detect the current active namenode failure and > switch over to the new active namenode. The switch over might make use of IP > failover or some thing more elaborate such as zookeeper to discover the new > active. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1973) HA: HDFS clients must handle namenode failover and switch over to the new active namenode.
[ https://issues.apache.org/jira/browse/HDFS-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13122381#comment-13122381 ] Hadoop QA commented on HDFS-1973: - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12498050/HDFS-1973-HDFS-1623.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/1346//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/1346//console This message is automatically generated. > HA: HDFS clients must handle namenode failover and switch over to the new > active namenode. > -- > > Key: HDFS-1973 > URL: https://issues.apache.org/jira/browse/HDFS-1973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Suresh Srinivas >Assignee: Aaron T. Myers > Attachments: HDFS-1973-HDFS-1623.patch, HDFS-1973-HDFS-1623.patch, > HDFS-1973-HDFS-1623.patch, HDFS-1973-HDFS-1623.patch, > HDFS-1973-HDFS-1623.patch, HDFS-1973-HDFS-1623.patch, > HDFS-1973-HDFS-1623.patch, hdfs-1973.0.patch > > > During failover, a client must detect the current active namenode failure and > switch over to the new active namenode. The switch over might make use of IP > failover or some thing more elaborate such as zookeeper to discover the new > active. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1973) HA: HDFS clients must handle namenode failover and switch over to the new active namenode.
[ https://issues.apache.org/jira/browse/HDFS-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13121784#comment-13121784 ] Hadoop QA commented on HDFS-1973: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12497950/HDFS-1973-HDFS-1623.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/1342//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/1342//artifact/trunk/hadoop-hdfs-project/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/1342//console This message is automatically generated. > HA: HDFS clients must handle namenode failover and switch over to the new > active namenode. > -- > > Key: HDFS-1973 > URL: https://issues.apache.org/jira/browse/HDFS-1973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Suresh Srinivas >Assignee: Aaron T. Myers > Attachments: HDFS-1973-HDFS-1623.patch, HDFS-1973-HDFS-1623.patch, > HDFS-1973-HDFS-1623.patch, HDFS-1973-HDFS-1623.patch, > HDFS-1973-HDFS-1623.patch, HDFS-1973-HDFS-1623.patch, hdfs-1973.0.patch > > > During failover, a client must detect the current active namenode failure and > switch over to the new active namenode. The switch over might make use of IP > failover or some thing more elaborate such as zookeeper to discover the new > active. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1973) HA: HDFS clients must handle namenode failover and switch over to the new active namenode.
[ https://issues.apache.org/jira/browse/HDFS-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13121729#comment-13121729 ] Hadoop QA commented on HDFS-1973: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12497935/HDFS-1973-HDFS-1623.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/1340//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/1340//artifact/trunk/hadoop-hdfs-project/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/1340//console This message is automatically generated. > HA: HDFS clients must handle namenode failover and switch over to the new > active namenode. > -- > > Key: HDFS-1973 > URL: https://issues.apache.org/jira/browse/HDFS-1973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Suresh Srinivas >Assignee: Aaron T. Myers > Attachments: HDFS-1973-HDFS-1623.patch, HDFS-1973-HDFS-1623.patch, > HDFS-1973-HDFS-1623.patch, HDFS-1973-HDFS-1623.patch, > HDFS-1973-HDFS-1623.patch, hdfs-1973.0.patch > > > During failover, a client must detect the current active namenode failure and > switch over to the new active namenode. The switch over might make use of IP > failover or some thing more elaborate such as zookeeper to discover the new > active. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1973) HA: HDFS clients must handle namenode failover and switch over to the new active namenode.
[ https://issues.apache.org/jira/browse/HDFS-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13121690#comment-13121690 ] Hadoop QA commented on HDFS-1973: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12497942/HDFS-1973-HDFS-1623.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/1341//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/1341//artifact/trunk/hadoop-hdfs-project/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/1341//console This message is automatically generated. > HA: HDFS clients must handle namenode failover and switch over to the new > active namenode. > -- > > Key: HDFS-1973 > URL: https://issues.apache.org/jira/browse/HDFS-1973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Suresh Srinivas >Assignee: Aaron T. Myers > Attachments: HDFS-1973-HDFS-1623.patch, HDFS-1973-HDFS-1623.patch, > HDFS-1973-HDFS-1623.patch, HDFS-1973-HDFS-1623.patch, > HDFS-1973-HDFS-1623.patch, hdfs-1973.0.patch > > > During failover, a client must detect the current active namenode failure and > switch over to the new active namenode. The switch over might make use of IP > failover or some thing more elaborate such as zookeeper to discover the new > active. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1973) HA: HDFS clients must handle namenode failover and switch over to the new active namenode.
[ https://issues.apache.org/jira/browse/HDFS-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13121662#comment-13121662 ] Todd Lipcon commented on HDFS-1973: --- One small thing I didn't notice before (sorry): - in the test cases, you have new methods verifyFileContents and writeContentsToFile. These are very similar to AppendTestUtil.write and AppndTestUtil.check. Can you use those instead? If not, you should fix the call to {{in.read}} to use {{IOUtils.readFully}} since {{read}} isn't guaranteed to read the whole length. > HA: HDFS clients must handle namenode failover and switch over to the new > active namenode. > -- > > Key: HDFS-1973 > URL: https://issues.apache.org/jira/browse/HDFS-1973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Suresh Srinivas >Assignee: Aaron T. Myers > Attachments: HDFS-1973-HDFS-1623.patch, HDFS-1973-HDFS-1623.patch, > HDFS-1973-HDFS-1623.patch, HDFS-1973-HDFS-1623.patch, hdfs-1973.0.patch > > > During failover, a client must detect the current active namenode failure and > switch over to the new active namenode. The switch over might make use of IP > failover or some thing more elaborate such as zookeeper to discover the new > active. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1973) HA: HDFS clients must handle namenode failover and switch over to the new active namenode.
[ https://issues.apache.org/jira/browse/HDFS-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119607#comment-13119607 ] Aaron T. Myers commented on HDFS-1973: -- Thanks a lot for the review, Todd. Doing what you describe will require Common changes. I've filed HADOOP-7717 to address these. Once that's committed, I'll update the patch here to remove the synchronization from {{ConfiguredFailoverProxyProvider}}. > HA: HDFS clients must handle namenode failover and switch over to the new > active namenode. > -- > > Key: HDFS-1973 > URL: https://issues.apache.org/jira/browse/HDFS-1973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Suresh Srinivas >Assignee: Aaron T. Myers > Attachments: HDFS-1973-HDFS-1623.patch, HDFS-1973-HDFS-1623.patch, > HDFS-1973-HDFS-1623.patch, hdfs-1973.0.patch > > > During failover, a client must detect the current active namenode failure and > switch over to the new active namenode. The switch over might make use of IP > failover or some thing more elaborate such as zookeeper to discover the new > active. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1973) HA: HDFS clients must handle namenode failover and switch over to the new active namenode.
[ https://issues.apache.org/jira/browse/HDFS-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119567#comment-13119567 ] Todd Lipcon commented on HDFS-1973: --- Aaron and I just chatted about this a bit. Here's a summary of what we discussed: - the if condition in {{performFailover}} was somewhat confusing to me as to its purpose. Aaron explained that its purpose is to avoid the case where multiple outstanding RPC calls fail, and then they all call performFailover at the same time. If there were an even number of such calls, and you didn't do any such checks for "already failed over", then you'd have a case where you failover twice and end up back at the original proxy object. - we decided that, rather than try to handle this situation in the FailoverProvider itself, it would be better to do this at the caller. Otherwise, each failover provider implementation will have to have this same concern. So, Aaron is going to update the patch to include a safeguard at the call site of {{performFailver}} which checks that, before calling performFailover, another thread hasn't already failed over to a new proxy object. > HA: HDFS clients must handle namenode failover and switch over to the new > active namenode. > -- > > Key: HDFS-1973 > URL: https://issues.apache.org/jira/browse/HDFS-1973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Suresh Srinivas >Assignee: Aaron T. Myers > Attachments: HDFS-1973-HDFS-1623.patch, HDFS-1973-HDFS-1623.patch, > HDFS-1973-HDFS-1623.patch, hdfs-1973.0.patch > > > During failover, a client must detect the current active namenode failure and > switch over to the new active namenode. The switch over might make use of IP > failover or some thing more elaborate such as zookeeper to discover the new > active. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1973) HA: HDFS clients must handle namenode failover and switch over to the new active namenode.
[ https://issues.apache.org/jira/browse/HDFS-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118553#comment-13118553 ] Hadoop QA commented on HDFS-1973: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12497230/HDFS-1973-HDFS-1623.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/1324//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/1324//artifact/trunk/hadoop-hdfs-project/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/1324//console This message is automatically generated. > HA: HDFS clients must handle namenode failover and switch over to the new > active namenode. > -- > > Key: HDFS-1973 > URL: https://issues.apache.org/jira/browse/HDFS-1973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Suresh Srinivas >Assignee: Aaron T. Myers > Attachments: HDFS-1973-HDFS-1623.patch, HDFS-1973-HDFS-1623.patch, > HDFS-1973-HDFS-1623.patch, hdfs-1973.0.patch > > > During failover, a client must detect the current active namenode failure and > switch over to the new active namenode. The switch over might make use of IP > failover or some thing more elaborate such as zookeeper to discover the new > active. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1973) HA: HDFS clients must handle namenode failover and switch over to the new active namenode.
[ https://issues.apache.org/jira/browse/HDFS-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118478#comment-13118478 ] Todd Lipcon commented on HDFS-1973: --- bq. Sure, but the whole of DFSClient is annotated @InterfaceAudience.Private. You still think we should keep it around? Yes, because unfortunately the Private annotation was added after the 0.20 release. If it's problematic to keep around, we don't have to, but it seemed easy enough to maintain for now. {code} + @Override + public synchronized void performFailover(Object currentProxy) { +if (proxies.get(currentProxyIndex) != currentProxy) { + currentProxyIndex = (currentProxyIndex + 1) % proxies.size(); +} + } {code} This code confuses me -- isn't {{currentProxy}} a proxy object, whereas {{proxies.get(...)}} is an {{AddressRpcProxyPair}}? Which is to say they're always un-equal? > HA: HDFS clients must handle namenode failover and switch over to the new > active namenode. > -- > > Key: HDFS-1973 > URL: https://issues.apache.org/jira/browse/HDFS-1973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Suresh Srinivas >Assignee: Aaron T. Myers > Attachments: HDFS-1973-HDFS-1623.patch, HDFS-1973-HDFS-1623.patch, > hdfs-1973.0.patch > > > During failover, a client must detect the current active namenode failure and > switch over to the new active namenode. The switch over might make use of IP > failover or some thing more elaborate such as zookeeper to discover the new > active. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1973) HA: HDFS clients must handle namenode failover and switch over to the new active namenode.
[ https://issues.apache.org/jira/browse/HDFS-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13115101#comment-13115101 ] Hadoop QA commented on HDFS-1973: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12496201/HDFS-1973-HDFS-1623.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hdfs.TestDfsOverAvroRpc org.apache.hadoop.hdfs.server.blockmanagement.TestHost2NodesMap +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/1291//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/1291//artifact/trunk/hadoop-hdfs-project/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/1291//console This message is automatically generated. > HA: HDFS clients must handle namenode failover and switch over to the new > active namenode. > -- > > Key: HDFS-1973 > URL: https://issues.apache.org/jira/browse/HDFS-1973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Suresh Srinivas >Assignee: Aaron T. Myers > Attachments: HDFS-1973-HDFS-1623.patch, HDFS-1973-HDFS-1623.patch, > hdfs-1973.0.patch > > > During failover, a client must detect the current active namenode failure and > switch over to the new active namenode. The switch over might make use of IP > failover or some thing more elaborate such as zookeeper to discover the new > active. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1973) HA: HDFS clients must handle namenode failover and switch over to the new active namenode.
[ https://issues.apache.org/jira/browse/HDFS-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13115036#comment-13115036 ] Aaron T. Myers commented on HDFS-1973: -- bq. A general note: this failover proxy provider doesn't get passed enough info to choose different configs based on the "logical URI". For example, if I have two clusters, foo and bar I'd expect to be able to configure dfs.ha.namenodes.foo separately from dfs.ha.namenodes.bar after setting both of dfs.client.failover.proxy.provider.foo|bar. I've filed HDFS-2367 to address this issue. > HA: HDFS clients must handle namenode failover and switch over to the new > active namenode. > -- > > Key: HDFS-1973 > URL: https://issues.apache.org/jira/browse/HDFS-1973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Suresh Srinivas >Assignee: Aaron T. Myers > Attachments: HDFS-1973-HDFS-1623.patch, HDFS-1973-HDFS-1623.patch, > hdfs-1973.0.patch > > > During failover, a client must detect the current active namenode failure and > switch over to the new active namenode. The switch over might make use of IP > failover or some thing more elaborate such as zookeeper to discover the new > active. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1973) HA: HDFS clients must handle namenode failover and switch over to the new active namenode.
[ https://issues.apache.org/jira/browse/HDFS-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113045#comment-13113045 ] Aaron T. Myers commented on HDFS-1973: -- Thanks a lot for the review, Todd. I'll attach an updated patch addressing most of your comments in just a moment. Comments below. bq. not sure if removing the DFSClient(Configuration conf) constructor is a good idea - it was deprecated at 0.21 but since we haven't had a well-adopted release since then, seems we should keep it around. (it's trivial to keep, right?) Sure, but the whole of {{DFSClient}} is annotated {{@InterfaceAudience.Private}}. You still think we should keep it around? bq. some log messages left in: LOG.info("address of nn1: " + conf.get("dfs.atm.nn1")) Whoops! Fixed. Any more that you noticed? Or just that one? :) bq. some of the lines seem to have a lot of columns - try to wrap at 80? Like which? I wasn't super strict about wrapping at 80, but I'm pretty sure none go over 90. bq. maybe rename this parameter to nameNodeUri? Done. bq. Shouldn't that have a "." at the end? It seems authorities are appended without a "." in the middle. Yep, great catch. Fixed. bq. why? It seems we should at least be logging a WARN if this is configured to point to a class that can't be loaded, if not re-throwing as IOE. Good point. I've changed it to throw an {{IOE}} with a descriptive error message. bq. I'm not sure that o.a.h.h.protocol is the right package for ConfiguredFailoverProxyProvider - maybe it belongs in the HA package? Good point. I've moved it to o.a.h.server.namenode.ha. bq. missing license headers on many of the new files Fixed. It's just two new files. bq. these will throw NPE in the case of misconfiguration. Should at least throw an exception indicating what the mistaken config is. Fixed. bq. Also, it seems more sensible to have a single config like "dfs.ha.namenode.addresses" with comma-separated URIs, since there's nothing explicitly tied to having exactly 2 NNs here, right? Good point. Fixed. bq. A general note: this failover proxy provider doesn't get passed enough info to choose different configs based on the "logical URI". For example, if I have two clusters, foo and bar I'd expect to be able to configure dfs.ha.namenodes.foo separately from dfs.ha.namenodes.bar after setting both of dfs.client.failover.proxy.provider.foo|bar. Good point. Mind if we address this in a follow-up JIRA? It seems like the solution would be to replace "dfs.ha.namenode.addresses" with something like "dfs.ha.namenode.addresses.", so as to be able to support configuring the addresses of multiple HA clusters. bq. Can we separate the addition of Idempotent annotations to a different patch? Certainly. I've removed all of them from this patch except for {{getBlockLocations}}, which is necessary for the test to work. bq. file a followup JIRA or use "TODO" so we don't forget about this case? Changed to: {noformat} // TODO(HA): This test should probably be made to fail if a client fails over // to talk to an NN with a different block pool id. Once failover between // active/standy in a single block pool is implemented, this test should be // changed to exercise that. {noformat} > HA: HDFS clients must handle namenode failover and switch over to the new > active namenode. > -- > > Key: HDFS-1973 > URL: https://issues.apache.org/jira/browse/HDFS-1973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Suresh Srinivas >Assignee: Aaron T. Myers > Attachments: HDFS-1973-HDFS-1623.patch, HDFS-1973-HDFS-1623.patch, > hdfs-1973.0.patch > > > During failover, a client must detect the current active namenode failure and > switch over to the new active namenode. The switch over might make use of IP > failover or some thing more elaborate such as zookeeper to discover the new > active. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1973) HA: HDFS clients must handle namenode failover and switch over to the new active namenode.
[ https://issues.apache.org/jira/browse/HDFS-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13109200#comment-13109200 ] Todd Lipcon commented on HDFS-1973: --- - not sure if removing the DFSClient(Configuration conf) constructor is a good idea - it was deprecated at 0.21 but since we haven't had a well-adopted release since then, seems we should keep it around. (it's trivial to keep, right?) - some log messages left in: LOG.info("address of nn1: " + conf.get("dfs.atm.nn1")) - some of the lines seem to have a lot of columns - try to wrap at 80? {code} - DFSClient(InetSocketAddress nameNodeAddr, ClientProtocol rpcNamenode, + DFSClient(URI nameNodeAddr, ClientProtocol rpcNamenode, {code} maybe rename this parameter to {{nameNodeUri}}? {code} + public static final String DFS_CLIENT_FAILOVER_PROXY_PROVIDER_KEY_PREFIX = "dfs.client.failover.proxy.provider"; {code} Shouldn't that have a "." at the end? It seems authorities are appended without a "." in the middle. {code} +} catch (RuntimeException e) { + if (e.getCause() instanceof ClassNotFoundException) { +return null; {code} why? It seems we should at least be logging a WARN if this is configured to point to a class that can't be loaded, if not re-throwing as IOE. - I'm not sure that o.a.h.h.protocol is the right package for ConfiguredFailoverProxyProvider - maybe it belongs in the HA package? - missing license headers on many of the new files {code} + InetSocketAddress first = NameNode.getAddress( + new URI(conf.get(CONFIGURED_NAMENODE_ADDRESS_FIRST)).getAuthority()); + InetSocketAddress second = NameNode.getAddress( + new URI(conf.get(CONFIGURED_NAMENODE_ADDRESS_SECOND)).getAuthority()); {code} these will throw NPE in the case of misconfiguration. Should at least throw an exception indicating what the mistaken config is. Also, it seems more sensible to have a single config like "dfs.ha.namenode.addresses" with comma-separated URIs, since there's nothing explicitly tied to having exactly 2 NNs here, right? A general note: this failover proxy provider doesn't get passed enough info to choose different configs based on the "logical URI". For example, if I have two clusters, {{foo}} and {{bar}} I'd expect to be able to configure {{dfs.ha.namenodes.foo}} separately from {{dfs.ha.namenodes.bar}} after setting both of {{dfs.client.failover.proxy.provider.foo|bar}}. - Can we separate the addition of Idempotent annotations to a different patch? {code} + // This test should probably be made to fail if a client fails over to talk to + // an NN with a different block pool id. {code} file a followup JIRA or use "TODO" so we don't forget about this case? > HA: HDFS clients must handle namenode failover and switch over to the new > active namenode. > -- > > Key: HDFS-1973 > URL: https://issues.apache.org/jira/browse/HDFS-1973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Suresh Srinivas >Assignee: Aaron T. Myers > Attachments: HDFS-1973-HDFS-1623.patch, hdfs-1973.0.patch > > > During failover, a client must detect the current active namenode failure and > switch over to the new active namenode. The switch over might make use of IP > failover or some thing more elaborate such as zookeeper to discover the new > active. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1973) HA: HDFS clients must handle namenode failover and switch over to the new active namenode.
[ https://issues.apache.org/jira/browse/HDFS-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13092073#comment-13092073 ] Aaron T. Myers commented on HDFS-1973: -- @Eli, sure I'll file a separate JIRA. It'd certainly be worth enumerating all of the places where HTTP fail-over is an issue. The example you provided is an interesting one. It seems you're assuming that an HA setup would have three nodes - active, standby, and 2NN, with the 2NN failing over to do checkpointing against the standby after a failure of the active. The design document in HDFS-1623 doesn't really address checkpointing. I've heard from Suresh and Todd informally that the intention is probably to make the standby node also capable of performing checkpointing. I'll file a separate JIRA to address this as well. > HA: HDFS clients must handle namenode failover and switch over to the new > active namenode. > -- > > Key: HDFS-1973 > URL: https://issues.apache.org/jira/browse/HDFS-1973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Suresh Srinivas >Assignee: Aaron T. Myers > > During failover, a client must detect the current active namenode failure and > switch over to the new active namenode. The switch over might make use of IP > failover or some thing more elaborate such as zookeeper to discover the new > active. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1973) HA: HDFS clients must handle namenode failover and switch over to the new active namenode.
[ https://issues.apache.org/jira/browse/HDFS-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13091525#comment-13091525 ] Eli Collins commented on HDFS-1973: --- @atm - mind filing a separate jira for HTTP fail-over? In some cases we may be able to define away the problem. Eg now that 1073 is in we could modify the 2NN to not have to fetch files via http (at least when using shared storage). > HA: HDFS clients must handle namenode failover and switch over to the new > active namenode. > -- > > Key: HDFS-1973 > URL: https://issues.apache.org/jira/browse/HDFS-1973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Suresh Srinivas >Assignee: Aaron T. Myers > > During failover, a client must detect the current active namenode failure and > switch over to the new active namenode. The switch over might make use of IP > failover or some thing more elaborate such as zookeeper to discover the new > active. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1973) HA: HDFS clients must handle namenode failover and switch over to the new active namenode.
[ https://issues.apache.org/jira/browse/HDFS-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089807#comment-13089807 ] Aaron T. Myers commented on HDFS-1973: -- bq. Alternatively client gets the address of both the namenodes. Tries them one at a time until it gets connected to the new active. This is what I was trying to communicate with "Configuration-based client failover. Clients are configured with a set of NN addresses to try until an operation succeeds." I think we're in agreement on this point, just talking past each other a little bit. :) bq. Proxy based client failover is an implementation details. It still needs to figure out the new active based on one of the schemes above. The proxy process would need to figure out the address of the new active, but clients wouldn't - the clients would just have the address of the proxy. The only thing for the client to do, then, would be to retry the RPC to the same address (the address of the proxy.) bq. +1 for logical URI. We could consider merging this requirement with HDFS-2231 to do this. Good point. I'll comment there. bq. Logical URI is needed for identifying a nameservice and not cluster, since federation supports multiple namenodes with in a cluster. Good point. In the above design document: s/cluster/nameservice/g. bq. Why should failover method be based on URI cluster part? Can it be a single mechanism across all the nameservices? Hence change the parameter to dfs.client.ha.failover.method? Imagine that one writes a program which uses absolute URIs to connect to two distinct clusters, one of which is HA-enabled using ZK to resolve the address, and the other is not. In this case we should use some ZK-based {{FailoverProxyProvider}} for the first, and just the normal RPC connection for the second. Thus, the configuration should be per-nameservice. I suppose we could do something like introduce {{dfs.client.ha.failover.method.}}, but that seems more annoying to configure to me. bq. The scheme you have defined works only for RPC protocols. How about HTTP? Yes, that's certainly true. My thinking there was that since it's generally less critical for the NN web interfaces to immediately fail over, and since we don't generally control the HTTP clients which access the NN web interface, that this be out of scope for this JIRA. To facilitate this, the operator could either run a standard HTTP proxy, use round-robin DNS, or even change DNS resolution of the NN and wait for clients to get the update address. bq. I am not sure why logical URI is required for VIP/failover based setup. The value of the logical URI could be the same as the actual URI of the proxy. It would then only be used to configure an appropriate {{FailoverProxyProvider}} which would retry failed RPCs to the same address. > HA: HDFS clients must handle namenode failover and switch over to the new > active namenode. > -- > > Key: HDFS-1973 > URL: https://issues.apache.org/jira/browse/HDFS-1973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Suresh Srinivas >Assignee: Aaron T. Myers > > During failover, a client must detect the current active namenode failure and > switch over to the new active namenode. The switch over might make use of IP > failover or some thing more elaborate such as zookeeper to discover the new > active. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1973) HA: HDFS clients must handle namenode failover and switch over to the new active namenode.
[ https://issues.apache.org/jira/browse/HDFS-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13087225#comment-13087225 ] Suresh Srinivas commented on HDFS-1973: --- Sorry for the late comment. I had been traveling. Before {{Cases to support}}, could we add a section like this: >> On failover, clients need the address of the new active. This could be done by: # Contacting zookeeper to get the current active NN. # Alternatively client gets the address of both the namenodes. Tries them one at a time until it gets connected to the new active. # For setups using IP failover, clients always use the same VIP/failover address, which moves to active. >> Given this, I am not sure about the {{Cases to support}}: Proxy based client failover is an implementation details. It still needs to figure out the new active based on one of the schemes above. I am not very clear on Configuration based support. Do you mean here, client config will be changed to point to the new active? DNS SRV records are also unnecessary given our config would have both the namenode addresses. +1 for logical URI. We could consider merging this requirement with HDFS-2231 to do this. Logical URI is needed for identifying a nameservice and not cluster, since federation supports multiple namenodes with in a cluster. We could use the concept of nameservice, introduced in federation for that? So URI would be nameservice1.foo.com. nameservices1 maps to nn1, nn2. As regards to viewfs, I think this scheme will work for viewfs. The viewfs mounttables will point to the logical URI, which in turn will use the mechanism you are proposing. Why should failover method be based on URI cluster part? Can it be a single mechanism across all the nameservices? Hence change the parameter to dfs.client.ha.failover.method? These are my early thoughts. Some questions I am left with are: # The scheme you have defined works only for RPC protocols. How about HTTP? # I am not sure why logical URI is required for VIP/failover based setup. We could continue to add more details. > HA: HDFS clients must handle namenode failover and switch over to the new > active namenode. > -- > > Key: HDFS-1973 > URL: https://issues.apache.org/jira/browse/HDFS-1973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Suresh Srinivas >Assignee: Aaron T. Myers > > During failover, a client must detect the current active namenode failure and > switch over to the new active namenode. The switch over might make use of IP > failover or some thing more elaborate such as zookeeper to discover the new > active. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1973) HA: HDFS clients must handle namenode failover and switch over to the new active namenode.
[ https://issues.apache.org/jira/browse/HDFS-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13086524#comment-13086524 ] Aaron T. Myers commented on HDFS-1973: -- Any feedback on the above proposal? Sanjay/Suresh - I'd particularly like your feedback since you guys have a lot of viewfs expertise, and some of this is similar/related. > HA: HDFS clients must handle namenode failover and switch over to the new > active namenode. > -- > > Key: HDFS-1973 > URL: https://issues.apache.org/jira/browse/HDFS-1973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Suresh Srinivas >Assignee: Aaron T. Myers > > During failover, a client must detect the current active namenode failure and > switch over to the new active namenode. The switch over might make use of IP > failover or some thing more elaborate such as zookeeper to discover the new > active. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1973) HA: HDFS clients must handle namenode failover and switch over to the new active namenode.
[ https://issues.apache.org/jira/browse/HDFS-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079584#comment-13079584 ] Aaron T. Myers commented on HDFS-1973: -- h3. Client Failover overview On failover between active and standby NNs, it's necessary for clients to be redirected to the new active NN. The goal of HDFS-1623 is to provide a framework for HDFS HA which can in fact support multiple underlying mechanisms. As such, the client failover approach should support multiple options. h3. Cases to support # Proxy-based client failover. Clients always communicate with an in-band proxy service which forwards all RPCs on to the correct NN. On failure, a process causes this proxy to begin sending requests to the now-active NN. # Virtual IP-based client failover. Clients always connect to a hostname which resolves to a particular IP address. On failure of the active NN, a process is initiated to switch which NIC will receive packets intended for said IP address to the now-active NN. (From a client's perspective, this case is equivalent to case #1.) # Zookeeper-based client failover. The URI to contact the active NN is stored in Zookeeper or some other highly-available service. Clients look up which NN to talk to by communicating with ZK to discern the currently active NN. On failure, some process causes the address stored in ZK to be changed to point to the now-active NN. # Configuration-based client failover. Clients are configured with a set of NN addresses to try until an operation succeeds. This configuration might exist in client-side configuration files, or perhaps in DNS via a SRV record that lists the NNs with different priorities. h3. Assumptions This proposal assumes that NN fencing works, and that after a failover any standby NN is either unreachable or will throw a {{StandbyException}} on any RPC from a client. That is, a client will not possibly receive incorrect results if it chooses to contact the wrong NN. This proposal also presumes that there is no direct coordination required between any central failover coordinator and clients, i.e. there's an intermediate name resolution system of some sort (ZK, DNS, local configuration, etc.) h3. Proposal The commit of HADOOP-7380 already introduced a facility whereby an IPC {{RetryInvocationHandler}} can utilize a {{FailoverProxyProvider}} implementation to perform the appropriate client-side action in the event of failover. At the moment, the only implementation of a {{FailoverProxyProvider}} is the {{DefaultFailoverProxyProvider}}, which does nothing in the case of failover. HADOOP-7380 also added an {{@Idempotent}} annotation which can be used to identify which methods can be safely retried during a failover event. What remains, then, is: # To implement {{FailoverProxyProviders}} which can support the cases outlined above (and perhaps others). # To provide a mechanism to select which {{FailoverProxyProvider}} implementation to use for a given HDFS URI. # To annotate the appropriate HDFS {{ClientProtocol}} interface methods with the {{@Idempotent}} tag. h4. {{FailoverProxyProvider}} implementations Cases 1 and 2 above can be achieved by implementing a single {{FailoverProxyProvider}} which simply retries to reconnect to the previous hostname/IP address on failover. Cases 3 and 4 can be implemented as distinct custom {{FailoverProxyProviders}}. h4. A mechanism to select the appropriate {{FailoverProxyProvider}} implementation I propose we add a mechanism to configure a mapping from URI authority -> {{FailoverProxyProvider}} implementation. Absolute URIs which previously specified the NN host name will instead contain a logical cluster name (which might be chosen to be identical to one of the NN's host names) which will be used by the chosen {{FailoverProxyProvider}} to determine the appropriate host to connect to. Introducing the concept of a cluster name will be a useful abstraction in general if, for example, in the future someone develops a fully-distributed NN, the cluster name still applies. On instantiation of a {{DFSClient}} (or other user of an HDFS URI, e.g. HFTP), the mapping would be checked to see if there's an entry for the given URI authority. If there is not, then a normal RPC client with connected socket to the given authority will be created as is done today with a {{DefaultProxyProvider}}. If there is an entry, then the authority will be treated as a logical cluster name, a {{FailoverProxyProvider}} of the correct type will be instantiated (via a factory class), and an RPC client will be created which utilizes this {{FailoverProxyProvider}}. The various {{FailoverProxyProvider}} implementations are responsible for their own configuration. As a straw man example, consider the following configuration: {code} fs.defaultFS cluster1.foo.com dfs.ha.client.failover-
[jira] [Commented] (HDFS-1973) HA: HDFS clients must handle namenode failover and switch over to the new active namenode.
[ https://issues.apache.org/jira/browse/HDFS-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13046987#comment-13046987 ] Hari A V commented on HDFS-1973: Hi Aron, Thanks for the answer. I will watch these issues to get more information :-) -Hari > HA: HDFS clients must handle namenode failover and switch over to the new > active namenode. > -- > > Key: HDFS-1973 > URL: https://issues.apache.org/jira/browse/HDFS-1973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Suresh Srinivas >Assignee: Aaron T. Myers > > During failover, a client must detect the current active namenode failure and > switch over to the new active namenode. The switch over might make use of IP > failover or some thing more elaborate such as zookeeper to discover the new > active. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1973) HA: HDFS clients must handle namenode failover and switch over to the new active namenode.
[ https://issues.apache.org/jira/browse/HDFS-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13046767#comment-13046767 ] Aaron T. Myers commented on HDFS-1973: -- Hi Hari, bq. Can you please elaborate a little bit on your area of interest with ZOOKEEPER-1080? As noted in Sanjay's design doc, one proposal for detecting NN failure would be to use an external ZK service. The HDFS proposal doesn't go into great detail on this, but it suggests using ZK with a heartbeat mechanism to see if the NN is still alive. I personally like the ZK recipe better (i.e. using ephemeral + sequence nodes). Another possible use for ZK in the implementation of NN HA would be to use ZK as the source of truth for clients to determine the active NN. This would seem to flow naturally from the part of the ZK recipe which says "Applications may consider creating a separate to znode to acknowledge that the leader has executed the leader procedure." If NN HA were to utilize an implementation of the ZK leader election recipe, then perhaps this "leader-procedure-complete znode" could store the IP or hostname of the active NN which clients could use. I haven't read the design doc posted on ZOOKEEPER-1080 yet. I'll go ahead and do that and post my comments there. I should also mention that we have not settled upon what strategy we'll take to do NN failure detection or client failover. As noted in Sanjay's design doc, we're also strongly considering using virtual IPs for client failover. > HA: HDFS clients must handle namenode failover and switch over to the new > active namenode. > -- > > Key: HDFS-1973 > URL: https://issues.apache.org/jira/browse/HDFS-1973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Suresh Srinivas >Assignee: Aaron T. Myers > > During failover, a client must detect the current active namenode failure and > switch over to the new active namenode. The switch over might make use of IP > failover or some thing more elaborate such as zookeeper to discover the new > active. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1973) HA: HDFS clients must handle namenode failover and switch over to the new active namenode.
[ https://issues.apache.org/jira/browse/HDFS-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13046352#comment-13046352 ] Hari A V commented on HDFS-1973: Hi Aron, Can you please elaborate a little bit on your area of interest with ZOOKEEPER-1080? If it is for NN HA, i would be very happy to get your comments on the design which i have suggested, so that i can consider them for the patch. I have refered to the NameNode+HA_v2_1.pdf uploaded at HDFS-1623 while preparing the design doc (that helped me a lot for my design doc). The usecases are almost matching. I am currently in the process of making the patch ready for submitting. - Hari > HA: HDFS clients must handle namenode failover and switch over to the new > active namenode. > -- > > Key: HDFS-1973 > URL: https://issues.apache.org/jira/browse/HDFS-1973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Suresh Srinivas >Assignee: Aaron T. Myers > > During failover, a client must detect the current active namenode failure and > switch over to the new active namenode. The switch over might make use of IP > failover or some thing more elaborate such as zookeeper to discover the new > active. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira