[jira] [Commented] (HDFS-7231) rollingupgrade needs some guard rails
[ https://issues.apache.org/jira/browse/HDFS-7231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14183724#comment-14183724 ] Allen Wittenauer commented on HDFS-7231: bq. (unable to continue) This is the part I've also been trying to get more information on as well. :D From what I've been told, the first few directories were converted to the new format but not all of them. This seems to imply that the DN process was brought down at some point in time. bq. This as you know is a recipe for disaster I can neither confirm or deny this statement. ;) bq. Before you go on to 2.4.1, if you do finalize of upgrade what happens? It appears to work as expected. So this whole condition appears to trigger with non-finalized systems. > rollingupgrade needs some guard rails > - > > Key: HDFS-7231 > URL: https://issues.apache.org/jira/browse/HDFS-7231 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Allen Wittenauer >Priority: Blocker > > See first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7231) rollingupgrade needs some guard rails
[ https://issues.apache.org/jira/browse/HDFS-7231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14183013#comment-14183013 ] Suresh Srinivas commented on HDFS-7231: --- [~aw], can you please respond to the comments? > rollingupgrade needs some guard rails > - > > Key: HDFS-7231 > URL: https://issues.apache.org/jira/browse/HDFS-7231 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Allen Wittenauer >Priority: Blocker > > See first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7231) rollingupgrade needs some guard rails
[ https://issues.apache.org/jira/browse/HDFS-7231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180312#comment-14180312 ] Suresh Srinivas commented on HDFS-7231: --- Allen, I just rewrote the steps with additional details to clarify: # Upgrade 2.0.5 cluster to 2.2 # Do not -finalizeUpgrade # Install 2.4.1 binaries on the cluster machines. Start the datanodes on 2.4.1. # Start namenode -upgrade option. # Namenode start fails because 2.0.5 to 2.2 upgrade is still in progress # Leave 2.4.1 DNs running # Install binaries on NN to 2.2 # Start NN on 2.2 with no upgrade related options So far things are clear. Then you go on to say, the following: bq. DNs now do a partial roll-forward, rendering them unable to continue What do you mean by this? bq. admins manually repair version files on those broken directories This is as you know is a recipe for disaster. Let me ask you a question. Before you go on to 2.4.1, if you do finalize of upgrade what happens? > rollingupgrade needs some guard rails > - > > Key: HDFS-7231 > URL: https://issues.apache.org/jira/browse/HDFS-7231 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Allen Wittenauer >Priority: Blocker > > See first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7231) rollingupgrade needs some guard rails
[ https://issues.apache.org/jira/browse/HDFS-7231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14169634#comment-14169634 ] Allen Wittenauer commented on HDFS-7231: Verified the same crappy experience exists in the 2.6 branch. Marking this as a blocker since this will be the last release for everyone's precious JDK 1.6 support. I'd love to hear some options from the peanut gallery on how to improve this so users aren't left with a potential time bomb on their hands. Alias -upgrade to -rollingupgrade? Bring nn -finalize back? Auto-finalize? > rollingupgrade needs some guard rails > - > > Key: HDFS-7231 > URL: https://issues.apache.org/jira/browse/HDFS-7231 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Allen Wittenauer >Priority: Blocker > > See first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7231) rollingupgrade needs some guard rails
[ https://issues.apache.org/jira/browse/HDFS-7231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14167620#comment-14167620 ] Allen Wittenauer commented on HDFS-7231: Argh: another point: with namenode -finalize being taken away, this scenario is pretty much unsolvable. > rollingupgrade needs some guard rails > - > > Key: HDFS-7231 > URL: https://issues.apache.org/jira/browse/HDFS-7231 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Allen Wittenauer > > See first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7231) rollingupgrade needs some guard rails
[ https://issues.apache.org/jira/browse/HDFS-7231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14167545#comment-14167545 ] Allen Wittenauer commented on HDFS-7231: One other thing about rolling while there is a finalize dir there: it's worth pointing out that -rollback becomes a "please shoot me now" command. At least, I don't want to think about the consequences > rollingupgrade needs some guard rails > - > > Key: HDFS-7231 > URL: https://issues.apache.org/jira/browse/HDFS-7231 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Allen Wittenauer > > See first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7231) rollingupgrade needs some guard rails
[ https://issues.apache.org/jira/browse/HDFS-7231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14167535#comment-14167535 ] Allen Wittenauer commented on HDFS-7231: We had an admin perform an upgrade that went south due to rolling upgrade interfering with the previous method of upgrading. The series of events as given to me (I was out of town, so didn't witness firsthand) was: # Build a 2.0.5/2.1.0/etc cluster where rolling upgrade is not an option. # Upgrade it to 2.2 # Do not \-finalizeUpgrade # Upgrade binaries to 2.4.1 # Run namenode \-upgrade # watch it fail. # Leave 2.4.1 DNs running # Downgrade binaries on NN to 2.2 # Start NN # DNs now do a partial roll-forward, rendering them unable to continue # admins manually repair version files on those broken directories ... There were clearly a few mistakes made in the above procedure, most of which were driven by a belief that the NN *had* to be out of safemode to do a finalize. So they attempted to do that, which of course led to other things going wrong. I'm not sure what triggered the DNs to basically render their VERSION files broken. I haven't been able to duplicate it, but I've only tried on a much smaller scale so that might be related. I also suspect there was an attempt to rollback the binaries on the DNs and I haven't tried that yet either. My own testing of this scenario has given me a few insights. * DNs should not start rolling while there is a directory there to finalize. If you do what appears (to me, at least) to get the system back up the proper thing here: (bring down the namenode, namenode \-finalize, bring up namenode), hdfs dfsadmin \-finalizeUpgrade afterward doesn't appear to send the message to the DNs to clean up their space, requiring manual intervention. * Suprise! DNs exit if the 'proper' NN is brought up with \-upgrade. Doing the 2.2 NN \-finalize and then bringing 2.4.1 NN up with \-upgrade, results in the 2.4.1 DNs all coming down. This was a bit of a surprise given they were perfectly happy staying up with a broken 2.2 NN in \-upgrade mode before. I'm sure there are other things here, but these are the two big ones that stuck out. I'm doing some other manual testing using the above procedures with a few other changes to see what else sticks out. > rollingupgrade needs some guard rails > - > > Key: HDFS-7231 > URL: https://issues.apache.org/jira/browse/HDFS-7231 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Allen Wittenauer > > See first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)