[ https://issues.apache.org/jira/browse/HDFS-5138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13864039#comment-13864039 ]
Todd Lipcon commented on HDFS-5138: ----------------------------------- General: - thanks for the description in the above JIRA comment. Can you transfer this comment somewhere into the docs, perhaps hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HDFSHighAvailabilityWithQJM.apt.vm or a new page? Perhaps with a slightly more user-facing angle. - what would happen if the admin called finalizeUpgrade() when neither node had yet transitioned to active? I don't see any sanity check here.. is it possible you'd end up leaving the shared in an orphaned "upgrading" state and never end up finalizing it? Similarly, what happens if you start one NN with -upgrade, and you start the other one without -upgrade. It seems to me it should check for the upgrade lock file in the shared dir and say "looks like an upgrade is in progress, please start the SBN with -upgrade". - there are a few TODOs in the code that probably need to be addressed - nothing big, just a few things you may have missed. JournalManager.java: - would be good to add Javadoc on the new methods, so that JM implementors know what the upgrade process looks like. i.e what is pre-upgrade, etc? QuorumJournalManager.java: - "Could not perform upgrade or more JournalNodes" error message has some missing words in it. + throw new IOException("Failed to lock shared log."); - this line should be unreachable, right? maybe an AssertionError("Unreachable code") would make more sense? Also this same exception message is used down below in canRollBack which isn't quite right. Journal: - when you upgrade the journal, I'd think you'd to copy over all the data from the PersistentLongFiles into the new dir? FileJournalManager: - worth considering a protobuf for the shared log lock, in case we want to add other fields to it later (instead of the custom serialization you do now) - need try...finally around the code where you write the shared log lock. On the read side you're also forgetting to close it. - the creation of the shared log lock file is non-atomic... I'm worried that we may hit the race in practice, since the AtomicFileOutputStream implies an fsync, which means that between the exists() check and the rename to the lock file, you may really have a decently long time window. Maybe we can use locking code like Storage does? Feel free to punt to a follow-up. FSNamesystem.java: - can you add a doc on doUpgradeOfSharedLogOnTransitionToActive()? NNUpgradeUtil.java: - why are some of the functions package-private and others are public? - make it an abstract class or give it a private constructor so it can't be instantiated, since it's just static methods - brief javadocs would be nice for these methods, even though they're straight refactors of existing code. FSEditLog.java: - in canRollBack(), you throw an exception if there is no shared log. That doesn't seem right... - capitalization of "RollBack" vs "Rollback" is a little inconsistent. Looks like Rollback is consistently used prior to this patch, so probably best to stick with that. FSImage.java: - in the switch statement on the startup option, I think you should keep the ROLLBACK case, but just have it throw AssertionError -- just to make sure we don't accidentally have some case where we're passing it there but shouldn't be.GA > Support HDFS upgrade in HA > -------------------------- > > Key: HDFS-5138 > URL: https://issues.apache.org/jira/browse/HDFS-5138 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 2.1.1-beta > Reporter: Kihwal Lee > Assignee: Aaron T. Myers > Priority: Blocker > Attachments: HDFS-5138.patch, HDFS-5138.patch, HDFS-5138.patch > > > With HA enabled, NN wo't start with "-upgrade". Since there has been a layout > version change between 2.0.x and 2.1.x, starting NN in upgrade mode was > necessary when deploying 2.1.x to an existing 2.0.x cluster. But the only way > to get around this was to disable HA and upgrade. > The NN and the cluster cannot be flipped back to HA until the upgrade is > finalized. If HA is disabled only on NN for layout upgrade and HA is turned > back on without involving DNs, things will work, but finaliizeUpgrade won't > work (the NN is in HA and it cannot be in upgrade mode) and DN's upgrade > snapshots won't get removed. > We will need a different ways of doing layout upgrade and upgrade snapshot. > I am marking this as a 2.1.1-beta blocker based on feedback from others. If > there is a reasonable workaround that does not increase maintenance window > greatly, we can lower its priority from blocker to critical. -- This message was sent by Atlassian JIRA (v6.1.5#6160)