[ https://issues.apache.org/jira/browse/HDFS-5138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13882096#comment-13882096 ]
Suresh Srinivas commented on HDFS-5138: --------------------------------------- @Todd, I have had some conversation about this [~atm] related to this jira. I had brought up one issue about potentially losing editlogs on JournalNode. I thought that would be addressed before this jira can be committed. I have been very busy and have not been able to provide all my comments. Reviewing this patch has been quite tricky. Here are my almost complete review comments. While some of the issues are minor nits, I do not think this patch and the documentation is ready. I am adding information about the design, the way I understand it. Let me know if I got it wrong. *Upgrade preparation:* # New bits are installed on the cluster nodes. # The cluster is brought down. *Upgrade:* For HA setup, choose one of the namenodes to initiate upgrade on and start it with -upgrade flag. # NN performs preupgrade for all non shared storage directories by moving current to previous.tmp and creating new current. #* Failure here is fine. NN start up fails. Next attempt at upgrade the storage directories are recovered. # NN performs preupgrade of shared edits (NFS/JournalNodes) over RPC. JournalNodes current moved to previous.tmp and new current is created. #* If one of the JN preupgrade fails and upgrade is reattempted, editlog directory could be lost on the JN. Restarting the JN does not fix the issue. # NN performs upgrade of non shared edits by writing new CTIME to current and moving previous.tmp to previous. #* If one of the JN preupgrade fails and upgrade is reattempted, editlog directory could be lost on the JN. Restarting the JN does not fix the issue. # NN performs upgrade of shared edits (NFS/JournalNodes) over RPC. JournalNodes current has new CTIM and previous.tmp is moved to previous. # We need to document that all the JournalNodes must be up. If a JN is irrecoverably lost, configuration must be changed to exclude the JN. *Rollback:* NN is started with rollback flag # For all the non shared directories, the NN checks for canRollBack, essentially ensures that previous directory with the right layout version exists. # For all the shared directories, the NN checks for canRollBack, essentially ensures that previous directory with the right layout version exists. # NN performs rollback for shared directories (moving previous to current) #* If rollback of one of the JN fails, then directories are in inconsistent state. I think any attempt at retrying rollback will fail and will require manually moving files around. I do not think restarting JN fixes this. # We need to document that all the JournalNodes must be up. If a JN is irrecoverably lost, configuration must be changed to exclude the JN. *Finalize:* DFSAdmin command is run to finalize the upgrade. # Active NN performs finalizing of editlog. If JN's fail to finalize, active NN fails to finalize. However it is possible that standby finalizes, leaving the cluster in an inconsistent state. # We need to document that all the JournalNodes must be up. If a JN is irrecoverably lost, configuration must be changed to exclude the JN. Comments on the code in the patch (this is almost complete): Comments: # Minor nit: there are some white space changes # assertAllResultsEqual - for loop can just start with i = 1? Also if the collection objects is of size zero or one, the method can return early. Is there a need to do object.toArray() for these early checks? With that, perhaps the findbugs exclude may not be necessary. # Unit test can be added for methods isAtLeastOneActive, getRpcAddressesForNameserviceId and getProxiesForAllNameNodesInNameservice (I am okay if this is done in a separate jira) # Finalizing upgrade is quite tricky. Consider the following scenarios: #* One NN is active and the other is standby - works fine #* One NN is active and the other is down or all NNs - finalize command throws exception and the user will not know if it has succeeded or failed and what to do next #* No active NN - throws an exception cannot finalize with no active #* BlockPoolSliceStorage.java change seems unnecessary # Why is {{throw new AssertionError("Unreachable code.");}} in QuorumJournalManager.java methods? # FSImage#doRollBack() - when canRollBack is false after checking if non-share directories can rollback, an exception must be immediately thrown, instead of checking shared editlog. Also printing Log.info when storages can be rolled back will help in debugging. # FSEditlog#canRollBackSharedLog should accept StorageInfo instead of Storage # QuorumJournalManager#canRollBack and getJournalCTime can throw AssertionError (from DFSUtil.assertAllResultsEqual()). Is that the right exception to expose or IOException? # Namenode startup throws AssertionError with -rollback option. I think we should throw IOException, which is how all the other failures are indicated. > Support HDFS upgrade in HA > -------------------------- > > Key: HDFS-5138 > URL: https://issues.apache.org/jira/browse/HDFS-5138 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 2.1.1-beta > Reporter: Kihwal Lee > Assignee: Aaron T. Myers > Priority: Blocker > Attachments: HDFS-5138.patch, HDFS-5138.patch, HDFS-5138.patch, > HDFS-5138.patch, HDFS-5138.patch, HDFS-5138.patch, HDFS-5138.patch, > HDFS-5138.patch, HDFS-5138.patch, HDFS-5138.patch, HDFS-5138.patch, > hdfs-5138-branch-2.txt > > > With HA enabled, NN wo't start with "-upgrade". Since there has been a layout > version change between 2.0.x and 2.1.x, starting NN in upgrade mode was > necessary when deploying 2.1.x to an existing 2.0.x cluster. But the only way > to get around this was to disable HA and upgrade. > The NN and the cluster cannot be flipped back to HA until the upgrade is > finalized. If HA is disabled only on NN for layout upgrade and HA is turned > back on without involving DNs, things will work, but finaliizeUpgrade won't > work (the NN is in HA and it cannot be in upgrade mode) and DN's upgrade > snapshots won't get removed. > We will need a different ways of doing layout upgrade and upgrade snapshot. > I am marking this as a 2.1.1-beta blocker based on feedback from others. If > there is a reasonable workaround that does not increase maintenance window > greatly, we can lower its priority from blocker to critical. -- This message was sent by Atlassian JIRA (v6.1.5#6160)