[ 
https://issues.apache.org/jira/browse/HDFS-5138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13864039#comment-13864039
 ] 

Todd Lipcon commented on HDFS-5138:
-----------------------------------

General:
- thanks for the description in the above JIRA comment. Can you transfer this 
comment somewhere into the docs, perhaps 
hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HDFSHighAvailabilityWithQJM.apt.vm 
or a new page? Perhaps with a slightly more user-facing angle.

- what would happen if the admin called finalizeUpgrade() when neither node had 
yet transitioned to active? I don't see any sanity check here.. is it possible 
you'd end up leaving the shared in an orphaned "upgrading" state and never end 
up finalizing it? Similarly, what happens if you start one NN with -upgrade, 
and you start the other one without -upgrade. It seems to me it should check 
for the upgrade lock file in the shared dir and say "looks like an upgrade is 
in progress, please start the SBN with -upgrade".

- there are a few TODOs in the code that probably need to be addressed - 
nothing big, just a few things you may have missed.

JournalManager.java:
- would be good to add Javadoc on the new methods, so that JM implementors know 
what the upgrade process looks like. i.e what is pre-upgrade, etc?

QuorumJournalManager.java:
- "Could not perform upgrade or more JournalNodes" error message has some 
missing words in it.

+    throw new IOException("Failed to lock shared log.");
- this line should be unreachable, right? maybe an AssertionError("Unreachable 
code") would make more sense? Also this same exception message is used down 
below in canRollBack which isn't quite right.

Journal:
- when you upgrade the journal, I'd think you'd to copy over all the data from 
the PersistentLongFiles into the new dir?

FileJournalManager:
- worth considering a protobuf for the shared log lock, in case we want to add 
other fields to it later (instead of the custom serialization you do now)
- need try...finally around the code where you write the shared log lock. On 
the read side you're also forgetting to close it.
- the creation of the shared log lock file is non-atomic... I'm worried that we 
may hit the race in practice, since the AtomicFileOutputStream implies an 
fsync, which means that between the exists() check and the rename to the lock 
file, you may really have a decently long time window. Maybe we can use locking 
code like Storage does? Feel free to punt to a follow-up.


FSNamesystem.java:
- can you add a doc on doUpgradeOfSharedLogOnTransitionToActive()?

NNUpgradeUtil.java:
- why are some of the functions package-private and others are public?
- make it an abstract class or give it a private constructor so it can't be 
instantiated, since it's just static methods
- brief javadocs would be nice for these methods, even though they're straight 
refactors of existing code.

FSEditLog.java:
- in canRollBack(), you throw an exception if there is no shared log. That 
doesn't seem right...
- capitalization of "RollBack" vs "Rollback" is a little inconsistent. Looks 
like Rollback is consistently used prior to this patch, so probably best to 
stick with that.

FSImage.java:
- in the switch statement on the startup option, I think you should keep the 
ROLLBACK case, but just have it throw AssertionError -- just to make sure we 
don't accidentally have some case where we're passing it there but shouldn't 
be.GA


> Support HDFS upgrade in HA
> --------------------------
>
>                 Key: HDFS-5138
>                 URL: https://issues.apache.org/jira/browse/HDFS-5138
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.1.1-beta
>            Reporter: Kihwal Lee
>            Assignee: Aaron T. Myers
>            Priority: Blocker
>         Attachments: HDFS-5138.patch, HDFS-5138.patch, HDFS-5138.patch
>
>
> With HA enabled, NN wo't start with "-upgrade". Since there has been a layout 
> version change between 2.0.x and 2.1.x, starting NN in upgrade mode was 
> necessary when deploying 2.1.x to an existing 2.0.x cluster. But the only way 
> to get around this was to disable HA and upgrade. 
> The NN and the cluster cannot be flipped back to HA until the upgrade is 
> finalized. If HA is disabled only on NN for layout upgrade and HA is turned 
> back on without involving DNs, things will work, but finaliizeUpgrade won't 
> work (the NN is in HA and it cannot be in upgrade mode) and DN's upgrade 
> snapshots won't get removed.
> We will need a different ways of doing layout upgrade and upgrade snapshot.  
> I am marking this as a 2.1.1-beta blocker based on feedback from others.  If 
> there is a reasonable workaround that does not increase maintenance window 
> greatly, we can lower its priority from blocker to critical.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to