[ 
https://issues.apache.org/jira/browse/HDFS-5223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13771100#comment-13771100
 ] 

Todd Lipcon commented on HDFS-5223:
-----------------------------------

To expand a little bit on Aaron's summary of our discussion above.

*Proposal 1*:
- note that we already include a version number in the header of the edit log 
and image formats. So, within a single image or edits directories, you might 
now have different edit log segments or images with different version numbers 
-- the ones written post-upgrade would have a higher version number.
- note that this allows for in-place software upgrade, but not in-place 
software downgrade. Once you've written an edit log with the new version, you 
couldn't downgrade the NN back to the previous version, because it would refuse 
to read the higher-versioned edit log segment.

bq. and we would require that changes made to the format of existing 
fsimage/edit log entries be done in a backward compatible fashion

This isn't quite the case -- because the new edit log segments would have a new 
version number, we have the same ability to evolve opcodes as today. I verified 
with Aaron that he mis-stated this above.

*Proposal 2*:
- This is basically the way that file systems such as ext3 handle version 
compatibility. Every ext3 filesystem's superblock contains a set of flags which 
determine which features have been enabled for it. Similarly, we'd add 
something to the edit log and fsimage headers with a set of feature names. 
Here's the docs from Documentation/filesystems/ext2.txt in the kernel tree:

{code}
These feature flags have specific meanings for the kernel as follows:

A COMPAT flag indicates that a feature is present in the filesystem,
but the on-disk format is 100% compatible with older on-disk formats, so
a kernel which didn't know anything about this feature could read/write
the filesystem without any chance of corrupting the filesystem (or even
making it inconsistent).  This is essentially just a flag which says
"this filesystem has a (hidden) feature" that the kernel or e2fsck may
want to be aware of (more on e2fsck and feature flags later).  The ext3
HAS_JOURNAL feature is a COMPAT flag because the ext3 journal is simply
a regular file with data blocks in it so the kernel does not need to
take any special notice of it if it doesn't understand ext3 journaling.

An RO_COMPAT flag indicates that the on-disk format is 100% compatible
with older on-disk formats for reading (i.e. the feature does not change
the visible on-disk format).  However, an old kernel writing to such a
filesystem would/could corrupt the filesystem, so this is prevented. The
most common such feature, SPARSE_SUPER, is an RO_COMPAT feature because
sparse groups allow file data blocks where superblock/group descriptor
backups used to live, and ext2_free_blocks() refuses to free these blocks,
which would leading to inconsistent bitmaps.  An old kernel would also
get an error if it tried to free a series of blocks which crossed a group
boundary, but this is a legitimate layout in a SPARSE_SUPER filesystem.

An INCOMPAT flag indicates the on-disk format has changed in some
way that makes it unreadable by older kernels, or would otherwise
cause a problem if an old kernel tried to mount it.  FILETYPE is an
INCOMPAT flag because older kernels would think a filename was longer
than 256 characters, which would lead to corrupt directory listings.
The COMPRESSION flag is an obvious INCOMPAT flag - if the kernel
doesn't understand compression, you would just get garbage back from
read() instead of it automatically decompressing your data.  The ext3
RECOVER flag is needed to prevent a kernel which does not understand the
ext3 journal from mounting the filesystem without replaying the journal.
{code}

This would allow us to do rolling upgrades, run mixed-version clusters, and 
still retain the ability to roll back to a prior version until the new feature 
was used. So, to take the example of a feature like snapshots which required a 
metadata change, the admin workflow would be:

# Shutdown standby node
# Upgrade standby software version
# Start standby node, failover to it
# Shutdown and upgrade the old active, start it back up.
# Note: at this point, the format for the edit logs and images is identical to 
the pre-upgrade format, so the user could still roll back. Trying to create a 
snapshot at this point would fail with an error like "Snapshots not enabled for 
this filesystem. Run dfsadmin -enableFeature snapshots to enable"
# User runs the above command, which forces an edit log roll. The new edit logs 
contain the flag indicating that snapshots are enabled, and may use the new 
opcodes (or add new fields to the old opcodes as necessary)

If the "explicit enable" doesn't sit well with people, we could also add a 
slightly simpler version like "-enableAllNewFeatures" or whatever, which a user 
can use after an upgrade with the understanding that it will prevent rollback.


I personally prefer option 2 -- it helps a lot with the HA upgrade scenario per 
above, allows rollback, and also has the nice property that it will allow us to 
selectively backport features between software versions without bizarre 
non-linear version numbering hacks like we have today.
                
> Allow edit log/fsimage format changes without changing layout version
> ---------------------------------------------------------------------
>
>                 Key: HDFS-5223
>                 URL: https://issues.apache.org/jira/browse/HDFS-5223
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.1.1-beta
>            Reporter: Aaron T. Myers
>
> Currently all HDFS on-disk formats are version by the single layout version. 
> This means that even for changes which might be backward compatible, like the 
> addition of a new edit log op code, we must go through the full `namenode 
> -upgrade' process which requires coordination with DNs, etc. HDFS should 
> support a lighter weight alternative.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to