[ https://issues.apache.org/jira/browse/HDFS-5223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13771100#comment-13771100 ]
Todd Lipcon commented on HDFS-5223: ----------------------------------- To expand a little bit on Aaron's summary of our discussion above. *Proposal 1*: - note that we already include a version number in the header of the edit log and image formats. So, within a single image or edits directories, you might now have different edit log segments or images with different version numbers -- the ones written post-upgrade would have a higher version number. - note that this allows for in-place software upgrade, but not in-place software downgrade. Once you've written an edit log with the new version, you couldn't downgrade the NN back to the previous version, because it would refuse to read the higher-versioned edit log segment. bq. and we would require that changes made to the format of existing fsimage/edit log entries be done in a backward compatible fashion This isn't quite the case -- because the new edit log segments would have a new version number, we have the same ability to evolve opcodes as today. I verified with Aaron that he mis-stated this above. *Proposal 2*: - This is basically the way that file systems such as ext3 handle version compatibility. Every ext3 filesystem's superblock contains a set of flags which determine which features have been enabled for it. Similarly, we'd add something to the edit log and fsimage headers with a set of feature names. Here's the docs from Documentation/filesystems/ext2.txt in the kernel tree: {code} These feature flags have specific meanings for the kernel as follows: A COMPAT flag indicates that a feature is present in the filesystem, but the on-disk format is 100% compatible with older on-disk formats, so a kernel which didn't know anything about this feature could read/write the filesystem without any chance of corrupting the filesystem (or even making it inconsistent). This is essentially just a flag which says "this filesystem has a (hidden) feature" that the kernel or e2fsck may want to be aware of (more on e2fsck and feature flags later). The ext3 HAS_JOURNAL feature is a COMPAT flag because the ext3 journal is simply a regular file with data blocks in it so the kernel does not need to take any special notice of it if it doesn't understand ext3 journaling. An RO_COMPAT flag indicates that the on-disk format is 100% compatible with older on-disk formats for reading (i.e. the feature does not change the visible on-disk format). However, an old kernel writing to such a filesystem would/could corrupt the filesystem, so this is prevented. The most common such feature, SPARSE_SUPER, is an RO_COMPAT feature because sparse groups allow file data blocks where superblock/group descriptor backups used to live, and ext2_free_blocks() refuses to free these blocks, which would leading to inconsistent bitmaps. An old kernel would also get an error if it tried to free a series of blocks which crossed a group boundary, but this is a legitimate layout in a SPARSE_SUPER filesystem. An INCOMPAT flag indicates the on-disk format has changed in some way that makes it unreadable by older kernels, or would otherwise cause a problem if an old kernel tried to mount it. FILETYPE is an INCOMPAT flag because older kernels would think a filename was longer than 256 characters, which would lead to corrupt directory listings. The COMPRESSION flag is an obvious INCOMPAT flag - if the kernel doesn't understand compression, you would just get garbage back from read() instead of it automatically decompressing your data. The ext3 RECOVER flag is needed to prevent a kernel which does not understand the ext3 journal from mounting the filesystem without replaying the journal. {code} This would allow us to do rolling upgrades, run mixed-version clusters, and still retain the ability to roll back to a prior version until the new feature was used. So, to take the example of a feature like snapshots which required a metadata change, the admin workflow would be: # Shutdown standby node # Upgrade standby software version # Start standby node, failover to it # Shutdown and upgrade the old active, start it back up. # Note: at this point, the format for the edit logs and images is identical to the pre-upgrade format, so the user could still roll back. Trying to create a snapshot at this point would fail with an error like "Snapshots not enabled for this filesystem. Run dfsadmin -enableFeature snapshots to enable" # User runs the above command, which forces an edit log roll. The new edit logs contain the flag indicating that snapshots are enabled, and may use the new opcodes (or add new fields to the old opcodes as necessary) If the "explicit enable" doesn't sit well with people, we could also add a slightly simpler version like "-enableAllNewFeatures" or whatever, which a user can use after an upgrade with the understanding that it will prevent rollback. I personally prefer option 2 -- it helps a lot with the HA upgrade scenario per above, allows rollback, and also has the nice property that it will allow us to selectively backport features between software versions without bizarre non-linear version numbering hacks like we have today. > Allow edit log/fsimage format changes without changing layout version > --------------------------------------------------------------------- > > Key: HDFS-5223 > URL: https://issues.apache.org/jira/browse/HDFS-5223 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.1.1-beta > Reporter: Aaron T. Myers > > Currently all HDFS on-disk formats are version by the single layout version. > This means that even for changes which might be backward compatible, like the > addition of a new edit log op code, we must go through the full `namenode > -upgrade' process which requires coordination with DNs, etc. HDFS should > support a lighter weight alternative. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira