Hi everybody, most of the bugs/issues/etc.. that I found while upgrading from CDH 5 to BigTop 1.4 are fixed, I am now testing (as suggested also in here) upgrade/rollback procedures for HDFS (all written in https://phabricator.wikimedia.org/T244499, will add documentation about this at the end I promise).
I initially followed [1][2] in my Test cluster, choosing the Rolling upgrade, but when I tried to rollback (after days since the initial upgrade) I ended up in an inconsistent state and I wasn't able to recover the previous HDFS state. I didn't save the exact error messages but the situation was more or less the following: FS-Image-rollback (created at the time of the upgrade) - up to transaction X FS-Image-current - up to transaction Y, with Y = X + 10000 (number totally made up for the example) QJM cluster: first available transaction Z = X + 10000 + 1 When I tried to rolling rollback, the Namenode complained about a hole in the transaction log, namely at X + 1, so it refused to start. I tried to force a regular rollback, but the Namenode refused again saying that there was no available FS Image to roll back to. I checked in the Hadoop code and indeed the Namenode saves the fs image with different naming/path in case of a rolling upgrade or a regular upgrade. Both cases make sense, especially the first one since there was indeed a hole between the last transaction of the FS-Image-rollback and the first available transaction to reply on the QJM cluster. I chose the rolling upgrade initially since it was appealing: it promises to bring back the Namenodes to their previous versions, but keeping the data modified between upgrade and rollback. I then found [3], in which it is said that with QJM everything is more complicated, and a regular rollback is the only option available. What I think this mean is that due to the Edit log spread among multiple nodes, a rollback that keeps data between upgrade and rollback is not available, so worst case scenario the data modified during that timeframe is lost. Not a big deal in my case, but I want to triple check with you if this is the correct interpretation or if there is another tutorial/guide/etc.. that I haven't read with a different procedure :) Is my interpretation correct? If not, is there anybody with experience in HDFS upgrades that could shed some light on the subject? Thanks in advance! Luca [1] https://hadoop.apache.org/docs/r2.8.5/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Upgrade_and_Rollback [2] https://hadoop.apache.org/docs/r2.8.5/hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html [3] https://hadoop.apache.org/docs/r2.8.5/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html#HDFS_UpgradeFinalizationRollback_with_HA_Enabled
