Hi Experts, I have questions on rollback/upgrade HDFS with QJM HA enabled.
On the website http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html#HDFS_UpgradeFinalizationRollback_with_HA_Enabled, it says: 'To perform a rollback of an upgrade, both NNs should first be shut down. The operator should run the roll back command on the NN where they initiated the upgrade procedure, which will perform the rollback on the local dirs there, as well as on the shared log, either NFS or on the JNs. Afterward, this NN should be started and the operator should run `-bootstrapStandby' on the other NN to bring the two NNs in sync with this rolled-back file system state.' Currently I expect the steps are(Please correct me if I am wrong): NN1 -> hadoop namenode -rollback NN1 -> hadoop namenode // In our env, this rollbacked namenode shuts down right after it finishes -rollback so it needs to be started again. NN2 -> hadoop namenode -bootstrapStandby hadoop datanode -rollback // on all datanodes [Question 1]: One thing I don't know is when the JournalNodes should be started and/or stopped. It seems like they should be started for the hadoop namenode -rollback. Should they be restarted sometime? [Question 2]: Another issue actually happens after the upgrade and before rollback starts: The standby NN process is actually heavily occupying the CPU and somehow is eating up disk space (without the disk space actually being used). This was causing "No space left on device" errors during the rollback process. As soon as I killed the namenode process, the disk space was immediately back to a reasonable amount. What might cause the NN process to occupy in a hidden way so much disk space? Thanks!