Kihwal Lee created HDFS-7443: -------------------------------- Summary: Datanode upgrade to BLOCKID_BASED_LAYOUT sometimes fails Key: HDFS-7443 URL: https://issues.apache.org/jira/browse/HDFS-7443 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.6.0 Reporter: Kihwal Lee Priority: Blocker
When we did an upgrade from 2.5 to 2.6 in a medium size cluster, about 4% of datanodes were not coming up. They treid data file layout upgrade for BLOCKID_BASED_LAYOUT introduced in HDFS-6482, but failed. All failures were caused by {{NativeIO.link()}} throwing IOException saying {{EEXIST}}. The data nodes didn't die right away, but the upgrade was soon retried when the block pool initialization was retried whenever {{BPServiceActor}} was registering with the namenode. After many retries, datenodes terminated. This would leave {{previous.tmp}} and {{current}} with no {{VERSION}} file in the block pool slice storage directory. Although {{previous.tmp}} contained the old {{VERSION}} file, the content was in the new layout and the subdirs were all newly created ones. This shouldn't have happened because the upgrade-recovery logic in {{Storage}} removes {{current}} and renames {{previous.tmp}} to {{current}} before retrying. All successfully upgraded volumes had old state preserved in their {{previous}} directory. In summary there were two observed issues. - Upgrade failure with {{link()}} failing with {{EEXIST}} - {{previous.tmp}} contained not the content of original {{current}}, but half-upgraded one. -- This message was sent by Atlassian JIRA (v6.3.4#6332)