[ https://issues.apache.org/jira/browse/HDFS-107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12767980#action_12767980 ]
Ashutosh Chauhan commented on HDFS-107: --------------------------------------- I saw this issue on our small 6-node cluster too. It took a while to identify the root cause of the problem. Symptoms were same as described here. In our case we have both 18 and 20 installed in our cluster, but we only run 20. A user saw the HDFS exception for their job, so they stopped 20 and thought of going back to 18 and tried to start it. And then they switched back to 20 again. In doing all this, version files of datanode and namenode got messed up and DNs n NN had different set of information in their version files. Apart from this peculiar usecase, as things are currently in hdfs, I think even one small misstep in upgrading the cluster can result in this bug, as is reported in previous comments. I think at the cluster startup time namenode and datanode should also exchange information contained in version file and in case of mismatch, they should reconcile the differences, potentially asking users input in case choices are not safe to make. There are few workarounds suggested in previous comments. Which one of these is recommended one? > Data-nodes should be formatted when the name-node is formatted. > --------------------------------------------------------------- > > Key: HDFS-107 > URL: https://issues.apache.org/jira/browse/HDFS-107 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Konstantin Shvachko > > The upgrade feature HADOOP-702 requires data-nodes to store persistently the > namespaceID > in their version files and verify during startup that it matches the one > stored on the name-node. > When the name-node reformats it generates a new namespaceID. > Now if the cluster starts with the reformatted name-node, and not reformatted > data-nodes > the data-nodes will fail with > java.io.IOException: Incompatible namespaceIDs ... > Data-nodes should be reformatted whenever the name-node is. I see 2 > approaches here: > 1) In order to reformat the cluster we call "start-dfs -format" or make a > special script "format-dfs". > This would format the cluster components all together. The question is > whether it should start > the cluster after formatting? > 2) Format the name-node only. When data-nodes connect to the name-node it > will tell them to > format their storage directories if it sees that the namespace is empty and > its cTime=0. > The drawback of this approach is that we can loose blocks of a data-node from > another cluster > if it connects by mistake to the empty name-node. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.