[jira] [Commented] (HDFS-2026) 1073: 2NN needs to handle case of reformatted NN better
[ https://issues.apache.org/jira/browse/HDFS-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052933#comment-13052933 ] Eli Collins commented on HDFS-2026: --- +1 link in jira for the 2nn -format command? > 1073: 2NN needs to handle case of reformatted NN better > --- > > Key: HDFS-2026 > URL: https://issues.apache.org/jira/browse/HDFS-2026 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: name-node >Affects Versions: Edit log branch (HDFS-1073) >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Critical > Fix For: Edit log branch (HDFS-1073) > > Attachments: hdfs-2026.txt, hdfs-2026.txt > > > Currently in the 1073 branch, the following steps ends up with a very > confused 2NN: > - format NN, run NN > - start 2NN, perform some checkpoints > - reformat NN, start NN on new namespace > - restart same 2NN > The 2NN currently saves the new VERSION info into its local storage directory > but doesn't clear out the old checkpoint or edits files. This is obviously > wrong and might lead to a corrupt checkpoint getting uploaded. > If the 2NN has storage directories with VERSION info, and connects to an NN > with different VERSION info, there are two alternatives: > a) refuse to perform any checkpoints until the operator issues a > "secondarynamenode -format" command (this is similar to how the > backupnode/checkpointnode works) > b) clear the current contents of the storage directory and save the new NN's > VERSION info. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2026) 1073: 2NN needs to handle case of reformatted NN better
[ https://issues.apache.org/jira/browse/HDFS-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052908#comment-13052908 ] Todd Lipcon commented on HDFS-2026: --- bq. Can we remove Checkpointer#uploadCheckpoint commented out? Added TODO -- the CN/BN will be addressed separately bq. testReformatNNBetweenCheckpoints method comment is missing a period fixed bq. The new call to sd.read in SecondaryNameNode#recoverCreate could use a comment added: {code} case NORMAL: // Read the VERSION file. This verifies that: // (a) the VERSION file for each of the directories is the same, // and (b) when we connect to a NN, we can verify that the remote // node matches the same namespace that we ran on previously. sd.read(); break; {code} bq. As an aside, readVersionFile would be a better name for that method I agree, but we should do that separately -- this function gets used throughout all of HDFS (eg also on the DN side) bq. Not you change would be good to add a comment to uploadImageFromStorage indicating it doesn't actually post an image but the 2NN posts to the NN asking it to get an image Added the following javadoc: {code} /** * Requests that the NameNode download an image from this node. * * @param fsName the http address for the remote NN * @param imageListenAddress the host/port where the local node is running an * HTTPServer hosting GetImageServlet * @param storage the storage directory to transfer the image from * @param txid the transaction ID of the image to be uploaded */ {code} and this comment: {code} // this doesn't directly upload an image, but rather asks the NN // to connect back to the 2NN to download the specified image. TransferFsImage.getFileClient(fsName, fileid, null, false); ... {code} > 1073: 2NN needs to handle case of reformatted NN better > --- > > Key: HDFS-2026 > URL: https://issues.apache.org/jira/browse/HDFS-2026 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: name-node >Affects Versions: Edit log branch (HDFS-1073) >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Critical > Fix For: Edit log branch (HDFS-1073) > > Attachments: hdfs-2026.txt > > > Currently in the 1073 branch, the following steps ends up with a very > confused 2NN: > - format NN, run NN > - start 2NN, perform some checkpoints > - reformat NN, start NN on new namespace > - restart same 2NN > The 2NN currently saves the new VERSION info into its local storage directory > but doesn't clear out the old checkpoint or edits files. This is obviously > wrong and might lead to a corrupt checkpoint getting uploaded. > If the 2NN has storage directories with VERSION info, and connects to an NN > with different VERSION info, there are two alternatives: > a) refuse to perform any checkpoints until the operator issues a > "secondarynamenode -format" command (this is similar to how the > backupnode/checkpointnode works) > b) clear the current contents of the storage directory and save the new NN's > VERSION info. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2026) 1073: 2NN needs to handle case of reformatted NN better
[ https://issues.apache.org/jira/browse/HDFS-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052837#comment-13052837 ] Eli Collins commented on HDFS-2026: --- Looks great. Some small stuff: * Can we remove Checkpointer#uploadCheckpoint commented out? (mark TODO if addressed in follow-on) * testReformatNNBetweenCheckpoints method comment is missing a period. * The new call to sd.read in SecondaryNameNode#recoverCreate could use a comment (not clear why we need to read the version file there). As an aside, readVersionFile would be a better name for that method. * Not you change would be good to add a comment to uploadImageFromStorage indicating it doesn't actually post an image but the 2NN posts to the NN asking it to get an image. > 1073: 2NN needs to handle case of reformatted NN better > --- > > Key: HDFS-2026 > URL: https://issues.apache.org/jira/browse/HDFS-2026 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: name-node >Affects Versions: Edit log branch (HDFS-1073) >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Critical > Fix For: Edit log branch (HDFS-1073) > > Attachments: hdfs-2026.txt > > > Currently in the 1073 branch, the following steps ends up with a very > confused 2NN: > - format NN, run NN > - start 2NN, perform some checkpoints > - reformat NN, start NN on new namespace > - restart same 2NN > The 2NN currently saves the new VERSION info into its local storage directory > but doesn't clear out the old checkpoint or edits files. This is obviously > wrong and might lead to a corrupt checkpoint getting uploaded. > If the 2NN has storage directories with VERSION info, and connects to an NN > with different VERSION info, there are two alternatives: > a) refuse to perform any checkpoints until the operator issues a > "secondarynamenode -format" command (this is similar to how the > backupnode/checkpointnode works) > b) clear the current contents of the storage directory and save the new NN's > VERSION info. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2026) 1073: 2NN needs to handle case of reformatted NN better
[ https://issues.apache.org/jira/browse/HDFS-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052719#comment-13052719 ] Eli Collins commented on HDFS-2026: --- Agree w option (a) as well. In the future we can make a single CheckpointNode support multiple distinct Namenodes, for now we should be explicit about the coupling. > 1073: 2NN needs to handle case of reformatted NN better > --- > > Key: HDFS-2026 > URL: https://issues.apache.org/jira/browse/HDFS-2026 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: name-node >Affects Versions: Edit log branch (HDFS-1073) >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Critical > Fix For: Edit log branch (HDFS-1073) > > Attachments: hdfs-2026.txt > > > Currently in the 1073 branch, the following steps ends up with a very > confused 2NN: > - format NN, run NN > - start 2NN, perform some checkpoints > - reformat NN, start NN on new namespace > - restart same 2NN > The 2NN currently saves the new VERSION info into its local storage directory > but doesn't clear out the old checkpoint or edits files. This is obviously > wrong and might lead to a corrupt checkpoint getting uploaded. > If the 2NN has storage directories with VERSION info, and connects to an NN > with different VERSION info, there are two alternatives: > a) refuse to perform any checkpoints until the operator issues a > "secondarynamenode -format" command (this is similar to how the > backupnode/checkpointnode works) > b) clear the current contents of the storage directory and save the new NN's > VERSION info. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2026) 1073: 2NN needs to handle case of reformatted NN better
[ https://issues.apache.org/jira/browse/HDFS-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049606#comment-13049606 ] Todd Lipcon commented on HDFS-2026: --- I should note that the patch does *not* add a "-format" command to the 2NN yet. I'll do that separately to keep this an easy review. > 1073: 2NN needs to handle case of reformatted NN better > --- > > Key: HDFS-2026 > URL: https://issues.apache.org/jira/browse/HDFS-2026 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: name-node >Affects Versions: Edit log branch (HDFS-1073) >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Critical > Fix For: Edit log branch (HDFS-1073) > > Attachments: hdfs-2026.txt > > > Currently in the 1073 branch, the following steps ends up with a very > confused 2NN: > - format NN, run NN > - start 2NN, perform some checkpoints > - reformat NN, start NN on new namespace > - restart same 2NN > The 2NN currently saves the new VERSION info into its local storage directory > but doesn't clear out the old checkpoint or edits files. This is obviously > wrong and might lead to a corrupt checkpoint getting uploaded. > If the 2NN has storage directories with VERSION info, and connects to an NN > with different VERSION info, there are two alternatives: > a) refuse to perform any checkpoints until the operator issues a > "secondarynamenode -format" command (this is similar to how the > backupnode/checkpointnode works) > b) clear the current contents of the storage directory and save the new NN's > VERSION info. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2026) 1073: 2NN needs to handle case of reformatted NN better
[ https://issues.apache.org/jira/browse/HDFS-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13044052#comment-13044052 ] Todd Lipcon commented on HDFS-2026: --- Created HDFS-2032 for the idea to add namespace id info to image and edits headers. > 1073: 2NN needs to handle case of reformatted NN better > --- > > Key: HDFS-2026 > URL: https://issues.apache.org/jira/browse/HDFS-2026 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: name-node >Affects Versions: Edit log branch (HDFS-1073) >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Critical > Fix For: Edit log branch (HDFS-1073) > > > Currently in the 1073 branch, the following steps ends up with a very > confused 2NN: > - format NN, run NN > - start 2NN, perform some checkpoints > - reformat NN, start NN on new namespace > - restart same 2NN > The 2NN currently saves the new VERSION info into its local storage directory > but doesn't clear out the old checkpoint or edits files. This is obviously > wrong and might lead to a corrupt checkpoint getting uploaded. > If the 2NN has storage directories with VERSION info, and connects to an NN > with different VERSION info, there are two alternatives: > a) refuse to perform any checkpoints until the operator issues a > "secondarynamenode -format" command (this is similar to how the > backupnode/checkpointnode works) > b) clear the current contents of the storage directory and save the new NN's > VERSION info. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2026) 1073: 2NN needs to handle case of reformatted NN better
[ https://issues.apache.org/jira/browse/HDFS-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13044047#comment-13044047 ] Todd Lipcon commented on HDFS-2026: --- I agree with that as well. I'll file a new JIRA for it -- I feel like it's out of scope for this one. > 1073: 2NN needs to handle case of reformatted NN better > --- > > Key: HDFS-2026 > URL: https://issues.apache.org/jira/browse/HDFS-2026 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: name-node >Affects Versions: Edit log branch (HDFS-1073) >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Critical > Fix For: Edit log branch (HDFS-1073) > > > Currently in the 1073 branch, the following steps ends up with a very > confused 2NN: > - format NN, run NN > - start 2NN, perform some checkpoints > - reformat NN, start NN on new namespace > - restart same 2NN > The 2NN currently saves the new VERSION info into its local storage directory > but doesn't clear out the old checkpoint or edits files. This is obviously > wrong and might lead to a corrupt checkpoint getting uploaded. > If the 2NN has storage directories with VERSION info, and connects to an NN > with different VERSION info, there are two alternatives: > a) refuse to perform any checkpoints until the operator issues a > "secondarynamenode -format" command (this is similar to how the > backupnode/checkpointnode works) > b) clear the current contents of the storage directory and save the new NN's > VERSION info. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2026) 1073: 2NN needs to handle case of reformatted NN better
[ https://issues.apache.org/jira/browse/HDFS-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13044042#comment-13044042 ] Sanjay Radia commented on HDFS-2026: On a related note, each edits should have the same info in its header so that we don't accidentally use mismatching edits and images. This can occur if our future checkpoint service simply copies the new image to the dir. > 1073: 2NN needs to handle case of reformatted NN better > --- > > Key: HDFS-2026 > URL: https://issues.apache.org/jira/browse/HDFS-2026 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: name-node >Affects Versions: Edit log branch (HDFS-1073) >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Critical > Fix For: Edit log branch (HDFS-1073) > > > Currently in the 1073 branch, the following steps ends up with a very > confused 2NN: > - format NN, run NN > - start 2NN, perform some checkpoints > - reformat NN, start NN on new namespace > - restart same 2NN > The 2NN currently saves the new VERSION info into its local storage directory > but doesn't clear out the old checkpoint or edits files. This is obviously > wrong and might lead to a corrupt checkpoint getting uploaded. > If the 2NN has storage directories with VERSION info, and connects to an NN > with different VERSION info, there are two alternatives: > a) refuse to perform any checkpoints until the operator issues a > "secondarynamenode -format" command (this is similar to how the > backupnode/checkpointnode works) > b) clear the current contents of the storage directory and save the new NN's > VERSION info. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2026) 1073: 2NN needs to handle case of reformatted NN better
[ https://issues.apache.org/jira/browse/HDFS-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13044034#comment-13044034 ] Todd Lipcon commented on HDFS-2026: --- Great. I'll move forward with option 1 -- yes, the idea is that we'll check namespace ID, cluster ID, layout version, etc. If anything doesn't match, we'll bail out and make the user run secondarynamenode -format. (it doesn't really make sense to "upgrade" the secondary namenode for out-of-date layout) > 1073: 2NN needs to handle case of reformatted NN better > --- > > Key: HDFS-2026 > URL: https://issues.apache.org/jira/browse/HDFS-2026 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: name-node >Affects Versions: Edit log branch (HDFS-1073) >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Critical > Fix For: Edit log branch (HDFS-1073) > > > Currently in the 1073 branch, the following steps ends up with a very > confused 2NN: > - format NN, run NN > - start 2NN, perform some checkpoints > - reformat NN, start NN on new namespace > - restart same 2NN > The 2NN currently saves the new VERSION info into its local storage directory > but doesn't clear out the old checkpoint or edits files. This is obviously > wrong and might lead to a corrupt checkpoint getting uploaded. > If the 2NN has storage directories with VERSION info, and connects to an NN > with different VERSION info, there are two alternatives: > a) refuse to perform any checkpoints until the operator issues a > "secondarynamenode -format" command (this is similar to how the > backupnode/checkpointnode works) > b) clear the current contents of the storage directory and save the new NN's > VERSION info. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2026) 1073: 2NN needs to handle case of reformatted NN better
[ https://issues.apache.org/jira/browse/HDFS-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13044029#comment-13044029 ] Sanjay Radia commented on HDFS-2026: I also prefer option 1. Shouldn't we be using the namespaceid to address this issue. BTW we have the same problem in alternate solution (after 1073) where the fsimage is simply copied over the NN's dirs. (Good news is that that no images are deleted as part of that.) > 1073: 2NN needs to handle case of reformatted NN better > --- > > Key: HDFS-2026 > URL: https://issues.apache.org/jira/browse/HDFS-2026 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: name-node >Affects Versions: Edit log branch (HDFS-1073) >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Critical > Fix For: Edit log branch (HDFS-1073) > > > Currently in the 1073 branch, the following steps ends up with a very > confused 2NN: > - format NN, run NN > - start 2NN, perform some checkpoints > - reformat NN, start NN on new namespace > - restart same 2NN > The 2NN currently saves the new VERSION info into its local storage directory > but doesn't clear out the old checkpoint or edits files. This is obviously > wrong and might lead to a corrupt checkpoint getting uploaded. > If the 2NN has storage directories with VERSION info, and connects to an NN > with different VERSION info, there are two alternatives: > a) refuse to perform any checkpoints until the operator issues a > "secondarynamenode -format" command (this is similar to how the > backupnode/checkpointnode works) > b) clear the current contents of the storage directory and save the new NN's > VERSION info. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2026) 1073: 2NN needs to handle case of reformatted NN better
[ https://issues.apache.org/jira/browse/HDFS-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13043119#comment-13043119 ] Todd Lipcon commented on HDFS-2026: --- fwiw, in trunk/0.20, the behavior is that the NN will happily start checkpointing the new namespace -- since each checkpoint moves the current/ dir to previous.checkpoint/, it's basically starting fresh each time. > 1073: 2NN needs to handle case of reformatted NN better > --- > > Key: HDFS-2026 > URL: https://issues.apache.org/jira/browse/HDFS-2026 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: name-node >Affects Versions: Edit log branch (HDFS-1073) >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Critical > Fix For: Edit log branch (HDFS-1073) > > > Currently in the 1073 branch, the following steps ends up with a very > confused 2NN: > - format NN, run NN > - start 2NN, perform some checkpoints > - reformat NN, start NN on new namespace > - restart same 2NN > The 2NN currently saves the new VERSION info into its local storage directory > but doesn't clear out the old checkpoint or edits files. This is obviously > wrong and might lead to a corrupt checkpoint getting uploaded. > If the 2NN has storage directories with VERSION info, and connects to an NN > with different VERSION info, there are two alternatives: > a) refuse to perform any checkpoints until the operator issues a > "secondarynamenode -format" command (this is similar to how the > backupnode/checkpointnode works) > b) clear the current contents of the storage directory and save the new NN's > VERSION info. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2026) 1073: 2NN needs to handle case of reformatted NN better
[ https://issues.apache.org/jira/browse/HDFS-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13043115#comment-13043115 ] Todd Lipcon commented on HDFS-2026: --- I prefer option (a) since it's generally safer. It also is helpful in that, if the operator accidentally formatted their NN, the checkpoint would provide a fairly recent backup for them. > 1073: 2NN needs to handle case of reformatted NN better > --- > > Key: HDFS-2026 > URL: https://issues.apache.org/jira/browse/HDFS-2026 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: name-node >Affects Versions: Edit log branch (HDFS-1073) >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Critical > Fix For: Edit log branch (HDFS-1073) > > > Currently in the 1073 branch, the following steps ends up with a very > confused 2NN: > - format NN, run NN > - start 2NN, perform some checkpoints > - reformat NN, start NN on new namespace > - restart same 2NN > The 2NN currently saves the new VERSION info into its local storage directory > but doesn't clear out the old checkpoint or edits files. This is obviously > wrong and might lead to a corrupt checkpoint getting uploaded. > If the 2NN has storage directories with VERSION info, and connects to an NN > with different VERSION info, there are two alternatives: > a) refuse to perform any checkpoints until the operator issues a > "secondarynamenode -format" command (this is similar to how the > backupnode/checkpointnode works) > b) clear the current contents of the storage directory and save the new NN's > VERSION info. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira