[jira] [Commented] (HADOOP-4885) Try to restore failed replicas of Name Node storage (at checkpoint time)
[ https://issues.apache.org/jira/browse/HADOOP-4885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13476248#comment-13476248 ] Bertrand Dechoux commented on HADOOP-4885: -- Sorry for the misunderstanding but its seems like it was indeed added (in the 1.0.2 like the backport says) so maybe this JIRA could be once again be updated to reflect it. (I can't modify the fix versions.) I will probably test this feature for 'curiosity' soon. > Try to restore failed replicas of Name Node storage (at checkpoint time) > > > Key: HADOOP-4885 > URL: https://issues.apache.org/jira/browse/HADOOP-4885 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Boris Shkolnik >Assignee: Boris Shkolnik > Fix For: 0.21.0 > > Attachments: HADOOP-4885-1.patch, HADOOP-4885-3.patch, > HADOOP-4885-3.patch, HADOOP-4885.branch-1.patch, > HADOOP-4885.branch-1.patch.2, HADOOP-4885.branch-1.patch.3, > HADOOP-4885.patch, HADOOP-4885.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-4885) Try to restore failed replicas of Name Node storage (at checkpoint time)
[ https://issues.apache.org/jira/browse/HADOOP-4885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13465706#comment-13465706 ] Bertrand Dechoux commented on HADOOP-4885: -- grep -R "dfs.namenode.name.dir.restore" * src/hdfs/org/apache/hadoop/hdfs/DFSConfigKeys.java: public static final String DFS_NAMENODE_NAME_DIR_RESTORE_KEY = "dfs.namenode.name.dir.restore"; Great! I will test it. The documentation does not seem updated but that's a detail. (same for the description of the jira...) > Try to restore failed replicas of Name Node storage (at checkpoint time) > > > Key: HADOOP-4885 > URL: https://issues.apache.org/jira/browse/HADOOP-4885 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Boris Shkolnik >Assignee: Boris Shkolnik > Fix For: 0.21.0 > > Attachments: HADOOP-4885-1.patch, HADOOP-4885-3.patch, > HADOOP-4885-3.patch, HADOOP-4885.branch-1.patch, > HADOOP-4885.branch-1.patch.2, HADOOP-4885.branch-1.patch.3, > HADOOP-4885.patch, HADOOP-4885.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-4885) Try to restore failed replicas of Name Node storage (at checkpoint time)
[ https://issues.apache.org/jira/browse/HADOOP-4885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13465698#comment-13465698 ] Brandon Li commented on HADOOP-4885: {quote}I did a grep -R "dfs.name.dir.restore" srcon a downloaded version of Hadoop 1.0.3 and found no match.{quote} The property name is dfs.namenode.name.dir.restore. > Try to restore failed replicas of Name Node storage (at checkpoint time) > > > Key: HADOOP-4885 > URL: https://issues.apache.org/jira/browse/HADOOP-4885 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Boris Shkolnik >Assignee: Boris Shkolnik > Fix For: 0.21.0 > > Attachments: HADOOP-4885-1.patch, HADOOP-4885-3.patch, > HADOOP-4885-3.patch, HADOOP-4885.branch-1.patch, > HADOOP-4885.branch-1.patch.2, HADOOP-4885.branch-1.patch.3, > HADOOP-4885.patch, HADOOP-4885.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-4885) Try to restore failed replicas of Name Node storage (at checkpoint time)
[ https://issues.apache.org/jira/browse/HADOOP-4885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13465661#comment-13465661 ] Harsh J commented on HADOOP-4885: - Lets visit HDFS-3075 for the backport. I removed the versioning from here as it was erroneous. > Try to restore failed replicas of Name Node storage (at checkpoint time) > > > Key: HADOOP-4885 > URL: https://issues.apache.org/jira/browse/HADOOP-4885 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Boris Shkolnik >Assignee: Boris Shkolnik > Fix For: 0.21.0 > > Attachments: HADOOP-4885-1.patch, HADOOP-4885-3.patch, > HADOOP-4885-3.patch, HADOOP-4885.branch-1.patch, > HADOOP-4885.branch-1.patch.2, HADOOP-4885.branch-1.patch.3, > HADOOP-4885.patch, HADOOP-4885.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-4885) Try to restore failed replicas of Name Node storage (at checkpoint time)
[ https://issues.apache.org/jira/browse/HADOOP-4885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13465633#comment-13465633 ] Bertrand Dechoux commented on HADOOP-4885: -- I did a grep -R "dfs.name.dir.restore" src on a downloaded version of Hadoop 1.0.3 and found no match. Maybe the fix version should be updated. > Try to restore failed replicas of Name Node storage (at checkpoint time) > > > Key: HADOOP-4885 > URL: https://issues.apache.org/jira/browse/HADOOP-4885 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Boris Shkolnik >Assignee: Boris Shkolnik > Fix For: 1.0.3, 0.21.0 > > Attachments: HADOOP-4885-1.patch, HADOOP-4885-3.patch, > HADOOP-4885-3.patch, HADOOP-4885.branch-1.patch, > HADOOP-4885.branch-1.patch.2, HADOOP-4885.branch-1.patch.3, > HADOOP-4885.patch, HADOOP-4885.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-4885) Try to restore failed replicas of Name Node storage (at checkpoint time)
[ https://issues.apache.org/jira/browse/HADOOP-4885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13235194#comment-13235194 ] Brandon Li commented on HADOOP-4885: bq. since we haven't started the edits.new file yet, something may actually block?? Filesystem modification can be blocked. This step should be optimized. By default, the automatic restore is disabled. bq. It would be better if the resyncing up to the last closed edit log is done asynchronously. This could be a way to optimize the operation. Another way is not to copy over the files but wait for the checkpoint processing to populate the new image and edit logs. For the second approach the storage directories under restoring should have a new state (e.g., formatted or restoring) rather than "active". bq. Ideally any exceptions from dealing with the removed dirs should be ignored. Agree. > Try to restore failed replicas of Name Node storage (at checkpoint time) > > > Key: HADOOP-4885 > URL: https://issues.apache.org/jira/browse/HADOOP-4885 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Boris Shkolnik >Assignee: Boris Shkolnik > Fix For: 0.21.0 > > Attachments: HADOOP-4885-1.patch, HADOOP-4885-3.patch, > HADOOP-4885-3.patch, HADOOP-4885.branch-1.patch, > HADOOP-4885.branch-1.patch.2, HADOOP-4885.branch-1.patch.3, > HADOOP-4885.patch, HADOOP-4885.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-4885) Try to restore failed replicas of Name Node storage (at checkpoint time)
[ https://issues.apache.org/jira/browse/HADOOP-4885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13234460#comment-13234460 ] Kihwal Lee commented on HADOOP-4885: - It would be better if the resyncing up to the last closed edit log is done asynchronously. That way, the NN only needs to sync one or two edits while rolling the log. - It seems that if a restore fails, rollEditLog() also fails even if there are healthy directories. Ideally any exceptions from dealing with the removed dirs should be ignored. > Try to restore failed replicas of Name Node storage (at checkpoint time) > > > Key: HADOOP-4885 > URL: https://issues.apache.org/jira/browse/HADOOP-4885 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Boris Shkolnik >Assignee: Boris Shkolnik > Fix For: 0.21.0 > > Attachments: HADOOP-4885-1.patch, HADOOP-4885-3.patch, > HADOOP-4885-3.patch, HADOOP-4885.branch-1.patch, > HADOOP-4885.branch-1.patch.2, HADOOP-4885.branch-1.patch.3, > HADOOP-4885.patch, HADOOP-4885.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-4885) Try to restore failed replicas of Name Node storage (at checkpoint time)
[ https://issues.apache.org/jira/browse/HADOOP-4885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233868#comment-13233868 ] Nathan Roberts commented on HADOOP-4885: Thanks for the response Brandon. My real concern is whether or not the namenode can continue completely normal operation during a long running restoration (several minutes for an image of 10s of GB). Or, since we haven't started the edits.new file yet, something may actually block?? > Try to restore failed replicas of Name Node storage (at checkpoint time) > > > Key: HADOOP-4885 > URL: https://issues.apache.org/jira/browse/HADOOP-4885 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Boris Shkolnik >Assignee: Boris Shkolnik > Fix For: 0.21.0 > > Attachments: HADOOP-4885-1.patch, HADOOP-4885-3.patch, > HADOOP-4885-3.patch, HADOOP-4885.branch-1.patch, > HADOOP-4885.branch-1.patch.2, HADOOP-4885.branch-1.patch.3, > HADOOP-4885.patch, HADOOP-4885.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-4885) Try to restore failed replicas of Name Node storage (at checkpoint time)
[ https://issues.apache.org/jira/browse/HADOOP-4885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233633#comment-13233633 ] Brandon Li commented on HADOOP-4885: Hi Nathan, It could be slow if the image is very large though currently the image size is limited by the memory size. Thanks, Brandon > Try to restore failed replicas of Name Node storage (at checkpoint time) > > > Key: HADOOP-4885 > URL: https://issues.apache.org/jira/browse/HADOOP-4885 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Boris Shkolnik >Assignee: Boris Shkolnik > Fix For: 0.21.0 > > Attachments: HADOOP-4885-1.patch, HADOOP-4885-3.patch, > HADOOP-4885-3.patch, HADOOP-4885.branch-1.patch, > HADOOP-4885.branch-1.patch.2, HADOOP-4885.branch-1.patch.3, > HADOOP-4885.patch, HADOOP-4885.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-4885) Try to restore failed replicas of Name Node storage (at checkpoint time)
[ https://issues.apache.org/jira/browse/HADOOP-4885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233469#comment-13233469 ] Nathan Roberts commented on HADOOP-4885: Quick question on this patch. Are there any negative effects if the images being restored are very large or the restore is otherwise very slow? Just wondering because at first glance it looks like the restoration is being done after closing the current edits log and before starting edits.new. > Try to restore failed replicas of Name Node storage (at checkpoint time) > > > Key: HADOOP-4885 > URL: https://issues.apache.org/jira/browse/HADOOP-4885 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Boris Shkolnik >Assignee: Boris Shkolnik > Fix For: 0.21.0 > > Attachments: HADOOP-4885-1.patch, HADOOP-4885-3.patch, > HADOOP-4885-3.patch, HADOOP-4885.branch-1.patch, > HADOOP-4885.branch-1.patch.2, HADOOP-4885.branch-1.patch.3, > HADOOP-4885.patch, HADOOP-4885.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-4885) Try to restore failed replicas of Name Node storage (at checkpoint time)
[ https://issues.apache.org/jira/browse/HADOOP-4885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13230483#comment-13230483 ] Eli Collins commented on HADOOP-4885: - Ah, that makes sense. Thanks for the explanation! +1 to the latest patch > Try to restore failed replicas of Name Node storage (at checkpoint time) > > > Key: HADOOP-4885 > URL: https://issues.apache.org/jira/browse/HADOOP-4885 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Boris Shkolnik >Assignee: Boris Shkolnik > Fix For: 0.21.0 > > Attachments: HADOOP-4885-1.patch, HADOOP-4885-3.patch, > HADOOP-4885-3.patch, HADOOP-4885.branch-1.patch, > HADOOP-4885.branch-1.patch.2, HADOOP-4885.branch-1.patch.3, > HADOOP-4885.patch, HADOOP-4885.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-4885) Try to restore failed replicas of Name Node storage (at checkpoint time)
[ https://issues.apache.org/jira/browse/HADOOP-4885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13230297#comment-13230297 ] Brandon Li commented on HADOOP-4885: I did manual tests before including the unit tests in the backport patch. The new edit log is immediately created in the new storage directory, but the rolled edit log doesn't exist in the recovered storage directory. From the NN UI, the healthy storage directory and recovered directory both have "Active" status. This is why I said it's "misleading". It would be a more obvious problem when the storage directory is a fsimage only directory. From the NN UI/JMX, the administrator can't tell which "Active" storage directory has fsimage inside and which doesn't. The same "Active" state here means differently at differnt time with different directories. > Try to restore failed replicas of Name Node storage (at checkpoint time) > > > Key: HADOOP-4885 > URL: https://issues.apache.org/jira/browse/HADOOP-4885 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Boris Shkolnik >Assignee: Boris Shkolnik > Fix For: 0.21.0 > > Attachments: HADOOP-4885-1.patch, HADOOP-4885-3.patch, > HADOOP-4885-3.patch, HADOOP-4885.branch-1.patch, > HADOOP-4885.branch-1.patch.2, HADOOP-4885.branch-1.patch.3, > HADOOP-4885.patch, HADOOP-4885.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-4885) Try to restore failed replicas of Name Node storage (at checkpoint time)
[ https://issues.apache.org/jira/browse/HADOOP-4885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229929#comment-13229929 ] Eli Collins commented on HADOOP-4885: - bq. The format-addStorageDir solution make the failed directory "active" immediately even it's not a real active state. The state is visible from the nn UI and JMX. If the checkpoint fails, the fake "Active" state can be misleading. Not sure I'm following.. when you roll the log and it restores the storage directory it creates a new empty storage dir, the directory is added to the list of storage dirs and a new edit log is immediately created on it (see FSEditLog#rollEditLog), ie it is immediately "active" right? Have you done any testing of this patch aside from running the unit tests? > Try to restore failed replicas of Name Node storage (at checkpoint time) > > > Key: HADOOP-4885 > URL: https://issues.apache.org/jira/browse/HADOOP-4885 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Boris Shkolnik >Assignee: Boris Shkolnik > Fix For: 0.21.0 > > Attachments: HADOOP-4885-1.patch, HADOOP-4885-3.patch, > HADOOP-4885-3.patch, HADOOP-4885.branch-1.patch, > HADOOP-4885.branch-1.patch.2, HADOOP-4885.branch-1.patch.3, > HADOOP-4885.patch, HADOOP-4885.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-4885) Try to restore failed replicas of Name Node storage (at checkpoint time)
[ https://issues.apache.org/jira/browse/HADOOP-4885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229819#comment-13229819 ] Brandon Li commented on HADOOP-4885: The format-addStorageDir solution make the failed directory "active" immediately even it's not a real active state. The state is visible from the nn UI and JMX. If the checkpoint fails, the fake "Active" state can be misleading. The copy-over solution may do some extra work but it sets the recovered storage directories in the real active state. I agree those 3 JIRA issues you mentioned should be back ported too to branch 1.02 (the backport patch here is for branch-1 not 1.02). It's a good point about the network mount problem. :-) It's also a problem with original patch: the "format-addStorageDir" creates the storage directory if it doesn't exist. However, if this storage directory is a mount point, it shouldn't be created automatically. HDFS-3095 is filed for this issue. > Try to restore failed replicas of Name Node storage (at checkpoint time) > > > Key: HADOOP-4885 > URL: https://issues.apache.org/jira/browse/HADOOP-4885 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Boris Shkolnik >Assignee: Boris Shkolnik > Fix For: 0.21.0 > > Attachments: HADOOP-4885-1.patch, HADOOP-4885-3.patch, > HADOOP-4885-3.patch, HADOOP-4885.branch-1.patch, > HADOOP-4885.branch-1.patch.2, HADOOP-4885.branch-1.patch.3, > HADOOP-4885.patch, HADOOP-4885.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-4885) Try to restore failed replicas of Name Node storage (at checkpoint time)
[ https://issues.apache.org/jira/browse/HADOOP-4885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229792#comment-13229792 ] Tsz Wo (Nicholas), SZE commented on HADOOP-4885: Hi Eli, Brandon addressed all [your earlier comment|https://issues.apache.org/jira/browse/HADOOP-4885?focusedCommentId=13228915&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13228915] last night. I did not see your further comment so that I committed the patch. You made some good points in [your previous comment|https://issues.apache.org/jira/browse/HADOOP-4885?focusedCommentId=13229774&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13229774]. As always, we could file a JIRA for them. Does it sound good? > Try to restore failed replicas of Name Node storage (at checkpoint time) > > > Key: HADOOP-4885 > URL: https://issues.apache.org/jira/browse/HADOOP-4885 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Boris Shkolnik >Assignee: Boris Shkolnik > Fix For: 0.21.0 > > Attachments: HADOOP-4885-1.patch, HADOOP-4885-3.patch, > HADOOP-4885-3.patch, HADOOP-4885.branch-1.patch, > HADOOP-4885.branch-1.patch.2, HADOOP-4885.branch-1.patch.3, > HADOOP-4885.patch, HADOOP-4885.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-4885) Try to restore failed replicas of Name Node storage (at checkpoint time)
[ https://issues.apache.org/jira/browse/HADOOP-4885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229774#comment-13229774 ] Eli Collins commented on HADOOP-4885: - bq. I didn't get your seco- nd question: my patch uses addStorageDir too. What I meant was the trunk patch does the following, which is much shorter: {code} sd.clearDirectory(); addStorageDir(sd); {code} and leverages the fact that checkpoint populates the directory. Why not use the same approach here? - I'd test with a real NFS mount and disconnect/reconnect the network. I found some bugs that way when backporting this a while back. Also discovered HDFS-2701, HDFS-2702, HDFS-2703 via testing with a real build instead of the unit tests. - Nit: s/"may should be mounted"/"may be a network mount"/ > Try to restore failed replicas of Name Node storage (at checkpoint time) > > > Key: HADOOP-4885 > URL: https://issues.apache.org/jira/browse/HADOOP-4885 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Boris Shkolnik >Assignee: Boris Shkolnik > Fix For: 0.21.0 > > Attachments: HADOOP-4885-1.patch, HADOOP-4885-3.patch, > HADOOP-4885-3.patch, HADOOP-4885.branch-1.patch, > HADOOP-4885.branch-1.patch.2, HADOOP-4885.branch-1.patch.3, > HADOOP-4885.patch, HADOOP-4885.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-4885) Try to restore failed replicas of Name Node storage (at checkpoint time)
[ https://issues.apache.org/jira/browse/HADOOP-4885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229496#comment-13229496 ] Tsz Wo (Nicholas), SZE commented on HADOOP-4885: +1 The patch also looks good to me. I will commit this in HDFS-3075. > Try to restore failed replicas of Name Node storage (at checkpoint time) > > > Key: HADOOP-4885 > URL: https://issues.apache.org/jira/browse/HADOOP-4885 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Boris Shkolnik >Assignee: Boris Shkolnik > Fix For: 0.21.0 > > Attachments: HADOOP-4885-1.patch, HADOOP-4885-3.patch, > HADOOP-4885-3.patch, HADOOP-4885.branch-1.patch, > HADOOP-4885.branch-1.patch.2, HADOOP-4885.branch-1.patch.3, > HADOOP-4885.patch, HADOOP-4885.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-4885) Try to restore failed replicas of Name Node storage (at checkpoint time)
[ https://issues.apache.org/jira/browse/HADOOP-4885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229457#comment-13229457 ] Jitendra Nath Pandey commented on HADOOP-4885: -- +1 the patch for branch-1 looks good to me. > Try to restore failed replicas of Name Node storage (at checkpoint time) > > > Key: HADOOP-4885 > URL: https://issues.apache.org/jira/browse/HADOOP-4885 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Boris Shkolnik >Assignee: Boris Shkolnik > Fix For: 0.21.0 > > Attachments: HADOOP-4885-1.patch, HADOOP-4885-3.patch, > HADOOP-4885-3.patch, HADOOP-4885.branch-1.patch, > HADOOP-4885.branch-1.patch.2, HADOOP-4885.branch-1.patch.3, > HADOOP-4885.patch, HADOOP-4885.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-4885) Try to restore failed replicas of Name Node storage (at checkpoint time)
[ https://issues.apache.org/jira/browse/HADOOP-4885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229452#comment-13229452 ] Brandon Li commented on HADOOP-4885: The new patch has the restart test. Thanks! > Try to restore failed replicas of Name Node storage (at checkpoint time) > > > Key: HADOOP-4885 > URL: https://issues.apache.org/jira/browse/HADOOP-4885 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Boris Shkolnik >Assignee: Boris Shkolnik > Fix For: 0.21.0 > > Attachments: HADOOP-4885-1.patch, HADOOP-4885-3.patch, > HADOOP-4885-3.patch, HADOOP-4885.branch-1.patch, > HADOOP-4885.branch-1.patch.2, HADOOP-4885.branch-1.patch.3, > HADOOP-4885.patch, HADOOP-4885.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-4885) Try to restore failed replicas of Name Node storage (at checkpoint time)
[ https://issues.apache.org/jira/browse/HADOOP-4885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229442#comment-13229442 ] Jitendra Nath Pandey commented on HADOOP-4885: -- The patch looks good to me. It would be great to add a few lines to test that namenode can restart with just a restored edits/image directory. > Try to restore failed replicas of Name Node storage (at checkpoint time) > > > Key: HADOOP-4885 > URL: https://issues.apache.org/jira/browse/HADOOP-4885 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Boris Shkolnik >Assignee: Boris Shkolnik > Fix For: 0.21.0 > > Attachments: HADOOP-4885-1.patch, HADOOP-4885-3.patch, > HADOOP-4885-3.patch, HADOOP-4885.branch-1.patch, > HADOOP-4885.branch-1.patch.2, HADOOP-4885.patch, HADOOP-4885.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-4885) Try to restore failed replicas of Name Node storage (at checkpoint time)
[ https://issues.apache.org/jira/browse/HADOOP-4885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229004#comment-13229004 ] Brandon Li commented on HADOOP-4885: Hi Eli, Thanks for the comments! The code base in branch-1 is slightly different with 0.21. Adding directories to removedStorageDirs in original patch is already in branch-1. I didn't get your second question: my patch uses addStorageDir too. The same test with minor modification(e.g., comparing md5 instead of length for edits files) is included in the backport patch. Thanks. > Try to restore failed replicas of Name Node storage (at checkpoint time) > > > Key: HADOOP-4885 > URL: https://issues.apache.org/jira/browse/HADOOP-4885 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Boris Shkolnik >Assignee: Boris Shkolnik > Fix For: 0.21.0 > > Attachments: HADOOP-4885-1.patch, HADOOP-4885-3.patch, > HADOOP-4885-3.patch, HADOOP-4885.branch-1.patch, > HADOOP-4885.branch-1.patch.2, HADOOP-4885.patch, HADOOP-4885.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-4885) Try to restore failed replicas of Name Node storage (at checkpoint time)
[ https://issues.apache.org/jira/browse/HADOOP-4885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13228915#comment-13228915 ] Eli Collins commented on HADOOP-4885: - - In the trunk patch we're also adding directories to removedStorageDirs, seems like we'll need those additions here right? - The trunk version uses addStorageDir, any reason that it's done differently here? - Testing? > Try to restore failed replicas of Name Node storage (at checkpoint time) > > > Key: HADOOP-4885 > URL: https://issues.apache.org/jira/browse/HADOOP-4885 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Boris Shkolnik >Assignee: Boris Shkolnik > Fix For: 0.21.0 > > Attachments: HADOOP-4885-1.patch, HADOOP-4885-3.patch, > HADOOP-4885-3.patch, HADOOP-4885.branch-1.patch, > HADOOP-4885.branch-1.patch.2, HADOOP-4885.patch, HADOOP-4885.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-4885) Try to restore failed replicas of Name Node storage (at checkpoint time)
[ https://issues.apache.org/jira/browse/HADOOP-4885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13228136#comment-13228136 ] Tsz Wo (Nicholas), SZE commented on HADOOP-4885: The getRestoreRemovedDirs() below should be removed. {code} + boolean getRestoreRemovedDirs() { +return this.restoreRemovedDirs; + } {code} > Try to restore failed replicas of Name Node storage (at checkpoint time) > > > Key: HADOOP-4885 > URL: https://issues.apache.org/jira/browse/HADOOP-4885 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Boris Shkolnik >Assignee: Boris Shkolnik > Fix For: 0.21.0 > > Attachments: HADOOP-4885-1.patch, HADOOP-4885-3.patch, > HADOOP-4885-3.patch, HADOOP-4885.branch-1.patch, HADOOP-4885.patch, > HADOOP-4885.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-4885) Try to restore failed replicas of Name Node storage (at checkpoint time)
[ https://issues.apache.org/jira/browse/HADOOP-4885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13023028#comment-13023028 ] Hudson commented on HADOOP-4885: Integrated in Hadoop-Hdfs-22-branch #35 (See [https://builds.apache.org/hudson/job/Hadoop-Hdfs-22-branch/35/]) > Try to restore failed replicas of Name Node storage (at checkpoint time) > > > Key: HADOOP-4885 > URL: https://issues.apache.org/jira/browse/HADOOP-4885 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Boris Shkolnik >Assignee: Boris Shkolnik > Fix For: 0.21.0 > > Attachments: HADOOP-4885-1.patch, HADOOP-4885-3.patch, > HADOOP-4885-3.patch, HADOOP-4885.patch, HADOOP-4885.patch > > -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-4885) Try to restore failed replicas of Name Node storage (at checkpoint time)
[ https://issues.apache.org/jira/browse/HADOOP-4885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022466#comment-13022466 ] Hudson commented on HADOOP-4885: Integrated in Hadoop-Hdfs-trunk #643 (See [https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/643/]) > Try to restore failed replicas of Name Node storage (at checkpoint time) > > > Key: HADOOP-4885 > URL: https://issues.apache.org/jira/browse/HADOOP-4885 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Boris Shkolnik >Assignee: Boris Shkolnik > Fix For: 0.21.0 > > Attachments: HADOOP-4885-1.patch, HADOOP-4885-3.patch, > HADOOP-4885-3.patch, HADOOP-4885.patch, HADOOP-4885.patch > > -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HADOOP-4885) Try to restore failed replicas of Name Node storage (at checkpoint time)
[ https://issues.apache.org/jira/browse/HADOOP-4885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992596#comment-12992596 ] Hudson commented on HADOOP-4885: Integrated in Hadoop-Hdfs-trunk-Commit #539 (See [https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/539/]) > Try to restore failed replicas of Name Node storage (at checkpoint time) > > > Key: HADOOP-4885 > URL: https://issues.apache.org/jira/browse/HADOOP-4885 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Boris Shkolnik >Assignee: Boris Shkolnik > Fix For: 0.21.0 > > Attachments: HADOOP-4885-1.patch, HADOOP-4885-3.patch, > HADOOP-4885-3.patch, HADOOP-4885.patch, HADOOP-4885.patch > > -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HADOOP-4885) Try to restore failed replicas of Name Node storage (at checkpoint time)
[ https://issues.apache.org/jira/browse/HADOOP-4885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991683#comment-12991683 ] Boris Shkolnik commented on HADOOP-4885: fix submitted in HDFS-1602. > Try to restore failed replicas of Name Node storage (at checkpoint time) > > > Key: HADOOP-4885 > URL: https://issues.apache.org/jira/browse/HADOOP-4885 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Boris Shkolnik >Assignee: Boris Shkolnik > Fix For: 0.21.0 > > Attachments: HADOOP-4885-1.patch, HADOOP-4885-3.patch, > HADOOP-4885-3.patch, HADOOP-4885.patch, HADOOP-4885.patch > > -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HADOOP-4885) Try to restore failed replicas of Name Node storage (at checkpoint time)
[ https://issues.apache.org/jira/browse/HADOOP-4885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988363#action_12988363 ] Konstantin Boudnik commented on HADOOP-4885: After a bit more of investigation I have noticed (dah!) this new config parameter {{dfs.name.dir.restore}} which triggers removed storage restoration. fsimage flies for both (nfs'ed and non-nfs volumes) as well as secondary NN's checkpoints have the same md5sums. So it seems that (as Hairong pointed out elsewhere) that without HDFS-903 this features kinda works. > Try to restore failed replicas of Name Node storage (at checkpoint time) > > > Key: HADOOP-4885 > URL: https://issues.apache.org/jira/browse/HADOOP-4885 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Boris Shkolnik >Assignee: Boris Shkolnik > Fix For: 0.21.0 > > Attachments: HADOOP-4885-1.patch, HADOOP-4885-3.patch, > HADOOP-4885-3.patch, HADOOP-4885.patch, HADOOP-4885.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-4885) Try to restore failed replicas of Name Node storage (at checkpoint time)
[ https://issues.apache.org/jira/browse/HADOOP-4885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987912#action_12987912 ] Konstantin Boudnik commented on HADOOP-4885: This feature seems like not completely working: adding back once removed storage volume doesn't happen as expected (see HDFS-1496). I'd suggest to disable this new feature for now > Try to restore failed replicas of Name Node storage (at checkpoint time) > > > Key: HADOOP-4885 > URL: https://issues.apache.org/jira/browse/HADOOP-4885 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Boris Shkolnik >Assignee: Boris Shkolnik > Fix For: 0.21.0 > > Attachments: HADOOP-4885-1.patch, HADOOP-4885-3.patch, > HADOOP-4885-3.patch, HADOOP-4885.patch, HADOOP-4885.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.