[jira] [Commented] (HDFS-2781) Add client protocol and DFSadmin for command to restore failed storage
[ https://issues.apache.org/jira/browse/HDFS-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13209811#comment-13209811 ] Todd Lipcon commented on HDFS-2781: --- We can move this to be a non-HA ticket right? Add client protocol and DFSadmin for command to restore failed storage -- Key: HDFS-2781 URL: https://issues.apache.org/jira/browse/HDFS-2781 Project: Hadoop HDFS Issue Type: Sub-task Components: ha Affects Versions: HA branch (HDFS-1623) Reporter: Eli Collins Assignee: Eli Collins Per HDFS-2769, it's important that an admin be able to ask the NN to try to restore failed storage since we may drop into SM until the shared edits dir is restored (w/o having to wait for the next checkpoint). There's currently an API (and usage in DFSAdmin) to flip the flag indicating whether the NN should try to restore failed storage but not that it should actually attempt to do so. This jira is to add one. This is useful outside HA but doing as an HDFS-1623 sub-task since it's motivated by HA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2781) Add client protocol and DFSadmin for command to restore failed storage
[ https://issues.apache.org/jira/browse/HDFS-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13204812#comment-13204812 ] Bikas Saha commented on HDFS-2781: -- Is this JIRA still valid? If I understand right, the premise was the the NN would fall into standby mode when the shared edits dir fails. After the shared edits dir is restored, the admin could use the command proposed in this JIRA to refresh the dirs. But current policy is for the NN to shutdown on shared edits dir failure. When the dir is brought back online, then the NN will pick it up on being restarted. When NN moves to active or standby states then the FSEditLog.journalSet is refreshed and will refresh the storage dirs upon next log roll (if the restore flag is set). Perhaps we are better off restoring directories as part of moving from active/standby states (when we re-init the JournalSet) instead of as an explicit command. Seems more natural and 1 less thing to do for the admin. Add client protocol and DFSadmin for command to restore failed storage -- Key: HDFS-2781 URL: https://issues.apache.org/jira/browse/HDFS-2781 Project: Hadoop HDFS Issue Type: Sub-task Components: ha Affects Versions: HA branch (HDFS-1623) Reporter: Eli Collins Assignee: Eli Collins Per HDFS-2769, it's important that an admin be able to ask the NN to try to restore failed storage since we may drop into SM until the shared edits dir is restored (w/o having to wait for the next checkpoint). There's currently an API (and usage in DFSAdmin) to flip the flag indicating whether the NN should try to restore failed storage but not that it should actually attempt to do so. This jira is to add one. This is useful outside HA but doing as an HDFS-1623 sub-task since it's motivated by HA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2781) Add client protocol and DFSadmin for command to restore failed storage
[ https://issues.apache.org/jira/browse/HDFS-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13204816#comment-13204816 ] Bikas Saha commented on HDFS-2781: -- Or perhaps storage dirs could restored when the dfsAdmin -restoreFailedStorage command sets the option to true (as part of the command). This would handle the non-HA cases. Add client protocol and DFSadmin for command to restore failed storage -- Key: HDFS-2781 URL: https://issues.apache.org/jira/browse/HDFS-2781 Project: Hadoop HDFS Issue Type: Sub-task Components: ha Affects Versions: HA branch (HDFS-1623) Reporter: Eli Collins Assignee: Eli Collins Per HDFS-2769, it's important that an admin be able to ask the NN to try to restore failed storage since we may drop into SM until the shared edits dir is restored (w/o having to wait for the next checkpoint). There's currently an API (and usage in DFSAdmin) to flip the flag indicating whether the NN should try to restore failed storage but not that it should actually attempt to do so. This jira is to add one. This is useful outside HA but doing as an HDFS-1623 sub-task since it's motivated by HA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2781) Add client protocol and DFSadmin for command to restore failed storage
[ https://issues.apache.org/jira/browse/HDFS-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13204918#comment-13204918 ] Eli Collins commented on HDFS-2781: --- If we change the behavior such that the NN drops into SM if it can't access the shared edits dir (we decided that's the desired behavior right?) then we'll still need this. We could make restoreFailedStorage (which flips the flag) also have the side effect of trying to restore shared storage though I'm not sure that's user friendly, eg if storage restoration is already enabled you mihht not think that you should try to enable it to get this side effect. Add client protocol and DFSadmin for command to restore failed storage -- Key: HDFS-2781 URL: https://issues.apache.org/jira/browse/HDFS-2781 Project: Hadoop HDFS Issue Type: Sub-task Components: ha Affects Versions: HA branch (HDFS-1623) Reporter: Eli Collins Assignee: Eli Collins Per HDFS-2769, it's important that an admin be able to ask the NN to try to restore failed storage since we may drop into SM until the shared edits dir is restored (w/o having to wait for the next checkpoint). There's currently an API (and usage in DFSAdmin) to flip the flag indicating whether the NN should try to restore failed storage but not that it should actually attempt to do so. This jira is to add one. This is useful outside HA but doing as an HDFS-1623 sub-task since it's motivated by HA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2781) Add client protocol and DFSadmin for command to restore failed storage
[ https://issues.apache.org/jira/browse/HDFS-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13204931#comment-13204931 ] Eli Collins commented on HDFS-2781: --- Can we define this away? Eg if a standby loses connection to shared storage it should probably shutdown gracefully vs keeping running, in which case we only restore failed storage on an active, and if the active has lost it's connection to shared storage it will be in SM (or not running), in which case restoring shared storage should cause it to come back to life. Add client protocol and DFSadmin for command to restore failed storage -- Key: HDFS-2781 URL: https://issues.apache.org/jira/browse/HDFS-2781 Project: Hadoop HDFS Issue Type: Sub-task Components: ha Affects Versions: HA branch (HDFS-1623) Reporter: Eli Collins Assignee: Eli Collins Per HDFS-2769, it's important that an admin be able to ask the NN to try to restore failed storage since we may drop into SM until the shared edits dir is restored (w/o having to wait for the next checkpoint). There's currently an API (and usage in DFSAdmin) to flip the flag indicating whether the NN should try to restore failed storage but not that it should actually attempt to do so. This jira is to add one. This is useful outside HA but doing as an HDFS-1623 sub-task since it's motivated by HA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2781) Add client protocol and DFSadmin for command to restore failed storage
[ https://issues.apache.org/jira/browse/HDFS-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13205052#comment-13205052 ] Bikas Saha commented on HDFS-2781: -- bq. friendly, eg if storage restoration is already enabled you mihht not think that you should try to enable it to get this side effect. In that case, rolling logs will restore the directories just like it works as of now HA imposes higher restrictions compared to what works as of now. So we might need to do special stuff for HA only. Which might be trying to restore failed directories in the process of transitioning to active (maybe also standby) From what I read of the code, the standby doesnt seem to bother with setting failed directories since its operations are all read only. So there might be no need for the standby to shutdown gracefully. If the active moves to SM because of a bad required directory then it should restore all required directories when it goes out of safe mode or else complain and stay in safe mode. All this should happen after the admin has done the necessary pre-requisites and issued a -safeMode leave command. bq. There's some interaction with fencing, here, though... one likely reason that the NN will lose touch with the shared storage is that another node has requested that the NAS device fence the host. Then, after the failover, the administrator might unfence the host from the NAS, and we don't want the NN to automatically come back to life at this point. Does the NN come back out of safemode automatically or only after an admin command? Add client protocol and DFSadmin for command to restore failed storage -- Key: HDFS-2781 URL: https://issues.apache.org/jira/browse/HDFS-2781 Project: Hadoop HDFS Issue Type: Sub-task Components: ha Affects Versions: HA branch (HDFS-1623) Reporter: Eli Collins Assignee: Eli Collins Per HDFS-2769, it's important that an admin be able to ask the NN to try to restore failed storage since we may drop into SM until the shared edits dir is restored (w/o having to wait for the next checkpoint). There's currently an API (and usage in DFSAdmin) to flip the flag indicating whether the NN should try to restore failed storage but not that it should actually attempt to do so. This jira is to add one. This is useful outside HA but doing as an HDFS-1623 sub-task since it's motivated by HA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2781) Add client protocol and DFSadmin for command to restore failed storage
[ https://issues.apache.org/jira/browse/HDFS-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13201591#comment-13201591 ] Todd Lipcon commented on HDFS-2781: --- Currently, if the shared edits goes away, we don't drop into safe mode, but rather abort the NN completely. So we probably need a different task (non-HA-specific) to allow the NN to drop to safemode instead of aborting. Add client protocol and DFSadmin for command to restore failed storage -- Key: HDFS-2781 URL: https://issues.apache.org/jira/browse/HDFS-2781 Project: Hadoop HDFS Issue Type: Sub-task Components: ha Affects Versions: HA branch (HDFS-1623) Reporter: Eli Collins Assignee: Eli Collins Per HDFS-2769, it's important that an admin be able to ask the NN to try to restore failed storage since we may drop into SM until the shared edits dir is restored (w/o having to wait for the next checkpoint). There's currently an API (and usage in DFSAdmin) to flip the flag indicating whether the NN should try to restore failed storage but not that it should actually attempt to do so. This jira is to add one. This is useful outside HA but doing as an HDFS-1623 sub-task since it's motivated by HA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2781) Add client protocol and DFSadmin for command to restore failed storage
[ https://issues.apache.org/jira/browse/HDFS-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13201830#comment-13201830 ] Bikas Saha commented on HDFS-2781: -- I renamed the shared edits dir. The following happened 1) Active moved to safe mode. So it seems the above observation has already been fixed 2) Standby crashed with NPE [HDFS-2905|https://issues.apache.org/jira/browse/HDFS-2905] Also, when the shared edits is brought back online (renaming it back) and the active is moved out of safe mode, then it starts re-using that directory when the standby rolls the edits. Add client protocol and DFSadmin for command to restore failed storage -- Key: HDFS-2781 URL: https://issues.apache.org/jira/browse/HDFS-2781 Project: Hadoop HDFS Issue Type: Sub-task Components: ha Affects Versions: HA branch (HDFS-1623) Reporter: Eli Collins Assignee: Eli Collins Per HDFS-2769, it's important that an admin be able to ask the NN to try to restore failed storage since we may drop into SM until the shared edits dir is restored (w/o having to wait for the next checkpoint). There's currently an API (and usage in DFSAdmin) to flip the flag indicating whether the NN should try to restore failed storage but not that it should actually attempt to do so. This jira is to add one. This is useful outside HA but doing as an HDFS-1623 sub-task since it's motivated by HA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2781) Add client protocol and DFSadmin for command to restore failed storage
[ https://issues.apache.org/jira/browse/HDFS-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13201834#comment-13201834 ] Bikas Saha commented on HDFS-2781: -- Actually the active goes into safe mode on my machine because it thinks there is not enough space. 12/02/06 15:47:19 WARN namenode.FSNamesystem: NameNode low on available disk space. Already in safe mode. 12/02/06 15:47:19 INFO hdfs.StateChange: STATE* Safe mode is ON. Resources are low on NN. Safe mode must be turned off manually. 12/02/06 15:47:24 WARN namenode.NameNodeResourceChecker: Space available on volume '/dev/disk0s2' is 0, which is below the configured reserved amount 104857600 So it might be that if the space constraint is removed then it might abort differently. Add client protocol and DFSadmin for command to restore failed storage -- Key: HDFS-2781 URL: https://issues.apache.org/jira/browse/HDFS-2781 Project: Hadoop HDFS Issue Type: Sub-task Components: ha Affects Versions: HA branch (HDFS-1623) Reporter: Eli Collins Assignee: Eli Collins Per HDFS-2769, it's important that an admin be able to ask the NN to try to restore failed storage since we may drop into SM until the shared edits dir is restored (w/o having to wait for the next checkpoint). There's currently an API (and usage in DFSAdmin) to flip the flag indicating whether the NN should try to restore failed storage but not that it should actually attempt to do so. This jira is to add one. This is useful outside HA but doing as an HDFS-1623 sub-task since it's motivated by HA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2781) Add client protocol and DFSadmin for command to restore failed storage
[ https://issues.apache.org/jira/browse/HDFS-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13201937#comment-13201937 ] Bikas Saha commented on HDFS-2781: -- When I break on that code line, the change to safe mode is being triggered by the NameNodeResourceChecker returning false for resources available. So DF returning 0 is what is causing the safe mode transition to occur. What do you mean by parse error? Are you suggesting that the check for available space be replaced by something else when the available space == 0. Something that will actually check if the directory exists or not? Add client protocol and DFSadmin for command to restore failed storage -- Key: HDFS-2781 URL: https://issues.apache.org/jira/browse/HDFS-2781 Project: Hadoop HDFS Issue Type: Sub-task Components: ha Affects Versions: HA branch (HDFS-1623) Reporter: Eli Collins Assignee: Eli Collins Per HDFS-2769, it's important that an admin be able to ask the NN to try to restore failed storage since we may drop into SM until the shared edits dir is restored (w/o having to wait for the next checkpoint). There's currently an API (and usage in DFSAdmin) to flip the flag indicating whether the NN should try to restore failed storage but not that it should actually attempt to do so. This jira is to add one. This is useful outside HA but doing as an HDFS-1623 sub-task since it's motivated by HA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2781) Add client protocol and DFSadmin for command to restore failed storage
[ https://issues.apache.org/jira/browse/HDFS-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13201945#comment-13201945 ] Todd Lipcon commented on HDFS-2781: --- I think if you were continuously writing to the active NN when the disk went offline, you'd see it abort. Doing a deletion of the directory allows the logs to still fsync (since the vnode still exists in memory despite not having any file system links to it anymore). On the next roll you'd probably see it abort with a FATAL message, rather than go into safe mode, so long as the roll happened before the periodic resource check interval. Add client protocol and DFSadmin for command to restore failed storage -- Key: HDFS-2781 URL: https://issues.apache.org/jira/browse/HDFS-2781 Project: Hadoop HDFS Issue Type: Sub-task Components: ha Affects Versions: HA branch (HDFS-1623) Reporter: Eli Collins Assignee: Eli Collins Per HDFS-2769, it's important that an admin be able to ask the NN to try to restore failed storage since we may drop into SM until the shared edits dir is restored (w/o having to wait for the next checkpoint). There's currently an API (and usage in DFSAdmin) to flip the flag indicating whether the NN should try to restore failed storage but not that it should actually attempt to do so. This jira is to add one. This is useful outside HA but doing as an HDFS-1623 sub-task since it's motivated by HA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2781) Add client protocol and DFSadmin for command to restore failed storage
[ https://issues.apache.org/jira/browse/HDFS-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13201944#comment-13201944 ] Aaron T. Myers commented on HDFS-2781: -- bq. What do you mean by parse error? Sorry, calling it a parse error isn't quite accurate. I've just always found it a little suspect that DF returns 0 for space available if the given path doesn't exist. It should probably throw an error or something along those lines. Also, note that the NameNodeResourceChecker can't really be considered helpful in this case, since it runs asynchronously. i.e. the NN might continue not in SM for a while (a minute by default, I think) before the NNResourceChecker runs and moves the NN into SM. Add client protocol and DFSadmin for command to restore failed storage -- Key: HDFS-2781 URL: https://issues.apache.org/jira/browse/HDFS-2781 Project: Hadoop HDFS Issue Type: Sub-task Components: ha Affects Versions: HA branch (HDFS-1623) Reporter: Eli Collins Assignee: Eli Collins Per HDFS-2769, it's important that an admin be able to ask the NN to try to restore failed storage since we may drop into SM until the shared edits dir is restored (w/o having to wait for the next checkpoint). There's currently an API (and usage in DFSAdmin) to flip the flag indicating whether the NN should try to restore failed storage but not that it should actually attempt to do so. This jira is to add one. This is useful outside HA but doing as an HDFS-1623 sub-task since it's motivated by HA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira