[jira] [Commented] (HDFS-2781) Add client protocol and DFSadmin for command to restore failed storage

2012-02-16 Thread Todd Lipcon (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13209811#comment-13209811
 ] 

Todd Lipcon commented on HDFS-2781:
---

We can move this to be a non-HA ticket right?

 Add client protocol and DFSadmin for command to restore failed storage
 --

 Key: HDFS-2781
 URL: https://issues.apache.org/jira/browse/HDFS-2781
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha
Affects Versions: HA branch (HDFS-1623)
Reporter: Eli Collins
Assignee: Eli Collins

 Per HDFS-2769, it's important that an admin be able to ask the NN to try to 
 restore failed storage since we may drop into SM until the shared edits dir 
 is restored (w/o having to wait for the next checkpoint). There's currently 
 an API (and usage in DFSAdmin) to flip the flag indicating whether the NN 
 should try to restore failed storage but not that it should actually attempt 
 to do so. This jira is to add one. This is useful outside HA but doing as an 
 HDFS-1623 sub-task since it's motivated by HA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2781) Add client protocol and DFSadmin for command to restore failed storage

2012-02-09 Thread Bikas Saha (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13204812#comment-13204812
 ] 

Bikas Saha commented on HDFS-2781:
--

Is this JIRA still valid? If I understand right, the premise was the the NN 
would fall into standby mode when the shared edits dir fails. After the shared 
edits dir is restored, the admin could use the command proposed in this JIRA to 
refresh the dirs.
But current policy is for the NN to shutdown on shared edits dir failure. When 
the dir is brought back online, then the NN will pick it up on being restarted.
When NN moves to active or standby states then the FSEditLog.journalSet is 
refreshed and will refresh the storage dirs upon next log roll (if the restore 
flag is set). Perhaps we are better off restoring directories as part of moving 
from active/standby states (when we re-init the JournalSet) instead of as an 
explicit command. Seems more natural and 1 less thing to do for the admin. 


 Add client protocol and DFSadmin for command to restore failed storage
 --

 Key: HDFS-2781
 URL: https://issues.apache.org/jira/browse/HDFS-2781
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha
Affects Versions: HA branch (HDFS-1623)
Reporter: Eli Collins
Assignee: Eli Collins

 Per HDFS-2769, it's important that an admin be able to ask the NN to try to 
 restore failed storage since we may drop into SM until the shared edits dir 
 is restored (w/o having to wait for the next checkpoint). There's currently 
 an API (and usage in DFSAdmin) to flip the flag indicating whether the NN 
 should try to restore failed storage but not that it should actually attempt 
 to do so. This jira is to add one. This is useful outside HA but doing as an 
 HDFS-1623 sub-task since it's motivated by HA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2781) Add client protocol and DFSadmin for command to restore failed storage

2012-02-09 Thread Bikas Saha (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13204816#comment-13204816
 ] 

Bikas Saha commented on HDFS-2781:
--

Or perhaps storage dirs could restored when the dfsAdmin -restoreFailedStorage 
command sets the option to true (as part of the command).
This would handle the non-HA cases.

 Add client protocol and DFSadmin for command to restore failed storage
 --

 Key: HDFS-2781
 URL: https://issues.apache.org/jira/browse/HDFS-2781
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha
Affects Versions: HA branch (HDFS-1623)
Reporter: Eli Collins
Assignee: Eli Collins

 Per HDFS-2769, it's important that an admin be able to ask the NN to try to 
 restore failed storage since we may drop into SM until the shared edits dir 
 is restored (w/o having to wait for the next checkpoint). There's currently 
 an API (and usage in DFSAdmin) to flip the flag indicating whether the NN 
 should try to restore failed storage but not that it should actually attempt 
 to do so. This jira is to add one. This is useful outside HA but doing as an 
 HDFS-1623 sub-task since it's motivated by HA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2781) Add client protocol and DFSadmin for command to restore failed storage

2012-02-09 Thread Eli Collins (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13204918#comment-13204918
 ] 

Eli Collins commented on HDFS-2781:
---

If we change the behavior such that the NN drops into SM if it can't access the 
shared edits dir (we decided that's the desired behavior right?) then we'll 
still need this. We could make restoreFailedStorage (which flips the flag) also 
have the side effect of trying to restore shared storage though I'm not sure 
that's user friendly, eg if storage restoration is already enabled you mihht 
not think that you should try to enable it to get this side effect.

 Add client protocol and DFSadmin for command to restore failed storage
 --

 Key: HDFS-2781
 URL: https://issues.apache.org/jira/browse/HDFS-2781
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha
Affects Versions: HA branch (HDFS-1623)
Reporter: Eli Collins
Assignee: Eli Collins

 Per HDFS-2769, it's important that an admin be able to ask the NN to try to 
 restore failed storage since we may drop into SM until the shared edits dir 
 is restored (w/o having to wait for the next checkpoint). There's currently 
 an API (and usage in DFSAdmin) to flip the flag indicating whether the NN 
 should try to restore failed storage but not that it should actually attempt 
 to do so. This jira is to add one. This is useful outside HA but doing as an 
 HDFS-1623 sub-task since it's motivated by HA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2781) Add client protocol and DFSadmin for command to restore failed storage

2012-02-09 Thread Eli Collins (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13204931#comment-13204931
 ] 

Eli Collins commented on HDFS-2781:
---

Can we define this away? Eg if a standby loses connection to shared storage it 
should probably shutdown gracefully vs keeping running, in which case we only 
restore failed storage on an active, and if the active has lost it's connection 
to shared storage it will be in SM (or not running), in which case restoring 
shared storage should cause it to come back to life.

 Add client protocol and DFSadmin for command to restore failed storage
 --

 Key: HDFS-2781
 URL: https://issues.apache.org/jira/browse/HDFS-2781
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha
Affects Versions: HA branch (HDFS-1623)
Reporter: Eli Collins
Assignee: Eli Collins

 Per HDFS-2769, it's important that an admin be able to ask the NN to try to 
 restore failed storage since we may drop into SM until the shared edits dir 
 is restored (w/o having to wait for the next checkpoint). There's currently 
 an API (and usage in DFSAdmin) to flip the flag indicating whether the NN 
 should try to restore failed storage but not that it should actually attempt 
 to do so. This jira is to add one. This is useful outside HA but doing as an 
 HDFS-1623 sub-task since it's motivated by HA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2781) Add client protocol and DFSadmin for command to restore failed storage

2012-02-09 Thread Bikas Saha (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13205052#comment-13205052
 ] 

Bikas Saha commented on HDFS-2781:
--

bq. friendly, eg if storage restoration is already enabled you mihht not think 
that you should try to enable it to get this side effect.
In that case, rolling logs will restore the directories just like it works as 
of now
HA imposes higher restrictions compared to what works as of now. So we might 
need to do special stuff for HA only. Which might be trying to restore failed 
directories in the process of transitioning to active (maybe also standby)
From what I read of the code, the standby doesnt seem to bother with setting 
failed directories since its operations are all read only. So there might be 
no need for the standby to shutdown gracefully.
If the active moves to SM because of a bad required directory then it should 
restore all required directories when it goes out of safe mode or else complain 
and stay in safe mode. All this should happen after the admin has done the 
necessary pre-requisites and issued a -safeMode leave command.
bq. There's some interaction with fencing, here, though... one likely reason 
that the NN will lose touch with the shared storage is that another node has 
requested that the NAS device fence the host. Then, after the failover, the 
administrator might unfence the host from the NAS, and we don't want the NN to 
automatically come back to life at this point.
Does the NN come back out of safemode automatically or only after an admin 
command?

 Add client protocol and DFSadmin for command to restore failed storage
 --

 Key: HDFS-2781
 URL: https://issues.apache.org/jira/browse/HDFS-2781
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha
Affects Versions: HA branch (HDFS-1623)
Reporter: Eli Collins
Assignee: Eli Collins

 Per HDFS-2769, it's important that an admin be able to ask the NN to try to 
 restore failed storage since we may drop into SM until the shared edits dir 
 is restored (w/o having to wait for the next checkpoint). There's currently 
 an API (and usage in DFSAdmin) to flip the flag indicating whether the NN 
 should try to restore failed storage but not that it should actually attempt 
 to do so. This jira is to add one. This is useful outside HA but doing as an 
 HDFS-1623 sub-task since it's motivated by HA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2781) Add client protocol and DFSadmin for command to restore failed storage

2012-02-06 Thread Todd Lipcon (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13201591#comment-13201591
 ] 

Todd Lipcon commented on HDFS-2781:
---

Currently, if the shared edits goes away, we don't drop into safe mode, but 
rather abort the NN completely. So we probably need a different task 
(non-HA-specific) to allow the NN to drop to safemode instead of aborting.

 Add client protocol and DFSadmin for command to restore failed storage
 --

 Key: HDFS-2781
 URL: https://issues.apache.org/jira/browse/HDFS-2781
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha
Affects Versions: HA branch (HDFS-1623)
Reporter: Eli Collins
Assignee: Eli Collins

 Per HDFS-2769, it's important that an admin be able to ask the NN to try to 
 restore failed storage since we may drop into SM until the shared edits dir 
 is restored (w/o having to wait for the next checkpoint). There's currently 
 an API (and usage in DFSAdmin) to flip the flag indicating whether the NN 
 should try to restore failed storage but not that it should actually attempt 
 to do so. This jira is to add one. This is useful outside HA but doing as an 
 HDFS-1623 sub-task since it's motivated by HA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2781) Add client protocol and DFSadmin for command to restore failed storage

2012-02-06 Thread Bikas Saha (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13201830#comment-13201830
 ] 

Bikas Saha commented on HDFS-2781:
--

I renamed the shared edits dir. The following happened
1) Active moved to safe mode. So it seems the above observation has already 
been fixed
2) Standby crashed with NPE 
[HDFS-2905|https://issues.apache.org/jira/browse/HDFS-2905]

Also, when the shared edits is brought back online (renaming it back) and the 
active is moved out of safe mode, then it starts re-using that directory when 
the standby rolls the edits.


 Add client protocol and DFSadmin for command to restore failed storage
 --

 Key: HDFS-2781
 URL: https://issues.apache.org/jira/browse/HDFS-2781
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha
Affects Versions: HA branch (HDFS-1623)
Reporter: Eli Collins
Assignee: Eli Collins

 Per HDFS-2769, it's important that an admin be able to ask the NN to try to 
 restore failed storage since we may drop into SM until the shared edits dir 
 is restored (w/o having to wait for the next checkpoint). There's currently 
 an API (and usage in DFSAdmin) to flip the flag indicating whether the NN 
 should try to restore failed storage but not that it should actually attempt 
 to do so. This jira is to add one. This is useful outside HA but doing as an 
 HDFS-1623 sub-task since it's motivated by HA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2781) Add client protocol and DFSadmin for command to restore failed storage

2012-02-06 Thread Bikas Saha (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13201834#comment-13201834
 ] 

Bikas Saha commented on HDFS-2781:
--

Actually the active goes into safe mode on my machine because it thinks there 
is not enough space.

12/02/06 15:47:19 WARN namenode.FSNamesystem: NameNode low on available disk 
space. Already in safe mode.
12/02/06 15:47:19 INFO hdfs.StateChange: STATE* Safe mode is ON. Resources are 
low on NN. Safe mode must be turned off manually.
12/02/06 15:47:24 WARN namenode.NameNodeResourceChecker: Space available on 
volume '/dev/disk0s2' is 0, which is below the configured reserved amount 
104857600

So it might be that if the space constraint is removed then it might abort 
differently.

 Add client protocol and DFSadmin for command to restore failed storage
 --

 Key: HDFS-2781
 URL: https://issues.apache.org/jira/browse/HDFS-2781
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha
Affects Versions: HA branch (HDFS-1623)
Reporter: Eli Collins
Assignee: Eli Collins

 Per HDFS-2769, it's important that an admin be able to ask the NN to try to 
 restore failed storage since we may drop into SM until the shared edits dir 
 is restored (w/o having to wait for the next checkpoint). There's currently 
 an API (and usage in DFSAdmin) to flip the flag indicating whether the NN 
 should try to restore failed storage but not that it should actually attempt 
 to do so. This jira is to add one. This is useful outside HA but doing as an 
 HDFS-1623 sub-task since it's motivated by HA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2781) Add client protocol and DFSadmin for command to restore failed storage

2012-02-06 Thread Bikas Saha (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13201937#comment-13201937
 ] 

Bikas Saha commented on HDFS-2781:
--

When I break on that code line, the change to safe mode is being triggered by 
the NameNodeResourceChecker returning false for resources available.
So DF returning 0 is what is causing the safe mode transition to occur.

What do you mean by parse error? Are you suggesting that the check for 
available space be replaced by something else when the available space == 0. 
Something that will actually check if the directory exists or not?



 Add client protocol and DFSadmin for command to restore failed storage
 --

 Key: HDFS-2781
 URL: https://issues.apache.org/jira/browse/HDFS-2781
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha
Affects Versions: HA branch (HDFS-1623)
Reporter: Eli Collins
Assignee: Eli Collins

 Per HDFS-2769, it's important that an admin be able to ask the NN to try to 
 restore failed storage since we may drop into SM until the shared edits dir 
 is restored (w/o having to wait for the next checkpoint). There's currently 
 an API (and usage in DFSAdmin) to flip the flag indicating whether the NN 
 should try to restore failed storage but not that it should actually attempt 
 to do so. This jira is to add one. This is useful outside HA but doing as an 
 HDFS-1623 sub-task since it's motivated by HA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2781) Add client protocol and DFSadmin for command to restore failed storage

2012-02-06 Thread Todd Lipcon (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13201945#comment-13201945
 ] 

Todd Lipcon commented on HDFS-2781:
---

I think if you were continuously writing to the active NN when the disk went 
offline, you'd see it abort. Doing a deletion of the directory allows the logs 
to still fsync (since the vnode still exists in memory despite not having any 
file system links to it anymore). On the next roll you'd probably see it abort 
with a FATAL message, rather than go into safe mode, so long as the roll 
happened before the periodic resource check interval.

 Add client protocol and DFSadmin for command to restore failed storage
 --

 Key: HDFS-2781
 URL: https://issues.apache.org/jira/browse/HDFS-2781
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha
Affects Versions: HA branch (HDFS-1623)
Reporter: Eli Collins
Assignee: Eli Collins

 Per HDFS-2769, it's important that an admin be able to ask the NN to try to 
 restore failed storage since we may drop into SM until the shared edits dir 
 is restored (w/o having to wait for the next checkpoint). There's currently 
 an API (and usage in DFSAdmin) to flip the flag indicating whether the NN 
 should try to restore failed storage but not that it should actually attempt 
 to do so. This jira is to add one. This is useful outside HA but doing as an 
 HDFS-1623 sub-task since it's motivated by HA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2781) Add client protocol and DFSadmin for command to restore failed storage

2012-02-06 Thread Aaron T. Myers (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13201944#comment-13201944
 ] 

Aaron T. Myers commented on HDFS-2781:
--

bq. What do you mean by parse error?

Sorry, calling it a parse error isn't quite accurate. I've just always found 
it a little suspect that DF returns 0 for space available if the given path 
doesn't exist. It should probably throw an error or something along those lines.

Also, note that the NameNodeResourceChecker can't really be considered helpful 
in this case, since it runs asynchronously. i.e. the NN might continue not in 
SM for a while (a minute by default, I think) before the NNResourceChecker runs 
and moves the NN into SM.

 Add client protocol and DFSadmin for command to restore failed storage
 --

 Key: HDFS-2781
 URL: https://issues.apache.org/jira/browse/HDFS-2781
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha
Affects Versions: HA branch (HDFS-1623)
Reporter: Eli Collins
Assignee: Eli Collins

 Per HDFS-2769, it's important that an admin be able to ask the NN to try to 
 restore failed storage since we may drop into SM until the shared edits dir 
 is restored (w/o having to wait for the next checkpoint). There's currently 
 an API (and usage in DFSAdmin) to flip the flag indicating whether the NN 
 should try to restore failed storage but not that it should actually attempt 
 to do so. This jira is to add one. This is useful outside HA but doing as an 
 HDFS-1623 sub-task since it's motivated by HA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira