[jira] [Commented] (HDFS-3886) Shutdown requests can possibly check for checkpoint issues (corrupted edits) and save a good namespace copy before closing down?

2012-09-08 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13451386#comment-13451386
 ] 

Harsh J commented on HDFS-3886:
---

s/caveat/issue :|

> Shutdown requests can possibly check for checkpoint issues (corrupted edits) 
> and save a good namespace copy before closing down?
> 
>
> Key: HDFS-3886
> URL: https://issues.apache.org/jira/browse/HDFS-3886
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 2.0.0-alpha
>Reporter: Harsh J
>Priority: Minor
>
> HDFS-3878 sorta gives me this idea. Aside of having a method to download it 
> to a different location, we can also lock up the namesystem (or deactivate 
> the client rpc server) and save the namesystem before we complete up the 
> shutdown.
> The init.d/shutdown scripts would have to work with this somehow though, to 
> not kill -9 it when in-process. Also, the new image may be stored in a 
> shutdown.chkpt directory, to not interfere in the regular dirs, but still 
> allow easier recovery.
> Obviously this will still not work if all directories are broken. So maybe we 
> could have some configs to tackle that as well?
> I haven't thought this through, so let me know what part is wrong to do :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3886) Shutdown requests can possibly check for checkpoint issues (corrupted edits) and save a good namespace copy before closing down?

2012-09-08 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13451385#comment-13451385
 ] 

Harsh J commented on HDFS-3886:
---

Yeah, sorry if my 'Shutdown requests' in title was too ambiguous. I am mostly 
talking of the NN shutdown itself. I'd thought the simple init.d framework 
itself had a kill -9'er (like a supervisor would) but I was wrong.

bq. This would indeed make an RPC to the NN to enter safemode, perform a save 
namespace, and then shut itself down.

So this is currently a simple:
{code}
sudo -u hdfs hdfs dfsadmin -safemode enter
sudo -u hdfs hdfs dfsadmin -saveNamespace
{code}

However, my only caveat with this is that client logs for clients still active, 
will start showing SafeModeExceptions rather than Connection Refused/etc.. Is 
that fine to have, when doing a clean-stop? Would it be better if we shut off 
the client RPC and then issued a namespace save?

I've also not thought of this in HA terms yet.

> Shutdown requests can possibly check for checkpoint issues (corrupted edits) 
> and save a good namespace copy before closing down?
> 
>
> Key: HDFS-3886
> URL: https://issues.apache.org/jira/browse/HDFS-3886
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 2.0.0-alpha
>Reporter: Harsh J
>Priority: Minor
>
> HDFS-3878 sorta gives me this idea. Aside of having a method to download it 
> to a different location, we can also lock up the namesystem (or deactivate 
> the client rpc server) and save the namesystem before we complete up the 
> shutdown.
> The init.d/shutdown scripts would have to work with this somehow though, to 
> not kill -9 it when in-process. Also, the new image may be stored in a 
> shutdown.chkpt directory, to not interfere in the regular dirs, but still 
> allow easier recovery.
> Obviously this will still not work if all directories are broken. So maybe we 
> could have some configs to tackle that as well?
> I haven't thought this through, so let me know what part is wrong to do :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3886) Shutdown requests can possibly check for checkpoint issues (corrupted edits) and save a good namespace copy before closing down?

2012-09-04 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13447866#comment-13447866
 ] 

Aaron T. Myers commented on HDFS-3886:
--

bq. I don't think you could easily do much with init.d as that is initiated by 
the OS when it's doing a shutdown and it may be unrolling large parts of the 
system: fast shutdowns are always appreciated before the monitoring layers 
escalate.

I wasn't suggesting we modify the existing behavior of `/etc/init.d/* stop', 
but rather that we add an extra, optional command along the lines of 
`/etc/init.d/* clean-stop'. This would indeed make an RPC to the NN to enter 
safemode, perform a save namespace, and then shut itself down. This wouldn't 
affect the behavior of an OS shutdown, since that would still just use the 
'stop' command. Does that make sense?

> Shutdown requests can possibly check for checkpoint issues (corrupted edits) 
> and save a good namespace copy before closing down?
> 
>
> Key: HDFS-3886
> URL: https://issues.apache.org/jira/browse/HDFS-3886
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 2.0.0-alpha
>Reporter: Harsh J
>Priority: Minor
>
> HDFS-3878 sorta gives me this idea. Aside of having a method to download it 
> to a different location, we can also lock up the namesystem (or deactivate 
> the client rpc server) and save the namesystem before we complete up the 
> shutdown.
> The init.d/shutdown scripts would have to work with this somehow though, to 
> not kill -9 it when in-process. Also, the new image may be stored in a 
> shutdown.chkpt directory, to not interfere in the regular dirs, but still 
> allow easier recovery.
> Obviously this will still not work if all directories are broken. So maybe we 
> could have some configs to tackle that as well?
> I haven't thought this through, so let me know what part is wrong to do :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3886) Shutdown requests can possibly check for checkpoint issues (corrupted edits) and save a good namespace copy before closing down?

2012-09-02 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446916#comment-13446916
 ] 

Steve Loughran commented on HDFS-3886:
--

I don't think you could easily do much with init.d as that is initiated by the 
OS when it's doing a shutdown and it may be unrolling large parts of the 
system: fast shutdowns are always appreciated before the monitoring layers 
escalate. Same for Linux clustering resource agents: the slower the shutdown, 
the longer it takes to migrate a service to a new node in the HA cluster.

Perhaps a way could be provided over RPC to tell the NN to block & checkpoint; 
dfsAdmin could be the gateway to this. If you could do this without even 
stopping the process, you have something you can test more easily and a better 
ops experience: you just issue a {{hadoop dfsadmin --checkpoint}} command, your 
NN goes into safe mode briefly, the logs are sorted out and things continue. 

> Shutdown requests can possibly check for checkpoint issues (corrupted edits) 
> and save a good namespace copy before closing down?
> 
>
> Key: HDFS-3886
> URL: https://issues.apache.org/jira/browse/HDFS-3886
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 2.0.0-alpha
>Reporter: Harsh J
>Priority: Minor
>
> HDFS-3878 sorta gives me this idea. Aside of having a method to download it 
> to a different location, we can also lock up the namesystem (or deactivate 
> the client rpc server) and save the namesystem before we complete up the 
> shutdown.
> The init.d/shutdown scripts would have to work with this somehow though, to 
> not kill -9 it when in-process. Also, the new image may be stored in a 
> shutdown.chkpt directory, to not interfere in the regular dirs, but still 
> allow easier recovery.
> Obviously this will still not work if all directories are broken. So maybe we 
> could have some configs to tackle that as well?
> I haven't thought this through, so let me know what part is wrong to do :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3886) Shutdown requests can possibly check for checkpoint issues (corrupted edits) and save a good namespace copy before closing down?

2012-09-01 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446883#comment-13446883
 ] 

Aaron T. Myers commented on HDFS-3886:
--

Interesting idea. Perhaps we could add a "clean shutdown" dfsadmin command, and 
then add an extra action to the init.d script which a cautious admin can choose 
to run? That way we preserve the shutdown behavior that Steve is concerned 
about, but give the admin an option to have guaranteed-good metadata? Just 
thinking out loud.

> Shutdown requests can possibly check for checkpoint issues (corrupted edits) 
> and save a good namespace copy before closing down?
> 
>
> Key: HDFS-3886
> URL: https://issues.apache.org/jira/browse/HDFS-3886
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 2.0.0-alpha
>Reporter: Harsh J
>Priority: Minor
>
> HDFS-3878 sorta gives me this idea. Aside of having a method to download it 
> to a different location, we can also lock up the namesystem (or deactivate 
> the client rpc server) and save the namesystem before we complete up the 
> shutdown.
> The init.d/shutdown scripts would have to work with this somehow though, to 
> not kill -9 it when in-process. Also, the new image may be stored in a 
> shutdown.chkpt directory, to not interfere in the regular dirs, but still 
> allow easier recovery.
> Obviously this will still not work if all directories are broken. So maybe we 
> could have some configs to tackle that as well?
> I haven't thought this through, so let me know what part is wrong to do :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3886) Shutdown requests can possibly check for checkpoint issues (corrupted edits) and save a good namespace copy before closing down?

2012-09-01 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446659#comment-13446659
 ] 

Harsh J commented on HDFS-3886:
---

Thanks much Steve. Perhaps instead can we have the shutdown scripts call a 
savenamespace pre signal? That way we sorta achieve the same?

> Shutdown requests can possibly check for checkpoint issues (corrupted edits) 
> and save a good namespace copy before closing down?
> 
>
> Key: HDFS-3886
> URL: https://issues.apache.org/jira/browse/HDFS-3886
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 2.0.0-alpha
>Reporter: Harsh J
>Priority: Minor
>
> HDFS-3878 sorta gives me this idea. Aside of having a method to download it 
> to a different location, we can also lock up the namesystem (or deactivate 
> the client rpc server) and save the namesystem before we complete up the 
> shutdown.
> The init.d/shutdown scripts would have to work with this somehow though, to 
> not kill -9 it when in-process. Also, the new image may be stored in a 
> shutdown.chkpt directory, to not interfere in the regular dirs, but still 
> allow easier recovery.
> Obviously this will still not work if all directories are broken. So maybe we 
> could have some configs to tackle that as well?
> I haven't thought this through, so let me know what part is wrong to do :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3886) Shutdown requests can possibly check for checkpoint issues (corrupted edits) and save a good namespace copy before closing down?

2012-09-01 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446653#comment-13446653
 ] 

Steve Loughran commented on HDFS-3886:
--

currently a kill -2 is the event sent from init.d to trigger a managed 
shutdown, but it needs to complete within a bounded period, otherwise robust 
init.d/ linux HA scripts will escalate to a -9; this is because they need to 
reliably shut down the system.

Any change that reverts service scripts from having timeout+escalation would be 
counterproductive from a service management perspective.

Now, if there were another signal handler that triggered lock up and system 
save, that could be good -but that would lie outside init.d land

> Shutdown requests can possibly check for checkpoint issues (corrupted edits) 
> and save a good namespace copy before closing down?
> 
>
> Key: HDFS-3886
> URL: https://issues.apache.org/jira/browse/HDFS-3886
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 2.0.0-alpha
>Reporter: Harsh J
>Priority: Minor
>
> HDFS-3878 sorta gives me this idea. Aside of having a method to download it 
> to a different location, we can also lock up the namesystem (or deactivate 
> the client rpc server) and save the namesystem before we complete up the 
> shutdown.
> The init.d/shutdown scripts would have to work with this somehow though, to 
> not kill -9 it when in-process. Also, the new image may be stored in a 
> shutdown.chkpt directory, to not interfere in the regular dirs, but still 
> allow easier recovery.
> Obviously this will still not work if all directories are broken. So maybe we 
> could have some configs to tackle that as well?
> I haven't thought this through, so let me know what part is wrong to do :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira