[ 
https://issues.apache.org/jira/browse/HADOOP-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508264
 ] 

Doug Cutting commented on HADOOP-1486:
--------------------------------------

> whether to have a monitoring daemon that restarts namenode automatically

It seems safe to restart the namenode in this case.  I'd simply add a loop to 
NameNode.main() that creates and starts a new NameNode when the existing 
namenode exits unexpectedly.  We should only restart if it's stopping due to an 
error, and not due to an explicit call to stop().  So perhaps NameNode#join() 
could return a boolean indicating whether it's exiting normally or should be 
restarted, and the catch in the ReplicationMonitor should call a NameNode 
method to trigger that kind of exit.  Does this sound workable?

> ReplicationMonitor thread goes away 
> ------------------------------------
>
>                 Key: HADOOP-1486
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1486
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Koji Noguchi
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.14.0
>
>         Attachments: catchThrowable2.patch
>
>
> Saw many over/under replicated blocks in fsck output.
> .out file showed
> Exception in thread "[EMAIL PROTECTED]" java.lang.IllegalArgumentException: 
> Unexpected non-existing data node: /99.9.99.0/99.9.99.42:99999
>   at 
> org.apache.hadoop.net.NetworkTopology.checkArgument(NetworkTopology.java:379)
>   at 
> org.apache.hadoop.net.NetworkTopology.isOnSameRack(NetworkTopology.java:424)
>   at 
> org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2853)
>   at 
> org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2816)
>   at 
> org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2658)
>   at 
> org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1774)
>   at 
> org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1723)
>   at java.lang.Thread.run(Thread.java:619)
> (same as HADOOP-1232)
> And, jstack showed no ReplicationMonitor thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to