[jira] Commented: (HADOOP-442) slaves file should include an 'exclude' section, to prevent "bad" datanodes and tasktrackers from disrupting a cluster

eric baldeschwieler (JIRA) Thu, 16 Nov 2006 11:35:59 -0800

    [ 
http://issues.apache.org/jira/browse/HADOOP-442?page=comments#action_12450506 ] 
            
eric baldeschwieler commented on HADOOP-442:
--------------------------------------------


Current proposal:

- Add config variables that points to file containing list of nodes HDFS should 
expect (slaves file) (optional config)
- Add config variable that points to a file containing a list of excluded nodes 
(from previous list) (optional config)

- The nameNode reads these files on startup (iff config).  It keeps a list of 
included nodes and another of excluded nodes.  If the include list is 
configured, it will be tested when a node registers or heartbeats.  If the node 
is not on the list, it will be told to shutdown on the response.  If the 
exclude list is configured, than a node will also be shutdown if listed.

- We will add an admin command to re-read the inclusion and exclusion files

- The job tracker will also read these lists and have a new admin command to 
reread the files



> slaves file should include an 'exclude' section, to prevent "bad" datanodes 
> and tasktrackers from disrupting  a cluster
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-442
>                 URL: http://issues.apache.org/jira/browse/HADOOP-442
>             Project: Hadoop
>          Issue Type: Bug
>          Components: conf
>            Reporter: Yoram Arnon
>
> I recently had a few nodes go bad, such that they were inaccessible to ssh, 
> but were still running their java processes.
> tasks that executed on them were failing, causing jobs to fail.
> I couldn't stop the java processes, because of the ssh issue, so I was 
> helpless until I could actually power down these nodes.
> restarting the cluster doesn't help, even when removing the bad nodes from 
> the slaves file - they just reconnect and are accepted.
> while we plan to avoid tasks from launching on the same nodes over and over, 
> what I'd like is to be able to prevent rogue processes from connecting to the 
> masters.
> Ideally, the slaves file will contain an 'exclude' section, which will list 
> nodes that shouldn't be accessed, and should be ignored if they try to 
> connect. That would also help in configuring the slaves file for a large 
> cluster - I'd list the full range of machines in the cluster, then list the 
> ones that are down in the 'exclude' section

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-442) slaves file should include an 'exclude' section, to prevent "bad" datanodes and tasktrackers from disrupting a cluster

Reply via email to