[jira] Commented: (HADOOP-442) slaves file should include an 'exclude' section, to prevent "bad" datanodes and tasktrackers from disrupting a cluster

dhruba borthakur (JIRA) Mon, 12 Feb 2007 11:22:27 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472390
 ]


dhruba borthakur commented on HADOOP-442:
-----------------------------------------

1. It would be nice if the description of dfs.hosts and dfs.hosts.exclude
   says "Full path name of file ..."

2. The FSNamesystem.close() function should have a dnthread.join() call.

3. Can we make FSNamesystem.refreshNodes() package private? i.e. remove the
   "public" keyword from its definition.

4. The method FSNamesystem.refreshNodes migth need to be synchronized because
   it traverses the datanodeMap. However, the first line in this method (that
   invokes "hostReader.refresh" should preferably be outside this 
synchrnization.
   It is good to read in contents from the hosts file outside the global
   FSNamesystem lock.

5. The methods inExcludedHostsList() and inHostsList() could be unified if we
   pass in the specified list as a parameter to this unified method.


> slaves file should include an 'exclude' section, to prevent "bad" datanodes 
> and tasktrackers from disrupting  a cluster
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-442
>                 URL: https://issues.apache.org/jira/browse/HADOOP-442
>             Project: Hadoop
>          Issue Type: Bug
>          Components: conf
>            Reporter: Yoram Arnon
>         Assigned To: Wendy Chien
>         Attachments: hadoop-442-8.patch
>
>
> I recently had a few nodes go bad, such that they were inaccessible to ssh, 
> but were still running their java processes.
> tasks that executed on them were failing, causing jobs to fail.
> I couldn't stop the java processes, because of the ssh issue, so I was 
> helpless until I could actually power down these nodes.
> restarting the cluster doesn't help, even when removing the bad nodes from 
> the slaves file - they just reconnect and are accepted.
> while we plan to avoid tasks from launching on the same nodes over and over, 
> what I'd like is to be able to prevent rogue processes from connecting to the 
> masters.
> Ideally, the slaves file will contain an 'exclude' section, which will list 
> nodes that shouldn't be accessed, and should be ignored if they try to 
> connect. That would also help in configuring the slaves file for a large 
> cluster - I'd list the full range of machines in the cluster, then list the 
> ones that are down in the 'exclude' section

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-442) slaves file should include an 'exclude' section, to prevent "bad" datanodes and tasktrackers from disrupting a cluster

Reply via email to