[ 
https://issues.apache.org/jira/browse/HADOOP-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12469320
 ] 

Yoram Arnon commented on HADOOP-442:
------------------------------------

so perhaps a simpler approach, more in line with the original request:

* single slaves file
* no new config variables, since the slaves file is configurable anyway via the 
hadoop script(s) and env vars
* lines in the slaves file that start with the word 'exclude' imply that the 
nodes listed on that line are to be excluded

pros of this approach:
* completely backwards compatible
* very simple for the administrator (and implementor) - no config, no 
confusion, single slaves file to edit

the intent was to simplify the administrator's life. I'd list *all* the cluster 
in the slaves file, then list a handful of nodes under an 'exclude' section. 
The exclude statement is an override on whatever is listed elsewhere in the 
file.
The behavior is:
* datanode/tasktracker on nodes that appear in the slaves file and are not 
excluded are launched on startup
* any node may connect to the master, unless it is excluded (for backwards 
compatibility. I'd actually prefer to only allow included nodes to connect but 
that will wait, perhaps indefinitely)

nodes are excluded for a reason, basically because they're dead and
* their presence clutters up the startup/shutdown script output and error logs
* if they are included in the cluster they may degrade its behavior or 
performance so they should be actively ignored
- I wouldn't worry about the replicas of data they contain - they're long dead 
and re-replicated. If they're alive, I won't exclude them.

I love the admin command to reread the file.


> slaves file should include an 'exclude' section, to prevent "bad" datanodes 
> and tasktrackers from disrupting  a cluster
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-442
>                 URL: https://issues.apache.org/jira/browse/HADOOP-442
>             Project: Hadoop
>          Issue Type: Bug
>          Components: conf
>            Reporter: Yoram Arnon
>         Assigned To: Wendy Chien
>
> I recently had a few nodes go bad, such that they were inaccessible to ssh, 
> but were still running their java processes.
> tasks that executed on them were failing, causing jobs to fail.
> I couldn't stop the java processes, because of the ssh issue, so I was 
> helpless until I could actually power down these nodes.
> restarting the cluster doesn't help, even when removing the bad nodes from 
> the slaves file - they just reconnect and are accepted.
> while we plan to avoid tasks from launching on the same nodes over and over, 
> what I'd like is to be able to prevent rogue processes from connecting to the 
> masters.
> Ideally, the slaves file will contain an 'exclude' section, which will list 
> nodes that shouldn't be accessed, and should be ignored if they try to 
> connect. That would also help in configuring the slaves file for a large 
> cluster - I'd list the full range of machines in the cluster, then list the 
> ones that are down in the 'exclude' section

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to