[ 
https://issues.apache.org/jira/browse/HADOOP-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12468848
 ] 

Wendy Chien commented on HADOOP-442:
------------------------------------

Some changes:
1. There are actually going to be 4 new config variables.  2 for dfs and 2 for 
mapred.  The current names are:
dfs.client.include
dfs.client.exclude
mapred.client.include
mapred.client.exclude

-These variables are going to be commented out in hadoop-default, so by default 
they will not be configured.

2. I created a generic gatekeeper class in hadoop.util which is used by dfs and 
mapred.

3. Currently, the include list will only be looked at for registration.  It 
will not be used for heartbeats, block reports, etc..  The exclude list will be 
used for everything.  The reason for this is if a node can't register, then it 
can't send heartbeats, so we don't need to check it again.    
 
4. The JobTracker will not have an admin command to refresh the nodes because 
such a mechanism does not yet exist.   

Comments welcome.


> slaves file should include an 'exclude' section, to prevent "bad" datanodes 
> and tasktrackers from disrupting  a cluster
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-442
>                 URL: https://issues.apache.org/jira/browse/HADOOP-442
>             Project: Hadoop
>          Issue Type: Bug
>          Components: conf
>            Reporter: Yoram Arnon
>         Assigned To: Wendy Chien
>
> I recently had a few nodes go bad, such that they were inaccessible to ssh, 
> but were still running their java processes.
> tasks that executed on them were failing, causing jobs to fail.
> I couldn't stop the java processes, because of the ssh issue, so I was 
> helpless until I could actually power down these nodes.
> restarting the cluster doesn't help, even when removing the bad nodes from 
> the slaves file - they just reconnect and are accepted.
> while we plan to avoid tasks from launching on the same nodes over and over, 
> what I'd like is to be able to prevent rogue processes from connecting to the 
> masters.
> Ideally, the slaves file will contain an 'exclude' section, which will list 
> nodes that shouldn't be accessed, and should be ignored if they try to 
> connect. That would also help in configuring the slaves file for a large 
> cluster - I'd list the full range of machines in the cluster, then list the 
> ones that are down in the 'exclude' section

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to