[
https://issues.apache.org/jira/browse/HADOOP-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12469272
]
Sameer Paranjpye commented on HADOOP-442:
-----------------------------------------
A couple of points/questions:
- The [dfs|mapred].client prefixes are a little confusing. They could be
construed as lists of machines from which HDFS clients are permitted to
connect, which is not the intent here. I'd also get rid of the 'include' suffix
since that list is used to constrain the set of nodes that can participate. I
propose calling them:
[dfs|mapred].hosts and [dfs|mapred].hosts.exclude
- What is the format of the include and exclude files, it needs to be specified
here.
- The variables should be set to reasonable default values in
hadoop-default.xml, not excluded. Setting them to empty strings by default is
probably fine.
- What happens when a node is taken off the include list and not placed on the
exclude list? Do heartbeats and block reports from it continue to get
processed? Perhaps the 'refreshNodes' command should examine the collection of
DatanodeDescriptors and update it to only include those that are on the new
include list.
- What are the semantics of exclude? How does it differ from decommission? Can
we unify the two and have exclude mean decommission? What happens if all
replicas of some blocks are excluded?
> slaves file should include an 'exclude' section, to prevent "bad" datanodes
> and tasktrackers from disrupting a cluster
> -----------------------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-442
> URL: https://issues.apache.org/jira/browse/HADOOP-442
> Project: Hadoop
> Issue Type: Bug
> Components: conf
> Reporter: Yoram Arnon
> Assigned To: Wendy Chien
>
> I recently had a few nodes go bad, such that they were inaccessible to ssh,
> but were still running their java processes.
> tasks that executed on them were failing, causing jobs to fail.
> I couldn't stop the java processes, because of the ssh issue, so I was
> helpless until I could actually power down these nodes.
> restarting the cluster doesn't help, even when removing the bad nodes from
> the slaves file - they just reconnect and are accepted.
> while we plan to avoid tasks from launching on the same nodes over and over,
> what I'd like is to be able to prevent rogue processes from connecting to the
> masters.
> Ideally, the slaves file will contain an 'exclude' section, which will list
> nodes that shouldn't be accessed, and should be ignored if they try to
> connect. That would also help in configuring the slaves file for a large
> cluster - I'd list the full range of machines in the cluster, then list the
> ones that are down in the 'exclude' section
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.