[ http://issues.apache.org/jira/browse/HADOOP-442?page=comments#action_12427632 ] Marco Nicosia commented on HADOOP-442: --------------------------------------
Would it be better to choose either one or the other to be authoritative for all operations? 1] The namenode/jobtrackers maintain the slaves file. Membership and other administrative functions are made via API calls to the process, which modifies a file on disk. That file is used, but never modified, by slaves.sh, etc. If the file is still text, it can be modified between process restarts. 2] The namenode/jobtrackers observe and respect the contents of a file on disk. Standard tools can modify it, but the processes would have to poll the file to see if it has been changed. I personally prefer #1, tho I'd hope that any API is open (XML-RPC, REST, SOAP...) instead of RMI so that any set of sysadmin automation can talk to it. > slaves file should include an 'exclude' section, to prevent "bad" datanodes > and tasktrackers from disrupting a cluster > ----------------------------------------------------------------------------------------------------------------------- > > Key: HADOOP-442 > URL: http://issues.apache.org/jira/browse/HADOOP-442 > Project: Hadoop > Issue Type: Bug > Reporter: Yoram Arnon > > I recently had a few nodes go bad, such that they were inaccessible to ssh, > but were still running their java processes. > tasks that executed on them were failing, causing jobs to fail. > I couldn't stop the java processes, because of the ssh issue, so I was > helpless until I could actually power down these nodes. > restarting the cluster doesn't help, even when removing the bad nodes from > the slaves file - they just reconnect and are accepted. > while we plan to avoid tasks from launching on the same nodes over and over, > what I'd like is to be able to prevent rogue processes from connecting to the > masters. > Ideally, the slaves file will contain an 'exclude' section, which will list > nodes that shouldn't be accessed, and should be ignored if they try to > connect. That would also help in configuring the slaves file for a large > cluster - I'd list the full range of machines in the cluster, then list the > ones that are down in the 'exclude' section -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
