[
https://issues.apache.org/jira/browse/HADOOP-7103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Bryan Duxbury updated HADOOP-7103:
----------------------------------
Affects Version/s: 0.20.2
> When rack awareness script returns nothing, cluster stops working
> -----------------------------------------------------------------
>
> Key: HADOOP-7103
> URL: https://issues.apache.org/jira/browse/HADOOP-7103
> Project: Hadoop Common
> Issue Type: Bug
> Affects Versions: 0.20.2
> Reporter: Bryan Duxbury
>
> This was an interesting one. Our rack awareness script contains a 1-1 mapping
> from host/ip to rack. We added a new rack's worth of machines without
> updating the awareness script, and when the script was called, it returned
> absolutely no results for the new machines.
> This resulted in the surprising result that basically the entire cluster
> stopped working. Even tasks or blocks assigned to nodes with a valid rack
> seemed to fail. The errors were only detectable by looking in the namenode
> and jobtracker logs, making it take a while before we could figure out the
> problem. After fixing the rack awareness script, everything returned to
> normal operation.
> It seems to me that either the error should be raised more aggressively, or a
> "default" rack should be assumed. This would keep simple mistakes from making
> the entire cluster unusable.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.