[jira] Updated: (HADOOP-7103) When rack awareness script returns nothing, cluster stops working

Bryan Duxbury (JIRA) Thu, 13 Jan 2011 10:06:12 -0800

     [ 
https://issues.apache.org/jira/browse/HADOOP-7103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Bryan Duxbury updated HADOOP-7103:
----------------------------------

    Affects Version/s: 0.20.2

> When rack awareness script returns nothing, cluster stops working
> -----------------------------------------------------------------
>
>                 Key: HADOOP-7103
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7103
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 0.20.2
>            Reporter: Bryan Duxbury
>
> This was an interesting one. Our rack awareness script contains a 1-1 mapping 
> from host/ip to rack. We added a new rack's worth of machines without 
> updating the awareness script, and when the script was called, it returned 
> absolutely no results for the new machines.
> This resulted in the surprising result that basically the entire cluster 
> stopped working. Even tasks or blocks assigned to nodes with a valid rack 
> seemed to fail. The errors were only detectable by looking in the namenode 
> and jobtracker logs, making it take a while before we could figure out the 
> problem. After fixing the rack awareness script, everything returned to 
> normal operation.
> It seems to me that either the error should be raised more aggressively, or a 
> "default" rack should be assumed. This would keep simple mistakes from making 
> the entire cluster unusable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-7103) When rack awareness script returns nothing, cluster stops working

Reply via email to