[
https://issues.apache.org/jira/browse/HADOOP-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Arun C Murthy updated HADOOP-1246:
----------------------------------
Description:
Ignoring IOExceptions during fetching of map outputs in
MapOutputLocation.java:getFile (e.g. content-length doesn't match actual data
recieved) leads to hung reduces since the MapOutputCopier puts the host in the
penalty box and retries forever.
Possible steps:
a) Distinguish between failure to fetch output v/s lost maps. (related to
HADOOP-1158)
b) Ensure the reduce doesn't keep fetching from 'lost maps'. (related to
HADOOP-1183)
c) On detection of 'failure to fetch' we probably should have exponential
back-offs (versus the same order back-offs as currently) for hosts in the
'penalty box'.
d) If fetches still fail for say 4 times (after exponential backoffs), we
should declare the Reduce as 'failed'.
This situation could also arise from situations like full-disks on the reducer,
whereby it isn't possible to save the map output on the local disk (say for
large map outputs).
Thoughts?
was:
Ignoring exceptions during fetching of map outputs in
MapOutputLocation.java:getFile (e.g. content-length doesn't match actual data
recieved) leads to hung reduces since the MapOutputCopier just ignores them,
puts the host in the penalty box and retries forever.
Possible steps:
a) Distinguish between failure to fetch output v/s lost maps. (related to
HADOOP-1158)
b) Ensure the reduce doesn't keep fetching from 'lost maps'. (related to
HADOOP-1183)
c) On detection of 'failure to fetch' we probably should have exponential
back-offs (versus the same order back-offs as currently) for hosts in the
'penalty box'.
d) If fetches still fail for say 4 times (after exponential backoffs), we
should declare the Reduce as 'failed'.
This situation could also arise from situations like full-disks on the reducer,
whereby it isn't possible to save the map output on the local disk (say for
large map outputs).
Thoughts?
Summary: Ignored IOExceptions from MapOutputLocation.java:getFile lead
to hung reduces (was: Ignored exceptions from MapOutputLocation.java:getFile
lead to hung reduces)
> Ignored IOExceptions from MapOutputLocation.java:getFile lead to hung reduces
> -----------------------------------------------------------------------------
>
> Key: HADOOP-1246
> URL: https://issues.apache.org/jira/browse/HADOOP-1246
> Project: Hadoop
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.12.3
> Reporter: Arun C Murthy
>
> Ignoring IOExceptions during fetching of map outputs in
> MapOutputLocation.java:getFile (e.g. content-length doesn't match actual data
> recieved) leads to hung reduces since the MapOutputCopier puts the host in
> the penalty box and retries forever.
> Possible steps:
> a) Distinguish between failure to fetch output v/s lost maps. (related to
> HADOOP-1158)
> b) Ensure the reduce doesn't keep fetching from 'lost maps'. (related to
> HADOOP-1183)
> c) On detection of 'failure to fetch' we probably should have exponential
> back-offs (versus the same order back-offs as currently) for hosts in the
> 'penalty box'.
> d) If fetches still fail for say 4 times (after exponential backoffs), we
> should declare the Reduce as 'failed'.
> This situation could also arise from situations like full-disks on the
> reducer, whereby it isn't possible to save the map output on the local disk
> (say for large map outputs).
> Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.