[jira] Commented: (HADOOP-1252) Disk problems should be handled better by the MR framework

Devaraj Das (JIRA) Thu, 12 Apr 2007 23:23:35 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12488584
 ]


Devaraj Das commented on HADOOP-1252:
-------------------------------------

Some points:
1) In the shuffle code, if the map output file was writing data to a disk and 
that disk has some problem (currently) leading to write failures, the reduce 
task won't realise it and will keep on fetching/writing, thus, hanging on that 
(much similar to the issue HADOOP-1246)
2) Checks for disk space sufficiency can be done before trying to copy a map 
output on the reduces since we know the length of the map output file apriori. 
There are some catches here though. If multiple entities are using the disk for 
writing (maps and reduces), then we might hit the problem of insufficient disk 
space a while down the line (we don't know apriori what the size of map output 
would be). But in any case this check will not worsen things. If a task hits 
the problem later we kill it.
3) getLocalPath needs to change to accomodate the (2) point (currently, it 
computes a hash of the path that we want to read-from/write-to, and maps it to 
a specific disk; checks mentioned in (2) have to be introduced for both 
read/write).

Comments? Did I miss out anything?

> Disk problems should be handled better by the MR framework
> ----------------------------------------------------------
>
>                 Key: HADOOP-1252
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1252
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.12.3
>            Reporter: Devaraj Das
>         Assigned To: Devaraj Das
>             Fix For: 0.13.0
>
>
> The MR framework should recover from Disk Failure problems without causing 
> jobs to hang. Note that this issue is about a short-term solution to solving 
> the problem. For example, by looking at the code and improving the exception 
> handling (to better detect faulty disks and missing files). The long term 
> approach might be to have a FS layer that takes care of failed disks and 
> makes it transparent to the tasks. That will be a separate issue by itself.
> Some of the issues that have been reported are HADOOP-1087 and a comment by 
> Koji on HADOOP-1200 (not sure whether those are all). Please add to this 
> issue as much details as possible on disk failures leading to hung jobs 
> (details like relevant exception traces, way to reproduce, etc.).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1252) Disk problems should be handled better by the MR framework

Reply via email to