[jira] Commented: (HADOOP-3144) better fault tolerance for corrupted text files

Chris Douglas (JIRA) Sun, 27 Apr 2008 17:08:36 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592703#action_12592703
 ]


Chris Douglas commented on HADOOP-3144:
---------------------------------------

bq. one of the founding principles of map-reduce as described in the google 
paper (and perhaps one of the most remarkable differences with general database 
systems) was the notion of being tolerant of bad data. if u see a few rows of 
bad data - skip it.

It's been awhile since I read the paper, but isn't recovery effected by 
skipping the record that caused a failure on the map (HADOOP-153)? Recovery 
from corrupted data without re-executing the map sounds like a solution for a 
less generic format than LineRecordReader; detecting and failing/discarding a 
map because its output is corrupt is application code, I agree, and this looks 
like Zheng has a very reasonable, general workaround (more below).

Given the re-execution model, the "correct" and more general fix would be to 
fail the map- with an OOM exception- and skip the range that had already been 
read. If it read into the following split, then it need not be rescheduled 
because we know that another task had already scanned up to the next record 
boundary (or failed trying). If one wants to fail the task earlier, then 
specifying a "SafeTextInputFormat" isn't a terrible burden, but you have a 
point: a property that controls special cases for TextInputFormat is more 
usable. Without HADOOP-153, the point is moot, and perhaps this fix is more 
pressing as a consequence.

bq. Zheng's fix does skip to the next available record (if it falls within the 
split). Otherwise an EOF is returned.

That's not a full description of what it does, though. I took a closer look, 
and it doesn't do what I had assumed, i.e. define both a max line length and 
force a hard limit for reading into the following split (which is why the 
archive format didn't seem like a non sequitur). It defines a single new 
property that defines the maximum line length, which prevents the situation in 
this JIRA by terminating the record reader if it's past the end of the split, 
having consumed the maximum line length. Since it takes the maximum of what 
remains in the split and the aforementioned length as the limit, the situation 
I asked after (i.e. returning the trailing part of a record as a single record) 
doesn't occur. Since it defaults to Long.MAX_VALUE, there's no issue with 
existing code. That's all I was trying to determine. The API change (changing 
the return type of readLine from {{int}} to {{long}}) makes more sense in this 
context, but it still seems unnecessary.

> better fault tolerance for corrupted text files
> -----------------------------------------------
>
>                 Key: HADOOP-3144
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3144
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.15.3
>            Reporter: Joydeep Sen Sarma
>            Assignee: Zheng Shao
>         Attachments: 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch
>
>
> every once in a while - we encounter corrupted text files (corrupted at 
> source prior to copying into hadoop). inevitably - some of the data looks 
> like a really really long line and hadoop trips over trying to stuff it into 
> an in memory object and gets outofmem error. Code looks same way in trunk as 
> well .. 
> so looking for an option to the textinputformat (and like) to ignore long 
> lines. ideally - we would just skip errant lines above a certain size limit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3144) better fault tolerance for corrupted text files

Reply via email to