[
https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592703#action_12592703
]
Chris Douglas commented on HADOOP-3144:
---------------------------------------
bq. one of the founding principles of map-reduce as described in the google
paper (and perhaps one of the most remarkable differences with general database
systems) was the notion of being tolerant of bad data. if u see a few rows of
bad data - skip it.
It's been awhile since I read the paper, but isn't recovery effected by
skipping the record that caused a failure on the map (HADOOP-153)? Recovery
from corrupted data without re-executing the map sounds like a solution for a
less generic format than LineRecordReader; detecting and failing/discarding a
map because its output is corrupt is application code, I agree, and this looks
like Zheng has a very reasonable, general workaround (more below).
Given the re-execution model, the "correct" and more general fix would be to
fail the map- with an OOM exception- and skip the range that had already been
read. If it read into the following split, then it need not be rescheduled
because we know that another task had already scanned up to the next record
boundary (or failed trying). If one wants to fail the task earlier, then
specifying a "SafeTextInputFormat" isn't a terrible burden, but you have a
point: a property that controls special cases for TextInputFormat is more
usable. Without HADOOP-153, the point is moot, and perhaps this fix is more
pressing as a consequence.
bq. Zheng's fix does skip to the next available record (if it falls within the
split). Otherwise an EOF is returned.
That's not a full description of what it does, though. I took a closer look,
and it doesn't do what I had assumed, i.e. define both a max line length and
force a hard limit for reading into the following split (which is why the
archive format didn't seem like a non sequitur). It defines a single new
property that defines the maximum line length, which prevents the situation in
this JIRA by terminating the record reader if it's past the end of the split,
having consumed the maximum line length. Since it takes the maximum of what
remains in the split and the aforementioned length as the limit, the situation
I asked after (i.e. returning the trailing part of a record as a single record)
doesn't occur. Since it defaults to Long.MAX_VALUE, there's no issue with
existing code. That's all I was trying to determine. The API change (changing
the return type of readLine from {{int}} to {{long}}) makes more sense in this
context, but it still seems unnecessary.
> better fault tolerance for corrupted text files
> -----------------------------------------------
>
> Key: HADOOP-3144
> URL: https://issues.apache.org/jira/browse/HADOOP-3144
> Project: Hadoop Core
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.15.3
> Reporter: Joydeep Sen Sarma
> Assignee: Zheng Shao
> Attachments: 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch
>
>
> every once in a while - we encounter corrupted text files (corrupted at
> source prior to copying into hadoop). inevitably - some of the data looks
> like a really really long line and hadoop trips over trying to stuff it into
> an in memory object and gets outofmem error. Code looks same way in trunk as
> well ..
> so looking for an option to the textinputformat (and like) to ignore long
> lines. ideally - we would just skip errant lines above a certain size limit.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.