[jira] Commented: (HADOOP-3144) better fault tolerance for corrupted text files

Joydeep Sen Sarma (JIRA) Sat, 26 Apr 2008 09:21:33 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592583#action_12592583
 ]


Joydeep Sen Sarma commented on HADOOP-3144:
-------------------------------------------

one of the founding principles of map-reduce as described in the google paper 
(and perhaps one of the most remarkable differences with general database 
systems) was the notion of being tolerant of bad data. if u see a few rows of 
bad data - skip it. 

we try to do this in the application land as much as possible. however, it is 
not possible for us to do anything if Hadoop throws an out of memory error. 
Hence this fix belongs in hadoop core. Zheng's fix does skip to the next 
available record (if it falls within the split). Otherwise an EOF is returned. 

3307 is way out there man - it's a solution for small files. if the file was 
small - we wouldn't have a problem to begin with (as u say - the input is 
bounded). this problem only affects large files. if u read the description of 
3307 carefully - u will notice it says that it has no impact on map-reduce. The 
problem we are trying to solve is a map-reduce problem - it applies whether the 
file comes from an archive (3307) or from local file system or hdfs (or any 
file system for that matter). 

> better fault tolerance for corrupted text files
> -----------------------------------------------
>
>                 Key: HADOOP-3144
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3144
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.15.3
>            Reporter: Joydeep Sen Sarma
>            Assignee: Zheng Shao
>         Attachments: 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch
>
>
> every once in a while - we encounter corrupted text files (corrupted at 
> source prior to copying into hadoop). inevitably - some of the data looks 
> like a really really long line and hadoop trips over trying to stuff it into 
> an in memory object and gets outofmem error. Code looks same way in trunk as 
> well .. 
> so looking for an option to the textinputformat (and like) to ignore long 
> lines. ideally - we would just skip errant lines above a certain size limit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3144) better fault tolerance for corrupted text files

Reply via email to