[
https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592554#action_12592554
]
Joydeep Sen Sarma commented on HADOOP-3144:
-------------------------------------------
we did not point the text reader at a binary file. we had a corrupted text file
filled with long section of junk.
given that this is a problem that can happen to anyone (we just happen to be
the lucky first) - and everyone uses textinputformat to read text files - why
shouldn't the safeguard be built into textinputformat? what's the downside?
(does it make sense to buy insurance after an accident? - we wait for people to
hit such a problem and then say - oh, but u should have used
'SafeTextInputFormat'?)
portable across what? i looked at 3307 earlier today. i don't know how it's
remotely related. enlighten us.
i am sorry - but i am mildly irritated by the comments here. we are aware of
the concept of subclassing. and we can write our own inputformat - thank u so
much. the whole point of going through this procedure is to contribute back to
the community something that is of general benefit. Either the argument is that
this is not of general benefit - or that the cost outweighs the benefit.
Neither argument has been made.
> better fault tolerance for corrupted text files
> -----------------------------------------------
>
> Key: HADOOP-3144
> URL: https://issues.apache.org/jira/browse/HADOOP-3144
> Project: Hadoop Core
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.15.3
> Reporter: Joydeep Sen Sarma
> Assignee: Zheng Shao
> Attachments: 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch
>
>
> every once in a while - we encounter corrupted text files (corrupted at
> source prior to copying into hadoop). inevitably - some of the data looks
> like a really really long line and hadoop trips over trying to stuff it into
> an in memory object and gets outofmem error. Code looks same way in trunk as
> well ..
> so looking for an option to the textinputformat (and like) to ignore long
> lines. ideally - we would just skip errant lines above a certain size limit.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.