[
https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592550#action_12592550
]
Chris Douglas commented on HADOOP-3144:
---------------------------------------
bq. It used to be sufficient does not mean that they will be sufficient in the
future - that's why we have open64. The cost of using a long instead of an int
is minimal, while we do avoid potential overflow problems
True, but it's accumulating bytes read from a text file into memory for a
single record. It's not at all obvious to me that this requires a long.
Future-proofing a case that will be a total disaster for the rest of the
framework seems premature, particularly when the change is to a generic text
parser. If someone truly needs to slurp >2GB of text data _per record_, surely
their requirements justify a less general RecordReader. It's not the cost of
the int that concerns me, but rather it's the API change to support a case
that's not only degenerate, but implausible.
bq. The reason for "maxBytesToConsume" is to tell readLine the end of this
block - there is no reason for the readLine to go through tens of gigs of data
search for an end of line, while the current block is only 128MB.
A far more portable solution for what this expresses would be an InputFormat
generating a subclass of FileSplit annotated with a hard limit enforced by the
RecordReader (i.e. returns EOF at some position within the file). Some of this
will inevitably be done as part of the Hadoop archive work (HADOOP-3307). As a
workaround, don't point text readers at binary data. ;)
> better fault tolerance for corrupted text files
> -----------------------------------------------
>
> Key: HADOOP-3144
> URL: https://issues.apache.org/jira/browse/HADOOP-3144
> Project: Hadoop Core
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.15.3
> Reporter: Joydeep Sen Sarma
> Assignee: Zheng Shao
> Attachments: 3144-ignore-spaces-2.patch, 3144-ignore-spaces-3.patch
>
>
> every once in a while - we encounter corrupted text files (corrupted at
> source prior to copying into hadoop). inevitably - some of the data looks
> like a really really long line and hadoop trips over trying to stuff it into
> an in memory object and gets outofmem error. Code looks same way in trunk as
> well ..
> so looking for an option to the textinputformat (and like) to ignore long
> lines. ideally - we would just skip errant lines above a certain size limit.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.