[
https://issues.apache.org/jira/browse/HADOOP-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593938#action_12593938
]
Chris Douglas commented on HADOOP-3144:
---------------------------------------
This looks good, but I don't understand this change:
{noformat}
@@ -166,7 +207,7 @@
boolean skipFirstLine = false;
if (codec != null) {
in = new LineReader(codec.createInputStream(fileIn), job);
- end = Long.MAX_VALUE;
+ end = Integer.MAX_VALUE;
} else {
if (start != 0) {
skipFirstLine = true;
{noformat}
Is this to avoid the overflow for the cast to int in next()?
Instead, in the call to readLine in next:
{noformat}
+ int newSize = in.readLine(value, maxLineLength,
+ (int)Math.max(end-pos, (long)maxLineLength));
{noformat}
It might be better to use (with appropriate casts):
{noformat}
+ int newSize = in.readLine(value, maxLineLength,
+ Math.max(Math.min(end - pos,
Integer.MAX_VALUE), maxLineLength));
{noformat}
Which makes Long.MAX_VALUE correct while avoiding the overflow, right?
> better fault tolerance for corrupted text files
> -----------------------------------------------
>
> Key: HADOOP-3144
> URL: https://issues.apache.org/jira/browse/HADOOP-3144
> Project: Hadoop Core
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.15.3
> Reporter: Joydeep Sen Sarma
> Assignee: Zheng Shao
> Attachments: 3144-4.patch, 3144-ignore-spaces-2.patch,
> 3144-ignore-spaces-3.patch
>
>
> every once in a while - we encounter corrupted text files (corrupted at
> source prior to copying into hadoop). inevitably - some of the data looks
> like a really really long line and hadoop trips over trying to stuff it into
> an in memory object and gets outofmem error. Code looks same way in trunk as
> well ..
> so looking for an option to the textinputformat (and like) to ignore long
> lines. ideally - we would just skip errant lines above a certain size limit.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.