[jira] [Commented] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

Jason Lowe (JIRA) Fri, 27 Jun 2014 09:55:25 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046138#comment-14046138
 ]


Jason Lowe commented on HADOOP-9867:
------------------------------------

Actually I agree with Rushabh that there are at least two somewhat different 
problems here.  The original problem reported in the JIRA has to do with 
records being dropped with uncompressed inputs.  We should fix that issue so we 
don't drop data when using an uncompressed input.  I'm assuming Rushabh's patch 
solves that issue, but I haven't looked at it in detail just yet.

There's another issue related to mistaken record delimiter recognition where 
the subsequent split reader can accidentally think it found a delimiter when in 
fact the real record delimiter is somewhere else. If the subsequent split 
reader sees 'xxxxyzxxx' at the beginning of its split then it will toss out the 
first record (i.e.: the first 'xxx') then read 'xyz' as the next record.  
However that may or may not be the correct behavior, because with that kind of 
delimiter and data the correct behavior depends upon the _previous_ split's 
data.  If the previous split ended with 'abc' then the behavior was correct and 
there are two records in the stream: 'abc' and 'xyz'.  If the previous split 
ended with 'abcx' then that's the incorrect behavior.  The records should be 
'abc' and 'xxyz' but the second split reader will report an 'xyz' record that 
shouldn't exist.

To solve that problem either a split reader would have to examine the 
prior-split's data to distinguish this case, or the split reader would have to 
realize it's an ambiguous situation and leave the record processing to the 
previous split reader to handle.  The former can be very expensive if the prior 
split is compressed, as it has to potentially unpack the entire split.  Also 
this can get very tricky and a reader may need to read more than one other 
split to resolve it.  For example, if the data stream is 
'axxxxxxxxxxxxx......xxxxxxbxxxxxx......xxxxxcxxxxxx' then a reader may have to 
scan far down into subsequent splits since only it knows where the true record 
boundaries are.  Simply tacking on an extra character at the beginning of that 
input changes where the record boundaries are and the record contents even the 
last split in the input.  Solving this requires a different high-level 
algorithm to split processing than what we have today (i.e.: throw away the 
first record and go), so I believe that's something better left to a followup 
JIRA.

It'd be nice to solve the dropped-record problem for scenarios where we don't 
have to worry about mistaken record delimiter recognition in the data, as 
that's an incremental improvement from where we are today.  I'll try to get 
some time to review the latest patch and provide comments soon.

> org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
> delimiters well
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-9867
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9867
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: io
>    Affects Versions: 0.20.2, 0.23.9, 2.2.0
>         Environment: CDH3U2 Redhat linux 5.7
>            Reporter: Kris Geusebroek
>            Assignee: Vinayakumar B
>            Priority: Critical
>         Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch, 
> HADOOP-9867.patch
>
>
> Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
> sometimes has the effect of skipping records from the input.
> This happens when the input splits are split off just after a 
> recordseparator. Starting point for the next split would be non zero and 
> skipFirstLine would be true. A seek into the file is done to start - 1 and 
> the text until the first recorddelimiter is ignored (due to the presumption 
> that this record is already handled by the previous maptask). Since the re 
> ord delimiter is multibyte the seek only got the last byte of the delimiter 
> into scope and its not recognized as a full delimiter. So the text is skipped 
> until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

Reply via email to