[jira] [Commented] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records

Dustin Cote (JIRA) Mon, 16 Nov 2015 08:08:02 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15006839#comment-15006839
 ]


Dustin Cote commented on MAPREDUCE-6549:
----------------------------------------

[~wilfreds]

Are you sure your fix is the right one?  The reason I changed the tests was 
because they validate an incomplete record for some reason at the end (which is 
the part of my fix that is breaking tests in the mapred package since I forgot 
to change them).  The reason I'm saying this is because I would expect the 
following:
Input: abcdefghij++kl++mno
Output records: 1) abcdefghij 2) kl

It looks like your tests do the same thing.  mno doesn't have a delimiter at 
the end, so isn't that garbage data as an incomplete record?  I would expect 
that to be the behavior if I were using the API, but I don't see any real 
documentation for this on multibyte delimiters.  If we're going to commit to 
the last part of the data without a delimiter at the end as being a record, 
then that should be documented as well.  Otherwise, I'd rather merge our 
patches together and verify the functionality so that the above scenario is 
what happens instead of pulling in undelimited data at the end of the file.

> multibyte delimiters with LineRecordReader cause duplicate records
> ------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6549
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv1, mrv2
>    Affects Versions: 2.7.2
>            Reporter: Dustin Cote
>            Assignee: Wilfred Spiegelenburg
>         Attachments: MAPREDUCE-6549-1.patch, MAPREDUCE-6549-2.patch
>
>
> LineRecorderReader currently produces duplicate records under certain 
> scenarios such as:
> 1) input string: "abc+++def++ghi++" 
> delimiter string: "+++" 
> test passes with all sizes of the split 
> 2) input string: "abc++def+++ghi++" 
> delimiter string: "+++" 
> test fails with a split size of 4 
> 2) input string: "abc+++def++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 5 
> 3) input string "abc+++defg++hij++" 
> delimiter string: "++" 
> test fails with a split size of 4 
> 4) input string "abc++def+++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 9 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records

Reply via email to