[jira] [Commented] (HADOOP-9622) bzip2 codec can drop records when reading data in splits

2013-11-21 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13828993#comment-13828993
 ] 

Jason Lowe commented on HADOOP-9622:


bq. There are already two Test classes TestLineRecordReader in mapred and 
mapreduce.lib.input packages in hadoop-mapreduce-client-jobclient project. It 
will be better to move included tests to these classes instead of creating 
multiple classes.

I'd much rather keep the unit tests for LineRecordReader in the same package as 
the code, that way when the code is updated Jenkins will run the tests to catch 
errors.  If we move these unit tests to the jobclient module then if a patch 
touches only LineRecordReader in the core module we won't run the unit tests 
since they're in a different module.

Instead I'd rather rename the TestLineRecordReader tests in the jobclient 
module to something like TestLineRecordReaderJobs.  Those tests are really 
integration tests rather than unit tests, since they're running a job for each 
test rather than just the LineRecordReader in isolation.

 bzip2 codec can drop records when reading data in splits
 

 Key: HADOOP-9622
 URL: https://issues.apache.org/jira/browse/HADOOP-9622
 Project: Hadoop Common
  Issue Type: Bug
  Components: io
Affects Versions: 2.0.4-alpha, 0.23.8
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
 Attachments: HADOOP-9622-2.patch, HADOOP-9622-testcase.patch, 
 HADOOP-9622.patch, blockEndingInCR.txt.bz2, blockEndingInCRThenLF.txt.bz2


 Bzip2Codec.BZip2CompressionInputStream can cause records to be dropped when 
 reading them in splits based on where record delimiters occur relative to 
 compression block boundaries.
 Thanks to [~knoguchi] for discovering this problem while working on PIG-3251.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HADOOP-9622) bzip2 codec can drop records when reading data in splits

2013-11-21 Thread Vinay (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13829028#comment-13829028
 ] 

Vinay commented on HADOOP-9622:
---

Thats sounds better jason. 
+1 for the existing patch then.

 bzip2 codec can drop records when reading data in splits
 

 Key: HADOOP-9622
 URL: https://issues.apache.org/jira/browse/HADOOP-9622
 Project: Hadoop Common
  Issue Type: Bug
  Components: io
Affects Versions: 2.0.4-alpha, 0.23.8
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
 Attachments: HADOOP-9622-2.patch, HADOOP-9622-testcase.patch, 
 HADOOP-9622.patch, blockEndingInCR.txt.bz2, blockEndingInCRThenLF.txt.bz2


 Bzip2Codec.BZip2CompressionInputStream can cause records to be dropped when 
 reading them in splits based on where record delimiters occur relative to 
 compression block boundaries.
 Thanks to [~knoguchi] for discovering this problem while working on PIG-3251.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HADOOP-9622) bzip2 codec can drop records when reading data in splits

2013-11-20 Thread Vinay (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13828448#comment-13828448
 ] 

Vinay commented on HADOOP-9622:
---

Thanks Jason for the patch for this tricky issue.
Patch looks good to me.

One small nit.
There are already two Test classes TestLineRecordReader in mapred and 
mapreduce.lib.input packages in hadoop-mapreduce-client-jobclient project. It 
will be better to move included tests to these classes instead of creating 
multiple classes.

 bzip2 codec can drop records when reading data in splits
 

 Key: HADOOP-9622
 URL: https://issues.apache.org/jira/browse/HADOOP-9622
 Project: Hadoop Common
  Issue Type: Bug
  Components: io
Affects Versions: 2.0.4-alpha, 0.23.8
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
 Attachments: HADOOP-9622-2.patch, HADOOP-9622-testcase.patch, 
 HADOOP-9622.patch, blockEndingInCR.txt.bz2, blockEndingInCRThenLF.txt.bz2


 Bzip2Codec.BZip2CompressionInputStream can cause records to be dropped when 
 reading them in splits based on where record delimiters occur relative to 
 compression block boundaries.
 Thanks to [~knoguchi] for discovering this problem while working on PIG-3251.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HADOOP-9622) bzip2 codec can drop records when reading data in splits

2013-11-19 Thread Chris Douglas (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13826352#comment-13826352
 ] 

Chris Douglas commented on HADOOP-9622:
---

bq. I'm tempted to handle this as a separate JIRA since I believe this will be 
an issue only with uncompressed inputs after this patch.

Yeah, that makes sense. Particularly since this issue covers the codec and the 
custom delimiter bug is in in the text processing. Thanks for looking into it.

bq. With this patch I think we have this case covered for compressed input due 
to the needAdditionalRecordAfterSplit logic.

I... think that's true. We can think about it in the followup.

 bzip2 codec can drop records when reading data in splits
 

 Key: HADOOP-9622
 URL: https://issues.apache.org/jira/browse/HADOOP-9622
 Project: Hadoop Common
  Issue Type: Bug
  Components: io
Affects Versions: 2.0.4-alpha, 0.23.8
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
 Attachments: HADOOP-9622-2.patch, HADOOP-9622-testcase.patch, 
 HADOOP-9622.patch, blockEndingInCR.txt.bz2, blockEndingInCRThenLF.txt.bz2


 Bzip2Codec.BZip2CompressionInputStream can cause records to be dropped when 
 reading them in splits based on where record delimiters occur relative to 
 compression block boundaries.
 Thanks to [~knoguchi] for discovering this problem while working on PIG-3251.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HADOOP-9622) bzip2 codec can drop records when reading data in splits

2013-11-19 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13826706#comment-13826706
 ] 

Jason Lowe commented on HADOOP-9622:


Turns out there's already a followup for multibyte custom delimiters at 
HADOOP-9867, so I'll add the testcase and relevant details to that JIRA.

Thanks for the review, Chris.  Given your earlier +1 I think this is now ready 
to go as-is.  If there are no objections I'll commit this in the next few days.

 bzip2 codec can drop records when reading data in splits
 

 Key: HADOOP-9622
 URL: https://issues.apache.org/jira/browse/HADOOP-9622
 Project: Hadoop Common
  Issue Type: Bug
  Components: io
Affects Versions: 2.0.4-alpha, 0.23.8
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
 Attachments: HADOOP-9622-2.patch, HADOOP-9622-testcase.patch, 
 HADOOP-9622.patch, blockEndingInCR.txt.bz2, blockEndingInCRThenLF.txt.bz2


 Bzip2Codec.BZip2CompressionInputStream can cause records to be dropped when 
 reading them in splits based on where record delimiters occur relative to 
 compression block boundaries.
 Thanks to [~knoguchi] for discovering this problem while working on PIG-3251.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HADOOP-9622) bzip2 codec can drop records when reading data in splits

2013-11-18 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13825854#comment-13825854
 ] 

Jason Lowe commented on HADOOP-9622:


bq. {{inDelimiter}} is insufficient because {{LineReader::readDefaultLine}} 
will match \n, while {{LineReader::readCustomLine}} would consider a partial 
match incomplete and require an extra line?

Yes, the crux of the issue is the default delimiter supports a subset of the 
delimiter as a valid delimiter (i.e.: \r\n is a delimiter but so is \r or \n).  
The custom delimiter support does not allow a subset of the specified delimiter 
to be a valid delimiter as well, so it won't recognize the start of the 
characters as a delimiter and will read an extra line before starting.

bq. I looked briefly at the custom delimiter code, and I'm not seeing how it 
handles splits that start in the middle of a delimiter. I must be missing 
something obvious...

Yeah, it does look like there's a problem with the handling of custom record 
delimiters on uncompressed input.  For this to work properly we need the 
consumer of the previous split to handle all bytes up to and including the 
first full record delimiter that starts at or after its split ends.  With this 
patch I think we have this case covered for compressed input due to the 
needAdditionalRecordAfterSplit logic.  However since the custom delimiter line 
reader seems to be returning the size of the record and subsequent delimiter 
bytes as the bytes consumed, I think we will end up reporting the end of the 
split too early to the LineRecordReader for uncompressed data in the case where 
the delimiter straddles the split boundary.

To verify there's a problem, I ran a simple wordcount on the following input 
data:

{noformat}

abcxxx
defxxx
ghixxx
jklxxx
mnoxxx
pqrxxx
stuxxx
vw xxx
xyzxxx
{noformat}

and then I ran it with the options 
{{-Dmapreduce.input.fileinputformat.split.maxsize=34 
-Dtextinputformat.record.delimiter=xxx}}.  The resulting output looked like 
this:

{noformat}
abc 1
def 1
ghi 1
jkl 1
mno 1
stu 1
vw  1
xyz 1
{noformat}

So we dropped the pqr record.  Not good.

I'm tempted to handle this as a separate JIRA since I believe this will be an 
issue only with uncompressed inputs after this patch.

 bzip2 codec can drop records when reading data in splits
 

 Key: HADOOP-9622
 URL: https://issues.apache.org/jira/browse/HADOOP-9622
 Project: Hadoop Common
  Issue Type: Bug
  Components: io
Affects Versions: 2.0.4-alpha, 0.23.8
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
 Attachments: HADOOP-9622-2.patch, HADOOP-9622-testcase.patch, 
 HADOOP-9622.patch, blockEndingInCR.txt.bz2, blockEndingInCRThenLF.txt.bz2


 Bzip2Codec.BZip2CompressionInputStream can cause records to be dropped when 
 reading them in splits based on where record delimiters occur relative to 
 compression block boundaries.
 Thanks to [~knoguchi] for discovering this problem while working on PIG-3251.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HADOOP-9622) bzip2 codec can drop records when reading data in splits

2013-11-16 Thread Chris Douglas (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1382#comment-1382
 ] 

Chris Douglas commented on HADOOP-9622:
---

+1 The unit test clearly demonstrates the bug and dropping records is severe 
enough that this case should be fixed.

The comment is helpful, thanks for including it. {{inDelimiter}} is 
insufficient because {{LineReader::readDefaultLine}} will match \n, while 
{{LineReader::readCustomLine}} would consider a partial match incomplete and 
require an extra line? I looked briefly at the custom delimiter code, and I'm 
not seeing how it handles splits that start in the middle of a delimiter. I 
must be missing something obvious...

 bzip2 codec can drop records when reading data in splits
 

 Key: HADOOP-9622
 URL: https://issues.apache.org/jira/browse/HADOOP-9622
 Project: Hadoop Common
  Issue Type: Bug
  Components: io
Affects Versions: 2.0.4-alpha, 0.23.8
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
 Attachments: HADOOP-9622-2.patch, HADOOP-9622-testcase.patch, 
 HADOOP-9622.patch, blockEndingInCR.txt.bz2, blockEndingInCRThenLF.txt.bz2


 Bzip2Codec.BZip2CompressionInputStream can cause records to be dropped when 
 reading them in splits based on where record delimiters occur relative to 
 compression block boundaries.
 Thanks to [~knoguchi] for discovering this problem while working on PIG-3251.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HADOOP-9622) bzip2 codec can drop records when reading data in splits

2013-06-18 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13687158#comment-13687158
 ] 

Nathan Roberts commented on HADOOP-9622:


Thanks for adding the detailed comment. Well explained!
+1 for the patch.


 bzip2 codec can drop records when reading data in splits
 

 Key: HADOOP-9622
 URL: https://issues.apache.org/jira/browse/HADOOP-9622
 Project: Hadoop Common
  Issue Type: Bug
  Components: io
Affects Versions: 2.0.4-alpha, 0.23.8
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
 Attachments: blockEndingInCRThenLF.txt.bz2, blockEndingInCR.txt.bz2, 
 HADOOP-9622-2.patch, HADOOP-9622.patch, HADOOP-9622-testcase.patch


 Bzip2Codec.BZip2CompressionInputStream can cause records to be dropped when 
 reading them in splits based on where record delimiters occur relative to 
 compression block boundaries.
 Thanks to [~knoguchi] for discovering this problem while working on PIG-3251.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9622) bzip2 codec can drop records when reading data in splits

2013-06-12 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13681668#comment-13681668
 ] 

Nathan Roberts commented on HADOOP-9622:


Looked over patch and approach seems reasonable. I'm not sure there is much 
else you can do without changing the codecs themselves, or forcing everything 
to move one byte at a time (so that we can know precisely when the codec moves 
beyond the split). 

It might help to add some comments somewhere in the code which specifically 
illustrate the boundary conditions. As you said it's tricky (clearly since both 
pig and MR got it wrong in slightly different edge cases), it certainly 
couldn't hurt to add some more commentary in this area.

 bzip2 codec can drop records when reading data in splits
 

 Key: HADOOP-9622
 URL: https://issues.apache.org/jira/browse/HADOOP-9622
 Project: Hadoop Common
  Issue Type: Bug
  Components: io
Affects Versions: 2.0.4-alpha, 0.23.8
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
 Attachments: blockEndingInCRThenLF.txt.bz2, blockEndingInCR.txt.bz2, 
 HADOOP-9622.patch, HADOOP-9622-testcase.patch


 Bzip2Codec.BZip2CompressionInputStream can cause records to be dropped when 
 reading them in splits based on where record delimiters occur relative to 
 compression block boundaries.
 Thanks to [~knoguchi] for discovering this problem while working on PIG-3251.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira