[jira] [Updated] (MAPREDUCE-6481) LineRecordReader may give incomplete record and wrong position/key information for uncompressed input sometimes.

2017-01-05 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-6481:
--
Fix Version/s: 2.8.0

> LineRecordReader may give incomplete record and wrong position/key 
> information for uncompressed input sometimes.
> 
>
> Key: MAPREDUCE-6481
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6481
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv2
>Affects Versions: 2.7.0
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Fix For: 2.8.0, 2.7.2, 2.6.3, 3.0.0-alpha1
>
> Attachments: MAPREDUCE-6481.000.patch
>
>
> LineRecordReader may give incomplete record and wrong position/key 
> information for uncompressed input sometimes.
> There are two issues:
> # LineRecordReader may give incomplete record: some characters cut off at the 
> end of record.
> # LineRecordReader may give wrong position/key information.
> The first issue only happens for Custom Delimiter, which is caused by the 
> following code at {{LineReader#readCustomLine}}:
> {code}
> if (appendLength > 0) {
> if (ambiguousByteCount > 0) {
>   str.append(recordDelimiterBytes, 0, ambiguousByteCount);
>   //appending the ambiguous characters (refer case 2.2)
>   bytesConsumed += ambiguousByteCount;
>   ambiguousByteCount=0;
> }
> str.append(buffer, startPosn, appendLength);
> txtLength += appendLength;
>   }
> {code}
> If {{appendLength}} is 0 and {{ambiguousByteCount}} is not 0, this bug will 
> be triggered. For example, input is "123456789aab", Custom Delimiter is "ab", 
> bufferSize is 10 and splitLength is 12, the correct record should be 
> "123456789a" with length 10, but we get incomplete record "123456789" with 
> length 9 from current code.
> The second issue can happen for both Custom Delimiter and Default Delimiter, 
> which is caused by the code in {{UncompressedSplitLineReader#readLine}}. 
> {{UncompressedSplitLineReader#readLine}} may report wrong size information at 
> some corner cases. The reason is {{unusedBytes}} in the following code:
> {code}
> bytesRead += unusedBytes;
> unusedBytes = bufferSize - getBufferPosn();
> bytesRead -= unusedBytes;
> {code}
> If the last bytes read (bufferLength) is less than bufferSize, the previous 
> {{unusedBytes}} will be wrong, which should be {{bufferLength}} - 
> {{bufferPosn}} instead of bufferSize - {{bufferPosn}}. It will return larger 
> value.
> For example, input is "1234567890ab12ab345", Custom Delimiter is "ab", 
> bufferSize is 10 and two splits:first splitLength is 15 and second 
> splitLength 4:
> the current code will give the following result:
> First record: Key:0 Value:"1234567890"
> Second record: Key:12 Value:"12"
> Third Record: Key:21 Value:"345"
> You can see the Key for the third record is wrong, it should be 16 instead of 
> 21. It is due to wrong {{unusedBytes}}. {{fillBuffer}} read 10 bytes for the 
> first time, for the second times, it only read 5 bytes, which is 5 bytes less 
> than the bufferSize. That is why the key we get is 5 bytes larger than the 
> correct one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Updated] (MAPREDUCE-6481) LineRecordReader may give incomplete record and wrong position/key information for uncompressed input sometimes.

2015-11-23 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated MAPREDUCE-6481:
--
Fix Version/s: 2.6.3

I committed this to branch-2.6.

> LineRecordReader may give incomplete record and wrong position/key 
> information for uncompressed input sometimes.
> 
>
> Key: MAPREDUCE-6481
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6481
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv2
>Affects Versions: 2.7.0
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Fix For: 2.7.2, 2.6.3
>
> Attachments: MAPREDUCE-6481.000.patch
>
>
> LineRecordReader may give incomplete record and wrong position/key 
> information for uncompressed input sometimes.
> There are two issues:
> # LineRecordReader may give incomplete record: some characters cut off at the 
> end of record.
> # LineRecordReader may give wrong position/key information.
> The first issue only happens for Custom Delimiter, which is caused by the 
> following code at {{LineReader#readCustomLine}}:
> {code}
> if (appendLength > 0) {
> if (ambiguousByteCount > 0) {
>   str.append(recordDelimiterBytes, 0, ambiguousByteCount);
>   //appending the ambiguous characters (refer case 2.2)
>   bytesConsumed += ambiguousByteCount;
>   ambiguousByteCount=0;
> }
> str.append(buffer, startPosn, appendLength);
> txtLength += appendLength;
>   }
> {code}
> If {{appendLength}} is 0 and {{ambiguousByteCount}} is not 0, this bug will 
> be triggered. For example, input is "123456789aab", Custom Delimiter is "ab", 
> bufferSize is 10 and splitLength is 12, the correct record should be 
> "123456789a" with length 10, but we get incomplete record "123456789" with 
> length 9 from current code.
> The second issue can happen for both Custom Delimiter and Default Delimiter, 
> which is caused by the code in {{UncompressedSplitLineReader#readLine}}. 
> {{UncompressedSplitLineReader#readLine}} may report wrong size information at 
> some corner cases. The reason is {{unusedBytes}} in the following code:
> {code}
> bytesRead += unusedBytes;
> unusedBytes = bufferSize - getBufferPosn();
> bytesRead -= unusedBytes;
> {code}
> If the last bytes read (bufferLength) is less than bufferSize, the previous 
> {{unusedBytes}} will be wrong, which should be {{bufferLength}} - 
> {{bufferPosn}} instead of bufferSize - {{bufferPosn}}. It will return larger 
> value.
> For example, input is "1234567890ab12ab345", Custom Delimiter is "ab", 
> bufferSize is 10 and two splits:first splitLength is 15 and second 
> splitLength 4:
> the current code will give the following result:
> First record: Key:0 Value:"1234567890"
> Second record: Key:12 Value:"12"
> Third Record: Key:21 Value:"345"
> You can see the Key for the third record is wrong, it should be 16 instead of 
> 21. It is due to wrong {{unusedBytes}}. {{fillBuffer}} read 10 bytes for the 
> first time, for the second times, it only read 5 bytes, which is 5 bytes less 
> than the bufferSize. That is why the key we get is 5 bytes larger than the 
> correct one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6481) LineRecordReader may give incomplete record and wrong position/key information for uncompressed input sometimes.

2015-09-18 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated MAPREDUCE-6481:
--
Fix Version/s: (was: 2.8.0)
   2.7.2

I committed this to branch-2.7 as well.

> LineRecordReader may give incomplete record and wrong position/key 
> information for uncompressed input sometimes.
> 
>
> Key: MAPREDUCE-6481
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6481
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv2
>Affects Versions: 2.7.0
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Fix For: 2.7.2
>
> Attachments: MAPREDUCE-6481.000.patch
>
>
> LineRecordReader may give incomplete record and wrong position/key 
> information for uncompressed input sometimes.
> There are two issues:
> # LineRecordReader may give incomplete record: some characters cut off at the 
> end of record.
> # LineRecordReader may give wrong position/key information.
> The first issue only happens for Custom Delimiter, which is caused by the 
> following code at {{LineReader#readCustomLine}}:
> {code}
> if (appendLength > 0) {
> if (ambiguousByteCount > 0) {
>   str.append(recordDelimiterBytes, 0, ambiguousByteCount);
>   //appending the ambiguous characters (refer case 2.2)
>   bytesConsumed += ambiguousByteCount;
>   ambiguousByteCount=0;
> }
> str.append(buffer, startPosn, appendLength);
> txtLength += appendLength;
>   }
> {code}
> If {{appendLength}} is 0 and {{ambiguousByteCount}} is not 0, this bug will 
> be triggered. For example, input is "123456789aab", Custom Delimiter is "ab", 
> bufferSize is 10 and splitLength is 12, the correct record should be 
> "123456789a" with length 10, but we get incomplete record "123456789" with 
> length 9 from current code.
> The second issue can happen for both Custom Delimiter and Default Delimiter, 
> which is caused by the code in {{UncompressedSplitLineReader#readLine}}. 
> {{UncompressedSplitLineReader#readLine}} may report wrong size information at 
> some corner cases. The reason is {{unusedBytes}} in the following code:
> {code}
> bytesRead += unusedBytes;
> unusedBytes = bufferSize - getBufferPosn();
> bytesRead -= unusedBytes;
> {code}
> If the last bytes read (bufferLength) is less than bufferSize, the previous 
> {{unusedBytes}} will be wrong, which should be {{bufferLength}} - 
> {{bufferPosn}} instead of bufferSize - {{bufferPosn}}. It will return larger 
> value.
> For example, input is "1234567890ab12ab345", Custom Delimiter is "ab", 
> bufferSize is 10 and two splits:first splitLength is 15 and second 
> splitLength 4:
> the current code will give the following result:
> First record: Key:0 Value:"1234567890"
> Second record: Key:12 Value:"12"
> Third Record: Key:21 Value:"345"
> You can see the Key for the third record is wrong, it should be 16 instead of 
> 21. It is due to wrong {{unusedBytes}}. {{fillBuffer}} read 10 bytes for the 
> first time, for the second times, it only read 5 bytes, which is 5 bytes less 
> than the bufferSize. That is why the key we get is 5 bytes larger than the 
> correct one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6481) LineRecordReader may give incomplete record and wrong position/key information for uncompressed input sometimes.

2015-09-17 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated MAPREDUCE-6481:
-
Status: Patch Available  (was: Open)

> LineRecordReader may give incomplete record and wrong position/key 
> information for uncompressed input sometimes.
> 
>
> Key: MAPREDUCE-6481
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6481
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv2
>Affects Versions: 2.7.0
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: MAPREDUCE-6481.000.patch
>
>
> LineRecordReader may give incomplete record and wrong position/key 
> information for uncompressed input sometimes.
> There are two issues:
> # LineRecordReader may give incomplete record: some characters cut off at the 
> end of record.
> # LineRecordReader may give wrong position/key information.
> The first issue only happens for Custom Delimiter, which is caused by the 
> following code at {{LineReader#readCustomLine}}:
> {code}
> if (appendLength > 0) {
> if (ambiguousByteCount > 0) {
>   str.append(recordDelimiterBytes, 0, ambiguousByteCount);
>   //appending the ambiguous characters (refer case 2.2)
>   bytesConsumed += ambiguousByteCount;
>   ambiguousByteCount=0;
> }
> str.append(buffer, startPosn, appendLength);
> txtLength += appendLength;
>   }
> {code}
> If {{appendLength}} is 0 and {{ambiguousByteCount}} is not 0, this bug will 
> be triggered. For example, input is "123456789aab", Custom Delimiter is "ab", 
> bufferSize is 10 and splitLength is 12, the correct record should be 
> "123456789a" with length 10, but we get incomplete record "123456789" with 
> length 9 from current code.
> The second issue can happen for both Custom Delimiter and Default Delimiter, 
> which is caused by the code in {{UncompressedSplitLineReader#readLine}}. 
> {{UncompressedSplitLineReader#readLine}} may report wrong size information at 
> some corner cases. The reason is {{unusedBytes}} in the following code:
> {code}
> bytesRead += unusedBytes;
> unusedBytes = bufferSize - getBufferPosn();
> bytesRead -= unusedBytes;
> {code}
> If the last bytes read (bufferLength) is less than bufferSize, the previous 
> {{unusedBytes}} will be wrong, which should be {{bufferLength}} - 
> {{bufferPosn}} instead of bufferSize - {{bufferPosn}}. It will return larger 
> value.
> For example, input is "1234567890ab12ab345", Custom Delimiter is "ab", 
> bufferSize is 10 and two splits:first splitLength is 15 and second 
> splitLength 4:
> the current code will give the following result:
> First record: Key:0 Value:"1234567890"
> Second record: Key:12 Value:"12"
> Third Record: Key:21 Value:"345"
> You can see the Key for the third record is wrong, it should be 16 instead of 
> 21. It is due to wrong {{unusedBytes}}. {{fillBuffer}} read 10 bytes for the 
> first time, for the second times, it only read 5 bytes, which is 5 bytes less 
> than the bufferSize. That is why the key we get is 5 bytes larger than the 
> correct one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6481) LineRecordReader may give incomplete record and wrong position/key information for uncompressed input sometimes.

2015-09-17 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated MAPREDUCE-6481:
-
Attachment: MAPREDUCE-6481.000.patch

> LineRecordReader may give incomplete record and wrong position/key 
> information for uncompressed input sometimes.
> 
>
> Key: MAPREDUCE-6481
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6481
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv2
>Affects Versions: 2.7.0
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: MAPREDUCE-6481.000.patch
>
>
> LineRecordReader may give incomplete record and wrong position/key 
> information for uncompressed input sometimes.
> There are two issues:
> # LineRecordReader may give incomplete record: some characters cut off at the 
> end of record.
> # LineRecordReader may give wrong position/key information.
> The first issue only happens for Custom Delimiter, which is caused by the 
> following code at {{LineReader#readCustomLine}}:
> {code}
> if (appendLength > 0) {
> if (ambiguousByteCount > 0) {
>   str.append(recordDelimiterBytes, 0, ambiguousByteCount);
>   //appending the ambiguous characters (refer case 2.2)
>   bytesConsumed += ambiguousByteCount;
>   ambiguousByteCount=0;
> }
> str.append(buffer, startPosn, appendLength);
> txtLength += appendLength;
>   }
> {code}
> If {{appendLength}} is 0 and {{ambiguousByteCount}} is not 0, this bug will 
> be triggered. For example, input is "123456789aab", Custom Delimiter is "ab", 
> bufferSize is 10 and splitLength is 12, the correct record should be 
> "123456789a" with length 10, but we get incomplete record "123456789" with 
> length 9 from current code.
> The second issue can happen for both Custom Delimiter and Default Delimiter, 
> which is caused by the code in {{UncompressedSplitLineReader#readLine}}. 
> {{UncompressedSplitLineReader#readLine}} may report wrong size information at 
> some corner cases. The reason is {{unusedBytes}} in the following code:
> {code}
> bytesRead += unusedBytes;
> unusedBytes = bufferSize - getBufferPosn();
> bytesRead -= unusedBytes;
> {code}
> If the last bytes read (bufferLength) is less than bufferSize, the previous 
> {{unusedBytes}} will be wrong, which should be {{bufferLength}} - 
> {{bufferPosn}} instead of bufferSize - {{bufferPosn}}. It will return larger 
> value.
> For example, input is "1234567890ab12ab345", Custom Delimiter is "ab", 
> bufferSize is 10 and two splits:first splitLength is 15 and second 
> splitLength 4:
> the current code will give the following result:
> First record: Key:0 Value:"1234567890"
> Second record: Key:12 Value:"12"
> Third Record: Key:21 Value:"345"
> You can see the Key for the third record is wrong, it should be 16 instead of 
> 21. It is due to wrong {{unusedBytes}}. {{fillBuffer}} read 10 bytes for the 
> first time, for the second times, it only read 5 bytes, which is 5 bytes less 
> than the bufferSize. That is why the key we get is 5 bytes larger than the 
> correct one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6481) LineRecordReader may give incomplete record and wrong position/key information for uncompressed input sometimes.

2015-09-17 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated MAPREDUCE-6481:
--
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 2.8.0
   Status: Resolved  (was: Patch Available)

Thanks [~zxu]!  I committed this to trunk and branch-2.

It would be nice to get this fixed for 2.7.2, but the patch doesn't apply.  
Could you provide a patch for branch-2.7 as well?

> LineRecordReader may give incomplete record and wrong position/key 
> information for uncompressed input sometimes.
> 
>
> Key: MAPREDUCE-6481
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6481
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv2
>Affects Versions: 2.7.0
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Fix For: 2.8.0
>
> Attachments: MAPREDUCE-6481.000.patch
>
>
> LineRecordReader may give incomplete record and wrong position/key 
> information for uncompressed input sometimes.
> There are two issues:
> # LineRecordReader may give incomplete record: some characters cut off at the 
> end of record.
> # LineRecordReader may give wrong position/key information.
> The first issue only happens for Custom Delimiter, which is caused by the 
> following code at {{LineReader#readCustomLine}}:
> {code}
> if (appendLength > 0) {
> if (ambiguousByteCount > 0) {
>   str.append(recordDelimiterBytes, 0, ambiguousByteCount);
>   //appending the ambiguous characters (refer case 2.2)
>   bytesConsumed += ambiguousByteCount;
>   ambiguousByteCount=0;
> }
> str.append(buffer, startPosn, appendLength);
> txtLength += appendLength;
>   }
> {code}
> If {{appendLength}} is 0 and {{ambiguousByteCount}} is not 0, this bug will 
> be triggered. For example, input is "123456789aab", Custom Delimiter is "ab", 
> bufferSize is 10 and splitLength is 12, the correct record should be 
> "123456789a" with length 10, but we get incomplete record "123456789" with 
> length 9 from current code.
> The second issue can happen for both Custom Delimiter and Default Delimiter, 
> which is caused by the code in {{UncompressedSplitLineReader#readLine}}. 
> {{UncompressedSplitLineReader#readLine}} may report wrong size information at 
> some corner cases. The reason is {{unusedBytes}} in the following code:
> {code}
> bytesRead += unusedBytes;
> unusedBytes = bufferSize - getBufferPosn();
> bytesRead -= unusedBytes;
> {code}
> If the last bytes read (bufferLength) is less than bufferSize, the previous 
> {{unusedBytes}} will be wrong, which should be {{bufferLength}} - 
> {{bufferPosn}} instead of bufferSize - {{bufferPosn}}. It will return larger 
> value.
> For example, input is "1234567890ab12ab345", Custom Delimiter is "ab", 
> bufferSize is 10 and two splits:first splitLength is 15 and second 
> splitLength 4:
> the current code will give the following result:
> First record: Key:0 Value:"1234567890"
> Second record: Key:12 Value:"12"
> Third Record: Key:21 Value:"345"
> You can see the Key for the third record is wrong, it should be 16 instead of 
> 21. It is due to wrong {{unusedBytes}}. {{fillBuffer}} read 10 bytes for the 
> first time, for the second times, it only read 5 bytes, which is 5 bytes less 
> than the bufferSize. That is why the key we get is 5 bytes larger than the 
> correct one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6481) LineRecordReader may give incomplete record and wrong position/key information for uncompressed input sometimes.

2015-09-16 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated MAPREDUCE-6481:
-
Description: 
LineRecordReader may give incomplete record and wrong position/key information 
for uncompressed input sometimes.
There are two issues:
# LineRecordReader may give incomplete record: some characters cut off at the 
end of record.
# LineRecordReader may give wrong position/key information.

The first issue only happens for Custom Delimiter, which is caused by the 
following code at {{LineReader#readCustomLine}}:
{code}
if (appendLength > 0) {
if (ambiguousByteCount > 0) {
  str.append(recordDelimiterBytes, 0, ambiguousByteCount);
  //appending the ambiguous characters (refer case 2.2)
  bytesConsumed += ambiguousByteCount;
  ambiguousByteCount=0;
}
str.append(buffer, startPosn, appendLength);
txtLength += appendLength;
  }
{code}
If {{appendLength}} is 0 and {{ambiguousByteCount}} is not 0, this bug will be 
triggered. For example, input is "123456789aab", Custom Delimiter is "ab", 
bufferSize is 10 and splitLength is 12, the correct record should be 
"123456789a" with length 10, but we get incomplete record "123456789" with 
length 9 from current code.

The second issue can happen for both Custom Delimiter and Default Delimiter, 
which is caused by the code in {{UncompressedSplitLineReader#readLine}}. 
{{UncompressedSplitLineReader#readLine}} may report wrong size information at 
some corner cases. The reason is {{unusedBytes}} in the following code:
{code}
bytesRead += unusedBytes;
unusedBytes = bufferSize - getBufferPosn();
bytesRead -= unusedBytes;
{code}
If the last bytes read (bufferLength) is less than bufferSize, the previous 
{{unusedBytes}} will be wrong, which should be {{bufferLength}} - 
{{bufferPosn}} instead of bufferSize - {{bufferPosn}}. It will return larger 
value.
For example, input is "1234567890ab12ab345", Custom Delimiter is "ab", 
bufferSize is 10 and two splits:first splitLength is 15 and second splitLength 
4:
the current code will give the following result:
First record: Key:0 Value:"1234567890"
Second record: Key:12 Value:"12"
Third Record: Key:21 Value:"345"
You can see the Key for the third record is wrong, it should be 16 instead of 
21. It is due to wrong {{unusedBytes}}. {{fillBuffer}} read 10 bytes for the 
first time, for the second times, it only read 5 bytes, which is 5 bytes less 
than the bufferSize. That is why the key we get is 5 bytes larger than the 
correct one.

  was:
LineRecordReader may give incomplete record and wrong position/key information 
for uncompressed input sometimes.
There are two issues:
# LineRecordReader may give incomplete record: some characters cut off at the 
end of record.
# LineRecordReader may give wrong position/key information.

The first issue only happens for Custom Delimiter, which is caused by the 
following code at {{LineReader#readCustomLine}}:
{code}
if (appendLength > 0) {
if (ambiguousByteCount > 0) {
  str.append(recordDelimiterBytes, 0, ambiguousByteCount);
  //appending the ambiguous characters (refer case 2.2)
  bytesConsumed += ambiguousByteCount;
  ambiguousByteCount=0;
}
str.append(buffer, startPosn, appendLength);
txtLength += appendLength;
  }
{code}
If {{appendLength}} is 0 and {{ambiguousByteCount}} is not 0, this bug will be 
triggered. For example, input is "123456789aab", Custom Delimiter is "ab", 
bufferSize is 10 and splitLength is 12, the correct record should be 
"123456789a" with length 10, but we get incomplete record "123456789" with 
length 9 from current code.

The second issue can happen for both Custom Delimiter and Default Delimiter, 
which is caused by the code in {{UncompressedSplitLineReader#readLine}}.
{{UncompressedSplitLineReader#readLine}} may report wrong size information at 
some corner cases. The reason is {{unusedBytes}} in the following code:
{code}
bytesRead += unusedBytes;
unusedBytes = bufferSize - getBufferPosn();
bytesRead -= unusedBytes;
{code}
If the last bytes read (bufferLength) is less than bufferSize, the previous 
{{unusedBytes}} will be wrong, which should be {{bufferLength}} - 
{{bufferPosn}} instead of bufferSize - {{bufferPosn}}. It will return larger 
value.
For example, input is "1234567890ab12ab345", Custom Delimiter is "ab", 
bufferSize is 10 and two splits:first splitLength is 15 and second splitLength 
4:
the current code will give the following result:
First record: Key:0 Value:"1234567890"
Second record: Key:12 Value:"12"
Third Record: Key:21 Value:"345"
You can see the Key for the third record is wrong, it should be 16 instead of 
21. It is due to wrong {{unusedBytes}}. {{fillBuffer}} read 10 bytes for the 
first time, for the second times, it only read 5 bytes, which is 5 bytes less 
than the bufferSize. That 

[jira] [Updated] (MAPREDUCE-6481) LineRecordReader may give incomplete record and wrong position/key information for uncompressed input sometimes.

2015-09-16 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated MAPREDUCE-6481:
-
Description: 
LineRecordReader may give incomplete record and wrong position/key information 
for uncompressed input sometimes.
There are two issues:
# LineRecordReader may give incomplete record: some characters cut off at the 
end of record.
# LineRecordReader may give wrong position/key information.

The first issue only happens for Custom Delimiter, which is caused by the 
following code at {{LineReader#readCustomLine}}:
{code}
if (appendLength > 0) {
if (ambiguousByteCount > 0) {
  str.append(recordDelimiterBytes, 0, ambiguousByteCount);
  //appending the ambiguous characters (refer case 2.2)
  bytesConsumed += ambiguousByteCount;
  ambiguousByteCount=0;
}
str.append(buffer, startPosn, appendLength);
txtLength += appendLength;
  }
{code}
If {{appendLength}} is 0 and {{ambiguousByteCount}} is not 0, this bug will be 
triggered. For example, input is "123456789aab", Custom Delimiter is "ab", 
bufferSize is 10 and splitLength is 12, the correct record should be 
"123456789a" with length 10, but we get incomplete record "123456789" with 
length 9 from current code.

The second issue can happen for both Custom Delimiter and Default Delimiter, 
which is caused by the code in {{UncompressedSplitLineReader#readLine}}.
{{UncompressedSplitLineReader#readLine}} may report wrong size information at 
some corner cases. The reason is {{unusedBytes}} in the following code:
{code}
bytesRead += unusedBytes;
unusedBytes = bufferSize - getBufferPosn();
bytesRead -= unusedBytes;
{code}
If the last bytes read (bufferLength) is less than bufferSize, the previous 
{{unusedBytes}} will be wrong, which should be {{bufferLength}} - 
{{bufferPosn}} instead of bufferSize - {{bufferPosn}}. It will return larger 
value.
For example, input is "1234567890ab12ab345", Custom Delimiter is "ab", 
bufferSize is 10 and two splits:first splitLength is 15 and second splitLength 
4:
the current code will give the following result:
First record: Key:0 Value:"1234567890"
Second record: Key:12 Value:"12"
Third Record: Key:21 Value:"345"
You can see the Key for the third record is wrong, it should be 16 instead of 
21. It is due to wrong {{unusedBytes}}. {{fillBuffer}} read 10 bytes for the 
first time, for the second times, it only read 5 bytes, which is 5 bytes less 
than the bufferSize. That is why the key we get is 5 bytes larger than the 
correct one.

  was:
LineRecordReader may give incomplete record and wrong position/key information 
for uncompressed input sometimes.
There are two issues:
# LineRecordReader may give incomplete record: some characters cut off at the 
end of record.
# LineRecordReader may give wrong position/key information.
The first issue only happens for Custom Delimiter, which is caused by the 
following code at {{LineReader#readCustomLine}}:
{code}
if (appendLength > 0) {
if (ambiguousByteCount > 0) {
  str.append(recordDelimiterBytes, 0, ambiguousByteCount);
  //appending the ambiguous characters (refer case 2.2)
  bytesConsumed += ambiguousByteCount;
  ambiguousByteCount=0;
}
str.append(buffer, startPosn, appendLength);
txtLength += appendLength;
  }
{code}
If {{appendLength}} is 0 and {{ambiguousByteCount}} is not 0, this bug will be 
triggered. For example, input is "123456789aab", Custom Delimiter is "ab", 
bufferSize is 10 and splitLength is 12, the correct record should be 
"123456789a" with length 10, but we get incomplete record "123456789" with 
length 9 from current code.

The second issue can happen for both Custom Delimiter and Default Delimiter, 
which is caused by the code in {{UncompressedSplitLineReader#readLine}}.
{{UncompressedSplitLineReader#readLine}} may report wrong size information at 
some corner cases. The reason is {{unusedBytes}} in the following code:
{code}
bytesRead += unusedBytes;
unusedBytes = bufferSize - getBufferPosn();
bytesRead -= unusedBytes;
{code}
If the last bytes read (bufferLength) is less than bufferSize, the previous 
{{unusedBytes}} will be wrong, which should be {{bufferLength}} - 
{{bufferPosn}} instead of bufferSize - {{bufferPosn}}. It will return larger 
value.
For example, input is "1234567890ab12ab345", Custom Delimiter is "ab", 
bufferSize is 10 and two splits:first splitLength is 15 and second splitLength 
4:
the current code will give the following result:
First record: Key:0 Value:"1234567890"
Second record: Key:12 Value:"12"
Third Record: Key:21 Value:"345"
You can see the Key for the third record is wrong, it should be 16 instead of 
21. It is due to wrong {{unusedBytes}}. {{fillBuffer}} read 10 bytes for the 
first time, for the second times, it only read 5 bytes, which is 5 bytes less 
than the bufferSize. That is