ConfX created MAPREDUCE-7444:
--------------------------------

             Summary: LineReader improperly sets AdditionalRecord, causing 
extra line read
                 Key: MAPREDUCE-7444
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7444
             Project: Hadoop Map/Reduce
          Issue Type: Bug
            Reporter: ConfX
         Attachments: reproduce.sh

h2. What happened:

After setting {{io.file.buffer.size}} precisely to some values, the LineReader 
sometimes mistakenly reads an extra line after the split.

Note that this bug is highly critical. A malicious party can create a file that 
makes the while loop run forever given knowledge of the buffer size setting and 
control over the file being read.
h2. Buggy code:

Consider reading a record file with multiple lines. In MapReduce, this is done 
using 
[org.apache.hadoop.mapred.LineRecordReader|https://github.com/confuzz-org/fuzz-hadoop/blob/confuzz/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/LineRecordReader.java#L249],
 which then calls 
[org.apache.hadoop.util.LineReader.readCustomLine|https://github.com/confuzz-org/fuzz-hadoop/blob/confuzz/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L268]
 if you specify your own delimiter ('\n' or "\r\n").

Intrinsically, if the file is compressed, the reading of the file is 
implemented using [org.apache.hadoop.mapreduce.lib.input. 
CompressedSplitLineReader|https://github.com/confuzz-org/fuzz-hadoop/blob/confuzz/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/CompressedSplitLineReader.java].
 Notice that in its 
[fillBuffer|https://github.com/confuzz-org/fuzz-hadoop/blob/confuzz/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/CompressedSplitLineReader.java#L128]
 function to refill the buffer, when {{inDelimiter}} is set to {{{}true{}}}, if 
haven't reached EOF the switch for additional record is set to true (suppose 
not using CRLF).

Now, we sometimes want to read the file in different splits (or up until a 
specific position). Consider the case where the end of the split is a 
delimiter. This is where interesting thing happen. We use an easy example to 
illustrate this. Suppose that the delimiter is 10 bytes long, and the buffer 
size is set precise enough such that the end of the second to last buffer ends 
right in the exact middle of the delimiter. In the final round of loop in 
{{{}readCustomLine{}}}, the buffer would be refilled first:
{noformat}
...
if (bufferPosn >= bufferLength) {
    startPosn = bufferPosn = 0;
    bufferLength = fillBuffer(in, buffer, ambiguousByteCount > 0);
...{noformat}
note that {{needAdditionalRecord}} would be set to {{true}} in 
{{{}CompressedSplitLineReader{}}}.
Next, after some reading and string comparing the second half of the final 
delimiter, the function would try to calculate the bytes read:
{noformat}
int readLength = bufferPosn - startPosn;
bytesConsumed += readLength;
int appendLength = readLength - delPosn;
...
if (appendLength >= 0 && ambiguousByteCount > 0) {
...
    unsetNeedAdditionalRecordAfterSplit();
}{noformat}
note that here the {{readLength}} would be 5 (half of the delimiter length) and 
{{appendLength}} would be {{{}5-10=-5{}}}. Thus, the condition for the {{if}} 
clause would be false and the {{needAdditionalRecord}} would not be unset.
In fact, the {{needAdditionalRecord}} switch would never be unset all the way 
until {{readCustomLine}} ends.
However, it should be set to {{{}false{}}}, or the next time the user calls 
{{next}} the reader would read at least another line since the {{while}} 
condition in {{{}next{}}}:
{noformat}
while (getFilePosition() <= end || in.needAdditionalRecordAfterSplit()) 
{{noformat}
is true due to not reaching EOF and {{needAdditionalRecord}} set to 
{{{}true{}}}.
This can lead to severe problems if every time after the split the start of the 
final buffer falls in the delimiter.
h2. How to reproduce:

(1) Set {{io.file.buffer.size }} to {{188}}
(2) Run test: 
{{org.apache.hadoop.mapred.TestLineRecordReader#testBzipWithMultibyteDelimiter}}

For an easy reproduction please run the {{reproduce.sh}} included in the 
attachment.
h2. StackTrace:
{noformat}
java.lang.AssertionError: Unexpected number of records in split expected:<60> 
but was:<61>
    at org.junit.Assert.fail(Assert.java:89)
    at org.junit.Assert.failNotEquals(Assert.java:835)
    at org.junit.Assert.assertEquals(Assert.java:647)
    at 
org.apache.hadoop.mapred.TestLineRecordReader.testSplitRecordsForFile(TestLineRecordReader.java:110)
    at 
org.apache.hadoop.mapred.TestLineRecordReader.testBzipWithMultibyteDelimiter(TestLineRecordReader.java:684){noformat}
For an easy reproduction, run the reproduce.sh in the attachment. We are happy 
to provide a patch if this issue is confirmed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org

Reply via email to