ConfX created MAPREDUCE-7444:
--------------------------------
Summary: LineReader improperly sets AdditionalRecord, causing
extra line read
Key: MAPREDUCE-7444
URL: https://issues.apache.org/jira/browse/MAPREDUCE-7444
Project: Hadoop Map/Reduce
Issue Type: Bug
Reporter: ConfX
Attachments: reproduce.sh
h2. What happened:
After setting {{io.file.buffer.size}} precisely to some values, the LineReader
sometimes mistakenly reads an extra line after the split.
Note that this bug is highly critical. A malicious party can create a file that
makes the while loop run forever given knowledge of the buffer size setting and
control over the file being read.
h2. Buggy code:
Consider reading a record file with multiple lines. In MapReduce, this is done
using
[org.apache.hadoop.mapred.LineRecordReader|https://github.com/confuzz-org/fuzz-hadoop/blob/confuzz/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/LineRecordReader.java#L249],
which then calls
[org.apache.hadoop.util.LineReader.readCustomLine|https://github.com/confuzz-org/fuzz-hadoop/blob/confuzz/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L268]
if you specify your own delimiter ('\n' or "\r\n").
Intrinsically, if the file is compressed, the reading of the file is
implemented using [org.apache.hadoop.mapreduce.lib.input.
CompressedSplitLineReader|https://github.com/confuzz-org/fuzz-hadoop/blob/confuzz/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/CompressedSplitLineReader.java].
Notice that in its
[fillBuffer|https://github.com/confuzz-org/fuzz-hadoop/blob/confuzz/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/CompressedSplitLineReader.java#L128]
function to refill the buffer, when {{inDelimiter}} is set to {{{}true{}}}, if
haven't reached EOF the switch for additional record is set to true (suppose
not using CRLF).
Now, we sometimes want to read the file in different splits (or up until a
specific position). Consider the case where the end of the split is a
delimiter. This is where interesting thing happen. We use an easy example to
illustrate this. Suppose that the delimiter is 10 bytes long, and the buffer
size is set precise enough such that the end of the second to last buffer ends
right in the exact middle of the delimiter. In the final round of loop in
{{{}readCustomLine{}}}, the buffer would be refilled first:
{noformat}
...
if (bufferPosn >= bufferLength) {
startPosn = bufferPosn = 0;
bufferLength = fillBuffer(in, buffer, ambiguousByteCount > 0);
...{noformat}
note that {{needAdditionalRecord}} would be set to {{true}} in
{{{}CompressedSplitLineReader{}}}.
Next, after some reading and string comparing the second half of the final
delimiter, the function would try to calculate the bytes read:
{noformat}
int readLength = bufferPosn - startPosn;
bytesConsumed += readLength;
int appendLength = readLength - delPosn;
...
if (appendLength >= 0 && ambiguousByteCount > 0) {
...
unsetNeedAdditionalRecordAfterSplit();
}{noformat}
note that here the {{readLength}} would be 5 (half of the delimiter length) and
{{appendLength}} would be {{{}5-10=-5{}}}. Thus, the condition for the {{if}}
clause would be false and the {{needAdditionalRecord}} would not be unset.
In fact, the {{needAdditionalRecord}} switch would never be unset all the way
until {{readCustomLine}} ends.
However, it should be set to {{{}false{}}}, or the next time the user calls
{{next}} the reader would read at least another line since the {{while}}
condition in {{{}next{}}}:
{noformat}
while (getFilePosition() <= end || in.needAdditionalRecordAfterSplit())
{{noformat}
is true due to not reaching EOF and {{needAdditionalRecord}} set to
{{{}true{}}}.
This can lead to severe problems if every time after the split the start of the
final buffer falls in the delimiter.
h2. How to reproduce:
(1) Set {{io.file.buffer.size }} to {{188}}
(2) Run test:
{{org.apache.hadoop.mapred.TestLineRecordReader#testBzipWithMultibyteDelimiter}}
For an easy reproduction please run the {{reproduce.sh}} included in the
attachment.
h2. StackTrace:
{noformat}
java.lang.AssertionError: Unexpected number of records in split expected:<60>
but was:<61>
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:647)
at
org.apache.hadoop.mapred.TestLineRecordReader.testSplitRecordsForFile(TestLineRecordReader.java:110)
at
org.apache.hadoop.mapred.TestLineRecordReader.testBzipWithMultibyteDelimiter(TestLineRecordReader.java:684){noformat}
For an easy reproduction, run the reproduce.sh in the attachment. We are happy
to provide a patch if this issue is confirmed.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]