[jira] [Commented] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well
[ https://issues.apache.org/jira/browse/HADOOP-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14046138#comment-14046138 ] Jason Lowe commented on HADOOP-9867: Actually I agree with Rushabh that there are at least two somewhat different problems here. The original problem reported in the JIRA has to do with records being dropped with uncompressed inputs. We should fix that issue so we don't drop data when using an uncompressed input. I'm assuming Rushabh's patch solves that issue, but I haven't looked at it in detail just yet. There's another issue related to mistaken record delimiter recognition where the subsequent split reader can accidentally think it found a delimiter when in fact the real record delimiter is somewhere else. If the subsequent split reader sees 'yzxxx' at the beginning of its split then it will toss out the first record (i.e.: the first 'xxx') then read 'xyz' as the next record. However that may or may not be the correct behavior, because with that kind of delimiter and data the correct behavior depends upon the _previous_ split's data. If the previous split ended with 'abc' then the behavior was correct and there are two records in the stream: 'abc' and 'xyz'. If the previous split ended with 'abcx' then that's the incorrect behavior. The records should be 'abc' and 'xxyz' but the second split reader will report an 'xyz' record that shouldn't exist. To solve that problem either a split reader would have to examine the prior-split's data to distinguish this case, or the split reader would have to realize it's an ambiguous situation and leave the record processing to the previous split reader to handle. The former can be very expensive if the prior split is compressed, as it has to potentially unpack the entire split. Also this can get very tricky and a reader may need to read more than one other split to resolve it. For example, if the data stream is 'ax..xxbxx..xcxx' then a reader may have to scan far down into subsequent splits since only it knows where the true record boundaries are. Simply tacking on an extra character at the beginning of that input changes where the record boundaries are and the record contents even the last split in the input. Solving this requires a different high-level algorithm to split processing than what we have today (i.e.: throw away the first record and go), so I believe that's something better left to a followup JIRA. It'd be nice to solve the dropped-record problem for scenarios where we don't have to worry about mistaken record delimiter recognition in the data, as that's an incremental improvement from where we are today. I'll try to get some time to review the latest patch and provide comments soon. org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well -- Key: HADOOP-9867 URL: https://issues.apache.org/jira/browse/HADOOP-9867 Project: Hadoop Common Issue Type: Bug Components: io Affects Versions: 0.20.2, 0.23.9, 2.2.0 Environment: CDH3U2 Redhat linux 5.7 Reporter: Kris Geusebroek Assignee: Vinayakumar B Priority: Critical Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch Having defined a recorddelimiter of multiple bytes in a new InputFileFormat sometimes has the effect of skipping records from the input. This happens when the input splits are split off just after a recordseparator. Starting point for the next split would be non zero and skipFirstLine would be true. A seek into the file is done to start - 1 and the text until the first recorddelimiter is ignored (due to the presumption that this record is already handled by the previous maptask). Since the re ord delimiter is multibyte the seek only got the last byte of the delimiter into scope and its not recognized as a full delimiter. So the text is skipped until the next delimiter (ignoring a full record!!) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well
[ https://issues.apache.org/jira/browse/HADOOP-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14044499#comment-14044499 ] Vinayakumar B commented on HADOOP-9867: --- Thanks [~shahrs87] for trying out the patch. I got test failure when the input string specified in your test is as follows with separator as xxx with split length as 46. {code}String inputData = abcxxxdefxxxghixxx + jklxxxmnoxxxpqrxxxstuxxxvw yz;{code} Can you check again? org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well -- Key: HADOOP-9867 URL: https://issues.apache.org/jira/browse/HADOOP-9867 Project: Hadoop Common Issue Type: Bug Components: io Affects Versions: 0.20.2, 0.23.9, 2.2.0 Environment: CDH3U2 Redhat linux 5.7 Reporter: Kris Geusebroek Assignee: Vinayakumar B Priority: Critical Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch Having defined a recorddelimiter of multiple bytes in a new InputFileFormat sometimes has the effect of skipping records from the input. This happens when the input splits are split off just after a recordseparator. Starting point for the next split would be non zero and skipFirstLine would be true. A seek into the file is done to start - 1 and the text until the first recorddelimiter is ignored (due to the presumption that this record is already handled by the previous maptask). Since the re ord delimiter is multibyte the seek only got the last byte of the delimiter into scope and its not recognized as a full delimiter. So the text is skipped until the next delimiter (ignoring a full record!!) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well
[ https://issues.apache.org/jira/browse/HADOOP-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14044681#comment-14044681 ] Rushabh S Shah commented on HADOOP-9867: Hey Vinayakaumar, Thanks for checking out the patch and providing valuable feedback. I did ran into this test case while solving this jira. I am going to file another jira for this specific test case (and a couple of more which I came across) since the test case you mentioned is not in the scope of this jira. org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well -- Key: HADOOP-9867 URL: https://issues.apache.org/jira/browse/HADOOP-9867 Project: Hadoop Common Issue Type: Bug Components: io Affects Versions: 0.20.2, 0.23.9, 2.2.0 Environment: CDH3U2 Redhat linux 5.7 Reporter: Kris Geusebroek Assignee: Vinayakumar B Priority: Critical Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch Having defined a recorddelimiter of multiple bytes in a new InputFileFormat sometimes has the effect of skipping records from the input. This happens when the input splits are split off just after a recordseparator. Starting point for the next split would be non zero and skipFirstLine would be true. A seek into the file is done to start - 1 and the text until the first recorddelimiter is ignored (due to the presumption that this record is already handled by the previous maptask). Since the re ord delimiter is multibyte the seek only got the last byte of the delimiter into scope and its not recognized as a full delimiter. So the text is skipped until the next delimiter (ignoring a full record!!) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well
[ https://issues.apache.org/jira/browse/HADOOP-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14044691#comment-14044691 ] Vinayakumar B commented on HADOOP-9867: --- I feel this case is related to thia jira also. Refer the example given by by jason in one of the above comments. org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well -- Key: HADOOP-9867 URL: https://issues.apache.org/jira/browse/HADOOP-9867 Project: Hadoop Common Issue Type: Bug Components: io Affects Versions: 0.20.2, 0.23.9, 2.2.0 Environment: CDH3U2 Redhat linux 5.7 Reporter: Kris Geusebroek Assignee: Vinayakumar B Priority: Critical Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch Having defined a recorddelimiter of multiple bytes in a new InputFileFormat sometimes has the effect of skipping records from the input. This happens when the input splits are split off just after a recordseparator. Starting point for the next split would be non zero and skipFirstLine would be true. A seek into the file is done to start - 1 and the text until the first recorddelimiter is ignored (due to the presumption that this record is already handled by the previous maptask). Since the re ord delimiter is multibyte the seek only got the last byte of the delimiter into scope and its not recognized as a full delimiter. So the text is skipped until the next delimiter (ignoring a full record!!) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well
[ https://issues.apache.org/jira/browse/HADOOP-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043559#comment-14043559 ] Hadoop QA commented on HADOOP-9867: --- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12652422/HADOOP-9867.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/4168//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/4168//console This message is automatically generated. org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well -- Key: HADOOP-9867 URL: https://issues.apache.org/jira/browse/HADOOP-9867 Project: Hadoop Common Issue Type: Bug Components: io Affects Versions: 0.20.2, 0.23.9, 2.2.0 Environment: CDH3U2 Redhat linux 5.7 Reporter: Kris Geusebroek Assignee: Vinayakumar B Priority: Critical Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch Having defined a recorddelimiter of multiple bytes in a new InputFileFormat sometimes has the effect of skipping records from the input. This happens when the input splits are split off just after a recordseparator. Starting point for the next split would be non zero and skipFirstLine would be true. A seek into the file is done to start - 1 and the text until the first recorddelimiter is ignored (due to the presumption that this record is already handled by the previous maptask). Since the re ord delimiter is multibyte the seek only got the last byte of the delimiter into scope and its not recognized as a full delimiter. So the text is skipped until the next delimiter (ignoring a full record!!) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well
[ https://issues.apache.org/jira/browse/HADOOP-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13913017#comment-13913017 ] Vinayakumar B commented on HADOOP-9867: --- Hi jason I was trying to implement the proposed solution you suggested. But I was facing issues. If you know the exact changes, can you please provide the patch. Thanks org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well -- Key: HADOOP-9867 URL: https://issues.apache.org/jira/browse/HADOOP-9867 Project: Hadoop Common Issue Type: Bug Components: io Affects Versions: 0.20.2, 0.23.9, 2.2.0 Environment: CDH3U2 Redhat linux 5.7 Reporter: Kris Geusebroek Assignee: Vinayakumar B Priority: Critical Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch Having defined a recorddelimiter of multiple bytes in a new InputFileFormat sometimes has the effect of skipping records from the input. This happens when the input splits are split off just after a recordseparator. Starting point for the next split would be non zero and skipFirstLine would be true. A seek into the file is done to start - 1 and the text until the first recorddelimiter is ignored (due to the presumption that this record is already handled by the previous maptask). Since the re ord delimiter is multibyte the seek only got the last byte of the delimiter into scope and its not recognized as a full delimiter. So the text is skipped until the next delimiter (ignoring a full record!!) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well
[ https://issues.apache.org/jira/browse/HADOOP-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13844428#comment-13844428 ] Jason Lowe commented on HADOOP-9867: Thanks for updating the patch, Vinay. Comments: * I don't think LineReader is the best place to put split-specific code. Its sole purpose is to read lines from an input stream regardless of split boundaries. There are users of this class that are not necessarily processing splits. That's why I created SplitLineReader in MapReduce, and I believe this logic is better placed there. * I don't think we want to change Math.max(maxBytesToConsume(pos), maxLineLength)) to Math.min(maxBytesToConsume(pos), maxLineLength)). We need to be able to read a record past the end of the split when the record crosses the split boundary, but I think this change could allow a truncated record to be returned for an uncompressed input stream. e.g.: fillBuffer happens to return data only up to the end of the split, record is incomplete (no delimiter found), but maxBytesToConsume keeps us from filling the buffer with more data and a truncated record is returned. I think a more straightforward approach would be to have SplitLineReader be aware of the end of the split and track it in fillBuffer() much like CompressedLineSplitReader does. The fillBuffer callback already indicates whether we're mid-delimiter or not, so we can simply check if fillBuffer is being called after the split has ended but we're mid-delimiter. In that case we need an additional record. org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well -- Key: HADOOP-9867 URL: https://issues.apache.org/jira/browse/HADOOP-9867 Project: Hadoop Common Issue Type: Bug Components: io Affects Versions: 0.20.2, 0.23.9, 2.2.0 Environment: CDH3U2 Redhat linux 5.7 Reporter: Kris Geusebroek Assignee: Vinay Priority: Critical Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch Having defined a recorddelimiter of multiple bytes in a new InputFileFormat sometimes has the effect of skipping records from the input. This happens when the input splits are split off just after a recordseparator. Starting point for the next split would be non zero and skipFirstLine would be true. A seek into the file is done to start - 1 and the text until the first recorddelimiter is ignored (due to the presumption that this record is already handled by the previous maptask). Since the re ord delimiter is multibyte the seek only got the last byte of the delimiter into scope and its not recognized as a full delimiter. So the text is skipped until the next delimiter (ignoring a full record!!) -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well
[ https://issues.apache.org/jira/browse/HADOOP-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843911#comment-13843911 ] Hadoop QA commented on HADOOP-9867: --- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12617969/HADOOP-9867.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/3350//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/3350//console This message is automatically generated. org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well -- Key: HADOOP-9867 URL: https://issues.apache.org/jira/browse/HADOOP-9867 Project: Hadoop Common Issue Type: Bug Components: io Affects Versions: 0.20.2, 0.23.9, 2.2.0 Environment: CDH3U2 Redhat linux 5.7 Reporter: Kris Geusebroek Assignee: Vinay Priority: Critical Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch Having defined a recorddelimiter of multiple bytes in a new InputFileFormat sometimes has the effect of skipping records from the input. This happens when the input splits are split off just after a recordseparator. Starting point for the next split would be non zero and skipFirstLine would be true. A seek into the file is done to start - 1 and the text until the first recorddelimiter is ignored (due to the presumption that this record is already handled by the previous maptask). Since the re ord delimiter is multibyte the seek only got the last byte of the delimiter into scope and its not recognized as a full delimiter. So the text is skipped until the next delimiter (ignoring a full record!!) -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well
[ https://issues.apache.org/jira/browse/HADOOP-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13827567#comment-13827567 ] Hadoop QA commented on HADOOP-9867: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12614864/HADOOP-9867.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient: org.apache.hadoop.mapred.TestJobCleanup {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/3302//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/3302//console This message is automatically generated. org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well -- Key: HADOOP-9867 URL: https://issues.apache.org/jira/browse/HADOOP-9867 Project: Hadoop Common Issue Type: Bug Components: io Affects Versions: 0.20.2, 0.23.9, 2.2.0 Environment: CDH3U2 Redhat linux 5.7 Reporter: Kris Geusebroek Assignee: Vinay Priority: Critical Attachments: HADOOP-9867.patch, HADOOP-9867.patch Having defined a recorddelimiter of multiple bytes in a new InputFileFormat sometimes has the effect of skipping records from the input. This happens when the input splits are split off just after a recordseparator. Starting point for the next split would be non zero and skipFirstLine would be true. A seek into the file is done to start - 1 and the text until the first recorddelimiter is ignored (due to the presumption that this record is already handled by the previous maptask). Since the re ord delimiter is multibyte the seek only got the last byte of the delimiter into scope and its not recognized as a full delimiter. So the text is skipped until the next delimiter (ignoring a full record!!) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well
[ https://issues.apache.org/jira/browse/HADOOP-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13827795#comment-13827795 ] Jason Lowe commented on HADOOP-9867: Thanks for the patch, Vinay. I think this approach can work when the input is uncompressed, however I don't think it will work for block-compressed inputs. Block codecs often report the file position as being the start of the codec block and then it teleports to the byte position of the next block once the first byte of the next block is consumed. See HADOOP-9622 for a similar issue with the default delimiter and how it's being addressed. Also getFilePosition() for a compressed input is returning a compressed stream offset, so if we try to do math on that with an uncompressed delimiter length we're mixing different units. Since LineRecordReader::getFilePosition() can mean different things for different inputs, I think a better approach would be to change LineReader (not LineRecordReader) so the reported file position for multi-byte custom delimiters is the file position after the record but not including its delimiter. Either that or wait for HADOOP-9622 to be committed and update the SplitLineReader interface from the HADOOP-9622 patch so the uncompressed input reader would indicate an additional record needs to be read if the split ends mid-delimiter. org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well -- Key: HADOOP-9867 URL: https://issues.apache.org/jira/browse/HADOOP-9867 Project: Hadoop Common Issue Type: Bug Components: io Affects Versions: 0.20.2, 0.23.9, 2.2.0 Environment: CDH3U2 Redhat linux 5.7 Reporter: Kris Geusebroek Assignee: Vinay Priority: Critical Attachments: HADOOP-9867.patch, HADOOP-9867.patch Having defined a recorddelimiter of multiple bytes in a new InputFileFormat sometimes has the effect of skipping records from the input. This happens when the input splits are split off just after a recordseparator. Starting point for the next split would be non zero and skipFirstLine would be true. A seek into the file is done to start - 1 and the text until the first recorddelimiter is ignored (due to the presumption that this record is already handled by the previous maptask). Since the re ord delimiter is multibyte the seek only got the last byte of the delimiter into scope and its not recognized as a full delimiter. So the text is skipped until the next delimiter (ignoring a full record!!) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well
[ https://issues.apache.org/jira/browse/HADOOP-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13828453#comment-13828453 ] Vinay commented on HADOOP-9867: --- Thanks Jason, I prefer waiting for HADOOP-9622 to be committed. Meanwhile I will try to update SplitLineReader offline. org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well -- Key: HADOOP-9867 URL: https://issues.apache.org/jira/browse/HADOOP-9867 Project: Hadoop Common Issue Type: Bug Components: io Affects Versions: 0.20.2, 0.23.9, 2.2.0 Environment: CDH3U2 Redhat linux 5.7 Reporter: Kris Geusebroek Assignee: Vinay Priority: Critical Attachments: HADOOP-9867.patch, HADOOP-9867.patch Having defined a recorddelimiter of multiple bytes in a new InputFileFormat sometimes has the effect of skipping records from the input. This happens when the input splits are split off just after a recordseparator. Starting point for the next split would be non zero and skipFirstLine would be true. A seek into the file is done to start - 1 and the text until the first recorddelimiter is ignored (due to the presumption that this record is already handled by the previous maptask). Since the re ord delimiter is multibyte the seek only got the last byte of the delimiter into scope and its not recognized as a full delimiter. So the text is skipped until the next delimiter (ignoring a full record!!) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well
[ https://issues.apache.org/jira/browse/HADOOP-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13826729#comment-13826729 ] Jason Lowe commented on HADOOP-9867: Ran across this JIRA while discussing the intricacies of HADOOP-9622. There's a relatively straightforward testcase that demonstrates the issue. With the following plaintext input {code:title=customdeliminput.txt} abcxxx defxxx ghixxx jklxxx mnoxxx pqrxxx stuxxx vw xxx xyzxxx {code} run a wordcount job like this: {noformat} hadoop jar $HADOOP_PREFIX/share/hadoop/mapreduce/hadoop-mapreduce-examples*.jar wordcount -Dmapreduce.input.fileinputformat.split.maxsize=33 -Dtextinputformat.record.delimiter=xxx customdeliminput.txt wcout {noformat} and we can see that one of the records was dropped due to incorrect split processing: {noformat} $ hadoop fs -cat wcout/part-r-0 abc 1 def 1 ghi 1 jkl 1 mno 1 stu 1 vw 1 xyz 1 {noformat} I don't think rewinding the seek position by the delimiter length is correct in all cases. I believe that will lead to duplicate records rather than dropped records (e.g.: split ends exactly when a delimiter ends, and both splits end up processing the record after that delimiter). Instead we can get correct behavior by treating any split in the middle of a multibyte custom delimiter as if the delimiter ended exactly at the end of the split, i.e.: the consumer of the prior split is responsible for processing the divided delimiter and the subsequent record. The consumer of the next split then tosses the first record up to the first full delimiter as usual (i.e.: including the partial delimiter at the beginning of the split) and proceeds to process any subsequent records. That way we don't get any dropped records or duplicate records. I think one way of accomplishing this is to have the LineReader for multibyte custom delimiters report the current position as the end of the record data *without* the delimiter bytes. Then any record that ends exactly at the end of the split or whose delimiter straddles the split boundary will cause the prior split to consume the extra record necessary. org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well -- Key: HADOOP-9867 URL: https://issues.apache.org/jira/browse/HADOOP-9867 Project: Hadoop Common Issue Type: Bug Components: io Affects Versions: 0.20.2 Environment: CDH3U2 Redhat linux 5.7 Reporter: Kris Geusebroek Having defined a recorddelimiter of multiple bytes in a new InputFileFormat sometimes has the effect of skipping records from the input. This happens when the input splits are split off just after a recordseparator. Starting point for the next split would be non zero and skipFirstLine would be true. A seek into the file is done to start - 1 and the text until the first recorddelimiter is ignored (due to the presumption that this record is already handled by the previous maptask). Since the re ord delimiter is multibyte the seek only got the last byte of the delimiter into scope and its not recognized as a full delimiter. So the text is skipped until the next delimiter (ignoring a full record!!) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well
[ https://issues.apache.org/jira/browse/HADOOP-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738145#comment-13738145 ] Kris Geusebroek commented on HADOOP-9867: - I created a Fix by adding the following code: } else { if (start != 0) { skipFirstLine = true; +for (int i=0; i recordDelimiter.length; i++) { --start; +} fileIn.seek(start); } currently I'm testing this with a custom created subclass of LineRecordReader. If testing is OK, I'm willing to create a patch file if needed. org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well -- Key: HADOOP-9867 URL: https://issues.apache.org/jira/browse/HADOOP-9867 Project: Hadoop Common Issue Type: Bug Components: io Affects Versions: 0.20.2 Environment: CDH3U2 Redhat linux 5.7 Reporter: Kris Geusebroek Having defined a recorddelimiter of multiple bytes in a new InputFileFormat sometimes has the effect of skipping records from the input. This happens when the input splits are split off just after a recordseparator. Starting point for the next split would be non zero and skipFirstLine would be true. A seek into the file is done to start - 1 and the text until the first recorddelimiter is ignored (due to the presumption that this record is already handled by the previous maptask). Since the re ord delimiter is multibyte the seek only got the last byte of the delimiter into scope and its not recognized as a full delimiter. So the text is skipped until the next delimiter (ignoring a full record!!) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira