Dustin Cote created MAPREDUCE-6549: -------------------------------------- Summary: multibyte delimiters with LineRecordReader cause duplicate records Key: MAPREDUCE-6549 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 2.7.2 Reporter: Dustin Cote Assignee: Dustin Cote
LineRecorderReader currently produces duplicate records under certain scenarios such as: 1) input string: "abc+++def++ghi++" delimiter string: "+++" test passes with all sizes of the split 2) input string: "abc++def+++ghi++" delimiter string: "+++" test fails with a split size of 4 2) input string: "abc+++def++ghi++" delimiter string: "++" test fails with a split size of 5 3) input string "abc+++defg++hij++" delimiter string: "++" test fails with a split size of 4 4) input string "abc++def+++ghi++" delimiter string: "++" test fails with a split size of 9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)