[jira] [Commented] (MAPREDUCE-5948) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

2015-06-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14597824#comment-14597824
 ] 

Hudson commented on MAPREDUCE-5948:
---

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2183 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2183/])
MAPREDUCE-5948. org.apache.hadoop.mapred.LineRecordReader does not handle 
multibyte record delimiters well. Contributed by Vinayakumar B, Rushabh Shah, 
and Akira AJISAKA (jlowe: rev 077250d8d7b4b757543a39a6ce8bb6e3be356c6f)
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/UncompressedSplitLineReader.java
* hadoop-mapreduce-project/CHANGES.txt
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestLineRecordReader.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapred/TestLineRecordReader.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/LineRecordReader.java


 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
 delimiters well
 --

 Key: MAPREDUCE-5948
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5948
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 0.20.2, 0.23.9, 2.2.0
 Environment: CDH3U2 Redhat linux 5.7
Reporter: Kris Geusebroek
Assignee: Akira AJISAKA
Priority: Critical
 Fix For: 2.8.0

 Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch, 
 HADOOP-9867.patch, MAPREDUCE-5948.002.patch, MAPREDUCE-5948.003.patch


 Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
 sometimes has the effect of skipping records from the input.
 This happens when the input splits are split off just after a 
 recordseparator. Starting point for the next split would be non zero and 
 skipFirstLine would be true. A seek into the file is done to start - 1 and 
 the text until the first recorddelimiter is ignored (due to the presumption 
 that this record is already handled by the previous maptask). Since the re 
 ord delimiter is multibyte the seek only got the last byte of the delimiter 
 into scope and its not recognized as a full delimiter. So the text is skipped 
 until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5948) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

2015-06-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14597782#comment-14597782
 ] 

Hudson commented on MAPREDUCE-5948:
---

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #235 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/235/])
MAPREDUCE-5948. org.apache.hadoop.mapred.LineRecordReader does not handle 
multibyte record delimiters well. Contributed by Vinayakumar B, Rushabh Shah, 
and Akira AJISAKA (jlowe: rev 077250d8d7b4b757543a39a6ce8bb6e3be356c6f)
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestLineRecordReader.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapred/TestLineRecordReader.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/LineRecordReader.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/UncompressedSplitLineReader.java
* hadoop-mapreduce-project/CHANGES.txt
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java


 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
 delimiters well
 --

 Key: MAPREDUCE-5948
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5948
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 0.20.2, 0.23.9, 2.2.0
 Environment: CDH3U2 Redhat linux 5.7
Reporter: Kris Geusebroek
Assignee: Akira AJISAKA
Priority: Critical
 Fix For: 2.8.0

 Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch, 
 HADOOP-9867.patch, MAPREDUCE-5948.002.patch, MAPREDUCE-5948.003.patch


 Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
 sometimes has the effect of skipping records from the input.
 This happens when the input splits are split off just after a 
 recordseparator. Starting point for the next split would be non zero and 
 skipFirstLine would be true. A seek into the file is done to start - 1 and 
 the text until the first recorddelimiter is ignored (due to the presumption 
 that this record is already handled by the previous maptask). Since the re 
 ord delimiter is multibyte the seek only got the last byte of the delimiter 
 into scope and its not recognized as a full delimiter. So the text is skipped 
 until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5948) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

2015-06-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14597721#comment-14597721
 ] 

Hudson commented on MAPREDUCE-5948:
---

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #226 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/226/])
MAPREDUCE-5948. org.apache.hadoop.mapred.LineRecordReader does not handle 
multibyte record delimiters well. Contributed by Vinayakumar B, Rushabh Shah, 
and Akira AJISAKA (jlowe: rev 077250d8d7b4b757543a39a6ce8bb6e3be356c6f)
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/LineRecordReader.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/UncompressedSplitLineReader.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestLineRecordReader.java
* hadoop-mapreduce-project/CHANGES.txt
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapred/TestLineRecordReader.java


 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
 delimiters well
 --

 Key: MAPREDUCE-5948
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5948
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 0.20.2, 0.23.9, 2.2.0
 Environment: CDH3U2 Redhat linux 5.7
Reporter: Kris Geusebroek
Assignee: Akira AJISAKA
Priority: Critical
 Fix For: 2.8.0

 Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch, 
 HADOOP-9867.patch, MAPREDUCE-5948.002.patch, MAPREDUCE-5948.003.patch


 Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
 sometimes has the effect of skipping records from the input.
 This happens when the input splits are split off just after a 
 recordseparator. Starting point for the next split would be non zero and 
 skipFirstLine would be true. A seek into the file is done to start - 1 and 
 the text until the first recorddelimiter is ignored (due to the presumption 
 that this record is already handled by the previous maptask). Since the re 
 ord delimiter is multibyte the seek only got the last byte of the delimiter 
 into scope and its not recognized as a full delimiter. So the text is skipped 
 until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5948) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

2015-06-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14597707#comment-14597707
 ] 

Hudson commented on MAPREDUCE-5948:
---

FAILURE: Integrated in Hadoop-Hdfs-trunk #2165 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2165/])
MAPREDUCE-5948. org.apache.hadoop.mapred.LineRecordReader does not handle 
multibyte record delimiters well. Contributed by Vinayakumar B, Rushabh Shah, 
and Akira AJISAKA (jlowe: rev 077250d8d7b4b757543a39a6ce8bb6e3be356c6f)
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestLineRecordReader.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/LineRecordReader.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/UncompressedSplitLineReader.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapred/TestLineRecordReader.java
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java
* hadoop-mapreduce-project/CHANGES.txt


 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
 delimiters well
 --

 Key: MAPREDUCE-5948
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5948
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 0.20.2, 0.23.9, 2.2.0
 Environment: CDH3U2 Redhat linux 5.7
Reporter: Kris Geusebroek
Assignee: Akira AJISAKA
Priority: Critical
 Fix For: 2.8.0

 Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch, 
 HADOOP-9867.patch, MAPREDUCE-5948.002.patch, MAPREDUCE-5948.003.patch


 Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
 sometimes has the effect of skipping records from the input.
 This happens when the input splits are split off just after a 
 recordseparator. Starting point for the next split would be non zero and 
 skipFirstLine would be true. A seek into the file is done to start - 1 and 
 the text until the first recorddelimiter is ignored (due to the presumption 
 that this record is already handled by the previous maptask). Since the re 
 ord delimiter is multibyte the seek only got the last byte of the delimiter 
 into scope and its not recognized as a full delimiter. So the text is skipped 
 until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5948) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

2015-06-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14597533#comment-14597533
 ] 

Hudson commented on MAPREDUCE-5948:
---

SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #237 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/237/])
MAPREDUCE-5948. org.apache.hadoop.mapred.LineRecordReader does not handle 
multibyte record delimiters well. Contributed by Vinayakumar B, Rushabh Shah, 
and Akira AJISAKA (jlowe: rev 077250d8d7b4b757543a39a6ce8bb6e3be356c6f)
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestLineRecordReader.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/UncompressedSplitLineReader.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java
* hadoop-mapreduce-project/CHANGES.txt
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/LineRecordReader.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapred/TestLineRecordReader.java


 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
 delimiters well
 --

 Key: MAPREDUCE-5948
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5948
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 0.20.2, 0.23.9, 2.2.0
 Environment: CDH3U2 Redhat linux 5.7
Reporter: Kris Geusebroek
Assignee: Akira AJISAKA
Priority: Critical
 Fix For: 2.8.0

 Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch, 
 HADOOP-9867.patch, MAPREDUCE-5948.002.patch, MAPREDUCE-5948.003.patch


 Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
 sometimes has the effect of skipping records from the input.
 This happens when the input splits are split off just after a 
 recordseparator. Starting point for the next split would be non zero and 
 skipFirstLine would be true. A seek into the file is done to start - 1 and 
 the text until the first recorddelimiter is ignored (due to the presumption 
 that this record is already handled by the previous maptask). Since the re 
 ord delimiter is multibyte the seek only got the last byte of the delimiter 
 into scope and its not recognized as a full delimiter. So the text is skipped 
 until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5948) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

2015-06-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14597511#comment-14597511
 ] 

Hudson commented on MAPREDUCE-5948:
---

FAILURE: Integrated in Hadoop-Yarn-trunk #967 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/967/])
MAPREDUCE-5948. org.apache.hadoop.mapred.LineRecordReader does not handle 
multibyte record delimiters well. Contributed by Vinayakumar B, Rushabh Shah, 
and Akira AJISAKA (jlowe: rev 077250d8d7b4b757543a39a6ce8bb6e3be356c6f)
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapred/TestLineRecordReader.java
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestLineRecordReader.java
* hadoop-mapreduce-project/CHANGES.txt
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/UncompressedSplitLineReader.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/LineRecordReader.java


 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
 delimiters well
 --

 Key: MAPREDUCE-5948
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5948
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 0.20.2, 0.23.9, 2.2.0
 Environment: CDH3U2 Redhat linux 5.7
Reporter: Kris Geusebroek
Assignee: Akira AJISAKA
Priority: Critical
 Fix For: 2.8.0

 Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch, 
 HADOOP-9867.patch, MAPREDUCE-5948.002.patch, MAPREDUCE-5948.003.patch


 Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
 sometimes has the effect of skipping records from the input.
 This happens when the input splits are split off just after a 
 recordseparator. Starting point for the next split would be non zero and 
 skipFirstLine would be true. A seek into the file is done to start - 1 and 
 the text until the first recorddelimiter is ignored (due to the presumption 
 that this record is already handled by the previous maptask). Since the re 
 ord delimiter is multibyte the seek only got the last byte of the delimiter 
 into scope and its not recognized as a full delimiter. So the text is skipped 
 until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5948) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

2015-06-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596736#comment-14596736
 ] 

Hudson commented on MAPREDUCE-5948:
---

FAILURE: Integrated in Hadoop-trunk-Commit #8048 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8048/])
MAPREDUCE-5948. org.apache.hadoop.mapred.LineRecordReader does not handle 
multibyte record delimiters well. Contributed by Vinayakumar B, Rushabh Shah, 
and Akira AJISAKA (jlowe: rev 077250d8d7b4b757543a39a6ce8bb6e3be356c6f)
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/LineRecordReader.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestLineRecordReader.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapred/TestLineRecordReader.java
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/UncompressedSplitLineReader.java
* hadoop-mapreduce-project/CHANGES.txt


 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
 delimiters well
 --

 Key: MAPREDUCE-5948
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5948
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 0.20.2, 0.23.9, 2.2.0
 Environment: CDH3U2 Redhat linux 5.7
Reporter: Kris Geusebroek
Assignee: Akira AJISAKA
Priority: Critical
 Fix For: 2.8.0

 Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch, 
 HADOOP-9867.patch, MAPREDUCE-5948.002.patch, MAPREDUCE-5948.003.patch


 Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
 sometimes has the effect of skipping records from the input.
 This happens when the input splits are split off just after a 
 recordseparator. Starting point for the next split would be non zero and 
 skipFirstLine would be true. A seek into the file is done to start - 1 and 
 the text until the first recorddelimiter is ignored (due to the presumption 
 that this record is already handled by the previous maptask). Since the re 
 ord delimiter is multibyte the seek only got the last byte of the delimiter 
 into scope and its not recognized as a full delimiter. So the text is skipped 
 until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5948) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

2015-06-19 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593600#comment-14593600
 ] 

Jason Lowe commented on MAPREDUCE-5948:
---

Thanks for taking this up, Akira.  Will try to review the patch shortly.

[~Markovich] could you provide more details on the duplicate records with bz2?  
A similar problem was reported in MAPREDUCE-6299 but there are no details to 
work with.  Note that the latest patch will not address any issues with bz2, as 
it only fixes the handling of duplicate records with uncompressed input.

 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
 delimiters well
 --

 Key: MAPREDUCE-5948
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5948
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 0.20.2, 0.23.9, 2.2.0
 Environment: CDH3U2 Redhat linux 5.7
Reporter: Kris Geusebroek
Assignee: Akira AJISAKA
Priority: Critical
 Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch, 
 HADOOP-9867.patch, MAPREDUCE-5948.002.patch, MAPREDUCE-5948.003.patch


 Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
 sometimes has the effect of skipping records from the input.
 This happens when the input splits are split off just after a 
 recordseparator. Starting point for the next split would be non zero and 
 skipFirstLine would be true. A seek into the file is done to start - 1 and 
 the text until the first recorddelimiter is ignored (due to the presumption 
 that this record is already handled by the previous maptask). Since the re 
 ord delimiter is multibyte the seek only got the last byte of the delimiter 
 into scope and its not recognized as a full delimiter. So the text is skipped 
 until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5948) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

2015-06-19 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593729#comment-14593729
 ] 

Jason Lowe commented on MAPREDUCE-5948:
---

+1 for the latest patch.  This should resolve the dropped/duplicate problems 
with uncompressed input.  We can tackle the reported duplicate records for bz2 
in MAPREDUCE-6299.

Will commit this early next week if there are no objections.

 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
 delimiters well
 --

 Key: MAPREDUCE-5948
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5948
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 0.20.2, 0.23.9, 2.2.0
 Environment: CDH3U2 Redhat linux 5.7
Reporter: Kris Geusebroek
Assignee: Akira AJISAKA
Priority: Critical
 Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch, 
 HADOOP-9867.patch, MAPREDUCE-5948.002.patch, MAPREDUCE-5948.003.patch


 Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
 sometimes has the effect of skipping records from the input.
 This happens when the input splits are split off just after a 
 recordseparator. Starting point for the next split would be non zero and 
 skipFirstLine would be true. A seek into the file is done to start - 1 and 
 the text until the first recorddelimiter is ignored (due to the presumption 
 that this record is already handled by the previous maptask). Since the re 
 ord delimiter is multibyte the seek only got the last byte of the delimiter 
 into scope and its not recognized as a full delimiter. So the text is skipped 
 until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5948) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

2015-06-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580875#comment-14580875
 ] 

Hadoop QA commented on MAPREDUCE-5948:
--

\\
\\
| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  17m 35s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 2 new or modified test files. |
| {color:green}+1{color} | javac |   7m 32s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 37s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   1m 46s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 33s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   3m 15s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | common tests |  23m 18s | Tests passed in 
hadoop-common. |
| {color:green}+1{color} | mapreduce tests |   1m 43s | Tests passed in 
hadoop-mapreduce-client-core. |
| | |  67m 24s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12738857/MAPREDUCE-5948.003.patch
 |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / c7729ef |
| hadoop-common test log | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5789/artifact/patchprocess/testrun_hadoop-common.txt
 |
| hadoop-mapreduce-client-core test log | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5789/artifact/patchprocess/testrun_hadoop-mapreduce-client-core.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5789/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5789/console |


This message was automatically generated.

 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
 delimiters well
 --

 Key: MAPREDUCE-5948
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5948
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 0.20.2, 0.23.9, 2.2.0
 Environment: CDH3U2 Redhat linux 5.7
Reporter: Kris Geusebroek
Assignee: Akira AJISAKA
Priority: Critical
 Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch, 
 HADOOP-9867.patch, MAPREDUCE-5948.002.patch, MAPREDUCE-5948.003.patch


 Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
 sometimes has the effect of skipping records from the input.
 This happens when the input splits are split off just after a 
 recordseparator. Starting point for the next split would be non zero and 
 skipFirstLine would be true. A seek into the file is done to start - 1 and 
 the text until the first recorddelimiter is ignored (due to the presumption 
 that this record is already handled by the previous maptask). Since the re 
 ord delimiter is multibyte the seek only got the last byte of the delimiter 
 into scope and its not recognized as a full delimiter. So the text is skipped 
 until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5948) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

2015-06-09 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579825#comment-14579825
 ] 

Hadoop QA commented on MAPREDUCE-5948:
--

\\
\\
| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  18m 24s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 2 new or modified test files. |
| {color:green}+1{color} | javac |   7m 47s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 46s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 24s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   1m 52s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  1s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 39s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   3m 21s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | common tests |  24m 27s | Tests passed in 
hadoop-common. |
| {color:green}+1{color} | mapreduce tests |   1m 44s | Tests passed in 
hadoop-mapreduce-client-core. |
| | |  70m  2s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12738701/MAPREDUCE-5948.002.patch
 |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 3c2397c |
| hadoop-common test log | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5787/artifact/patchprocess/testrun_hadoop-common.txt
 |
| hadoop-mapreduce-client-core test log | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5787/artifact/patchprocess/testrun_hadoop-mapreduce-client-core.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5787/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5787/console |


This message was automatically generated.

 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
 delimiters well
 --

 Key: MAPREDUCE-5948
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5948
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 0.20.2, 0.23.9, 2.2.0
 Environment: CDH3U2 Redhat linux 5.7
Reporter: Kris Geusebroek
Assignee: Akira AJISAKA
Priority: Critical
 Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch, 
 HADOOP-9867.patch, MAPREDUCE-5948.002.patch


 Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
 sometimes has the effect of skipping records from the input.
 This happens when the input splits are split off just after a 
 recordseparator. Starting point for the next split would be non zero and 
 skipFirstLine would be true. A seek into the file is done to start - 1 and 
 the text until the first recorddelimiter is ignored (due to the presumption 
 that this record is already handled by the previous maptask). Since the re 
 ord delimiter is multibyte the seek only got the last byte of the delimiter 
 into scope and its not recognized as a full delimiter. So the text is skipped 
 until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5948) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

2015-06-03 Thread Akira AJISAKA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570468#comment-14570468
 ] 

Akira AJISAKA commented on MAPREDUCE-5948:
--

Hi [~vinayrpet], are you willing to take over this issue? If you are not 
interested in the work, I'd like to rebase the patch and address the review 
comments.

 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
 delimiters well
 --

 Key: MAPREDUCE-5948
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5948
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 0.20.2, 0.23.9, 2.2.0
 Environment: CDH3U2 Redhat linux 5.7
Reporter: Kris Geusebroek
Assignee: Rushabh S Shah
Priority: Critical
 Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch, 
 HADOOP-9867.patch


 Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
 sometimes has the effect of skipping records from the input.
 This happens when the input splits are split off just after a 
 recordseparator. Starting point for the next split would be non zero and 
 skipFirstLine would be true. A seek into the file is done to start - 1 and 
 the text until the first recorddelimiter is ignored (due to the presumption 
 that this record is already handled by the previous maptask). Since the re 
 ord delimiter is multibyte the seek only got the last byte of the delimiter 
 into scope and its not recognized as a full delimiter. So the text is skipped 
 until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5948) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

2015-06-03 Thread Vinayakumar B (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570486#comment-14570486
 ] 

Vinayakumar B commented on MAPREDUCE-5948:
--

Sure, Go ahead :)

 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
 delimiters well
 --

 Key: MAPREDUCE-5948
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5948
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 0.20.2, 0.23.9, 2.2.0
 Environment: CDH3U2 Redhat linux 5.7
Reporter: Kris Geusebroek
Assignee: Rushabh S Shah
Priority: Critical
 Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch, 
 HADOOP-9867.patch


 Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
 sometimes has the effect of skipping records from the input.
 This happens when the input splits are split off just after a 
 recordseparator. Starting point for the next split would be non zero and 
 skipFirstLine would be true. A seek into the file is done to start - 1 and 
 the text until the first recorddelimiter is ignored (due to the presumption 
 that this record is already handled by the previous maptask). Since the re 
 ord delimiter is multibyte the seek only got the last byte of the delimiter 
 into scope and its not recognized as a full delimiter. So the text is skipped 
 until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5948) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

2015-06-02 Thread Rushabh S Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569101#comment-14569101
 ] 

Rushabh S Shah commented on MAPREDUCE-5948:
---

Sorry this fell off my radar too.
I don't have enough cycles to work on this right now.
We can move this to next release.
Or if someone is interested to work on this, I am more than happy to let 
him/her take.

 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
 delimiters well
 --

 Key: MAPREDUCE-5948
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5948
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 0.20.2, 0.23.9, 2.2.0
 Environment: CDH3U2 Redhat linux 5.7
Reporter: Kris Geusebroek
Assignee: Rushabh S Shah
Priority: Critical
 Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch, 
 HADOOP-9867.patch


 Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
 sometimes has the effect of skipping records from the input.
 This happens when the input splits are split off just after a 
 recordseparator. Starting point for the next split would be non zero and 
 skipFirstLine would be true. A seek into the file is done to start - 1 and 
 the text until the first recorddelimiter is ignored (due to the presumption 
 that this record is already handled by the previous maptask). Since the re 
 ord delimiter is multibyte the seek only got the last byte of the delimiter 
 into scope and its not recognized as a full delimiter. So the text is skipped 
 until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5948) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

2015-04-24 Thread Rivkin Andrey (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510783#comment-14510783
 ] 

Rivkin Andrey commented on MAPREDUCE-5948:
--

Today, faced with this problem. In Bz2 files multibyte record delimiter leads 
to record duplicates. In uncommpresed files leads to record drop. 
Hadoop version 2.5.0.
Also we can't set hex values, for example /x01 in hadoop conf. 
(textinputformat.record.delimiter). 

 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
 delimiters well
 --

 Key: MAPREDUCE-5948
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5948
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 0.20.2, 0.23.9, 2.2.0
 Environment: CDH3U2 Redhat linux 5.7
Reporter: Kris Geusebroek
Assignee: Rushabh S Shah
Priority: Critical
 Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch, 
 HADOOP-9867.patch


 Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
 sometimes has the effect of skipping records from the input.
 This happens when the input splits are split off just after a 
 recordseparator. Starting point for the next split would be non zero and 
 skipFirstLine would be true. A seek into the file is done to start - 1 and 
 the text until the first recorddelimiter is ignored (due to the presumption 
 that this record is already handled by the previous maptask). Since the re 
 ord delimiter is multibyte the seek only got the last byte of the delimiter 
 into scope and its not recognized as a full delimiter. So the text is skipped 
 until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5948) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

2015-04-24 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510791#comment-14510791
 ] 

Hadoop QA commented on MAPREDUCE-5948:
--

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | patch |   0m  0s | The patch command could not apply 
the patch during dryrun. |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12652422/HADOOP-9867.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / c8d7290 |
| Console output | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5443/console |


This message was automatically generated.

 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
 delimiters well
 --

 Key: MAPREDUCE-5948
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5948
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 0.20.2, 0.23.9, 2.2.0
 Environment: CDH3U2 Redhat linux 5.7
Reporter: Kris Geusebroek
Assignee: Rushabh S Shah
Priority: Critical
 Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch, 
 HADOOP-9867.patch


 Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
 sometimes has the effect of skipping records from the input.
 This happens when the input splits are split off just after a 
 recordseparator. Starting point for the next split would be non zero and 
 skipFirstLine would be true. A seek into the file is done to start - 1 and 
 the text until the first recorddelimiter is ignored (due to the presumption 
 that this record is already handled by the previous maptask). Since the re 
 ord delimiter is multibyte the seek only got the last byte of the delimiter 
 into scope and its not recognized as a full delimiter. So the text is skipped 
 until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5948) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

2014-06-27 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14046456#comment-14046456
 ] 

Hadoop QA commented on MAPREDUCE-5948:
--

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12652422/HADOOP-9867.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4695//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4695//console

This message is automatically generated.

 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
 delimiters well
 --

 Key: MAPREDUCE-5948
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5948
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 0.20.2, 0.23.9, 2.2.0
 Environment: CDH3U2 Redhat linux 5.7
Reporter: Kris Geusebroek
Assignee: Vinayakumar B
Priority: Critical
 Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch, 
 HADOOP-9867.patch


 Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
 sometimes has the effect of skipping records from the input.
 This happens when the input splits are split off just after a 
 recordseparator. Starting point for the next split would be non zero and 
 skipFirstLine would be true. A seek into the file is done to start - 1 and 
 the text until the first recorddelimiter is ignored (due to the presumption 
 that this record is already handled by the previous maptask). Since the re 
 ord delimiter is multibyte the seek only got the last byte of the delimiter 
 into scope and its not recognized as a full delimiter. So the text is skipped 
 until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5948) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

2014-06-27 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14046477#comment-14046477
 ] 

Jason Lowe commented on MAPREDUCE-5948:
---

Moved this to MAPREDUCE since that's were the change should go.  Thanks for the 
patch, Rushabh.  Some comments on the patch:

* LineReader: I'm not sure changing LineReader's byteConsumed semantics will be 
OK.  LineReader is used by things besides MapReduce, so changing the return 
result for some corner cases may be problematic for other downstream projects.  
Rather than rely on the bytes consumed result from LineReader it seems we could 
track this using the existing totalBytesRead value.  We simply need to track 
how many bytes from the input stream we've read before we try to read the next 
line.  If we've read past the end of the split (i.e.: bytes  split length) 
then we can set the finished flag.
* UncompressedSplitLineReader: Given how complicated split processing can be 
and all the corner cases, it'd be nice to have some comments explaining what's 
going on similar to what's in CompressedSplitLineReader.  Otherwise many will 
wonder why we're going out of our way to make sure we stop right at the split 
when we fill the buffer, etc.
* bufferPosition is not really a position in the buffer but a position within 
the split, and that's similar to totalBytesRead.  I'm not sure we can 
completely combine them, but it'd be nice if their names were more specific to 
what they're tracking (one is the total bytes read from the input stream and 
another is total bytes consumed by the line reader).
* {{(inDelimiter  actualBytesRead  0)}} can just be {{(inDelimiter)}} 
because we already know actualBytesRead  0 at that point due to the previous 
condition check.
* Test doesn't cleanup the temporary files it creates when it's done.  
(Actually just noticed the tests are creating files in the wrong directory, 
should be something like TestLineRecordReader.class.getName() and not 
TestTextInputFormat.)
* Nit: whitespace between {{if}} and ( and lines formatted to 80 columns to 
follow the coding standard


 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
 delimiters well
 --

 Key: MAPREDUCE-5948
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5948
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 0.20.2, 0.23.9, 2.2.0
 Environment: CDH3U2 Redhat linux 5.7
Reporter: Kris Geusebroek
Assignee: Vinayakumar B
Priority: Critical
 Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch, 
 HADOOP-9867.patch


 Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
 sometimes has the effect of skipping records from the input.
 This happens when the input splits are split off just after a 
 recordseparator. Starting point for the next split would be non zero and 
 skipFirstLine would be true. A seek into the file is done to start - 1 and 
 the text until the first recorddelimiter is ignored (due to the presumption 
 that this record is already handled by the previous maptask). Since the re 
 ord delimiter is multibyte the seek only got the last byte of the delimiter 
 into scope and its not recognized as a full delimiter. So the text is skipped 
 until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.2#6252)