[jira] [Commented] (HADOOP-13064) LineReader reports incorrect number of bytes read resulting in correctness issues using LineRecordReader
[ https://issues.apache.org/jira/browse/HADOOP-13064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264777#comment-15264777 ] Andrew Ash commented on HADOOP-13064: - [~jellis] those two do look pretty related -- were you testing with version 2.7.1 by chance? Can you check if your test passes in 2.7.2 which contains fixes for both those tickets? > LineReader reports incorrect number of bytes read resulting in correctness > issues using LineRecordReader > > > Key: HADOOP-13064 > URL: https://issues.apache.org/jira/browse/HADOOP-13064 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Joe Ellis >Priority: Critical > Attachments: LineReaderTest.java > > > The specific issue we were seeing with LineReader is that when we pass in > '\r\n' as the line delimiter the number of bytes that it claims to have read > is less than what it actually read. We narrowed this down to only happening > when the delimiter is split across the internal buffer boundary, so if > fillbuffer fills with "row\r" and the next call fills with "\n" then the > number of bytes reported would be 4 rather than 5. > This results in correctness issues in LineRecordReader because if this off by > one issue is seen enough times when reading a split then it will continue to > read records past its split boundary, resulting in records appearing to come > from multiple splits. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-10614) CBZip2InputStream is not threadsafe
[ https://issues.apache.org/jira/browse/HADOOP-10614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057742#comment-14057742 ] Andrew Ash commented on HADOOP-10614: - Cool thanks CBZip2InputStream is not threadsafe --- Key: HADOOP-10614 URL: https://issues.apache.org/jira/browse/HADOOP-10614 Project: Hadoop Common Issue Type: Improvement Affects Versions: 1.2.1, 2.2.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.3.0, 2.5.0 Attachments: bzip2-2.diff, bzip2.diff Hadoop uses CBZip2InputStream to decode bzip2 files. However, the implementation is not threadsafe. This is not a really problem for Hadoop MapReduce because Hadoop runs each task in a separate JVM. But for other libraries that utilize multithreading and use Hadoop's InputFormat, e.g., Spark, it will cause exceptions like the following: {code} java.lang.ArrayIndexOutOfBoundsException: 6 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.recvDecodingTables(CBZip2InputStream.java:729) org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:795) org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:499) org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:330) org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:394) org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:428) java.io.InputStream.read(InputStream.java:101) org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:205) org.apache.hadoop.util.LineReader.readLine(LineReader.java:169) org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:176) org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:43) org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:198) org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:181) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:35) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1000) org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847) org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847) org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1077) org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1077) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:724) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HADOOP-10614) CBZip2InputStream is not threadsafe
[ https://issues.apache.org/jira/browse/HADOOP-10614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14055398#comment-14055398 ] Andrew Ash commented on HADOOP-10614: - I can't quite tell -- are these Hudson failures a problem? CBZip2InputStream is not threadsafe --- Key: HADOOP-10614 URL: https://issues.apache.org/jira/browse/HADOOP-10614 Project: Hadoop Common Issue Type: Improvement Affects Versions: 1.2.1, 2.2.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.3.0, 2.5.0 Attachments: bzip2-2.diff, bzip2.diff Hadoop uses CBZip2InputStream to decode bzip2 files. However, the implementation is not threadsafe. This is not a really problem for Hadoop MapReduce because Hadoop runs each task in a separate JVM. But for other libraries that utilize multithreading and use Hadoop's InputFormat, e.g., Spark, it will cause exceptions like the following: {code} java.lang.ArrayIndexOutOfBoundsException: 6 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.recvDecodingTables(CBZip2InputStream.java:729) org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:795) org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:499) org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:330) org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:394) org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:428) java.io.InputStream.read(InputStream.java:101) org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:205) org.apache.hadoop.util.LineReader.readLine(LineReader.java:169) org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:176) org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:43) org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:198) org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:181) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:35) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1000) org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847) org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847) org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1077) org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1077) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:724) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HADOOP-10614) CBZip2InputStream is not threadsafe
[ https://issues.apache.org/jira/browse/HADOOP-10614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002747#comment-14002747 ] Andrew Ash commented on HADOOP-10614: - Thanks Sandy and Xiangrui for fixing this issue! CBZip2InputStream is not threadsafe --- Key: HADOOP-10614 URL: https://issues.apache.org/jira/browse/HADOOP-10614 Project: Hadoop Common Issue Type: Improvement Affects Versions: 1.2.1, 2.2.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.3.0, 2.5.0 Attachments: bzip2-2.diff, bzip2.diff Hadoop uses CBZip2InputStream to decode bzip2 files. However, the implementation is not threadsafe. This is not a really problem for Hadoop MapReduce because Hadoop runs each task in a separate JVM. But for other libraries that utilize multithreading and use Hadoop's InputFormat, e.g., Spark, it will cause exceptions like the following: {code} java.lang.ArrayIndexOutOfBoundsException: 6 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.recvDecodingTables(CBZip2InputStream.java:729) org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:795) org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:499) org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:330) org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:394) org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:428) java.io.InputStream.read(InputStream.java:101) org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:205) org.apache.hadoop.util.LineReader.readLine(LineReader.java:169) org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:176) org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:43) org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:198) org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:181) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:35) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1000) org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847) org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847) org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1077) org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1077) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:724) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HADOOP-6842) hadoop fs -text does not give a useful text representation of MapWritable objects
[ https://issues.apache.org/jira/browse/HADOOP-6842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13626986#comment-13626986 ] Andrew Ash commented on HADOOP-6842: Looks like nothing ever came of this? I'd appreciate a nicer toString() on MapWritable too hadoop fs -text does not give a useful text representation of MapWritable objects --- Key: HADOOP-6842 URL: https://issues.apache.org/jira/browse/HADOOP-6842 Project: Hadoop Common Issue Type: Improvement Affects Versions: 0.20.0 Reporter: Steven Wong If a sequence file contains MapWritable objects, running hadoop fs -text on the file prints the following for each MapWritable: org.apache.hadoop.io.MapWritable@4f8235ed To be more useful, it should print out the contents of the map instead. This can be done by adding a toString method to MapWritable, i.e. something like: public String toString() { return (new TreeMapWritable, Writable(instance)).toString(); } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira