[jira] [Commented] (HADOOP-13064) LineReader reports incorrect number of bytes read resulting in correctness issues using LineRecordReader

2016-04-29 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264777#comment-15264777
 ] 

Andrew Ash commented on HADOOP-13064:
-

[~jellis] those two do look pretty related -- were you testing with version 
2.7.1 by chance?  Can you check if your test passes in 2.7.2 which contains 
fixes for both those tickets?

> LineReader reports incorrect number of bytes read resulting in correctness 
> issues using LineRecordReader
> 
>
> Key: HADOOP-13064
> URL: https://issues.apache.org/jira/browse/HADOOP-13064
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Joe Ellis
>Priority: Critical
> Attachments: LineReaderTest.java
>
>
> The specific issue we were seeing with LineReader is that when we pass in 
> '\r\n' as the line delimiter the number of bytes that it claims to have read 
> is less than what it actually read. We narrowed this down to only happening 
> when the delimiter is split across the internal buffer boundary, so if 
> fillbuffer fills with "row\r" and the next call fills with "\n" then the 
> number of bytes reported would be 4 rather than 5.
> This results in correctness issues in LineRecordReader because if this off by 
> one issue is seen enough times when reading a split then it will continue to 
> read records past its split boundary, resulting in records appearing to come 
> from multiple splits.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-10614) CBZip2InputStream is not threadsafe

2014-07-10 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057742#comment-14057742
 ] 

Andrew Ash commented on HADOOP-10614:
-

Cool thanks

 CBZip2InputStream is not threadsafe
 ---

 Key: HADOOP-10614
 URL: https://issues.apache.org/jira/browse/HADOOP-10614
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 1.2.1, 2.2.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.3.0, 2.5.0

 Attachments: bzip2-2.diff, bzip2.diff


 Hadoop uses CBZip2InputStream to decode bzip2 files. However, the 
 implementation is not threadsafe. This is not a really problem for Hadoop 
 MapReduce because Hadoop runs each task in a separate JVM. But for other 
 libraries that utilize multithreading and use Hadoop's InputFormat, e.g., 
 Spark, it will cause exceptions like the following:
 {code}
 java.lang.ArrayIndexOutOfBoundsException: 6 
 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.recvDecodingTables(CBZip2InputStream.java:729)
  
 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:795)
  
 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:499)
  
 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:330)
  
 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:394)
  
 org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:428)
  java.io.InputStream.read(InputStream.java:101) 
 org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:205) 
 org.apache.hadoop.util.LineReader.readLine(LineReader.java:169) 
 org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:176) 
 org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:43) 
 org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:198) 
 org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:181) 
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:35)
  scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) 
 org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1000) 
 org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847) 
 org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847) 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1077)
  
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1077)
  org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) 
 org.apache.spark.scheduler.Task.run(Task.scala:51) 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  java.lang.Thread.run(Thread.java:724)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HADOOP-10614) CBZip2InputStream is not threadsafe

2014-07-08 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14055398#comment-14055398
 ] 

Andrew Ash commented on HADOOP-10614:
-

I can't quite tell -- are these Hudson failures a problem?

 CBZip2InputStream is not threadsafe
 ---

 Key: HADOOP-10614
 URL: https://issues.apache.org/jira/browse/HADOOP-10614
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 1.2.1, 2.2.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.3.0, 2.5.0

 Attachments: bzip2-2.diff, bzip2.diff


 Hadoop uses CBZip2InputStream to decode bzip2 files. However, the 
 implementation is not threadsafe. This is not a really problem for Hadoop 
 MapReduce because Hadoop runs each task in a separate JVM. But for other 
 libraries that utilize multithreading and use Hadoop's InputFormat, e.g., 
 Spark, it will cause exceptions like the following:
 {code}
 java.lang.ArrayIndexOutOfBoundsException: 6 
 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.recvDecodingTables(CBZip2InputStream.java:729)
  
 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:795)
  
 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:499)
  
 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:330)
  
 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:394)
  
 org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:428)
  java.io.InputStream.read(InputStream.java:101) 
 org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:205) 
 org.apache.hadoop.util.LineReader.readLine(LineReader.java:169) 
 org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:176) 
 org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:43) 
 org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:198) 
 org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:181) 
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:35)
  scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) 
 org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1000) 
 org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847) 
 org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847) 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1077)
  
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1077)
  org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) 
 org.apache.spark.scheduler.Task.run(Task.scala:51) 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  java.lang.Thread.run(Thread.java:724)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HADOOP-10614) CBZip2InputStream is not threadsafe

2014-05-19 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002747#comment-14002747
 ] 

Andrew Ash commented on HADOOP-10614:
-

Thanks Sandy and Xiangrui for fixing this issue!

 CBZip2InputStream is not threadsafe
 ---

 Key: HADOOP-10614
 URL: https://issues.apache.org/jira/browse/HADOOP-10614
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 1.2.1, 2.2.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.3.0, 2.5.0

 Attachments: bzip2-2.diff, bzip2.diff


 Hadoop uses CBZip2InputStream to decode bzip2 files. However, the 
 implementation is not threadsafe. This is not a really problem for Hadoop 
 MapReduce because Hadoop runs each task in a separate JVM. But for other 
 libraries that utilize multithreading and use Hadoop's InputFormat, e.g., 
 Spark, it will cause exceptions like the following:
 {code}
 java.lang.ArrayIndexOutOfBoundsException: 6 
 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.recvDecodingTables(CBZip2InputStream.java:729)
  
 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:795)
  
 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:499)
  
 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:330)
  
 org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:394)
  
 org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:428)
  java.io.InputStream.read(InputStream.java:101) 
 org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:205) 
 org.apache.hadoop.util.LineReader.readLine(LineReader.java:169) 
 org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:176) 
 org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:43) 
 org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:198) 
 org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:181) 
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:35)
  scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) 
 org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1000) 
 org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847) 
 org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847) 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1077)
  
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1077)
  org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) 
 org.apache.spark.scheduler.Task.run(Task.scala:51) 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  java.lang.Thread.run(Thread.java:724)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HADOOP-6842) hadoop fs -text does not give a useful text representation of MapWritable objects

2013-04-09 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13626986#comment-13626986
 ] 

Andrew Ash commented on HADOOP-6842:


Looks like nothing ever came of this?  I'd appreciate a nicer toString() on 
MapWritable too

 hadoop fs -text does not give a useful text representation of MapWritable 
 objects
 ---

 Key: HADOOP-6842
 URL: https://issues.apache.org/jira/browse/HADOOP-6842
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 0.20.0
Reporter: Steven Wong

 If a sequence file contains MapWritable objects, running hadoop fs -text on 
 the file prints the following for each MapWritable:
 org.apache.hadoop.io.MapWritable@4f8235ed
 To be more useful, it should print out the contents of the map instead. This 
 can be done by adding a toString method to MapWritable, i.e. something like:
 public String toString() {
 return (new TreeMapWritable, Writable(instance)).toString();
 }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira