Re: Reading from .bz2 files with Spark

2014-05-19 Thread Andrew Ash
Hi Xiangrui, many thanks to you and Sandy for fixing this issue! On Fri, May 16, 2014 at 10:23 PM, Xiangrui Meng men...@gmail.com wrote: Hi Andrew, I submitted a patch and verified it solves the problem. You can download the patch from https://issues.apache.org/jira/browse/HADOOP-10614 .

Re: Reading from .bz2 files with Spark

2014-05-16 Thread Xiangrui Meng
Hi Andrew, Could you try varying the minPartitions parameter? For example: val r = sc.textFile(/user/aa/myfile.bz2, 4).count val r = sc.textFile(/user/aa/myfile.bz2, 8).count Best, Xiangrui On Tue, May 13, 2014 at 9:08 AM, Xiangrui Meng men...@gmail.com wrote: Which hadoop version did you

Re: Reading from .bz2 files with Spark

2014-05-16 Thread Andre Bois-Crettez
We never saw your exception when reading bzip2 files with spark. But when we wrongly compiled spark against older version of hadoop (was default in spark), we ended up with sequential reading of bzip2 file, not taking advantage of block splits to work in parallel. Once we compiled spark with

Re: Reading from .bz2 files with Spark

2014-05-16 Thread Andrew Ash
Hi Xiangrui, // FYI I'm getting your emails late due to the Apache mailing list outage I'm using CDH4.4.0, which I think uses the MapReduce v2 API. The .jars are named like this: hadoop-hdfs-2.0.0-cdh4.4.0.jar I'm also glad you were able to reproduce! Please paste a link to the Hadoop bug you

Re: Reading from .bz2 files with Spark

2014-05-16 Thread Xiangrui Meng
Hi Andrew, I verified that this is due to thread safety. I changed SPARK_WORKER_CORES to 1 in spark-env.sh, so there is only 1 thread per worker. Then I can load the file without any problem with different values of minPartitions. I will submit a JIRA to both Spark and Hadoop. Best, Xiangrui On

Re: Reading from .bz2 files with Spark

2014-05-16 Thread Xiangrui Meng
Hi Andrew, This is the JIRA I created: https://issues.apache.org/jira/browse/MAPREDUCE-5893 . Hopefully someone wants to work on it. Best, Xiangrui On Fri, May 16, 2014 at 6:47 PM, Xiangrui Meng men...@gmail.com wrote: Hi Andre, I could reproduce the bug with Hadoop 2.2.0. Some older version

Re: Reading from .bz2 files with Spark

2014-05-16 Thread Xiangrui Meng
Hi Andre, I could reproduce the bug with Hadoop 2.2.0. Some older version of Hadoop do not support splittable compression, so you ended up with sequential reads. It is easy to reproduce the bug with the following setup: 1) Workers are configured with multiple cores. 2) BZip2 files are big enough

Re: Reading from .bz2 files with Spark

2014-05-13 Thread Xiangrui Meng
Which hadoop version did you use? I'm not sure whether Hadoop v2 fixes the problem you described, but it does contain several fixes to bzip2 format. -Xiangrui On Wed, May 7, 2014 at 9:19 PM, Andrew Ash and...@andrewash.com wrote: Hi all, Is anyone reading and writing to .bz2 files stored in