Hi Xiangrui, many thanks to you and Sandy for fixing this issue!
On Fri, May 16, 2014 at 10:23 PM, Xiangrui Meng men...@gmail.com wrote:
Hi Andrew,
I submitted a patch and verified it solves the problem. You can
download the patch from
https://issues.apache.org/jira/browse/HADOOP-10614 .
Hi Andrew,
Could you try varying the minPartitions parameter? For example:
val r = sc.textFile(/user/aa/myfile.bz2, 4).count
val r = sc.textFile(/user/aa/myfile.bz2, 8).count
Best,
Xiangrui
On Tue, May 13, 2014 at 9:08 AM, Xiangrui Meng men...@gmail.com wrote:
Which hadoop version did you
We never saw your exception when reading bzip2 files with spark.
But when we wrongly compiled spark against older version of hadoop (was
default in spark), we ended up with sequential reading of bzip2 file,
not taking advantage of block splits to work in parallel.
Once we compiled spark with
Hi Xiangrui,
// FYI I'm getting your emails late due to the Apache mailing list outage
I'm using CDH4.4.0, which I think uses the MapReduce v2 API. The .jars are
named like this: hadoop-hdfs-2.0.0-cdh4.4.0.jar
I'm also glad you were able to reproduce! Please paste a link to the
Hadoop bug you
Hi Andrew,
I verified that this is due to thread safety. I changed
SPARK_WORKER_CORES to 1 in spark-env.sh, so there is only 1 thread per
worker. Then I can load the file without any problem with different
values of minPartitions. I will submit a JIRA to both Spark and
Hadoop.
Best,
Xiangrui
On
Hi Andrew,
This is the JIRA I created:
https://issues.apache.org/jira/browse/MAPREDUCE-5893 . Hopefully
someone wants to work on it.
Best,
Xiangrui
On Fri, May 16, 2014 at 6:47 PM, Xiangrui Meng men...@gmail.com wrote:
Hi Andre,
I could reproduce the bug with Hadoop 2.2.0. Some older version
Hi Andre,
I could reproduce the bug with Hadoop 2.2.0. Some older version of
Hadoop do not support splittable compression, so you ended up with
sequential reads. It is easy to reproduce the bug with the following
setup:
1) Workers are configured with multiple cores.
2) BZip2 files are big enough
Which hadoop version did you use? I'm not sure whether Hadoop v2 fixes
the problem you described, but it does contain several fixes to bzip2
format. -Xiangrui
On Wed, May 7, 2014 at 9:19 PM, Andrew Ash and...@andrewash.com wrote:
Hi all,
Is anyone reading and writing to .bz2 files stored in