You probably want to update this line as follows: lines = sc.textFile('file.gz').repartition(sc.defaultParallelism * 3)
For more details on why, see this answer <http://stackoverflow.com/a/27631722/877069>. Nick On Mon, Mar 16, 2015 at 6:50 AM Marius Soutier <mps....@gmail.com> wrote: > 1. I don't think textFile is capable of unpacking a .gz file. You need to > use hadoopFile or newAPIHadoop file for this. > > > Sorry that’s incorrect, textFile works fine on .gz files. What it can’t do > is compute splits on gz files, so if you have a single file, you'll have a > single partition. > > Processing 30 GB of gzipped data should not take that long, at least with > the Scala API. Python not sure, especially under 1.2.1. > >