You probably want to update this line as follows:

lines = sc.textFile('file.gz').repartition(sc.defaultParallelism * 3)

For more details on why, see this answer
<http://stackoverflow.com/a/27631722/877069>.

Nick
​

On Mon, Mar 16, 2015 at 6:50 AM Marius Soutier <mps....@gmail.com> wrote:

> 1. I don't think textFile is capable of unpacking a .gz file. You need to
> use hadoopFile or newAPIHadoop file for this.
>
>
> Sorry that’s incorrect, textFile works fine on .gz files. What it can’t do
> is compute splits on gz files, so if you have a single file, you'll have a
> single partition.
>
> Processing 30 GB of gzipped data should not take that long, at least with
> the Scala API. Python not sure, especially under 1.2.1.
>
>

Reply via email to