Re: Processing of text file in large gzip archive

Nicholas Chammas Mon, 16 Mar 2015 07:47:57 -0700

You probably want to update this line as follows:

lines = sc.textFile('file.gz').repartition(sc.defaultParallelism * 3)


For more details on why, see this answer
<http://stackoverflow.com/a/27631722/877069>.

Nick


On Mon, Mar 16, 2015 at 6:50 AM Marius Soutier <mps....@gmail.com> wrote:

> 1. I don't think textFile is capable of unpacking a .gz file. You need to
> use hadoopFile or newAPIHadoop file for this.
>
>
> Sorry that’s incorrect, textFile works fine on .gz files. What it can’t do
> is compute splits on gz files, so if you have a single file, you'll have a
> single partition.
>
> Processing 30 GB of gzipped data should not take that long, at least with
> the Scala API. Python not sure, especially under 1.2.1.
>
>

Re: Processing of text file in large gzip archive

Reply via email to