Re: Processing of text file in large gzip archive

Marius Soutier Mon, 16 Mar 2015 03:50:40 -0700

> 1. I don't think textFile is capable of unpacking a .gz file. You need to use 
> hadoopFile or newAPIHadoop file for this.


Sorry that’s incorrect, textFile works fine on .gz files. What it can’t do is 
compute splits on gz files, so if you have a single file, you'll have a single 
partition.

Processing 30 GB of gzipped data should not take that long, at least with the 
Scala API. Python not sure, especially under 1.2.1.

Re: Processing of text file in large gzip archive

Reply via email to