1. I don't think textFile is capable of unpacking a .gz file. You need to use
hadoopFile or newAPIHadoop file for this.
Sorry that’s incorrect, textFile works fine on .gz files. What it can’t do is
compute splits on gz files, so if you have a single file, you'll have a single
partition.
You probably want to update this line as follows:
lines = sc.textFile('file.gz').repartition(sc.defaultParallelism * 3)
For more details on why, see this answer
http://stackoverflow.com/a/27631722/877069.
Nick
On Mon, Mar 16, 2015 at 6:50 AM Marius Soutier mps@gmail.com wrote:
1. I
1. I don't think textFile is capable of unpacking a .gz file. You need to
use hadoopFile or newAPIHadoop file for this.
2. Instead of map, do a mapPartitions
3. You need to open the driver UI and see what's really taking time. If
that is running on a remote machine and you are not able to access