Re: Processing of text file in large gzip archive

2015-03-16 Thread Marius Soutier
1. I don't think textFile is capable of unpacking a .gz file. You need to use hadoopFile or newAPIHadoop file for this. Sorry that’s incorrect, textFile works fine on .gz files. What it can’t do is compute splits on gz files, so if you have a single file, you'll have a single partition.

Re: Processing of text file in large gzip archive

2015-03-16 Thread Nicholas Chammas
You probably want to update this line as follows: lines = sc.textFile('file.gz').repartition(sc.defaultParallelism * 3) For more details on why, see this answer http://stackoverflow.com/a/27631722/877069. Nick ​ On Mon, Mar 16, 2015 at 6:50 AM Marius Soutier mps@gmail.com wrote: 1. I

Re: Processing of text file in large gzip archive

2015-03-16 Thread Akhil Das
1. I don't think textFile is capable of unpacking a .gz file. You need to use hadoopFile or newAPIHadoop file for this. 2. Instead of map, do a mapPartitions 3. You need to open the driver UI and see what's really taking time. If that is running on a remote machine and you are not able to access