Yep. I figured that out. I uncompressed the file and it looks much faster now. Thanks.
On Sun, May 11, 2014 at 8:14 AM, Mayur Rustagi <mayur.rust...@gmail.com>wrote: > .gz files are not splittable hence harder to process. Easiest is to move > to a splittable compression like lzo and break file into multiple blocks to > be read and for subsequent processing. > On 11 May 2014 09:01, "Soumya Simanta" <soumya.sima...@gmail.com> wrote: > >> >> >> I've a Spark cluster with 3 worker nodes. >> >> >> - *Workers:* 3 >> - *Cores:* 48 Total, 48 Used >> - *Memory:* 469.8 GB Total, 72.0 GB Used >> >> I want a process a single file compressed (*.gz) on HDFS. The file is >> 1.5GB compressed and 11GB uncompressed. >> When I try to read the compressed file from HDFS it takes a while (4-5 >> minutes) load it into an RDD. If I use the .cache operation it takes even >> longer. Is there a way to make loading of the RDD from HDFS faster ? >> >> Thanks >> -Soumya >> >> >>