I've a Spark cluster with 3 worker nodes.
- *Workers:* 3 - *Cores:* 48 Total, 48 Used - *Memory:* 469.8 GB Total, 72.0 GB Used I want a process a single file compressed (*.gz) on HDFS. The file is 1.5GB compressed and 11GB uncompressed. When I try to read the compressed file from HDFS it takes a while (4-5 minutes) load it into an RDD. If I use the .cache operation it takes even longer. Is there a way to make loading of the RDD from HDFS faster ? Thanks -Soumya