Re: Fwd: Is there a way to load a large file from HDFS faster into Spark

Soumya Simanta Sun, 11 May 2014 07:18:06 -0700

Yep. I figured that out. I uncompressed the file and it looks much faster
now. Thanks.




On Sun, May 11, 2014 at 8:14 AM, Mayur Rustagi <[email protected]>wrote:

> .gz files are not splittable hence harder to process. Easiest is to move
> to a splittable compression like lzo and break file into multiple blocks to
> be read and for subsequent processing.
> On 11 May 2014 09:01, "Soumya Simanta" <[email protected]> wrote:
>
>>
>>
>> I've a Spark cluster with 3 worker nodes.
>>
>>
>>    - *Workers:* 3
>>    - *Cores:* 48 Total, 48 Used
>>    - *Memory:* 469.8 GB Total, 72.0 GB Used
>>
>> I want a process a single file compressed (*.gz) on HDFS. The file is
>> 1.5GB compressed and 11GB uncompressed.
>> When I try to read the compressed file from HDFS it takes a while (4-5
>> minutes) load it into an RDD. If I use the .cache operation it takes even
>> longer. Is there a way to make loading of the RDD from HDFS faster ?
>>
>> Thanks
>>  -Soumya
>>
>>
>>

Re: Fwd: Is there a way to load a large file from HDFS faster into Spark

Reply via email to