Fwd: Is there a way to load a large file from HDFS faster into Spark

2014-05-15 Thread Soumya Simanta
I've a Spark cluster with 3 worker nodes.


   - *Workers:* 3
   - *Cores:* 48 Total, 48 Used
   - *Memory:* 469.8 GB Total, 72.0 GB Used

I want a process a single file compressed (*.gz) on HDFS. The file is 1.5GB
compressed and 11GB uncompressed.
When I try to read the compressed file from HDFS it takes a while (4-5
minutes) load it into an RDD. If I use the .cache operation it takes even
longer. Is there a way to make loading of the RDD from HDFS faster ?

Thanks
-Soumya


Re: Fwd: Is there a way to load a large file from HDFS faster into Spark

2014-05-11 Thread Soumya Simanta
Yep. I figured that out. I uncompressed the file and it looks much faster
now. Thanks.



On Sun, May 11, 2014 at 8:14 AM, Mayur Rustagi wrote:

> .gz files are not splittable hence harder to process. Easiest is to move
> to a splittable compression like lzo and break file into multiple blocks to
> be read and for subsequent processing.
> On 11 May 2014 09:01, "Soumya Simanta"  wrote:
>
>>
>>
>> I've a Spark cluster with 3 worker nodes.
>>
>>
>>- *Workers:* 3
>>- *Cores:* 48 Total, 48 Used
>>- *Memory:* 469.8 GB Total, 72.0 GB Used
>>
>> I want a process a single file compressed (*.gz) on HDFS. The file is
>> 1.5GB compressed and 11GB uncompressed.
>> When I try to read the compressed file from HDFS it takes a while (4-5
>> minutes) load it into an RDD. If I use the .cache operation it takes even
>> longer. Is there a way to make loading of the RDD from HDFS faster ?
>>
>> Thanks
>>  -Soumya
>>
>>
>>


Re: Fwd: Is there a way to load a large file from HDFS faster into Spark

2014-05-11 Thread Mayur Rustagi
.gz files are not splittable hence harder to process. Easiest is to move to
a splittable compression like lzo and break file into multiple blocks to be
read and for subsequent processing.
On 11 May 2014 09:01, "Soumya Simanta"  wrote:

>
>
> I've a Spark cluster with 3 worker nodes.
>
>
>- *Workers:* 3
>- *Cores:* 48 Total, 48 Used
>- *Memory:* 469.8 GB Total, 72.0 GB Used
>
> I want a process a single file compressed (*.gz) on HDFS. The file is
> 1.5GB compressed and 11GB uncompressed.
> When I try to read the compressed file from HDFS it takes a while (4-5
> minutes) load it into an RDD. If I use the .cache operation it takes even
> longer. Is there a way to make loading of the RDD from HDFS faster ?
>
> Thanks
>  -Soumya
>
>
>


Is there a way to load a large file from HDFS faster into Spark

2014-05-10 Thread Soumya Simanta
I've a Spark cluster with 3 worker nodes.


   - *Workers:* 3
   - *Cores:* 48 Total, 48 Used
   - *Memory:* 469.8 GB Total, 72.0 GB Used

I want a process a single file compressed (*.gz) on HDFS. The file is 1.5GB
compressed and 11GB uncompressed.
When I try to read the compressed file from HDFS it takes a while (4-5
minutes) load it into an RDD. If I use the .cache operation it takes even
longer. Is there a way to make loading of the RDD from HDFS faster ?

Thanks
-Soumya