Hi Mich,

Yes you are right. We were getting gz files and this is causing the issue. I 
will be changing it to bzip or other splittable formats and try running it 
again today. 

Thanks,
Asmath

Sent from my iPhone

> On Mar 25, 2021, at 6:51 AM, Mich Talebzadeh <mich.talebza...@gmail.com> 
> wrote:
> 
> 
> Hi Asmath,
> 
> Have you actually managed to run this single file? Because Spark (as brought 
> up a few times already) will pull the whole of the GZ file in a single 
> partition in the driver, and can get an out of memory error.
> 
> HTH
> 
>    view my Linkedin profile
> 
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> 
>> On Wed, 24 Mar 2021 at 01:19, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> 
>> wrote:
>> Hi,
>> 
>> I have 10gb file that should be loaded into spark dataframe. This file is 
>> csv with header and we were using rdd.zipwithindex to get column names and 
>> convert to avro accordingly. 
>> 
>> I am assuming this is taking long time and only executor runs and never 
>> achieves parallelism. Is there a easy way to achieve parallelism after 
>> filtering out the header. 
>> 
>> I am
>> Also interested in solution that can remove header from the file and I can 
>> give my own schema. This way I can split the files.
>> 
>> Rdd.partitions is always 1 for this even after repartitioning the dataframe 
>> after zip with index . Any help on this topic please .
>> 
>> Thanks,
>> Asmath
>> 
>> Sent from my iPhone
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> 

Reply via email to