Hi Asmath,

Have you actually managed to run this single file? Because Spark (as
brought up a few times already) will pull the whole of the GZ file in a
single partition in the driver, and can get an out of memory error.

HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 24 Mar 2021 at 01:19, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com>
wrote:

> Hi,
>
> I have 10gb file that should be loaded into spark dataframe. This file is
> csv with header and we were using rdd.zipwithindex to get column names and
> convert to avro accordingly.
>
> I am assuming this is taking long time and only executor runs and never
> achieves parallelism. Is there a easy way to achieve parallelism after
> filtering out the header.
>
> I am
> Also interested in solution that can remove header from the file and I can
> give my own schema. This way I can split the files.
>
> Rdd.partitions is always 1 for this even after repartitioning the
> dataframe after zip with index . Any help on this topic please .
>
> Thanks,
> Asmath
>
> Sent from my iPhone
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Reply via email to