Hi Mich, Yes you are right. We were getting gz files and this is causing the issue. I will be changing it to bzip or other splittable formats and try running it again today.
Thanks, Asmath Sent from my iPhone > On Mar 25, 2021, at 6:51 AM, Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > > > Hi Asmath, > > Have you actually managed to run this single file? Because Spark (as brought > up a few times already) will pull the whole of the GZ file in a single > partition in the driver, and can get an out of memory error. > > HTH > > view my Linkedin profile > > > Disclaimer: Use it at your own risk. Any and all responsibility for any loss, > damage or destruction of data or any other property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > > >> On Wed, 24 Mar 2021 at 01:19, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> >> wrote: >> Hi, >> >> I have 10gb file that should be loaded into spark dataframe. This file is >> csv with header and we were using rdd.zipwithindex to get column names and >> convert to avro accordingly. >> >> I am assuming this is taking long time and only executor runs and never >> achieves parallelism. Is there a easy way to achieve parallelism after >> filtering out the header. >> >> I am >> Also interested in solution that can remove header from the file and I can >> give my own schema. This way I can split the files. >> >> Rdd.partitions is always 1 for this even after repartitioning the dataframe >> after zip with index . Any help on this topic please . >> >> Thanks, >> Asmath >> >> Sent from my iPhone >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>