Re: Rdd - zip with index

Yuri Oleynikov (‫יורי אולייניקוב‬‎) Tue, 23 Mar 2021 18:44:46 -0700

Hi, Mohammed 
I think that the reason that only one executor is running and have single 
partition is because you have single file that might be read/loaded into memory.


In order to achieve better parallelism I’d suggest to split the csv file.

Another problem is question: why are you using rdd?
Just Spark.read.option(“header”, 
true).load()..select(....).write.format(“avro”).save(...)


> On 24 Mar 2021, at 03:19, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> 
> wrote:
> 
> Hi,
> 
> I have 10gb file that should be loaded into spark dataframe. This file is csv 
> with header and we were using rdd.zipwithindex to get column names and 
> convert to avro accordingly. 
> 
> I am assuming this is taking long time and only executor runs and never 
> achieves parallelism. Is there a easy way to achieve parallelism after 
> filtering out the header. 
> 
> I am
> Also interested in solution that can remove header from the file and I can 
> give my own schema. This way I can split the files.
> 
> Rdd.partitions is always 1 for this even after repartitioning the dataframe 
> after zip with index . Any help on this topic please .
> 
> Thanks,
> Asmath
> 
> Sent from my iPhone
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Rdd - zip with index

Reply via email to