Hi, Mohammed I think that the reason that only one executor is running and have single partition is because you have single file that might be read/loaded into memory.
In order to achieve better parallelism I’d suggest to split the csv file. Another problem is question: why are you using rdd? Just Spark.read.option(“header”, true).load()..select(....).write.format(“avro”).save(...) > On 24 Mar 2021, at 03:19, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> > wrote: > > Hi, > > I have 10gb file that should be loaded into spark dataframe. This file is csv > with header and we were using rdd.zipwithindex to get column names and > convert to avro accordingly. > > I am assuming this is taking long time and only executor runs and never > achieves parallelism. Is there a easy way to achieve parallelism after > filtering out the header. > > I am > Also interested in solution that can remove header from the file and I can > give my own schema. This way I can split the files. > > Rdd.partitions is always 1 for this even after repartitioning the dataframe > after zip with index . Any help on this topic please . > > Thanks, > Asmath > > Sent from my iPhone > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org