So spark by default doesn’t split the large 10gb file when loaded? Sent from my iPhone
> On Mar 23, 2021, at 8:44 PM, Yuri Oleynikov (יורי אולייניקוב) > <yur...@gmail.com> wrote: > > Hi, Mohammed > I think that the reason that only one executor is running and have single > partition is because you have single file that might be read/loaded into > memory. > > In order to achieve better parallelism I’d suggest to split the csv file. > > Another problem is question: why are you using rdd? > Just Spark.read.option(“header”, > true).load()..select(....).write.format(“avro”).save(...) > > >> On 24 Mar 2021, at 03:19, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> >> wrote: >> >> Hi, >> >> I have 10gb file that should be loaded into spark dataframe. This file is >> csv with header and we were using rdd.zipwithindex to get column names and >> convert to avro accordingly. >> >> I am assuming this is taking long time and only executor runs and never >> achieves parallelism. Is there a easy way to achieve parallelism after >> filtering out the header. >> >> I am >> Also interested in solution that can remove header from the file and I can >> give my own schema. This way I can split the files. >> >> Rdd.partitions is always 1 for this even after repartitioning the dataframe >> after zip with index . Any help on this topic please . >> >> Thanks, >> Asmath >> >> Sent from my iPhone >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org