Hi, Mohammed 
I think that the reason that only one executor is running and have single 
partition is because you have single file that might be read/loaded into memory.

In order to achieve better parallelism I’d suggest to split the csv file.

Another problem is question: why are you using rdd?
Just Spark.read.option(“header”, 
true).load()..select(....).write.format(“avro”).save(...)


> On 24 Mar 2021, at 03:19, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> 
> wrote:
> 
> Hi,
> 
> I have 10gb file that should be loaded into spark dataframe. This file is csv 
> with header and we were using rdd.zipwithindex to get column names and 
> convert to avro accordingly. 
> 
> I am assuming this is taking long time and only executor runs and never 
> achieves parallelism. Is there a easy way to achieve parallelism after 
> filtering out the header. 
> 
> I am
> Also interested in solution that can remove header from the file and I can 
> give my own schema. This way I can split the files.
> 
> Rdd.partitions is always 1 for this even after repartitioning the dataframe 
> after zip with index . Any help on this topic please .
> 
> Thanks,
> Asmath
> 
> Sent from my iPhone
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

  • Rdd - zip with ind... KhajaAsmath Mohammed
    • Re: Rdd - zip... Yuri Oleynikov (‫יורי אולייניקוב‬‎)
      • Re: Rdd -... KhajaAsmath Mohammed
        • Re: R... Yuri Oleynikov (‫יורי אולייניקוב‬‎)
          • R... Sean Owen
            • ... ayan guha
              • ... Mich Talebzadeh
                • ... Sean Owen
                • ... Mich Talebzadeh
                • ... KhajaAsmath Mohammed
                • ... Sean Owen
                • ... ayan guha

Reply via email to