Re: Rdd - zip with index

KhajaAsmath Mohammed Tue, 23 Mar 2021 18:47:25 -0700

So spark by default doesn’t split the large 10gb file when loaded? 

Sent from my iPhone


> On Mar 23, 2021, at 8:44 PM, Yuri Oleynikov (‫יורי אולייניקוב‬‎) 
> <yur...@gmail.com> wrote:
> 
> Hi, Mohammed 
> I think that the reason that only one executor is running and have single 
> partition is because you have single file that might be read/loaded into 
> memory.
> 
> In order to achieve better parallelism I’d suggest to split the csv file.
> 
> Another problem is question: why are you using rdd?
> Just Spark.read.option(“header”, 
> true).load()..select(....).write.format(“avro”).save(...)
> 
> 
>> On 24 Mar 2021, at 03:19, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> 
>> wrote:
>> 
>> Hi,
>> 
>> I have 10gb file that should be loaded into spark dataframe. This file is 
>> csv with header and we were using rdd.zipwithindex to get column names and 
>> convert to avro accordingly. 
>> 
>> I am assuming this is taking long time and only executor runs and never 
>> achieves parallelism. Is there a easy way to achieve parallelism after 
>> filtering out the header. 
>> 
>> I am
>> Also interested in solution that can remove header from the file and I can 
>> give my own schema. This way I can split the files.
>> 
>> Rdd.partitions is always 1 for this even after repartitioning the dataframe 
>> after zip with index . Any help on this topic please .
>> 
>> Thanks,
>> Asmath
>> 
>> Sent from my iPhone
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Rdd - zip with index

Reply via email to