Re: Rdd - zip with index

Yuri Oleynikov (‫יורי אולייניקוב‬‎) Tue, 23 Mar 2021 20:04:13 -0700

I’m not Spark core developer and do not want to confuse you but it seems 
logical to me that just reading from single file (no matter what format of the 
file is used) gives no parallelism unless you do repartition by some column 
just after csv load, but the if you’re telling you’ve already tried repartition 
with no luck...



> On 24 Mar 2021, at 03:47, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> 
> wrote:
> 
> So spark by default doesn’t split the large 10gb file when loaded? 
> 
> Sent from my iPhone
> 
>> On Mar 23, 2021, at 8:44 PM, Yuri Oleynikov (‫יורי אולייניקוב‬‎) 
>> <yur...@gmail.com> wrote:
>> 
>> Hi, Mohammed 
>> I think that the reason that only one executor is running and have single 
>> partition is because you have single file that might be read/loaded into 
>> memory.
>> 
>> In order to achieve better parallelism I’d suggest to split the csv file.
>> 
>> Another problem is question: why are you using rdd?
>> Just Spark.read.option(“header”, 
>> true).load()..select(....).write.format(“avro”).save(...)
>> 
>> 
>>>> On 24 Mar 2021, at 03:19, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> 
>>>> wrote:
>>> 
>>> Hi,
>>> 
>>> I have 10gb file that should be loaded into spark dataframe. This file is 
>>> csv with header and we were using rdd.zipwithindex to get column names and 
>>> convert to avro accordingly. 
>>> 
>>> I am assuming this is taking long time and only executor runs and never 
>>> achieves parallelism. Is there a easy way to achieve parallelism after 
>>> filtering out the header. 
>>> 
>>> I am
>>> Also interested in solution that can remove header from the file and I can 
>>> give my own schema. This way I can split the files.
>>> 
>>> Rdd.partitions is always 1 for this even after repartitioning the dataframe 
>>> after zip with index . Any help on this topic please .
>>> 
>>> Thanks,
>>> Asmath
>>> 
>>> Sent from my iPhone
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Rdd - zip with index

Reply via email to