Re: Repartition not working on a csv file

2018-07-01 Thread Abdeali Kothari
I prefer not to do a .cache() due to memory limits. But I did try a persist() with DISK_ONLY I did the repartition(), followed by a .count() followed by a persist() of DISK_ONLY That didn't change the number of tasks either On Sun, Jul 1, 2018, 15:50 Alexander Czech wrote: > You could try to

Re: Repartition not working on a csv file

2018-07-01 Thread Alexander Czech
You could try to force a repartion right at that point by producing a cached version of the DF with .cache() if memory allows it. On Sun, Jul 1, 2018 at 5:04 AM, Abdeali Kothari wrote: > I've tried that too - it doesn't work. It does a repetition, but not right > after the broadcast join - it do

Re: Repartition not working on a csv file

2018-06-30 Thread Abdeali Kothari
I've tried that too - it doesn't work. It does a repetition, but not right after the broadcast join - it does a lot more processing and does the repetition right before I do my next sortmerge join (stage 12 I described above) As the heavy processing is before the sort merge join, it still doesn't h

Re: Repartition not working on a csv file

2018-06-30 Thread yujhe.li
Abdeali Kothari wrote > My entire CSV is less than 20KB. > By somewhere in between, I do a broadcast join with 3500 records in > another > file. > After the broadcast join I have a lot of processing to do. Overall, the > time to process a single record goes up-to 5mins on 1 executor > > I'm trying

Re: Repartition not working on a csv file

2018-06-30 Thread Abdeali Kothari
My entire CSV is less than 20KB. By somewhere in between, I do a broadcast join with 3500 records in another file. After the broadcast join I have a lot of processing to do. Overall, the time to process a single record goes up-to 5mins on 1 executor I'm trying to increase the partitions that my da

Re: Repartition not working on a csv file

2018-06-30 Thread yujhe.li
Abdeali Kothari wrote > I am using Spark 2.3.0 and trying to read a CSV file which has 500 > records. > When I try to read it, spark says that it has two stages: 10, 11 and then > they join into stage 12. What's your CSV size per file? I think Spark optimizer may put many files into one task when

Repartition not working on a csv file

2018-06-18 Thread Abdeali Kothari
I am using Spark 2.3.0 and trying to read a CSV file which has 500 records. When I try to read it, spark says that it has two stages: 10, 11 and then they join into stage 12. This makes sense and is what I would expect, as I have 30 map-based UDFs after which i do a join, and run another 10 UDFs a