Re: Job is not able to perform Broadcast Join

2020-10-06 Thread David Edwards
After adding the sequential ids you might need a repartition? I've found using monotically increasing id before that the df goes to a single partition. Usually becomes clear in the spark ui though On Tue, 6 Oct 2020, 20:38 Sachit Murarka, wrote: > Yes, Even I tried the same first. Then I moved

Re: Questions about count() performance with dataframes and parquet files

2020-02-12 Thread David Edwards
you say "do the count as the final step", are you referring to > getting the counts of the individual data frames, or from the already > outputted parquet? > > Thanks and I appreciate your reply > > On Thu, Feb 13, 2020 at 4:15 PM David Edwards > wrote: > >> Hi Ashley,

Re: Questions about count() performance with dataframes and parquet files

2020-02-12 Thread David Edwards
Hi Ashley, I'm not an expert but think this is because spark does lazy execution and doesn't actually perform any actions until you do some kind of write, count or other operation on the dataframe. If you remove the count steps it will work out a more efficient execution plan reducing the number