Re: Job is not able to perform Broadcast Join

2020-10-06 Thread David Edwards
After adding the sequential ids you might need a repartition? I've found using monotically increasing id before that the df goes to a single partition. Usually becomes clear in the spark ui though On Tue, 6 Oct 2020, 20:38 Sachit Murarka, wrote: > Yes, Even I tried the same first. Then I moved

Re: Job is not able to perform Broadcast Join

2020-10-06 Thread Sachit Murarka
Yes, Even I tried the same first. Then I moved to join method because shuffle spill was happening because row num without partition happens on single task. Instead of processinf entire dataframe on single task. I have broken down that into df1 and df2 and joining. Because df2 is having very less

Re: Job is not able to perform Broadcast Join

2020-10-06 Thread Eve Liao
Try to avoid broadcast. Thought this: https://towardsdatascience.com/adding-sequential-ids-to-a-spark-dataframe-fa0df5566ff6 could be helpful. On Tue, Oct 6, 2020 at 12:18 PM Sachit Murarka wrote: > Thanks Eve for response. > > Yes I know we can use broadcast for smaller datasets,I increased

Re: Job is not able to perform Broadcast Join

2020-10-06 Thread Sachit Murarka
Thanks Eve for response. Yes I know we can use broadcast for smaller datasets,I increased the threshold (4Gb) for the same then also it did not work. and the df3 is somewhat greater than 2gb. Trying by removing broadcast as well.. Job is running since 1 hour. Will let you know. Thanks Sachit

Re: Job is not able to perform Broadcast Join

2020-10-06 Thread Eve Liao
How many rows does df3 have? Broadcast joins are a great way to append data stored in relatively *small* single source of truth data files to large DataFrames. DataFrames up to 2GB can be broadcasted so a data file with tens or even hundreds of thousands of rows is a broadcast candidate. Your

Job is not able to perform Broadcast Join

2020-10-06 Thread Sachit Murarka
Hello Users, I am facing an issue in spark job where I am doing row number() without partition by clause because I need to add sequential increasing IDs. But to avoid the large spill I am not doing row number() over the complete data frame. Instead I am applying monotically_increasing id on