Re: [pyspark 2.4+] BucketBy SortBy doesn't retain sort order

2020-03-03 Thread Rishi Shah
Hi All, Just checking in to see if anyone has any advice on this. Thanks, Rishi On Mon, Mar 2, 2020 at 9:21 PM Rishi Shah wrote: > Hi All, > > I have 2 large tables (~1TB), I used the following to save both the > tables. Then when I try to join both tables with join_column, it still does >

[pyspark 2.4+] BucketBy SortBy doesn't retain sort order

2020-03-02 Thread Rishi Shah
Hi All, I have 2 large tables (~1TB), I used the following to save both the tables. Then when I try to join both tables with join_column, it still does shuffle & sort before the join. Could someone please help? df.repartition(2000).write.bucketBy(1,