Re: Question about bucketing and custom partitioners

2022-04-11 Thread Gourav Sengupta
Hi, have you checked skew settings in SPARK 3.2? I am also not quite sure why you need a custom partitioner? While RDD still remains a valid option you must try to explore the recent ways of thinking and framing better solutions using SPARK. Regards, Gourav Sengupta On Mon, Apr 11, 2022 at 4:47

Re: Question about bucketing and custom partitioners

2022-04-11 Thread Lalwani, Jayesh
You can partition and bucket a Dataframe by any column. You can create a column using an expression. You can add a paritition_id column to your dataframe, and partition/bucket by that column From: David Diebold Date: Monday, April 11, 2022 at 11:48 AM To: "user @spark" Subject: [EXTERNAL]

Re: Question about bucketing and custom partitioners

2022-04-11 Thread ayan guha
IMHO you should ask this to dev email for better response and suggestions On Tue, 12 Apr 2022 at 1:47 am, David Diebold wrote: > Hello, > > I have a few questions related to bucketing and custom partitioning in > dataframe api. > > I am considering bucketing to perform one-side free shuffle

Question about bucketing and custom partitioners

2022-04-11 Thread David Diebold
Hello, I have a few questions related to bucketing and custom partitioning in dataframe api. I am considering bucketing to perform one-side free shuffle join in incremental jobs, but there is one thing that I'm not happy with. Data is likely to grow/skew over time. At some point, i would need to