Hi,
have you checked skew settings in SPARK 3.2?
I am also not quite sure why you need a custom partitioner? While RDD still
remains a valid option you must try to explore the recent ways of thinking
and framing better solutions using SPARK.
Regards,
Gourav Sengupta
On Mon, Apr 11, 2022 at 4:47
You can partition and bucket a Dataframe by any column. You can create a column
using an expression. You can add a paritition_id column to your dataframe, and
partition/bucket by that column
From: David Diebold
Date: Monday, April 11, 2022 at 11:48 AM
To: "user @spark"
Subject: [EXTERNAL]
IMHO you should ask this to dev email for better response and suggestions
On Tue, 12 Apr 2022 at 1:47 am, David Diebold
wrote:
> Hello,
>
> I have a few questions related to bucketing and custom partitioning in
> dataframe api.
>
> I am considering bucketing to perform one-side free shuffle
Hello,
I have a few questions related to bucketing and custom partitioning in
dataframe api.
I am considering bucketing to perform one-side free shuffle join in
incremental jobs, but there is one thing that I'm not happy with.
Data is likely to grow/skew over time. At some point, i would need to