Question about bucketing and custom partitioners

David Diebold Mon, 11 Apr 2022 08:47:07 -0700

Hello,

I have a few questions related to bucketing and custom partitioning in
dataframe api.


I am considering bucketing to perform one-side free shuffle join in
incremental jobs, but there is one thing that I'm not happy with.
Data is likely to grow/skew over time. At some point, i would need to
change amount of buckets which would provoke shuffle.

Instead of this, I would like to use a custom partitioner, that would
replace shuffle by narrow transformation.
That is something that was feasible with RDD developer api. For example, I
could use such partitioning scheme:
partition_id = (nb_partitions-1) * ( hash(column) - Int.minValue) /
(Int.maxValue - Int.minValue)
When I multiply amount of partitions by 2 each new partition depends only
on one partition from parent (=> narrow transformation)

So, here are my questions :

1/ Is it possible to use custom partitioner when saving a dataframe with
bucketing ?
2/ Still with the API dataframe, is it possible to apply custom partitioner
to a dataframe ?
    Is it possible to repartition the dataframe with a narrow
transformation like what could be done with RDD ?
    Is there some sort of dataframe developer API ? Do you have any
pointers on this ?

Thanks !
David

Question about bucketing and custom partitioners

Reply via email to