Hello, I have a few questions related to bucketing and custom partitioning in dataframe api.
I am considering bucketing to perform one-side free shuffle join in incremental jobs, but there is one thing that I'm not happy with. Data is likely to grow/skew over time. At some point, i would need to change amount of buckets which would provoke shuffle. Instead of this, I would like to use a custom partitioner, that would replace shuffle by narrow transformation. That is something that was feasible with RDD developer api. For example, I could use such partitioning scheme: partition_id = (nb_partitions-1) * ( hash(column) - Int.minValue) / (Int.maxValue - Int.minValue) When I multiply amount of partitions by 2 each new partition depends only on one partition from parent (=> narrow transformation) So, here are my questions : 1/ Is it possible to use custom partitioner when saving a dataframe with bucketing ? 2/ Still with the API dataframe, is it possible to apply custom partitioner to a dataframe ? Is it possible to repartition the dataframe with a narrow transformation like what could be done with RDD ? Is there some sort of dataframe developer API ? Do you have any pointers on this ? Thanks ! David