stuhood commented on code in PR #22207:
URL: https://github.com/apache/datafusion/pull/22207#discussion_r3250968025
##########
datafusion/physical-plan/src/repartition/mod.rs:
##########
@@ -600,6 +600,11 @@ impl BatchPartitioner {
num_input_partitions,
))
}
+ Partitioning::Expr(_) => {
+ not_impl_err!(
+ "Expression partitioning is not supported by
RepartitionExec"
+ )
+ }
Review Comment:
> My intent wasn't for `ExprPartitioning` to be efficient execution format
for physically repartitioning rows. I was thinking of this as partitioning for
sources/plans that already have known partitioning and declare it to preserve
in the plan to unlock optimizations.
That works for the first join, but not for followup joins. For example:
If you have a 3 table join, the first join will be able to use an equality
match on range partitioning to say: no re-partitioning needed at all because
the two tables are partitioned the same way! Great.
But its very likely that the second join _does_ need to re-partition _one_
of its inputs (assuming different join keys between the two joins): the output
of join one needs to be re-partitioned to match the third table. Now,
technically you can just repartition both sides (i.e. switch to hash or
something). But if you instead re-partition to _match_ the third table, then
you might be able to significantly cut down on data movement.
----
So, yes: I think that it is important to be able to efficiently re-partition
by this strategy. If we don't have concrete use-cases for generic expression
partitioning, then it would not be my first choice here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]