stuhood commented on issue #21207:
URL: https://github.com/apache/datafusion/issues/21207#issuecomment-4163924852
## Notes:
```
Gene:
data is partitioned on key ranges at the data level BUT, then Hash is
declared
DataSourceExec output partitions = Hash(join_key)
"no sense of Range partitioning in DataFusion: part of the problem"
the DataSourceExec already claims Hash, so doesn't get a Hash Repartition
but the dynamic filter creation in HashJoin is assuming
hash-repartitioning
essentially: bakes in logic about the Hash type by pushing down a
case expression
Gene:
additionally, physical expression is maybe the wrong place to be
optimizing this
Trino treats dynamic filters as their own structure, rather than being
derived from a PhysicalExpr
Lía: consider making partitioning a trait
Andrew: Are both sides partitioned?
Gene:
would like to be able to compute the filter in a way that aligns the
source and destination partitions
Hash partitioning is pretty overloaded
currently forced to claim Hash atop Range data
Andrew: Add range partitioning to the enum?
Gene: had attempted it in the past, but there was concern with adding range
partitioning to the enum
Jayant also needs to add more logic to Dynamic Filters
Jay
working on serializing dynamic filters
downcasting for serialize
Andrew:
extend the idea of partitioning to support more partitioning?
Gene:
Make dynamic filters its own separate type
note:
partitioning is completely equivalent on both sides of the join, but
that information is being lost
Jay:
filter needs to be pushed down to the datasource
it's been assuming that there is a repartition
Gene:
the logic inside the HashJoin _assumes_ that there is a particular type
of partitioning
matches on it with a case Hash
Nga:
currently has to identify which partition, but it's a bit messy
Andrew:
is there a missing case in the dynamic filter for an _exact match_?
Gene:
tried to implement this: adding a generic way to bind a runtime
expression to a PhysicalExpr
Alessandro:
abusing Hash even though actually Range
have to compensate for this gap with adhoc handling
should we introduce Range
Geoffrey
if we were using regular/official Hash partitioning (leaving in
re-partitioning), would the right thing happen?
equality
Alessandro
not the _default_ Hash
Gene:
it's on the user to assure that the Hash enum is equal
if both sides are lying, then actually ok
Lía:
we use a hash, because usually you have a repartition
it's true, that maybe we should use
Andrew:
sounds like we should add a specific equality match for "equal
partitioning"
Nga:
if equal partitions
Geoffrey:
other cases which can be treated as equal as well
Stu:
can the `enum` be enriched to ensure that `Eq` is meaningful? all Hash
parameters accounted for, etc.
```
## Conclusions
Conclusions from my perspective:
* Seems like it should be possible to add explicit handling for "both sides
are claiming equal partitioning" case in HashJoin.
* Adding Range partitioning to the enum would be good, but maybe not
necessary to solve this partitioning problem.
* Should have another meeting about the dynamic filter serialization problem:
* @jayshrivastava will bring a proposal.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]