stuhood commented on issue #22395:
URL: https://github.com/apache/datafusion/issues/22395#issuecomment-4618822694
@gene-bordegaray :
1. today's discussion (thanks!)
2. the design doc that I've been working on
3. reviewing #22657
...have highlighted one of the tradeoffs of the current definition of Range
partitioning that I wanted to call out:
----
As defined, Partitioning::Range contains no overlapping partitions. This is
aligned with Partitioning::Hash (I can't really imagine what overlap would mean
there), but allowing for overlapping _output_ partitioning from a
`TableProvider`, while continuing to require that _input_ partitioning is total
and non-overlapping (is this a `enum Distribution`...?) would have some really
interesting properties.
In order to produce a reasonable number of Range partitions _which match the
other side of the join_, the underlying data will either:
1. `DISJOINT` - already need to be over-partitioned into a series of
disjoint ranges, in which case the creator of the `TableProvider` would
merge/aggregate a few of the underlying files/partitions into the declared
Ranges (e.g. go from 1000 underlying partitions to 16 declared
Partitioning::Range partitions)
2. `NOT-DISJOINT` - if the underlying files/partitions are _not_-disjoint /
are-overlapping, then the `TableProvider` will need to:
1. dynamically choose partition boundaries (using statistics about the
amount of overlap, size, etc)
2. bucket the files into those partitions
3. push down or execute range filters for cases where a file spans a
partition boundary (to guarantee that data is emitted in exactly one partition)
----
All of this is fine, and we're prepared to do it in our implementation. But
if `Partitioning::Range` supported overlapping output partitions, things would
look a little bit different:
In that case instead, a wrapper (?) around the `TableProvider` might
implement the steps from the `NOT-DISJOINT` case above: it would take the
underlying overlapped partitions, consume their statistics (generically), and
push down static per-file filters based on the chosen boundaries, and then
coalesce their outputs.
The benefit of this would be one upstream implementation of this range
filtering... and possibly also some amount of sharing of the coordination
required _between_ the two sides of a join, which we have been planning to
implement with an optimizer rule inserted at the appropriate spot to inspect
both sides of the join, and align their Partitioning::Range declaration based
on their shared statistics.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]