Re: [PR] feat: added parallelization on key partitioned data [datafusion]

via GitHub Fri, 28 Nov 2025 09:27:19 -0800


gene-bordegaray commented on PR #18919:
URL: https://github.com/apache/datafusion/pull/18919#issuecomment-3589996704


   > * Anything `KeyPartitioned` is `Hash` partitioned (but the opposite is not 
true) ==> is this correct?
   Yes, key partitioning guarantees that each distinct value of the key is 
fully contained within a single partition which is pretty much a stronger hash 
partitioning. Another thing to note is that key partitioning can only root from 
the file scan level as of now compared to hash which of course has a 
repartitioning operator.
   
   > * `KeyPartitioned` means each key is in a distinct partition ==> is this 
correct?
   Yes, I believe you have the right idea but to be sure, `KeyPartitioned`, in 
theory allows, multiple keys in a single partition as long as they are fully 
contained.
   
   > * If the above is correct (if it's not, my reasoning does not hold and you 
can ignore the rest of this comment), I am not sure how this applies to high 
cardinality keys, for example `date_bin(timestamp, 15m)` or id hash ranges (say 
you have a million files, each one having a distinct range). I imagine we'd 
want to be able to group multiple "keys" into the same processing partition, to 
avoid having thousands of partitions. My understanding is that DataFusion 
partitions will add overhead if there are too many (subsequent repartitions, 
coalesce, merge sort), but I may be mistaken.
   Yes, this is a noted limit to the original design. I added the comment: 
"best with moderate partition counts (10-100 partitions)." in the config. This 
is rooting from splitting distinct keys into their own partitions as of now. I 
did this to keep the first iteration relatively simple as the PR is large. In a 
follow up issue, some gerat work would be to merge group to `target_partitions` 
when size allows. This would still have keys fully contained within each 
partition but allow for higher cardinality.
   
   > * Once we group `KeyPartitioned` partitions together, they become `Hash` 
partitions. ==> is this correct?
   It depends what "group" means. If we simply merge key partitioned data into 
a single partition, no, this is still key partitioned as each key is still 
fully in one place. If we are repartitioning or shuffling data, we lose key 
partitioning and fallback to hash
   
   > * So in practice, it appears to me that we'll almost always need to resort 
to `Hash` partitions.
   For this first PR, yes, but I think this isn't a one PR fix all scenario. I 
think this comes down to how intentional the user is. Yes, key partitioned data 
is rarer than say hash, but it is powerful enough for people to consider it. 
The use cases will also increse as follow up issues are resolved: higher 
cardinality, propagation through joins, etc.
   
   > * What we'd loose compared to `KeyPartition` is the `SortExec` elision 
when aggregating then sorting by the partition key, but I'd argue that if you 
had one group per partition, then probably the sorting is cheap enough. ==> Do 
we lose something else?
   >   (This point is not challenging the PR as a whole but just an 
implementation choice.)
   We would also lose rpartitioning between partial and final aggregations 
which is the main overhead we are trying to avoid:
   
   **BEFORE:** DataSourceExec -> Aggregate Partial (gby: a) -> Repartition 
Hash(a) -> Aggregate Final (gby: a)
   **AFTER:** DataSourceExec -> Aggregate FinalPartitioned (gby: a)
   
   In some cases we also eliminate bottlenecks due to SPMs between 
aggregations: 
   **BEFORE:** DataSourceExec -> Aggregate Partial (gby: a) -> SPM -> Aggregate 
Final (gby: a) - single-threaded!
   **AFTER:** DataSourceExec -> Aggregate FinalPartitioned (gby: a) -> SPM
   
   > 
   > So my current understanding is: `KeyPartitioned` is indeed different from 
`Hash` (a specific case carrying more information) but the ratio complexity / 
added value is not obvious. The reason we'd not take full advantage of 
`KeyPartitioned` may be that DF partitions are actually bound to processing 
units (~threads), and maybe there would be value in separating the notion of 
processing thread and the notion of data partition, where you could have N 
processing unit per partitions (with partial repartitions), or N partitions per 
thread. But this sounds like a completely different topic and I don't know how 
much it makes sense.
   I am in favor of keeping Hash and KeyPartitioned separate as I see them as 
two distinct methods of partitioning. I also don't knowif adding more 
information into Hash partitioning will eliminate cimplexity and raher just 
cause more indirection. I do like the idea of merging file groups for higher 
cardinality as this was my main concern with this v1 (as noted in the comments) 
but chose to refrain due to complexity.
   
   
   > Sorry for the wall of text, I am mostly trying to wrap my head around 
this, please correct anything I missed in here.
   Do not apologize, this is a lot of the internal debates I was / am having 
and am glad to talk about the trade offs. Let me know what you think 😄 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: added parallelization on key partitioned data [datafusion]

Reply via email to