Re: [PR] feat: added parallelization on key partitioned data [datafusion]

via GitHub Mon, 01 Dec 2025 10:29:06 -0800


gene-bordegaray commented on PR #18919:
URL: https://github.com/apache/datafusion/pull/18919#issuecomment-3598206877


   hey @fmonjalet thank you for the thoughtful response. here are some of my 
thoughts following this. Let me know what you think.
   
   > When the partitioning is by one field, `KeyPartitioned` and `Hash` are the 
same (correct me if wrong) 
   
   This is correct but with a caveat. Say data is partitioned on column "a", 
this is not necessarily the same for hash vs key partitioned. Imagine we hash 
partition with our hash function: "Hash(a) = a % 3" this would put the values 
1, 4, 7 in the same partition. Now say we partition the same data but use key 
partitioning. Now, 1, 4, 7 will initially be put into separate groups and thus 
different partitions, then with the follow up work to merge groups to improve 
high cardinality (as described in above comments) these could merge into the 
same or different partitions. 
   
   With the key partitioned approach it is important to note that we can 
initially separate by the key values then merge them based on size to improve 
cardinality without breaking the key partitioning. If we tried to merge hash or 
reshuffle based on size we would not truly be hash partitioned by our hash 
function. I do not know what the implications of this is throughout Datafusion 
but nevertheless this is not true hash partitioning.
   
   Another important difference between hash and key partition is there place 
of origin. Key partitioning can only originate from the data scan itself, 
making it explicit and a guarantee. Hash partitioning has an associated 
repartitioning operator which can be introduced any where in the plan, this 
makes hash an implicit guarantee. This is just another difference I wanted to 
point out about the partition styles.
   
   > But what I understand is that the difference starts when you have two 
fields: `KeyPartitioned` is hierarchical. Taking a "practical" distribution: 
say you have an "order" data set that tracks orders of customers to providers. 
Files are partitioned by `(customer_id, provider_id)`, conceptually organized 
as follows:
   > 
   > ```
   > customer_id_hash_0/
   >     provider_id_hash_1.parquet
   >     provider_id_hash_2.parquet
   >     provider_id_hash_3.parquet
   >     provider_id_hash_4.parquet
   > customer_id_hash_1/
   >     provider_id_hash_1.parquet
   >     provider_id_hash_2.parquet
   >     provider_id_hash_3.parquet
   >     provider_id_hash_4.parquet
   > [... etc ...]
   > ```
   > 
   > Technically this layout satisfies all the following partitioning:
   > 
   > * `KeyPartitioned(hash(customer_id), hash(provider_id))`
   > * `KeyPartitioned(customer_id, provider_id)` (simpler expression of the 
above)
   > * `Hash(customer_id, provider_id)`
   > * `Hash(customer_id)`
   
   Just a comment to ensure we are on the same page. This layout has the 
capability to achieve all of these partitioning types. If you key or hash 
partition on `customer_id` then you are guaranteeing properties about the 
location of your data "a single partitions will contains all rows for a 
distinct `customer_id`". The difference comes in how this is actually achieved 
and what this allows us to do. Key partitioning explicitly states how our data 
is partitioned and doe snot use a hash function, allowing us to do things like 
merge partitions without breaking guarantees. Hash is of course different as 
described above.  
   
   > Now you want to compute:
   > 
   > ```sql
   > SELECT customer_id, SUM(amount) FROM orders GROUP BY customer_id
   > ```
   > 
   > * If the data partitioning is declared as `Hash(customer_id, 
provider_id)`, then you'd have to insert a repartition, because according to 
the partitioning, `(customer0, provider0)` and `(customer0, provider1)` may be 
in different partitions.
   
   Yes, spot on.
   
   > * If the data partitioning is `KeyPartitioned(customer_id, provider_id)`, 
you can reuse the existing partitioning: `(customer0, provider0)` and 
`(customer0, provider1)` are in the same partition, this is what 
`KeyPartitioned` guarantees.
   
   This is not true for the current implementation, but can be true as an 
option in follow up work. I noted that yes if our data is set up in a 
hierarchical format we could implicitly key partition by a superset of the data 
if it benefitted some parallelization, but decided to not implement this in the 
first PR as it would require some heuristic or additional user option (I don't 
know if this is too many knobs for the user to be turning). Due to the 
complexity and ambiguity in the implementation I decided against it.
   
   This is a good thing to point out though, that yes this would be another 
differentiating factor between hash and key partitioned.
   
   > * `Hash(customer_id)` also works, but we may lack the mechanism to ask the 
data source to say it satisfies this partitioning, and we lose information that 
can be useful.
   
   This goes hand-in-hand with the last statement I made. Say your data is 
organized to actually be partitioned hierarchically by `customer_id, 
provider_id`, with the same query of your example
   
   Using hash partitioning say you declare the data partitioned by 
Hash(customer_id, provider_id). The optimizer will think when it needs to do an 
aggregation on the group by clause `customer_id`, we are hash partitioned on 
`customer_id, provider_id` with no knowledge about the fact that the underlying 
data that is actually also partitioned by just `customer_id`, thus inserting a 
repartition.
   
   Now, using key partitioning on the same columns: KeyPartitioned(customer_id, 
provider_id), and using some heuristic or options that determines if it is 
worth it to repartition by a superset of the passed in partitioning, the 
optimizer can recognize that with hierarchical organization, this means that 
all rows with the same `customer_id` must also be in the same partition. Thus a 
repartition is avoided.
   
   > 
   > The following query (a bit artificial ,sorry):
   > 
   > ```sql
   > WITH max_order_per_provider AS (
   >   SELECT customer_id, provider_id, MAX(amount) AS max_amount FROM orders 
GROUP BY customer_id, provider_id
   > )
   > SELECT customer_id, MIN(max_amount) as min_max FROM max_order_per_provider 
GROUP BY customer_id
   > ```
   > 
   > * Can work with 0 repartitions with `KeyPartitioned(customer_id, 
provider_id)` (if the partitioning is propagated properly through the plan)
   > * `Hash(customer_id)` could also worl. Knowing the data source is 
`KeyPartitioned` mostly gives us the information that allows to say it 
satisfies `Hash(customer_id)`.
   
   Yes, this is correct but taking into account my past statements about 
implementing the heuristic or option to partition by key partition super set 
when beneficial.
   
   > From there, I see `KeyPartitioned` as a device to avoid propagating 
partitionings from the top. In the latest example, partitioning by 
`customer_id` along the entire subplan would be ideal, but I don't think we 
have a mechanism to propagate this information in the plan. We currently cannot 
ask the leaf node "could you provide `Hash(customer_id)` instead of 
`Hash(customer_id, provider_id)`.
   
   Yes this is the main idea, I am just seeing KeyPartitioned as the mechanism 
to do this behavior. The reason so is because although it holds similar 
properties to hash partitioning, they are based on fundamentally different 
concepts. My two biggest strifes with this are:
   1. KeyPartitioned originates solely from the file scan while hash 
partitioning can be introduced anywhere in the plan. This holds a deeper 
meaning in the plan that data was explicitly partitioned in some hierarchical 
fashion and I could see this coming into use for future optimizations.
   2. Although key and hash partitioning guarantee similar things in regards to 
data location, they achieve this in different ways, hash function vs file level 
partitions.
   
   > I am now wondering about whether we should have `KeyPartitioned`, or a 
mechanism to propagate ideal partitioning down the stack (e.g. `Hash(a, b)` can 
be changed to `Hash(a)` to avoid a repartition later on). Gene, does this 
capture your thoughts? Do you see cases where `KeyPartitioned` adds value in 
between execution nodes that is not captured here?
   
   This approach could work but seems like we may be trying to stretch the 
functionality of hash partitioning too far, turning into something it was not 
designed to do. I do think that think that an implementation of checking to see 
if a plan would benefit from hash repartitioning on a superset would be able to 
achieve similar results but I think that having a different type of 
partitioning would be a clearer implementation. A rule like this would get 
pretty tricky as you would have to take into account all partitioning 
requirements throughout the plan when determining if you would rather partition 
by a superset hash. With the key partitioning approach the fact that you data 
is partitioned is this way is apparent from the data source and operators can 
more naturally decide what to do with this property.
   
   With this said I can see an optimization rule like this being beneficial 
regardless of if we decide to move forward with the key partitioned of hash 
partitioned approach. In another scenario say that earlier in the plan we are 
forces to repartition our data by `(customer_id, provider_id)` then further 
down the line we have an operator that is requiring us to be partitioned by 
just `customer_id`. Rather than inserting a repartition here we could have this 
rule that you suggest which changes the original repartition to `customer_id` 
which then implicitly satisfies the `(customer_id, provider_id)` requirement 
then is maintained through the operators to satisfy the `customer_id` 
requirement without the additional shuffle.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: added parallelization on key partitioned data [datafusion]

Reply via email to