gene-bordegaray commented on PR #18919: URL: https://github.com/apache/datafusion/pull/18919#issuecomment-3598206877
hey @fmonjalet thank you for the thoughtful response. here are some of my thoughts following this. Let me know what you think. > When the partitioning is by one field, `KeyPartitioned` and `Hash` are the same (correct me if wrong) This is correct but with a caveat. Say data is partitioned on column "a", this is not necessarily the same for hash vs key partitioned. Imagine we hash partition with our hash function: "Hash(a) = a % 3" this would put the values 1, 4, 7 in the same partition. Now say we partition the same data but use key partitioning. Now, 1, 4, 7 will initially be put into separate groups and thus different partitions, then with the follow up work to merge groups to improve high cardinality (as described in above comments) these could merge into the same or different partitions. With the key partitioned approach it is important to note that we can initially separate by the key values then merge them based on size to improve cardinality without breaking the key partitioning. If we tried to merge hash or reshuffle based on size we would not truly be hash partitioned by our hash function. I do not know what the implications of this is throughout Datafusion but nevertheless this is not true hash partitioning. Another important difference between hash and key partition is there place of origin. Key partitioning can only originate from the data scan itself, making it explicit and a guarantee. Hash partitioning has an associated repartitioning operator which can be introduced any where in the plan, this makes hash an implicit guarantee. This is just another difference I wanted to point out about the partition styles. > But what I understand is that the difference starts when you have two fields: `KeyPartitioned` is hierarchical. Taking a "practical" distribution: say you have an "order" data set that tracks orders of customers to providers. Files are partitioned by `(customer_id, provider_id)`, conceptually organized as follows: > > ``` > customer_id_hash_0/ > provider_id_hash_1.parquet > provider_id_hash_2.parquet > provider_id_hash_3.parquet > provider_id_hash_4.parquet > customer_id_hash_1/ > provider_id_hash_1.parquet > provider_id_hash_2.parquet > provider_id_hash_3.parquet > provider_id_hash_4.parquet > [... etc ...] > ``` > > Technically this layout satisfies all the following partitioning: > > * `KeyPartitioned(hash(customer_id), hash(provider_id))` > * `KeyPartitioned(customer_id, provider_id)` (simpler expression of the above) > * `Hash(customer_id, provider_id)` > * `Hash(customer_id)` Just a comment to ensure we are on the same page. This layout has the capability to achieve all of these partitioning types. If you key or hash partition on `customer_id` then you are guaranteeing properties about the location of your data "a single partitions will contains all rows for a distinct `customer_id`". The difference comes in how this is actually achieved and what this allows us to do. Key partitioning explicitly states how our data is partitioned and doe snot use a hash function, allowing us to do things like merge partitions without breaking guarantees. Hash is of course different as described above. > Now you want to compute: > > ```sql > SELECT customer_id, SUM(amount) FROM orders GROUP BY customer_id > ``` > > * If the data partitioning is declared as `Hash(customer_id, provider_id)`, then you'd have to insert a repartition, because according to the partitioning, `(customer0, provider0)` and `(customer0, provider1)` may be in different partitions. Yes, spot on. > * If the data partitioning is `KeyPartitioned(customer_id, provider_id)`, you can reuse the existing partitioning: `(customer0, provider0)` and `(customer0, provider1)` are in the same partition, this is what `KeyPartitioned` guarantees. This is not true for the current implementation, but can be true as an option in follow up work. I noted that yes if our data is set up in a hierarchical format we could implicitly key partition by a superset of the data if it benefitted some parallelization, but decided to not implement this in the first PR as it would require some heuristic or additional user option (I don't know if this is too many knobs for the user to be turning). Due to the complexity and ambiguity in the implementation I decided against it. This is a good thing to point out though, that yes this would be another differentiating factor between hash and key partitioned. > * `Hash(customer_id)` also works, but we may lack the mechanism to ask the data source to say it satisfies this partitioning, and we lose information that can be useful. This goes hand-in-hand with the last statement I made. Say your data is organized to actually be partitioned hierarchically by `customer_id, provider_id`, with the same query of your example Using hash partitioning say you declare the data partitioned by Hash(customer_id, provider_id). The optimizer will think when it needs to do an aggregation on the group by clause `customer_id`, we are hash partitioned on `customer_id, provider_id` with no knowledge about the fact that the underlying data that is actually also partitioned by just `customer_id`, thus inserting a repartition. Now, using key partitioning on the same columns: KeyPartitioned(customer_id, provider_id), and using some heuristic or options that determines if it is worth it to repartition by a superset of the passed in partitioning, the optimizer can recognize that with hierarchical organization, this means that all rows with the same `customer_id` must also be in the same partition. Thus a repartition is avoided. > > The following query (a bit artificial ,sorry): > > ```sql > WITH max_order_per_provider AS ( > SELECT customer_id, provider_id, MAX(amount) AS max_amount FROM orders GROUP BY customer_id, provider_id > ) > SELECT customer_id, MIN(max_amount) as min_max FROM max_order_per_provider GROUP BY customer_id > ``` > > * Can work with 0 repartitions with `KeyPartitioned(customer_id, provider_id)` (if the partitioning is propagated properly through the plan) > * `Hash(customer_id)` could also worl. Knowing the data source is `KeyPartitioned` mostly gives us the information that allows to say it satisfies `Hash(customer_id)`. Yes, this is correct but taking into account my past statements about implementing the heuristic or option to partition by key partition super set when beneficial. > From there, I see `KeyPartitioned` as a device to avoid propagating partitionings from the top. In the latest example, partitioning by `customer_id` along the entire subplan would be ideal, but I don't think we have a mechanism to propagate this information in the plan. We currently cannot ask the leaf node "could you provide `Hash(customer_id)` instead of `Hash(customer_id, provider_id)`. Yes this is the main idea, I am just seeing KeyPartitioned as the mechanism to do this behavior. The reason so is because although it holds similar properties to hash partitioning, they are based on fundamentally different concepts. My two biggest strifes with this are: 1. KeyPartitioned originates solely from the file scan while hash partitioning can be introduced anywhere in the plan. This holds a deeper meaning in the plan that data was explicitly partitioned in some hierarchical fashion and I could see this coming into use for future optimizations. 2. Although key and hash partitioning guarantee similar things in regards to data location, they achieve this in different ways, hash function vs file level partitions. > I am now wondering about whether we should have `KeyPartitioned`, or a mechanism to propagate ideal partitioning down the stack (e.g. `Hash(a, b)` can be changed to `Hash(a)` to avoid a repartition later on). Gene, does this capture your thoughts? Do you see cases where `KeyPartitioned` adds value in between execution nodes that is not captured here? This approach could work but seems like we may be trying to stretch the functionality of hash partitioning too far, turning into something it was not designed to do. I do think that think that an implementation of checking to see if a plan would benefit from hash repartitioning on a superset would be able to achieve similar results but I think that having a different type of partitioning would be a clearer implementation. A rule like this would get pretty tricky as you would have to take into account all partitioning requirements throughout the plan when determining if you would rather partition by a superset hash. With the key partitioning approach the fact that you data is partitioned is this way is apparent from the data source and operators can more naturally decide what to do with this property. With this said I can see an optimization rule like this being beneficial regardless of if we decide to move forward with the key partitioned of hash partitioned approach. In another scenario say that earlier in the plan we are forces to repartition our data by `(customer_id, provider_id)` then further down the line we have an operator that is requiring us to be partitioned by just `customer_id`. Rather than inserting a repartition here we could have this rule that you suggest which changes the original repartition to `customer_id` which then implicitly satisfies the `(customer_id, provider_id)` requirement then is maintained through the operators to satisfy the `customer_id` requirement without the additional shuffle. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
