alamb commented on issue #19269: URL: https://github.com/apache/datafusion/issues/19269#issuecomment-3646314096
I agree with the assessment that data that is partitioned on `Hash(a)` is also partitioned on `Hash(a,b)` as well as your assessment of the potential skew problem. I think skew is likely only a problem either for these scenarios 1. Very low cardinality data (where there are fewer distinct values than partitions) 2. Single value skew (where one of the values occurs far more often than the others) > An option that can be a boolean that turns this behavior on or off. This is simple to implement and low risk. I think starting with a flag to force repartitioning would be a good start. > We can read the amount of rows in parquet metadata to make a heuristic of when this behavior should turn on or not. Thus, if we are already distributed well we wont repartition, otherwise insert the repartition. This comes with the complexity of make a good heuristic and more code added. I agree with the tradeoffs. There are already some other heuristics in the configuration settings that are similar, where they have to trade off "keep the existing sort" or repartition to get maximum parallelism" https://datafusion.apache.org/user-guide/configs.html For example datafusion.optimizer.prefer_existing_sort | false -- | -- In keeping with DataFusion's extensible nature, I wonder if we should make some sort of trait that encapsulates these heuristics (not this issue, for a follow on PR) ```rust pub trait OptimizerHeuristics { // should the optimizer repartition fn should_repartition(&self, input: &ExecutionPlan, ...) ... } ``` 🤔 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
