alamb commented on issue #19269:
URL: https://github.com/apache/datafusion/issues/19269#issuecomment-3646314096

   I agree with the assessment that data that is partitioned on  `Hash(a)` is 
also partitioned on `Hash(a,b)` as well as your assessment of the potential 
skew problem. 
   
   I think skew is likely only a problem either for  these scenarios
   1. Very low cardinality data (where there are fewer distinct values than 
partitions)
   2. Single value skew (where one of the values occurs far more often than the 
others)
   
   > An option that can be a boolean that turns this behavior on or off. This 
is simple to implement and low risk.
   
   I think starting with a flag to force repartitioning would be a good start. 
   
   > We can read the amount of rows in parquet metadata to make a heuristic of 
when this behavior should turn on or not. Thus, if we are already distributed 
well we wont repartition, otherwise insert the repartition. This comes with the 
complexity of make a good heuristic and more code added.
   
   I agree with the tradeoffs. 
   
   
   There are already some other heuristics in the configuration settings that 
are similar, where they have to trade off "keep the existing sort" or 
repartition to get maximum parallelism"
   
   https://datafusion.apache.org/user-guide/configs.html
   
   For example
   
   datafusion.optimizer.prefer_existing_sort | false
   -- | --
   
   In keeping with DataFusion's extensible nature, I wonder if we should make 
some sort of trait that encapsulates these heuristics (not this issue, for a 
follow on PR)
   
   ```rust
   pub trait OptimizerHeuristics {
     // should the optimizer repartition 
     fn should_repartition(&self, input: &ExecutionPlan, ...) 
   ...
   }
   ```
   
   🤔 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to