[GitHub] [arrow] drusso commented on pull request #8222: ARROW-10043: [Rust][DataFusion] Implement COUNT(DISTINCT col)

GitBox Tue, 22 Sep 2020 21:14:12 -0700


drusso commented on pull request #8222:
URL: https://github.com/apache/arrow/pull/8222#issuecomment-696689994



   Thanks for the review/feedback all!
   
   @jorgecarleitao: 
   
   > it may be worth take a look at #8172 , where we are trying to improve how 
to declare and run aggregate expressions. Many parts of this PR are in conflict 
with that one (at API level, all the implementation here is valid).
   
   I will get the changes here updated and integrated with #8172. It looks like 
they've already landed 👍 
   
   > Out of curiosity: is the DISTINCT something that is applicable to an 
aggregate expression in general, or is something that is only applicable to a 
subset of aggregations (such as count, sum)?
   
   To my knowledge `DISTINCT` is applicable to all aggregations. Though there 
are some cases where `DISTINCT` won't have any effect on the result, for 
example the results of `MIN(col)` and `MIN(DISTINCT col)` are always the same. 
This is something the logical (or physical) planner can optimize for – there's 
no need to do the work to compute the distinct set in those cases. 
   
   (Also note that `DISTINCT` is applicable outside the context of 
aggregations, for example `SELECT DISTINCT c1, c2 FROM t1`.)
   
   @alamb:
   
   > One potential downside of the approach in this PR is that it will likely 
struggle when there are a large number of distinct values per group (as the 
FnvHashSet will be huge) as well as needing special support for each datatype.
   
   Agreed. I'm happy to continue iterating and improving on the work here. 
   
   > Another approach we could consider is to use a HashAggregateExpr operation 
as a pre-filter to remove duplicates.
   
   I had this thought as well, however – and correct me if I'm wrong – it 
doesn't generalize to a scenario like:
   
   ```
   SELECT c1, COUNT(DISTINCT c2), COUNT(DISTINCT c3) FROM t1 GROUP BY c1
   ```
   
   For the `COUNT(DISTINCT c2)` expression we would want distinct pairs of 
`(c1, c2)`; for the `COUNT(DISTINCT c3)` expression we would want distinct 
pairs of `(c1, c3)`. But we can't do both simultaneously. The existing 
implementation handles that scenario. Perhaps there's some ideas here we can 
use for further optimizations, though. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] drusso commented on pull request #8222: ARROW-10043: [Rust][DataFusion] Implement COUNT(DISTINCT col)

Reply via email to