Ted-Jiang edited a comment on pull request #1841:
URL: 
https://github.com/apache/arrow-datafusion/pull/1841#issuecomment-1042631149


   > * I wonder if/how this gets things closer to being able to do distinct on 
compressed data (in DF's case on dictionary encoded columns). The problem (as I 
understand it) is that there is no guarantee that Arrow dictionaries have the 
same encoded representation for a value across batches, or even in the same 
record batch (if I remember how dictionary concatenation currently works in 
Arrow).
   
   `There is no guarantee that Arrow dictionaries have the same encoded 
representation for a value across batches` : yes
   For no-int col: We plan to maintain a global dictionary to encode 
col(string) into 32-bit int to accelerate count distinct.
   
   > * Would this work on 64-bit columns if they could first be casted to 
32-bit? That is, assuming the contents of the 64-bit column actually fit as 
32-bit unsigned integers?
   
   IMO, it will lose front 32 bit info, the result will be incorrect.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to