Ted-Jiang edited a comment on pull request #1841: URL: https://github.com/apache/arrow-datafusion/pull/1841#issuecomment-1042631149
> * I wonder if/how this gets things closer to being able to do distinct on compressed data (in DF's case on dictionary encoded columns). The problem (as I understand it) is that there is no guarantee that Arrow dictionaries have the same encoded representation for a value across batches, or even in the same record batch (if I remember how dictionary concatenation currently works in Arrow). `There is no guarantee that Arrow dictionaries have the same encoded representation for a value across batches` : yes For no-int col: We plan to maintain a global dictionary to encode col(string) into 32-bit int to accelerate count distinct. > * Would this work on 64-bit columns if they could first be casted to 32-bit? That is, assuming the contents of the 64-bit column actually fit as 32-bit unsigned integers? IMO, it will lose front 32 bit info, the result will be incorrect. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org