[GitHub] [arrow-datafusion] wjones127 commented on pull request #5545: Support arbitrary user defined partition column in `ListingTable` (rather than assuming they are always Dictionary encoded)

via GitHub Thu, 27 Apr 2023 20:07:40 -0700


wjones127 commented on PR #5545:
URL: 
https://github.com/apache/arrow-datafusion/pull/5545#issuecomment-1526917997


   Got here as a downstream user who is affected by this change. Thinking 
through this, I wouldn't totally write off dictionary encoding integers as 
useless, since there still are benefits to dictionary arrays besides space 
savings. They essentially mark columns as having low cardinality and provide 
the set of unique values. Any scalar compute functions run on these columns can 
be applied to the dictionary while leaving the indices buffer untouched. That 
is an easy to way to achieve what I would expect out of a "smart" compute 
engine: when projecting partition columns, project the distinct values rather 
than the expanded/materialized array. It's possible DataFusion already handles 
this in a smart way I'm unaware of though.
   
   I'd also note that the ideal partition column types are probably [run-end 
encoded arrays](https://github.com/apache/arrow-rs/issues/3520) (`RunArray`), 
once they are implemented.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] wjones127 commented on pull request #5545: Support arbitrary user defined partition column in `ListingTable` (rather than assuming they are always Dictionary encoded)

Reply via email to