alamb commented on PR #9956: URL: https://github.com/apache/arrow-rs/pull/9956#issuecomment-4769412416
One thing @adriangb and @zhuqi-lucas and I have noticed in DataFusion is that getting heuristics to work well is very challenging -- for example cutoff values often vary from architecture to architecture (e.g. is 32 contiguous 1s good, or should it be 64?) One thing we have been exploring is a more dynamic approach -- aka to switch the predicate evaluation strategy at certain times when the decoder naturally has to re-create some state, such as between row groups, like in this PR: - https://github.com/apache/arrow-rs/pull/10158 It seems as if you have taken a similar approach in this PR > Adds an adaptive post-filter cost model for row groups (caveat I have not had a chance to read this one carefully, and for that I apologize) I think we had been planning to put more of the adaptivity at a higher level (DataFusion specifically) as it has more information about things like statistics, and cross file predicate selectivity. I wonder if you have thought about where these auto adaptive decisions would best be made. I do think the APIs you have outlined allow for both automatic and manually overriding (e.g. DataFusion could override the decisions made automatically) which is interesting -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
