notashes commented on issue #20324: URL: https://github.com/apache/datafusion/issues/20324#issuecomment-3910339618
I was going through Q24/26 and have a strong suspicion it may be related to the way arrow builds predicate caches. Unlike some other queries where the problems seem to be the cost of evaluating row_filter outweighs the pushdown benefits - in this case `bytes_scanned` actually goes up when pushdown filters is enabled (for q24 it went from `214 MB -> 310 MB` for me. Similarly `scanning_until_data` goes up from `~300 ms -> ~900 ms`. And with pushdown on you also get `predicate_cache_inner_records=24.51 M`. ```sql SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "EventTime" LIMIT 10; ``` That's the query btw and you end up with `SearchPhrase` in both filter and output column! https://github.com/apache/arrow-rs/blob/442e1b8d952f5f15cc0922165e56a8f42bd1e716/parquet/src/arrow/push_decoder/reader_builder/mod.rs#L648 And I think it comes across two problems here - [with_predicate](https://github.com/apache/arrow-rs/blob/442e1b8d952f5f15cc0922165e56a8f42bd1e716/parquet/src/arrow/arrow_reader/read_plan.rs#L153) forces the entire row group to evaluate and for cached columns we tend to call `expand_to_batch_boundaries()` on row_selection. And `SearchPhrase` probably adding an Utf8View column adds further overhead. I don't think it's similar to what #19639 addresses - the filter here is actually selective? (14% passing through) Not really sure what the right general fix looks like - but wanted to share! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
