notashes commented on issue #20324:
URL: https://github.com/apache/datafusion/issues/20324#issuecomment-3910339618

   I was going through Q24/26 and have a strong suspicion it may be related to 
the way arrow builds predicate caches. Unlike some other queries where the 
problems seem to be the cost of evaluating row_filter outweighs the pushdown 
benefits - in this case `bytes_scanned` actually goes up when pushdown filters 
is enabled (for q24 it went from `214 MB -> 310 MB` for me.
   
   Similarly `scanning_until_data` goes up from `~300 ms -> ~900 ms`. And with 
pushdown on you also get `predicate_cache_inner_records=24.51 M`. 
   
   ```sql
   SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY 
"EventTime" LIMIT 10;
   ```
   That's the query btw and you end up with `SearchPhrase` in both filter and 
output column! 
   
   
https://github.com/apache/arrow-rs/blob/442e1b8d952f5f15cc0922165e56a8f42bd1e716/parquet/src/arrow/push_decoder/reader_builder/mod.rs#L648
 
   
   And I think it comes across two problems here - 
[with_predicate](https://github.com/apache/arrow-rs/blob/442e1b8d952f5f15cc0922165e56a8f42bd1e716/parquet/src/arrow/arrow_reader/read_plan.rs#L153)
 forces the entire row group to evaluate and for cached columns we tend to call 
`expand_to_batch_boundaries()` on row_selection. And `SearchPhrase` probably 
adding an Utf8View column adds further overhead. 
   
   I don't think it's similar to what #19639 addresses - the filter here is 
actually selective? (14% passing through) 
   
   Not really sure what the right general fix looks like - but wanted to share! 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to