alamb opened a new issue, #17575: URL: https://github.com/apache/datafusion/issues/17575
DataFusion 56.1.0 includes a new predicate cache - https://github.com/apache/arrow-rs/pull/7850 We tried hard to include a switch to disable the cache to prevent regressions, but apparently it doesn't always work in all cases. @nuno-faria reports: > I found a potential performance regression with `parquet 56.1.0`. Now more data pages will be fetched if their size is less than the execution batch size. For example: ```rust use datafusion::error::Result; use datafusion::prelude::{ParquetReadOptions, SessionConfig, SessionContext}; #[tokio::main] async fn main() -> Result<()> { let config = SessionConfig::new().with_target_partitions(1); let ctx = SessionContext::new_with_config(config); ctx.sql("set datafusion.execution.parquet.pushdown_filters = true") .await? .collect() .await?; ctx.sql( " copy ( select i as k from generate_series(1, 1000000) as t(i) order by k ) to 't.parquet' options (MAX_ROW_GROUP_SIZE 100000, DATA_PAGE_ROW_COUNT_LIMIT 1000, WRITE_BATCH_SIZE 1000, DICTIONARY_ENABLED FALSE);", ) .await? .collect() .await?; ctx.register_parquet("t", "t.parquet", ParquetReadOptions::new()) .await?; ctx.sql("explain analyze select k from t where k = 123456") .await? .show() .await?; Ok(()) } ``` With `parquet 56.0.0`: ``` metrics=[..., bytes_scanned=1273, ...] # some debug info showing that a single page is retrieved total=1273 ranges=[132974..134247] ``` With `parquet 56.1.0`: ``` metrics=[..., bytes_scanned=9929, ...] # some debug info showing that multiple pages are retrieved total=9929 ranges=[125400..126482, 126482..127564, 127564..128646, 128646..129728, 129728..130810, 130810..131892, 131892..132974, 132974..134247, 134247..135329] ``` I think this is a consequence of https://github.com/apache/arrow-rs/pull/7850, more specifically https://github.com/apache/arrow-rs/blame/0c7cb2ac3f3132216a08fd557f9b1edc7f90060f/parquet/src/arrow/arrow_reader/selection.rs#L445. _Originally posted by @nuno-faria in https://github.com/apache/datafusion/issues/17275#issuecomment-3266643038_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
