alamb opened a new issue, #17575:
URL: https://github.com/apache/datafusion/issues/17575

   DataFusion 56.1.0 includes a new predicate cache 
   - https://github.com/apache/arrow-rs/pull/7850
   
   We tried hard to include a switch to disable the cache to prevent 
regressions, but apparently it doesn't always work in all cases. 
   
   @nuno-faria reports:
   
   > I found a potential performance regression with `parquet 56.1.0`. Now more 
data pages will be fetched if their size is less than the execution batch size. 
For example:
   
   ```rust
   use datafusion::error::Result;
   use datafusion::prelude::{ParquetReadOptions, SessionConfig, SessionContext};
   
   #[tokio::main]
   async fn main() -> Result<()> {
       let config = SessionConfig::new().with_target_partitions(1);
       let ctx = SessionContext::new_with_config(config);
       ctx.sql("set datafusion.execution.parquet.pushdown_filters = true")
           .await?
           .collect()
           .await?;
   
       ctx.sql(
           "
           copy (
               select i as k
               from generate_series(1, 1000000) as t(i)
               order by k
           ) to 't.parquet'
           options (MAX_ROW_GROUP_SIZE 100000, DATA_PAGE_ROW_COUNT_LIMIT 1000, 
WRITE_BATCH_SIZE 1000, DICTIONARY_ENABLED FALSE);",
       )
       .await?
       .collect()
       .await?;
   
       ctx.register_parquet("t", "t.parquet", ParquetReadOptions::new())
           .await?;
   
       ctx.sql("explain analyze select k from t where k = 123456")
           .await?
           .show()
           .await?;
   
       Ok(())
   }
   ```
   
   With `parquet 56.0.0`:
   ```
   metrics=[..., bytes_scanned=1273, ...]
   
   # some debug info showing that a single page is retrieved
   total=1273
   ranges=[132974..134247]
   ```
   
   With `parquet 56.1.0`:
   ```
   metrics=[..., bytes_scanned=9929, ...]
   
   # some debug info showing that multiple pages are retrieved
   total=9929
   ranges=[125400..126482, 126482..127564, 127564..128646, 128646..129728, 
129728..130810, 130810..131892, 131892..132974, 132974..134247, 134247..135329]
   ```
   
   I think this is a consequence of 
https://github.com/apache/arrow-rs/pull/7850, more specifically 
https://github.com/apache/arrow-rs/blame/0c7cb2ac3f3132216a08fd557f9b1edc7f90060f/parquet/src/arrow/arrow_reader/selection.rs#L445.
   
   _Originally posted by @nuno-faria in 
https://github.com/apache/datafusion/issues/17275#issuecomment-3266643038_
               


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to