nuno-faria commented on PR #17275:
URL: https://github.com/apache/datafusion/pull/17275#issuecomment-3266643038
I found a potential performance regression with `parquet 56.1.0`. Now more
data pages will be returned if their size is less than the execution batch
size. For example:
```rust
use datafusion::error::Result;
use datafusion::prelude::{ParquetReadOptions, SessionConfig, SessionContext};
#[tokio::main]
async fn main() -> Result<()> {
let config = SessionConfig::new().with_target_partitions(1);
let ctx = SessionContext::new_with_config(config);
ctx.sql("set datafusion.execution.parquet.pushdown_filters = true")
.await?
.collect()
.await?;
ctx.sql(
"
copy (
select i as k
from generate_series(1, 1000000) as t(i)
order by k
) to 't.parquet'
options (MAX_ROW_GROUP_SIZE 100000, DATA_PAGE_ROW_COUNT_LIMIT 1000,
WRITE_BATCH_SIZE 1000, DICTIONARY_ENABLED FALSE);",
)
.await?
.collect()
.await?;
ctx.register_parquet("t", "t.parquet", ParquetReadOptions::new())
.await?;
ctx.sql("explain analyze select k from t where k = 123456")
.await?
.show()
.await?;
Ok(())
}
```
With `parquet 56.0.0`:
```
metrics=[..., bytes_scanned=1273, ...]
# some debug info showing that a single page is retrieved
total=1273
ranges=[132974..134247]
```
With `parquet 56.1.0`:
```
metrics=[..., bytes_scanned=9929, ...]
# some debug info showing that multiple pages are retrieved
total=9929
ranges=[125400..126482, 126482..127564, 127564..128646, 128646..129728,
129728..130810, 130810..131892, 131892..132974, 132974..134247, 134247..135329]
```
I think this is a consequence of
https://github.com/apache/arrow-rs/pull/7850, more specifically
https://github.com/apache/arrow-rs/blame/0c7cb2ac3f3132216a08fd557f9b1edc7f90060f/parquet/src/arrow/arrow_reader/selection.rs#L445.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]