adriangb opened a new pull request, #21987: URL: https://github.com/apache/datafusion/pull/21987
> ⚠️ **This PR is not meant to be merged as-is.** It exists to make the > code changes from a wide-schema-perf investigation visible from > [#21968](https://github.com/apache/datafusion/issues/21968) — a more > careful breakout into landable PRs (and an upstream of the arrow-rs > companion changes) is the next step. Posting as a draft so the diff > is browseable. Companion arrow-rs draft: apache/arrow-rs#TBD (link added in a follow-up comment once it exists). ## What this branch does Speeds up parquet reads on schemas with hundreds-to-thousands of columns where the query touches only a handful. On a 1024-col × 256-file synthetic dataset the warm wide vs narrow ratio drops from ~2× to ~1.7× and cold from ~30× to ~22×. With `collect_statistics=false` cold drops to ~3.5×. Changes: - `statistics_from_parquet_metadata`: was O(N²)/file because each iteration called `StatisticsConverter::try_new` which scanned all parquet leaves. Precompute logical→parquet leaf indices once and use a new low-overhead `from_arrow_field` constructor (added on the arrow-rs side). O(N²)/file → O(N)/file. - `apply_file_schema_type_coercions`: short-circuit before building the full lookup HashMap when nothing transforms; return \`None\` when no field actually changed (was returning \`Some(<identical schema>)\` in the latter case, forcing a wasted \`ArrowReaderMetadata\` rebuild per file). - `DefaultFilesMetadataCache`: store \`memory_size\` next to each entry — no more per-put / per-evict structural walks. - `CachedParquetMetaData`: \`OnceLock<ArrowReaderMetadata>\` so warm cache hits become an Arc-clone (~4 ns) instead of re-walking the parquet schema (~190 µs at 1024 cols). Plus a single-slot \`Mutex<Option<(supplied_schema_ptr, ArrowReaderMetadata)>>\` for the post-coercion build. - \`CachedParquetFileReader::get_arrow_reader_metadata\`: implements the new arrow-rs trait method, returning fully-built \`ArrowReaderMetadata\` from cache for both base and post-coercion configurations. \`prepare_filters\` made async so the coercion rebuild also routes through the cache-aware reader. - New \`wide_schema_microbench\` covering try_new vs cached clone, apply_coercions no-op, PruningPredicate::try_new, and StatisticsConverter try_new vs from_arrow_field. Full investigation log: see \`report.md\` on the branch. ## Status - All targeted tests pass (\`-p datafusion-datasource-parquet\`, \`-p datafusion-execution\`, arrow-rs parquet schema/arrow_reader). - The 16 \`row_filter\` / \`row_group_filter\` failures in the parquet datasource crate are environmental (need the \`parquet-testing\` submodule) and reproduce on \`main\`. - This depends on the arrow-rs branch with companion changes (see PR link above); the workspace's \`patch.crates-io\` points at it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
