adriangb opened a new pull request, #21987:
URL: https://github.com/apache/datafusion/pull/21987

   > ⚠️ **This PR is not meant to be merged as-is.** It exists to make the
   > code changes from a wide-schema-perf investigation visible from
   > [#21968](https://github.com/apache/datafusion/issues/21968) — a more
   > careful breakout into landable PRs (and an upstream of the arrow-rs
   > companion changes) is the next step. Posting as a draft so the diff
   > is browseable.
   
   Companion arrow-rs draft: apache/arrow-rs#TBD (link added in a follow-up
   comment once it exists).
   
   ## What this branch does
   
   Speeds up parquet reads on schemas with hundreds-to-thousands of
   columns where the query touches only a handful. On a 1024-col × 256-file
   synthetic dataset the warm wide vs narrow ratio drops from ~2× to
   ~1.7× and cold from ~30× to ~22×. With `collect_statistics=false` cold
   drops to ~3.5×.
   
   Changes:
   
   - `statistics_from_parquet_metadata`: was O(N²)/file because each
     iteration called `StatisticsConverter::try_new` which scanned all
     parquet leaves. Precompute logical→parquet leaf indices once and
     use a new low-overhead `from_arrow_field` constructor (added on the
     arrow-rs side). O(N²)/file → O(N)/file.
   - `apply_file_schema_type_coercions`: short-circuit before building
     the full lookup HashMap when nothing transforms; return \`None\` when
     no field actually changed (was returning \`Some(<identical schema>)\`
     in the latter case, forcing a wasted \`ArrowReaderMetadata\` rebuild
     per file).
   - `DefaultFilesMetadataCache`: store \`memory_size\` next to each
     entry — no more per-put / per-evict structural walks.
   - `CachedParquetMetaData`: \`OnceLock<ArrowReaderMetadata>\` so warm
     cache hits become an Arc-clone (~4 ns) instead of re-walking the
     parquet schema (~190 µs at 1024 cols). Plus a single-slot
     \`Mutex<Option<(supplied_schema_ptr, ArrowReaderMetadata)>>\` for the
     post-coercion build.
   - \`CachedParquetFileReader::get_arrow_reader_metadata\`: implements
     the new arrow-rs trait method, returning fully-built
     \`ArrowReaderMetadata\` from cache for both base and post-coercion
     configurations. \`prepare_filters\` made async so the coercion
     rebuild also routes through the cache-aware reader.
   - New \`wide_schema_microbench\` covering try_new vs cached clone,
     apply_coercions no-op, PruningPredicate::try_new, and
     StatisticsConverter try_new vs from_arrow_field.
   
   Full investigation log: see \`report.md\` on the branch.
   
   ## Status
   
   - All targeted tests pass (\`-p datafusion-datasource-parquet\`,
     \`-p datafusion-execution\`, arrow-rs parquet schema/arrow_reader).
   - The 16 \`row_filter\` / \`row_group_filter\` failures in the parquet
     datasource crate are environmental (need the \`parquet-testing\`
     submodule) and reproduce on \`main\`.
   - This depends on the arrow-rs branch with companion changes (see PR
     link above); the workspace's \`patch.crates-io\` points at it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to