adriangb opened a new pull request, #9882:
URL: https://github.com/apache/arrow-rs/pull/9882

   > ⚠️ **This PR is not meant to be merged as-is.** It exists to make the
   > arrow-rs side of a wide-schema-perf investigation visible from
   > 
[apache/datafusion#21968](https://github.com/apache/datafusion/issues/21968)
   > — a more careful breakout into landable PRs is the next step. Posting
   > as a draft so the diff is browseable.
   
   Companion DataFusion draft: https://github.com/apache/datafusion/pull/21987
   
   ## What this branch does
   
   Adds primitives that let downstream readers (DataFusion in particular)
   avoid per-file O(N_columns) walks when opening many parquet files with
   wide schemas.
   
   - \`SchemaDescriptor::root_to_first_leaf\`: precomputed root-field →
     first-leaf-column map at construction; new \`root_first_leaf_index\`
     accessor. \`parquet_column\` becomes O(1) instead of an O(N) scan
     over \`columns()\`. The TODO comment that called this out is removed.
   - \`StatisticsConverter::from_arrow_field\`: low-overhead constructor
     taking an already-resolved \`(field, parquet_leaf_index)\` pair, so
     callers building many converters against the same schemas (per-file
     statistics gathering) skip the redundant arrow + parquet name
     lookups inside \`try_new\`.
   - \`ArrowReaderMetadata::from_field_levels\`: package a precomputed
     \`(metadata, schema, FieldLevels)\` triple directly. Together with
     \`parquet_to_arrow_schema_and_field_levels\` (a new public helper
     that produces both in one walk) callers can cache the per-file arrow
     view and reuse it across reader builds for the same metadata.
   - \`AsyncFileReader::get_arrow_reader_metadata\`: new trait method with
     a default impl that delegates to \`try_new\` after \`get_metadata\`.
     \`load_async\` now goes through it so reader implementations that
     cache derived state (e.g. DataFusion's metadata cache) can
     short-circuit the per-leaf walk.
   - Public accessors on \`ArrowReaderOptions\` (\`supplied_schema()\` /
     \`skip_arrow_metadata()\` / \`virtual_columns()\`) so callers can
     decide whether their cached arrow view applies given the options.
   
   ## Status
   
   - \`cargo test -p parquet --features arrow --lib\` — all 1086 pass
     (had to bump 4 \`memory_size\` test fixtures by 40 bytes for the
     \`root_to_first_leaf\` cache).
   - Microbench numbers from the DataFusion branch:
     \`ArrowReaderMetadata::try_new\` is ~190 ns/col linear (190 µs at
     1024 cols); the cache-aware \`from_field_levels\` + clone path is
     ~4 ns flat (~43000× faster at 1024 cols).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to