adriangb opened a new pull request, #9882: URL: https://github.com/apache/arrow-rs/pull/9882
> ⚠️ **This PR is not meant to be merged as-is.** It exists to make the > arrow-rs side of a wide-schema-perf investigation visible from > [apache/datafusion#21968](https://github.com/apache/datafusion/issues/21968) > — a more careful breakout into landable PRs is the next step. Posting > as a draft so the diff is browseable. Companion DataFusion draft: https://github.com/apache/datafusion/pull/21987 ## What this branch does Adds primitives that let downstream readers (DataFusion in particular) avoid per-file O(N_columns) walks when opening many parquet files with wide schemas. - \`SchemaDescriptor::root_to_first_leaf\`: precomputed root-field → first-leaf-column map at construction; new \`root_first_leaf_index\` accessor. \`parquet_column\` becomes O(1) instead of an O(N) scan over \`columns()\`. The TODO comment that called this out is removed. - \`StatisticsConverter::from_arrow_field\`: low-overhead constructor taking an already-resolved \`(field, parquet_leaf_index)\` pair, so callers building many converters against the same schemas (per-file statistics gathering) skip the redundant arrow + parquet name lookups inside \`try_new\`. - \`ArrowReaderMetadata::from_field_levels\`: package a precomputed \`(metadata, schema, FieldLevels)\` triple directly. Together with \`parquet_to_arrow_schema_and_field_levels\` (a new public helper that produces both in one walk) callers can cache the per-file arrow view and reuse it across reader builds for the same metadata. - \`AsyncFileReader::get_arrow_reader_metadata\`: new trait method with a default impl that delegates to \`try_new\` after \`get_metadata\`. \`load_async\` now goes through it so reader implementations that cache derived state (e.g. DataFusion's metadata cache) can short-circuit the per-leaf walk. - Public accessors on \`ArrowReaderOptions\` (\`supplied_schema()\` / \`skip_arrow_metadata()\` / \`virtual_columns()\`) so callers can decide whether their cached arrow view applies given the options. ## Status - \`cargo test -p parquet --features arrow --lib\` — all 1086 pass (had to bump 4 \`memory_size\` test fixtures by 40 bytes for the \`root_to_first_leaf\` cache). - Microbench numbers from the DataFusion branch: \`ArrowReaderMetadata::try_new\` is ~190 ns/col linear (190 µs at 1024 cols); the cache-aware \`from_field_levels\` + clone path is ~4 ns flat (~43000× faster at 1024 cols). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
