jecsand838 opened a new issue, #8923: URL: https://github.com/apache/arrow-rs/issues/8923
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** As work has progressed integrating `arrow-avro` into Apache DataFusion’s Avro datasource, we've come [across complications and potential blockers](https://github.com/apache/datafusion/pull/17861#issuecomment-3508537578) related to projection pushdown. Currently the `arrow-avro` `ReaderBuilder` handles projection via the [schema resolution](https://avro.apache.org/docs/1.11.1/specification/#schema-resolution) process, however this results in callers needing to implement complex and redundant logic for what should be seamless and simple. This projection logic involves deriving projection reader schemas from writer schemas/SchemaRef and aligning schema metadata. Furthermore, yet additional complexity is placed on the caller in the event they need to do a one-off projection against a pre-defined reader schema stored externally for other schema resolution needs. In that scenario they'd need to pre-process this reader schema to drop fields while preserving the other intended schema resolution paths (promotion, renaming/reordering, etc.). While the current `arrow-avro` implementation technically supports projection indirectly and the [DataFusion Avro Datasource](https://github.com/apache/datafusion/issues/14097) work shouldn't be blocked, all [potential solutions](https://github.com/apache/datafusion/pull/17861#issuecomment-3577323411) are unnecessarily complex for the caller and not ideal. **Describe the solution you'd like** I’d like `arrow-avro`’s `ReaderBuilder` to gain an explicit projection API that: 1. **Adds a projection method on `ReaderBuilder`** Something along the lines of: ```rust impl ReaderBuilder { /// Set the column projection for reading. /// /// `projection` is a list of zero-based top-level field indices /// (matching the Arrow/Avro field order). pub fn with_projection(self, projection: Vec<usize>) -> Self { ... } } ``` The semantics would mirror the CSV `ReaderBuilder::with_projection` API (which already uses a `Vec<usize>` of column indices). 2. **If no reader schema is provided, derive a reader schema from the writer schema using the projection.** When `reader_schema` is `None` and `projection` is `Some(indices)`, `ReaderBuilder::build` / `ReaderBuilder::build_decoder` would: - (`ReaderBuilder::build`): Read the OCF header and obtain the writer schema (as it already does). - (`ReaderBuilder::build_decoder`): Obtain the writer schemas from the provided writer schema registry. - Construct an Avro reader schema that: * (OCF) Uses the same root record name, namespace, and other Avro‑specific metadata from the writer schema via the existing `AVRO_NAME_METADATA_KEY`, `AVRO_NAMESPACE_METADATA_KEY`, etc. * (SOE) similar to OCF with the added caveat of being compatible with each writer schema. * Keeps only the top-level fields indicated by the projection indices, in the projected order. - Use that derived Avro schema internally as `reader_schema` when constructing the `RecordDecoder` / `Reader`. This keeps all Avro-specific details anchored to the real writer schema from the OCF header or writer schemas from the writer schema registry while still allowing efficient columnar projection for Avro encoded data. 3. **If a reader schema *is* provided, prune it to match the projection.** When `reader_schema` is `Some(schema)` and `projection` is `Some(indices)`, `ReaderBuilder::build` / `ReaderBuilder::build_decoder` would: - Parse the provided reader schema into the internal Avro `Schema` representation. - Prune its top-level record fields so that only those referenced by the projection remain (preserving existing Avro metadata for those fields). - Use the pruned schema for decoding. This lets callers continue to specify a reader schema for evolution / type overrides, while still applying an additional column projection in a single place. The projection is always interpreted relative to the Arrow/Avro top-level field ordering (consistent with other Arrow readers). the OCF and SOE paths would essentially be the same. 4. **Scope and compatibility** - If `projection` is `None`, behavior remains exactly as today, preserving backward compatibility. - The goal is to centralize Avro‑aware projection logic in `arrow-avro`, so downstream projects (like DataFusion) don’t need to reimplement Avro schema editing or risk diverging from `arrow-avro`’s resolution rules. **Describe alternatives you've considered** 1. **Constructing a projected Avro reader schema in DataFusion itself.** DataFusion could try to: * Persist the original Avro schema JSON alongside the Arrow schema, * Apply projection to that JSON in its own code (i.e., pruning fields), and * Pass the result to `ReaderBuilder::with_reader_schema`. However, this duplicates logic that already conceptually belongs to `arrow-avro` (schema resolution + projection). It would be easy for a downstream implementation to diverge subtly from `arrow-avro`’s behavior, especially around name resolution and metadata like `avro.name` and `avro.namespace`. 2. **Reapplying Avro metadata (record name/namespace) after using `AvroSchema::try_from(ArrowSchema)`.** Another option is to let DataFusion continue converting its projected Arrow schema back into an `AvroSchema`, then manually patch in the correct root record name and namespace via metadata constants like `AVRO_NAME_METADATA_KEY` and `AVRO_NAMESPACE_METADATA_KEY`. This is fragile for a couple of reasons: * It requires DataFusion to either track and re-inject the original writer schema’s naming information or reconstruct it heuristically. * It still doesn’t leverage any internal `arrow-avro` machinery for resolving writer vs reader schemas when projection is involved. 3. **Not using projection at the Avro level at all.** DataFusion could simply read all fields from OCF into Arrow and then project at the Arrow layer. While simple, this is undesirable for: * Wide OCF schemas where only a small subset of columns is needed. * Workloads where I/O and decode costs are a bottleneck; projection pushdown is a key optimization that other Arrow readers (i.e. CSV, Parquet) already expose via `ReaderBuilder`. Overall, these alternatives either duplicate `arrow-avro` logic downstream or fail to provide the performance characteristics that DataFusion and similar engines need. While one of these solutions maybe acceptable for the short term (until next `arrow-rs` major release), we want to ensure `arrow-avro` provides a long term and optimized path forward. **Additional context** * This feature is directly motivated by the ongoing effort to switch DataFusion’s Avro datasource over to the upstream `arrow-avro` reader. That PR currently needs to juggle between preserving the original Avro root record name and applying projection, and the discussion there explicitly calls out the need for a `ReaderBuilder::with_projection` style API on `arrow-avro`. * The broader effort to adopt `arrow-avro` in DataFusion is tracked under "Use arrow-avro for performance and improved type support" (`apache/datafusion#14097`). * Other Arrow Rust readers such as `arrow_csv::reader::ReaderBuilder` already expose a `with_projection(Vec<usize>)` method, so adding an analogous method to `arrow_avro::reader::ReaderBuilder` would make the API more consistent across formats and easier to adopt in query engines that already rely on projection pushdown for CSV and Parquet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
