jecsand838 opened a new issue, #8923:
URL: https://github.com/apache/arrow-rs/issues/8923

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   As work has progressed integrating `arrow-avro` into Apache DataFusion’s 
Avro datasource, we've come [across complications and potential 
blockers](https://github.com/apache/datafusion/pull/17861#issuecomment-3508537578)
 related to projection pushdown.
   
   Currently the `arrow-avro` `ReaderBuilder` handles projection via the 
[schema 
resolution](https://avro.apache.org/docs/1.11.1/specification/#schema-resolution)
 process, however this results in callers needing to implement complex and 
redundant logic for what should be seamless and simple. This projection logic 
involves deriving projection reader schemas from writer schemas/SchemaRef and 
aligning schema metadata. Furthermore, yet additional complexity is placed on 
the caller in the event they need to do a one-off projection against a 
pre-defined reader schema stored externally for other schema resolution needs. 
In that scenario they'd need to pre-process this reader schema to drop fields 
while preserving the other intended schema resolution paths (promotion, 
renaming/reordering, etc.).
   
   While the current `arrow-avro` implementation technically supports 
projection indirectly and the [DataFusion Avro 
Datasource](https://github.com/apache/datafusion/issues/14097) work shouldn't 
be blocked, all [potential 
solutions](https://github.com/apache/datafusion/pull/17861#issuecomment-3577323411)
 are unnecessarily complex for the caller and not ideal. 
   
   **Describe the solution you'd like**
   
   I’d like `arrow-avro`’s `ReaderBuilder` to gain an explicit projection API 
that:
   
   1. **Adds a projection method on `ReaderBuilder`**
      Something along the lines of:
      ```rust
      impl ReaderBuilder {
          /// Set the column projection for reading.
          ///
          /// `projection` is a list of zero-based top-level field indices
          /// (matching the Arrow/Avro field order).
          pub fn with_projection(self, projection: Vec<usize>) -> Self { ... }
      }
      ```
      The semantics would mirror the CSV `ReaderBuilder::with_projection` API 
(which already uses a `Vec<usize>` of column indices).
   2. **If no reader schema is provided, derive a reader schema from the writer 
schema using the projection.**
      When `reader_schema` is `None` and `projection` is `Some(indices)`, 
`ReaderBuilder::build` / `ReaderBuilder::build_decoder` would:
      - (`ReaderBuilder::build`): Read the OCF header and obtain the writer 
schema (as it already does).
      - (`ReaderBuilder::build_decoder`): Obtain the writer schemas from the 
provided writer schema registry.
      - Construct an Avro reader schema that:
          * (OCF) Uses the same root record name, namespace, and other 
Avro‑specific metadata from the writer schema via the existing 
`AVRO_NAME_METADATA_KEY`, `AVRO_NAMESPACE_METADATA_KEY`, etc.
          * (SOE) similar to OCF with the added caveat of being compatible with 
each writer schema.
          * Keeps only the top-level fields indicated by the projection 
indices, in the projected order.
      - Use that derived Avro schema internally as `reader_schema` when 
constructing the `RecordDecoder` / `Reader`.
      
       This keeps all Avro-specific details anchored to the real writer schema 
from the OCF header or writer schemas from the writer schema registry while 
still allowing efficient columnar projection for Avro encoded data.
   3. **If a reader schema *is* provided, prune it to match the projection.**
      When `reader_schema` is `Some(schema)` and `projection` is 
`Some(indices)`, `ReaderBuilder::build` / `ReaderBuilder::build_decoder` would:
      - Parse the provided reader schema into the internal Avro `Schema` 
representation.
      - Prune its top-level record fields so that only those referenced by the 
projection remain (preserving existing Avro metadata for those fields).
      - Use the pruned schema for decoding.
      
      This lets callers continue to specify a reader schema for evolution / 
type overrides, while still applying an additional column projection in a 
single place. The projection is always interpreted relative to the Arrow/Avro 
top-level field ordering (consistent with other Arrow readers). the OCF and SOE 
paths would essentially be the same.
   4. **Scope and compatibility**
      - If `projection` is `None`, behavior remains exactly as today, 
preserving backward compatibility.
      - The goal is to centralize Avro‑aware projection logic in `arrow-avro`, 
so downstream projects (like DataFusion) don’t need to reimplement Avro schema 
editing or risk diverging from `arrow-avro`’s resolution rules.
   
   **Describe alternatives you've considered**
   
   1. **Constructing a projected Avro reader schema in DataFusion itself.**
    
       DataFusion could try to:
       * Persist the original Avro schema JSON alongside the Arrow schema,
       * Apply projection to that JSON in its own code (i.e., pruning fields), 
and
       * Pass the result to `ReaderBuilder::with_reader_schema`.
   
       However, this duplicates logic that already conceptually belongs to 
`arrow-avro` (schema resolution + projection). It would be easy for a 
downstream implementation to diverge subtly from `arrow-avro`’s behavior, 
especially around name resolution and metadata like `avro.name` and 
`avro.namespace`.
   
   2. **Reapplying Avro metadata (record name/namespace) after using 
`AvroSchema::try_from(ArrowSchema)`.**
   
       Another option is to let DataFusion continue converting its projected 
Arrow schema back into an `AvroSchema`, then manually patch in the correct root 
record name and namespace via metadata constants like `AVRO_NAME_METADATA_KEY` 
and `AVRO_NAMESPACE_METADATA_KEY`.
   
       This is fragile for a couple of reasons:
       * It requires DataFusion to either track and re-inject the original 
writer schema’s naming information or reconstruct it heuristically.
       * It still doesn’t leverage any internal `arrow-avro` machinery for 
resolving writer vs reader schemas when projection is involved.
   
   3. **Not using projection at the Avro level at all.**
    
       DataFusion could simply read all fields from OCF into Arrow and then 
project at the Arrow layer. While simple, this is undesirable for:
       * Wide OCF schemas where only a small subset of columns is needed.
       * Workloads where I/O and decode costs are a bottleneck; projection 
pushdown is a key optimization that other Arrow readers (i.e. CSV, Parquet) 
already expose via `ReaderBuilder`.
   
   Overall, these alternatives either duplicate `arrow-avro` logic downstream 
or fail to provide the performance characteristics that DataFusion and similar 
engines need. While one of these solutions maybe acceptable for the short term 
(until next `arrow-rs` major release), we want to ensure `arrow-avro` provides 
a long term and optimized path forward.
   
   **Additional context**
   
   * This feature is directly motivated by the ongoing effort to switch 
DataFusion’s Avro datasource over to the upstream `arrow-avro` reader. That PR 
currently needs to juggle between preserving the original Avro root record name 
and applying projection, and the discussion there explicitly calls out the need 
for a `ReaderBuilder::with_projection` style API on `arrow-avro`.
   * The broader effort to adopt `arrow-avro` in DataFusion is tracked under 
"Use arrow-avro for performance and improved type support" 
(`apache/datafusion#14097`).
   * Other Arrow Rust readers such as `arrow_csv::reader::ReaderBuilder` 
already expose a `with_projection(Vec<usize>)` method, so adding an analogous 
method to `arrow_avro::reader::ReaderBuilder` would make the API more 
consistent across formats and easier to adopt in query engines that already 
rely on projection pushdown for CSV and Parquet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to