[PR] Refactor arrow-avro writer to introduce unified `RecordEncoder` and s… [arrow-rs]

via GitHub Tue, 02 Sep 2025 21:18:14 -0700


jecsand838 opened a new pull request, #8274:
URL: https://github.com/apache/arrow-rs/pull/8274


   # Which issue does this PR close?
   
   - Part of https://github.com/apache/arrow-rs/issues/4886
   
   # Rationale for this change
   
   This refactor streamlines the `arrow-avro` writer by introducing a single, 
schema‑driven `RecordEncoder` that plans writes up front and encodes rows using 
consistent, explicit rules for nullability and type dispatch. It reduces 
duplication in nested/struct/list handling, makes the order of Avro union 
branches (null‑first vs null‑second) an explicit choice, and aligns header 
schema generation with value encoding. 
   
   This should improve correctness (especially for nested optionals), make 
behavior easier to reason about, and pave the way for future optimizations. 
   
   # What changes are included in this PR?
   
   **High‑level:**
   
   * Introduces a unified, schema‑driven `RecordEncoder` with a builder that 
walks the Avro record in Avro order and maps each field to its Arrow column, 
producing a reusable write plan. The encoder covers scalars and nested types 
(struct, (large) lists, maps, strings/binaries).
   * Applies a single model of **nullability** throughout encoding, including 
nested sites (list items, fixed‑size list items, map values), and uses explicit 
union‑branch indices according to the chosen order.
   
   **API and implementation details:**
   
   * **Writer / encoder refactor**
   
     * Replaces the previous per‑column/child encoding paths with a 
**`FieldPlan`** tree (variants for `Scalar`, `Struct { … }`, and `List { … }`) 
and per‑site `nullability` carried from the Avro schema.
     * Adds encoder variants for `LargeBinary`, `Utf8`, `Utf8Large`, `List`, 
`LargeList`, and `Struct`. 
     * Encodes union branch indices with `write_optional_index` (writes 
`0x00/0x02` according to Null‑First/Null‑Second), replacing the old branch 
write.
   
   * **Schema generation & metadata**
   
     * Moves the **`Nullability`** enum to `schema.rs` and threads it through 
schema generation and writer logic.
     * Adds `AvroSchema::from_arrow_with_options(schema, Option<Nullability>)` 
to either reuse embedded Avro JSON or build new Avro JSON that **honors the 
requested null‑union order at all nullable sites**. 
     * Adds `extend_with_passthrough_metadata` so Arrow schema metadata is 
copied into Avro JSON while skipping Avro‑reserved and internal Arrow keys.
     * Introduces helpers like `wrap_nullable` and 
`arrow_field_to_avro_with_order` to apply ordering consistently for arrays, 
fixed‑size lists, maps, structs, and unions. 
    
   * **Format and glue**
   
     * Simplifies `writer/format.rs` by removing the `EncoderOptions` plumbing 
from the OCF format; `write_long` remains exported for header writing.
   
   # Are these changes tested?
   
   Yes.
   
   * Adds focused unit tests in `writer/encoder.rs` that verify scalar and 
string/binary encodings (e.g., Binary/LargeBinary, Utf8/LargeUtf8) and validate 
length/branch encoding primitives used by the writer. 
   * Round trip integration tests that validate List and Struct decoding in 
`writer/mod.rs`.
   * Adjusts existing schema tests (e.g., decimal metadata expectations) to 
align with the new schema/metadata handling. 
   
   # Are there any user-facing changes?
   
   N/A because arrow-avro is not public yet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] Refactor arrow-avro writer to introduce unified `RecordEncoder` and s… [arrow-rs]

Reply via email to