jecsand838 opened a new issue, #9290:
URL: https://github.com/apache/arrow-rs/issues/9290

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   `arrow-avro` currently cannot encode/decode a number of Arrow `DataType`s, 
and some types have schema/encoding mismatches that can lead to incorrect data 
(even when encoding succeeds).
   
   The goal is:
   
   * **No more `ArrowError::NotYetImplemented` (or similar) when 
writing/reading an Arrow `RecordBatch` containing supported Arrow types**, 
excluding **Sparse Unions** (will be handled separately).
   * **When compiled with `feature = "avro_custom_types"`:** Arrow to Avro to 
Arrow should **round-trip the Arrow `DataType`** (including 
width/signedness/time units and relevant metadata using **Arrow-specific custom 
logical types** following the established `arrow.*` pattern.
   * **When compiled without `avro_custom_types`:** Arrow types should be 
encoded to the **closest standard Avro primitive / logical type**, with any 
necessary lossy conversions documented and consistently applied.
   
   **Arrow `DataType` vs `arrow-avro` code paths**
   
   #### 1) Types mapped in `schema.rs` but not actually writable (Writer errors)
   `arrow-avro/src/schema.rs` already generates Avro schema for several Arrow 
types, but `arrow-avro/src/writer/encoder.rs` doesn’t implement the 
corresponding encoders, so writing fails.
   
   Examples:
   * `DataType::Int8`, `Int16`
   * `DataType::UInt8`, `UInt16`, `UInt32`, `UInt64`
   * `DataType::Float16`
   * `DataType::Date64` (**note:** `schema.rs` maps this to Avro `long` with 
the standard logical type `local-timestamp-millis`)
   * `DataType::Time64(TimeUnit::Nanosecond)`
   
   Concrete failures observed in code:
   * `encoder.rs` falls back to:
     * `Err(ArrowError::NotYetImplemented(format!("Avro scalar type not yet 
supported: {other:?}")))`
     for many of the above (e.g. Int8/Int16/UInt*/Float16).
   * `Date64` is explicitly rejected in the writer with:
     * `NotYetImplemented("Avro logical type 'date' is days since epoch (int). 
Arrow Date64 (ms) has no direct Avro logical type; cast to Date32 or to a 
Timestamp.")`
     This is inconsistent with `schema.rs`, which currently emits 
`local-timestamp-millis` for `Date64`.
   * `Time64(Nanosecond)` is explicitly rejected with:
     * `NotYetImplemented("Avro writer does not support time-nanos; cast to 
Time64(Microsecond).")`
   
   This shows up immediately in common scenarios (e.g. encoding a `RecordBatch` 
with an `Int16` column).
   
   #### 2) Schema/encoding mismatches (Writer produces Avro that doesn’t match 
its own schema)
   There are also cases where **the schema generated doesn’t match the encoding 
behavior**:
   
   * **Interval(YearMonth) / Interval(DayTime)**  
     * `schema.rs` maps these to `long` with `arrowIntervalUnit=...`
     * `encoder.rs` encodes **all** Arrow `Interval` variants using an **Avro 
`duration` fixed(12)** writer (`DurationEncoder`)
     * This is **not just a metadata mismatch**: Avro `long` values are zig-zag 
varint encoded, while Avro `fixed(12)` is always 12 raw bytes. Writing 
`fixed(12)` bytes under a `long` schema can produce **invalid / unreadable 
Avro** and corrupt the stream.
   
   * **Time32(Second)** and **Timestamp(Second)**  
     * `schema.rs` writes these as `int`/`long` with `arrowTimeUnit="second"` 
(no standard Avro logicalType)
     * `encoder.rs` converts seconds to **milliseconds** on write 
(`Time32SecondsToMillisEncoder`, `TimestampSecondsToMillisEncoder`)
     * That yields data in millis while the schema/metadata indicate seconds
   
   These mismatches are bigger than “no round-trip” — they can lead to 
incorrect interpretation by readers (including `arrow-avro`).
   
   #### 3) Reader does not currently round-trip most Arrow-specific types (even 
when metadata exists)
   The reader currently only maps a small set of Arrow-specific types via 
`avro_custom_types` (notably Duration* and RunEndEncoded). Many other 
Arrow-specific or Arrow-width-specific types decode to the nearest Avro 
primitive:
   
   * Avro `int` always becomes Arrow `Int32` (no path back to 
`Int8`/`Int16`/`UInt8`/`UInt16`)
   * Avro `long` always becomes Arrow `Int64` or one of the timestamp/time 
logical types (no path back to `UInt32`/`UInt64`)
   * There is no Float16 decode path
   
   Additionally, `codec.rs` contains a special-case that treats `Int64` + 
`arrowTimeUnit == "nanosecond"` as `TimestampNanos(false)`. This would conflict 
with `schema.rs`’s current representation for `Time64(Nanosecond)` (which also 
uses `arrowTimeUnit="nanosecond"`), i.e. even if writer support were added, the 
reader-side mapping would likely misinterpret it as a timestamp unless this is 
corrected.
   
   **Describe the solution you'd like**
   
   Implement missing Arrow to Avro support in a way that is consistent across:
   
   * schema generation (`arrow-avro/src/schema.rs`)
   * encoding (`arrow-avro/src/writer/encoder.rs`)
   * schema to codec mapping (`arrow-avro/src/codec.rs`)
   * decoding/builders (`arrow-avro/src/reader/record.rs`)
   
   ...and that is explicitly aligned with `avro_custom_types`.
   
   **A. Add Writer/Reader support for remaining Arrow primitive 
width/signedness types**
   
   At minimum, remove “type-level” NotYetImplemented errors by adding Writer 
encoders and Reader support for:
   
   * `Int8`, `Int16`
   * `UInt8`, `UInt16`, `UInt32`, `UInt64`
   * `Float16`
   
   Proposed mapping behavior:
   
   | Arrow `DataType` | `avro_custom_types` **OFF** (interop-first) | 
`avro_custom_types` **ON** (round-trip) |
   |---|---|---|
   | Int8 / Int16 | encode as Avro `int` (widen) | encode as Avro `int` with 
`logicalType: "arrow.int8"` / `"arrow.int16"` |
   | UInt8 / UInt16 | encode as Avro `int` (as i32) | Avro `int` with 
`logicalType: "arrow.uint8"` / `"arrow.uint16"` |
   | UInt32 | encode as Avro `long` (as i64) | Avro `long` with `logicalType: 
"arrow.uint32"` |
   | UInt64 | encode as Avro `long` *when value fits i64*, otherwise error (or 
documented fallback) | encode with a round-trippable representation (e.g. Avro 
`fixed(8)` or `bytes`) + `logicalType: "arrow.uint64"` |
   | Float16 | encode as Avro `float` (f32) | round-trip via `logicalType: 
"arrow.float16"` + representation (e.g. store IEEE-754 f16 bits in `fixed(2)` 
or `int`) |
   
   **B. Complete date/time/timestamp support (and align schema + encoding)**
   
   Implement:
   
   * `Date64` (currently `schema.rs` emits `long` + `local-timestamp-millis`, 
but the writer errors)
   * `Time64(Nanosecond)` (schema exists but writer errors; reader must not 
confuse with timestamps)
   * Align `Time32(Second)` / `Timestamp(Second)` schema + encoding
   
   Proposed behavior:
   
   | Arrow `DataType` | `avro_custom_types` **OFF** | `avro_custom_types` 
**ON** |
   |---|---|---|
   | Date64 | encode as `long` + **`logicalType: "local-timestamp-millis"`** 
(this is what `schema.rs` already emits today; implement writer accordingly) | 
`long` + `logicalType: "arrow.date64"` (ms since epoch, round-trip to Date64) |
   | Time32(Second) | encode as `time-millis` (int) by scaling seconds to 
millis | `int` + `logicalType: "arrow.time32-second"` and store raw seconds |
   | Time64(Nanosecond) | encode as `time-micros` (long) with scaling nanos to 
micros (document/truncate or require divisible by 1000) | `long` + 
`logicalType: "arrow.time64-nanosecond"` and store raw nanos |
   | Timestamp(Second, tz?) | encode as `timestamp-millis` / 
`local-timestamp-millis` with scaling seconds to millis | `long` + 
`logicalType: "arrow.timestamp-second"` (and preserve tz/local-ness as needed) |
   
   Notes:
   * Arrow’s `Date64` spec treats values as “days in milliseconds” (evenly 
divisible by `86_400_000`). `arrow-rs` does not enforce this at runtime, so the 
writer should either document behavior for non-conforming values or optionally 
validate when writing.
   
   **C. Implement Interval handling and define custom logical types for 
round-trip**
   
   Current `Interval` behavior is not consistent between schema and encoding 
for YearMonth/DayTime, and Avro’s standard `duration` can’t represent Arrow’s 
full interval semantics (negative values; MonthDayNano nanos).
   
   Proposed approach:
   
   * With `avro_custom_types` **ON**: implement custom logical types for each 
Arrow interval unit that can round-trip (including negative and nanos):
     * `arrow.interval-year-month`
     * `arrow.interval-day-time`
     * `arrow.interval-month-day-nano` (should preserve i32 months, i32 days, 
i64 nanos)
   
   * With `avro_custom_types` **OFF**: encode to the closest standard Avro 
representation:
     * Prefer Avro `duration` where feasible (and document constraints: 
unsigned months/days/millis; millis precision)
     * For values not representable (negative, sub-millis nanos), return a 
clear error suggesting enabling `avro_custom_types` or casting
   
   **E. Add coverage tests (type-level and round-trip)**
   Add tests ensuring:
   
   1) **Writer** can encode `RecordBatch` for each Arrow type above (no 
`NotYetImplemented`)  
   2) With `avro_custom_types` **ON**: Arrow to Avro to Arrow schema equality 
for these types (excluding Sparse union)  
   3) With `avro_custom_types` **OFF**: Arrow to Avro does not error; Arrow 
type coercions match documented “closest Avro type/logical type” rules  
   4) Ensure schema generation and encoding are consistent (especially for 
Interval and second-based time/timestamp)
   
   **Describe alternatives you've considered**
   
   * Require callers to manually cast unsupported Arrow types (e.g. `Int16 to 
Int32`, `Float16 to Float32`, `Time64(nanos) to Time64(micros)`) prior to 
writing Avro.
   * Always enable `avro_custom_types` and rely on Arrow-specific annotations 
everywhere (simpler for Arrow-only pipelines, less interoperable).
   * For unsigned types, encode as strings or decimals to avoid range issues 
(more interoperable but heavier, and not a strict round-trip without custom 
annotations).
   
   **Additional context**
   
   Relevant code locations (current behavior):
   * `arrow-avro/src/schema.rs` (already maps many of the missing types; also 
stores Arrow metadata like `arrowTimeUnit`, `arrowIntervalUnit`, 
`arrowUnionTypeIds`)
   * `arrow-avro/src/writer/encoder.rs` (missing encoders for 
Int8/Int16/UInt*/Float16/Date64/Time64(nanos); has conversion encoders for 
seconds to millis that currently don’t align with schema; interval encoding 
currently can violate the declared schema for YearMonth/DayTime)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to