jecsand838 opened a new issue, #9290:
URL: https://github.com/apache/arrow-rs/issues/9290
**Is your feature request related to a problem or challenge? Please describe
what you are trying to do.**
`arrow-avro` currently cannot encode/decode a number of Arrow `DataType`s,
and some types have schema/encoding mismatches that can lead to incorrect data
(even when encoding succeeds).
The goal is:
* **No more `ArrowError::NotYetImplemented` (or similar) when
writing/reading an Arrow `RecordBatch` containing supported Arrow types**,
excluding **Sparse Unions** (will be handled separately).
* **When compiled with `feature = "avro_custom_types"`:** Arrow to Avro to
Arrow should **round-trip the Arrow `DataType`** (including
width/signedness/time units and relevant metadata using **Arrow-specific custom
logical types** following the established `arrow.*` pattern.
* **When compiled without `avro_custom_types`:** Arrow types should be
encoded to the **closest standard Avro primitive / logical type**, with any
necessary lossy conversions documented and consistently applied.
**Arrow `DataType` vs `arrow-avro` code paths**
#### 1) Types mapped in `schema.rs` but not actually writable (Writer errors)
`arrow-avro/src/schema.rs` already generates Avro schema for several Arrow
types, but `arrow-avro/src/writer/encoder.rs` doesn’t implement the
corresponding encoders, so writing fails.
Examples:
* `DataType::Int8`, `Int16`
* `DataType::UInt8`, `UInt16`, `UInt32`, `UInt64`
* `DataType::Float16`
* `DataType::Date64` (**note:** `schema.rs` maps this to Avro `long` with
the standard logical type `local-timestamp-millis`)
* `DataType::Time64(TimeUnit::Nanosecond)`
Concrete failures observed in code:
* `encoder.rs` falls back to:
* `Err(ArrowError::NotYetImplemented(format!("Avro scalar type not yet
supported: {other:?}")))`
for many of the above (e.g. Int8/Int16/UInt*/Float16).
* `Date64` is explicitly rejected in the writer with:
* `NotYetImplemented("Avro logical type 'date' is days since epoch (int).
Arrow Date64 (ms) has no direct Avro logical type; cast to Date32 or to a
Timestamp.")`
This is inconsistent with `schema.rs`, which currently emits
`local-timestamp-millis` for `Date64`.
* `Time64(Nanosecond)` is explicitly rejected with:
* `NotYetImplemented("Avro writer does not support time-nanos; cast to
Time64(Microsecond).")`
This shows up immediately in common scenarios (e.g. encoding a `RecordBatch`
with an `Int16` column).
#### 2) Schema/encoding mismatches (Writer produces Avro that doesn’t match
its own schema)
There are also cases where **the schema generated doesn’t match the encoding
behavior**:
* **Interval(YearMonth) / Interval(DayTime)**
* `schema.rs` maps these to `long` with `arrowIntervalUnit=...`
* `encoder.rs` encodes **all** Arrow `Interval` variants using an **Avro
`duration` fixed(12)** writer (`DurationEncoder`)
* This is **not just a metadata mismatch**: Avro `long` values are zig-zag
varint encoded, while Avro `fixed(12)` is always 12 raw bytes. Writing
`fixed(12)` bytes under a `long` schema can produce **invalid / unreadable
Avro** and corrupt the stream.
* **Time32(Second)** and **Timestamp(Second)**
* `schema.rs` writes these as `int`/`long` with `arrowTimeUnit="second"`
(no standard Avro logicalType)
* `encoder.rs` converts seconds to **milliseconds** on write
(`Time32SecondsToMillisEncoder`, `TimestampSecondsToMillisEncoder`)
* That yields data in millis while the schema/metadata indicate seconds
These mismatches are bigger than “no round-trip” — they can lead to
incorrect interpretation by readers (including `arrow-avro`).
#### 3) Reader does not currently round-trip most Arrow-specific types (even
when metadata exists)
The reader currently only maps a small set of Arrow-specific types via
`avro_custom_types` (notably Duration* and RunEndEncoded). Many other
Arrow-specific or Arrow-width-specific types decode to the nearest Avro
primitive:
* Avro `int` always becomes Arrow `Int32` (no path back to
`Int8`/`Int16`/`UInt8`/`UInt16`)
* Avro `long` always becomes Arrow `Int64` or one of the timestamp/time
logical types (no path back to `UInt32`/`UInt64`)
* There is no Float16 decode path
Additionally, `codec.rs` contains a special-case that treats `Int64` +
`arrowTimeUnit == "nanosecond"` as `TimestampNanos(false)`. This would conflict
with `schema.rs`’s current representation for `Time64(Nanosecond)` (which also
uses `arrowTimeUnit="nanosecond"`), i.e. even if writer support were added, the
reader-side mapping would likely misinterpret it as a timestamp unless this is
corrected.
**Describe the solution you'd like**
Implement missing Arrow to Avro support in a way that is consistent across:
* schema generation (`arrow-avro/src/schema.rs`)
* encoding (`arrow-avro/src/writer/encoder.rs`)
* schema to codec mapping (`arrow-avro/src/codec.rs`)
* decoding/builders (`arrow-avro/src/reader/record.rs`)
...and that is explicitly aligned with `avro_custom_types`.
**A. Add Writer/Reader support for remaining Arrow primitive
width/signedness types**
At minimum, remove “type-level” NotYetImplemented errors by adding Writer
encoders and Reader support for:
* `Int8`, `Int16`
* `UInt8`, `UInt16`, `UInt32`, `UInt64`
* `Float16`
Proposed mapping behavior:
| Arrow `DataType` | `avro_custom_types` **OFF** (interop-first) |
`avro_custom_types` **ON** (round-trip) |
|---|---|---|
| Int8 / Int16 | encode as Avro `int` (widen) | encode as Avro `int` with
`logicalType: "arrow.int8"` / `"arrow.int16"` |
| UInt8 / UInt16 | encode as Avro `int` (as i32) | Avro `int` with
`logicalType: "arrow.uint8"` / `"arrow.uint16"` |
| UInt32 | encode as Avro `long` (as i64) | Avro `long` with `logicalType:
"arrow.uint32"` |
| UInt64 | encode as Avro `long` *when value fits i64*, otherwise error (or
documented fallback) | encode with a round-trippable representation (e.g. Avro
`fixed(8)` or `bytes`) + `logicalType: "arrow.uint64"` |
| Float16 | encode as Avro `float` (f32) | round-trip via `logicalType:
"arrow.float16"` + representation (e.g. store IEEE-754 f16 bits in `fixed(2)`
or `int`) |
**B. Complete date/time/timestamp support (and align schema + encoding)**
Implement:
* `Date64` (currently `schema.rs` emits `long` + `local-timestamp-millis`,
but the writer errors)
* `Time64(Nanosecond)` (schema exists but writer errors; reader must not
confuse with timestamps)
* Align `Time32(Second)` / `Timestamp(Second)` schema + encoding
Proposed behavior:
| Arrow `DataType` | `avro_custom_types` **OFF** | `avro_custom_types`
**ON** |
|---|---|---|
| Date64 | encode as `long` + **`logicalType: "local-timestamp-millis"`**
(this is what `schema.rs` already emits today; implement writer accordingly) |
`long` + `logicalType: "arrow.date64"` (ms since epoch, round-trip to Date64) |
| Time32(Second) | encode as `time-millis` (int) by scaling seconds to
millis | `int` + `logicalType: "arrow.time32-second"` and store raw seconds |
| Time64(Nanosecond) | encode as `time-micros` (long) with scaling nanos to
micros (document/truncate or require divisible by 1000) | `long` +
`logicalType: "arrow.time64-nanosecond"` and store raw nanos |
| Timestamp(Second, tz?) | encode as `timestamp-millis` /
`local-timestamp-millis` with scaling seconds to millis | `long` +
`logicalType: "arrow.timestamp-second"` (and preserve tz/local-ness as needed) |
Notes:
* Arrow’s `Date64` spec treats values as “days in milliseconds” (evenly
divisible by `86_400_000`). `arrow-rs` does not enforce this at runtime, so the
writer should either document behavior for non-conforming values or optionally
validate when writing.
**C. Implement Interval handling and define custom logical types for
round-trip**
Current `Interval` behavior is not consistent between schema and encoding
for YearMonth/DayTime, and Avro’s standard `duration` can’t represent Arrow’s
full interval semantics (negative values; MonthDayNano nanos).
Proposed approach:
* With `avro_custom_types` **ON**: implement custom logical types for each
Arrow interval unit that can round-trip (including negative and nanos):
* `arrow.interval-year-month`
* `arrow.interval-day-time`
* `arrow.interval-month-day-nano` (should preserve i32 months, i32 days,
i64 nanos)
* With `avro_custom_types` **OFF**: encode to the closest standard Avro
representation:
* Prefer Avro `duration` where feasible (and document constraints:
unsigned months/days/millis; millis precision)
* For values not representable (negative, sub-millis nanos), return a
clear error suggesting enabling `avro_custom_types` or casting
**E. Add coverage tests (type-level and round-trip)**
Add tests ensuring:
1) **Writer** can encode `RecordBatch` for each Arrow type above (no
`NotYetImplemented`)
2) With `avro_custom_types` **ON**: Arrow to Avro to Arrow schema equality
for these types (excluding Sparse union)
3) With `avro_custom_types` **OFF**: Arrow to Avro does not error; Arrow
type coercions match documented “closest Avro type/logical type” rules
4) Ensure schema generation and encoding are consistent (especially for
Interval and second-based time/timestamp)
**Describe alternatives you've considered**
* Require callers to manually cast unsupported Arrow types (e.g. `Int16 to
Int32`, `Float16 to Float32`, `Time64(nanos) to Time64(micros)`) prior to
writing Avro.
* Always enable `avro_custom_types` and rely on Arrow-specific annotations
everywhere (simpler for Arrow-only pipelines, less interoperable).
* For unsigned types, encode as strings or decimals to avoid range issues
(more interoperable but heavier, and not a strict round-trip without custom
annotations).
**Additional context**
Relevant code locations (current behavior):
* `arrow-avro/src/schema.rs` (already maps many of the missing types; also
stores Arrow metadata like `arrowTimeUnit`, `arrowIntervalUnit`,
`arrowUnionTypeIds`)
* `arrow-avro/src/writer/encoder.rs` (missing encoders for
Int8/Int16/UInt*/Float16/Date64/Time64(nanos); has conversion encoders for
seconds to millis that currently don’t align with schema; interval encoding
currently can violate the declared schema for YearMonth/DayTime)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]