joaquinhuigomez opened a new pull request, #9623: URL: https://github.com/apache/arrow-rs/pull/9623
# Which issue does this PR close? - Closes #9595 # Rationale for this change The [IPC specification](https://arrow.apache.org/docs/format/Columnar.html#format-ipc) states: > An edge-case for interleaved dictionary and record batches occurs when the record batches contain dictionary encoded arrays that are completely null. In this case, the dictionary for the encoded column might appear after the first record batch. Arrow C++ (v17+) relies on this and does not emit a dictionary batch when all values in a dictionary-encoded column are null. The Rust IPC reader currently fails with `"Cannot find a dictionary batch with dict id: ..."` when reading such streams, making cross-language interop broken for this edge case. # What changes are included in this PR? When the IPC reader encounters a `Dictionary`-typed column whose `dict_id` has no corresponding entry in `dictionaries_by_id`, it now synthesizes an empty values array of the appropriate type (via `new_empty_array`) instead of returning an error. This matches the spec's allowance for omitted dictionary batches on null-only columns. # Are these changes tested? Yes. A new test (`test_read_null_dict_without_dictionary_batch`) writes an IPC stream with an all-null dictionary column, strips the dictionary batch message from the raw bytes to simulate C++ behavior, then verifies the Rust reader successfully decodes the stream. # Are there any user-facing changes? IPC streams produced by C++ (or other implementations) that omit dictionary batches for null-only dictionary columns can now be read without error. Previously these streams caused a `ParseError`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
