atwam opened a new issue, #9716: URL: https://github.com/apache/arrow-rs/issues/9716
**Describe the bug** Arrow's [spec for variable-size binary](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout) layout (used for binary or Utf8 arrays) states: > The offsets buffer contains length + 1 signed integers (either 32-bit or 64-bit, depending on the data type), which encode the start position of each slot in the data buffer. This means that an empty array should have a length-1 offsets buffer containing a single 0. Instead, serializing an empty Utf8/Byte array creates an empty offset buffer (in [`get_byte_array_buffers`](https://github.com/apache/arrow-rs/blob/711fac88104fc27d89faaa221345bc95686cf1e9/arrow-ipc/src/writer.rs#L1725)). This means that an empty (no-rows) IPC file created by arrow-rs will not be spec-compliant, and may not be readable by some other implementations (for example polars/arrow2). Note that this behavior applies to Utf8/Binary arrays, but also List, LargeList, and by extension Map (anything that uses `get_byte_array_buffers` or `get_list_array_buffers`) **To Reproduce** - Create a `RecordBatch` with a single Utf8 (empty) array, serialize to IPC. - Try to read that IPC file with a different arrow library. For example, polars will fail reading the serialized IPC file. **Expected behavior** For variable size layouts, IPC writer should output a length 1 offsets buffer containing a single `[0]`. **Additional context** Current version works when doing round-trip because [`get_offsets_from_buffer`](https://github.com/apache/arrow-rs/blob/711fac88104fc27d89faaa221345bc95686cf1e9/arrow-array/src/array/mod.rs#L1030) fills in an empty offsets for a `[0]` array. Note that arrow-cpp [seems to take the same approach](https://github.com/apache/arrow/blob/4eca50770f7f2c5938a676f0719fbfc8aae4803c/cpp/src/arrow/ipc/writer.cc#L322), and similarly will fill-in an empty offsets buffer with `[0]`. While this means we can round-trip with arrow-cpp and pyarrow, we don't comply with the spec and can cause issues with other implementations. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
