atwam opened a new issue, #9716:
URL: https://github.com/apache/arrow-rs/issues/9716

   **Describe the bug**
   
   Arrow's [spec for variable-size 
binary](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout)
 layout (used for binary or Utf8 arrays) states:
   > The offsets buffer contains length + 1 signed integers (either 32-bit or 
64-bit, depending on the data type), which encode the start position of each 
slot in the data buffer.
   
   This means that an empty array should have a length-1 offsets buffer 
containing a single 0.
   Instead, serializing an empty Utf8/Byte array creates an empty offset buffer 
(in 
[`get_byte_array_buffers`](https://github.com/apache/arrow-rs/blob/711fac88104fc27d89faaa221345bc95686cf1e9/arrow-ipc/src/writer.rs#L1725)).
   
   This means that an empty (no-rows) IPC file created by arrow-rs will not be 
spec-compliant, and may not be readable by some other implementations (for 
example polars/arrow2).
   
   Note that this behavior applies to Utf8/Binary arrays, but also List, 
LargeList, and by extension Map (anything that uses `get_byte_array_buffers` or 
`get_list_array_buffers`)
   
   **To Reproduce**
   - Create a `RecordBatch` with a single Utf8 (empty) array, serialize to IPC.
   - Try to read that IPC file with a different arrow library. For example, 
polars will fail reading the serialized IPC file.
   
   **Expected behavior**
   For variable size layouts, IPC writer should output a length 1 offsets 
buffer containing a single `[0]`.
   
   **Additional context**
   
   Current version works when doing round-trip because 
[`get_offsets_from_buffer`](https://github.com/apache/arrow-rs/blob/711fac88104fc27d89faaa221345bc95686cf1e9/arrow-array/src/array/mod.rs#L1030)
 fills in an empty offsets for a `[0]` array.
   
   Note that arrow-cpp [seems to take the same 
approach](https://github.com/apache/arrow/blob/4eca50770f7f2c5938a676f0719fbfc8aae4803c/cpp/src/arrow/ipc/writer.cc#L322),
 and similarly will fill-in an empty offsets buffer with `[0]`. While this 
means we can round-trip with arrow-cpp and pyarrow, we don't comply with the 
spec and can cause issues with other implementations.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to