chaubold opened a new issue, #15068: URL: https://github.com/apache/arrow/issues/15068
### Describe the bug, including details regarding any error messages, version, and platform. Hi guys, I wanted to create a JIRA ticket, but apparently that is not allowed anymore for the public, so I'm reporting the bug here. Feel free to mirror it in the Apache JIRA. We have noticed that serializing and deserializing a table with IPC, where a column contains an ExtensionType with storage type `pa.null()`, is not usable after reading it again. It seems as if the buffers get mixed up, as if the `null` array gets serialized as buffer even though that should (according to the IPC format documentation) not be the case. See below an example for this error. Only `test_null_ext_table` fails. If the storage type is not `pa.null()` but any other type it works, even if there are only missing values. Interestingly, `pa.list_(pa.null())` works fine as well. ```python import tempfile import unittest import pyarrow as pa class MyArrowExtType(pa.ExtensionType): def __init__(self, storage_type, extra: str): self._extra = extra pa.ExtensionType.__init__(self, storage_type, "test.ext_type") def __arrow_ext_serialize__(self): return self._extra.encode() @classmethod def __arrow_ext_deserialize__(cls, storage_type, serialized): extra = serialized.decode() return MyArrowExtType(storage_type, extra) pa.register_extension_type(MyArrowExtType(pa.null(), "")) class PyArrowExtTypeNullIPCTest(unittest.TestCase): def test_null_ext_table(self): """ This test fails """ t = pa.Table.from_pydict( { "ints": [1, 2, 3], "strings": ["a", "b", "c"], } ) ext_array = pa.ExtensionArray.from_storage( # PyArrow 6 already raises an error in this line, but that was fixed in PyArrow 7 # https://issues.apache.org/jira/browse/ARROW-14522 MyArrowExtType(pa.null(), "ext_null"), pa.nulls(3, pa.null()) ) t = t.add_column(0, "ext", ext_array) self.read_write_ipc(t) def test_int_null_ext_table(self): """ works """ t = pa.Table.from_pydict( { "ints": [1, 2, 3], "strings": ["a", "b", "c"], } ) ext_array = pa.ExtensionArray.from_storage( MyArrowExtType(pa.int64(), "ext_int"), pa.nulls(3, pa.int64()) ) t = t.add_column(0, "ext", ext_array) print(t) self.read_write_ipc(t) def test_list_null_ext_table(self): """ works """ t = pa.Table.from_pydict( { "ints": [1, 2, 3], "strings": ["a", "b", "c"], } ) ext_array = pa.ExtensionArray.from_storage( MyArrowExtType(pa.list_(pa.null()), "ext_null"), pa.nulls(3, pa.list_(pa.null())), ) t = t.add_column(0, "ext", ext_array) self.read_write_ipc(t) def read_write_ipc(self, write_t): with tempfile.TemporaryFile() as tmpfile: with pa.ipc.new_file(tmpfile, write_t.schema) as writer: writer.write_table(write_t) tmpfile.seek(0) with pa.ipc.open_file(tmpfile) as reader: read_t = reader.read_all() self.assertEqual(write_t, read_t) return read_t if __name__ == "__main__": unittest.main() ``` This example fails when comparing the _next_ array in the table because apparently serialization wrote one buffer too many (a null array should not have any buffer serialized). ``` test_arrow_ext_types_ipc.py:80: in read_write_ipc self.assertEqual(write_t, read_t) E AssertionError: pyarr[105 chars]ts: [[1,2,3]] E strings: [["a","b","c"]] != pyarr[105 chars]ts: [<Invalid array: Buffer #1 too small in ar[186 chars]t 0>] ``` Occurs with PyArrow versions 9 and 10. ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org