chaubold opened a new issue, #15068:
URL: https://github.com/apache/arrow/issues/15068

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Hi guys,
   
   I wanted to create a JIRA ticket, but apparently that is not allowed anymore 
for the public, so I'm reporting the bug here. Feel free to mirror it in the 
Apache JIRA.
   
   We have noticed that serializing and deserializing a table with IPC, where a 
column contains an ExtensionType with storage type `pa.null()`, is not usable 
after reading it again. It seems as if the buffers get mixed up, as if the 
`null` array gets serialized as buffer even though that should (according to 
the IPC format documentation) not be the case.
   
   See below an example for this error. Only `test_null_ext_table` fails. If 
the storage type is not `pa.null()` but any other type it works, even if there 
are only missing values. Interestingly, `pa.list_(pa.null())` works fine as 
well.
   
   ```python
   import tempfile
   import unittest
   import pyarrow as pa
   
   class MyArrowExtType(pa.ExtensionType):
       def __init__(self, storage_type, extra: str):
           self._extra = extra
           pa.ExtensionType.__init__(self, storage_type, "test.ext_type")
   
       def __arrow_ext_serialize__(self):
           return self._extra.encode()
   
       @classmethod
       def __arrow_ext_deserialize__(cls, storage_type, serialized):
           extra = serialized.decode()
           return MyArrowExtType(storage_type, extra)
   
   
   pa.register_extension_type(MyArrowExtType(pa.null(), ""))
   
   
   class PyArrowExtTypeNullIPCTest(unittest.TestCase):
       def test_null_ext_table(self):
           """ This test fails """
           t = pa.Table.from_pydict(
               {
                   "ints": [1, 2, 3],
                   "strings": ["a", "b", "c"],
               }
           )
           ext_array = pa.ExtensionArray.from_storage(
               # PyArrow 6 already raises an error in this line, but that was 
fixed in PyArrow 7
               # https://issues.apache.org/jira/browse/ARROW-14522
               MyArrowExtType(pa.null(), "ext_null"), pa.nulls(3, pa.null())
           )
   
           t = t.add_column(0, "ext", ext_array)
           self.read_write_ipc(t)
   
       def test_int_null_ext_table(self):
           """ works """
           t = pa.Table.from_pydict(
               {
                   "ints": [1, 2, 3],
                   "strings": ["a", "b", "c"],
               }
           )
           ext_array = pa.ExtensionArray.from_storage(
               MyArrowExtType(pa.int64(), "ext_int"), pa.nulls(3, pa.int64())
           )
           t = t.add_column(0, "ext", ext_array)
           print(t)
           self.read_write_ipc(t)
   
       def test_list_null_ext_table(self):
           """ works """
           t = pa.Table.from_pydict(
               {
                   "ints": [1, 2, 3],
                   "strings": ["a", "b", "c"],
               }
           )
           ext_array = pa.ExtensionArray.from_storage(
               MyArrowExtType(pa.list_(pa.null()), "ext_null"),
               pa.nulls(3, pa.list_(pa.null())),
           )
           t = t.add_column(0, "ext", ext_array)
           self.read_write_ipc(t)
   
       def read_write_ipc(self, write_t):
           with tempfile.TemporaryFile() as tmpfile:
               with pa.ipc.new_file(tmpfile, write_t.schema) as writer:
                   writer.write_table(write_t)
   
               tmpfile.seek(0)
   
               with pa.ipc.open_file(tmpfile) as reader:
                   read_t = reader.read_all()
   
           self.assertEqual(write_t, read_t)
           return read_t
   
   
   if __name__ == "__main__":
       unittest.main()
   
   ```
   
   This example fails when comparing the _next_ array in the table because 
apparently serialization wrote one buffer too many (a null array should not 
have any buffer serialized).
   
   ```
   test_arrow_ext_types_ipc.py:80: in read_write_ipc
       self.assertEqual(write_t, read_t)
   E   AssertionError: pyarr[105 chars]ts: [[1,2,3]]
   E   strings: [["a","b","c"]] != pyarr[105 chars]ts: [<Invalid array: Buffer 
#1 too small in ar[186 chars]t 0>]
   ```
   
   Occurs with PyArrow versions 9 and 10.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to