Jim Pivarski created ARROW-14522:
------------------------------------

             Summary: Can't read empty-but-for-nulls data from Parquet if it 
has an ExtensionType
                 Key: ARROW-14522
                 URL: https://issues.apache.org/jira/browse/ARROW-14522
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 6.0.0
            Reporter: Jim Pivarski


Here's a corner case: suppose that I have data with type null, but it can have 
missing values so the whole array consists of nothing but nulls. In real life, 
this might only happen inside a nested data structure, at some level where an 
untyped data source (e.g. nested Python lists) had no entries so a type could 
not be determined. We expect to be able to write and read this data to and from 
Parquet, and we can—as long as it doesn't have an ExtensionType.

Here's an example that works, _without_ ExtensionType:
{code:python}
>>> import json
>>> import numpy as np
>>> import pyarrow as pa
>>> import pyarrow.parquet
>>> 
>>> validbits = np.packbits(np.ones(14, dtype=np.uint8), bitorder="little")
>>> empty_but_for_nulls = pa.Array.from_buffers(
...     pa.null(), 14, [pa.py_buffer(validbits)], null_count=14
... )
>>> empty_but_for_nulls
<pyarrow.lib.NullArray object at 0x7fb1560bbd00>
14 nulls
>>> 
>>> pa.parquet.write_table(pa.table({"": empty_but_for_nulls}), "tmp.parquet")
>>> pa.parquet.read_table("tmp.parquet")
pyarrow.Table
: null
----
: [14 nulls]
{code}
And here's a continuation of that example, which doesn't work because the type 
{{pa.null()}} is replaced by {{AnnotatedType(pa.null(), \{"cool": "beans"})}}:
{code:python}
>>> class AnnotatedType(pa.ExtensionType):
...     def __init__(self, storage_type, annotation):
...         self.annotation = annotation
...         super().__init__(storage_type, "my:app")
...     def __arrow_ext_serialize__(self):
...         return json.dumps(self.annotation).encode()
...     @classmethod
...     def __arrow_ext_deserialize__(cls, storage_type, serialized):
...         annotation = json.loads(serialized.decode())
...         return cls(storage_type, annotation)
... 
>>> pa.register_extension_type(AnnotatedType(pa.null(), None))
>>> 
>>> empty_but_for_nulls = pa.Array.from_buffers(
...     AnnotatedType(pa.null(), {"cool": "beans"}),
...     14,
...     [pa.py_buffer(validbits)],
...     null_count=14,
... )
>>> empty_but_for_nulls
<pyarrow.lib.ExtensionArray object at 0x7fb14b5e1ca0>
14 nulls
>>> 
>>> pa.parquet.write_table(pa.table({"": empty_but_for_nulls}), "tmp2.parquet")
>>> pa.parquet.read_table("tmp2.parquet")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File 
"/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", 
line 1941, in read_table
    return dataset.read(columns=columns, use_threads=use_threads,
  File 
"/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", 
line 1776, in read
    table = self._dataset.to_table(
  File "pyarrow/_dataset.pyx", line 491, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 3235, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 143, in 
pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Array of type extension<my:app<AnnotatedType>> has 14 
nulls but no null bitmap
{code}
If "nullable type null" were outside the set of types that should be writable 
to Parquet, then it would not work for the non-ExtensionType or it would fail 
on writing, not reading, so I'm quite sure this is a bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to