[ https://issues.apache.org/jira/browse/ARROW-14547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche closed ARROW-14547. ----------------------------------------- Resolution: Duplicate > Reading FixedSizeListArray from Parquet with nulls > -------------------------------------------------- > > Key: ARROW-14547 > URL: https://issues.apache.org/jira/browse/ARROW-14547 > Project: Apache Arrow > Issue Type: Bug > Components: Parquet, Python > Affects Versions: 6.0.0 > Reporter: Jim Pivarski > Priority: Major > > This one is easy to describe: given an array of fixed-sized lists, in which > some are null, > {code:python} > >>> import numpy as np > >>> import pyarrow as pa > >>> import pyarrow.parquet > >>> a = pa.FixedSizeListArray.from_arrays(np.arange(10), 5).take([0, None]) > >>> a > <pyarrow.lib.FixedSizeListArray object at 0x7ff801cb2760> > [ > [ > 0, > 1, > 2, > 3, > 4 > ], > null > ] > {code} > you can write them to a Parquet file, but not read them back: > {code:python} > >>> pa.parquet.write_table(pa.table({"": a}), "tmp.parquet") > >>> pa.parquet.read_table("tmp.parquet") > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > File > "/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", > line 1941, in read_table > return dataset.read(columns=columns, use_threads=use_threads, > File > "/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", > line 1776, in read > table = self._dataset.to_table( > File "pyarrow/_dataset.pyx", line 491, in pyarrow._dataset.Dataset.to_table > File "pyarrow/_dataset.pyx", line 3235, in pyarrow._dataset.Scanner.to_table > File "pyarrow/error.pxi", line 143, in > pyarrow.lib.pyarrow_internal_check_status > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Expected all lists to be of size=5 but index 2 had > size=0 > {code} > It could be that, at some level, the second list is considered to be empty. > For completeness, this doesn't happen if the fixed-sized lists have no nulls: > {code:python} > >>> b = pa.FixedSizeListArray.from_arrays(np.arange(10), 5) > >>> b > <pyarrow.lib.FixedSizeListArray object at 0x7ff801c1ed60> > [ > [ > 0, > 1, > 2, > 3, > 4 > ], > [ > 5, > 6, > 7, > 8, > 9 > ] > ] > >>> pa.parquet.write_table(pa.table({"": b}), "tmp2.parquet") > >>> pa.parquet.read_table("tmp2.parquet") > pyarrow.Table > : fixed_size_list<item: int64>[5] > child 0, item: int64 > ---- > : [[[0,1,2,3,4],[5,6,7,8,9]]] > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)