[ https://issues.apache.org/jira/browse/ARROW-12336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche updated ARROW-12336: ------------------------------------------ Fix Version/s: 4.0.0 > [C++][Python] Empty Int64 array is of wrong size > ------------------------------------------------ > > Key: ARROW-12336 > URL: https://issues.apache.org/jira/browse/ARROW-12336 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Environment: macOS 10.15.7 > Arrow version: 3.1.0.dev578 > Reporter: Thomas Blauth > Priority: Major > Fix For: 4.0.0 > > > Setup: > Table with Int64 and str columns; generated using the dataset api; filtered > on str column. > > Bug Description: > Calling {{table.to_pandas()}} fails due to an empty array of the ChunkedArray > of the Int64 column. This empty array has a size of 4 Byte when using the > arrow nightly builds and 0 Byte when using arrow 3.0.0. > Note: The bug does not occur when the table only contains an Int64 column. > > Minimal example: > {code:python} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet > import pyarrow.dataset > print("Arrow version: " + str(pa.__version__)) > print("---------------") > # Only Int64 works fine > df = pd.DataFrame({"Int_col": [1, 2, 10]}, dtype="Int64") > table = pa.table(df) > path_0 = "./test_0.parquet" > pa.parquet.write_table(table, path_0) > schema = pa.parquet.read_schema(path_0) > ds = pa.dataset.FileSystemDataset.from_paths( > paths=[path_0], > filesystem=pa.fs.LocalFileSystem(), > schema=schema, > format=pa.dataset.ParquetFileFormat(), > ) > table = ds.to_table(filter=(pa.dataset.field("Int_col") == 3)) > print("Size of array: " + str(table.column(0).nbytes)) > df = table.to_pandas() > print("---------------") > # Int64 and str crashes > df = pd.DataFrame({"Int_col": [1, 2, 10], "str_col": ["A", "B", "Z"]}) > df = df.astype({"Int_col": "Int64"}) > table = pa.table(df) > path_1 = "./test_1.parquet" > pa.parquet.write_table(table, path_1) > schema = pa.parquet.read_schema(path_1) > ds = pa.dataset.FileSystemDataset.from_paths( > paths=[path_1], > filesystem=pa.fs.LocalFileSystem(), > schema=schema, > format=pa.dataset.ParquetFileFormat(), > ) > table = ds.to_table(filter=(pa.dataset.field("str_col") == "C")) > print("Size of array: " + str(table.column(0).nbytes)) > df = table.to_pandas() > {code} > > Output : > {code:bash} > Arrow version: 3.1.0.dev578 > --------------- > Size of array: 0 > --------------- > Size of array: 4 > Traceback (most recent call last): > File "/Users/xxx/empty_array_buffer_size.py", line 47, in <module> > df = table.to_pandas() > File "pyarrow/array.pxi", line 756, in > pyarrow.lib._PandasConvertible.to_pandas > File "pyarrow/table.pxi", line 1740, in pyarrow.lib.Table._to_pandas > File > "/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pyarrow/pandas_compat.py", > line 794, in table_to_blockmanager > blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes) > File > "/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pyarrow/pandas_compat.py", > line 1135, in _table_to_blocks > return [_reconstruct_block(item, columns, extension_columns) > File > "/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pyarrow/pandas_compat.py", > line 1135, in <listcomp> > return [_reconstruct_block(item, columns, extension_columns) > File > "/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pyarrow/pandas_compat.py", > line 753, in _reconstruct_block > pd_ext_arr = pandas_dtype.__from_arrow__(arr) > File > "/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pandas/core/arrays/integer.py", > line 117, in __from_arrow__ > data, mask = pyarrow_array_to_numpy_and_mask(arr, dtype=self.type) > File > "/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pandas/core/arrays/_arrow_utils.py", > line 32, in pyarrow_array_to_numpy_and_mask > data = np.frombuffer(buflist[1], dtype=dtype)[arr.offset : arr.offset + > len(arr)] > ValueError: buffer size must be a multiple of element size > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)