[ 
https://issues.apache.org/jira/browse/ARROW-12336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-12336:
------------------------------------------
    Fix Version/s: 4.0.0

> [C++][Python] Empty Int64 array is of wrong size
> ------------------------------------------------
>
>                 Key: ARROW-12336
>                 URL: https://issues.apache.org/jira/browse/ARROW-12336
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>         Environment: macOS 10.15.7
> Arrow version: 3.1.0.dev578
>            Reporter: Thomas Blauth
>            Priority: Major
>             Fix For: 4.0.0
>
>
> Setup:
> Table with Int64 and str columns; generated using the dataset api; filtered 
> on str column.
>  
> Bug Description:
> Calling {{table.to_pandas()}} fails due to an empty array of the ChunkedArray 
> of the Int64 column. This empty array has a size of 4 Byte when using the 
> arrow nightly builds and 0 Byte when using arrow 3.0.0.
> Note: The bug does not occur when the table only contains an Int64 column.
>  
> Minimal example:
> {code:python}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet
> import pyarrow.dataset
> print("Arrow version: " + str(pa.__version__))
> print("---------------")
> # Only Int64 works fine
> df = pd.DataFrame({"Int_col": [1, 2, 10]}, dtype="Int64")
> table = pa.table(df)
> path_0 = "./test_0.parquet"
> pa.parquet.write_table(table, path_0)
> schema = pa.parquet.read_schema(path_0)
> ds = pa.dataset.FileSystemDataset.from_paths(
>     paths=[path_0],
>     filesystem=pa.fs.LocalFileSystem(),
>     schema=schema, 
>     format=pa.dataset.ParquetFileFormat(),
> )
> table = ds.to_table(filter=(pa.dataset.field("Int_col") == 3))
> print("Size of array: " + str(table.column(0).nbytes))
> df = table.to_pandas()
> print("---------------")
> # Int64 and str crashes
> df = pd.DataFrame({"Int_col": [1, 2, 10], "str_col": ["A", "B", "Z"]})
> df = df.astype({"Int_col": "Int64"})
> table = pa.table(df)
> path_1 = "./test_1.parquet"
> pa.parquet.write_table(table, path_1)
> schema = pa.parquet.read_schema(path_1)
> ds = pa.dataset.FileSystemDataset.from_paths(
>     paths=[path_1],
>     filesystem=pa.fs.LocalFileSystem(),
>     schema=schema, 
>     format=pa.dataset.ParquetFileFormat(),
> )
> table = ds.to_table(filter=(pa.dataset.field("str_col") == "C"))
> print("Size of array: " + str(table.column(0).nbytes))
> df = table.to_pandas()
> {code}
>  
> Output :
> {code:bash}
> Arrow version: 3.1.0.dev578
> ---------------
> Size of array: 0
> ---------------
> Size of array: 4
> Traceback (most recent call last):
>   File "/Users/xxx/empty_array_buffer_size.py", line 47, in <module>
>     df = table.to_pandas()
>   File "pyarrow/array.pxi", line 756, in 
> pyarrow.lib._PandasConvertible.to_pandas
>   File "pyarrow/table.pxi", line 1740, in pyarrow.lib.Table._to_pandas
>   File 
> "/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pyarrow/pandas_compat.py",
>  line 794, in table_to_blockmanager
>     blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
>   File 
> "/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pyarrow/pandas_compat.py",
>  line 1135, in _table_to_blocks
>     return [_reconstruct_block(item, columns, extension_columns)
>   File 
> "/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pyarrow/pandas_compat.py",
>  line 1135, in <listcomp>
>     return [_reconstruct_block(item, columns, extension_columns)
>   File 
> "/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pyarrow/pandas_compat.py",
>  line 753, in _reconstruct_block
>     pd_ext_arr = pandas_dtype.__from_arrow__(arr)
>   File 
> "/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pandas/core/arrays/integer.py",
>  line 117, in __from_arrow__
>     data, mask = pyarrow_array_to_numpy_and_mask(arr, dtype=self.type)
>   File 
> "/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pandas/core/arrays/_arrow_utils.py",
>  line 32, in pyarrow_array_to_numpy_and_mask
>     data = np.frombuffer(buflist[1], dtype=dtype)[arr.offset : arr.offset + 
> len(arr)]
> ValueError: buffer size must be a multiple of element size
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to