Thomas Blauth created ARROW-12336:
-------------------------------------

             Summary: [C++][Python] Empty Int64 array is of wrong size
                 Key: ARROW-12336
                 URL: https://issues.apache.org/jira/browse/ARROW-12336
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++, Python
         Environment: macOS 10.15.7
Arrow version: 3.1.0.dev578
            Reporter: Thomas Blauth


Setup:

Table with Int64 and str columns; generated using the dataset api; filtered on 
str column.

 

Bug Description:

Calling {{table.to_pandas()}} fails due to an empty array of the ChunkedArray 
of the Int64 column. This empty array has a size of 4 Byte when using the arrow 
nightly builds and 0 Byte when using arrow 3.0.0.

Note: The bug does not occur when the table only contains an Int64 column.

 

Minimal example:
{code:python}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet
import pyarrow.dataset

print("Arrow version: " + str(pa.__version__))
print("---------------")

# Only Int64 works fine
df = pd.DataFrame({"Int_col": [1, 2, 10]}, dtype="Int64")
table = pa.table(df)
path_0 = "./test_0.parquet"
pa.parquet.write_table(table, path_0)

schema = pa.parquet.read_schema(path_0)
ds = pa.dataset.FileSystemDataset.from_paths(
    paths=[path_0],
    filesystem=pa.fs.LocalFileSystem(),
    schema=schema, 
    format=pa.dataset.ParquetFileFormat(),
)
table = ds.to_table(filter=(pa.dataset.field("Int_col") == 3))

print("Size of array: " + str(table.column(0).nbytes))
df = table.to_pandas()
print("---------------")


# Int64 and str crashes
df = pd.DataFrame({"Int_col": [1, 2, 10], "str_col": ["A", "B", "Z"]})
df = df.astype({"Int_col": "Int64"})
table = pa.table(df)
path_1 = "./test_1.parquet"
pa.parquet.write_table(table, path_1)

schema = pa.parquet.read_schema(path_1)
ds = pa.dataset.FileSystemDataset.from_paths(
    paths=[path_1],
    filesystem=pa.fs.LocalFileSystem(),
    schema=schema, 
    format=pa.dataset.ParquetFileFormat(),
)
table = ds.to_table(filter=(pa.dataset.field("str_col") == "C"))

print("Size of array: " + str(table.column(0).nbytes))
df = table.to_pandas()
{code}
 

Output :
{code:bash}
Arrow version: 3.1.0.dev578
---------------
Size of array: 0
---------------
Size of array: 4
Traceback (most recent call last):
  File "/Users/xxx/empty_array_buffer_size.py", line 47, in <module>
    df = table.to_pandas()
  File "pyarrow/array.pxi", line 756, in 
pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi", line 1740, in pyarrow.lib.Table._to_pandas
  File 
"/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pyarrow/pandas_compat.py",
 line 794, in table_to_blockmanager
    blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
  File 
"/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pyarrow/pandas_compat.py",
 line 1135, in _table_to_blocks
    return [_reconstruct_block(item, columns, extension_columns)
  File 
"/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pyarrow/pandas_compat.py",
 line 1135, in <listcomp>
    return [_reconstruct_block(item, columns, extension_columns)
  File 
"/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pyarrow/pandas_compat.py",
 line 753, in _reconstruct_block
    pd_ext_arr = pandas_dtype.__from_arrow__(arr)
  File 
"/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pandas/core/arrays/integer.py",
 line 117, in __from_arrow__
    data, mask = pyarrow_array_to_numpy_and_mask(arr, dtype=self.type)
  File 
"/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pandas/core/arrays/_arrow_utils.py",
 line 32, in pyarrow_array_to_numpy_and_mask
    data = np.frombuffer(buflist[1], dtype=dtype)[arr.offset : arr.offset + 
len(arr)]
ValueError: buffer size must be a multiple of element size
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to