[jira] [Created] (ARROW-12066) [Python] Dataset API seg fault when filtering string column for None

2021-03-23 Thread Thomas Blauth (Jira)
Thomas Blauth created ARROW-12066:
-

 Summary: [Python] Dataset API seg fault when filtering string 
column for None
 Key: ARROW-12066
 URL: https://issues.apache.org/jira/browse/ARROW-12066
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 3.0.0
 Environment: macOS 10.15.7
Reporter: Thomas Blauth


Trying to load a parquet file using the dataset api leads to a segmentation 
fault when filtering string columns for None values.

Minimal reproducing example: 
{code:python}
import pyarrow as pa
import pyarrow.dataset
import pyarrow.parquet
import pandas as pd

path = "./test.parquet"
df = pd.DataFrame({"A": ("a", "b", None)})
pa.parquet.write_table(pa.table(df), path)

ds = pa.dataset.dataset(path, format="parquet")
filter = pa.dataset.field("A") == pa.dataset.scalar(None)
table = ds.to_table(filter=filter)
{code}
Backtrace:
{code:bash}
(lldb) target create "/usr/local/mambaforge/envs/xxx/bin/python"
Current executable set to '/usr/local/mambaforge/envs/xxx/bin/python' (x86_64).
(lldb) settings set -- target.run-args  "./tmp.py"
(lldb) r
Process 35235 launched: '/usr/local/mambaforge/envs/xxx/bin/python' (x86_64)
Process 35235 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
(code=1, address=0x9)
frame #0: 0x00010314be48 libarrow.300.0.0.dylib`arrow::Status 
arrow::VisitScalarInline(arrow::Scalar const&, 
arrow::ScalarHashImpl*) + 104
libarrow.300.0.0.dylib`arrow::VisitScalarInline:
->  0x10314be48 <+104>: cmpb   $0x0, 0x9(%rax)
0x10314be4c <+108>: je 0x10314c0bc   ; <+732>
0x10314be52 <+114>: movq   0x10(%rax), %rdi
0x10314be56 <+118>: movq   0x20(%rax), %rsi
Target 0: (python) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
(code=1, address=0x9)
  * frame #0: 0x00010314be48 libarrow.300.0.0.dylib`arrow::Status 
arrow::VisitScalarInline(arrow::Scalar const&, 
arrow::ScalarHashImpl*) + 104
frame #1: 0x00010314bd4f 
libarrow.300.0.0.dylib`arrow::ScalarHashImpl::AccumulateHashFrom(arrow::Scalar 
const&) + 111
frame #2: 0x000103134bca 
libarrow.300.0.0.dylib`arrow::Scalar::Hash::hash(arrow::Scalar const&) + 42
frame #3: 0x000132fa0ea8 
libarrow_dataset.300.0.0.dylib`arrow::dataset::Expression::hash() const + 264
frame #4: 0x000132fc913c 
libarrow_dataset.300.0.0.dylib`std::__1::__hash_const_iterator*> std::__1::__hash_table, 
std::__1::allocator 
>::find(arrow::dataset::Expression const&) const + 
28
frame #5: 0x000132faca9b 
libarrow_dataset.300.0.0.dylib`arrow::Result 
arrow::dataset::Modify(arrow::dataset::Expression, 
arrow::dataset::Canonicalize(arrow::dataset::Expression, 
arrow::compute::ExecContext*)::$_1 const&, 
arrow::dataset::Canonicalize(arrow::dataset::Expression, 
arrow::compute::ExecContext*)::$_9 const&) + 123
frame #6: 0x000132fac623 
libarrow_dataset.300.0.0.dylib`arrow::dataset::Canonicalize(arrow::dataset::Expression,
 arrow::compute::ExecContext*) + 131
frame #7: 0x000132fac76d 
libarrow_dataset.300.0.0.dylib`arrow::dataset::Canonicalize(arrow::dataset::Expression,
 arrow::compute::ExecContext*) + 461
frame #8: 0x000132fb00cb 
libarrow_dataset.300.0.0.dylib`arrow::dataset::SimplifyWithGuarantee(arrow::dataset::Expression,
 arrow::dataset::Expression const&)::$_10::operator()() const + 75
frame #9: 0x000132faf6b5 
libarrow_dataset.300.0.0.dylib`arrow::dataset::SimplifyWithGuarantee(arrow::dataset::Expression,
 arrow::dataset::Expression const&) + 517
frame #10: 0x000132f893f8 
libarrow_dataset.300.0.0.dylib`arrow::dataset::Dataset::GetFragments(arrow::dataset::Expression)
 + 88
frame #11: 0x000132f8d25c 
libarrow_dataset.300.0.0.dylib`arrow::dataset::GetFragmentsFromDatasets(std::__1::vector,
 std::__1::allocator > > const&, 
arrow::dataset::Expression)::'lambda'(std::__1::shared_ptr)::operator()(std::__1::shared_ptr)
 const + 76
frame #12: 0x000132f8cd6c 
libarrow_dataset.300.0.0.dylib`arrow::MapIterator,
 std::__1::allocator > > const&, 
arrow::dataset::Expression)::'lambda'(std::__1::shared_ptr),
 std::__1::shared_ptr, 
arrow::Iterator > >::Next() + 316
frame #13: 0x000132f8cb27 
libarrow_dataset.300.0.0.dylib`arrow::Result
 > > 
arrow::Iterator 
> 
>::Next,
 std::__1::allocator > > const&, 
arrow::dataset::Expression)::'lambda'(std::__1::shared_ptr),
 std::__1::shared_ptr, 
arrow::Iterator > > >(void*) + 39
frame #14: 0x000132f8dcdb 
libarrow_dataset.300.0.0.dylib`arrow::Iterator
 > >::Next() + 43
frame #15: 0x000132f8d692 
libarrow_dataset.300.0.0.dylib`arrow::FlattenIterator
 >::Next() + 258
frame #16: 0x000132f8d477 
libarrow_dataset.300.0.0.dylib`arrow::Result
 > arrow::Iterator 
>::Next > 
>(void*) + 39
frame #17: 0x000132f8de0b 
libarrow_dataset.300.0.0.dylib`arrow::It

[jira] [Created] (ARROW-12336) [C++][Python] Empty Int64 array is of wrong size

2021-04-12 Thread Thomas Blauth (Jira)
Thomas Blauth created ARROW-12336:
-

 Summary: [C++][Python] Empty Int64 array is of wrong size
 Key: ARROW-12336
 URL: https://issues.apache.org/jira/browse/ARROW-12336
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
 Environment: macOS 10.15.7
Arrow version: 3.1.0.dev578
Reporter: Thomas Blauth


Setup:

Table with Int64 and str columns; generated using the dataset api; filtered on 
str column.

 

Bug Description:

Calling {{table.to_pandas()}} fails due to an empty array of the ChunkedArray 
of the Int64 column. This empty array has a size of 4 Byte when using the arrow 
nightly builds and 0 Byte when using arrow 3.0.0.

Note: The bug does not occur when the table only contains an Int64 column.

 

Minimal example:
{code:python}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet
import pyarrow.dataset

print("Arrow version: " + str(pa.__version__))
print("---")

# Only Int64 works fine
df = pd.DataFrame({"Int_col": [1, 2, 10]}, dtype="Int64")
table = pa.table(df)
path_0 = "./test_0.parquet"
pa.parquet.write_table(table, path_0)

schema = pa.parquet.read_schema(path_0)
ds = pa.dataset.FileSystemDataset.from_paths(
paths=[path_0],
filesystem=pa.fs.LocalFileSystem(),
schema=schema, 
format=pa.dataset.ParquetFileFormat(),
)
table = ds.to_table(filter=(pa.dataset.field("Int_col") == 3))

print("Size of array: " + str(table.column(0).nbytes))
df = table.to_pandas()
print("---")


# Int64 and str crashes
df = pd.DataFrame({"Int_col": [1, 2, 10], "str_col": ["A", "B", "Z"]})
df = df.astype({"Int_col": "Int64"})
table = pa.table(df)
path_1 = "./test_1.parquet"
pa.parquet.write_table(table, path_1)

schema = pa.parquet.read_schema(path_1)
ds = pa.dataset.FileSystemDataset.from_paths(
paths=[path_1],
filesystem=pa.fs.LocalFileSystem(),
schema=schema, 
format=pa.dataset.ParquetFileFormat(),
)
table = ds.to_table(filter=(pa.dataset.field("str_col") == "C"))

print("Size of array: " + str(table.column(0).nbytes))
df = table.to_pandas()
{code}
 

Output :
{code:bash}
Arrow version: 3.1.0.dev578
---
Size of array: 0
---
Size of array: 4
Traceback (most recent call last):
  File "/Users/xxx/empty_array_buffer_size.py", line 47, in 
df = table.to_pandas()
  File "pyarrow/array.pxi", line 756, in 
pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi", line 1740, in pyarrow.lib.Table._to_pandas
  File 
"/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pyarrow/pandas_compat.py",
 line 794, in table_to_blockmanager
blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
  File 
"/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pyarrow/pandas_compat.py",
 line 1135, in _table_to_blocks
return [_reconstruct_block(item, columns, extension_columns)
  File 
"/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pyarrow/pandas_compat.py",
 line 1135, in 
return [_reconstruct_block(item, columns, extension_columns)
  File 
"/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pyarrow/pandas_compat.py",
 line 753, in _reconstruct_block
pd_ext_arr = pandas_dtype.__from_arrow__(arr)
  File 
"/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pandas/core/arrays/integer.py",
 line 117, in __from_arrow__
data, mask = pyarrow_array_to_numpy_and_mask(arr, dtype=self.type)
  File 
"/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pandas/core/arrays/_arrow_utils.py",
 line 32, in pyarrow_array_to_numpy_and_mask
data = np.frombuffer(buflist[1], dtype=dtype)[arr.offset : arr.offset + 
len(arr)]
ValueError: buffer size must be a multiple of element size
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)