[Parquet][Python, C++]Seg fault using new dataset api; filters not work with old dataset api

Xinyu Zeng Wed, 13 Apr 2022 21:45:55 -0700

The data is a TPCH lineitem table, scale factor=10. Written by
pyarrow.parquet.write_table with parameters: compression='snappy',
version='2.6', use_dictionary=True, row_group_size=1024,
data_page_version='2.0'. The key difference here is row_group_size is
only 1K which is a relatively low number.


The result parquet file can be read by Spark. But using ParquetDataset
with use_legacy_dataset=False will result in segmentation fault. Set
use_legacy_dataset=True works fine.

I also find that when use_legacy_dataset=True, it is not possible to
pass filters to the api, the error is following:

Traceback (most recent call last):
  File "scripts/filter_exp.py", line 26, in <module>
    dataset = pq.ParquetDataset('lineitem_1K.parquet',
filesystem=None, use_legacy_dataset=True,
  File "/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py",
line 1439, in __init__
    self._filter(filters)
  File "/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py",
line 1561, in _filter
    accepts_filter = self._partitions.filter_accepts_partition
AttributeError: 'NoneType' object has no attribute 'filter_accepts_partition'

I am using pyarrow 7.0.0 on Ubuntu 20.04.

[Parquet][Python, C++]Seg fault using new dataset api; filters not work with old dataset api

Reply via email to