The data is a TPCH lineitem table, scale factor=10. Written by
pyarrow.parquet.write_table with parameters: compression='snappy',
version='2.6', use_dictionary=True, row_group_size=1024,
data_page_version='2.0'. The key difference here is row_group_size is
only 1K which is a relatively low number.
The result parquet file can be read by Spark. But using ParquetDataset
with use_legacy_dataset=False will result in segmentation fault. Set
use_legacy_dataset=True works fine.
I also find that when use_legacy_dataset=True, it is not possible to
pass filters to the api, the error is following:
Traceback (most recent call last):
File "scripts/filter_exp.py", line 26, in <module>
dataset = pq.ParquetDataset('lineitem_1K.parquet',
filesystem=None, use_legacy_dataset=True,
File "/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py",
line 1439, in __init__
self._filter(filters)
File "/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py",
line 1561, in _filter
accepts_filter = self._partitions.filter_accepts_partition
AttributeError: 'NoneType' object has no attribute 'filter_accepts_partition'
I am using pyarrow 7.0.0 on Ubuntu 20.04.