Raúl Cumplido created ARROW-16526: ------------------------------------- Summary: [Python] test_partitioned_dataset fails when building with PARQUET but without DATASET Key: ARROW-16526 URL: https://issues.apache.org/jira/browse/ARROW-16526 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 8.0.0 Reporter: Raúl Cumplido Fix For: 9.0.0
Our current [minimal_build examples|https://github.com/apache/arrow/tree/master/python/examples/minimal_build] for python build with: {code:java} -DARROW_PARQUET=ON \{code} but without DATASET. These produces the following failure: {code:java} _________________________________________________________ test_partitioned_dataset[True] _________________________________________________________tempdir = PosixPath('/tmp/pytest-of-root/pytest-0/test_partitioned_dataset_True_0'), use_legacy_dataset = True @pytest.mark.pandas @parametrize_legacy_dataset def test_partitioned_dataset(tempdir, use_legacy_dataset): # ARROW-3208: Segmentation fault when reading a Parquet partitioned dataset # to a Parquet file path = tempdir / "ARROW-3208" df = pd.DataFrame({ 'one': [-1, 10, 2.5, 100, 1000, 1, 29.2], 'two': [-1, 10, 2, 100, 1000, 1, 11], 'three': [0, 0, 0, 0, 0, 0, 0] }) table = pa.Table.from_pandas(df) > pq.write_to_dataset(table, root_path=str(path), partition_cols=['one', 'two'])pyarrow/tests/parquet/test_dataset.py:1544: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ pyarrow/parquet/__init__.py:3110: in write_to_dataset import pyarrow.dataset as ds _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ """Dataset is currently unstable. APIs subject to change without notice.""" import pyarrow as pa from pyarrow.util import _is_iterable, _stringify_path, _is_path_like > from pyarrow._dataset import ( # noqa CsvFileFormat, CsvFragmentScanOptions, Dataset, DatasetFactory, DirectoryPartitioning, FilenamePartitioning, FileFormat, FileFragment, FileSystemDataset, FileSystemDatasetFactory, FileSystemFactoryOptions, FileWriteOptions, Fragment, FragmentScanOptions, HivePartitioning, IpcFileFormat, IpcFileWriteOptions, InMemoryDataset, Partitioning, PartitioningFactory, Scanner, TaggedRecordBatch, UnionDataset, UnionDatasetFactory, _get_partition_keys, _filesystemdataset_write, ) E ModuleNotFoundError: No module named 'pyarrow._dataset' {code} This can be reproduced via running the minimal_build examples: {code:java} $ cd arrow/python/examples/minimal_build $ docker build -t arrow_ubuntu_minimal -f Dockerfile.ubuntu . {code} or via building arrow and pyarrow with PARQUET but without DATASET. -- This message was sent by Atlassian Jira (v8.20.7#820007)