Raúl Cumplido created ARROW-16526:
-------------------------------------

             Summary: [Python] test_partitioned_dataset fails when building 
with PARQUET but without DATASET
                 Key: ARROW-16526
                 URL: https://issues.apache.org/jira/browse/ARROW-16526
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 8.0.0
            Reporter: Raúl Cumplido
             Fix For: 9.0.0


Our current [minimal_build 
examples|https://github.com/apache/arrow/tree/master/python/examples/minimal_build]
 for python build with:
{code:java}
 -DARROW_PARQUET=ON \{code}
but without DATASET.

These produces the following failure:
{code:java}
 _________________________________________________________ 
test_partitioned_dataset[True] 
_________________________________________________________tempdir = 
PosixPath('/tmp/pytest-of-root/pytest-0/test_partitioned_dataset_True_0'), 
use_legacy_dataset = True    @pytest.mark.pandas
    @parametrize_legacy_dataset
    def test_partitioned_dataset(tempdir, use_legacy_dataset):
        # ARROW-3208: Segmentation fault when reading a Parquet partitioned 
dataset
        # to a Parquet file
        path = tempdir / "ARROW-3208"
        df = pd.DataFrame({
            'one': [-1, 10, 2.5, 100, 1000, 1, 29.2],
            'two': [-1, 10, 2, 100, 1000, 1, 11],
            'three': [0, 0, 0, 0, 0, 0, 0]
        })
        table = pa.Table.from_pandas(df)
>       pq.write_to_dataset(table, root_path=str(path),
                            partition_cols=['one', 
'two'])pyarrow/tests/parquet/test_dataset.py:1544: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pyarrow/parquet/__init__.py:3110: in write_to_dataset
    import pyarrow.dataset as ds
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _     
"""Dataset is currently unstable. APIs subject to change without notice."""
    
    import pyarrow as pa
    from pyarrow.util import _is_iterable, _stringify_path, _is_path_like
    
>   from pyarrow._dataset import (  # noqa
        CsvFileFormat,
        CsvFragmentScanOptions,
        Dataset,
        DatasetFactory,
        DirectoryPartitioning,
        FilenamePartitioning,
        FileFormat,
        FileFragment,
        FileSystemDataset,
        FileSystemDatasetFactory,
        FileSystemFactoryOptions,
        FileWriteOptions,
        Fragment,
        FragmentScanOptions,
        HivePartitioning,
        IpcFileFormat,
        IpcFileWriteOptions,
        InMemoryDataset,
        Partitioning,
        PartitioningFactory,
        Scanner,
        TaggedRecordBatch,
        UnionDataset,
        UnionDatasetFactory,
        _get_partition_keys,
        _filesystemdataset_write,
    )
E   ModuleNotFoundError: No module named 'pyarrow._dataset'
{code}
This can be reproduced via running the minimal_build examples:
{code:java}
$ cd arrow/python/examples/minimal_build
$ docker build -t arrow_ubuntu_minimal -f Dockerfile.ubuntu . {code}
or via building arrow and pyarrow with PARQUET but without DATASET.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to