[jira] [Updated] (ARROW-13922) ParquetDataset throws error when len(path_or_paths) = 1

Joris Van den Bossche (Jira) Mon, 11 Oct 2021 01:56:06 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-13922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joris Van den Bossche updated ARROW-13922:
------------------------------------------
    Labels: good-second-issue  (was: )

> ParquetDataset throws error when len(path_or_paths) = 1
> -------------------------------------------------------
>
>                 Key: ARROW-13922
>                 URL: https://issues.apache.org/jira/browse/ARROW-13922
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Ashish Gupta
>            Assignee: Weston Pace
>            Priority: Major
>              Labels: good-second-issue
>
>  
> After updating pyarrow to version 5.0.0, ParquetDataset doesn't take a list 
> of length 1 for path_or_paths. Is this by design or a bug?
>  
> {code:java}
> In [1]: import pyarrow.parquet as pq
> In [2]: import pandas as pd
> In [3]: df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c']})
> In [4]: df.to_parquet('test.parquet', index=False)
> In [5]: pq.ParquetDataset('test.parquet', 
> use_legacy_dataset=False).read(use_threads=False).to_pandas()
> Out[5]:
>    A  B
> 0  1  a
> 1  2  b
> 2  3  c
> In [6]: pq.ParquetDataset(['test.parquet'], 
> use_legacy_dataset=False).read(use_threads=False).to_pandas()
> ---------------------------------------------------------------------------
> ValueError                                Traceback (most recent call last)
> ValueError: cannot construct a FileSource from a path without a FileSystem
> Exception ignored in: 'pyarrow._dataset._make_file_source'
> Traceback (most recent call last):
>   File 
> "/data/install/anaconda3/lib/python3.8/site-packages/pyarrow/parquet.py", 
> line 1676, in __init__
>     fragment = parquet_format.make_fragment(single_file, filesystem)
> ValueError: cannot construct a FileSource from a path without a FileSystem
> ---------------------------------------------------------------------------
> ArrowInvalid                              Traceback (most recent call last)
> <ipython-input-6-ed8ec622cb5b> in <module>
> ----> 1 pq.ParquetDataset(['test.parquet'], 
> use_legacy_dataset=False).read(use_threads=False).to_pandas()/data/install/anaconda3/lib/python3.8/site-packages/pyarrow/parquet.py
>  in __new__(cls, path_or_paths, filesystem, schema, metadata, 
> split_row_groups, validate_schema, filters, metadata_nthreads, 
> read_dictionary, memory_map, buffer_size, partitioning, use_legacy_dataset, 
> pre_buffer, coerce_int96_timestamp_unit)
>    1284
>    1285         if not use_legacy_dataset:
> -> 1286             return _ParquetDatasetV2(
>    1287                 path_or_paths, filesystem=filesystem,
>    1288                 
> filters=filters,/data/install/anaconda3/lib/python3.8/site-packages/pyarrow/parquet.py
>  in __init__(self, path_or_paths, filesystem, filters, partitioning, 
> read_dictionary, buffer_size, memory_map, ignore_prefixes, pre_buffer, 
> coerce_int96_timestamp_unit, **kwargs)
>    1677
>    1678             self._dataset = ds.FileSystemDataset(
> -> 1679                 [fragment], schema=fragment.physical_schema,
>    1680                 format=parquet_format,
>    1681                 
> filesystem=fragment.filesystem/data/install/anaconda3/lib/python3.8/site-packages/pyarrow/_dataset.pyx
>  in 
> pyarrow._dataset.Fragment.physical_schema.__get__()/data/install/anaconda3/lib/python3.8/site-packages/pyarrow/error.pxi
>  in 
> pyarrow.lib.pyarrow_internal_check_status()/data/install/anaconda3/lib/python3.8/site-packages/pyarrow/error.pxi
>  in pyarrow.lib.check_status()ArrowInvalid: Called Open() on an uninitialized 
> FileSource
> In [7]: pq.ParquetDataset(['test.parquet', 'test.parquet'], 
> use_legacy_dataset=False).read(use_threads=False).to_pandas()
> Out[7]:
>    A  B
> 0  1  a
> 1  2  b
> 2  3  c
> 3  1  a
> 4  2  b
> 5  3  c
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13922) ParquetDataset throws error when len(path_or_paths) = 1

Reply via email to