[jira] [Commented] (ARROW-11250) [Python] Inconsistent behavior calling ds.dataset()

Lance Dacey (Jira) Fri, 15 Jan 2021 02:49:25 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-11250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265909#comment-17265909
 ]


Lance Dacey commented on ARROW-11250:
-------------------------------------

Sure, I can raise an issue there.

 
{code:java}
fs_pa.get_file_info("dev/test-split")
<FileInfo for 'dev/test-split': type=FileType.NotFound>
{code}
 

I had to tweak the code you provided a bit to get it to run for the 
FileSelector:
{code:java}
fs_pa.get_file_info(FileSelector("dev/test-split", recursive=True))

[<FileInfo for 'dev/test-split/row_date=2020-12-31': type=FileType.Directory>,
 <FileInfo for 
'dev/test-split/row_date=2020-12-31/428445ed3a854cbfb3025389477811a3-0.parquet':
 type=FileType.File, size=2261024>,
 <FileInfo for 
'dev/test-split/row_date=2020-12-31/6f9f4f5c5d0e494fbf7420540765afbc-0.parquet':
 type=FileType.File, size=713840>,
 <FileInfo for 
'dev/test-split/row_date=2020-12-31/7a29a1eb3f464c9c955e23e05c2e2c28-0.parquet':
 type=FileType.File, size=627492>,
 <FileInfo for 
'dev/test-split/row_date=2020-12-31/7bfb6fdd2b404ad88022357c486f17de-0.parquet':
 type=FileType.File, size=290697>,
 <FileInfo for 
'dev/test-split/row_date=2020-12-31/c448b5f025f241d9b6f77ed5eead239c-0.parquet':
 type=FileType.File, size=463202>,
 <FileInfo for 
'dev/test-split/row_date=2020-12-31/cde0bb54687642e3aace011d7a106947-0.parquet':
 type=FileType.File, size=713840>,
 <FileInfo for 
'dev/test-split/row_date=2020-12-31/d5a0ebc05c974158a818f9b216e4093a-0.parquet':
 type=FileType.File, size=928676>,
...
]{code}
 

FYI - if I add an ending slash to the path I get type=Directory instead of 
NotFound:
{code:java}
fs_pa.get_file_info("dev/test-split/")
<FileInfo for 'dev/test-split/': type=FileType.Directory>
{code}

> [Python] Inconsistent behavior calling ds.dataset()
> ---------------------------------------------------
>
>                 Key: ARROW-11250
>                 URL: https://issues.apache.org/jira/browse/ARROW-11250
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 2.0.0
>         Environment: Ubuntu 18.04
> adal                      1.2.5              pyh9f0ad1d_0    conda-forge
> adlfs                     0.5.9              pyhd8ed1ab_0    conda-forge
> apache-airflow            1.10.14                  pypi_0    pypi
> azure-common              1.1.24                     py_0    conda-forge
> azure-core                1.9.0              pyhd3deb0d_0    conda-forge
> azure-datalake-store      0.0.51             pyh9f0ad1d_0    conda-forge
> azure-identity            1.5.0              pyhd8ed1ab_0    conda-forge
> azure-nspkg               3.0.2                      py_0    conda-forge
> azure-storage-blob        12.6.0             pyhd3deb0d_0    conda-forge
> azure-storage-common      2.1.0            py37hc8dfbb8_3    conda-forge
> fsspec                    0.8.5              pyhd8ed1ab_0    conda-forge
> jupyterlab_pygments       0.1.2              pyh9f0ad1d_0    conda-forge
> pandas                    1.2.0            py37ha9443f7_0
> pyarrow                   2.0.0           py37h4935f41_6_cpu    conda-forge
>            Reporter: Lance Dacey
>            Priority: Minor
>              Labels: azureblob, dataset,, python
>             Fix For: 4.0.0
>
>
> In a Jupyter notebook, I have noticed that sometimes I am not able to read a 
> dataset which certainly exists on Azure Blob.
>  
> {code:java}
> fs = fsspec.filesystem(protocol="abfs", account_name, account_key)
> {code}
>  
> One example of this is reading a dataset in one cell:
>  
> {code:java}
> ds.dataset("dev/test-split", partitioning="hive", filesystem=fs){code}
>  
> Then in another cell I try to read the same dataset:
>  
> {code:java}
> ds.dataset("dev/test-split", partitioning="hive", filesystem=fs)
> ---------------------------------------------------------------------------
> FileNotFoundError                         Traceback (most recent call last)
> <ipython-input-514-bf63585a0c1b> in <module>
> ----> 1 ds.dataset("dev/test-split", partitioning="hive", filesystem=fs)
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, 
> schema, format, filesystem, partitioning, partition_base_dir, 
> exclude_invalid_files, ignore_prefixes)
>     669     # TODO(kszucs): support InMemoryDataset for a table input
>     670     if _is_path_like(source):
> --> 671         return _filesystem_dataset(source, **kwargs)
>     672     elif isinstance(source, (tuple, list)):
>     673         if all(_is_path_like(elem) for elem in source):
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
> _filesystem_dataset(source, schema, filesystem, partitioning, format, 
> partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
>     426         fs, paths_or_selector = _ensure_multiple_sources(source, 
> filesystem)
>     427     else:
> --> 428         fs, paths_or_selector = _ensure_single_source(source, 
> filesystem)
>     429 
>     430     options = FileSystemFactoryOptions(
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
> _ensure_single_source(path, filesystem)
>     402         paths_or_selector = [path]
>     403     else:
> --> 404         raise FileNotFoundError(path)
>     405 
>     406     return filesystem, paths_or_selector
> FileNotFoundError: dev/test-split
> {code}
>  
> If I reset the kernel, it works again. It also works if I change the path 
> slightly, like adding a "/" at the end (so basically it just not work if I 
> read the same dataset twice):
>  
> {code:java}
> ds.dataset("dev/test-split/", partitioning="hive", filesystem=fs)
> {code}
>  
>  
> The other strange behavior I have noticed that that if I read a dataset 
> inside of my Jupyter notebook,
>  
> {code:java}
> %%time
> dataset = ds.dataset("dev/test-split", 
> partitioning=ds.partitioning(pa.schema([("date", pa.date32())]), 
> flavor="hive"), 
> filesystem=fs,
> exclude_invalid_files=False)
> CPU times: user 1.98 s, sys: 0 ns, total: 1.98 s Wall time: 2.58 s{code}
>  
> Now, on the exact same server when I try to run the same code against the 
> same dataset in Airflow it takes over 3 minutes (comparing the timestamps in 
> my logs between right before I read the dataset, and immediately after the 
> dataset is available to filter):
> {code:java}
> [2021-01-14 03:52:04,011] INFO - Reading dev/test-split
> [2021-01-14 03:55:17,360] INFO - Processing dataset in batches
> {code}
> This is probably not a pyarrow issue, but what are some potential causes that 
> I can look into? I have one example where it is 9 seconds to read the dataset 
> in Jupyter, but then 11 *minutes* in Airflow. I don't know what to really 
> investigate - as I mentioned, the Jupyter notebook and Airflow are on the 
> same server and both are deployed using Docker. Airflow is using the 
> CeleryExecutor.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11250) [Python] Inconsistent behavior calling ds.dataset()

Reply via email to