[ https://issues.apache.org/jira/browse/ARROW-11250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265909#comment-17265909 ]
Lance Dacey commented on ARROW-11250: ------------------------------------- Sure, I can raise an issue there. {code:java} fs_pa.get_file_info("dev/test-split") <FileInfo for 'dev/test-split': type=FileType.NotFound> {code} I had to tweak the code you provided a bit to get it to run for the FileSelector: {code:java} fs_pa.get_file_info(FileSelector("dev/test-split", recursive=True)) [<FileInfo for 'dev/test-split/row_date=2020-12-31': type=FileType.Directory>, <FileInfo for 'dev/test-split/row_date=2020-12-31/428445ed3a854cbfb3025389477811a3-0.parquet': type=FileType.File, size=2261024>, <FileInfo for 'dev/test-split/row_date=2020-12-31/6f9f4f5c5d0e494fbf7420540765afbc-0.parquet': type=FileType.File, size=713840>, <FileInfo for 'dev/test-split/row_date=2020-12-31/7a29a1eb3f464c9c955e23e05c2e2c28-0.parquet': type=FileType.File, size=627492>, <FileInfo for 'dev/test-split/row_date=2020-12-31/7bfb6fdd2b404ad88022357c486f17de-0.parquet': type=FileType.File, size=290697>, <FileInfo for 'dev/test-split/row_date=2020-12-31/c448b5f025f241d9b6f77ed5eead239c-0.parquet': type=FileType.File, size=463202>, <FileInfo for 'dev/test-split/row_date=2020-12-31/cde0bb54687642e3aace011d7a106947-0.parquet': type=FileType.File, size=713840>, <FileInfo for 'dev/test-split/row_date=2020-12-31/d5a0ebc05c974158a818f9b216e4093a-0.parquet': type=FileType.File, size=928676>, ... ]{code} FYI - if I add an ending slash to the path I get type=Directory instead of NotFound: {code:java} fs_pa.get_file_info("dev/test-split/") <FileInfo for 'dev/test-split/': type=FileType.Directory> {code} > [Python] Inconsistent behavior calling ds.dataset() > --------------------------------------------------- > > Key: ARROW-11250 > URL: https://issues.apache.org/jira/browse/ARROW-11250 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 2.0.0 > Environment: Ubuntu 18.04 > adal 1.2.5 pyh9f0ad1d_0 conda-forge > adlfs 0.5.9 pyhd8ed1ab_0 conda-forge > apache-airflow 1.10.14 pypi_0 pypi > azure-common 1.1.24 py_0 conda-forge > azure-core 1.9.0 pyhd3deb0d_0 conda-forge > azure-datalake-store 0.0.51 pyh9f0ad1d_0 conda-forge > azure-identity 1.5.0 pyhd8ed1ab_0 conda-forge > azure-nspkg 3.0.2 py_0 conda-forge > azure-storage-blob 12.6.0 pyhd3deb0d_0 conda-forge > azure-storage-common 2.1.0 py37hc8dfbb8_3 conda-forge > fsspec 0.8.5 pyhd8ed1ab_0 conda-forge > jupyterlab_pygments 0.1.2 pyh9f0ad1d_0 conda-forge > pandas 1.2.0 py37ha9443f7_0 > pyarrow 2.0.0 py37h4935f41_6_cpu conda-forge > Reporter: Lance Dacey > Priority: Minor > Labels: azureblob, dataset,, python > Fix For: 4.0.0 > > > In a Jupyter notebook, I have noticed that sometimes I am not able to read a > dataset which certainly exists on Azure Blob. > > {code:java} > fs = fsspec.filesystem(protocol="abfs", account_name, account_key) > {code} > > One example of this is reading a dataset in one cell: > > {code:java} > ds.dataset("dev/test-split", partitioning="hive", filesystem=fs){code} > > Then in another cell I try to read the same dataset: > > {code:java} > ds.dataset("dev/test-split", partitioning="hive", filesystem=fs) > --------------------------------------------------------------------------- > FileNotFoundError Traceback (most recent call last) > <ipython-input-514-bf63585a0c1b> in <module> > ----> 1 ds.dataset("dev/test-split", partitioning="hive", filesystem=fs) > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, > schema, format, filesystem, partitioning, partition_base_dir, > exclude_invalid_files, ignore_prefixes) > 669 # TODO(kszucs): support InMemoryDataset for a table input > 670 if _is_path_like(source): > --> 671 return _filesystem_dataset(source, **kwargs) > 672 elif isinstance(source, (tuple, list)): > 673 if all(_is_path_like(elem) for elem in source): > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in > _filesystem_dataset(source, schema, filesystem, partitioning, format, > partition_base_dir, exclude_invalid_files, selector_ignore_prefixes) > 426 fs, paths_or_selector = _ensure_multiple_sources(source, > filesystem) > 427 else: > --> 428 fs, paths_or_selector = _ensure_single_source(source, > filesystem) > 429 > 430 options = FileSystemFactoryOptions( > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in > _ensure_single_source(path, filesystem) > 402 paths_or_selector = [path] > 403 else: > --> 404 raise FileNotFoundError(path) > 405 > 406 return filesystem, paths_or_selector > FileNotFoundError: dev/test-split > {code} > > If I reset the kernel, it works again. It also works if I change the path > slightly, like adding a "/" at the end (so basically it just not work if I > read the same dataset twice): > > {code:java} > ds.dataset("dev/test-split/", partitioning="hive", filesystem=fs) > {code} > > > The other strange behavior I have noticed that that if I read a dataset > inside of my Jupyter notebook, > > {code:java} > %%time > dataset = ds.dataset("dev/test-split", > partitioning=ds.partitioning(pa.schema([("date", pa.date32())]), > flavor="hive"), > filesystem=fs, > exclude_invalid_files=False) > CPU times: user 1.98 s, sys: 0 ns, total: 1.98 s Wall time: 2.58 s{code} > > Now, on the exact same server when I try to run the same code against the > same dataset in Airflow it takes over 3 minutes (comparing the timestamps in > my logs between right before I read the dataset, and immediately after the > dataset is available to filter): > {code:java} > [2021-01-14 03:52:04,011] INFO - Reading dev/test-split > [2021-01-14 03:55:17,360] INFO - Processing dataset in batches > {code} > This is probably not a pyarrow issue, but what are some potential causes that > I can look into? I have one example where it is 9 seconds to read the dataset > in Jupyter, but then 11 *minutes* in Airflow. I don't know what to really > investigate - as I mentioned, the Jupyter notebook and Airflow are on the > same server and both are deployed using Docker. Airflow is using the > CeleryExecutor. > -- This message was sent by Atlassian Jira (v8.3.4#803005)