[ https://issues.apache.org/jira/browse/ARROW-18436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
James Bourbeau updated ARROW-18436: ----------------------------------- Description: When attempting to create a new filesystem object from a public dataset in S3, where there is a space in the bucket name, an error is raised. Here's a minimal reproducer: {code:java} from pyarrow.fs import FileSystem result = FileSystem.from_uri("s3://nyc-tlc/trip data/fhvhv_tripdata_2022-06.parquet") {code} which fails with the following traceback: {code:java} Traceback (most recent call last): File "/Users/james/projects/dask/dask/test.py", line 3, in <module> result = FileSystem.from_uri("s3://nyc-tlc/trip data/fhvhv_tripdata_2022-06.parquet") File "pyarrow/_fs.pyx", line 470, in pyarrow._fs.FileSystem.from_uri File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Cannot parse URI: 's3://nyc-tlc/trip data/fhvhv_tripdata_2022-06.parquet'{code} Note that things work if I use a different dataset that doesn't have a space in the URI, or if I replace the portion of the URI that has a space with a `*` wildcard {code:java} from pyarrow.fs import FileSystem result = FileSystem.from_uri("s3://ursa-labs-taxi-data/2009/01/data.parquet") # works result = FileSystem.from_uri("s3://nyc-tlc/*/fhvhv_tripdata_2022-06.parquet") # works {code} The wildcard isn't necessarily equivalent to the original failing URI, but I think highlights that the space is somehow problematic. was: When attempting to create a new filesystem object from a public dataset in S3, where there is a space in the bucket name, an error is raised. Here's a minimal reproducer: {code:java} from pyarrow.fs import FileSystem result = FileSystem.from_uri("s3://nyc-tlc/trip data/fhvhv_tripdata_2022-06.parquet") {code} which fails with the following traceback: {code:java} Traceback (most recent call last): File "/Users/james/projects/dask/dask/test.py", line 3, in <module> result = FileSystem.from_uri("s3://nyc-tlc/trip data/fhvhv_tripdata_2022-06.parquet") File "pyarrow/_fs.pyx", line 470, in pyarrow._fs.FileSystem.from_uri File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Cannot parse URI: 's3://nyc-tlc/trip data/fhvhv_tripdata_2022-06.parquet'{code} Note that things work if I use a different dataset that doesn't have a space in the URI, or if I replace the portion of the URI that has a space with a `*` wildcard {code:java} from pyarrow.fs import FileSystem result = FileSystem.from_uri("s3://nyc-tlc/*/fhvhv_tripdata_2022-06.parquet") # works {code} The wildcard isn't necessarily equivalent to the original failing URI, but I think highlights that the space is somehow problematic. > `pyarrow.fs.FileSystem.from_uri` crashes when URI has a space > ------------------------------------------------------------- > > Key: ARROW-18436 > URL: https://issues.apache.org/jira/browse/ARROW-18436 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 10.0.1 > Environment: - OS: macOS > - `python=3.9.15:h709bd14_0_cpython` (installed from conda-forge) > - `pyarrow=10.0.1:py39h2db5b05_1_cpu` (installed from conda-forge) > Reporter: James Bourbeau > Priority: Minor > > When attempting to create a new filesystem object from a public dataset in > S3, where there is a space in the bucket name, an error is raised. > > Here's a minimal reproducer: > {code:java} > from pyarrow.fs import FileSystem > result = FileSystem.from_uri("s3://nyc-tlc/trip > data/fhvhv_tripdata_2022-06.parquet") {code} > which fails with the following traceback: > > {code:java} > Traceback (most recent call last): > File "/Users/james/projects/dask/dask/test.py", line 3, in <module> > result = FileSystem.from_uri("s3://nyc-tlc/trip > data/fhvhv_tripdata_2022-06.parquet") > File "pyarrow/_fs.pyx", line 470, in pyarrow._fs.FileSystem.from_uri > File "pyarrow/error.pxi", line 144, in > pyarrow.lib.pyarrow_internal_check_status > File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Cannot parse URI: 's3://nyc-tlc/trip > data/fhvhv_tripdata_2022-06.parquet'{code} > > Note that things work if I use a different dataset that doesn't have a space > in the URI, or if I replace the portion of the URI that has a space with a > `*` wildcard > > {code:java} > from pyarrow.fs import FileSystem > result = FileSystem.from_uri("s3://ursa-labs-taxi-data/2009/01/data.parquet") > # works > result = > FileSystem.from_uri("s3://nyc-tlc/*/fhvhv_tripdata_2022-06.parquet") # works > {code} > > The wildcard isn't necessarily equivalent to the original failing URI, but I > think highlights that the space is somehow problematic. -- This message was sent by Atlassian Jira (v8.20.10#820010)