[ 
https://issues.apache.org/jira/browse/ARROW-18436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Bourbeau updated ARROW-18436:
-----------------------------------
    Description: 
When attempting to create a new filesystem object from a public dataset in S3, 
where there is a space in the bucket name, an error is raised.

 

Here's a minimal reproducer:
{code:java}
from pyarrow.fs import FileSystem
result = FileSystem.from_uri("s3://nyc-tlc/trip 
data/fhvhv_tripdata_2022-06.parquet") {code}
which fails with the following traceback:

 
{code:java}
Traceback (most recent call last):
  File "/Users/james/projects/dask/dask/test.py", line 3, in <module>
    result = FileSystem.from_uri("s3://nyc-tlc/trip 
data/fhvhv_tripdata_2022-06.parquet")
  File "pyarrow/_fs.pyx", line 470, in pyarrow._fs.FileSystem.from_uri
  File "pyarrow/error.pxi", line 144, in 
pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Cannot parse URI: 's3://nyc-tlc/trip 
data/fhvhv_tripdata_2022-06.parquet'{code}
 

Note that things work if I use a different dataset that doesn't have a space in 
the URI, or if I replace the portion of the URI that has a space with a `*` 
wildcard

 
{code:java}
from pyarrow.fs import FileSystem
result = FileSystem.from_uri("s3://ursa-labs-taxi-data/2009/01/data.parquet") # 
works
 result = FileSystem.from_uri("s3://nyc-tlc/*/fhvhv_tripdata_2022-06.parquet") 
# works
{code}
 

The wildcard isn't necessarily equivalent to the original failing URI, but I 
think highlights that the space is somehow problematic.

  was:
When attempting to create a new filesystem object from a public dataset in S3, 
where there is a space in the bucket name, an error is raised.

 

Here's a minimal reproducer:
{code:java}
from pyarrow.fs import FileSystem
result = FileSystem.from_uri("s3://nyc-tlc/trip 
data/fhvhv_tripdata_2022-06.parquet") {code}
which fails with the following traceback:

 
{code:java}
Traceback (most recent call last):
  File "/Users/james/projects/dask/dask/test.py", line 3, in <module>
    result = FileSystem.from_uri("s3://nyc-tlc/trip 
data/fhvhv_tripdata_2022-06.parquet")
  File "pyarrow/_fs.pyx", line 470, in pyarrow._fs.FileSystem.from_uri
  File "pyarrow/error.pxi", line 144, in 
pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Cannot parse URI: 's3://nyc-tlc/trip 
data/fhvhv_tripdata_2022-06.parquet'{code}
 

Note that things work if I use a different dataset that doesn't have a space in 
the URI, or if I replace the portion of the URI that has a space with a `*` 
wildcard

 

 
{code:java}
from pyarrow.fs import FileSystem
result = FileSystem.from_uri("s3://nyc-tlc/*/fhvhv_tripdata_2022-06.parquet") # 
works
{code}
 

The wildcard isn't necessarily equivalent to the original failing URI, but I 
think highlights that the space is somehow problematic.


> `pyarrow.fs.FileSystem.from_uri` crashes when URI has a space
> -------------------------------------------------------------
>
>                 Key: ARROW-18436
>                 URL: https://issues.apache.org/jira/browse/ARROW-18436
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 10.0.1
>         Environment: - OS: macOS
> - `python=3.9.15:h709bd14_0_cpython` (installed from conda-forge)
> - `pyarrow=10.0.1:py39h2db5b05_1_cpu` (installed from conda-forge)
>            Reporter: James Bourbeau
>            Priority: Minor
>
> When attempting to create a new filesystem object from a public dataset in 
> S3, where there is a space in the bucket name, an error is raised.
>  
> Here's a minimal reproducer:
> {code:java}
> from pyarrow.fs import FileSystem
> result = FileSystem.from_uri("s3://nyc-tlc/trip 
> data/fhvhv_tripdata_2022-06.parquet") {code}
> which fails with the following traceback:
>  
> {code:java}
> Traceback (most recent call last):
>   File "/Users/james/projects/dask/dask/test.py", line 3, in <module>
>     result = FileSystem.from_uri("s3://nyc-tlc/trip 
> data/fhvhv_tripdata_2022-06.parquet")
>   File "pyarrow/_fs.pyx", line 470, in pyarrow._fs.FileSystem.from_uri
>   File "pyarrow/error.pxi", line 144, in 
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Cannot parse URI: 's3://nyc-tlc/trip 
> data/fhvhv_tripdata_2022-06.parquet'{code}
>  
> Note that things work if I use a different dataset that doesn't have a space 
> in the URI, or if I replace the portion of the URI that has a space with a 
> `*` wildcard
>  
> {code:java}
> from pyarrow.fs import FileSystem
> result = FileSystem.from_uri("s3://ursa-labs-taxi-data/2009/01/data.parquet") 
> # works
>  result = 
> FileSystem.from_uri("s3://nyc-tlc/*/fhvhv_tripdata_2022-06.parquet") # works
> {code}
>  
> The wildcard isn't necessarily equivalent to the original failing URI, but I 
> think highlights that the space is somehow problematic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to