[ 
https://issues.apache.org/jira/browse/ARROW-17045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones updated ARROW-17045:
-------------------------------
    Summary: [C++] Reject trailing slashes on file path  (was: [C++] GCS 
doesn't drop ending slash for files)

> [C++] Reject trailing slashes on file path
> ------------------------------------------
>
>                 Key: ARROW-17045
>                 URL: https://issues.apache.org/jira/browse/ARROW-17045
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 8.0.0
>            Reporter: Will Jones
>            Assignee: Will Jones
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 9.0.0
>
>          Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> We had several different behaviors when passing in file paths with trailing 
> slashes: LocalFileSystem would return IOError, S3 would trim off the trailing 
> slash, and GCS would keep the trailing slash as part of the file name (later 
> creating confusion as the file would be labelled a "directory" in list 
> calls). This PR moves them all to the behavior of LocalFileSystem: return 
> IOError.
> The R filesystem bindings relied on the behavior provided by S3, so they are 
> now modified to trim the trailing slash before passing down to C++.
> Here is an example of the differences in behavior between S3 and GCS:
> {code:python}
> import pyarrow.fs
> from pyarrow.fs import FileSelector
> from datetime import timedelta
> gcs = pyarrow.fs.GcsFileSystem(
>     endpoint_override="localhost:9001",
>     scheme="http",
>     anonymous=True,
>     retry_time_limit=timedelta(seconds=1),
> )
> gcs.create_dir("py_test")
> # Writing to test.txt with and without slash produces a file and a directory!?
> with gcs.open_output_stream("py_test/test.txt") as out_stream:
>     out_stream.write(b"Hello world!")
> with gcs.open_output_stream("py_test/test.txt/") as out_stream:
>     out_stream.write(b"Hello world!")
> gcs.get_file_info(FileSelector("py_test"))
> # [<FileInfo for 'py_test/test.txt': type=FileType.File, size=12>, <FileInfo 
> for 'py_test/test.txt': type=FileType.Directory>]
> s3 = pyarrow.fs.S3FileSystem(
>     access_key="minioadmin",
>     secret_key="minioadmin",
>     scheme="http",
>     endpoint_override="localhost:9000",
>     allow_bucket_creation=True,
>     allow_bucket_deletion=True,
> )
> s3.create_dir("py-test")
> # Writing to test.txt with and without slash writes to same file
> with s3.open_output_stream("py-test/test.txt") as out_stream:
>     out_stream.write(b"Hello world!")
> with s3.open_output_stream("py-test/test.txt/") as out_stream:
>     out_stream.write(b"Hello world!")
> s3.get_file_info(FileSelector("py-test"))
> # [<FileInfo for 'py-test/test.txt': type=FileType.File, size=12>]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to