[ 
https://issues.apache.org/jira/browse/ARROW-17045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones updated ARROW-17045:
-------------------------------
    Description: 
We had several different behaviors when passing in file paths with trailing 
slashes: LocalFileSystem would return IOError, S3 would trim off the trailing 
slash, and GCS would keep the trailing slash as part of the file name (later 
creating confusion as the file would be labelled a "directory" in list calls). 
This PR moves them all to the behavior of LocalFileSystem: return IOError.

The R filesystem bindings relied on the behavior provided by S3, so they are 
now modified to trim the trailing slash before passing down to C++.

Here is an example of the differences in behavior between S3 and GCS:

{code:python}
import pyarrow.fs
from pyarrow.fs import FileSelector
from datetime import timedelta

gcs = pyarrow.fs.GcsFileSystem(
    endpoint_override="localhost:9001",
    scheme="http",
    anonymous=True,
    retry_time_limit=timedelta(seconds=1),
)

gcs.create_dir("py_test")

# Writing to test.txt with and without slash produces a file and a directory!?
with gcs.open_output_stream("py_test/test.txt") as out_stream:
    out_stream.write(b"Hello world!")
with gcs.open_output_stream("py_test/test.txt/") as out_stream:
    out_stream.write(b"Hello world!")
gcs.get_file_info(FileSelector("py_test"))
# [<FileInfo for 'py_test/test.txt': type=FileType.File, size=12>, <FileInfo 
for 'py_test/test.txt': type=FileType.Directory>]

s3 = pyarrow.fs.S3FileSystem(
    access_key="minioadmin",
    secret_key="minioadmin",
    scheme="http",
    endpoint_override="localhost:9000",
    allow_bucket_creation=True,
    allow_bucket_deletion=True,
)

s3.create_dir("py-test")

# Writing to test.txt with and without slash writes to same file
with s3.open_output_stream("py-test/test.txt") as out_stream:
    out_stream.write(b"Hello world!")
with s3.open_output_stream("py-test/test.txt/") as out_stream:
    out_stream.write(b"Hello world!")
s3.get_file_info(FileSelector("py-test"))
# [<FileInfo for 'py-test/test.txt': type=FileType.File, size=12>]
{code}


  was:
There is inconsistent behavior between GCS and S3 when it comes to creating 
files. I'm still not sure yet whether this is an implementation difference or 
difference between minio and GCS testbench.

Example:

{code:python}
import pyarrow.fs
from pyarrow.fs import FileSelector
from datetime import timedelta

gcs = pyarrow.fs.GcsFileSystem(
    endpoint_override="localhost:9001",
    scheme="http",
    anonymous=True,
    retry_time_limit=timedelta(seconds=1),
)

gcs.create_dir("py_test")
with gcs.open_output_stream("py_test/test.txt") as out_stream:
    out_stream.write(b"Hello world!")

with gcs.open_output_stream("py_test/test.txt/") as out_stream:
    out_stream.write(b"Hello world!")

gcs.get_file_info(FileSelector("py_test"))
# [<FileInfo for 'py_test/test.txt': type=FileType.File, size=12>, <FileInfo 
for 'py_test/test.txt': type=FileType.Directory>]

s3 = pyarrow.fs.S3FileSystem(
    access_key="minioadmin",
    secret_key="minioadmin",
    scheme="http",
    endpoint_override="localhost:9000",
    allow_bucket_creation=True,
    allow_bucket_deletion=True,
)

s3.create_dir("py-test")
with s3.open_output_stream("py-test/test.txt") as out_stream:
    out_stream.write(b"Hello world!")
with s3.open_output_stream("py-test/test.txt/") as out_stream:
    out_stream.write(b"Hello world!")

s3.get_file_info(FileSelector("py-test"))
# [<FileInfo for 'py-test/test.txt': type=FileType.File, size=12>]
{code}


> [C++] GCS doesn't drop ending slash for files
> ---------------------------------------------
>
>                 Key: ARROW-17045
>                 URL: https://issues.apache.org/jira/browse/ARROW-17045
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 8.0.0
>            Reporter: Will Jones
>            Assignee: Will Jones
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 9.0.0
>
>          Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> We had several different behaviors when passing in file paths with trailing 
> slashes: LocalFileSystem would return IOError, S3 would trim off the trailing 
> slash, and GCS would keep the trailing slash as part of the file name (later 
> creating confusion as the file would be labelled a "directory" in list 
> calls). This PR moves them all to the behavior of LocalFileSystem: return 
> IOError.
> The R filesystem bindings relied on the behavior provided by S3, so they are 
> now modified to trim the trailing slash before passing down to C++.
> Here is an example of the differences in behavior between S3 and GCS:
> {code:python}
> import pyarrow.fs
> from pyarrow.fs import FileSelector
> from datetime import timedelta
> gcs = pyarrow.fs.GcsFileSystem(
>     endpoint_override="localhost:9001",
>     scheme="http",
>     anonymous=True,
>     retry_time_limit=timedelta(seconds=1),
> )
> gcs.create_dir("py_test")
> # Writing to test.txt with and without slash produces a file and a directory!?
> with gcs.open_output_stream("py_test/test.txt") as out_stream:
>     out_stream.write(b"Hello world!")
> with gcs.open_output_stream("py_test/test.txt/") as out_stream:
>     out_stream.write(b"Hello world!")
> gcs.get_file_info(FileSelector("py_test"))
> # [<FileInfo for 'py_test/test.txt': type=FileType.File, size=12>, <FileInfo 
> for 'py_test/test.txt': type=FileType.Directory>]
> s3 = pyarrow.fs.S3FileSystem(
>     access_key="minioadmin",
>     secret_key="minioadmin",
>     scheme="http",
>     endpoint_override="localhost:9000",
>     allow_bucket_creation=True,
>     allow_bucket_deletion=True,
> )
> s3.create_dir("py-test")
> # Writing to test.txt with and without slash writes to same file
> with s3.open_output_stream("py-test/test.txt") as out_stream:
>     out_stream.write(b"Hello world!")
> with s3.open_output_stream("py-test/test.txt/") as out_stream:
>     out_stream.write(b"Hello world!")
> s3.get_file_info(FileSelector("py-test"))
> # [<FileInfo for 'py-test/test.txt': type=FileType.File, size=12>]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to