Hello Aldrin,

It's not either/or, the directory marker is created everytime necessary, for example when CreateDir() is called.

Regards

Antoine.


Le 15/07/2024 à 19:20, Aldrin a écrit :
Thanks Antoine!

Preserving the property across multiple clients (and presumably across 
independent sessions of the same client) is the part that I was missing.

 From the link you shared, I saw an aws page discussing the use of folders in the s3 console [1]. 
Their approach is to create the marker on folder creation. Instead of adding an `S3Options` 
property for avoiding marker creation on delete, what if it changes the time of marker creation 
from "on delete" to "on creation"? That would seem to align better with tools 
like S3 console as well as cyberduck and simplify the overall consensus logic that Felipe mentioned 
as being a potential pitfall for the 3rd proposed solution (folder creation should occur far less 
often than file deletion/move/replace).

I'm not sure if this is already an option (I don't know much about the 
S3Filesystem implementation of Arrow) or was an old option that was changed in 
favor of creating the marker on deletion.


[1]: https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-folders.html





# ------------------------------

# Aldrin


https://github.com/drin/

https://gitlab.com/octalene

https://keybase.io/octalene


On Monday, July 15th, 2024 at 07:59, Antoine Pitrou <anto...@python.org> wrote:

No, because these markers also communicate the information to other
implementations of S3 abstractions.


An example of this is: https://docs.cyberduck.io/protocols/s3/#folders


Regards


Antoine.




Le 13/07/2024 à 07:15, Aldrin a écrit :


...then I still expect the directory /foo to exist


Right, but if that is the sole purpose of empty directory markers, I'm curious 
if there was an attempt at keeping track of the prefixes/directories locally?


# ------------------------------


# Aldrin


https://github.com/drin/


https://gitlab.com/octalene


https://keybase.io/octalene


On Friday, July 12th, 2024 at 19:44, Hyunseok Seo hsseo0...@gmail.com wrote:


I wonder why S3 (object storage) operates based on file system semantics.
Python users are usually data scientists. They might not be familiar with
the differences between object storage and file storage. Furthermore, I
think there are a lot of pyarrow users.


Avoiding file by file operations so that internal functions can batch as
much as possible.


Thank you for the detailed explanation. So, are you suggesting that a more
fundamental solution is needed rather than just adding options? I thought
supporting such options would help users who do not want markers, despite
the issues you mentioned. Furthermore, I agree that supporting ObjectStore
is necessary for a more fundamental solution.


Thank you.


2024년 7월 13일 (토) 오전 10:00, Weston Pace weston.p...@gmail.com님이 작성:


I think my question is still relevant: no matter what semantics
`S3FileSystem` is trying to provide, I'm still not sure how the placeholder
object helps. I assume it's for listing objects, but what else?


If I have a local filesystem and I delete a file /foo/bar then I still
expect the directory /foo to exist.


```


mkdir /foo


touch /foo/bar


rm /foo/bar


ls / # should show /foo


```


In an object store there is no `mkdir` and, even if I remove /foo/bar then
there is no guarantee /foo will exist.


On Fri, Jul 12, 2024, 2:50 PM Aldrin octalene....@pm.me.invalid wrote:


But I think the issue being addressed 1 is essentially, "`delete_file`
shouldn't create additional files/directories in S3."


I think discussion about the semantics at large is interesting but may be
a digression? Also, I think there are varying degrees of "filesystem
semantics" that are even being discussed (the naming system and
hierarchical inode structure vs atomicity of read/write operations).


I think my question is still relevant: no matter what semantics
`S3FileSystem` is trying to provide, I'm still not sure how the
placeholder
object helps. I assume it's for listing objects, but what else?


# ------------------------------


# Aldrin


https://github.com/drin/


https://gitlab.com/octalene


https://keybase.io/octalene


On Friday, July 12th, 2024 at 14:26, Raphael Taylor-Davies
r.taylordav...@googlemail.com.INVALID wrote:


Many people
are familiar with object stores these days. You could create a new
abstraction `ObjectStore` which is very similar to `FileSystem`
except
the
semantics are object store semantics and not filesystem semantics.


FWIW in the Arrow Rust ecosystem we only provide an object store
abstraction, and this has served us very well. My 2 cents is that
object
store semantics are sufficient, if not superior 1, than filesystem
based interfaces for the vast majority of use cases, with the few
workloads that aren't sufficiently served requiring such close
integration with often OS-specific filesystem APIs and behaviours as to
make building a coherent abstraction extremely difficult.


Iceberg also took a similar approach with its File IO abstraction 2.


1:


https://docs.rs/object_store/latest/object_store/#why-not-a-filesystem-interface


On 12/07/2024 22:05, Weston Pace wrote:


The markers are necessary to offer file system semantics on top of
object
stores. You will get a ton of subtle bugs otherwise.
Yes, object stores and filesystems are different. If you expect
your
filesystem to act like a filesystem then these things need to be
done in
order to avoid these bugs.


If an option modifies a filesystem to behave more like an object
store
then
I don't think it's necessarily a bad thing as long as it isn't the
default. By turning on the option the user is intentionally altering
the
behavior and should not be making the same expectations.


On the other hand, there is another approach you could take. Many
people
are familiar with object stores these days. You could create a new
abstraction `ObjectStore` which is very similar to `FileSystem`
except
the
semantics are object store semantics and not filesystem semantics. I
believe most of our filesystem classes could implement both
`ObjectStore`
and `FileSystem` abstractions without significant code duplication.


This way, if a user wants filesystem semantics, they use a
`FileSystem` and
they pay the abstraction cost. If a user is comfortable with
`ObjectStore`
semantics they use `ObjectStore` and they don't have to pay the
costs.


This would be more work than just allowing options to violate
FileSystem
guarantees but it would provide a more clear distinction between the
two.


On Fri, Jul 12, 2024 at 9:25 AM Aldrin octalene....@pm.me.invalid
wrote:


Hello!


This may be naive, but why does the empty directory marker need to
exist
on the S3 side at all? If a local directory is created (because
filesystem
semantics), then I am not sure why a fake object needs to exist on
the
object-store side.


# ------------------------------


# Aldrin


https://github.com/drin/


https://gitlab.com/octalene


https://keybase.io/octalene


On Friday, July 12th, 2024 at 08:35, Felipe Oliveira Carvalho <
felipe...@gmail.com> wrote:


Hi,


The markers are necessary to offer file system semantics on top
of
object
stores. You will get a ton of subtle bugs otherwise.


If instead of arrow::FileSystem, Arrow offered an
arrow::ObjectStore
interface that wraps local filesystems and object stores with
object-store
semantics (i.e. no concept of empty directory or atomic directory
deletion), then application developers would have more control of
the
actions performed on the object store they are using. Cons would
be
slower
operations when working with a local filesystem and no concept of
directory.


1. Add an Option: Introduce an option in S3Options to control
whether empty directory markers are created, giving users the
choice.


Then it wouldn't be an honest implementation of arrow::FileSystem
for the
reasons listed above.


Change Default Behavior: Modify the default behavior to avoid
creating empty directory markers when a file is deleted.


That would bring in the bugs because an arrow::FileSystem
instance
would
behave differently depending on what is backing it.


3. Smarter Directory Creation: Improve the implementation to
check
for other objects in the same path before creating an empty
directory
marker.


This might be a problem when more than one client or thread is
mutating
the
object store through the arrow::FileSystem. You can check now and
once
you're done deleting all the other files you thought existed are
deleted
as
well. Very likely if clients decide to implement parallel
deletion.


The existing solution of always creating a marker when done is
not
perfect
either, but less likely to break.


## Suggested Workaround


Avoiding file by file operations so that internal functions can
batch as
much as possible.


--
Felipe


On Fri, Jul 12, 2024 at 7:22 AM Hyunseok Seo hsseo0...@gmail.com
wrote:


Hello. community!


I am currently working on addressing the issue described in
[C++]
Addoption to not create parent directory with S3 delete_file.
In
this
process, I have
found it necessary to gather feedback on how to best resolve
this
issue.
Below is a summary and some questions I have for the community.


### Background
Currently, the S3FileSystem generates an empty directory marker
(by
calling the EnsureParentExists function) when a file is deleted
and the
directory becomes empty. This behavior maintains the appearance
of the
directory structure. However, there have been issues raised by
users
regarding this behavior in issues 1.


### Why Maintain Empty Directory Markers?
 From what I understand, object stores like S3 do not have a
concept of
directories. The motivation behind maintaining these markers
could be
to
manage the object store as if it were a traditional file
system.
If
anyone
knows the context behind the implementation of S3FileSystem, it
would
be
great if you could share it.


### Issues with Marker Creation
Users who have raised concerns about the creation of empty
directory
markers cite the following reasons:


- Increase in Unnecessary Requests 2: Creating empty directory
markers leads to additional S3 requests, which can increase
costs and
affect performance.
- File System Consistency Issues 1: S3 is designed as an object
store, and creating empty directory markers can break the
inherent
consistency of the file system.


### Proposed Solutions
Issue 1 suggests the following approaches:


1. Add an Option: Introduce an option in S3Options to control
whether
empty directory markers are created, giving users the choice.
2. Change Default Behavior: Modify the default behavior to
avoid
creating empty directory markers when a file is deleted.
3. Smarter Directory Creation: Improve the implementation to
check for
other objects in the same path before creating an empty
directory
marker.
Here is my personal thought (approach 1 + 3):


(approach 1) I believe it would be best to add the Marker as an
option
(as some users might not want this enhancement).


(approach 3) When the option is enabled, if there are no files
(objects)
in the path (prefix) corresponding to a directory based on the
file
system
concept, we should maintain the Marker. Otherwise, we should
check the
number of files in the same path and avoid calling
EnsureParentExists
if
there are two or more files.


On the other hand, I also feel that this approach might make
the
logic
more
complicated.


### We Would Like Your Feedback
- What are your thoughts on the creation of empty directory
markers?
- Which of the proposed solutions do you prefer?
- Do you have any additional suggestions or comments?


We appreciate your valuable feedback and aim to find the best
solution
based on your input.


Thank you.

Reply via email to