And to clarify, by "other clients" I mean "other remote clients on other 
systems concurrently accessing the same data."

I still think that many cients on a single system could use a local filesystem 
to gate directory-based operations more efficiently (since a local filesystem 
is optimized for that very thing and it could be mounted to memory instead of a 
block device).





# ------------------------------

# Aldrin


https://github.com/drin/

https://gitlab.com/octalene

https://keybase.io/octalene


On Monday, July 15th, 2024 at 10:20, Aldrin <octalene....@pm.me.INVALID> wrote:

> Thanks Antoine!
> 

> Preserving the property across multiple clients (and presumably across 
> independent sessions of the same client) is the part that I was missing.
> 

> From the link you shared, I saw an aws page discussing the use of folders in 
> the s3 console [1]. Their approach is to create the marker on folder 
> creation. Instead of adding an `S3Options` property for avoiding marker 
> creation on delete, what if it changes the time of marker creation from "on 
> delete" to "on creation"? That would seem to align better with tools like S3 
> console as well as cyberduck and simplify the overall consensus logic that 
> Felipe mentioned as being a potential pitfall for the 3rd proposed solution 
> (folder creation should occur far less often than file deletion/move/replace).
> 

> I'm not sure if this is already an option (I don't know much about the 
> S3Filesystem implementation of Arrow) or was an old option that was changed 
> in favor of creating the marker on deletion.
> 

> 

> [1]: https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-folders.html
> 

> 

> 

> 

> 

> 

> 

> # ------------------------------
> 

> # Aldrin
> 

> 

> https://github.com/drin/
> 

> https://gitlab.com/octalene
> 

> https://keybase.io/octalene
> 

> 

> 

> 

> On Monday, July 15th, 2024 at 07:59, Antoine Pitrou anto...@python.org wrote:
> 

> > No, because these markers also communicate the information to other
> > implementations of S3 abstractions.
> 

> > An example of this is: https://docs.cyberduck.io/protocols/s3/#folders
> 

> > Regards
> 

> > Antoine.
> 

> > Le 13/07/2024 à 07:15, Aldrin a écrit :
> 

> > > > ...then I still expect the directory /foo to exist
> 

> > > Right, but if that is the sole purpose of empty directory markers, I'm 
> > > curious if there was an attempt at keeping track of the 
> > > prefixes/directories locally?
> 

> > > # ------------------------------
> 

> > > # Aldrin
> 

> > > https://github.com/drin/
> 

> > > https://gitlab.com/octalene
> 

> > > https://keybase.io/octalene
> 

> > > On Friday, July 12th, 2024 at 19:44, Hyunseok Seo hsseo0...@gmail.com 
> > > wrote:
> 

> > > > I wonder why S3 (object storage) operates based on file system 
> > > > semantics.
> > > > Python users are usually data scientists. They might not be familiar 
> > > > with
> > > > the differences between object storage and file storage. Furthermore, I
> > > > think there are a lot of pyarrow users.
> 

> > > > > Avoiding file by file operations so that internal functions can batch 
> > > > > as
> > > > > much as possible.
> 

> > > > Thank you for the detailed explanation. So, are you suggesting that a 
> > > > more
> > > > fundamental solution is needed rather than just adding options? I 
> > > > thought
> > > > supporting such options would help users who do not want markers, 
> > > > despite
> > > > the issues you mentioned. Furthermore, I agree that supporting 
> > > > ObjectStore
> > > > is necessary for a more fundamental solution.
> 

> > > > Thank you.
> 

> > > > 2024년 7월 13일 (토) 오전 10:00, Weston Pace weston.p...@gmail.com님이 작성:
> 

> > > > > > I think my question is still relevant: no matter what semantics
> > > > > > `S3FileSystem` is trying to provide, I'm still not sure how the 
> > > > > > placeholder
> > > > > > object helps. I assume it's for listing objects, but what else?
> 

> > > > > If I have a local filesystem and I delete a file /foo/bar then I still
> > > > > expect the directory /foo to exist.
> 

> > > > > ```
> 

> > > > > mkdir /foo
> 

> > > > > touch /foo/bar
> 

> > > > > rm /foo/bar
> 

> > > > > ls / # should show /foo
> 

> > > > > ```
> 

> > > > > In an object store there is no `mkdir` and, even if I remove /foo/bar 
> > > > > then
> > > > > there is no guarantee /foo will exist.
> 

> > > > > On Fri, Jul 12, 2024, 2:50 PM Aldrin octalene....@pm.me.invalid wrote:
> 

> > > > > > But I think the issue being addressed 1 is essentially, 
> > > > > > "`delete_file`
> > > > > > shouldn't create additional files/directories in S3."
> 

> > > > > > I think discussion about the semantics at large is interesting but 
> > > > > > may be
> > > > > > a digression? Also, I think there are varying degrees of "filesystem
> > > > > > semantics" that are even being discussed (the naming system and
> > > > > > hierarchical inode structure vs atomicity of read/write operations).
> 

> > > > > > I think my question is still relevant: no matter what semantics
> > > > > > `S3FileSystem` is trying to provide, I'm still not sure how the
> > > > > > placeholder
> > > > > > object helps. I assume it's for listing objects, but what else?
> 

> > > > > > # ------------------------------
> 

> > > > > > # Aldrin
> 

> > > > > > https://github.com/drin/
> 

> > > > > > https://gitlab.com/octalene
> 

> > > > > > https://keybase.io/octalene
> 

> > > > > > On Friday, July 12th, 2024 at 14:26, Raphael Taylor-Davies
> > > > > > r.taylordav...@googlemail.com.INVALID wrote:
> 

> > > > > > > > Many people
> > > > > > > > are familiar with object stores these days. You could create a 
> > > > > > > > new
> > > > > > > > abstraction `ObjectStore` which is very similar to `FileSystem`
> > > > > > > > except
> > > > > > > > the
> > > > > > > > semantics are object store semantics and not filesystem 
> > > > > > > > semantics.
> 

> > > > > > > FWIW in the Arrow Rust ecosystem we only provide an object store
> > > > > > > abstraction, and this has served us very well. My 2 cents is that
> > > > > > > object
> > > > > > > store semantics are sufficient, if not superior 1, than filesystem
> > > > > > > based interfaces for the vast majority of use cases, with the few
> > > > > > > workloads that aren't sufficiently served requiring such close
> > > > > > > integration with often OS-specific filesystem APIs and behaviours 
> > > > > > > as to
> > > > > > > make building a coherent abstraction extremely difficult.
> 

> > > > > > > Iceberg also took a similar approach with its File IO abstraction 
> > > > > > > 2.
> 

> > > > > > > 1:
> 

> > > > > https://docs.rs/object_store/latest/object_store/#why-not-a-filesystem-interface
> 

> > > > > > > On 12/07/2024 22:05, Weston Pace wrote:
> 

> > > > > > > > > The markers are necessary to offer file system semantics on 
> > > > > > > > > top of
> > > > > > > > > object
> > > > > > > > > stores. You will get a ton of subtle bugs otherwise.
> > > > > > > > > Yes, object stores and filesystems are different. If you 
> > > > > > > > > expect
> > > > > > > > > your
> > > > > > > > > filesystem to act like a filesystem then these things need to 
> > > > > > > > > be
> > > > > > > > > done in
> > > > > > > > > order to avoid these bugs.
> 

> > > > > > > > If an option modifies a filesystem to behave more like an object
> > > > > > > > store
> > > > > > > > then
> > > > > > > > I don't think it's necessarily a bad thing as long as it isn't 
> > > > > > > > the
> > > > > > > > default. By turning on the option the user is intentionally 
> > > > > > > > altering
> > > > > > > > the
> > > > > > > > behavior and should not be making the same expectations.
> 

> > > > > > > > On the other hand, there is another approach you could take. 
> > > > > > > > Many
> > > > > > > > people
> > > > > > > > are familiar with object stores these days. You could create a 
> > > > > > > > new
> > > > > > > > abstraction `ObjectStore` which is very similar to `FileSystem`
> > > > > > > > except
> > > > > > > > the
> > > > > > > > semantics are object store semantics and not filesystem 
> > > > > > > > semantics. I
> > > > > > > > believe most of our filesystem classes could implement both
> > > > > > > > `ObjectStore`
> > > > > > > > and `FileSystem` abstractions without significant code 
> > > > > > > > duplication.
> 

> > > > > > > > This way, if a user wants filesystem semantics, they use a
> > > > > > > > `FileSystem` and
> > > > > > > > they pay the abstraction cost. If a user is comfortable with
> > > > > > > > `ObjectStore`
> > > > > > > > semantics they use `ObjectStore` and they don't have to pay the
> > > > > > > > costs.
> 

> > > > > > > > This would be more work than just allowing options to violate
> > > > > > > > FileSystem
> > > > > > > > guarantees but it would provide a more clear distinction 
> > > > > > > > between the
> > > > > > > > two.
> 

> > > > > > > > On Fri, Jul 12, 2024 at 9:25 AM Aldrin 
> > > > > > > > octalene....@pm.me.invalid
> > > > > > > > wrote:
> 

> > > > > > > > > Hello!
> 

> > > > > > > > > This may be naive, but why does the empty directory marker 
> > > > > > > > > need to
> > > > > > > > > exist
> > > > > > > > > on the S3 side at all? If a local directory is created 
> > > > > > > > > (because
> > > > > > > > > filesystem
> > > > > > > > > semantics), then I am not sure why a fake object needs to 
> > > > > > > > > exist on
> > > > > > > > > the
> > > > > > > > > object-store side.
> 

> > > > > > > > > # ------------------------------
> 

> > > > > > > > > # Aldrin
> 

> > > > > > > > > https://github.com/drin/
> 

> > > > > > > > > https://gitlab.com/octalene
> 

> > > > > > > > > https://keybase.io/octalene
> 

> > > > > > > > > On Friday, July 12th, 2024 at 08:35, Felipe Oliveira Carvalho 
> > > > > > > > > <
> > > > > > > > > felipe...@gmail.com> wrote:
> 

> > > > > > > > > > Hi,
> 

> > > > > > > > > > The markers are necessary to offer file system semantics on 
> > > > > > > > > > top
> > > > > > > > > > of
> > > > > > > > > > object
> > > > > > > > > > stores. You will get a ton of subtle bugs otherwise.
> 

> > > > > > > > > > If instead of arrow::FileSystem, Arrow offered an
> > > > > > > > > > arrow::ObjectStore
> > > > > > > > > > interface that wraps local filesystems and object stores 
> > > > > > > > > > with
> > > > > > > > > > object-store
> > > > > > > > > > semantics (i.e. no concept of empty directory or atomic 
> > > > > > > > > > directory
> > > > > > > > > > deletion), then application developers would have more 
> > > > > > > > > > control of
> > > > > > > > > > the
> > > > > > > > > > actions performed on the object store they are using. Cons 
> > > > > > > > > > would
> > > > > > > > > > be
> > > > > > > > > > slower
> > > > > > > > > > operations when working with a local filesystem and no 
> > > > > > > > > > concept of
> > > > > > > > > > directory.
> 

> > > > > > > > > > > 1. Add an Option: Introduce an option in S3Options to 
> > > > > > > > > > > control
> > > > > > > > > > > whether empty directory markers are created, giving users 
> > > > > > > > > > > the
> > > > > > > > > > > choice.
> 

> > > > > > > > > > Then it wouldn't be an honest implementation of 
> > > > > > > > > > arrow::FileSystem
> > > > > > > > > > for the
> > > > > > > > > > reasons listed above.
> 

> > > > > > > > > > > Change Default Behavior: Modify the default behavior to 
> > > > > > > > > > > avoid
> > > > > > > > > > > creating empty directory markers when a file is deleted.
> 

> > > > > > > > > > That would bring in the bugs because an arrow::FileSystem
> > > > > > > > > > instance
> > > > > > > > > > would
> > > > > > > > > > behave differently depending on what is backing it.
> 

> > > > > > > > > > > 3. Smarter Directory Creation: Improve the implementation 
> > > > > > > > > > > to
> > > > > > > > > > > check
> > > > > > > > > > > for other objects in the same path before creating an 
> > > > > > > > > > > empty
> > > > > > > > > > > directory
> > > > > > > > > > > marker.
> 

> > > > > > > > > > This might be a problem when more than one client or thread 
> > > > > > > > > > is
> > > > > > > > > > mutating
> > > > > > > > > > the
> > > > > > > > > > object store through the arrow::FileSystem. You can check 
> > > > > > > > > > now and
> > > > > > > > > > once
> > > > > > > > > > you're done deleting all the other files you thought 
> > > > > > > > > > existed are
> > > > > > > > > > deleted
> > > > > > > > > > as
> > > > > > > > > > well. Very likely if clients decide to implement parallel
> > > > > > > > > > deletion.
> 

> > > > > > > > > > The existing solution of always creating a marker when done 
> > > > > > > > > > is
> > > > > > > > > > not
> > > > > > > > > > perfect
> > > > > > > > > > either, but less likely to break.
> 

> > > > > > > > > > ## Suggested Workaround
> 

> > > > > > > > > > Avoiding file by file operations so that internal functions 
> > > > > > > > > > can
> > > > > > > > > > batch as
> > > > > > > > > > much as possible.
> 

> > > > > > > > > > --
> > > > > > > > > > Felipe
> 

> > > > > > > > > > On Fri, Jul 12, 2024 at 7:22 AM Hyunseok Seo 
> > > > > > > > > > hsseo0...@gmail.com
> > > > > > > > > > wrote:
> 

> > > > > > > > > > > Hello. community!
> 

> > > > > > > > > > > I am currently working on addressing the issue described 
> > > > > > > > > > > in
> > > > > > > > > > > [C++]
> > > > > > > > > > > Addoption to not create parent directory with S3 
> > > > > > > > > > > delete_file.
> > > > > > > > > > > In
> > > > > > > > > > > this
> > > > > > > > > > > process, I have
> > > > > > > > > > > found it necessary to gather feedback on how to best 
> > > > > > > > > > > resolve
> > > > > > > > > > > this
> > > > > > > > > > > issue.
> > > > > > > > > > > Below is a summary and some questions I have for the 
> > > > > > > > > > > community.
> 

> > > > > > > > > > > ### Background
> > > > > > > > > > > Currently, the S3FileSystem generates an empty directory 
> > > > > > > > > > > marker
> > > > > > > > > > > (by
> > > > > > > > > > > calling the EnsureParentExists function) when a file is 
> > > > > > > > > > > deleted
> > > > > > > > > > > and the
> > > > > > > > > > > directory becomes empty. This behavior maintains the 
> > > > > > > > > > > appearance
> > > > > > > > > > > of the
> > > > > > > > > > > directory structure. However, there have been issues 
> > > > > > > > > > > raised by
> > > > > > > > > > > users
> > > > > > > > > > > regarding this behavior in issues 1.
> 

> > > > > > > > > > > ### Why Maintain Empty Directory Markers?
> > > > > > > > > > > From what I understand, object stores like S3 do not have 
> > > > > > > > > > > a
> > > > > > > > > > > concept of
> > > > > > > > > > > directories. The motivation behind maintaining these 
> > > > > > > > > > > markers
> > > > > > > > > > > could be
> > > > > > > > > > > to
> > > > > > > > > > > manage the object store as if it were a traditional file
> > > > > > > > > > > system.
> > > > > > > > > > > If
> > > > > > > > > > > anyone
> > > > > > > > > > > knows the context behind the implementation of 
> > > > > > > > > > > S3FileSystem, it
> > > > > > > > > > > would
> > > > > > > > > > > be
> > > > > > > > > > > great if you could share it.
> 

> > > > > > > > > > > ### Issues with Marker Creation
> > > > > > > > > > > Users who have raised concerns about the creation of empty
> > > > > > > > > > > directory
> > > > > > > > > > > markers cite the following reasons:
> 

> > > > > > > > > > > - Increase in Unnecessary Requests 2: Creating empty 
> > > > > > > > > > > directory
> > > > > > > > > > > markers leads to additional S3 requests, which can 
> > > > > > > > > > > increase
> > > > > > > > > > > costs and
> > > > > > > > > > > affect performance.
> > > > > > > > > > > - File System Consistency Issues 1: S3 is designed as an 
> > > > > > > > > > > object
> > > > > > > > > > > store, and creating empty directory markers can break the
> > > > > > > > > > > inherent
> > > > > > > > > > > consistency of the file system.
> 

> > > > > > > > > > > ### Proposed Solutions
> > > > > > > > > > > Issue 1 suggests the following approaches:
> 

> > > > > > > > > > > 1. Add an Option: Introduce an option in S3Options to 
> > > > > > > > > > > control
> > > > > > > > > > > whether
> > > > > > > > > > > empty directory markers are created, giving users the 
> > > > > > > > > > > choice.
> > > > > > > > > > > 2. Change Default Behavior: Modify the default behavior to
> > > > > > > > > > > avoid
> > > > > > > > > > > creating empty directory markers when a file is deleted.
> > > > > > > > > > > 3. Smarter Directory Creation: Improve the implementation 
> > > > > > > > > > > to
> > > > > > > > > > > check for
> > > > > > > > > > > other objects in the same path before creating an empty
> > > > > > > > > > > directory
> > > > > > > > > > > marker.
> > > > > > > > > > > Here is my personal thought (approach 1 + 3):
> 

> > > > > > > > > > > (approach 1) I believe it would be best to add the Marker 
> > > > > > > > > > > as an
> > > > > > > > > > > option
> > > > > > > > > > > (as some users might not want this enhancement).
> 

> > > > > > > > > > > (approach 3) When the option is enabled, if there are no 
> > > > > > > > > > > files
> > > > > > > > > > > (objects)
> > > > > > > > > > > in the path (prefix) corresponding to a directory based 
> > > > > > > > > > > on the
> > > > > > > > > > > file
> > > > > > > > > > > system
> > > > > > > > > > > concept, we should maintain the Marker. Otherwise, we 
> > > > > > > > > > > should
> > > > > > > > > > > check the
> > > > > > > > > > > number of files in the same path and avoid calling
> > > > > > > > > > > EnsureParentExists
> > > > > > > > > > > if
> > > > > > > > > > > there are two or more files.
> 

> > > > > > > > > > > On the other hand, I also feel that this approach might 
> > > > > > > > > > > make
> > > > > > > > > > > the
> > > > > > > > > > > logic
> > > > > > > > > > > more
> > > > > > > > > > > complicated.
> 

> > > > > > > > > > > ### We Would Like Your Feedback
> > > > > > > > > > > - What are your thoughts on the creation of empty 
> > > > > > > > > > > directory
> > > > > > > > > > > markers?
> > > > > > > > > > > - Which of the proposed solutions do you prefer?
> > > > > > > > > > > - Do you have any additional suggestions or comments?
> 

> > > > > > > > > > > We appreciate your valuable feedback and aim to find the 
> > > > > > > > > > > best
> > > > > > > > > > > solution
> > > > > > > > > > > based on your input.
> 

> > > > > > > > > > > Thank you.

Attachment: publickey - octalene.dev@pm.me - 0x21969656.asc
Description: application/pgp-keys

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to