And to clarify, by "other clients" I mean "other remote clients on other systems concurrently accessing the same data."
I still think that many cients on a single system could use a local filesystem to gate directory-based operations more efficiently (since a local filesystem is optimized for that very thing and it could be mounted to memory instead of a block device). # ------------------------------ # Aldrin https://github.com/drin/ https://gitlab.com/octalene https://keybase.io/octalene On Monday, July 15th, 2024 at 10:20, Aldrin <octalene....@pm.me.INVALID> wrote: > Thanks Antoine! > > Preserving the property across multiple clients (and presumably across > independent sessions of the same client) is the part that I was missing. > > From the link you shared, I saw an aws page discussing the use of folders in > the s3 console [1]. Their approach is to create the marker on folder > creation. Instead of adding an `S3Options` property for avoiding marker > creation on delete, what if it changes the time of marker creation from "on > delete" to "on creation"? That would seem to align better with tools like S3 > console as well as cyberduck and simplify the overall consensus logic that > Felipe mentioned as being a potential pitfall for the 3rd proposed solution > (folder creation should occur far less often than file deletion/move/replace). > > I'm not sure if this is already an option (I don't know much about the > S3Filesystem implementation of Arrow) or was an old option that was changed > in favor of creating the marker on deletion. > > > [1]: https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-folders.html > > > > > > > > # ------------------------------ > > # Aldrin > > > https://github.com/drin/ > > https://gitlab.com/octalene > > https://keybase.io/octalene > > > > > On Monday, July 15th, 2024 at 07:59, Antoine Pitrou anto...@python.org wrote: > > > No, because these markers also communicate the information to other > > implementations of S3 abstractions. > > > An example of this is: https://docs.cyberduck.io/protocols/s3/#folders > > > Regards > > > Antoine. > > > Le 13/07/2024 à 07:15, Aldrin a écrit : > > > > > ...then I still expect the directory /foo to exist > > > > Right, but if that is the sole purpose of empty directory markers, I'm > > > curious if there was an attempt at keeping track of the > > > prefixes/directories locally? > > > > # ------------------------------ > > > > # Aldrin > > > > https://github.com/drin/ > > > > https://gitlab.com/octalene > > > > https://keybase.io/octalene > > > > On Friday, July 12th, 2024 at 19:44, Hyunseok Seo hsseo0...@gmail.com > > > wrote: > > > > > I wonder why S3 (object storage) operates based on file system > > > > semantics. > > > > Python users are usually data scientists. They might not be familiar > > > > with > > > > the differences between object storage and file storage. Furthermore, I > > > > think there are a lot of pyarrow users. > > > > > > Avoiding file by file operations so that internal functions can batch > > > > > as > > > > > much as possible. > > > > > Thank you for the detailed explanation. So, are you suggesting that a > > > > more > > > > fundamental solution is needed rather than just adding options? I > > > > thought > > > > supporting such options would help users who do not want markers, > > > > despite > > > > the issues you mentioned. Furthermore, I agree that supporting > > > > ObjectStore > > > > is necessary for a more fundamental solution. > > > > > Thank you. > > > > > 2024년 7월 13일 (토) 오전 10:00, Weston Pace weston.p...@gmail.com님이 작성: > > > > > > > I think my question is still relevant: no matter what semantics > > > > > > `S3FileSystem` is trying to provide, I'm still not sure how the > > > > > > placeholder > > > > > > object helps. I assume it's for listing objects, but what else? > > > > > > If I have a local filesystem and I delete a file /foo/bar then I still > > > > > expect the directory /foo to exist. > > > > > > ``` > > > > > > mkdir /foo > > > > > > touch /foo/bar > > > > > > rm /foo/bar > > > > > > ls / # should show /foo > > > > > > ``` > > > > > > In an object store there is no `mkdir` and, even if I remove /foo/bar > > > > > then > > > > > there is no guarantee /foo will exist. > > > > > > On Fri, Jul 12, 2024, 2:50 PM Aldrin octalene....@pm.me.invalid wrote: > > > > > > > But I think the issue being addressed 1 is essentially, > > > > > > "`delete_file` > > > > > > shouldn't create additional files/directories in S3." > > > > > > > I think discussion about the semantics at large is interesting but > > > > > > may be > > > > > > a digression? Also, I think there are varying degrees of "filesystem > > > > > > semantics" that are even being discussed (the naming system and > > > > > > hierarchical inode structure vs atomicity of read/write operations). > > > > > > > I think my question is still relevant: no matter what semantics > > > > > > `S3FileSystem` is trying to provide, I'm still not sure how the > > > > > > placeholder > > > > > > object helps. I assume it's for listing objects, but what else? > > > > > > > # ------------------------------ > > > > > > > # Aldrin > > > > > > > https://github.com/drin/ > > > > > > > https://gitlab.com/octalene > > > > > > > https://keybase.io/octalene > > > > > > > On Friday, July 12th, 2024 at 14:26, Raphael Taylor-Davies > > > > > > r.taylordav...@googlemail.com.INVALID wrote: > > > > > > > > > Many people > > > > > > > > are familiar with object stores these days. You could create a > > > > > > > > new > > > > > > > > abstraction `ObjectStore` which is very similar to `FileSystem` > > > > > > > > except > > > > > > > > the > > > > > > > > semantics are object store semantics and not filesystem > > > > > > > > semantics. > > > > > > > > FWIW in the Arrow Rust ecosystem we only provide an object store > > > > > > > abstraction, and this has served us very well. My 2 cents is that > > > > > > > object > > > > > > > store semantics are sufficient, if not superior 1, than filesystem > > > > > > > based interfaces for the vast majority of use cases, with the few > > > > > > > workloads that aren't sufficiently served requiring such close > > > > > > > integration with often OS-specific filesystem APIs and behaviours > > > > > > > as to > > > > > > > make building a coherent abstraction extremely difficult. > > > > > > > > Iceberg also took a similar approach with its File IO abstraction > > > > > > > 2. > > > > > > > > 1: > > > > > > https://docs.rs/object_store/latest/object_store/#why-not-a-filesystem-interface > > > > > > > > On 12/07/2024 22:05, Weston Pace wrote: > > > > > > > > > > The markers are necessary to offer file system semantics on > > > > > > > > > top of > > > > > > > > > object > > > > > > > > > stores. You will get a ton of subtle bugs otherwise. > > > > > > > > > Yes, object stores and filesystems are different. If you > > > > > > > > > expect > > > > > > > > > your > > > > > > > > > filesystem to act like a filesystem then these things need to > > > > > > > > > be > > > > > > > > > done in > > > > > > > > > order to avoid these bugs. > > > > > > > > > If an option modifies a filesystem to behave more like an object > > > > > > > > store > > > > > > > > then > > > > > > > > I don't think it's necessarily a bad thing as long as it isn't > > > > > > > > the > > > > > > > > default. By turning on the option the user is intentionally > > > > > > > > altering > > > > > > > > the > > > > > > > > behavior and should not be making the same expectations. > > > > > > > > > On the other hand, there is another approach you could take. > > > > > > > > Many > > > > > > > > people > > > > > > > > are familiar with object stores these days. You could create a > > > > > > > > new > > > > > > > > abstraction `ObjectStore` which is very similar to `FileSystem` > > > > > > > > except > > > > > > > > the > > > > > > > > semantics are object store semantics and not filesystem > > > > > > > > semantics. I > > > > > > > > believe most of our filesystem classes could implement both > > > > > > > > `ObjectStore` > > > > > > > > and `FileSystem` abstractions without significant code > > > > > > > > duplication. > > > > > > > > > This way, if a user wants filesystem semantics, they use a > > > > > > > > `FileSystem` and > > > > > > > > they pay the abstraction cost. If a user is comfortable with > > > > > > > > `ObjectStore` > > > > > > > > semantics they use `ObjectStore` and they don't have to pay the > > > > > > > > costs. > > > > > > > > > This would be more work than just allowing options to violate > > > > > > > > FileSystem > > > > > > > > guarantees but it would provide a more clear distinction > > > > > > > > between the > > > > > > > > two. > > > > > > > > > On Fri, Jul 12, 2024 at 9:25 AM Aldrin > > > > > > > > octalene....@pm.me.invalid > > > > > > > > wrote: > > > > > > > > > > Hello! > > > > > > > > > > This may be naive, but why does the empty directory marker > > > > > > > > > need to > > > > > > > > > exist > > > > > > > > > on the S3 side at all? If a local directory is created > > > > > > > > > (because > > > > > > > > > filesystem > > > > > > > > > semantics), then I am not sure why a fake object needs to > > > > > > > > > exist on > > > > > > > > > the > > > > > > > > > object-store side. > > > > > > > > > > # ------------------------------ > > > > > > > > > > # Aldrin > > > > > > > > > > https://github.com/drin/ > > > > > > > > > > https://gitlab.com/octalene > > > > > > > > > > https://keybase.io/octalene > > > > > > > > > > On Friday, July 12th, 2024 at 08:35, Felipe Oliveira Carvalho > > > > > > > > > < > > > > > > > > > felipe...@gmail.com> wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > The markers are necessary to offer file system semantics on > > > > > > > > > > top > > > > > > > > > > of > > > > > > > > > > object > > > > > > > > > > stores. You will get a ton of subtle bugs otherwise. > > > > > > > > > > > If instead of arrow::FileSystem, Arrow offered an > > > > > > > > > > arrow::ObjectStore > > > > > > > > > > interface that wraps local filesystems and object stores > > > > > > > > > > with > > > > > > > > > > object-store > > > > > > > > > > semantics (i.e. no concept of empty directory or atomic > > > > > > > > > > directory > > > > > > > > > > deletion), then application developers would have more > > > > > > > > > > control of > > > > > > > > > > the > > > > > > > > > > actions performed on the object store they are using. Cons > > > > > > > > > > would > > > > > > > > > > be > > > > > > > > > > slower > > > > > > > > > > operations when working with a local filesystem and no > > > > > > > > > > concept of > > > > > > > > > > directory. > > > > > > > > > > > > 1. Add an Option: Introduce an option in S3Options to > > > > > > > > > > > control > > > > > > > > > > > whether empty directory markers are created, giving users > > > > > > > > > > > the > > > > > > > > > > > choice. > > > > > > > > > > > Then it wouldn't be an honest implementation of > > > > > > > > > > arrow::FileSystem > > > > > > > > > > for the > > > > > > > > > > reasons listed above. > > > > > > > > > > > > Change Default Behavior: Modify the default behavior to > > > > > > > > > > > avoid > > > > > > > > > > > creating empty directory markers when a file is deleted. > > > > > > > > > > > That would bring in the bugs because an arrow::FileSystem > > > > > > > > > > instance > > > > > > > > > > would > > > > > > > > > > behave differently depending on what is backing it. > > > > > > > > > > > > 3. Smarter Directory Creation: Improve the implementation > > > > > > > > > > > to > > > > > > > > > > > check > > > > > > > > > > > for other objects in the same path before creating an > > > > > > > > > > > empty > > > > > > > > > > > directory > > > > > > > > > > > marker. > > > > > > > > > > > This might be a problem when more than one client or thread > > > > > > > > > > is > > > > > > > > > > mutating > > > > > > > > > > the > > > > > > > > > > object store through the arrow::FileSystem. You can check > > > > > > > > > > now and > > > > > > > > > > once > > > > > > > > > > you're done deleting all the other files you thought > > > > > > > > > > existed are > > > > > > > > > > deleted > > > > > > > > > > as > > > > > > > > > > well. Very likely if clients decide to implement parallel > > > > > > > > > > deletion. > > > > > > > > > > > The existing solution of always creating a marker when done > > > > > > > > > > is > > > > > > > > > > not > > > > > > > > > > perfect > > > > > > > > > > either, but less likely to break. > > > > > > > > > > > ## Suggested Workaround > > > > > > > > > > > Avoiding file by file operations so that internal functions > > > > > > > > > > can > > > > > > > > > > batch as > > > > > > > > > > much as possible. > > > > > > > > > > > -- > > > > > > > > > > Felipe > > > > > > > > > > > On Fri, Jul 12, 2024 at 7:22 AM Hyunseok Seo > > > > > > > > > > hsseo0...@gmail.com > > > > > > > > > > wrote: > > > > > > > > > > > > Hello. community! > > > > > > > > > > > > I am currently working on addressing the issue described > > > > > > > > > > > in > > > > > > > > > > > [C++] > > > > > > > > > > > Addoption to not create parent directory with S3 > > > > > > > > > > > delete_file. > > > > > > > > > > > In > > > > > > > > > > > this > > > > > > > > > > > process, I have > > > > > > > > > > > found it necessary to gather feedback on how to best > > > > > > > > > > > resolve > > > > > > > > > > > this > > > > > > > > > > > issue. > > > > > > > > > > > Below is a summary and some questions I have for the > > > > > > > > > > > community. > > > > > > > > > > > > ### Background > > > > > > > > > > > Currently, the S3FileSystem generates an empty directory > > > > > > > > > > > marker > > > > > > > > > > > (by > > > > > > > > > > > calling the EnsureParentExists function) when a file is > > > > > > > > > > > deleted > > > > > > > > > > > and the > > > > > > > > > > > directory becomes empty. This behavior maintains the > > > > > > > > > > > appearance > > > > > > > > > > > of the > > > > > > > > > > > directory structure. However, there have been issues > > > > > > > > > > > raised by > > > > > > > > > > > users > > > > > > > > > > > regarding this behavior in issues 1. > > > > > > > > > > > > ### Why Maintain Empty Directory Markers? > > > > > > > > > > > From what I understand, object stores like S3 do not have > > > > > > > > > > > a > > > > > > > > > > > concept of > > > > > > > > > > > directories. The motivation behind maintaining these > > > > > > > > > > > markers > > > > > > > > > > > could be > > > > > > > > > > > to > > > > > > > > > > > manage the object store as if it were a traditional file > > > > > > > > > > > system. > > > > > > > > > > > If > > > > > > > > > > > anyone > > > > > > > > > > > knows the context behind the implementation of > > > > > > > > > > > S3FileSystem, it > > > > > > > > > > > would > > > > > > > > > > > be > > > > > > > > > > > great if you could share it. > > > > > > > > > > > > ### Issues with Marker Creation > > > > > > > > > > > Users who have raised concerns about the creation of empty > > > > > > > > > > > directory > > > > > > > > > > > markers cite the following reasons: > > > > > > > > > > > > - Increase in Unnecessary Requests 2: Creating empty > > > > > > > > > > > directory > > > > > > > > > > > markers leads to additional S3 requests, which can > > > > > > > > > > > increase > > > > > > > > > > > costs and > > > > > > > > > > > affect performance. > > > > > > > > > > > - File System Consistency Issues 1: S3 is designed as an > > > > > > > > > > > object > > > > > > > > > > > store, and creating empty directory markers can break the > > > > > > > > > > > inherent > > > > > > > > > > > consistency of the file system. > > > > > > > > > > > > ### Proposed Solutions > > > > > > > > > > > Issue 1 suggests the following approaches: > > > > > > > > > > > > 1. Add an Option: Introduce an option in S3Options to > > > > > > > > > > > control > > > > > > > > > > > whether > > > > > > > > > > > empty directory markers are created, giving users the > > > > > > > > > > > choice. > > > > > > > > > > > 2. Change Default Behavior: Modify the default behavior to > > > > > > > > > > > avoid > > > > > > > > > > > creating empty directory markers when a file is deleted. > > > > > > > > > > > 3. Smarter Directory Creation: Improve the implementation > > > > > > > > > > > to > > > > > > > > > > > check for > > > > > > > > > > > other objects in the same path before creating an empty > > > > > > > > > > > directory > > > > > > > > > > > marker. > > > > > > > > > > > Here is my personal thought (approach 1 + 3): > > > > > > > > > > > > (approach 1) I believe it would be best to add the Marker > > > > > > > > > > > as an > > > > > > > > > > > option > > > > > > > > > > > (as some users might not want this enhancement). > > > > > > > > > > > > (approach 3) When the option is enabled, if there are no > > > > > > > > > > > files > > > > > > > > > > > (objects) > > > > > > > > > > > in the path (prefix) corresponding to a directory based > > > > > > > > > > > on the > > > > > > > > > > > file > > > > > > > > > > > system > > > > > > > > > > > concept, we should maintain the Marker. Otherwise, we > > > > > > > > > > > should > > > > > > > > > > > check the > > > > > > > > > > > number of files in the same path and avoid calling > > > > > > > > > > > EnsureParentExists > > > > > > > > > > > if > > > > > > > > > > > there are two or more files. > > > > > > > > > > > > On the other hand, I also feel that this approach might > > > > > > > > > > > make > > > > > > > > > > > the > > > > > > > > > > > logic > > > > > > > > > > > more > > > > > > > > > > > complicated. > > > > > > > > > > > > ### We Would Like Your Feedback > > > > > > > > > > > - What are your thoughts on the creation of empty > > > > > > > > > > > directory > > > > > > > > > > > markers? > > > > > > > > > > > - Which of the proposed solutions do you prefer? > > > > > > > > > > > - Do you have any additional suggestions or comments? > > > > > > > > > > > > We appreciate your valuable feedback and aim to find the > > > > > > > > > > > best > > > > > > > > > > > solution > > > > > > > > > > > based on your input. > > > > > > > > > > > > Thank you.
publickey - octalene.dev@pm.me - 0x21969656.asc
Description: application/pgp-keys
signature.asc
Description: OpenPGP digital signature