Just to add a bit more context to the performance test question: What I had in mind was the exists call on a (non-existing) directories in a bucket with a lot of objects. A comment from one of the SDK contributors about that call was that it could be an expensive call in an object store if implemented wrongly. I would imagine that this could be a valid concern because the concept of directories is not really present in an object store like S3, if I'm not mistaken?!
On Mon, Oct 9, 2023 at 6:49 PM Matthias Pohl <matthias.p...@aiven.io> wrote: > I would agree with David's proposal as well. > > Would it make sense to come up with some performance comparisons for the > different S3 implementations in the end? ...just to ensure that we're > improving things or (at least) don't make things worse. Or is there > something like that already somewhere? > > A bit out of scope: > We noticed that the FileSystem contract is not well defined. The JavaDoc > is ambiguous (IMHO) for some operations. For instance, the return value of > delete [1] is true "if the operation was successful": It's unclear (at > least to me) what success means here. Is it about the processing (i.e. the > delete was performed on an existing file) or the outcome (i.e. success is > reached as well if the file didn't exist in the first place). Removing the > return type could help to make the contract clearer. In the end, only the > outcome (i.e. the file doesn't exist anymore) matters in my opinion. A > similar argument could be applied to mkdirs [2] and rename [3]. > > That said, I'm not suggesting you adapt the interface as part of your > work. But it would be good to collect other improvements as part of it. We > could consider improving the FileSystem interface as part of the 2.0 > efforts as a follow-up. > > [1] > https://github.com/apache/flink/blob/d78d52b27af2550f50b44349d3ec6dc84b966a8a/flink-core/src/main/java/org/apache/flink/core/fs/FileSystem.java#L695 > [2] > https://github.com/apache/flink/blob/d78d52b27af2550f50b44349d3ec6dc84b966a8a/flink-core/src/main/java/org/apache/flink/core/fs/FileSystem.java#L706 > [3] > https://github.com/apache/flink/blob/d78d52b27af2550f50b44349d3ec6dc84b966a8a/flink-core/src/main/java/org/apache/flink/core/fs/FileSystem.java#L773 > > On Tue, Oct 3, 2023 at 6:25 PM Martijn Visser <martijnvis...@apache.org> > wrote: > >> +1 for David's suggestion. We should get away from the current >> approach with two abstractions and get to one rock solid one. >> >> On Mon, Oct 2, 2023 at 11:13 PM David Morávek <d...@apache.org> wrote: >> > >> > Hi Maomao, >> > >> > I wonder whether it would make sense to take a stab at consolidating >> the S3 >> > filesystems instead and introduce a native one. The whole Hadoop wrapper >> > around the S3 client exists for legacy reasons, and it adds complexity >> and >> > probably an unnecessary performance penalty. >> > >> > If you take a look at the underlying presto implementation, it's >> actually >> > not too complex to adapt to Flink interfaces (since you're proposing to >> > maintain a copy of it anyway). >> > >> > Overall, the S3 FS is probably the most used one that we have so this >> could >> > be rather high impact. It would also eliminate user confusion when >> choosing >> > the implementation to use. >> > >> > WDYT? >> > >> > Best, >> > D. >> > >> > On Fri, Sep 29, 2023 at 2:41 PM Min, Maomao <mimao...@amazon.com.invalid >> > >> > wrote: >> > >> > > Hi Flink Dev, >> > > >> > > I’m Maomao, a developer from AWS EMR. >> > > >> > > Recently, our team is working on adding AWS SDK V2 support for >> Flink’s S3 >> > > Filesystem. During development, we found out that our work was >> blocked by >> > > Presto. This is because that Presto still uses AWS SDK V1 and won’t >> add >> > > support for AWS SDK V2 in short term. To unblock, our team proposed >> several >> > > options and I’ve created a JIRA issue as here< >> > > https://issues.apache.org/jira/browse/FLINK-33157>. >> > > >> > > Since our team plans to contribute this work back to the community >> later, >> > > we’d like to collect feedback from the community about the options we >> > > proposed in the long term so that the community won’t need to >> duplicate >> > > this work in the future. >> > > >> > > Best, >> > > Maomao >> > > >> > > >> >