Looks like Maomao was missed from previous replies. Adding back 
@Maomao<mailto:mimao...@amazon.com>.

Thanks everyone for your response. We are having some discussion within AWS EMR 
team. Will get back to you very soon.

Regards,
Kevin

From: Matthias Pohl <matthias.p...@aiven.io>
Date: Tuesday, October 10, 2023 at 15:35
To: "dev@flink.apache.org" <dev@flink.apache.org>
Cc: "Zhao, Kevin" <kevnz...@amazon.com>, "Josephraj, Prabhu" 
<jopra...@amazon.com>, emr-flink-team <emr-flink-t...@amazon.com>
Subject: RE: [EXTERNAL] Support AWS SDK V2 for Flink's S3 FileSystem


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.


Just to add a bit more context to the performance test question: What I had in 
mind was the exists call on a (non-existing) directories in a bucket with a lot 
of objects. A comment from one of the SDK contributors about that call was that 
it could be an expensive call in an object store if implemented wrongly. I 
would imagine that this could be a valid concern because the concept of 
directories is not really present in an object store like S3, if I'm not 
mistaken?!

On Mon, Oct 9, 2023 at 6:49 PM Matthias Pohl 
<matthias.p...@aiven.io<mailto:matthias.p...@aiven.io>> wrote:
I would agree with David's proposal as well.

Would it make sense to come up with some performance comparisons for the 
different S3 implementations in the end? ...just to ensure that we're improving 
things or (at least) don't make things worse. Or is there something like that 
already somewhere?

A bit out of scope:
We noticed that the FileSystem contract is not well defined. The JavaDoc is 
ambiguous (IMHO) for some operations. For instance, the return value of delete 
[1] is true "if the operation was successful": It's unclear (at least to me) 
what success means here. Is it about the processing (i.e. the delete was 
performed on an existing file) or the outcome (i.e. success is reached as well 
if the file didn't exist in the first place). Removing the return type could 
help to make the contract clearer. In the end, only the outcome (i.e. the file 
doesn't exist anymore) matters in my opinion. A similar argument could be 
applied to mkdirs [2] and rename [3].

That said, I'm not suggesting you adapt the interface as part of your work. But 
it would be good to collect other improvements as part of it. We could consider 
improving the FileSystem interface as part of the 2.0 efforts as a follow-up.

[1] 
https://github.com/apache/flink/blob/d78d52b27af2550f50b44349d3ec6dc84b966a8a/flink-core/src/main/java/org/apache/flink/core/fs/FileSystem.java#L695
[2] 
https://github.com/apache/flink/blob/d78d52b27af2550f50b44349d3ec6dc84b966a8a/flink-core/src/main/java/org/apache/flink/core/fs/FileSystem.java#L706
[3] 
https://github.com/apache/flink/blob/d78d52b27af2550f50b44349d3ec6dc84b966a8a/flink-core/src/main/java/org/apache/flink/core/fs/FileSystem.java#L773

On Tue, Oct 3, 2023 at 6:25 PM Martijn Visser 
<martijnvis...@apache.org<mailto:martijnvis...@apache.org>> wrote:
+1 for David's suggestion. We should get away from the current
approach with two abstractions and get to one rock solid one.

On Mon, Oct 2, 2023 at 11:13 PM David Morávek 
<d...@apache.org<mailto:d...@apache.org>> wrote:
>
> Hi Maomao,
>
> I wonder whether it would make sense to take a stab at consolidating the S3
> filesystems instead and introduce a native one. The whole Hadoop wrapper
> around the S3 client exists for legacy reasons, and it adds complexity and
> probably an unnecessary performance penalty.
>
> If you take a look at the underlying presto implementation, it's actually
> not too complex to adapt to Flink interfaces (since you're proposing to
> maintain a copy of it anyway).
>
> Overall, the S3 FS is probably the most used one that we have so this could
> be rather high impact. It would also eliminate user confusion when choosing
> the implementation to use.
>
> WDYT?
>
> Best,
> D.
>
> On Fri, Sep 29, 2023 at 2:41 PM Min, Maomao <mimao...@amazon.com.invalid>
> wrote:
>
> > Hi Flink Dev,
> >
> > I’m Maomao, a developer from AWS EMR.
> >
> > Recently, our team is working on adding AWS SDK V2 support for Flink’s S3
> > Filesystem. During development, we found out that our work was blocked by
> > Presto. This is because that Presto still uses AWS SDK V1 and won’t add
> > support for AWS SDK V2 in short term. To unblock, our team proposed several
> > options and I’ve created a JIRA issue as here<
> > https://issues.apache.org/jira/browse/FLINK-33157>.
> >
> > Since our team plans to contribute this work back to the community later,
> > we’d like to collect feedback from the community about the options we
> > proposed in the long term so that the community won’t need to duplicate
> > this work in the future.
> >
> > Best,
> > Maomao
> >
> >

Reply via email to