Re: [DISCUSSION] Native S3 Filesystem in Apache Flink

Samrat Deb Tue, 11 Nov 2025 11:52:32 -0800

Hi Gabor,

Apologies for the delayed response.


> - A migration guide would be excellent from the old connectors. That way
users can see how much effort it is.

Yes, that’s one of the key aspects. I’ve tested the patch on S3. The
configuration remains exactly the same. The only change required is to
place the new `flink-s3-fs-native` JAR in the `plugins` directory and
remove the `flink-s3-fs-hadoop` JAR from there.
I haven’t documented a detailed design or migration plan yet. I’m waiting
for the first round of benchmark and comparison test results.

> - One of the key points from operational perspective is to have a way to
> make IOPS usage
> configurable. As on oversimplified explanation just to get a taste this
can
> be kept under control in 2 ways and places:
>  1. In Hadoop s3a set `fs.s3a.limit.total`
>  2. In connector set `s3.multipart.upload.min.file.size` and
> `s3.multipart.upload.min.part.size`
> Do I understand it correctly that this is intended to be covered by the
> following configs?

> | s3.upload.min.part.size | 5242880 | Minimum part size for multipart
> uploads (5MB) |
> | s3.upload.max.concurrent.uploads | CPU cores | Maximum concurrent
uploads
> per stream |

Yes, the POC patch currently includes three configurations[1]:
1. `s3.upload.min.part.size`
2. `s3.upload.max.concurrent.uploads`
3. `s3.read.buffer.size`

The idea is to start by supporting configurable IOPS through these
parameters.
Do you think these minimal configs are sufficient to begin with?

> > I am now drafting a formal benchmark plan based on these specifics and
will share it with this thread in the coming days for feedback.
> Waiting for the details.

Still Waiting for my employer to approve resources for the purpose 😅

Cheers,
Samrat

[1]
https://github.com/apache/flink/pull/27187/files#diff-f1e31c70c03cb943bc0e62fe456ca8d0b6bb63ae56c062d68f54ce2806b43f45R38


On Wed, Nov 5, 2025 at 5:34 PM Gabor Somogyi <[email protected]>
wrote:

> Hi Samrat,
>
> Thanks for the contribution! I've had a slight look at the code which is
> promising.
>
> I've a couple of questions/remarks:
> - A migration guide would be excellent from the old connectors. That way
> users can see how much effort it is.
> - One of the key points from operational perspective is to have a way to
> make IOPS usage
> configurable. As on oversimplified explanation just to get a taste this can
> be kept under control in 2 ways and places:
>   1. In Hadoop s3a set `fs.s3a.limit.total`
>   2. In connector set `s3.multipart.upload.min.file.size` and
> `s3.multipart.upload.min.part.size`
> Do I understand it correctly that this is intended to be covered by the
> following configs?
>
> | s3.upload.min.part.size | 5242880 | Minimum part size for multipart
> uploads (5MB) |
> | s3.upload.max.concurrent.uploads | CPU cores | Maximum concurrent uploads
> per stream |
>
> > I am now drafting a formal benchmark plan based on these specifics and
> will share it with this thread in the coming days for feedback.
> Waiting for the details.
>
> BR,
> G
>
>
> On Wed, Nov 5, 2025 at 7:08 AM Samrat Deb <[email protected]> wrote:
>
> > Hi all,
> >
> > I have a working POC for the Native S3 filesystem, which is now available
> > as a draft PR [1].
> > The POC is functional and has been validated in a local setup with Minio.
> > It's important to note that it does not yet have complete test coverage.
> >
> > The immediate next step is to conduct a comprehensive benchmark to
> compare
> > its performance against the existing `flink-s3-fs-hadoop` and
> > `flink-s3-fs-presto` implementations.
> >
> > I've had a very meaningful discussion with Piotr Nowojski about this
> > offline. I am grateful for his detailed guidance on defining a rigorous
> > benchmarking strategy, including specific cluster configurations, job
> > workloads, and key metrics for evaluating both checkpoint/recovery
> > performance and pure throughput.
> > I am now drafting a formal benchmark plan based on these specifics and
> will
> > share it with this thread in the coming days for feedback.
> >
> > Cheers,
> > Samrat
> >
> >  [1] https://github.com/apache/flink/pull/27187
> >
> > On Wed, Oct 29, 2025 at 9:31 PM Samrat Deb <[email protected]>
> wrote:
> >
> > > thank you  Martijn for clarifying .
> > > i will proceed with creating a task.
> > >
> > > Thanks Mate for the pointer to Minio for testing.
> > > minio is good to use for testing .
> > >
> > >
> > > Cheers,
> > > Samrat
> > >
> > >
> > > On Mon, 27 Oct 2025 at 11:55 PM, Mate Czagany <[email protected]>
> > wrote:
> > >
> > >> Hi,
> > >>
> > >> Just to add to the MinIO licensing concerns, I could not see any
> recent
> > >> change to the license itself, they have changed the license from
> Apache
> > >> 2.0
> > >> to AGPL-3.0 in 2021, and the Docker image used by the tests (which is
> > from
> > >> 2022) already contains the AGPL-3.0 license. This should not be an
> issue
> > >> as
> > >> Flink does not distribute nor makes MinIO available over the network,
> > it's
> > >> only used by the tests.
> > >>
> > >> What's changed recently is that MinIO no longer publishes Docker
> images
> > to
> > >> the public [1], so it might be worth it to look into using alternative
> > >> solutions in the future, e.g. Garage [2].
> > >>
> > >> Best regards,
> > >> Mate
> > >>
> > >> [1]
> https://github.com/minio/minio/issues/21647#issuecomment-3418675115
> > >> [2] https://garagehq.deuxfleurs.fr/
> > >>
> > >> On Mon, Oct 27, 2025 at 5:48 PM Ferenc Csaky
> <[email protected]
> > >
> > >> wrote:
> > >>
> > >> > Hi,
> > >> >
> > >> > Really nice to see people chime into this thread. I agree with
> Martijn
> > >> > about the
> > >> > development approach. There will be some iterations until we can
> > >> stabilize
> > >> > this anyways,
> > >> > so we can try to shoot getting out a good enough MVP, then fix
> issues
> > +
> > >> > reach feature
> > >> > parity with the existing implementations on the go.
> > >> >
> > >> > I am not a licensing expert but AFAIK the previous images that were
> > >> > released under the
> > >> > acceptable license can be continued to use. For most integration
> > tests,
> > >> we
> > >> > use an
> > >> > ancient image anyways [1]. There is another place where the latest
> img
> > >> > gets pulled [2],
> > >> > I guess it would be good to apply an explicit that tag there. But
> > AFAIK
> > >> > they stop
> > >> > publishing to Docker Hub, so I would anticipate we cannot end up
> > pulling
> > >> > an image with
> > >> > a forbidden license.
> > >> >
> > >> > Best,
> > >> > Ferenc
> > >> >
> > >> > [1]
> > >> >
> > >>
> >
> https://github.com/apache/flink/blob/fd1a97768b661f19783afe70d93a0a8d3d625b2a/flink-test-utils-parent/flink-test-utils-junit/src/main/java/org/apache/flink/util/DockerImageVersions.java#L39
> > >> > [2]
> > >> >
> > >>
> >
> https://github.com/apache/flink/blob/fd1a97768b661f19783afe70d93a0a8d3d625b2a/flink-end-to-end-tests/test-scripts/common_s3_minio.sh#L51
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > On Sunday, October 26th, 2025 at 22:05, Martijn Visser <
> > >> > [email protected]> wrote:
> > >> >
> > >> > >
> > >> > >
> > >> > > Hi Samrat,
> > >> > >
> > >> > > First of all, thanks for the proposal. It's long overdue to get
> this
> > >> in a
> > >> > > better state.
> > >> > >
> > >> > > With regards to the schemes, I would say to ship an initial
> release
> > >> that
> > >> > > does not include support for s3a and s3p, and focus first on
> getting
> > >> this
> > >> > > new implementation into a stable state. When that's done, as a
> > >> follow-up,
> > >> > > we can consider adding support for s3a and s3p on this
> > implementation,
> > >> > and
> > >> > > when that's there consider deprecating the older implementations.
> It
> > >> will
> > >> > > probably take multiple releases before we have this in a stable
> > state.
> > >> > >
> > >> > > Not directly related to this, but given that MinIO decided to
> change
> > >> > their
> > >> > > license, do we also need to refactor existing tests to not use
> MinIO
> > >> > > anymore but something else?
> > >> > >
> > >> > > Thanks,
> > >> > >
> > >> > > Martijn
> > >> > >
> > >> > > On Sat, Oct 25, 2025 at 1:38 AM Samrat Deb [email protected]
> > >> wrote:
> > >> > >
> > >> > > > Hi all,
> > >> > > >
> > >> > > > One clarifying question regarding the URI schemes:
> > >> > > >
> > >> > > > Currently, the Flink ecosystem uses multiple schemes to
> > >> differentiate
> > >> > > > between S3 implementations: s3a:// for the Hadoop-based
> connector
> > >> and
> > >> > > > s3p://[1] for the Presto-based one, which is often recommended
> for
> > >> > > > checkpointing.
> > >> > > >
> > >> > > > A key goal of the proposed flink-s3-fs-native is to unify these
> > >> into a
> > >> > > > single implementation. With that in mind, what should be the
> > >> strategy
> > >> > for
> > >> > > > scheme support? Should the new native s3 filesystem register
> only
> > >> for
> > >> > the
> > >> > > > simple s3:// scheme, aiming to deprecate the others? Or would it
> > be
> > >> > > > beneficial to also support s3a:// and s3p:// to provide a
> smoother
> > >> > > > migration path for users who may have these schemes in their
> > >> existing
> > >> > job
> > >> > > > configurations?
> > >> > > > Cheers,
> > >> > > > Samrat
> > >> > > >
> > >> > > > [1] https://github.com/generalui/s3p
> > >> > > >
> > >> > > > On Wed, Oct 22, 2025 at 6:31 PM Piotr Nowojski
> > [email protected]
> > >> > > > wrote:
> > >> > > >
> > >> > > > > Hi Samrat,
> > >> > > > >
> > >> > > > > > 1. Even if the specifics are hazy, could you recall the
> > general
> > >> > > > > > nature of those concerns? For instance, were they related to
> > >> S3's
> > >> > > > > > eventual
> > >> > > > > > consistency model, which has since improved, the atomicity
> of
> > >> > Multipart
> > >> > > > > > Upload commits, or perhaps complex failure/recovery
> scenarios
> > >> > during
> > >> > > > > > the
> > >> > > > > > commit phase?
> > >> > > > >
> > >> > > > > and
> > >> > > > >
> > >> > > > > > *8. *The flink-s3-fs-presto connector explicitly throws an
> > >> > > > > > `UnsupportedOperationException` when
> > >> `createRecoverableWriter()` is
> > >> > > > > > called.
> > >> > > > > > Was this a deliberate design choice to keep the Presto
> > connector
> > >> > > > > > lightweight and optimized specifically for checkpointing, or
> > >> were
> > >> > there
> > >> > > > > > other technical challenges that prevented its implementation
> > at
> > >> the
> > >> > > > > > time?
> > >> > > > > > Any context on this would be very helpful
> > >> > > > >
> > >> > > > > I very vaguely remember that at least one of those concerns
> was
> > >> with
> > >> > > > > respect to how long
> > >> > > > > does it take for the S3 to make some certain operations
> visible.
> > >> > That you
> > >> > > > > think you have
> > >> > > > > uploaded and committed a file, but in reality it might not be
> > >> > visible for
> > >> > > > > tens of seconds.
> > >> > > > >
> > >> > > > > Sorry, I don't remember more (or even if there was more). I
> was
> > >> only
> > >> > > > > superficially involved
> > >> > > > > in the S3 connector back then - just participated/overheard
> some
> > >> > > > > discussions.
> > >> > > > >
> > >> > > > > > 2. It's clear that implementing an efficient
> > >> > > > > > PathsCopyingFileSystem[2]
> > >> > > > > > is
> > >> > > > > > a non-negotiable requirement for performance. Is there any
> > >> > benchmark
> > >> > > > > > numbers available that can be used as reference and evaluate
> > new
> > >> > > > > > implementation deviation ?
> > >> > > > >
> > >> > > > > I only have the numbers that I put in the original Flip [1]. I
> > >> don't
> > >> > > > > remember the benchmark
> > >> > > > > setup, but it must have been something simple. Like just let
> > some
> > >> job
> > >> > > > > accumulate 1GB of state
> > >> > > > > and measure how long the state downloading phase of recovery
> was
> > >> > taking.
> > >> > > > >
> > >> > > > > > 3. Do you recall the workload characteristics for that PoC?
> > >> > > > > > Specifically,
> > >> > > > > > was the 30-40% performance advantage of s5cmd observed when
> > >> copying
> > >> > > > > > many
> > >> > > > > > small files (like checkpoint state) or larger,
> multi-gigabyte
> > >> > files?
> > >> > > > >
> > >> > > > > It was just a regular mix of compacted RocksDB sst files, with
> > >> total
> > >> > > > > state
> > >> > > > > size 1 or at most
> > >> > > > > a couple of GBs. So most of the files were around ~64MB or
> > ~128MB,
> > >> > with a
> > >> > > > > couple of
> > >> > > > > smaller L0 files, and maybe one larger L2 file.
> > >> > > > >
> > >> > > > > > 4. The idea of a switchable implementation sounds great.
> Would
> > >> you
> > >> > > > > > envision this as a configuration flag (e.g.,
> > >> > > > > > s3.native.copy.strategy=s5cmd
> > >> > > > > > or s3.native.copy.strategy=sdk) that selects the backend
> > >> > implementation
> > >> > > > > > at
> > >> > > > > > runtime? Also on contrary is it worth adding configuration
> > that
> > >> > exposes
> > >> > > > > > some level of implementation level information ?
> > >> > > > >
> > >> > > > > I think something like that should be fine, assuming that
> > `s5cmd`
> > >> > will
> > >> > > > > again
> > >> > > > > prove significantly faster and/or more cpu efficient. If not,
> if
> > >> the
> > >> > > > > SDKv2
> > >> > > > > has
> > >> > > > > already improved and caught up with the `s5cmd`, then it
> > probably
> > >> > doesn't
> > >> > > > > make sense to keep `s5cmd` support.
> > >> > > > >
> > >> > > > > > 5. My understanding is that the key takeaway here is to
> avoid
> > >> the
> > >> > > > > > file-by-file stream-based copy used in the vanilla connector
> > and
> > >> > > > > > leverage
> > >> > > > > > bulk operations, which PathsCopyingFileSystem[2] enables.
> This
> > >> > seems
> > >> > > > > > most
> > >> > > > > > critical during state download on recovery. please suggest
> if
> > my
> > >> > > > > > inference
> > >> > > > > > is in right direction
> > >> > > > >
> > >> > > > > Yes, but you should also make the bult transfer configurable.
> > How
> > >> > many
> > >> > > > > bulk
> > >> > > > > transfers
> > >> > > > > can be happening in parallel etc.
> > >> > > > >
> > >> > > > > > 6. The warning about `s5cmd` causing OOMs sounds like
> > >> indication to
> > >> > > > > > consider `S3TransferManager`[3] implementation, which might
> > >> offer
> > >> > more
> > >> > > > > > granular control over buffering and in-flight requests. Do
> you
> > >> > think
> > >> > > > > > exploring more on `S3TransferManager` would be valuable ?
> > >> > > > >
> > >> > > > > I'm pretty sure if you start hundreds of bulk transfers in
> > >> parallel
> > >> > via
> > >> > > > > the
> > >> > > > > `S3TransferManager` you can get the same problems with running
> > >> out of
> > >> > > > > memory or exceeding available network throughput. I don't know
> > if
> > >> > > > > `S3TransferManager` is better or worse in that regard to be
> > >> honest.
> > >> > > > >
> > >> > > > > > 7. The insight on AWS aggressively dropping packets instead
> of
> > >> > > > > > gracefully
> > >> > > > > > throttling is invaluable. Currently i have limited
> > understanding
> > >> > on how
> > >> > > > > > aws
> > >> > > > > > behaves at throttling I will deep dive more into it and
> > >> > > > > > look for clarification based on findings or doubt. To
> counter
> > >> this,
> > >> > > > > > were
> > >> > > > > > you thinking of a configurable rate limiter within the
> > >> filesystem
> > >> > > > > > itself
> > >> > > > > > (e.g., setting max bandwidth or max concurrent requests), or
> > >> > something
> > >> > > > > > more
> > >> > > > > > dynamic that could adapt to network conditions?
> > >> > > > >
> > >> > > > > Flat rate limiting is tricky because AWS offers burst network
> > >> > capacity,
> > >> > > > > which
> > >> > > > > comes very handy, and in the vast majority of cases works
> fine.
> > >> But
> > >> > for
> > >> > > > > some jobs
> > >> > > > > if you exceed that burst capacity, AWS starts dropping your
> > >> packets
> > >> > and
> > >> > > > > then the
> > >> > > > > problems happen. On the other hand, if rate limit to your
> normal
> > >> > > > > capacity,
> > >> > > > > you
> > >> > > > > are leaving a lot of network throughput unused during
> > recoveries.
> > >> > > > >
> > >> > > > > At the same time AWS doesn't share details for the burst
> > >> capacity, so
> > >> > > > > it's
> > >> > > > > sometimes
> > >> > > > > tricky to configure the whole system properly. I don't have an
> > >> > universal
> > >> > > > > good answer
> > >> > > > > for that :(
> > >> > > > >
> > >> > > > > Best,
> > >> > > > > Piotrek
> > >> > > > >
> > >> > > > > wt., 21 paź 2025 o 21:40 Samrat Deb [email protected]
> > >> > napisał(a):
> > >> > > > >
> > >> > > > > > Hi Gabor/ Ferenc
> > >> > > > > >
> > >> > > > > > Thank you for sharing the pointer and valuable feedback.
> > >> > > > > >
> > >> > > > > > The link to the custom `XmlResponsesSaxParser`[1] looks
> scary
> > 😦
> > >> > > > > > and contains hidden complexity.
> > >> > > > > >
> > >> > > > > > 1. Could you share some context on why this custom parser
> was
> > >> > > > > > necessary?
> > >> > > > > > Was it to work around a specific bug, a performance issue,
> or
> > an
> > >> > > > > > inconsistency in the S3 XML API responses that the default
> AWS
> > >> SDK
> > >> > > > > > parser
> > >> > > > > > couldn't handle at the time? With sdk v2 what are core
> > >> > functionality
> > >> > > > > > that
> > >> > > > > > is required to be intensively tested ?
> > >> > > > > >
> > >> > > > > > 2. You mentioned it has no Hadoop dependency, which is great
> > >> news.
> > >> > > > > > For
> > >> > > > > > a
> > >> > > > > > new native S3 connector, would integration simply require
> > >> > implementing
> > >> > > > > > a
> > >> > > > > > new S3DelegationTokenProvider/Receiver pair using the AWS
> SDK,
> > >> or
> > >> > are
> > >> > > > > > there
> > >> > > > > > more subtle integration points with the framework that
> should
> > be
> > >> > > > > > accounted?
> > >> > > > > >
> > >> > > > > > 3. I remember solving Serialized Throwable exception issue
> [2]
> > >> > > > > > leading
> > >> > > > > > to
> > >> > > > > > a new bug [3], where an initial fix led to a regression that
> > >> Gabor
> > >> > > > > > later
> > >> > > > > > solved with Ferenc providing a detailed root cause insights
> > [4]
> > >> 😅.
> > >> > > > > > Its hard to fully sure that all scenarios are covered
> > properly.
> > >> > This is
> > >> > > > > > one
> > >> > > > > > of the example, there can be other unknowns.
> > >> > > > > > what would be the best approach to test for and prevent such
> > >> > > > > > regressions
> > >> > > > > > or
> > >> > > > > > unknown unknowns, especially in the most sensitive parts of
> > the
> > >> > > > > > filesystem
> > >> > > > > > logic?
> > >> > > > > >
> > >> > > > > > Cheers,
> > >> > > > > > Samrat
> > >> > > > > >
> > >> > > > > > [1]
> > >> > > >
> > >> > > >
> > >> >
> > >>
> >
> https://github.com/apache/flink/blob/0e4e6d7082e83f098d0c1a94351babb3ea407aa8/flink-filesystems/flink-s3-fs-base/src/main/java/com/amazonaws/services/s3/model/transform/XmlResponsesSaxParser.java
> > >> > > >
> > >> > > > > > [2] https://issues.apache.org/jira/browse/FLINK-28513
> > >> > > > > > [3] https://github.com/apache/flink/pull/25231
> > >> > > > > > [4]
> > >> > https://github.com/apache/flink/pull/25231#issuecomment-2312059662
> > >> > > > > >
> > >> > > > > > On Tue, 21 Oct 2025 at 3:49 PM, Gabor Somogyi <
> > >> > > > > > [email protected]
> > >> > > > > >
> > >> > > > > > wrote:
> > >> > > > > >
> > >> > > > > > > Hi Samrat,
> > >> > > > > > >
> > >> > > > > > > +1 on the direction that we move away from hadoop.
> > >> > > > > > >
> > >> > > > > > > This is a long standing discussion to replace the
> mentioned
> > 2
> > >> > > > > > > connectors
> > >> > > > > > > with something better.
> > >> > > > > > > Both of them has it's own weaknesses, I've fixed several
> > >> blockers
> > >> > > > > > > inside
> > >> > > > > > > them.
> > >> > > > > > >
> > >> > > > > > > There are definitely magic inside them, please see this
> [1]
> > >> for
> > >> > > > > > > example
> > >> > > > > > > and
> > >> > > > > > > there are more🙂
> > >> > > > > > > I think the most sensitive part is the recovery because
> hard
> > >> to
> > >> > test
> > >> > > > > > > all
> > >> > > > > > > cases.
> > >> > > > > > >
> > >> > > > > > > @Ferenc
> > >> > > > > > >
> > >> > > > > > > > One thing that comes to my mind that will need some
> > changes
> > >> > and its
> > >> > > > > > > > involvement
> > >> > > > > > > > to this change is not trivial is the delegation token
> > >> > framework.
> > >> > > > > > > > Currently
> > >> > > > > > > > it
> > >> > > > > > > > is also tied to the Hadoop stuff and has some abstract
> > >> classes
> > >> > in the
> > >> > > > > > > > base
> > >> > > > > > > > S3 FS
> > >> > > > > > > > module.
> > >> > > > > > >
> > >> > > > > > > The delegation token framework has no dependency on hadoop
> > so
> > >> > there
> > >> > > > > > > is
> > >> > > > > > > no
> > >> > > > > > > blocker on the road,
> > >> > > > > > > but I'm here to help if any question appears.
> > >> > > > > > >
> > >> > > > > > > BR,
> > >> > > > > > > G
> > >> > > > > > >
> > >> > > > > > > [1]
> > >> > > >
> > >> > > >
> > >> >
> > >>
> >
> https://github.com/apache/flink/blob/0e4e6d7082e83f098d0c1a94351babb3ea407aa8/flink-filesystems/flink-s3-fs-base/src/main/java/com/amazonaws/services/s3/model/transform/XmlResponsesSaxParser.java#L95-L104
> > >> > > >
> > >> > > > > > > On Tue, Oct 14, 2025 at 8:19 PM Samrat Deb
> > >> [email protected]
> > >> > > > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > Hi All,
> > >> > > > > > > >
> > >> > > > > > > > Poorvank (cc'ed) and I are writing to start a discussion
> > >> about
> > >> > a
> > >> > > > > > > > potential
> > >> > > > > > > > improvement for Flink, creating a new, native S3
> > filesystem
> > >> > > > > > > > independent
> > >> > > > > > > > of
> > >> > > > > > > > Hadoop/Presto.
> > >> > > > > > > >
> > >> > > > > > > > The goal of this proposal is to address several
> challenges
> > >> > related
> > >> > > > > > > > to
> > >> > > > > > > > Flink's S3 integration, simplifying flink-s3-filesystem.
> > If
> > >> > this
> > >> > > > > > > > discussion
> > >> > > > > > > > gains positive traction, the next step would be to move
> > >> forward
> > >> > > > > > > > with
> > >> > > > > > > > a
> > >> > > > > > > > formalised FLIP.
> > >> > > > > > > >
> > >> > > > > > > > The Challenges with the Current S3 Connectors
> > >> > > > > > > > Currently, Flink offers two primary S3 filesystems,
> > >> > > > > > > > flink-s3-fs-hadoop[1]
> > >> > > > > > > > and flink-s3-fs-presto[2]. While functional, this
> > >> > dual-connector
> > >> > > > > > > > approach
> > >> > > > > > > > has few issues:
> > >> > > > > > > >
> > >> > > > > > > > 1. The flink-s3-fs-hadoop connector adds an additional
> > >> > dependency
> > >> > > > > > > > to
> > >> > > > > > > > manage. Upgrades like AWS SDK v2 are more dependent on
> > >> > > > > > > > Hadoop/Presto
> > >> > > > > > > > to
> > >> > > > > > > > support first and leverage in flink-s3-filesystem.
> > Sometimes
> > >> > it's
> > >> > > > > > > > restrictive to leverage features directly from the AWS
> > SDK.
> > >> > > > > > > >
> > >> > > > > > > > 2. The flink-s3-fs-presto connector was introduced to
> > >> mitigate
> > >> > the
> > >> > > > > > > > performance issues of the Hadoop connector, especially
> for
> > >> > > > > > > > checkpointing.
> > >> > > > > > > > However, it lacks a RecoverableWriter implementation.
> > >> > > > > > > > Sometimes it's confusing for Flink users, highlighting
> the
> > >> need
> > >> > > > > > > > for a
> > >> > > > > > > > single, unified solution.
> > >> > > > > > > >
> > >> > > > > > > > Proposed Solution:
> > >> > > > > > > > A Native, Hadoop-Free S3 Filesystem
> > >> > > > > > > >
> > >> > > > > > > > I propose we develop a new filesystem, let's call it
> > >> > > > > > > > flink-s3-fs-native,
> > >> > > > > > > > built directly on the modern AWS SDK for Java v2. This
> > >> approach
> > >> > > > > > > > would
> > >> > > > > > > > be
> > >> > > > > > > > free of any Hadoop or Presto dependencies. I have done a
> > >> small
> > >> > > > > > > > prototype
> > >> > > > > > > > to
> > >> > > > > > > > validate [3]
> > >> > > > > > > >
> > >> > > > > > > > This is motivated by trino<>s3 [4]. The Trino project
> > >> > successfully
> > >> > > > > > > > undertook a similar migration, moving from Hadoop-based
> > >> object
> > >> > > > > > > > storage
> > >> > > > > > > > clients to their own native implementations.
> > >> > > > > > > >
> > >> > > > > > > > The new Flink S3 filesystem would:
> > >> > > > > > > >
> > >> > > > > > > > 1. Provide a single, unified connector for all S3
> > >> interactions,
> > >> > > > > > > > from
> > >> > > > > > > > state
> > >> > > > > > > > backends to sinks.
> > >> > > > > > > >
> > >> > > > > > > > 2. Implement a high-performance S3RecoverableWriter
> using
> > >> S3's
> > >> > > > > > > > Multipart
> > >> > > > > > > > Upload feature, ensuring exactly-once sink semantics.
> > >> > > > > > > >
> > >> > > > > > > > 3. Offer a clean, self-contained dependency, drastically
> > >> > > > > > > > simplifying
> > >> > > > > > > > setup
> > >> > > > > > > > and eliminating external dependencies.
> > >> > > > > > > >
> > >> > > > > > > > A Phased Migration Path
> > >> > > > > > > > To ensure a smooth transition, we could adopt a phased
> > >> > approach on
> > >> > > > > > > > a
> > >> > > > > > > > very
> > >> > > > > > > > high level :
> > >> > > > > > > >
> > >> > > > > > > > Phase 1:
> > >> > > > > > > > Introduce the new native S3 filesystem as an optional,
> > >> parallel
> > >> > > > > > > > plugin.
> > >> > > > > > > > This would allow for community testing and adoption
> > without
> > >> > > > > > > > breaking
> > >> > > > > > > > existing setups.
> > >> > > > > > > >
> > >> > > > > > > > Phase 2:
> > >> > > > > > > > Once the native connector achieves feature parity and
> > proven
> > >> > > > > > > > stability,
> > >> > > > > > > > we
> > >> > > > > > > > will update the documentation to recommend it as the
> > default
> > >> > choice
> > >> > > > > > > > for
> > >> > > > > > > > all
> > >> > > > > > > > S3 use cases.
> > >> > > > > > > >
> > >> > > > > > > > Phase 3:
> > >> > > > > > > > In a future major release, the legacy flink-s3-fs-hadoop
> > and
> > >> > > > > > > > flink-s3-fs-presto connectors could be formally
> > deprecated,
> > >> > with
> > >> > > > > > > > clear
> > >> > > > > > > > migration guides provided for users.
> > >> > > > > > > >
> > >> > > > > > > > I would love to hear the community's thoughts on this.
> > >> > > > > > > >
> > >> > > > > > > > A few questions to start the discussion:
> > >> > > > > > > >
> > >> > > > > > > > 1. What are the biggest pain points with the current S3
> > >> > filesystem?
> > >> > > > > > > >
> > >> > > > > > > > 2. Are there any critical features from the Hadoop S3A
> > >> client
> > >> > that
> > >> > > > > > > > are
> > >> > > > > > > > essential to replicate in a native implementation?
> > >> > > > > > > >
> > >> > > > > > > > 3. Would a simplified, non-dependent S3 experience be a
> > >> > valuable
> > >> > > > > > > > improvement for Flink use cases?
> > >> > > > > > > >
> > >> > > > > > > > Cheers,
> > >> > > > > > > > Samrat
> > >> > > > > > > >
> > >> > > > > > > > [1]
> > >> > > >
> > >> > > >
> > >> >
> > >>
> >
> https://github.com/apache/flink/tree/master/flink-filesystems/flink-s3-fs-hadoop
> > >> > > >
> > >> > > > > > > > [2]
> > >> > > >
> > >> > > >
> > >> >
> > >>
> >
> https://github.com/apache/flink/tree/master/flink-filesystems/flink-s3-fs-presto
> > >> > > >
> > >> > > > > > > > [3] https://github.com/Samrat002/flink/pull/4
> > >> > > > > > > > [4]
> > >> > > > > > > >
> > >> >
> https://github.com/trinodb/trino/tree/master/lib/trino-filesystem-s3
> > >> >
> > >>
> > >
> >
>

Re: [DISCUSSION] Native S3 Filesystem in Apache Flink

Reply via email to