Re: [DISCUSSION] Native S3 Filesystem in Apache Flink

Samrat Deb Tue, 03 Feb 2026 13:42:27 -0800

Hi,
I conducted a benchmarking comparison of state checkpointing to S3,
comparing the proposed native S3 implementation with flink-s3-fs-presto.
The results are promising. The native implementation performs better under
the setup used.
PTAL at the benchmark document for detailed analysis with logs and setup
details[1]


As a next step, FLIP-555[2] is out for review. PTAL

Cheers,
Samrat

[1]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406620396
[2]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-555%3A+Flink+Native+S3+FileSystem


On Wed, Nov 12, 2025 at 1:21 AM Samrat Deb <[email protected]> wrote:

> Hi Gabor,
>
> Apologies for the delayed response.
>
> > - A migration guide would be excellent from the old connectors. That way
> users can see how much effort it is.
>
> Yes, that’s one of the key aspects. I’ve tested the patch on S3. The
> configuration remains exactly the same. The only change required is to
> place the new `flink-s3-fs-native` JAR in the `plugins` directory and
> remove the `flink-s3-fs-hadoop` JAR from there.
> I haven’t documented a detailed design or migration plan yet. I’m waiting
> for the first round of benchmark and comparison test results.
>
> > - One of the key points from operational perspective is to have a way to
> > make IOPS usage
> > configurable. As on oversimplified explanation just to get a taste this
> can
> > be kept under control in 2 ways and places:
> >  1. In Hadoop s3a set `fs.s3a.limit.total`
> >  2. In connector set `s3.multipart.upload.min.file.size` and
> > `s3.multipart.upload.min.part.size`
> > Do I understand it correctly that this is intended to be covered by the
> > following configs?
>
> > | s3.upload.min.part.size | 5242880 | Minimum part size for multipart
> > uploads (5MB) |
> > | s3.upload.max.concurrent.uploads | CPU cores | Maximum concurrent
> uploads
> > per stream |
>
> Yes, the POC patch currently includes three configurations[1]:
> 1. `s3.upload.min.part.size`
> 2. `s3.upload.max.concurrent.uploads`
> 3. `s3.read.buffer.size`
>
> The idea is to start by supporting configurable IOPS through these
> parameters.
> Do you think these minimal configs are sufficient to begin with?
>
> > > I am now drafting a formal benchmark plan based on these specifics and
> will share it with this thread in the coming days for feedback.
> > Waiting for the details.
>
> Still Waiting for my employer to approve resources for the purpose 😅
>
> Cheers,
> Samrat
>
> [1]
> https://github.com/apache/flink/pull/27187/files#diff-f1e31c70c03cb943bc0e62fe456ca8d0b6bb63ae56c062d68f54ce2806b43f45R38
>
>
> On Wed, Nov 5, 2025 at 5:34 PM Gabor Somogyi <[email protected]>
> wrote:
>
>> Hi Samrat,
>>
>> Thanks for the contribution! I've had a slight look at the code which is
>> promising.
>>
>> I've a couple of questions/remarks:
>> - A migration guide would be excellent from the old connectors. That way
>> users can see how much effort it is.
>> - One of the key points from operational perspective is to have a way to
>> make IOPS usage
>> configurable. As on oversimplified explanation just to get a taste this
>> can
>> be kept under control in 2 ways and places:
>>   1. In Hadoop s3a set `fs.s3a.limit.total`
>>   2. In connector set `s3.multipart.upload.min.file.size` and
>> `s3.multipart.upload.min.part.size`
>> Do I understand it correctly that this is intended to be covered by the
>> following configs?
>>
>> | s3.upload.min.part.size | 5242880 | Minimum part size for multipart
>> uploads (5MB) |
>> | s3.upload.max.concurrent.uploads | CPU cores | Maximum concurrent
>> uploads
>> per stream |
>>
>> > I am now drafting a formal benchmark plan based on these specifics and
>> will share it with this thread in the coming days for feedback.
>> Waiting for the details.
>>
>> BR,
>> G
>>
>>
>> On Wed, Nov 5, 2025 at 7:08 AM Samrat Deb <[email protected]> wrote:
>>
>> > Hi all,
>> >
>> > I have a working POC for the Native S3 filesystem, which is now
>> available
>> > as a draft PR [1].
>> > The POC is functional and has been validated in a local setup with
>> Minio.
>> > It's important to note that it does not yet have complete test coverage.
>> >
>> > The immediate next step is to conduct a comprehensive benchmark to
>> compare
>> > its performance against the existing `flink-s3-fs-hadoop` and
>> > `flink-s3-fs-presto` implementations.
>> >
>> > I've had a very meaningful discussion with Piotr Nowojski about this
>> > offline. I am grateful for his detailed guidance on defining a rigorous
>> > benchmarking strategy, including specific cluster configurations, job
>> > workloads, and key metrics for evaluating both checkpoint/recovery
>> > performance and pure throughput.
>> > I am now drafting a formal benchmark plan based on these specifics and
>> will
>> > share it with this thread in the coming days for feedback.
>> >
>> > Cheers,
>> > Samrat
>> >
>> >  [1] https://github.com/apache/flink/pull/27187
>> >
>> > On Wed, Oct 29, 2025 at 9:31 PM Samrat Deb <[email protected]>
>> wrote:
>> >
>> > > thank you  Martijn for clarifying .
>> > > i will proceed with creating a task.
>> > >
>> > > Thanks Mate for the pointer to Minio for testing.
>> > > minio is good to use for testing .
>> > >
>> > >
>> > > Cheers,
>> > > Samrat
>> > >
>> > >
>> > > On Mon, 27 Oct 2025 at 11:55 PM, Mate Czagany <[email protected]>
>> > wrote:
>> > >
>> > >> Hi,
>> > >>
>> > >> Just to add to the MinIO licensing concerns, I could not see any
>> recent
>> > >> change to the license itself, they have changed the license from
>> Apache
>> > >> 2.0
>> > >> to AGPL-3.0 in 2021, and the Docker image used by the tests (which is
>> > from
>> > >> 2022) already contains the AGPL-3.0 license. This should not be an
>> issue
>> > >> as
>> > >> Flink does not distribute nor makes MinIO available over the network,
>> > it's
>> > >> only used by the tests.
>> > >>
>> > >> What's changed recently is that MinIO no longer publishes Docker
>> images
>> > to
>> > >> the public [1], so it might be worth it to look into using
>> alternative
>> > >> solutions in the future, e.g. Garage [2].
>> > >>
>> > >> Best regards,
>> > >> Mate
>> > >>
>> > >> [1]
>> https://github.com/minio/minio/issues/21647#issuecomment-3418675115
>> > >> [2] https://garagehq.deuxfleurs.fr/
>> > >>
>> > >> On Mon, Oct 27, 2025 at 5:48 PM Ferenc Csaky
>> <[email protected]
>> > >
>> > >> wrote:
>> > >>
>> > >> > Hi,
>> > >> >
>> > >> > Really nice to see people chime into this thread. I agree with
>> Martijn
>> > >> > about the
>> > >> > development approach. There will be some iterations until we can
>> > >> stabilize
>> > >> > this anyways,
>> > >> > so we can try to shoot getting out a good enough MVP, then fix
>> issues
>> > +
>> > >> > reach feature
>> > >> > parity with the existing implementations on the go.
>> > >> >
>> > >> > I am not a licensing expert but AFAIK the previous images that were
>> > >> > released under the
>> > >> > acceptable license can be continued to use. For most integration
>> > tests,
>> > >> we
>> > >> > use an
>> > >> > ancient image anyways [1]. There is another place where the latest
>> img
>> > >> > gets pulled [2],
>> > >> > I guess it would be good to apply an explicit that tag there. But
>> > AFAIK
>> > >> > they stop
>> > >> > publishing to Docker Hub, so I would anticipate we cannot end up
>> > pulling
>> > >> > an image with
>> > >> > a forbidden license.
>> > >> >
>> > >> > Best,
>> > >> > Ferenc
>> > >> >
>> > >> > [1]
>> > >> >
>> > >>
>> >
>> https://github.com/apache/flink/blob/fd1a97768b661f19783afe70d93a0a8d3d625b2a/flink-test-utils-parent/flink-test-utils-junit/src/main/java/org/apache/flink/util/DockerImageVersions.java#L39
>> > >> > [2]
>> > >> >
>> > >>
>> >
>> https://github.com/apache/flink/blob/fd1a97768b661f19783afe70d93a0a8d3d625b2a/flink-end-to-end-tests/test-scripts/common_s3_minio.sh#L51
>> > >> >
>> > >> >
>> > >> >
>> > >> >
>> > >> > On Sunday, October 26th, 2025 at 22:05, Martijn Visser <
>> > >> > [email protected]> wrote:
>> > >> >
>> > >> > >
>> > >> > >
>> > >> > > Hi Samrat,
>> > >> > >
>> > >> > > First of all, thanks for the proposal. It's long overdue to get
>> this
>> > >> in a
>> > >> > > better state.
>> > >> > >
>> > >> > > With regards to the schemes, I would say to ship an initial
>> release
>> > >> that
>> > >> > > does not include support for s3a and s3p, and focus first on
>> getting
>> > >> this
>> > >> > > new implementation into a stable state. When that's done, as a
>> > >> follow-up,
>> > >> > > we can consider adding support for s3a and s3p on this
>> > implementation,
>> > >> > and
>> > >> > > when that's there consider deprecating the older
>> implementations. It
>> > >> will
>> > >> > > probably take multiple releases before we have this in a stable
>> > state.
>> > >> > >
>> > >> > > Not directly related to this, but given that MinIO decided to
>> change
>> > >> > their
>> > >> > > license, do we also need to refactor existing tests to not use
>> MinIO
>> > >> > > anymore but something else?
>> > >> > >
>> > >> > > Thanks,
>> > >> > >
>> > >> > > Martijn
>> > >> > >
>> > >> > > On Sat, Oct 25, 2025 at 1:38 AM Samrat Deb [email protected]
>> > >> wrote:
>> > >> > >
>> > >> > > > Hi all,
>> > >> > > >
>> > >> > > > One clarifying question regarding the URI schemes:
>> > >> > > >
>> > >> > > > Currently, the Flink ecosystem uses multiple schemes to
>> > >> differentiate
>> > >> > > > between S3 implementations: s3a:// for the Hadoop-based
>> connector
>> > >> and
>> > >> > > > s3p://[1] for the Presto-based one, which is often recommended
>> for
>> > >> > > > checkpointing.
>> > >> > > >
>> > >> > > > A key goal of the proposed flink-s3-fs-native is to unify these
>> > >> into a
>> > >> > > > single implementation. With that in mind, what should be the
>> > >> strategy
>> > >> > for
>> > >> > > > scheme support? Should the new native s3 filesystem register
>> only
>> > >> for
>> > >> > the
>> > >> > > > simple s3:// scheme, aiming to deprecate the others? Or would
>> it
>> > be
>> > >> > > > beneficial to also support s3a:// and s3p:// to provide a
>> smoother
>> > >> > > > migration path for users who may have these schemes in their
>> > >> existing
>> > >> > job
>> > >> > > > configurations?
>> > >> > > > Cheers,
>> > >> > > > Samrat
>> > >> > > >
>> > >> > > > [1] https://github.com/generalui/s3p
>> > >> > > >
>> > >> > > > On Wed, Oct 22, 2025 at 6:31 PM Piotr Nowojski
>> > [email protected]
>> > >> > > > wrote:
>> > >> > > >
>> > >> > > > > Hi Samrat,
>> > >> > > > >
>> > >> > > > > > 1. Even if the specifics are hazy, could you recall the
>> > general
>> > >> > > > > > nature of those concerns? For instance, were they related
>> to
>> > >> S3's
>> > >> > > > > > eventual
>> > >> > > > > > consistency model, which has since improved, the atomicity
>> of
>> > >> > Multipart
>> > >> > > > > > Upload commits, or perhaps complex failure/recovery
>> scenarios
>> > >> > during
>> > >> > > > > > the
>> > >> > > > > > commit phase?
>> > >> > > > >
>> > >> > > > > and
>> > >> > > > >
>> > >> > > > > > *8. *The flink-s3-fs-presto connector explicitly throws an
>> > >> > > > > > `UnsupportedOperationException` when
>> > >> `createRecoverableWriter()` is
>> > >> > > > > > called.
>> > >> > > > > > Was this a deliberate design choice to keep the Presto
>> > connector
>> > >> > > > > > lightweight and optimized specifically for checkpointing,
>> or
>> > >> were
>> > >> > there
>> > >> > > > > > other technical challenges that prevented its
>> implementation
>> > at
>> > >> the
>> > >> > > > > > time?
>> > >> > > > > > Any context on this would be very helpful
>> > >> > > > >
>> > >> > > > > I very vaguely remember that at least one of those concerns
>> was
>> > >> with
>> > >> > > > > respect to how long
>> > >> > > > > does it take for the S3 to make some certain operations
>> visible.
>> > >> > That you
>> > >> > > > > think you have
>> > >> > > > > uploaded and committed a file, but in reality it might not be
>> > >> > visible for
>> > >> > > > > tens of seconds.
>> > >> > > > >
>> > >> > > > > Sorry, I don't remember more (or even if there was more). I
>> was
>> > >> only
>> > >> > > > > superficially involved
>> > >> > > > > in the S3 connector back then - just participated/overheard
>> some
>> > >> > > > > discussions.
>> > >> > > > >
>> > >> > > > > > 2. It's clear that implementing an efficient
>> > >> > > > > > PathsCopyingFileSystem[2]
>> > >> > > > > > is
>> > >> > > > > > a non-negotiable requirement for performance. Is there any
>> > >> > benchmark
>> > >> > > > > > numbers available that can be used as reference and
>> evaluate
>> > new
>> > >> > > > > > implementation deviation ?
>> > >> > > > >
>> > >> > > > > I only have the numbers that I put in the original Flip [1].
>> I
>> > >> don't
>> > >> > > > > remember the benchmark
>> > >> > > > > setup, but it must have been something simple. Like just let
>> > some
>> > >> job
>> > >> > > > > accumulate 1GB of state
>> > >> > > > > and measure how long the state downloading phase of recovery
>> was
>> > >> > taking.
>> > >> > > > >
>> > >> > > > > > 3. Do you recall the workload characteristics for that PoC?
>> > >> > > > > > Specifically,
>> > >> > > > > > was the 30-40% performance advantage of s5cmd observed when
>> > >> copying
>> > >> > > > > > many
>> > >> > > > > > small files (like checkpoint state) or larger,
>> multi-gigabyte
>> > >> > files?
>> > >> > > > >
>> > >> > > > > It was just a regular mix of compacted RocksDB sst files,
>> with
>> > >> total
>> > >> > > > > state
>> > >> > > > > size 1 or at most
>> > >> > > > > a couple of GBs. So most of the files were around ~64MB or
>> > ~128MB,
>> > >> > with a
>> > >> > > > > couple of
>> > >> > > > > smaller L0 files, and maybe one larger L2 file.
>> > >> > > > >
>> > >> > > > > > 4. The idea of a switchable implementation sounds great.
>> Would
>> > >> you
>> > >> > > > > > envision this as a configuration flag (e.g.,
>> > >> > > > > > s3.native.copy.strategy=s5cmd
>> > >> > > > > > or s3.native.copy.strategy=sdk) that selects the backend
>> > >> > implementation
>> > >> > > > > > at
>> > >> > > > > > runtime? Also on contrary is it worth adding configuration
>> > that
>> > >> > exposes
>> > >> > > > > > some level of implementation level information ?
>> > >> > > > >
>> > >> > > > > I think something like that should be fine, assuming that
>> > `s5cmd`
>> > >> > will
>> > >> > > > > again
>> > >> > > > > prove significantly faster and/or more cpu efficient. If
>> not, if
>> > >> the
>> > >> > > > > SDKv2
>> > >> > > > > has
>> > >> > > > > already improved and caught up with the `s5cmd`, then it
>> > probably
>> > >> > doesn't
>> > >> > > > > make sense to keep `s5cmd` support.
>> > >> > > > >
>> > >> > > > > > 5. My understanding is that the key takeaway here is to
>> avoid
>> > >> the
>> > >> > > > > > file-by-file stream-based copy used in the vanilla
>> connector
>> > and
>> > >> > > > > > leverage
>> > >> > > > > > bulk operations, which PathsCopyingFileSystem[2] enables.
>> This
>> > >> > seems
>> > >> > > > > > most
>> > >> > > > > > critical during state download on recovery. please suggest
>> if
>> > my
>> > >> > > > > > inference
>> > >> > > > > > is in right direction
>> > >> > > > >
>> > >> > > > > Yes, but you should also make the bult transfer configurable.
>> > How
>> > >> > many
>> > >> > > > > bulk
>> > >> > > > > transfers
>> > >> > > > > can be happening in parallel etc.
>> > >> > > > >
>> > >> > > > > > 6. The warning about `s5cmd` causing OOMs sounds like
>> > >> indication to
>> > >> > > > > > consider `S3TransferManager`[3] implementation, which might
>> > >> offer
>> > >> > more
>> > >> > > > > > granular control over buffering and in-flight requests. Do
>> you
>> > >> > think
>> > >> > > > > > exploring more on `S3TransferManager` would be valuable ?
>> > >> > > > >
>> > >> > > > > I'm pretty sure if you start hundreds of bulk transfers in
>> > >> parallel
>> > >> > via
>> > >> > > > > the
>> > >> > > > > `S3TransferManager` you can get the same problems with
>> running
>> > >> out of
>> > >> > > > > memory or exceeding available network throughput. I don't
>> know
>> > if
>> > >> > > > > `S3TransferManager` is better or worse in that regard to be
>> > >> honest.
>> > >> > > > >
>> > >> > > > > > 7. The insight on AWS aggressively dropping packets
>> instead of
>> > >> > > > > > gracefully
>> > >> > > > > > throttling is invaluable. Currently i have limited
>> > understanding
>> > >> > on how
>> > >> > > > > > aws
>> > >> > > > > > behaves at throttling I will deep dive more into it and
>> > >> > > > > > look for clarification based on findings or doubt. To
>> counter
>> > >> this,
>> > >> > > > > > were
>> > >> > > > > > you thinking of a configurable rate limiter within the
>> > >> filesystem
>> > >> > > > > > itself
>> > >> > > > > > (e.g., setting max bandwidth or max concurrent requests),
>> or
>> > >> > something
>> > >> > > > > > more
>> > >> > > > > > dynamic that could adapt to network conditions?
>> > >> > > > >
>> > >> > > > > Flat rate limiting is tricky because AWS offers burst network
>> > >> > capacity,
>> > >> > > > > which
>> > >> > > > > comes very handy, and in the vast majority of cases works
>> fine.
>> > >> But
>> > >> > for
>> > >> > > > > some jobs
>> > >> > > > > if you exceed that burst capacity, AWS starts dropping your
>> > >> packets
>> > >> > and
>> > >> > > > > then the
>> > >> > > > > problems happen. On the other hand, if rate limit to your
>> normal
>> > >> > > > > capacity,
>> > >> > > > > you
>> > >> > > > > are leaving a lot of network throughput unused during
>> > recoveries.
>> > >> > > > >
>> > >> > > > > At the same time AWS doesn't share details for the burst
>> > >> capacity, so
>> > >> > > > > it's
>> > >> > > > > sometimes
>> > >> > > > > tricky to configure the whole system properly. I don't have
>> an
>> > >> > universal
>> > >> > > > > good answer
>> > >> > > > > for that :(
>> > >> > > > >
>> > >> > > > > Best,
>> > >> > > > > Piotrek
>> > >> > > > >
>> > >> > > > > wt., 21 paź 2025 o 21:40 Samrat Deb [email protected]
>> > >> > napisał(a):
>> > >> > > > >
>> > >> > > > > > Hi Gabor/ Ferenc
>> > >> > > > > >
>> > >> > > > > > Thank you for sharing the pointer and valuable feedback.
>> > >> > > > > >
>> > >> > > > > > The link to the custom `XmlResponsesSaxParser`[1] looks
>> scary
>> > 😦
>> > >> > > > > > and contains hidden complexity.
>> > >> > > > > >
>> > >> > > > > > 1. Could you share some context on why this custom parser
>> was
>> > >> > > > > > necessary?
>> > >> > > > > > Was it to work around a specific bug, a performance issue,
>> or
>> > an
>> > >> > > > > > inconsistency in the S3 XML API responses that the default
>> AWS
>> > >> SDK
>> > >> > > > > > parser
>> > >> > > > > > couldn't handle at the time? With sdk v2 what are core
>> > >> > functionality
>> > >> > > > > > that
>> > >> > > > > > is required to be intensively tested ?
>> > >> > > > > >
>> > >> > > > > > 2. You mentioned it has no Hadoop dependency, which is
>> great
>> > >> news.
>> > >> > > > > > For
>> > >> > > > > > a
>> > >> > > > > > new native S3 connector, would integration simply require
>> > >> > implementing
>> > >> > > > > > a
>> > >> > > > > > new S3DelegationTokenProvider/Receiver pair using the AWS
>> SDK,
>> > >> or
>> > >> > are
>> > >> > > > > > there
>> > >> > > > > > more subtle integration points with the framework that
>> should
>> > be
>> > >> > > > > > accounted?
>> > >> > > > > >
>> > >> > > > > > 3. I remember solving Serialized Throwable exception issue
>> [2]
>> > >> > > > > > leading
>> > >> > > > > > to
>> > >> > > > > > a new bug [3], where an initial fix led to a regression
>> that
>> > >> Gabor
>> > >> > > > > > later
>> > >> > > > > > solved with Ferenc providing a detailed root cause insights
>> > [4]
>> > >> 😅.
>> > >> > > > > > Its hard to fully sure that all scenarios are covered
>> > properly.
>> > >> > This is
>> > >> > > > > > one
>> > >> > > > > > of the example, there can be other unknowns.
>> > >> > > > > > what would be the best approach to test for and prevent
>> such
>> > >> > > > > > regressions
>> > >> > > > > > or
>> > >> > > > > > unknown unknowns, especially in the most sensitive parts of
>> > the
>> > >> > > > > > filesystem
>> > >> > > > > > logic?
>> > >> > > > > >
>> > >> > > > > > Cheers,
>> > >> > > > > > Samrat
>> > >> > > > > >
>> > >> > > > > > [1]
>> > >> > > >
>> > >> > > >
>> > >> >
>> > >>
>> >
>> https://github.com/apache/flink/blob/0e4e6d7082e83f098d0c1a94351babb3ea407aa8/flink-filesystems/flink-s3-fs-base/src/main/java/com/amazonaws/services/s3/model/transform/XmlResponsesSaxParser.java
>> > >> > > >
>> > >> > > > > > [2] https://issues.apache.org/jira/browse/FLINK-28513
>> > >> > > > > > [3] https://github.com/apache/flink/pull/25231
>> > >> > > > > > [4]
>> > >> > https://github.com/apache/flink/pull/25231#issuecomment-2312059662
>> > >> > > > > >
>> > >> > > > > > On Tue, 21 Oct 2025 at 3:49 PM, Gabor Somogyi <
>> > >> > > > > > [email protected]
>> > >> > > > > >
>> > >> > > > > > wrote:
>> > >> > > > > >
>> > >> > > > > > > Hi Samrat,
>> > >> > > > > > >
>> > >> > > > > > > +1 on the direction that we move away from hadoop.
>> > >> > > > > > >
>> > >> > > > > > > This is a long standing discussion to replace the
>> mentioned
>> > 2
>> > >> > > > > > > connectors
>> > >> > > > > > > with something better.
>> > >> > > > > > > Both of them has it's own weaknesses, I've fixed several
>> > >> blockers
>> > >> > > > > > > inside
>> > >> > > > > > > them.
>> > >> > > > > > >
>> > >> > > > > > > There are definitely magic inside them, please see this
>> [1]
>> > >> for
>> > >> > > > > > > example
>> > >> > > > > > > and
>> > >> > > > > > > there are more🙂
>> > >> > > > > > > I think the most sensitive part is the recovery because
>> hard
>> > >> to
>> > >> > test
>> > >> > > > > > > all
>> > >> > > > > > > cases.
>> > >> > > > > > >
>> > >> > > > > > > @Ferenc
>> > >> > > > > > >
>> > >> > > > > > > > One thing that comes to my mind that will need some
>> > changes
>> > >> > and its
>> > >> > > > > > > > involvement
>> > >> > > > > > > > to this change is not trivial is the delegation token
>> > >> > framework.
>> > >> > > > > > > > Currently
>> > >> > > > > > > > it
>> > >> > > > > > > > is also tied to the Hadoop stuff and has some abstract
>> > >> classes
>> > >> > in the
>> > >> > > > > > > > base
>> > >> > > > > > > > S3 FS
>> > >> > > > > > > > module.
>> > >> > > > > > >
>> > >> > > > > > > The delegation token framework has no dependency on
>> hadoop
>> > so
>> > >> > there
>> > >> > > > > > > is
>> > >> > > > > > > no
>> > >> > > > > > > blocker on the road,
>> > >> > > > > > > but I'm here to help if any question appears.
>> > >> > > > > > >
>> > >> > > > > > > BR,
>> > >> > > > > > > G
>> > >> > > > > > >
>> > >> > > > > > > [1]
>> > >> > > >
>> > >> > > >
>> > >> >
>> > >>
>> >
>> https://github.com/apache/flink/blob/0e4e6d7082e83f098d0c1a94351babb3ea407aa8/flink-filesystems/flink-s3-fs-base/src/main/java/com/amazonaws/services/s3/model/transform/XmlResponsesSaxParser.java#L95-L104
>> > >> > > >
>> > >> > > > > > > On Tue, Oct 14, 2025 at 8:19 PM Samrat Deb
>> > >> [email protected]
>> > >> > > > > > > wrote:
>> > >> > > > > > >
>> > >> > > > > > > > Hi All,
>> > >> > > > > > > >
>> > >> > > > > > > > Poorvank (cc'ed) and I are writing to start a
>> discussion
>> > >> about
>> > >> > a
>> > >> > > > > > > > potential
>> > >> > > > > > > > improvement for Flink, creating a new, native S3
>> > filesystem
>> > >> > > > > > > > independent
>> > >> > > > > > > > of
>> > >> > > > > > > > Hadoop/Presto.
>> > >> > > > > > > >
>> > >> > > > > > > > The goal of this proposal is to address several
>> challenges
>> > >> > related
>> > >> > > > > > > > to
>> > >> > > > > > > > Flink's S3 integration, simplifying
>> flink-s3-filesystem.
>> > If
>> > >> > this
>> > >> > > > > > > > discussion
>> > >> > > > > > > > gains positive traction, the next step would be to move
>> > >> forward
>> > >> > > > > > > > with
>> > >> > > > > > > > a
>> > >> > > > > > > > formalised FLIP.
>> > >> > > > > > > >
>> > >> > > > > > > > The Challenges with the Current S3 Connectors
>> > >> > > > > > > > Currently, Flink offers two primary S3 filesystems,
>> > >> > > > > > > > flink-s3-fs-hadoop[1]
>> > >> > > > > > > > and flink-s3-fs-presto[2]. While functional, this
>> > >> > dual-connector
>> > >> > > > > > > > approach
>> > >> > > > > > > > has few issues:
>> > >> > > > > > > >
>> > >> > > > > > > > 1. The flink-s3-fs-hadoop connector adds an additional
>> > >> > dependency
>> > >> > > > > > > > to
>> > >> > > > > > > > manage. Upgrades like AWS SDK v2 are more dependent on
>> > >> > > > > > > > Hadoop/Presto
>> > >> > > > > > > > to
>> > >> > > > > > > > support first and leverage in flink-s3-filesystem.
>> > Sometimes
>> > >> > it's
>> > >> > > > > > > > restrictive to leverage features directly from the AWS
>> > SDK.
>> > >> > > > > > > >
>> > >> > > > > > > > 2. The flink-s3-fs-presto connector was introduced to
>> > >> mitigate
>> > >> > the
>> > >> > > > > > > > performance issues of the Hadoop connector, especially
>> for
>> > >> > > > > > > > checkpointing.
>> > >> > > > > > > > However, it lacks a RecoverableWriter implementation.
>> > >> > > > > > > > Sometimes it's confusing for Flink users, highlighting
>> the
>> > >> need
>> > >> > > > > > > > for a
>> > >> > > > > > > > single, unified solution.
>> > >> > > > > > > >
>> > >> > > > > > > > Proposed Solution:
>> > >> > > > > > > > A Native, Hadoop-Free S3 Filesystem
>> > >> > > > > > > >
>> > >> > > > > > > > I propose we develop a new filesystem, let's call it
>> > >> > > > > > > > flink-s3-fs-native,
>> > >> > > > > > > > built directly on the modern AWS SDK for Java v2. This
>> > >> approach
>> > >> > > > > > > > would
>> > >> > > > > > > > be
>> > >> > > > > > > > free of any Hadoop or Presto dependencies. I have done
>> a
>> > >> small
>> > >> > > > > > > > prototype
>> > >> > > > > > > > to
>> > >> > > > > > > > validate [3]
>> > >> > > > > > > >
>> > >> > > > > > > > This is motivated by trino<>s3 [4]. The Trino project
>> > >> > successfully
>> > >> > > > > > > > undertook a similar migration, moving from Hadoop-based
>> > >> object
>> > >> > > > > > > > storage
>> > >> > > > > > > > clients to their own native implementations.
>> > >> > > > > > > >
>> > >> > > > > > > > The new Flink S3 filesystem would:
>> > >> > > > > > > >
>> > >> > > > > > > > 1. Provide a single, unified connector for all S3
>> > >> interactions,
>> > >> > > > > > > > from
>> > >> > > > > > > > state
>> > >> > > > > > > > backends to sinks.
>> > >> > > > > > > >
>> > >> > > > > > > > 2. Implement a high-performance S3RecoverableWriter
>> using
>> > >> S3's
>> > >> > > > > > > > Multipart
>> > >> > > > > > > > Upload feature, ensuring exactly-once sink semantics.
>> > >> > > > > > > >
>> > >> > > > > > > > 3. Offer a clean, self-contained dependency,
>> drastically
>> > >> > > > > > > > simplifying
>> > >> > > > > > > > setup
>> > >> > > > > > > > and eliminating external dependencies.
>> > >> > > > > > > >
>> > >> > > > > > > > A Phased Migration Path
>> > >> > > > > > > > To ensure a smooth transition, we could adopt a phased
>> > >> > approach on
>> > >> > > > > > > > a
>> > >> > > > > > > > very
>> > >> > > > > > > > high level :
>> > >> > > > > > > >
>> > >> > > > > > > > Phase 1:
>> > >> > > > > > > > Introduce the new native S3 filesystem as an optional,
>> > >> parallel
>> > >> > > > > > > > plugin.
>> > >> > > > > > > > This would allow for community testing and adoption
>> > without
>> > >> > > > > > > > breaking
>> > >> > > > > > > > existing setups.
>> > >> > > > > > > >
>> > >> > > > > > > > Phase 2:
>> > >> > > > > > > > Once the native connector achieves feature parity and
>> > proven
>> > >> > > > > > > > stability,
>> > >> > > > > > > > we
>> > >> > > > > > > > will update the documentation to recommend it as the
>> > default
>> > >> > choice
>> > >> > > > > > > > for
>> > >> > > > > > > > all
>> > >> > > > > > > > S3 use cases.
>> > >> > > > > > > >
>> > >> > > > > > > > Phase 3:
>> > >> > > > > > > > In a future major release, the legacy
>> flink-s3-fs-hadoop
>> > and
>> > >> > > > > > > > flink-s3-fs-presto connectors could be formally
>> > deprecated,
>> > >> > with
>> > >> > > > > > > > clear
>> > >> > > > > > > > migration guides provided for users.
>> > >> > > > > > > >
>> > >> > > > > > > > I would love to hear the community's thoughts on this.
>> > >> > > > > > > >
>> > >> > > > > > > > A few questions to start the discussion:
>> > >> > > > > > > >
>> > >> > > > > > > > 1. What are the biggest pain points with the current S3
>> > >> > filesystem?
>> > >> > > > > > > >
>> > >> > > > > > > > 2. Are there any critical features from the Hadoop S3A
>> > >> client
>> > >> > that
>> > >> > > > > > > > are
>> > >> > > > > > > > essential to replicate in a native implementation?
>> > >> > > > > > > >
>> > >> > > > > > > > 3. Would a simplified, non-dependent S3 experience be a
>> > >> > valuable
>> > >> > > > > > > > improvement for Flink use cases?
>> > >> > > > > > > >
>> > >> > > > > > > > Cheers,
>> > >> > > > > > > > Samrat
>> > >> > > > > > > >
>> > >> > > > > > > > [1]
>> > >> > > >
>> > >> > > >
>> > >> >
>> > >>
>> >
>> https://github.com/apache/flink/tree/master/flink-filesystems/flink-s3-fs-hadoop
>> > >> > > >
>> > >> > > > > > > > [2]
>> > >> > > >
>> > >> > > >
>> > >> >
>> > >>
>> >
>> https://github.com/apache/flink/tree/master/flink-filesystems/flink-s3-fs-presto
>> > >> > > >
>> > >> > > > > > > > [3] https://github.com/Samrat002/flink/pull/4
>> > >> > > > > > > > [4]
>> > >> > > > > > > >
>> > >> >
>> https://github.com/trinodb/trino/tree/master/lib/trino-filesystem-s3
>> > >> >
>> > >>
>> > >
>> >
>>
>

Re: [DISCUSSION] Native S3 Filesystem in Apache Flink

Reply via email to