Hi, I conducted a benchmarking comparison of state checkpointing to S3, comparing the proposed native S3 implementation with flink-s3-fs-presto. The results are promising. The native implementation performs better under the setup used. PTAL at the benchmark document for detailed analysis with logs and setup details[1]
As a next step, FLIP-555[2] is out for review. PTAL Cheers, Samrat [1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406620396 [2] https://cwiki.apache.org/confluence/display/FLINK/FLIP-555%3A+Flink+Native+S3+FileSystem On Wed, Nov 12, 2025 at 1:21 AM Samrat Deb <[email protected]> wrote: > Hi Gabor, > > Apologies for the delayed response. > > > - A migration guide would be excellent from the old connectors. That way > users can see how much effort it is. > > Yes, that’s one of the key aspects. I’ve tested the patch on S3. The > configuration remains exactly the same. The only change required is to > place the new `flink-s3-fs-native` JAR in the `plugins` directory and > remove the `flink-s3-fs-hadoop` JAR from there. > I haven’t documented a detailed design or migration plan yet. I’m waiting > for the first round of benchmark and comparison test results. > > > - One of the key points from operational perspective is to have a way to > > make IOPS usage > > configurable. As on oversimplified explanation just to get a taste this > can > > be kept under control in 2 ways and places: > > 1. In Hadoop s3a set `fs.s3a.limit.total` > > 2. In connector set `s3.multipart.upload.min.file.size` and > > `s3.multipart.upload.min.part.size` > > Do I understand it correctly that this is intended to be covered by the > > following configs? > > > | s3.upload.min.part.size | 5242880 | Minimum part size for multipart > > uploads (5MB) | > > | s3.upload.max.concurrent.uploads | CPU cores | Maximum concurrent > uploads > > per stream | > > Yes, the POC patch currently includes three configurations[1]: > 1. `s3.upload.min.part.size` > 2. `s3.upload.max.concurrent.uploads` > 3. `s3.read.buffer.size` > > The idea is to start by supporting configurable IOPS through these > parameters. > Do you think these minimal configs are sufficient to begin with? > > > > I am now drafting a formal benchmark plan based on these specifics and > will share it with this thread in the coming days for feedback. > > Waiting for the details. > > Still Waiting for my employer to approve resources for the purpose 😅 > > Cheers, > Samrat > > [1] > https://github.com/apache/flink/pull/27187/files#diff-f1e31c70c03cb943bc0e62fe456ca8d0b6bb63ae56c062d68f54ce2806b43f45R38 > > > On Wed, Nov 5, 2025 at 5:34 PM Gabor Somogyi <[email protected]> > wrote: > >> Hi Samrat, >> >> Thanks for the contribution! I've had a slight look at the code which is >> promising. >> >> I've a couple of questions/remarks: >> - A migration guide would be excellent from the old connectors. That way >> users can see how much effort it is. >> - One of the key points from operational perspective is to have a way to >> make IOPS usage >> configurable. As on oversimplified explanation just to get a taste this >> can >> be kept under control in 2 ways and places: >> 1. In Hadoop s3a set `fs.s3a.limit.total` >> 2. In connector set `s3.multipart.upload.min.file.size` and >> `s3.multipart.upload.min.part.size` >> Do I understand it correctly that this is intended to be covered by the >> following configs? >> >> | s3.upload.min.part.size | 5242880 | Minimum part size for multipart >> uploads (5MB) | >> | s3.upload.max.concurrent.uploads | CPU cores | Maximum concurrent >> uploads >> per stream | >> >> > I am now drafting a formal benchmark plan based on these specifics and >> will share it with this thread in the coming days for feedback. >> Waiting for the details. >> >> BR, >> G >> >> >> On Wed, Nov 5, 2025 at 7:08 AM Samrat Deb <[email protected]> wrote: >> >> > Hi all, >> > >> > I have a working POC for the Native S3 filesystem, which is now >> available >> > as a draft PR [1]. >> > The POC is functional and has been validated in a local setup with >> Minio. >> > It's important to note that it does not yet have complete test coverage. >> > >> > The immediate next step is to conduct a comprehensive benchmark to >> compare >> > its performance against the existing `flink-s3-fs-hadoop` and >> > `flink-s3-fs-presto` implementations. >> > >> > I've had a very meaningful discussion with Piotr Nowojski about this >> > offline. I am grateful for his detailed guidance on defining a rigorous >> > benchmarking strategy, including specific cluster configurations, job >> > workloads, and key metrics for evaluating both checkpoint/recovery >> > performance and pure throughput. >> > I am now drafting a formal benchmark plan based on these specifics and >> will >> > share it with this thread in the coming days for feedback. >> > >> > Cheers, >> > Samrat >> > >> > [1] https://github.com/apache/flink/pull/27187 >> > >> > On Wed, Oct 29, 2025 at 9:31 PM Samrat Deb <[email protected]> >> wrote: >> > >> > > thank you Martijn for clarifying . >> > > i will proceed with creating a task. >> > > >> > > Thanks Mate for the pointer to Minio for testing. >> > > minio is good to use for testing . >> > > >> > > >> > > Cheers, >> > > Samrat >> > > >> > > >> > > On Mon, 27 Oct 2025 at 11:55 PM, Mate Czagany <[email protected]> >> > wrote: >> > > >> > >> Hi, >> > >> >> > >> Just to add to the MinIO licensing concerns, I could not see any >> recent >> > >> change to the license itself, they have changed the license from >> Apache >> > >> 2.0 >> > >> to AGPL-3.0 in 2021, and the Docker image used by the tests (which is >> > from >> > >> 2022) already contains the AGPL-3.0 license. This should not be an >> issue >> > >> as >> > >> Flink does not distribute nor makes MinIO available over the network, >> > it's >> > >> only used by the tests. >> > >> >> > >> What's changed recently is that MinIO no longer publishes Docker >> images >> > to >> > >> the public [1], so it might be worth it to look into using >> alternative >> > >> solutions in the future, e.g. Garage [2]. >> > >> >> > >> Best regards, >> > >> Mate >> > >> >> > >> [1] >> https://github.com/minio/minio/issues/21647#issuecomment-3418675115 >> > >> [2] https://garagehq.deuxfleurs.fr/ >> > >> >> > >> On Mon, Oct 27, 2025 at 5:48 PM Ferenc Csaky >> <[email protected] >> > > >> > >> wrote: >> > >> >> > >> > Hi, >> > >> > >> > >> > Really nice to see people chime into this thread. I agree with >> Martijn >> > >> > about the >> > >> > development approach. There will be some iterations until we can >> > >> stabilize >> > >> > this anyways, >> > >> > so we can try to shoot getting out a good enough MVP, then fix >> issues >> > + >> > >> > reach feature >> > >> > parity with the existing implementations on the go. >> > >> > >> > >> > I am not a licensing expert but AFAIK the previous images that were >> > >> > released under the >> > >> > acceptable license can be continued to use. For most integration >> > tests, >> > >> we >> > >> > use an >> > >> > ancient image anyways [1]. There is another place where the latest >> img >> > >> > gets pulled [2], >> > >> > I guess it would be good to apply an explicit that tag there. But >> > AFAIK >> > >> > they stop >> > >> > publishing to Docker Hub, so I would anticipate we cannot end up >> > pulling >> > >> > an image with >> > >> > a forbidden license. >> > >> > >> > >> > Best, >> > >> > Ferenc >> > >> > >> > >> > [1] >> > >> > >> > >> >> > >> https://github.com/apache/flink/blob/fd1a97768b661f19783afe70d93a0a8d3d625b2a/flink-test-utils-parent/flink-test-utils-junit/src/main/java/org/apache/flink/util/DockerImageVersions.java#L39 >> > >> > [2] >> > >> > >> > >> >> > >> https://github.com/apache/flink/blob/fd1a97768b661f19783afe70d93a0a8d3d625b2a/flink-end-to-end-tests/test-scripts/common_s3_minio.sh#L51 >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > On Sunday, October 26th, 2025 at 22:05, Martijn Visser < >> > >> > [email protected]> wrote: >> > >> > >> > >> > > >> > >> > > >> > >> > > Hi Samrat, >> > >> > > >> > >> > > First of all, thanks for the proposal. It's long overdue to get >> this >> > >> in a >> > >> > > better state. >> > >> > > >> > >> > > With regards to the schemes, I would say to ship an initial >> release >> > >> that >> > >> > > does not include support for s3a and s3p, and focus first on >> getting >> > >> this >> > >> > > new implementation into a stable state. When that's done, as a >> > >> follow-up, >> > >> > > we can consider adding support for s3a and s3p on this >> > implementation, >> > >> > and >> > >> > > when that's there consider deprecating the older >> implementations. It >> > >> will >> > >> > > probably take multiple releases before we have this in a stable >> > state. >> > >> > > >> > >> > > Not directly related to this, but given that MinIO decided to >> change >> > >> > their >> > >> > > license, do we also need to refactor existing tests to not use >> MinIO >> > >> > > anymore but something else? >> > >> > > >> > >> > > Thanks, >> > >> > > >> > >> > > Martijn >> > >> > > >> > >> > > On Sat, Oct 25, 2025 at 1:38 AM Samrat Deb [email protected] >> > >> wrote: >> > >> > > >> > >> > > > Hi all, >> > >> > > > >> > >> > > > One clarifying question regarding the URI schemes: >> > >> > > > >> > >> > > > Currently, the Flink ecosystem uses multiple schemes to >> > >> differentiate >> > >> > > > between S3 implementations: s3a:// for the Hadoop-based >> connector >> > >> and >> > >> > > > s3p://[1] for the Presto-based one, which is often recommended >> for >> > >> > > > checkpointing. >> > >> > > > >> > >> > > > A key goal of the proposed flink-s3-fs-native is to unify these >> > >> into a >> > >> > > > single implementation. With that in mind, what should be the >> > >> strategy >> > >> > for >> > >> > > > scheme support? Should the new native s3 filesystem register >> only >> > >> for >> > >> > the >> > >> > > > simple s3:// scheme, aiming to deprecate the others? Or would >> it >> > be >> > >> > > > beneficial to also support s3a:// and s3p:// to provide a >> smoother >> > >> > > > migration path for users who may have these schemes in their >> > >> existing >> > >> > job >> > >> > > > configurations? >> > >> > > > Cheers, >> > >> > > > Samrat >> > >> > > > >> > >> > > > [1] https://github.com/generalui/s3p >> > >> > > > >> > >> > > > On Wed, Oct 22, 2025 at 6:31 PM Piotr Nowojski >> > [email protected] >> > >> > > > wrote: >> > >> > > > >> > >> > > > > Hi Samrat, >> > >> > > > > >> > >> > > > > > 1. Even if the specifics are hazy, could you recall the >> > general >> > >> > > > > > nature of those concerns? For instance, were they related >> to >> > >> S3's >> > >> > > > > > eventual >> > >> > > > > > consistency model, which has since improved, the atomicity >> of >> > >> > Multipart >> > >> > > > > > Upload commits, or perhaps complex failure/recovery >> scenarios >> > >> > during >> > >> > > > > > the >> > >> > > > > > commit phase? >> > >> > > > > >> > >> > > > > and >> > >> > > > > >> > >> > > > > > *8. *The flink-s3-fs-presto connector explicitly throws an >> > >> > > > > > `UnsupportedOperationException` when >> > >> `createRecoverableWriter()` is >> > >> > > > > > called. >> > >> > > > > > Was this a deliberate design choice to keep the Presto >> > connector >> > >> > > > > > lightweight and optimized specifically for checkpointing, >> or >> > >> were >> > >> > there >> > >> > > > > > other technical challenges that prevented its >> implementation >> > at >> > >> the >> > >> > > > > > time? >> > >> > > > > > Any context on this would be very helpful >> > >> > > > > >> > >> > > > > I very vaguely remember that at least one of those concerns >> was >> > >> with >> > >> > > > > respect to how long >> > >> > > > > does it take for the S3 to make some certain operations >> visible. >> > >> > That you >> > >> > > > > think you have >> > >> > > > > uploaded and committed a file, but in reality it might not be >> > >> > visible for >> > >> > > > > tens of seconds. >> > >> > > > > >> > >> > > > > Sorry, I don't remember more (or even if there was more). I >> was >> > >> only >> > >> > > > > superficially involved >> > >> > > > > in the S3 connector back then - just participated/overheard >> some >> > >> > > > > discussions. >> > >> > > > > >> > >> > > > > > 2. It's clear that implementing an efficient >> > >> > > > > > PathsCopyingFileSystem[2] >> > >> > > > > > is >> > >> > > > > > a non-negotiable requirement for performance. Is there any >> > >> > benchmark >> > >> > > > > > numbers available that can be used as reference and >> evaluate >> > new >> > >> > > > > > implementation deviation ? >> > >> > > > > >> > >> > > > > I only have the numbers that I put in the original Flip [1]. >> I >> > >> don't >> > >> > > > > remember the benchmark >> > >> > > > > setup, but it must have been something simple. Like just let >> > some >> > >> job >> > >> > > > > accumulate 1GB of state >> > >> > > > > and measure how long the state downloading phase of recovery >> was >> > >> > taking. >> > >> > > > > >> > >> > > > > > 3. Do you recall the workload characteristics for that PoC? >> > >> > > > > > Specifically, >> > >> > > > > > was the 30-40% performance advantage of s5cmd observed when >> > >> copying >> > >> > > > > > many >> > >> > > > > > small files (like checkpoint state) or larger, >> multi-gigabyte >> > >> > files? >> > >> > > > > >> > >> > > > > It was just a regular mix of compacted RocksDB sst files, >> with >> > >> total >> > >> > > > > state >> > >> > > > > size 1 or at most >> > >> > > > > a couple of GBs. So most of the files were around ~64MB or >> > ~128MB, >> > >> > with a >> > >> > > > > couple of >> > >> > > > > smaller L0 files, and maybe one larger L2 file. >> > >> > > > > >> > >> > > > > > 4. The idea of a switchable implementation sounds great. >> Would >> > >> you >> > >> > > > > > envision this as a configuration flag (e.g., >> > >> > > > > > s3.native.copy.strategy=s5cmd >> > >> > > > > > or s3.native.copy.strategy=sdk) that selects the backend >> > >> > implementation >> > >> > > > > > at >> > >> > > > > > runtime? Also on contrary is it worth adding configuration >> > that >> > >> > exposes >> > >> > > > > > some level of implementation level information ? >> > >> > > > > >> > >> > > > > I think something like that should be fine, assuming that >> > `s5cmd` >> > >> > will >> > >> > > > > again >> > >> > > > > prove significantly faster and/or more cpu efficient. If >> not, if >> > >> the >> > >> > > > > SDKv2 >> > >> > > > > has >> > >> > > > > already improved and caught up with the `s5cmd`, then it >> > probably >> > >> > doesn't >> > >> > > > > make sense to keep `s5cmd` support. >> > >> > > > > >> > >> > > > > > 5. My understanding is that the key takeaway here is to >> avoid >> > >> the >> > >> > > > > > file-by-file stream-based copy used in the vanilla >> connector >> > and >> > >> > > > > > leverage >> > >> > > > > > bulk operations, which PathsCopyingFileSystem[2] enables. >> This >> > >> > seems >> > >> > > > > > most >> > >> > > > > > critical during state download on recovery. please suggest >> if >> > my >> > >> > > > > > inference >> > >> > > > > > is in right direction >> > >> > > > > >> > >> > > > > Yes, but you should also make the bult transfer configurable. >> > How >> > >> > many >> > >> > > > > bulk >> > >> > > > > transfers >> > >> > > > > can be happening in parallel etc. >> > >> > > > > >> > >> > > > > > 6. The warning about `s5cmd` causing OOMs sounds like >> > >> indication to >> > >> > > > > > consider `S3TransferManager`[3] implementation, which might >> > >> offer >> > >> > more >> > >> > > > > > granular control over buffering and in-flight requests. Do >> you >> > >> > think >> > >> > > > > > exploring more on `S3TransferManager` would be valuable ? >> > >> > > > > >> > >> > > > > I'm pretty sure if you start hundreds of bulk transfers in >> > >> parallel >> > >> > via >> > >> > > > > the >> > >> > > > > `S3TransferManager` you can get the same problems with >> running >> > >> out of >> > >> > > > > memory or exceeding available network throughput. I don't >> know >> > if >> > >> > > > > `S3TransferManager` is better or worse in that regard to be >> > >> honest. >> > >> > > > > >> > >> > > > > > 7. The insight on AWS aggressively dropping packets >> instead of >> > >> > > > > > gracefully >> > >> > > > > > throttling is invaluable. Currently i have limited >> > understanding >> > >> > on how >> > >> > > > > > aws >> > >> > > > > > behaves at throttling I will deep dive more into it and >> > >> > > > > > look for clarification based on findings or doubt. To >> counter >> > >> this, >> > >> > > > > > were >> > >> > > > > > you thinking of a configurable rate limiter within the >> > >> filesystem >> > >> > > > > > itself >> > >> > > > > > (e.g., setting max bandwidth or max concurrent requests), >> or >> > >> > something >> > >> > > > > > more >> > >> > > > > > dynamic that could adapt to network conditions? >> > >> > > > > >> > >> > > > > Flat rate limiting is tricky because AWS offers burst network >> > >> > capacity, >> > >> > > > > which >> > >> > > > > comes very handy, and in the vast majority of cases works >> fine. >> > >> But >> > >> > for >> > >> > > > > some jobs >> > >> > > > > if you exceed that burst capacity, AWS starts dropping your >> > >> packets >> > >> > and >> > >> > > > > then the >> > >> > > > > problems happen. On the other hand, if rate limit to your >> normal >> > >> > > > > capacity, >> > >> > > > > you >> > >> > > > > are leaving a lot of network throughput unused during >> > recoveries. >> > >> > > > > >> > >> > > > > At the same time AWS doesn't share details for the burst >> > >> capacity, so >> > >> > > > > it's >> > >> > > > > sometimes >> > >> > > > > tricky to configure the whole system properly. I don't have >> an >> > >> > universal >> > >> > > > > good answer >> > >> > > > > for that :( >> > >> > > > > >> > >> > > > > Best, >> > >> > > > > Piotrek >> > >> > > > > >> > >> > > > > wt., 21 paź 2025 o 21:40 Samrat Deb [email protected] >> > >> > napisał(a): >> > >> > > > > >> > >> > > > > > Hi Gabor/ Ferenc >> > >> > > > > > >> > >> > > > > > Thank you for sharing the pointer and valuable feedback. >> > >> > > > > > >> > >> > > > > > The link to the custom `XmlResponsesSaxParser`[1] looks >> scary >> > 😦 >> > >> > > > > > and contains hidden complexity. >> > >> > > > > > >> > >> > > > > > 1. Could you share some context on why this custom parser >> was >> > >> > > > > > necessary? >> > >> > > > > > Was it to work around a specific bug, a performance issue, >> or >> > an >> > >> > > > > > inconsistency in the S3 XML API responses that the default >> AWS >> > >> SDK >> > >> > > > > > parser >> > >> > > > > > couldn't handle at the time? With sdk v2 what are core >> > >> > functionality >> > >> > > > > > that >> > >> > > > > > is required to be intensively tested ? >> > >> > > > > > >> > >> > > > > > 2. You mentioned it has no Hadoop dependency, which is >> great >> > >> news. >> > >> > > > > > For >> > >> > > > > > a >> > >> > > > > > new native S3 connector, would integration simply require >> > >> > implementing >> > >> > > > > > a >> > >> > > > > > new S3DelegationTokenProvider/Receiver pair using the AWS >> SDK, >> > >> or >> > >> > are >> > >> > > > > > there >> > >> > > > > > more subtle integration points with the framework that >> should >> > be >> > >> > > > > > accounted? >> > >> > > > > > >> > >> > > > > > 3. I remember solving Serialized Throwable exception issue >> [2] >> > >> > > > > > leading >> > >> > > > > > to >> > >> > > > > > a new bug [3], where an initial fix led to a regression >> that >> > >> Gabor >> > >> > > > > > later >> > >> > > > > > solved with Ferenc providing a detailed root cause insights >> > [4] >> > >> 😅. >> > >> > > > > > Its hard to fully sure that all scenarios are covered >> > properly. >> > >> > This is >> > >> > > > > > one >> > >> > > > > > of the example, there can be other unknowns. >> > >> > > > > > what would be the best approach to test for and prevent >> such >> > >> > > > > > regressions >> > >> > > > > > or >> > >> > > > > > unknown unknowns, especially in the most sensitive parts of >> > the >> > >> > > > > > filesystem >> > >> > > > > > logic? >> > >> > > > > > >> > >> > > > > > Cheers, >> > >> > > > > > Samrat >> > >> > > > > > >> > >> > > > > > [1] >> > >> > > > >> > >> > > > >> > >> > >> > >> >> > >> https://github.com/apache/flink/blob/0e4e6d7082e83f098d0c1a94351babb3ea407aa8/flink-filesystems/flink-s3-fs-base/src/main/java/com/amazonaws/services/s3/model/transform/XmlResponsesSaxParser.java >> > >> > > > >> > >> > > > > > [2] https://issues.apache.org/jira/browse/FLINK-28513 >> > >> > > > > > [3] https://github.com/apache/flink/pull/25231 >> > >> > > > > > [4] >> > >> > https://github.com/apache/flink/pull/25231#issuecomment-2312059662 >> > >> > > > > > >> > >> > > > > > On Tue, 21 Oct 2025 at 3:49 PM, Gabor Somogyi < >> > >> > > > > > [email protected] >> > >> > > > > > >> > >> > > > > > wrote: >> > >> > > > > > >> > >> > > > > > > Hi Samrat, >> > >> > > > > > > >> > >> > > > > > > +1 on the direction that we move away from hadoop. >> > >> > > > > > > >> > >> > > > > > > This is a long standing discussion to replace the >> mentioned >> > 2 >> > >> > > > > > > connectors >> > >> > > > > > > with something better. >> > >> > > > > > > Both of them has it's own weaknesses, I've fixed several >> > >> blockers >> > >> > > > > > > inside >> > >> > > > > > > them. >> > >> > > > > > > >> > >> > > > > > > There are definitely magic inside them, please see this >> [1] >> > >> for >> > >> > > > > > > example >> > >> > > > > > > and >> > >> > > > > > > there are more🙂 >> > >> > > > > > > I think the most sensitive part is the recovery because >> hard >> > >> to >> > >> > test >> > >> > > > > > > all >> > >> > > > > > > cases. >> > >> > > > > > > >> > >> > > > > > > @Ferenc >> > >> > > > > > > >> > >> > > > > > > > One thing that comes to my mind that will need some >> > changes >> > >> > and its >> > >> > > > > > > > involvement >> > >> > > > > > > > to this change is not trivial is the delegation token >> > >> > framework. >> > >> > > > > > > > Currently >> > >> > > > > > > > it >> > >> > > > > > > > is also tied to the Hadoop stuff and has some abstract >> > >> classes >> > >> > in the >> > >> > > > > > > > base >> > >> > > > > > > > S3 FS >> > >> > > > > > > > module. >> > >> > > > > > > >> > >> > > > > > > The delegation token framework has no dependency on >> hadoop >> > so >> > >> > there >> > >> > > > > > > is >> > >> > > > > > > no >> > >> > > > > > > blocker on the road, >> > >> > > > > > > but I'm here to help if any question appears. >> > >> > > > > > > >> > >> > > > > > > BR, >> > >> > > > > > > G >> > >> > > > > > > >> > >> > > > > > > [1] >> > >> > > > >> > >> > > > >> > >> > >> > >> >> > >> https://github.com/apache/flink/blob/0e4e6d7082e83f098d0c1a94351babb3ea407aa8/flink-filesystems/flink-s3-fs-base/src/main/java/com/amazonaws/services/s3/model/transform/XmlResponsesSaxParser.java#L95-L104 >> > >> > > > >> > >> > > > > > > On Tue, Oct 14, 2025 at 8:19 PM Samrat Deb >> > >> [email protected] >> > >> > > > > > > wrote: >> > >> > > > > > > >> > >> > > > > > > > Hi All, >> > >> > > > > > > > >> > >> > > > > > > > Poorvank (cc'ed) and I are writing to start a >> discussion >> > >> about >> > >> > a >> > >> > > > > > > > potential >> > >> > > > > > > > improvement for Flink, creating a new, native S3 >> > filesystem >> > >> > > > > > > > independent >> > >> > > > > > > > of >> > >> > > > > > > > Hadoop/Presto. >> > >> > > > > > > > >> > >> > > > > > > > The goal of this proposal is to address several >> challenges >> > >> > related >> > >> > > > > > > > to >> > >> > > > > > > > Flink's S3 integration, simplifying >> flink-s3-filesystem. >> > If >> > >> > this >> > >> > > > > > > > discussion >> > >> > > > > > > > gains positive traction, the next step would be to move >> > >> forward >> > >> > > > > > > > with >> > >> > > > > > > > a >> > >> > > > > > > > formalised FLIP. >> > >> > > > > > > > >> > >> > > > > > > > The Challenges with the Current S3 Connectors >> > >> > > > > > > > Currently, Flink offers two primary S3 filesystems, >> > >> > > > > > > > flink-s3-fs-hadoop[1] >> > >> > > > > > > > and flink-s3-fs-presto[2]. While functional, this >> > >> > dual-connector >> > >> > > > > > > > approach >> > >> > > > > > > > has few issues: >> > >> > > > > > > > >> > >> > > > > > > > 1. The flink-s3-fs-hadoop connector adds an additional >> > >> > dependency >> > >> > > > > > > > to >> > >> > > > > > > > manage. Upgrades like AWS SDK v2 are more dependent on >> > >> > > > > > > > Hadoop/Presto >> > >> > > > > > > > to >> > >> > > > > > > > support first and leverage in flink-s3-filesystem. >> > Sometimes >> > >> > it's >> > >> > > > > > > > restrictive to leverage features directly from the AWS >> > SDK. >> > >> > > > > > > > >> > >> > > > > > > > 2. The flink-s3-fs-presto connector was introduced to >> > >> mitigate >> > >> > the >> > >> > > > > > > > performance issues of the Hadoop connector, especially >> for >> > >> > > > > > > > checkpointing. >> > >> > > > > > > > However, it lacks a RecoverableWriter implementation. >> > >> > > > > > > > Sometimes it's confusing for Flink users, highlighting >> the >> > >> need >> > >> > > > > > > > for a >> > >> > > > > > > > single, unified solution. >> > >> > > > > > > > >> > >> > > > > > > > Proposed Solution: >> > >> > > > > > > > A Native, Hadoop-Free S3 Filesystem >> > >> > > > > > > > >> > >> > > > > > > > I propose we develop a new filesystem, let's call it >> > >> > > > > > > > flink-s3-fs-native, >> > >> > > > > > > > built directly on the modern AWS SDK for Java v2. This >> > >> approach >> > >> > > > > > > > would >> > >> > > > > > > > be >> > >> > > > > > > > free of any Hadoop or Presto dependencies. I have done >> a >> > >> small >> > >> > > > > > > > prototype >> > >> > > > > > > > to >> > >> > > > > > > > validate [3] >> > >> > > > > > > > >> > >> > > > > > > > This is motivated by trino<>s3 [4]. The Trino project >> > >> > successfully >> > >> > > > > > > > undertook a similar migration, moving from Hadoop-based >> > >> object >> > >> > > > > > > > storage >> > >> > > > > > > > clients to their own native implementations. >> > >> > > > > > > > >> > >> > > > > > > > The new Flink S3 filesystem would: >> > >> > > > > > > > >> > >> > > > > > > > 1. Provide a single, unified connector for all S3 >> > >> interactions, >> > >> > > > > > > > from >> > >> > > > > > > > state >> > >> > > > > > > > backends to sinks. >> > >> > > > > > > > >> > >> > > > > > > > 2. Implement a high-performance S3RecoverableWriter >> using >> > >> S3's >> > >> > > > > > > > Multipart >> > >> > > > > > > > Upload feature, ensuring exactly-once sink semantics. >> > >> > > > > > > > >> > >> > > > > > > > 3. Offer a clean, self-contained dependency, >> drastically >> > >> > > > > > > > simplifying >> > >> > > > > > > > setup >> > >> > > > > > > > and eliminating external dependencies. >> > >> > > > > > > > >> > >> > > > > > > > A Phased Migration Path >> > >> > > > > > > > To ensure a smooth transition, we could adopt a phased >> > >> > approach on >> > >> > > > > > > > a >> > >> > > > > > > > very >> > >> > > > > > > > high level : >> > >> > > > > > > > >> > >> > > > > > > > Phase 1: >> > >> > > > > > > > Introduce the new native S3 filesystem as an optional, >> > >> parallel >> > >> > > > > > > > plugin. >> > >> > > > > > > > This would allow for community testing and adoption >> > without >> > >> > > > > > > > breaking >> > >> > > > > > > > existing setups. >> > >> > > > > > > > >> > >> > > > > > > > Phase 2: >> > >> > > > > > > > Once the native connector achieves feature parity and >> > proven >> > >> > > > > > > > stability, >> > >> > > > > > > > we >> > >> > > > > > > > will update the documentation to recommend it as the >> > default >> > >> > choice >> > >> > > > > > > > for >> > >> > > > > > > > all >> > >> > > > > > > > S3 use cases. >> > >> > > > > > > > >> > >> > > > > > > > Phase 3: >> > >> > > > > > > > In a future major release, the legacy >> flink-s3-fs-hadoop >> > and >> > >> > > > > > > > flink-s3-fs-presto connectors could be formally >> > deprecated, >> > >> > with >> > >> > > > > > > > clear >> > >> > > > > > > > migration guides provided for users. >> > >> > > > > > > > >> > >> > > > > > > > I would love to hear the community's thoughts on this. >> > >> > > > > > > > >> > >> > > > > > > > A few questions to start the discussion: >> > >> > > > > > > > >> > >> > > > > > > > 1. What are the biggest pain points with the current S3 >> > >> > filesystem? >> > >> > > > > > > > >> > >> > > > > > > > 2. Are there any critical features from the Hadoop S3A >> > >> client >> > >> > that >> > >> > > > > > > > are >> > >> > > > > > > > essential to replicate in a native implementation? >> > >> > > > > > > > >> > >> > > > > > > > 3. Would a simplified, non-dependent S3 experience be a >> > >> > valuable >> > >> > > > > > > > improvement for Flink use cases? >> > >> > > > > > > > >> > >> > > > > > > > Cheers, >> > >> > > > > > > > Samrat >> > >> > > > > > > > >> > >> > > > > > > > [1] >> > >> > > > >> > >> > > > >> > >> > >> > >> >> > >> https://github.com/apache/flink/tree/master/flink-filesystems/flink-s3-fs-hadoop >> > >> > > > >> > >> > > > > > > > [2] >> > >> > > > >> > >> > > > >> > >> > >> > >> >> > >> https://github.com/apache/flink/tree/master/flink-filesystems/flink-s3-fs-presto >> > >> > > > >> > >> > > > > > > > [3] https://github.com/Samrat002/flink/pull/4 >> > >> > > > > > > > [4] >> > >> > > > > > > > >> > >> > >> https://github.com/trinodb/trino/tree/master/lib/trino-filesystem-s3 >> > >> > >> > >> >> > > >> > >> >
