gaborgsomogyi commented on code in PR #27937:
URL: https://github.com/apache/flink/pull/27937#discussion_r3098868117
##########
docs/content/docs/deployment/filesystems/s3.md:
##########
@@ -64,94 +64,208 @@ env.configure(config);
Note that these examples are *not* exhaustive and you can use S3 in other
places as well, including your [high availability setup]({{< ref
"docs/deployment/ha/overview" >}}) or the [EmbeddedRocksDBStateBackend]({{< ref
"docs/ops/state/state_backends" >}}#the-rocksdbstatebackend); everywhere that
Flink expects a FileSystem URI (unless otherwise stated).
-For most use cases, you may use one of our `flink-s3-fs-hadoop` and
`flink-s3-fs-presto` S3 filesystem plugins which are self-contained and easy to
set up.
-For some cases, however, e.g., for using S3 as YARN's resource storage dir, it
may be necessary to set up a specific Hadoop S3 filesystem implementation.
+## S3 FileSystem Implementations
-### Hadoop/Presto S3 File Systems plugins
+Flink provides three independent S3 filesystem implementations:
-{{< hint info >}}
-You don't have to configure this manually if you are running [Flink on
EMR](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-flink.html).
-{{< /hint >}}
+| Implementation | Checkpointing | FileSink | Notes |
+|---------------|:---:|:---:|-------|
+| **Native S3** (`flink-s3-fs-native`) | ✓ | ✓ | **Experimental** in Flink
2.3. Built on AWS SDK v2; no Hadoop dependency. |
+| **Presto S3** (`flink-s3-fs-presto`) | ✓ | x | Production-proven for
checkpointing. |
+| **Hadoop S3** (`flink-s3-fs-hadoop`) | ✓ | ✓ | Mature; the only stable
implementation that provides `RecoverableWriter` for the FileSink. |
-Flink provides two file systems to talk to Amazon S3, `flink-s3-fs-presto` and
`flink-s3-fs-hadoop`.
-Both implementations are self-contained with no dependency footprint, so there
is no need to add Hadoop to the classpath to use them.
+Previously, users had to choose between Presto (recommended for checkpointing
throughput) and Hadoop (the only implementation with `RecoverableWriter`,
required by the [FileSink]({{< ref "docs/connectors/datastream/filesystem"
>}})). The Native S3 implementation unifies both capabilities in a single
plugin and
[benchmarks](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406620396)
show ~2x higher checkpoint throughput (~200 MB/s vs ~90 MB/s) compared to
Presto at state sizes up to 15 GB.
Review Comment:
Maybe we can be less exact on numbers since it can change rapidly when we
add new features or users have difference use-case. We can mention that
measurements show performance gain and stability without exact details. This
doc is for sales and how to use. If users are interested in the details
(majority don't) they can find it manually without links.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]