Re: [PR] [FLINK-39118] Add documentation for Native s3 FileSystem [flink]

via GitHub Wed, 15 Apr 2026 00:54:32 -0700


gaborgsomogyi commented on code in PR #27841:
URL: https://github.com/apache/flink/pull/27841#discussion_r3084816648



##########
docs/content.zh/docs/deployment/filesystems/s3.md:
##########
@@ -50,127 +48,319 @@ env.fromSource(
     "s3-input"
 );
 
-// 写入 S3 bucket
+// Write to S3 bucket
 stream.sinkTo(
-    FileSink.forRowFormat(
-        new Path("s3://<bucket>/<endpoint>"), new SimpleStringEncoder<>()
-    ).build()
+        FileSink.forRowFormat(
+            new Path("s3://<bucket>/<endpoint>"), new SimpleStringEncoder<>()
+        ).build()
 );
 
-
-// 使用 S3 作为 checkpoint storage
+// Use S3 as checkpoint storage
 Configuration config = new Configuration();
 config.set(CheckpointingOptions.CHECKPOINT_STORAGE, "filesystem");
 config.set(CheckpointingOptions.CHECKPOINTS_DIRECTORY, 
"s3://<your-bucket>/<endpoint>");
 env.configure(config);
 ```
 
-注意这些例子并*不详尽*，S3 同样可以用在其他场景，包括 [JobManager 高可用配置]({{< ref 
"docs/deployment/ha/overview" >}}) 或 [RocksDBStateBackend]({{< ref 
"docs/ops/state/state_backends" >}}#the-rocksdbstatebackend)，以及所有 Flink 
需要使用文件系统 URI 的位置。
+Note that these examples are *not* exhaustive and you can use S3 in other 
places as well, including your [high availability setup]({{< ref 
"docs/deployment/ha/overview" >}}) or the [EmbeddedRocksDBStateBackend]({{< ref 
"docs/ops/state/state_backends" >}}#the-rocksdbstatebackend); everywhere that 
Flink expects a FileSystem URI (unless otherwise stated).
+
+## S3 FileSystem Implementations
+
+Flink provides three independent S3 filesystem implementations, each with 
different trade-offs:
+
+- **Native S3 FileSystem** (`flink-s3-fs-native`): Built directly on AWS SDK 
v2 with async I/O and parallel transfers removing the dependency from hadoop 
entirely. This implementation supports both checkpointing and the FileSystem 
sink. The Native S3 FileSystem aims to provide integrated support for 
checkpointing as well as FileSystem sink, removing the need to use Presto S3 
FileSystem for checkpointing and Hadoop S3 FileSystem for the FileSystem sink. 
[Benchmarks](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406620396)
 show ~2x higher checkpoint throughput (~200 MB/s vs ~90 MB/s) compared to the 
Presto implementation at state sizes up to 15 GB. **Experimental** in Flink 2.3.
+- **Presto S3 FileSystem** (`flink-s3-fs-presto`): Based on Presto project 
code, recommended for checkpointing.
+- **Hadoop S3 FileSystem** (`flink-s3-fs-hadoop`): Based on Hadoop project 
code, has FileSystem sink support.
+
+All three are self-contained with no dependency footprint, so there is no need 
to add Hadoop to the classpath to use them.
+
+## Common Configuration
 
-在大部分使用场景下，可使用 `flink-s3-fs-hadoop` 或 `flink-s3-fs-presto` 两个独立且易于设置的 S3 
文件系统插件。然而在某些情况下，例如使用 S3 作为 YARN 的资源存储目录时，可能需要配置 Hadoop S3 文件系统。
+### Configure Access Credentials
 
-### Hadoop/Presto S3 文件系统插件
+After setting up the S3 FileSystem implementation, you need to make sure that 
Flink is allowed to access your S3 buckets.
+
+#### Identity and Access Management (IAM) (Recommended)
+
+The recommended way of setting up credentials on AWS is via [Identity and 
Access Management 
(IAM)](http://docs.aws.amazon.com/IAM/latest/UserGuide/introduction.html). You 
can use IAM features to securely give Flink instances the credentials that they 
need to access S3 buckets. Details about how to do this are beyond the scope of 
this documentation. Please refer to the AWS user guide. What you are looking 
for are [IAM 
Roles](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html).
+
+If you set this up correctly, you can manage access to S3 within AWS and don't 
need to distribute any access keys to Flink.
+
+#### Access Keys (Discouraged)

Review Comment:
   I wouldn't say it Discouraged because it's not unsecure. Though I also think 
that IAM is the preferred way.



##########
docs/content.zh/docs/deployment/filesystems/s3.md:
##########
@@ -50,127 +48,319 @@ env.fromSource(
     "s3-input"
 );
 
-// 写入 S3 bucket
+// Write to S3 bucket
 stream.sinkTo(
-    FileSink.forRowFormat(
-        new Path("s3://<bucket>/<endpoint>"), new SimpleStringEncoder<>()
-    ).build()
+        FileSink.forRowFormat(
+            new Path("s3://<bucket>/<endpoint>"), new SimpleStringEncoder<>()
+        ).build()
 );
 
-
-// 使用 S3 作为 checkpoint storage
+// Use S3 as checkpoint storage
 Configuration config = new Configuration();
 config.set(CheckpointingOptions.CHECKPOINT_STORAGE, "filesystem");
 config.set(CheckpointingOptions.CHECKPOINTS_DIRECTORY, 
"s3://<your-bucket>/<endpoint>");
 env.configure(config);
 ```
 
-注意这些例子并*不详尽*，S3 同样可以用在其他场景，包括 [JobManager 高可用配置]({{< ref 
"docs/deployment/ha/overview" >}}) 或 [RocksDBStateBackend]({{< ref 
"docs/ops/state/state_backends" >}}#the-rocksdbstatebackend)，以及所有 Flink 
需要使用文件系统 URI 的位置。
+Note that these examples are *not* exhaustive and you can use S3 in other 
places as well, including your [high availability setup]({{< ref 
"docs/deployment/ha/overview" >}}) or the [EmbeddedRocksDBStateBackend]({{< ref 
"docs/ops/state/state_backends" >}}#the-rocksdbstatebackend); everywhere that 
Flink expects a FileSystem URI (unless otherwise stated).
+
+## S3 FileSystem Implementations
+
+Flink provides three independent S3 filesystem implementations, each with 
different trade-offs:
+
+- **Native S3 FileSystem** (`flink-s3-fs-native`): Built directly on AWS SDK 
v2 with async I/O and parallel transfers removing the dependency from hadoop 
entirely. This implementation supports both checkpointing and the FileSystem 
sink. The Native S3 FileSystem aims to provide integrated support for 
checkpointing as well as FileSystem sink, removing the need to use Presto S3 
FileSystem for checkpointing and Hadoop S3 FileSystem for the FileSystem sink. 
[Benchmarks](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406620396)
 show ~2x higher checkpoint throughput (~200 MB/s vs ~90 MB/s) compared to the 
Presto implementation at state sizes up to 15 GB. **Experimental** in Flink 2.3.
+- **Presto S3 FileSystem** (`flink-s3-fs-presto`): Based on Presto project 
code, recommended for checkpointing.
+- **Hadoop S3 FileSystem** (`flink-s3-fs-hadoop`): Based on Hadoop project 
code, has FileSystem sink support.
+
+All three are self-contained with no dependency footprint, so there is no need 
to add Hadoop to the classpath to use them.
+
+## Common Configuration
 
-在大部分使用场景下，可使用 `flink-s3-fs-hadoop` 或 `flink-s3-fs-presto` 两个独立且易于设置的 S3 
文件系统插件。然而在某些情况下，例如使用 S3 作为 YARN 的资源存储目录时，可能需要配置 Hadoop S3 文件系统。
+### Configure Access Credentials
 
-### Hadoop/Presto S3 文件系统插件
+After setting up the S3 FileSystem implementation, you need to make sure that 
Flink is allowed to access your S3 buckets.
+
+#### Identity and Access Management (IAM) (Recommended)
+
+The recommended way of setting up credentials on AWS is via [Identity and 
Access Management 
(IAM)](http://docs.aws.amazon.com/IAM/latest/UserGuide/introduction.html). You 
can use IAM features to securely give Flink instances the credentials that they 
need to access S3 buckets. Details about how to do this are beyond the scope of 
this documentation. Please refer to the AWS user guide. What you are looking 
for are [IAM 
Roles](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html).
+
+If you set this up correctly, you can manage access to S3 within AWS and don't 
need to distribute any access keys to Flink.
+
+#### Access Keys (Discouraged)
+
+Access to S3 can be granted via your **access and secret key pair**. Please 
note that this is discouraged since the [introduction of IAM 
roles](https://blogs.aws.amazon.com/security/post/Tx1XG3FX6VMU6O5/A-safer-way-to-distribute-AWS-credentials-to-EC2).
+
+You need to configure both `s3.access-key` and `s3.secret-key` in Flink's 
[configuration file]({{< ref "docs/deployment/config#flink-configuration-file" 
>}}):
+
+```yaml
+s3.access-key: your-access-key
+s3.secret-key: your-secret-key
+```
+
+You can limit this configuration to JobManagers by using [delegation 
tokens]({{< ref "docs/deployment/security/security-delegation-token" >}}):

Review Comment:
   I think tokens worth a separate bullet point before this because:
   * it has different configs
   * it's less secure than IAM but better than keys
   
   It can be small like this with an example.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [FLINK-39118] Add documentation for Native s3 FileSystem [flink]

Reply via email to