Re: [PR] [FLINK-39118] Add documentation for Native s3 FileSystem [flink]

via GitHub Thu, 16 Apr 2026 02:21:49 -0700


Samrat002 commented on code in PR #27841:
URL: https://github.com/apache/flink/pull/27841#discussion_r3092114319



##########
docs/content.zh/docs/deployment/filesystems/s3.md:
##########
@@ -50,127 +48,327 @@ env.fromSource(
     "s3-input"
 );
 
-// 写入 S3 bucket
+// Write to S3 bucket
 stream.sinkTo(
-    FileSink.forRowFormat(
-        new Path("s3://<bucket>/<endpoint>"), new SimpleStringEncoder<>()
-    ).build()
+        FileSink.forRowFormat(
+            new Path("s3://<bucket>/<endpoint>"), new SimpleStringEncoder<>()
+        ).build()
 );
 
-
-// 使用 S3 作为 checkpoint storage
+// Use S3 as checkpoint storage
 Configuration config = new Configuration();
 config.set(CheckpointingOptions.CHECKPOINT_STORAGE, "filesystem");
 config.set(CheckpointingOptions.CHECKPOINTS_DIRECTORY, 
"s3://<your-bucket>/<endpoint>");
 env.configure(config);
 ```
 
-注意这些例子并*不详尽*，S3 同样可以用在其他场景，包括 [JobManager 高可用配置]({{< ref 
"docs/deployment/ha/overview" >}}) 或 [RocksDBStateBackend]({{< ref 
"docs/ops/state/state_backends" >}}#the-rocksdbstatebackend)，以及所有 Flink 
需要使用文件系统 URI 的位置。
+Note that these examples are *not* exhaustive and you can use S3 in other 
places as well, including your [high availability setup]({{< ref 
"docs/deployment/ha/overview" >}}) or the [EmbeddedRocksDBStateBackend]({{< ref 
"docs/ops/state/state_backends" >}}#the-rocksdbstatebackend); everywhere that 
Flink expects a FileSystem URI (unless otherwise stated).
+
+## S3 FileSystem Implementations
+
+Flink provides three independent S3 filesystem implementations, each with 
different trade-offs:
+
+- **Native S3 FileSystem** (`flink-s3-fs-native`): Built directly on AWS SDK 
v2 with async I/O and parallel transfers removing the dependency from hadoop 
entirely. This implementation supports both checkpointing and the FileSystem 
sink. The Native S3 FileSystem aims to provide integrated support for 
checkpointing as well as FileSystem sink, removing the need to use Presto S3 
FileSystem for checkpointing and Hadoop S3 FileSystem for the FileSystem sink. 
[Benchmarks](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406620396)
 show ~2x higher checkpoint throughput (~200 MB/s vs ~90 MB/s) compared to the 
Presto implementation at state sizes up to 15 GB. **Experimental** in Flink 2.3.
+- **Presto S3 FileSystem** (`flink-s3-fs-presto`): Based on Presto project 
code, recommended for checkpointing.
+- **Hadoop S3 FileSystem** (`flink-s3-fs-hadoop`): Based on Hadoop project 
code, has FileSystem sink support.
+
+All three are self-contained with no dependency footprint, so there is no need 
to add Hadoop to the classpath to use them.
+
+## Common Configuration
+
+### Configure Access Credentials
+
+After setting up the S3 FileSystem implementation, you need to make sure that 
Flink is allowed to access your S3 buckets.
+
+#### Identity and Access Management (IAM) (Recommended)
+
+The recommended way of setting up credentials on AWS is via [Identity and 
Access Management 
(IAM)](http://docs.aws.amazon.com/IAM/latest/UserGuide/introduction.html). You 
can use IAM features to securely give Flink instances the credentials that they 
need to access S3 buckets. Details about how to do this are beyond the scope of 
this documentation. Please refer to the AWS user guide. What you are looking 
for are [IAM 
Roles](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html).
+
+If you set this up correctly, you can manage access to S3 within AWS and don't 
need to distribute any access keys to Flink.
+
+#### Delegation Tokens
+
+[Delegation tokens]({{< ref 
"docs/deployment/security/security-delegation-token" >}}) provide time-bounded, 
automatically negotiated credentials. The JobManager uses configured long-lived 
credentials (`s3.access-key`, `s3.secret-key`, and `s3.region` must all be set) 
to call AWS STS and obtain short-lived session tokens, which are then 
automatically distributed to TaskManagers — avoiding the need to configure 
long-lived credentials on each TaskManager directly. Configure the credentials 
provider for your S3 implementation to allow TaskManagers to consume the 
distributed tokens:
+
+```yaml
+# Long-lived credentials required by the delegation token issuer on the 
JobManager
+s3.access-key: your-access-key
+s3.secret-key: your-secret-key
+s3.region: us-east-1

Review Comment:
   Ahh got it ,
   
   Each implementation has its own service name prefix:
   for native: `security.delegation.token.provider.s3-native.*`
   for Hadoop: `security.delegation.token.provider.s3-hadoop.*`
   for Presto: `security.delegation.token.provider.s3-presto.*`
   
   Each S3 implementation has its own delegation token provider with a 
dedicated configuration prefix. so it should set the access-key, secret-key, 
and region under the corresponding prefix for the implementation it is using. 
   
   Updated the patch 👍🏻 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [FLINK-39118] Add documentation for Native s3 FileSystem [flink]

Reply via email to