davidradl commented on code in PR #27937:
URL: https://github.com/apache/flink/pull/27937#discussion_r3094171049
##########
docs/content/docs/deployment/filesystems/s3.md:
##########
@@ -64,94 +64,288 @@ env.configure(config);
Note that these examples are *not* exhaustive and you can use S3 in other
places as well, including your [high availability setup]({{< ref
"docs/deployment/ha/overview" >}}) or the [EmbeddedRocksDBStateBackend]({{< ref
"docs/ops/state/state_backends" >}}#the-rocksdbstatebackend); everywhere that
Flink expects a FileSystem URI (unless otherwise stated).
-For most use cases, you may use one of our `flink-s3-fs-hadoop` and
`flink-s3-fs-presto` S3 filesystem plugins which are self-contained and easy to
set up.
-For some cases, however, e.g., for using S3 as YARN's resource storage dir, it
may be necessary to set up a specific Hadoop S3 filesystem implementation.
+## S3 FileSystem Implementations
-### Hadoop/Presto S3 File Systems plugins
+Flink provides three independent S3 filesystem implementations, each with
different trade-offs:
+
+- **Native S3 FileSystem** (`flink-s3-fs-native`): Built directly on AWS SDK
v2 with async I/O and parallel transfers removing the dependency from hadoop
entirely. This implementation supports both checkpointing and the FileSystem
sink. The Native S3 FileSystem aims to provide integrated support for
checkpointing as well as FileSystem sink, removing the need to use Presto S3
FileSystem for checkpointing and Hadoop S3 FileSystem for the FileSystem sink.
[Benchmarks](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406620396)
show ~2x higher checkpoint throughput (~200 MB/s vs ~90 MB/s) compared to the
Presto implementation at state sizes up to 15 GB. **Experimental** in Flink 2.3.
+- **Presto S3 FileSystem** (`flink-s3-fs-presto`): Based on Presto project
code, recommended for checkpointing.
+- **Hadoop S3 FileSystem** (`flink-s3-fs-hadoop`): Based on Hadoop project
code, has FileSystem sink support.
+
+All three are self-contained with no dependency footprint, so there is no need
to add Hadoop to the classpath to use them.
+
+## Common Configuration
+
+### Configure Access Credentials
+
+After setting up the S3 FileSystem implementation, you need to make sure that
Flink is allowed to access your S3 buckets.
+
+#### Identity and Access Management (IAM) (Recommended)
+
+The recommended way of setting up credentials on AWS is via [Identity and
Access Management
(IAM)](http://docs.aws.amazon.com/IAM/latest/UserGuide/introduction.html). You
can use IAM features to securely give Flink instances the credentials that they
need to access S3 buckets. Details about how to do this are beyond the scope of
this documentation. Please refer to the AWS user guide. What you are looking
for are [IAM
Roles](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html).
+
+If you set this up correctly, you can manage access to S3 within AWS and don't
need to distribute any access keys to Flink.
+
+#### Delegation Tokens
+
+[Delegation tokens]({{< ref
"docs/deployment/security/security-delegation-token" >}}) provide time-bounded,
automatically negotiated credentials. The JobManager uses long-lived
credentials to call AWS STS and obtain short-lived session tokens, which are
then automatically distributed to TaskManagers.
+
+Each S3 implementation has its own delegation token provider with a dedicated
configuration prefix. You must set the `access-key`, `secret-key`, and `region`
under the corresponding prefix for the implementation you are using:
+
+```yaml
+# For Native S3 implementation
+security.delegation.token.provider.s3-native.access-key: your-access-key
+security.delegation.token.provider.s3-native.secret-key: your-secret-key
+security.delegation.token.provider.s3-native.region: us-east-1
+
+# For Hadoop implementation
+security.delegation.token.provider.s3-hadoop.access-key: your-access-key
+security.delegation.token.provider.s3-hadoop.secret-key: your-secret-key
+security.delegation.token.provider.s3-hadoop.region: us-east-1
+
+# For Presto implementation
+security.delegation.token.provider.s3-presto.access-key: your-access-key
+security.delegation.token.provider.s3-presto.secret-key: your-secret-key
+security.delegation.token.provider.s3-presto.region: us-east-1
+```
+
+All three values (`access-key`, `secret-key`, `region`) must be set for
delegation tokens to be issued. The `DynamicTemporaryAWSCredentialsProvider` is
automatically included in the credentials provider chain for each
implementation, so TaskManagers will consume the distributed tokens without
additional configuration.
+
+#### Access Keys
+
+Access to S3 can be granted via your **access and secret key pair**. While
access keys are not inherently insecure, IAM roles are preferred as they avoid
the need to manage and distribute static credentials. See the [introduction of
IAM
roles](https://blogs.aws.amazon.com/security/post/Tx1XG3FX6VMU6O5/A-safer-way-to-distribute-AWS-credentials-to-EC2)
for more context.
Review Comment:
putting this text into the docs I think would help clarify without going
into too much detail.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]