[ https://issues.apache.org/jira/browse/FLINK-6306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16143244#comment-16143244 ]
ASF GitHub Bot commented on FLINK-6306: --------------------------------------- GitHub user sjwiesman opened a pull request: https://github.com/apache/flink/pull/4607 [FLINK-6306][connectors] Sink for eventually consistent file systems ## What is the purpose of the change This pull request implements a sink for writing out to an eventually consistent filesystem, such as Amazon S3, with exactly once semantics. ## Brief change log - The sink stages files on a consistent filesystem (local, hdfs, etc) . - Once per checkpoint, files are copied to the eventually consistent filesystem. - When a checkpoint completion notification is sent, the files are marked consistent. Otherwise, they are left because delete is not a consistent operation. - It is up to consumers to choose their semantics; at least once by reading all files, or exactly once by only reading files marked consistent. ## Verifying this change This change added tests and can be verified as follows: - Added tests based on the existing BucketingSink test suite. - Added tests that verify semantics based on different checkpointing combinations (successful, concurrent, timed out, and failed). - Added integration test that verifies exactly once holds during failure. - Manually verified by having run in production for several months. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): no - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no - The serializers: no - The runtime per-record code paths (performance sensitive): no - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper:no ## Documentation - Does this pull request introduce a new feature? yes - If yes, how is the feature documented? JavaDocs You can merge this pull request into a Git repository by running: $ git pull https://github.com/sjwiesman/flink FLINK-6306 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/4607.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4607 ---- commit 347ea767195d74efc39964c02ace1bbe10d8aa0a Author: Seth Wiesman <swies...@mediamath.com> Date: 2017-08-27T21:36:04Z [FLINK-6306][connectors] Sink for eventually consistent file systems ---- > Sink for eventually consistent file systems > ------------------------------------------- > > Key: FLINK-6306 > URL: https://issues.apache.org/jira/browse/FLINK-6306 > Project: Flink > Issue Type: New Feature > Components: filesystem-connector > Reporter: Seth Wiesman > Assignee: Seth Wiesman > Attachments: eventually-consistent-sink > > > Currently Flink provides the BucketingSink as an exactly once method for > writing out to a file system. It provides these guarantees by moving files > through several stages and deleting or truncating files that get into a bad > state. While this is a powerful abstraction, it causes issues with eventually > consistent file systems such as Amazon's S3 where most operations (ie rename, > delete, truncate) are not guaranteed to become consistent within a reasonable > amount of time. Flink should provide a sink that provides exactly once writes > to a file system where only PUT operations are considered consistent. -- This message was sent by Atlassian JIRA (v6.4.14#64029)