[ https://issues.apache.org/jira/browse/BEAM-92?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16048676#comment-16048676 ]
ASF GitHub Bot commented on BEAM-92: ------------------------------------ GitHub user reuvenlax opened a pull request: https://github.com/apache/beam/pull/3356 [BEAM-92] Allow value-dependent files in FileBasedSink This is modeled off of the pattern used in BigQueryIO. The user can provide a DynamicDestinations class which can map the input into a user-defined destination type, and the destination into a FilenamePolicy. Some refactoring of FilenamePolicy is done as well to make this more useful - e.g. allowing FilenamePolicy to pick its own base directory instead of having it passed in from the sink (this is marked as @Experimental, so such changes are allowed). Not yet in this PR: we should allow the sinks to take user-defined types for mapping. We should also provide a convenience method for dynamic output using the DefaultFilenamePolicy (e.g. passing in KVs). R: @jkff You can merge this pull request into a Git repository by running: $ git pull https://github.com/reuvenlax/incubator-beam dynamic_file_based_sink Alternatively you can review and apply these changes as the patch at: https://github.com/apache/beam/pull/3356.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3356 ---- commit 531a013f6f1e6865566f648d3486d77a7de20369 Author: Reuven Lax <re...@google.com> Date: 2017-06-10T00:11:32Z Add DynamicDestinations support to FileBasedSink commit 616f9ede5afa7e771f50595bcd290b907744d57f Author: Reuven Lax <re...@google.com> Date: 2017-06-10T18:23:32Z More fixups. commit c6855fa130c86c94de74c36976a667de7f8d6219 Author: Reuven Lax <re...@google.com> Date: 2017-06-12T16:42:26Z Fix some tests. commit f4d4bf5f9238f7e9627363daa14158cdcc68d64b Author: Reuven Lax <re...@google.com> Date: 2017-06-13T21:34:08Z Remove baseDirectory parameter from FilenamePolicy. The FilenamePolicy can choose it's own base directory. commit 219a3d649428f673bf35f043990d61000d5fd64e Author: Reuven Lax <re...@google.com> Date: 2017-06-14T00:06:01Z No longer pass in "extension" to FilenamePolicy. Instead we pass in file metadata class, including a getSuggestedExtension method that is based on the compression type. commit 1c5ed9e39e891b4418f597d531f442806b4f22b8 Author: Reuven Lax <re...@google.com> Date: 2017-06-14T00:41:22Z Add a withTempDirectory override to TextIO and AvroIO. This way users aren't forced to provide a dummy file prefix, just to specify a temp directory. commit 14bf19544801ce0791ffd2f95574ba9b5b9d33c6 Author: Reuven Lax <re...@google.com> Date: 2017-06-14T00:49:34Z Fix validation code. commit 93727530ac83e70570f6da1214fa20c5871b97e6 Author: Reuven Lax <re...@google.com> Date: 2017-06-14T00:52:08Z Fix fix javadoc. commit da6b0e648fac9d9def26c37aa755b73a5a3f9cef Author: Reuven Lax <re...@google.com> Date: 2017-06-14T02:29:00Z Fix CheckStyle violations. commit d2624f729cb7f7dc49f9dda65f6db7744f21e3c6 Author: Reuven Lax <re...@google.com> Date: 2017-06-14T04:01:26Z Fix some failures. ---- > Data-dependent sinks > -------------------- > > Key: BEAM-92 > URL: https://issues.apache.org/jira/browse/BEAM-92 > Project: Beam > Issue Type: New Feature > Components: sdk-java-core > Reporter: Eugene Kirpichov > Assignee: Reuven Lax > > Current sink API writes all data to a single destination, but there are many > use cases where different pieces of data need to be routed to different > destinations where the set of destinations is data-dependent (so can't be > implemented with a Partition transform). > One internally discussed proposal was an API of the form: > {code} > PCollection<Void> PCollection<T>.apply( > Write.using(DoFn<T, SinkT> where, > MapFn<SinkT, WriteOperation<WriteResultT, T>> how) > {code} > so an item T gets written to a destination (or multiple destinations) > determined by "where"; and the writing strategy is determined by "how" that > produces a WriteOperation (current API - global init/write/global finalize > hooks) for any given destination. > This API also has other benefits: > * allows the SinkT to be computed dynamically (in "where"), rather than > specified at pipeline construction time > * removes the necessity for a Sink class entirely > * is sequenceable w.r.t. downstream transforms (you can stick transforms onto > the returned PCollection<Void>, while the current Write.to() returns a PDone) -- This message was sent by Atlassian JIRA (v6.4.14#64029)