[ https://issues.apache.org/jira/browse/BEAM-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16003764#comment-16003764 ]
ASF GitHub Bot commented on BEAM-2052: -------------------------------------- GitHub user jkff opened a pull request: https://github.com/apache/beam/pull/3023 [BEAM-2052] Allow dynamic sharding in windowed file sinks This is a slightly modified and rearranged version of @reuvenlax 's #2647 . My concerns about it are: 1) In direct runner, the integration tests of dynamic sharding are vacuous, because direct runner replaces unspecified sharding with fixed sharding at https://github.com/apache/beam/blob/master/runners/direct-java/src/main/java/org/apache/beam/runners/direct/WriteWithShardingFactory.java (applied at https://github.com/apache/beam/blob/master/runners/direct-java/src/main/java/org/apache/beam/runners/direct/DirectRunner.java#L217). However, this is a testing-only concern: other runners don't have this override, so overall the testing is non-vacuous, this is just hard to test against direct runner and I suspect that we probably want these tests to be non-vacuous in direct runner too. 2) When I removed that override for testing purposes, I noticed that there's a very large number of files being written - primarily, I guess, because the bundles are very small. So large a number of files that the test time for batch with dynamic sharding grows from 21 seconds to 5 minutes. In particular, we write many, many files for each window/pane - presumably because in streaming runners and in direct runner, there's at least 1 bundle per key, and we create at least 1 file per bundle in WriteFiles.Write(Windowed,Unwindowed)Bundles. Reuven, can you please comment on whether this "at least 1 file per key" is expected behavior in a streaming runner? I suspect that it's not, but then I'm not sure how to fix the PR semantically. CC: @reuvenlax @davorbonaci @dhalperi You can merge this pull request into a Git repository by running: $ git pull https://github.com/jkff/incubator-beam finish-pr-2647-2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/beam/pull/3023.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3023 ---- commit c03781cbbba1d80e1ad5c93165bdad6bebd05c53 Author: Reuven Lax <re...@google.com> Date: 2017-04-05T19:13:44Z Implement dynamic-sharding for windowed file outputs, and add an integration test. commit c43cc4abe7ef81a7a9155ac686eed46af24eb7c0 Author: Reuven Lax <re...@google.com> Date: 2017-05-09T20:02:12Z Renames FileBasedSink inner classes FileBasedWriteOperation -> WriteOperation, FileBasedWriter -> Writer commit 3347c6e49725a3648bd944b9543425518e2f77e1 Author: Eugene Kirpichov <kirpic...@google.com> Date: 2017-05-09T22:10:07Z Simpler code for setting shard numbers on results in FileBasedSink commit b775df16594d30538c2b5b0af0d17a179060960c Author: Eugene Kirpichov <kirpic...@google.com> Date: 2017-05-09T22:25:57Z Splits WriteBundles into windowed/unwindowed versions ---- > Windowed file sinks should support dynamic sharding > --------------------------------------------------- > > Key: BEAM-2052 > URL: https://issues.apache.org/jira/browse/BEAM-2052 > Project: Beam > Issue Type: Bug > Components: sdk-java-core > Reporter: Reuven Lax > Assignee: Reuven Lax > Fix For: 2.0.0 > > > Currently windowed file sinks (WriteFiles and FileBasedSink) require > withNumShards to be set explicitly. We should remove this requirement, and > allow dynamic output. -- This message was sent by Atlassian JIRA (v6.3.15#6346)