[ 
https://issues.apache.org/jira/browse/BEAM-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16003764#comment-16003764
 ] 

ASF GitHub Bot commented on BEAM-2052:
--------------------------------------

GitHub user jkff opened a pull request:

    https://github.com/apache/beam/pull/3023

    [BEAM-2052] Allow dynamic sharding in windowed file sinks

    This is a slightly modified and rearranged version of @reuvenlax 's #2647 .
    
    My concerns about it are:
    
    1) In direct runner, the integration tests of dynamic sharding are vacuous, 
because direct runner replaces unspecified sharding with fixed sharding at 
https://github.com/apache/beam/blob/master/runners/direct-java/src/main/java/org/apache/beam/runners/direct/WriteWithShardingFactory.java
 (applied at 
https://github.com/apache/beam/blob/master/runners/direct-java/src/main/java/org/apache/beam/runners/direct/DirectRunner.java#L217).
 However, this is a testing-only concern: other runners don't have this 
override, so overall the testing is non-vacuous, this is just hard to test 
against direct runner and I suspect that we probably want these tests to be 
non-vacuous in direct runner too.
    
    2) When I removed that override for testing purposes, I noticed that 
there's a very large number of files being written - primarily, I guess, 
because the bundles are very small. So large a number of files that the test 
time for batch with dynamic sharding grows from 21 seconds to 5 minutes. In 
particular, we write many, many files for each window/pane - presumably because 
in streaming runners and in direct runner, there's at least 1 bundle per key, 
and we create at least 1 file per bundle in 
WriteFiles.Write(Windowed,Unwindowed)Bundles.
    
    Reuven, can you please comment on whether this "at least 1 file per key" is 
expected behavior in a streaming runner? I suspect that it's not, but then I'm 
not sure how to fix the PR semantically.
    
    CC: @reuvenlax @davorbonaci @dhalperi 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jkff/incubator-beam finish-pr-2647-2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/beam/pull/3023.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3023
    
----
commit c03781cbbba1d80e1ad5c93165bdad6bebd05c53
Author: Reuven Lax <re...@google.com>
Date:   2017-04-05T19:13:44Z

    Implement dynamic-sharding for windowed file outputs, and add an 
integration test.

commit c43cc4abe7ef81a7a9155ac686eed46af24eb7c0
Author: Reuven Lax <re...@google.com>
Date:   2017-05-09T20:02:12Z

    Renames FileBasedSink inner classes
    
    FileBasedWriteOperation -> WriteOperation, FileBasedWriter -> Writer

commit 3347c6e49725a3648bd944b9543425518e2f77e1
Author: Eugene Kirpichov <kirpic...@google.com>
Date:   2017-05-09T22:10:07Z

    Simpler code for setting shard numbers on results in FileBasedSink

commit b775df16594d30538c2b5b0af0d17a179060960c
Author: Eugene Kirpichov <kirpic...@google.com>
Date:   2017-05-09T22:25:57Z

    Splits WriteBundles into windowed/unwindowed versions

----


> Windowed file sinks should support dynamic sharding
> ---------------------------------------------------
>
>                 Key: BEAM-2052
>                 URL: https://issues.apache.org/jira/browse/BEAM-2052
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-java-core
>            Reporter: Reuven Lax
>            Assignee: Reuven Lax
>             Fix For: 2.0.0
>
>
> Currently windowed file sinks (WriteFiles and FileBasedSink) require 
> withNumShards to be set explicitly. We should remove this requirement, and 
> allow dynamic output.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to