[ 
https://issues.apache.org/jira/browse/BEAM-434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15373644#comment-15373644
 ] 

Kenneth Knowles commented on BEAM-434:
--------------------------------------

OK, I'm pretty convinced I was wrong, by the argument that users are going to 
copy/paste/modify the example and assume each piece of it is important and 
should be retained in their own code.

I _do_ think it is very important that users know that the runner controls the 
number of bundles & shards and that if you want a particular number then you 
have to hardcode it. But I want users to know that this is a special case with 
real downsides. My thinking had been that making it explicit in the example 
would make it clear that the reason there are very few shards is because we 
hardcoded it. But it would also imply that this is something one should do by 
default, the opposite of the desired message.

So now I favor a variant of [~dhalp...@google.com]'s  option 3, which is an 
implementation detail of "the direct runner should - via whatever means - limit 
the number of output shards of Write (not just text, but probably most or all) 
to a simple human readable number".

But I think having a fixed number in the absence of code fixing that number 
would also set the wrong expectation. Thus I think it is very important to 
follow [~frances]'s idea to make the number variable. I'd suggest a range of 3 
to 7. Somehow two shards just doesn't seem "sharded" enough for me. Using the 
usual override approach, as proposed, is probably the easiest implementation 
technique. That last will be best decided by [~tgroh].

> Limit the number of output files a beam-examples execution writes
> -----------------------------------------------------------------
>
>                 Key: BEAM-434
>                 URL: https://issues.apache.org/jira/browse/BEAM-434
>             Project: Beam
>          Issue Type: Bug
>          Components: examples-java
>            Reporter: Amit Sela
>            Assignee: Amit Sela
>            Priority: Minor
>
> When using `TextIO.Write.to("/path/to/output")` without any restrictions on 
> the number of shards, it might generate many output files (depending on your 
> input), for WordCount for example, you'll get as many output files as unique 
> words in your input.
> Since I think examples are expected to execute in a friendly manner to "see" 
> what it does and not optimize for performance in some way, I suggest to use 
> `withoutSharding()` when writing the example output to an output file.
> Examples I could find that behave this way:
> org.apache.beam.examples.WordCount
> org.apache.beam.examples.complete.TfIdf
> org.apache.beam.examples.cookbook.DeDupExample



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to