[ https://issues.apache.org/jira/browse/BEAM-434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15373644#comment-15373644 ]
Kenneth Knowles commented on BEAM-434: -------------------------------------- OK, I'm pretty convinced I was wrong, by the argument that users are going to copy/paste/modify the example and assume each piece of it is important and should be retained in their own code. I _do_ think it is very important that users know that the runner controls the number of bundles & shards and that if you want a particular number then you have to hardcode it. But I want users to know that this is a special case with real downsides. My thinking had been that making it explicit in the example would make it clear that the reason there are very few shards is because we hardcoded it. But it would also imply that this is something one should do by default, the opposite of the desired message. So now I favor a variant of [~dhalp...@google.com]'s option 3, which is an implementation detail of "the direct runner should - via whatever means - limit the number of output shards of Write (not just text, but probably most or all) to a simple human readable number". But I think having a fixed number in the absence of code fixing that number would also set the wrong expectation. Thus I think it is very important to follow [~frances]'s idea to make the number variable. I'd suggest a range of 3 to 7. Somehow two shards just doesn't seem "sharded" enough for me. Using the usual override approach, as proposed, is probably the easiest implementation technique. That last will be best decided by [~tgroh]. > Limit the number of output files a beam-examples execution writes > ----------------------------------------------------------------- > > Key: BEAM-434 > URL: https://issues.apache.org/jira/browse/BEAM-434 > Project: Beam > Issue Type: Bug > Components: examples-java > Reporter: Amit Sela > Assignee: Amit Sela > Priority: Minor > > When using `TextIO.Write.to("/path/to/output")` without any restrictions on > the number of shards, it might generate many output files (depending on your > input), for WordCount for example, you'll get as many output files as unique > words in your input. > Since I think examples are expected to execute in a friendly manner to "see" > what it does and not optimize for performance in some way, I suggest to use > `withoutSharding()` when writing the example output to an output file. > Examples I could find that behave this way: > org.apache.beam.examples.WordCount > org.apache.beam.examples.complete.TfIdf > org.apache.beam.examples.cookbook.DeDupExample -- This message was sent by Atlassian JIRA (v6.3.4#6332)