[ 
https://issues.apache.org/jira/browse/BEAM-434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15372987#comment-15372987
 ] 

Ismaël Mejía commented on BEAM-434:
-----------------------------------

The examples should be simpler, and in particular the 'hello world = wordcount' 
one, so the less options the better.

If the user wants to define the sharding he could do so, once he progresses on 
Beam he will know about such stuff, I think the important decision is what 
should be the default behavior if numSharding is not defined for a Text.Write 
in the DirectRunner (or other runners) ?

Of course we can go the way of restricting this only in DirectRunner (#3) 
because distributed/parallel writes are less of an issue there. However from a 
user point of view it is really confusing that other runners also implement 
this differently, so I don't know if it makes sense to standarize this, e.g. 
Spark produces more files if the output is big (for the reasons Amit 
explained), but Flink produces only one (at least in my tests).


> When examples write output to file it creates many output files instead of one
> ------------------------------------------------------------------------------
>
>                 Key: BEAM-434
>                 URL: https://issues.apache.org/jira/browse/BEAM-434
>             Project: Beam
>          Issue Type: Bug
>          Components: examples-java
>            Reporter: Amit Sela
>            Assignee: Amit Sela
>            Priority: Minor
>
> When using `TextIO.Write.to("/path/to/output")` without any restrictions on 
> the number of shards, it might generate many output files (depending on your 
> input), for WordCount for example, you'll get as many output files as unique 
> words in your input.
> Since I think examples are expected to execute in a friendly manner to "see" 
> what it does and not optimize for performance in some way, I suggest to use 
> `withoutSharding()` when writing the example output to an output file.
> Examples I could find that behave this way:
> org.apache.beam.examples.WordCount
> org.apache.beam.examples.complete.TfIdf
> org.apache.beam.examples.cookbook.DeDupExample



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to