[jira] [Comment Edited] (BEAM-434) When examples write output to file it creates many output files instead of one

Amit Sela (JIRA) Mon, 11 Jul 2016 00:28:25 -0700

    [ 
https://issues.apache.org/jira/browse/BEAM-434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15370313#comment-15370313
 ]


Amit Sela edited comment on BEAM-434 at 7/11/16 7:27 AM:
---------------------------------------------------------

I simply followed 
https://github.com/apache/incubator-beam/tree/master/examples/java#building-and-running
 which I guess uses the DirectRunner:
mvn compile exec:java \
-Dexec.mainClass=org.apache.beam.examples.WordCount \
-Dexec.args="--inputFile=/tmp/kinglear.txt --output=/tmp/example.out"
Ended up with 5230 output files... such as example.out-02102-of-05230

BTW, I found it out after using the new additions to the Spark runner in this 
PR: https://github.com/apache/incubator-beam/pull/495 but because I wasn't sure 
if that's a runner issue, I tried the official examples.

No, Spark doesn't force a different parallel task for a single key, but 
generally applies steps in the DAG to partitions of the data (those are called 
"tasks"). You could write your own partitioner to do that... but you probably 
shouldn't. You could also initiate a repartition of the data, but we don't do 
it in the runner (for now), and you could set the number of partitions for 
shuffle operations but there is a default number set by the TaskScheduler - 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskScheduler.scala#L61
  

Finally, I would argue that examples are not for training but rather for 
engagement. You could add a  BOLD disclaimer about the fact that "a single file 
output is not recommended for use in production" but as a first time user, I 
think the best experience is:

Clone
Build
Run example
"cat output.txt"
See result and be happy :)

That's my point of view as an OSS user.
 
 If you don't want to hard-code `withoutSharding` you could add it as an 
arguments and have the example use `withNumShards`



was (Author: amitsela):
I simply followed 
https://github.com/apache/incubator-beam/tree/master/examples/java#building-and-running
 which I guess uses the DirectRunner:
mvn compile exec:java \
-Dexec.mainClass=org.apache.beam.examples.WordCount \
-Dexec.args="--inputFile=/tmp/kinglear.txt --output=/tmp/example.out"
Ended up with 5230 output files... such as example.out-02102-of-05230

BTW, I found it out after using the new additions to the Spark runner in this 
PR: https://github.com/apache/incubator-beam/pull/495 but because I wasn't sure 
if that's a runner issue, I tried the official examples.

No, Spark doesn't force a different parallel task for a single key, but 
generally applies steps in the DAG to partitions of the data (those are called 
"tasks"). You could write your own partitioner to do that... but you probably 
shouldn't. You could also initiate a repartition of the data, but we don't do 
it in the runner (for now), and you could set the number of partitions for 
shuffle operations but there is a default number set by the TaskScheduler - 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskScheduler.scala#L61
  

Finally, I would argue that examples are not for training but rather for 
engagement. You could add a  BOLD disclaimer about the fact that "a single file 
output is not recommended for use in production" but as a first time user, I 
think the best experience is:

Clone
Build
Run example
"cat output.txt"
See result and be happy :)

That's my point of view as an OSS user.
 
 If you don't want to hard-code ``withoutSharding`` you could add it as an 
arguments and have the example use ``withNumShards``


> When examples write output to file it creates many output files instead of one
> ------------------------------------------------------------------------------
>
>                 Key: BEAM-434
>                 URL: https://issues.apache.org/jira/browse/BEAM-434
>             Project: Beam
>          Issue Type: Bug
>          Components: examples-java
>            Reporter: Amit Sela
>            Assignee: Amit Sela
>            Priority: Minor
>
> When using `TextIO.Write.to("/path/to/output")` without any restrictions on 
> the number of shards, it might generate many output files (depending on your 
> input), for WordCount for example, you'll get as many output files as unique 
> words in your input.
> Since I think examples are expected to execute in a friendly manner to "see" 
> what it does and not optimize for performance in some way, I suggest to use 
> `withoutSharding()` when writing the example output to an output file.
> Examples I could find that behave this way:
> org.apache.beam.examples.WordCount
> org.apache.beam.examples.complete.TfIdf
> org.apache.beam.examples.cookbook.DeDupExample



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (BEAM-434) When examples write output to file it creates many output files instead of one

Reply via email to