The Java DirectRunner enforces additional strictness checks that are
expensive (such as re-encoding all elements to make sure that the coder is
compatible).

Retry your run with --enforceImmutability=false --enforceEncodability=false

On Sat, Apr 11, 2020 at 11:45 AM Krystian Kichewko <
[email protected]> wrote:

> Hello!
>
> I'm trying to learn Apache Beam, and I was looking into examples,, when I
> noticed something unusual:
>
> It seems that "word count" example is much faster using Python than Java.
>
> Python example pipeline on King Lear:
>
> real 0m9.294s
> user 0m2.822s
> sys 0m0.370s
>
> Java example pipeline on King Lear:
>
> real 1m35.780s
> user 4m10.089s
> sys 0m1.743s
>
> As you can see it is 10 sec vs 105 sec real time, and it uses even more
> CPU time because it uses all of CPU cores.
>
> Is this some kind of limitation of Java's direct runner? Or am I doing
> something wrong? Is this intended? Should I file a bug?
>
> Or maybe this difference is eliminated on real life pipelines?
>
> I got similar results when testing using Google Colab:
> https://beam.apache.org/get-started/try-apache-beam/
>
> When you execute on all Shakespeare's books in the bucket
> (gs://dataflow-samples/shakespeare/*) the difference is even greater:
>
> Python:
>
> real 0m47.900s
> user 0m18.350s
> sys 0m0.579s
>
> Java:
>
> real 14m28.201s
> user 28m3.206s
> sys 0m7.597s
>
>
> How to reproduce:
>
> Python 3.7:
>
> docker run -it --rm python:3.7-buster /bin/bash
> pip3 install apache-beam[gcp]
> mkdir -p /tmp/foo
> cd /tmp/foo
> time python -m apache_beam.examples.wordcount --input
> gs://dataflow-samples/shakespeare/kinglear.txt --output ./count
>
> real 0m9.294s
> user 0m2.822s
> sys 0m0.370s
>
>
> Java:
>
> docker run -it --rm ubuntu:16.04 /bin/bash
> apt update
> apt install default-jdk maven
> mkdir -p /tmp/foo
> cd /tmp/foo
> mvn archetype:generate -DarchetypeGroupId=org.apache.beam
> -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples
> -DarchetypeVersion=2.19.0 -DgroupId=org.example
> -DartifactId=word-count-beam -Dversion="0.1"
> -Dpackage=org.apache.beam.examples -DinteractiveMode=false
> cd word-count-beam
> mvn compile
> time mvn exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount
> -Dexec.args="--inputFile=gs://apache-beam-samples/shakespeare/kinglear.txt
> --output=counts" -Pdirect-runner
>
> Execute twice because the first time maven will download dependencies:
>
> time mvn exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount
> -Dexec.args="--inputFile=gs://apache-beam-samples/shakespeare/kinglear.txt
> --output=counts" -Pdirect-runner
>
> real 1m35.780s
> user 4m10.089s
> sys 0m1.743s
>
>
> Thanks,
> Krystian Kichewko
>

Reply via email to