Hello!

I'm trying to learn Apache Beam, and I was looking into examples,, when I
noticed something unusual:

It seems that "word count" example is much faster using Python than Java.

Python example pipeline on King Lear:

real 0m9.294s
user 0m2.822s
sys 0m0.370s

Java example pipeline on King Lear:

real 1m35.780s
user 4m10.089s
sys 0m1.743s

As you can see it is 10 sec vs 105 sec real time, and it uses even more CPU
time because it uses all of CPU cores.

Is this some kind of limitation of Java's direct runner? Or am I doing
something wrong? Is this intended? Should I file a bug?

Or maybe this difference is eliminated on real life pipelines?

I got similar results when testing using Google Colab:
https://beam.apache.org/get-started/try-apache-beam/

When you execute on all Shakespeare's books in the bucket
(gs://dataflow-samples/shakespeare/*) the difference is even greater:

Python:

real 0m47.900s
user 0m18.350s
sys 0m0.579s

Java:

real 14m28.201s
user 28m3.206s
sys 0m7.597s


How to reproduce:

Python 3.7:

docker run -it --rm python:3.7-buster /bin/bash
pip3 install apache-beam[gcp]
mkdir -p /tmp/foo
cd /tmp/foo
time python -m apache_beam.examples.wordcount --input
gs://dataflow-samples/shakespeare/kinglear.txt --output ./count

real 0m9.294s
user 0m2.822s
sys 0m0.370s


Java:

docker run -it --rm ubuntu:16.04 /bin/bash
apt update
apt install default-jdk maven
mkdir -p /tmp/foo
cd /tmp/foo
mvn archetype:generate -DarchetypeGroupId=org.apache.beam
-DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples
-DarchetypeVersion=2.19.0 -DgroupId=org.example
-DartifactId=word-count-beam -Dversion="0.1"
-Dpackage=org.apache.beam.examples -DinteractiveMode=false
cd word-count-beam
mvn compile
time mvn exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount
-Dexec.args="--inputFile=gs://apache-beam-samples/shakespeare/kinglear.txt
--output=counts" -Pdirect-runner

Execute twice because the first time maven will download dependencies:

time mvn exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount
-Dexec.args="--inputFile=gs://apache-beam-samples/shakespeare/kinglear.txt
--output=counts" -Pdirect-runner

real 1m35.780s
user 4m10.089s
sys 0m1.743s


Thanks,
Krystian Kichewko

Reply via email to