Hello! I'm trying to learn Apache Beam, and I was looking into examples,, when I noticed something unusual:
It seems that "word count" example is much faster using Python than Java. Python example pipeline on King Lear: real 0m9.294s user 0m2.822s sys 0m0.370s Java example pipeline on King Lear: real 1m35.780s user 4m10.089s sys 0m1.743s As you can see it is 10 sec vs 105 sec real time, and it uses even more CPU time because it uses all of CPU cores. Is this some kind of limitation of Java's direct runner? Or am I doing something wrong? Is this intended? Should I file a bug? Or maybe this difference is eliminated on real life pipelines? I got similar results when testing using Google Colab: https://beam.apache.org/get-started/try-apache-beam/ When you execute on all Shakespeare's books in the bucket (gs://dataflow-samples/shakespeare/*) the difference is even greater: Python: real 0m47.900s user 0m18.350s sys 0m0.579s Java: real 14m28.201s user 28m3.206s sys 0m7.597s How to reproduce: Python 3.7: docker run -it --rm python:3.7-buster /bin/bash pip3 install apache-beam[gcp] mkdir -p /tmp/foo cd /tmp/foo time python -m apache_beam.examples.wordcount --input gs://dataflow-samples/shakespeare/kinglear.txt --output ./count real 0m9.294s user 0m2.822s sys 0m0.370s Java: docker run -it --rm ubuntu:16.04 /bin/bash apt update apt install default-jdk maven mkdir -p /tmp/foo cd /tmp/foo mvn archetype:generate -DarchetypeGroupId=org.apache.beam -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples -DarchetypeVersion=2.19.0 -DgroupId=org.example -DartifactId=word-count-beam -Dversion="0.1" -Dpackage=org.apache.beam.examples -DinteractiveMode=false cd word-count-beam mvn compile time mvn exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount -Dexec.args="--inputFile=gs://apache-beam-samples/shakespeare/kinglear.txt --output=counts" -Pdirect-runner Execute twice because the first time maven will download dependencies: time mvn exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount -Dexec.args="--inputFile=gs://apache-beam-samples/shakespeare/kinglear.txt --output=counts" -Pdirect-runner real 1m35.780s user 4m10.089s sys 0m1.743s Thanks, Krystian Kichewko
