Hi, The first huge difference is the fact that the spark runner still uses RDD whereas directly using spark, you are using dataset. A bunch of optimization in spark are related to dataset.
I started a large refactoring of the spark runner to leverage Spark 2.x (and dataset). It's not yet ready as it includes other improvements (the portability layer with Job API, a first check of state API, ...). Anyway, by Spark wordcount, you mean the one included in the spark distribution ? Regards JB On 18/09/2018 08:39, devinduan(段丁瑞) wrote: > Hi, > I'm testing Beam on Spark. > I use spark example code WordCount processing 1G data file, cost 1 > minutes. > However, I use Beam example code WordCount processing the same file, > cost 30minutes. > My Spark parameter is : --deploy-mode client --executor-memory 1g > --num-executors 1 --driver-memory 1g > My Spark version is 2.3.1, Beam version is 2.5 > Is there any optimization method? > Thank you. > > -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com