Thanks for the details. I will take a look later tomorrow (I have another issue to investigate on the Spark runner today for Beam 2.7.0 release).
Regards JB On 19/09/2018 08:31, devinduan(段丁瑞) wrote: > Hi, > I test 300MB data file. > Use command like: > ./spark-submit --master yarn --deploy-mode client --class > com.test.BeamTest --executor-memory 1g --num-executors 1 --driver-memory 1g > > I set only one exeuctor. so task run in sequence . One task cost 10s. > However, Spark task cost only 0.4s > > > > *From:* Jean-Baptiste Onofré <mailto:j...@nanthrax.net> > *Date:* 2018-09-19 12:22 > *To:* dev@beam.apache.org <mailto:dev@beam.apache.org> > *Subject:* Re: How to optimize the performance of Beam on > Spark(Internet mail) > > Hi, > > did you compare the stages in the Spark UI in order to identify which > stage is taking time ? > > You use spark-submit in both cases for the bootstrapping ? > > I will do a test here as well. > > Regards > JB > > On 19/09/2018 05:34, devinduan(段丁瑞) wrote: > > Hi, > > Thanks for you reply. > > Our team plan to use Beam instead of Spark, So I'm testing the > > performance of Beam API. > > I'm coding some example through Spark API and Beam API , like > > "WordCount" , "Join", "OrderBy", "Union" ... > > I use the same Resources and configuration to run these Job. > > Tim said I should remove "withNumShards(1)" and > > set spark.default.parallelism=32. I did it and tried again, but > Beam job > > still running very slowly. > > Here is My Beam code and Spark code: > > Beam "WordCount": > > > > Spark "WordCount": > > > > I will try the other example later. > > > > Regards > > devin > > > > > > *From:* Jean-Baptiste Onofré <mailto:j...@nanthrax.net> > > *Date:* 2018-09-18 22:43 > > *To:* dev@beam.apache.org <mailto:dev@beam.apache.org> > > *Subject:* Re: How to optimize the performance of Beam on > > Spark(Internet mail) > > > > Hi, > > > > The first huge difference is the fact that the spark runner > still uses > > RDD whereas directly using spark, you are using dataset. A > bunch of > > optimization in spark are related to dataset. > > > > I started a large refactoring of the spark runner to leverage > Spark 2.x > > (and dataset). > > It's not yet ready as it includes other improvements (the > portability > > layer with Job API, a first check of state API, ...). > > > > Anyway, by Spark wordcount, you mean the one included in the spark > > distribution ? > > > > Regards > > JB > > > > On 18/09/2018 08:39, devinduan(段丁瑞) wrote: > > > Hi, > > > I'm testing Beam on Spark. > > > I use spark example code WordCount processing 1G data > file, cost 1 > > > minutes. > > > However, I use Beam example code WordCount processing > the same > > file, > > > cost 30minutes. > > > My Spark parameter is : --deploy-mode client > > --executor-memory 1g > > > --num-executors 1 --driver-memory 1g > > > My Spark version is 2.3.1, Beam version is 2.5 > > > Is there any optimization method? > > > Thank you. > > > > > > > > > > -- > > Jean-Baptiste Onofré > > jbono...@apache.org > > http://blog.nanthrax.net > > Talend - http://www.talend.com > > > > -- > Jean-Baptiste Onofré > jbono...@apache.org > http://blog.nanthrax.net > Talend - http://www.talend.com > -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com