Thank you Devin Can you also please try Beam with more spark executors if you are able?
On Wed, Sep 19, 2018 at 10:47 AM devinduan(段丁瑞) <devind...@tencent.com> wrote: > Thanks for your help! > I will test other examples of Beam On Spark in the future and then feed > back the results. > Regards > devin > > > *From:* Jean-Baptiste Onofré <j...@nanthrax.net> > *Date:* 2018-09-19 16:32 > *To:* devinduan(段丁瑞) <devind...@tencent.com>; dev <dev@beam.apache.org> > *Subject:* Re: How to optimize the performance of Beam on Spark(Internet > mail) > > Thanks for the details. > > I will take a look later tomorrow (I have another issue to investigate > on the Spark runner today for Beam 2.7.0 release). > > Regards > JB > > On 19/09/2018 08:31, devinduan(段丁瑞) wrote: > > Hi, > > I test 300MB data file. > > Use command like: > > ./spark-submit --master yarn --deploy-mode client --class > > com.test.BeamTest --executor-memory 1g --num-executors 1 --driver-memory > 1g > > > > I set only one exeuctor. so task run in sequence . One task cost 10s. > > However, Spark task cost only 0.4s > > > > > > > > *From:* Jean-Baptiste Onofré <mailto:j...@nanthrax.net > <j...@nanthrax.net>> > > *Date:* 2018-09-19 12:22 > > *To:* dev@beam.apache.org <mailto:dev@beam.apache.org > <dev@beam.apache.org>> > > *Subject:* Re: How to optimize the performance of Beam on > > Spark(Internet mail) > > > > Hi, > > > > did you compare the stages in the Spark UI in order to identify which > > stage is taking time ? > > > > You use spark-submit in both cases for the bootstrapping ? > > > > I will do a test here as well. > > > > Regards > > JB > > > > On 19/09/2018 05:34, devinduan(段丁瑞) wrote: > > > Hi, > > > Thanks for you reply. > > > Our team plan to use Beam instead of Spark, So I'm testing the > > > performance of Beam API. > > > I'm coding some example through Spark API and Beam API , like > > > "WordCount" , "Join", "OrderBy", "Union" ... > > > I use the same Resources and configuration to run these Job. > > > Tim said I should remove "withNumShards(1)" and > > > set spark.default.parallelism=32. I did it and tried again, but > > Beam job > > > still running very slowly. > > > Here is My Beam code and Spark code: > > > Beam "WordCount": > > > > > > Spark "WordCount": > > > > > > I will try the other example later. > > > > > > Regards > > > devin > > > > > > > > > *From:* Jean-Baptiste Onofré <mailto:j...@nanthrax.net > <j...@nanthrax.net>> > > > *Date:* 2018-09-18 22:43 > > > *To:* dev@beam.apache.org <mailto:dev@beam.apache.org > <dev@beam.apache.org>> > > > *Subject:* Re: How to optimize the performance of Beam on > > > Spark(Internet mail) > > > > > > Hi, > > > > > > The first huge difference is the fact that the spark runner > > still uses > > > RDD whereas directly using spark, you are using dataset. A > > bunch of > > > optimization in spark are related to dataset. > > > > > > I started a large refactoring of the spark runner to leverage > > Spark 2.x > > > (and dataset). > > > It's not yet ready as it includes other improvements (the > > portability > > > layer with Job API, a first check of state API, ...). > > > > > > Anyway, by Spark wordcount, you mean the one included in the > spark > > > distribution ? > > > > > > Regards > > > JB > > > > > > On 18/09/2018 08:39, devinduan(段丁瑞) wrote: > > > > Hi, > > > > I'm testing Beam on Spark. > > > > I use spark example code WordCount processing 1G data > > file, cost 1 > > > > minutes. > > > > However, I use Beam example code WordCount processing > > the same > > > file, > > > > cost 30minutes. > > > > My Spark parameter is : --deploy-mode client > > > --executor-memory 1g > > > > --num-executors 1 --driver-memory 1g > > > > My Spark version is 2.3.1, Beam version is 2.5 > > > > Is there any optimization method? > > > > Thank you. > > > > > > > > > > > > > > -- > > > Jean-Baptiste Onofré > > > jbono...@apache.org > > > http://blog.nanthrax.net > > > Talend - http://www.talend.com > > > > > > > -- > > Jean-Baptiste Onofré > > jbono...@apache.org > > http://blog.nanthrax.net > > Talend - http://www.talend.com > > > > -- > Jean-Baptiste Onofré > jbono...@apache.org > http://blog.nanthrax.net > Talend - http://www.talend.com > >