Re: Re: How to optimize the performance of Beam on Spark(Internet mail)

Tim Robertson Wed, 19 Sep 2018 02:04:51 -0700

Thank you Devin

Can you also please try Beam with more spark executors if you are able?


On Wed, Sep 19, 2018 at 10:47 AM devinduan(段丁瑞) <devind...@tencent.com>
wrote:

> Thanks for your help!
> I will test other examples of Beam On Spark in the future and then feed
> back the results.
> Regards
> devin
>
>
> *From:* Jean-Baptiste Onofré <j...@nanthrax.net>
> *Date:* 2018-09-19 16:32
> *To:* devinduan(段丁瑞) <devind...@tencent.com>; dev <dev@beam.apache.org>
> *Subject:* Re: How to optimize the performance of Beam on Spark(Internet
> mail)
>
> Thanks for the details.
>
> I will take a look later tomorrow (I have another issue to investigate
> on the Spark runner today for Beam 2.7.0 release).
>
> Regards
> JB
>
> On 19/09/2018 08:31, devinduan(段丁瑞) wrote:
> > Hi,
> >     I test 300MB data file.
> >     Use command like:
> >     ./spark-submit --master yarn --deploy-mode client  --class
> > com.test.BeamTest --executor-memory 1g --num-executors 1 --driver-memory
> 1g
> >
> >  I set only one exeuctor. so task run in sequence . One task cost 10s.
> > However, Spark task cost only 0.4s
> >
> >
> >
> >     *From:* Jean-Baptiste Onofré <mailto:j...@nanthrax.net
> <j...@nanthrax.net>>
> >     *Date:* 2018-09-19 12:22
> >     *To:* dev@beam.apache.org <mailto:dev@beam.apache.org
> <dev@beam.apache.org>>
> >     *Subject:* Re: How to optimize the performance of Beam on
> >     Spark(Internet mail)
> >
> >     Hi,
> >
> >     did you compare the stages in the Spark UI in order to identify which
> >     stage is taking time ?
> >
> >     You use spark-submit in both cases for the bootstrapping ?
> >
> >     I will do a test here as well.
> >
> >     Regards
> >     JB
> >
> >     On 19/09/2018 05:34, devinduan(段丁瑞) wrote:
> >     > Hi,
> >     >     Thanks for you reply.
> >     >     Our team plan to use Beam instead of Spark, So I'm testing the
> >     > performance of Beam API.
> >     >     I'm coding some example through Spark API and Beam API , like
> >     > "WordCount" , "Join",  "OrderBy",  "Union" ...
> >     >     I use the same Resources and configuration to run these Job.
> >     >    Tim said I should remove "withNumShards(1)" and
> >     > set spark.default.parallelism=32. I did it and tried again, but
> >     Beam job
> >     > still running very slowly.
> >     >     Here is My Beam code and Spark code:
> >     >    Beam "WordCount":
> >     >
> >     >    Spark "WordCount":
> >     >
> >     >    I will try the other example later.
> >     >
> >     > Regards
> >     > devin
> >     >
> >     >
> >     >     *From:* Jean-Baptiste Onofré <mailto:j...@nanthrax.net
> <j...@nanthrax.net>>
> >     >     *Date:* 2018-09-18 22:43
> >     >     *To:* dev@beam.apache.org <mailto:dev@beam.apache.org
> <dev@beam.apache.org>>
> >     >     *Subject:* Re: How to optimize the performance of Beam on
> >     >     Spark(Internet mail)
> >     >
> >     >     Hi,
> >     >
> >     >     The first huge difference is the fact that the spark runner
> >     still uses
> >     >     RDD whereas directly using spark, you are using dataset. A
> >     bunch of
> >     >     optimization in spark are related to dataset.
> >     >
> >     >     I started a large refactoring of the spark runner to leverage
> >     Spark 2.x
> >     >     (and dataset).
> >     >     It's not yet ready as it includes other improvements (the
> >     portability
> >     >     layer with Job API, a first check of state API, ...).
> >     >
> >     >     Anyway, by Spark wordcount, you mean the one included in the
> spark
> >     >     distribution ?
> >     >
> >     >     Regards
> >     >     JB
> >     >
> >     >     On 18/09/2018 08:39, devinduan(段丁瑞) wrote:
> >     >     > Hi，
> >     >     >     I'm testing Beam on Spark.
> >     >     >     I use spark example code WordCount processing 1G data
> >     file, cost 1
> >     >     > minutes.
> >     >     >     However, I use Beam example code WordCount processing
> >     the same
> >     >     file,
> >     >     > cost 30minutes.
> >     >     >     My Spark parameter is :  --deploy-mode client
> >     >      --executor-memory 1g
> >     >     > --num-executors 1 --driver-memory 1g
> >     >     >     My Spark version is 2.3.1,  Beam version is 2.5
> >     >     >     Is there any optimization method?
> >     >     > Thank you.
> >     >     >
> >     >     >
> >     >
> >     >     --
> >     >     Jean-Baptiste Onofré
> >     >     jbono...@apache.org
> >     >     http://blog.nanthrax.net
> >     >     Talend - http://www.talend.com
> >     >
> >
> >     --
> >     Jean-Baptiste Onofré
> >     jbono...@apache.org
> >     http://blog.nanthrax.net
> >     Talend - http://www.talend.com
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
>

Re: Re: How to optimize the performance of Beam on Spark(Internet mail)

Reply via email to