Hi,

did you compare the stages in the Spark UI in order to identify which
stage is taking time ?

You use spark-submit in both cases for the bootstrapping ?

I will do a test here as well.

Regards
JB

On 19/09/2018 05:34, devinduan(段丁瑞) wrote:
> Hi,
>     Thanks for you reply.
>     Our team plan to use Beam instead of Spark, So I'm testing the
> performance of Beam API.
>     I'm coding some example through Spark API and Beam API , like
> "WordCount" , "Join",  "OrderBy",  "Union" ...
>     I use the same Resources and configuration to run these Job.   
>    Tim said I should remove "withNumShards(1)" and
> set spark.default.parallelism=32. I did it and tried again, but Beam job
> still running very slowly.
>     Here is My Beam code and Spark code:
>    Beam "WordCount":
>     
>    Spark "WordCount":
> 
>    I will try the other example later.
>     
> Regards
> devin
> 
>      
>     *From:* Jean-Baptiste Onofré <mailto:j...@nanthrax.net>
>     *Date:* 2018-09-18 22:43
>     *To:* dev@beam.apache.org <mailto:dev@beam.apache.org>
>     *Subject:* Re: How to optimize the performance of Beam on
>     Spark(Internet mail)
> 
>     Hi,
> 
>     The first huge difference is the fact that the spark runner still uses
>     RDD whereas directly using spark, you are using dataset. A bunch of
>     optimization in spark are related to dataset.
> 
>     I started a large refactoring of the spark runner to leverage Spark 2.x
>     (and dataset).
>     It's not yet ready as it includes other improvements (the portability
>     layer with Job API, a first check of state API, ...).
> 
>     Anyway, by Spark wordcount, you mean the one included in the spark
>     distribution ?
> 
>     Regards
>     JB
> 
>     On 18/09/2018 08:39, devinduan(段丁瑞) wrote:
>     > Hi,
>     >     I'm testing Beam on Spark. 
>     >     I use spark example code WordCount processing 1G data file, cost 1
>     > minutes.
>     >     However, I use Beam example code WordCount processing the same
>     file,
>     > cost 30minutes.
>     >     My Spark parameter is :  --deploy-mode client
>      --executor-memory 1g
>     > --num-executors 1 --driver-memory 1g
>     >     My Spark version is 2.3.1,  Beam version is 2.5
>     >     Is there any optimization method?
>     > Thank you.
>     >
>     >    
> 
>     --
>     Jean-Baptiste Onofré
>     jbono...@apache.org
>     http://blog.nanthrax.net
>     Talend - http://www.talend.com
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Reply via email to