Thanks for the details.

I will take a look later tomorrow (I have another issue to investigate
on the Spark runner today for Beam 2.7.0 release).

Regards
JB

On 19/09/2018 08:31, devinduan(段丁瑞) wrote:
> Hi,
>     I test 300MB data file.
>     Use command like:
>     ./spark-submit --master yarn --deploy-mode client  --class
> com.test.BeamTest --executor-memory 1g --num-executors 1 --driver-memory 1g 
>    
>  I set only one exeuctor. so task run in sequence . One task cost 10s.
> However, Spark task cost only 0.4s
> 
> 
> 
>     *From:* Jean-Baptiste Onofré <mailto:j...@nanthrax.net>
>     *Date:* 2018-09-19 12:22
>     *To:* dev@beam.apache.org <mailto:dev@beam.apache.org>
>     *Subject:* Re: How to optimize the performance of Beam on
>     Spark(Internet mail)
> 
>     Hi,
> 
>     did you compare the stages in the Spark UI in order to identify which
>     stage is taking time ?
> 
>     You use spark-submit in both cases for the bootstrapping ?
> 
>     I will do a test here as well.
> 
>     Regards
>     JB
> 
>     On 19/09/2018 05:34, devinduan(段丁瑞) wrote:
>     > Hi,
>     >     Thanks for you reply.
>     >     Our team plan to use Beam instead of Spark, So I'm testing the
>     > performance of Beam API.
>     >     I'm coding some example through Spark API and Beam API , like
>     > "WordCount" , "Join",  "OrderBy",  "Union" ...
>     >     I use the same Resources and configuration to run these Job.   
>     >    Tim said I should remove "withNumShards(1)" and
>     > set spark.default.parallelism=32. I did it and tried again, but
>     Beam job
>     > still running very slowly.
>     >     Here is My Beam code and Spark code:
>     >    Beam "WordCount":
>     >     
>     >    Spark "WordCount":
>     >
>     >    I will try the other example later.
>     >     
>     > Regards
>     > devin
>     >
>     >      
>     >     *From:* Jean-Baptiste Onofré <mailto:j...@nanthrax.net>
>     >     *Date:* 2018-09-18 22:43
>     >     *To:* dev@beam.apache.org <mailto:dev@beam.apache.org>
>     >     *Subject:* Re: How to optimize the performance of Beam on
>     >     Spark(Internet mail)
>     >
>     >     Hi,
>     >
>     >     The first huge difference is the fact that the spark runner
>     still uses
>     >     RDD whereas directly using spark, you are using dataset. A
>     bunch of
>     >     optimization in spark are related to dataset.
>     >
>     >     I started a large refactoring of the spark runner to leverage
>     Spark 2.x
>     >     (and dataset).
>     >     It's not yet ready as it includes other improvements (the
>     portability
>     >     layer with Job API, a first check of state API, ...).
>     >
>     >     Anyway, by Spark wordcount, you mean the one included in the spark
>     >     distribution ?
>     >
>     >     Regards
>     >     JB
>     >
>     >     On 18/09/2018 08:39, devinduan(段丁瑞) wrote:
>     >     > Hi,
>     >     >     I'm testing Beam on Spark. 
>     >     >     I use spark example code WordCount processing 1G data
>     file, cost 1
>     >     > minutes.
>     >     >     However, I use Beam example code WordCount processing
>     the same
>     >     file,
>     >     > cost 30minutes.
>     >     >     My Spark parameter is :  --deploy-mode client
>     >      --executor-memory 1g
>     >     > --num-executors 1 --driver-memory 1g
>     >     >     My Spark version is 2.3.1,  Beam version is 2.5
>     >     >     Is there any optimization method?
>     >     > Thank you.
>     >     >
>     >     >    
>     >
>     >     --
>     >     Jean-Baptiste Onofré
>     >     jbono...@apache.org
>     >     http://blog.nanthrax.net
>     >     Talend - http://www.talend.com
>     >
> 
>     --
>     Jean-Baptiste Onofré
>     jbono...@apache.org
>     http://blog.nanthrax.net
>     Talend - http://www.talend.com
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Reply via email to