Hi,

The first huge difference is the fact that the spark runner still uses
RDD whereas directly using spark, you are using dataset. A bunch of
optimization in spark are related to dataset.

I started a large refactoring of the spark runner to leverage Spark 2.x
(and dataset).
It's not yet ready as it includes other improvements (the portability
layer with Job API, a first check of state API, ...).

Anyway, by Spark wordcount, you mean the one included in the spark
distribution ?

Regards
JB

On 18/09/2018 08:39, devinduan(段丁瑞) wrote:
> Hi,
>     I'm testing Beam on Spark. 
>     I use spark example code WordCount processing 1G data file, cost 1
> minutes.
>     However, I use Beam example code WordCount processing the same file,
> cost 30minutes.
>     My Spark parameter is :  --deploy-mode client  --executor-memory 1g
> --num-executors 1 --driver-memory 1g
>     My Spark version is 2.3.1,  Beam version is 2.5
>     Is there any optimization method?
> Thank you.
> 
>    

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Reply via email to