Re: Re: How to optimize the performance of Beam on Spark(Internet mail)

2018-09-28 Thread Tim Robertson
Thanks for sharing those results. The second set (executors at 20-30) look similar to what I would have expected. BEAM-5036 definitely plays a part here as the data is not moved on HDFS efficiently (fix in PR awaiting review now [1]). To give an idea of the impact, here are some numbers from my o

Re: Re: How to optimize the performance of Beam on Spark(Internet mail)

2018-09-28 Thread Robert Bradshaw
Something here on the Beam side is clearly linear in the input size, as if there's a bottleneck where were' not able to get any parallelization. Is the spark variant running in parallel? On Fri, Sep 28, 2018 at 4:57 AM devinduan(段丁瑞) wrote: > Hi > I have completed my test. > 1. Spark paramet

Re: Re: How to optimize the performance of Beam on Spark(Internet mail)

2018-09-19 Thread 段丁瑞
am On Spark in the future and then feed back the results. Regards devin From: Jean-Baptiste Onofré<mailto:j...@nanthrax.net> Date: 2018-09-19 16:32 To: devinduan(段丁瑞)<mailto:devind...@tencent.com>; dev<mailto:dev@beam.apache.org> Subject: Re: How to optimize the performance of

Re: Re: How to optimize the performance of Beam on Spark(Internet mail)

2018-09-19 Thread Tim Robertson
vin > > > *From:* Jean-Baptiste Onofré > *Date:* 2018-09-19 16:32 > *To:* devinduan(段丁瑞) ; dev > *Subject:* Re: How to optimize the performance of Beam on Spark(Internet > mail) > > Thanks for the details. > > I will take a look later tomorrow (I have another issue to

Re: Re: How to optimize the performance of Beam on Spark(Internet mail)

2018-09-19 Thread 段丁瑞
rg> Subject: Re: How to optimize the performance of Beam on Spark(Internet mail) Thanks for the details. I will take a look later tomorrow (I have another issue to investigate on the Spark runner today for Beam 2.7.0 release). Regards JB On 19/09/2018 08:31, devinduan(段丁瑞) wrote: > Hi, &g

Re: How to optimize the performance of Beam on Spark(Internet mail)

2018-09-19 Thread Jean-Baptiste Onofré
ste Onofré <mailto:j...@nanthrax.net> > *Date:* 2018-09-19 12:22 > *To:* dev@beam.apache.org <mailto:dev@beam.apache.org> > *Subject:* Re: How to optimize the performance of Beam on > Spark(Internet mail) > > Hi, > > did you compare the stag

Re: How to optimize the performance of Beam on Spark(Internet mail)

2018-09-18 Thread Jean-Baptiste Onofré
quot;: >      >    Spark "WordCount": > >    I will try the other example later. >      > Regards > devin > >   > *From:* Jean-Baptiste Onofré <mailto:j...@nanthrax.net> > *Date:* 2018-09-18 22:43 > *To:* dev@beam.apache.org <mailto:dev@b

Re: How to optimize the performance of Beam on Spark

2018-09-18 Thread Jean-Baptiste Onofré
Hi, The first huge difference is the fact that the spark runner still uses RDD whereas directly using spark, you are using dataset. A bunch of optimization in spark are related to dataset. I started a large refactoring of the spark runner to leverage Spark 2.x (and dataset). It's not yet ready as

Re: How to optimize the performance of Beam on Spark

2018-09-18 Thread Tim Robertson
Hi devinduan The known issues Robert links there are actually HDFS related and not specific to Spark. The improvement we're seeking is that the final copy of the output file can be optimised by using a "move" instead of "copy" andI expect to have it fixed for Beam 2.8.0. On a small dataset like

Re: How to optimize the performance of Beam on Spark

2018-09-18 Thread Robert Bradshaw
There are known performance issues with Beam on Spark that are being worked on, e.g. https://issues.apache.org/jira/browse/BEAM-5036 . It's possible you're hitting something different, but would be worth investigating. See also https://lists.apache.org/list.html?dev@beam.apache.org:lte=1M:Performan

How to optimize the performance of Beam on Spark

2018-09-17 Thread 段丁瑞
Hi, I'm testing Beam on Spark. I use spark example code WordCount processing 1G data file, cost 1 minutes. However, I use Beam example code WordCount processing the same file, cost 30minutes. My Spark parameter is : --deploy-mode client --executor-memory 1g --num-executors 1 --d