Thanks for sharing those results.
The second set (executors at 20-30) look similar to what I would have
expected.
BEAM-5036 definitely plays a part here as the data is not moved on HDFS
efficiently (fix in PR awaiting review now [1]).
To give an idea of the impact, here are some numbers from my o
Something here on the Beam side is clearly linear in the input size, as if
there's a bottleneck where were' not able to get any parallelization. Is
the spark variant running in parallel?
On Fri, Sep 28, 2018 at 4:57 AM devinduan(段丁瑞)
wrote:
> Hi
> I have completed my test.
> 1. Spark paramet
am On Spark in the future and then feed back
the results.
Regards
devin
From: Jean-Baptiste Onofré<mailto:j...@nanthrax.net>
Date: 2018-09-19 16:32
To: devinduan(段丁瑞)<mailto:devind...@tencent.com>;
dev<mailto:dev@beam.apache.org>
Subject: Re: How to optimize the performance of
vin
>
>
> *From:* Jean-Baptiste Onofré
> *Date:* 2018-09-19 16:32
> *To:* devinduan(段丁瑞) ; dev
> *Subject:* Re: How to optimize the performance of Beam on Spark(Internet
> mail)
>
> Thanks for the details.
>
> I will take a look later tomorrow (I have another issue to
rg>
Subject: Re: How to optimize the performance of Beam on Spark(Internet mail)
Thanks for the details.
I will take a look later tomorrow (I have another issue to investigate
on the Spark runner today for Beam 2.7.0 release).
Regards
JB
On 19/09/2018 08:31, devinduan(段丁瑞) wrote:
> Hi,
&g
ste Onofré <mailto:j...@nanthrax.net>
> *Date:* 2018-09-19 12:22
> *To:* dev@beam.apache.org <mailto:dev@beam.apache.org>
> *Subject:* Re: How to optimize the performance of Beam on
> Spark(Internet mail)
>
> Hi,
>
> did you compare the stag
quot;:
>
> Spark "WordCount":
>
> I will try the other example later.
>
> Regards
> devin
>
>
> *From:* Jean-Baptiste Onofré <mailto:j...@nanthrax.net>
> *Date:* 2018-09-18 22:43
> *To:* dev@beam.apache.org <mailto:dev@b
Hi,
The first huge difference is the fact that the spark runner still uses
RDD whereas directly using spark, you are using dataset. A bunch of
optimization in spark are related to dataset.
I started a large refactoring of the spark runner to leverage Spark 2.x
(and dataset).
It's not yet ready as
Hi devinduan
The known issues Robert links there are actually HDFS related and not
specific to Spark. The improvement we're seeking is that the final copy of
the output file can be optimised by using a "move" instead of "copy" andI
expect to have it fixed for Beam 2.8.0. On a small dataset like
There are known performance issues with Beam on Spark that are being worked
on, e.g. https://issues.apache.org/jira/browse/BEAM-5036 . It's possible
you're hitting something different, but would be worth investigating. See
also
https://lists.apache.org/list.html?dev@beam.apache.org:lte=1M:Performan
Hi,
I'm testing Beam on Spark.
I use spark example code WordCount processing 1G data file, cost 1 minutes.
However, I use Beam example code WordCount processing the same file, cost
30minutes.
My Spark parameter is : --deploy-mode client --executor-memory 1g
--num-executors 1 --d
11 matches
Mail list logo