Same need here, using Flink runner. We are processing a
pcollection (extracting features per element) then combining
these into groups of features and running the next operator on
those groups.
Each group contains ~50 elements, so the parallelism of the
operator upstream of the groupby should be higher, to be balanced
with the downstream operator.
On Tue, Apr 18, 2023 at 19:17 Jeff Zhang <zjf...@gmail.com> wrote:
Hi Reuven,
It would be better to set parallelism for operators, as I
mentioned before, there may be multiple groupby, join
operators in one pipeline, and their parallelism can be
different due to different input data sizes.
On Wed, Apr 19, 2023 at 3:59 AM Reuven Lax <re...@google.com>
wrote:
Jeff - does setting the global default work for you, or
do you need per-operator control? Seems like it would be
to add this to ResourceHints.
On Tue, Apr 18, 2023 at 12:35 PM Robert Bradshaw
<rober...@google.com> wrote:
Yeah, I don't think we have a good per-operator API
for this. If we were to add it, it probably belongs
in ResourceHints.
On Sun, Apr 16, 2023 at 11:28 PM Reuven Lax
<re...@google.com> wrote:
Looking at FlinkPipelineOptions, there is a
parallelism option you can set. I believe this
sets the default parallelism for all Flink operators.
On Sun, Apr 16, 2023 at 7:20 PM Jeff Zhang
<zjf...@gmail.com> wrote:
Thanks Holden, this would work for Spark, but
Flink doesn't have such kind of mechanism, so
I am looking for a general solution on the
beam side.
On Mon, Apr 17, 2023 at 10:08 AM Holden Karau
<hol...@pigscanfly.ca> wrote:
To a (small) degree Sparks “new” AQE
might be able to help depending on what
kind of operations Beam is compiling it
down to.
Have you tried setting
spark.sql.adaptive.enabled &
spark.sql.adaptive.coalescePartitions.enabled
On Mon, Apr 17, 2023 at 10:34 AM Reuven
Lax via user <user@beam.apache.org> wrote:
I see. Robert - what is the story for
parallelism controls on GBK with the
Spark or Flink runners?
On Sun, Apr 16, 2023 at 6:24 PM Jeff
Zhang <zjf...@gmail.com> wrote:
No, I don't use dataflow, I use
Spark & Flink.
On Mon, Apr 17, 2023 at 8:08 AM
Reuven Lax <re...@google.com> wrote:
Are you running on the
Dataflow runner? If so,
Dataflow - unlike Spark and
Flink - dynamically modifies
the parallelism as the
operator runs, so there is no
need to have such controls.
In fact these specific
controls wouldn't make much
sense for the way Dataflow
implements these operators.
On Sun, Apr 16, 2023 at
12:25 AM Jeff Zhang
<zjf...@gmail.com> wrote:
Just for
performance tuning like
in Spark and Flink.
On Sun, Apr 16, 2023 at
1:10 PM Robert Bradshaw
via user
<user@beam.apache.org> wrote:
What are you trying
to achieve by setting
the parallelism?
On Sat, Apr 15, 2023
at 5:13 PM Jeff Zhang
<zjf...@gmail.com> wrote:
Thanks Reuven,
what I mean is to
set the
parallelism in
operator level.
And the input
size of the
operator is
unknown at
compiling stage
if it is not a
source
operator,
Here's an example
of flink
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/execution/parallel/#operator-level
Spark also
support to set
operator level
parallelism (see
groupByKey and
reduceByKey):
https://spark.apache.org/docs/latest/rdd-programming-guide.html
On Sun, Apr 16,
2023 at 1:42 AM
Reuven Lax via
user
<user@beam.apache.org>
wrote:
The maximum
parallelism is
always
determined by
the
parallelism
of your data.
If you do a
GroupByKey
for example,
the number of
keys in your
data
determines
the maximum
parallelism.
Beyond the
limitations
in your data,
it depends on
your
execution
engine. If
you're using
Dataflow,
Dataflow is
designed to
automatically
determine the
parallelism
(e.g. work
will be
dynamically
split and
moved around
between
workers, the
number of
workers will
autoscale,
etc.), so
there's no
need to
explicitly
set the
parallelism
of the execution.
On Sat, Apr
15, 2023 at
1:12 AM Jeff
Zhang
<zjf...@gmail.com>
wrote:
Besides
the
global
parallelism of
beam job,
is there
any way
to set
parallelism for
individual
operators
like
group by
and join?
I
understand the
parallelism setting
depends
on the
underlying
execution
engine,
but it is
very
common to
set
parallelism like
group by
and join
in both
spark &
flink.
--
Best Regards
Jeff Zhang
--
Best Regards
Jeff Zhang
--
Best Regards
Jeff Zhang
--
Best Regards
Jeff Zhang
--
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance
Spark, etc.): https://amzn.to/2MaRAG9
<https://amzn.to/2MaRAG9>
YouTube Live Streams:
https://www.youtube.com/user/holdenkarau
--
Best Regards
Jeff Zhang
--
Best Regards
Jeff Zhang