Status of dynamic worker scaling with Kafka consumers

Adam Bellemare Thu, 06 Aug 2020 09:58:22 -0700

Hi Folks

When processing events from Kafka, it seems that, from my reading, the
distribution of partitions maps directly to the worker via the concept of
'splits' :


https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaUnboundedSource.java#L54

>From the code:

> The partitions are evenly distributed among the splits. The number of
splits returned is {@code
> min(desiredNumSplits, totalNumPartitions)}, though better not to depend
on the exact count.

> <p>It is important to assign the partitions deterministically so that we
can support resuming a
> split from last checkpoint. The Kafka partitions are sorted by {@code
<topic, partition>} and then
> assigned to splits in round-robin order.

I'm not intimately familiar with Beam's execution model, but my reading of
this code suggests that:
1) Beam allocates partitions to workers once, at creation time
2) This implies that once started, the worker count cannot be changed as
the partitions are not redistributed
3) Any state is tied to the split, which is in turn tied to the worker.
This means outside of, say, a global window
<https://beam.apache.org/releases/javadoc/2.5.0/org/apache/beam/sdk/transforms/windowing/GlobalWindow.html>,
materialized kafka state is "localized" to a worker.

Follow up Q:
4) Is this independent of the runner? I am much more familiar with Apache
Spark as a runner than say, Dataflow.

If any could confirm or refute my 3 statements and 1 question, it would go
a long way towards validating my understanding of Beam's current
relationship to scaling and partitioned data locality with Kafka.

Thanks

Status of dynamic worker scaling with Kafka consumers

Reply via email to