Re: [PROPOSAL] Batched DoFns in the Python SDK

Chad Dombrova Fri, 17 Dec 2021 08:57:21 -0800

Hi Brian,
We implemented a feature that's similar to this, but with a different
motivation: scheduled tasks.  We had the same need of creating batches of
logical elements, but rather than perform SIMD-optimized computations, we
want to produce remotely scheduled tasks.  It's my hope that the underlying
job of slicing a stream into batches of elements can achieve both goals.
We've been using this change in production for awhile now, but I would
really love to get this onto master.  Can you have a look and see how it
compares to what you have in mind?

Here's the info I sent out about this earlier:

Beam's niche is low latency, high throughput workloads, but Beam has
incredible promise as an orchestrator of long running work that gets sent
to a scheduler.  We've created a modified version of Beam that allows the
python SDK worker to outsource tasks to a scheduler, like Kubernetes batch
jobs[1], Argo[2], or Google's own OpenCue[3].

The basic idea is that any element in a stream can be tagged to be executed
outside of the normal SdkWorker as an atomic "task".  A task is one
invocation of a stage, composed of one or more DoFns, against a slice of
the data stream, composed of one or more tagged elements.   The upshot is
that we're able to slice up the processing of a stream across potentially
*many* workers, with the trade-off being the added overhead of starting up
a worker process for each task.

For more info on how we use our modified version of Beam to make visual
effects for feature films, check out the talk[4] I gave at the Beam Summit.

Here's our design doc:
https://docs.google.com/document/d/1GrAvDWwnR1QAmFX7lnNA7I_mQBC2G1V2jE2CZOc6rlw/edit?usp=sharing

And here's the github branch:
https://github.com/LumaPictures/beam/tree/taskworker_public

[1] https://kubernetes.io/docs/concepts/workloads/controllers/job/
[2] https://argoproj.github.io/
[3] https://cloud.google.com/opencue
[4] https://www.youtube.com/watch?v=gvbQI3I03a8&ab_channel=ApacheBeam

-chad

On Wed, Dec 15, 2021 at 9:59 AM Brian Hulette <bhule...@google.com> wrote:

> Hi all,
>
> I've drafted a proposal to add a "Batched DoFn" concept to the Beam Python
> SDK. The primary motivation is to make it more natural to draft vectorized
> Beam pipelines by allowing DoFns to produce and/or consume batches of
> logical elements. You can find the proposal here:
>
> https://s.apache.org/batched-dofns
>
> Please take a look and let me know what you think.
>
> Thanks!
> Brian
>

Re: [PROPOSAL] Batched DoFns in the Python SDK

Reply via email to