Re: Getting Started With Implementing a Runner

Joey Tran Fri, 23 Jun 2023 13:43:08 -0700

>
> Totally doable by one person, especially given the limited feature set you
> mention above.
> https://docs.google.com/presentation/d/1Cso0XP9dmj77OD9Bd53C1M3W1sPJF0ZnA20gzb2BPhE
>  is
> a good starting point as to what the relationship between a Runner and the
> SDK is at a level of detail sufficient for implementation (told from the
> perspective of an SDK, but the story is largely about the interface which
> is directly applicable).



Great slides, I really appreciate the illustrations.

I hadn't realized there was a concept of an "SDK Worker", I had imagined
that once the Runner started execution of a workflow, it was Runner all the
way down. Is the Fn API the only way to implement a runner? Our execution
environment is a bit constrained in such a way that we can't expose the
APIs required to implement the Fn API. To be forthright, we basically only
have the ability to start a worker either with a known Pub/Sub topic to
expect data from and a Pub/Sub topic to write to; or with a bundle of data
to process and return the outputs for. We're constrained from really any
additional communication with a worker beyond that.

On Fri, Jun 23, 2023 at 4:02 PM Robert Bradshaw <[email protected]> wrote:

> On Fri, Jun 23, 2023 at 11:15 AM Joey Tran <[email protected]>
> wrote:
>
>> Thanks all for the responses!
>>
>> If Beam Runner Authoring Guide is rather high-level for you, then, at
>>> fist, I’d suggest to answer two questions for yourself:
>>> - Am I going to implement a portable runner or native one?
>>>
>>
>> Portable sounds great, but the answer depends on how much additional cost
>> it'd require to implement portable over non-portable, even considering
>> future deprecation (unless deprecation is happening tomorrow). I'm not
>> familiar enough to know what the additional cost is so I don't have a firm
>> answer.
>>
>
> I would way it would not be that expensive to write it in a "portable
> compatible" way (i.e it uses the publicly-documented protocol as the
> interface rather than reaching into internal details) even if it doesn't
> use GRCP and fire up separate processes/docker images for the workers
> (preferring to do tall of that inline like the Python portable direct
> runner does).
>
>
>> - Which SDK I should use for this runner?
>>>
>> I'd be developing this runner against the python SDK and if the runner
>> only worked with the python SDK that'd be okay in the short term
>>
>
> Yes. And if you do it the above way, it should be easy to extend (or not)
> if/when the need arises.
>
>
>> Also, we don’t know if this new runner will be contributed back to Beam,
>>> what is a runtime and what actually is a final goal of it.
>>
>> Likely won't be contributed back to Beam (not sure if it'd actually be
>> useful to a wide audience anyways).
>>
>> The context is we've been developing an in-house large-scale pipeline
>> framework that encapsulates both the programming model and the
>> runner/execution of data workflows. As it's grown, I keep finding myself
>> just reimplementing features and abstractions Beam has already implemented,
>> so I wanted to explore adopting Beam. Our execution environment is very
>> particular though and our workflows require it (due to the way we license
>> our software), so my plan was to try to create a very basic runner that
>> uses our execution environment. The runner could have very few features
>> e.g. no streaming, no metrics, no side inputs, etc. After that I'd probably
>> introduce a shim for some of our internally implemented transforms and
>> assess from there.
>>
>> Not sure if this is a lofty goal or not, so happy to hear your thoughts
>> as to whether this seems reasonable and achievable without a large
>> concerted effort or even if the general idea makes any sense. (I recognize
>> that it might not be *easy*, but I don't have the resources to dedicate
>> more than myself to work on a PoC)
>>
>
> Totally doable by one person, especially given the limited feature set you
> mention above.
> https://docs.google.com/presentation/d/1Cso0XP9dmj77OD9Bd53C1M3W1sPJF0ZnA20gzb2BPhE
> is a good starting point as to what the relationship between a Runner and
> the SDK is at a level of detail sufficient for implementation (told from
> the perspective of an SDK, but the story is largely about the interface
> which is directly applicable).
>
> Given the limited feature set you proposed, this is similar to the
> original Python portable runner which took a week or two to put together
> (granted a lot has been added since then), or the typescript direct runner
> (
> https://github.com/apache/beam/blob/ea9147ad2946f72f7d52924cba2820e9aae7cd91/sdks/typescript/src/apache_beam/runners/direct_runner.ts
> ) which was done (in its basic form, no support for side inputs and such)
> in less than a week. Granted, as these are local runners, this illustrates
> only the Beam-side complexity of things (not the work of actually
> implementing a distributed shuffle, starting and assigning work to multiple
> workers, etc. but presumably that's the kind of thing your execution
> environment already takes care of.
>
> As for some more concrete pointers, you could probably leverage a lot of
> what's there by invoking create_stages
>
>
> https://github.com/apache/beam/blob/v2.48.0/sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py#L362
>
> which will do optimization, fusion, etc. and then implementing your own
> version of run_stages
>
>
> https://github.com/apache/beam/blob/v2.48.0/sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py#L392
>
> to execute these in topological order on your compute infrastructure. (If
> you're not doing streaming, this is much more straightforward than all the
> bundler scheduler stuff that currently exists in that code).
>
>
>
>>
>>
>>
>>
>>
>> On Fri, Jun 23, 2023 at 12:17 PM Alexey Romanenko <
>> [email protected]> wrote:
>>
>>>
>>>
>>> On 23 Jun 2023, at 17:40, Robert Bradshaw via user <[email protected]>
>>> wrote:
>>>
>>> On Fri, Jun 23, 2023, 7:37 AM Alexey Romanenko <[email protected]>
>>> wrote:
>>>
>>>> If Beam Runner Authoring Guide is rather high-level for you, then, at
>>>> fist, I’d suggest to answer two questions for yourself:
>>>> - Am I going to implement a portable runner or native one?
>>>>
>>>
>>> The answer to this should be portable, as non-portable ones will be
>>> deprecated.
>>>
>>>
>>> Well, actually this is a question that I don’t remember we discussed
>>> here in details before and had a common agreement.
>>>
>>> Actually, I’m not sure that I understand clearly what is meant by
>>> “deprecation" in this case. For example, Portable Spark Runner is heavily
>>> actually based on native Spark RDD runner and its translations. So, which
>>> part should be deprecated and what is a reason for that?
>>>
>>> Well, anyway I guess it’s off topic here.
>>>
>>> Also, we don’t know if this new runner will be contributed back to Beam,
>>> what is a runtime and what actually is a final goal of it.
>>> So I agree that more details on this would be useful.
>>>
>>> —
>>> Alexey
>>>
>>>
>>> - Which SDK I should use for this runner?
>>>>
>>>
>>> The answer to the above question makes this one moot :).
>>>
>>> On a more serious note, could you tell us a bit more about the runner
>>> you're looking at implementing?
>>>
>>>
>>>> Then, depending on answers, I’d suggest to take as an example one of
>>>> the most similar Beam runners and use it as a more detailed source of
>>>> information along with Beam runner doc mentioned before.
>>>>
>>>> —
>>>> Alexey
>>>>
>>>> On 22 Jun 2023, at 14:39, Joey Tran <[email protected]> wrote:
>>>>
>>>> Hi Beam community!
>>>>
>>>> I'm interested in trying to implement a runner with my company's
>>>> execution environment but I'm struggling to get started. I've read the docs
>>>> page
>>>> <https://beam.apache.org/contribute/runner-guide/#testing-your-runner>
>>>> on implementing a runner but it's quite high level. Anyone have any
>>>> concrete suggestions on getting started?
>>>>
>>>> I've started by cloning and running the hello world example
>>>> <https://github.com/apache/beam-starter-python>. I've then subclassed `
>>>> PipelineRunner
>>>> <https://github.com/apache/beam/blob/9d0fc05d0042c2bb75ded511497e1def8c218c33/sdks/python/apache_beam/runners/runner.py#L103>`
>>>> to create my own custom runner but at this point I'm a bit stuck. My custom
>>>> runner just looks like
>>>>
>>>> class CustomRunner(runner.PipelineRunner):
>>>>     def run_pipeline(self, pipeline,
>>>>                      options):
>>>>         self.visit_transforms(pipeline, options)
>>>>
>>>> And when using it I get an error about not having implemented "Impulse"
>>>>
>>>> NotImplementedError: Execution of [<Impulse(PTransform)
>>>> label=[Impulse]>] not implemented in runner <my_app.app.CustomRunner object
>>>> at 0x135d9ff40>.
>>>>
>>>> Am I going about this the right way? Are there tests I can run my
>>>> custom runner against to validate it beyond just running the hello world
>>>> example? I'm finding myself just digging through the beam source to try to
>>>> piece together how a runner works and I'm struggling to get a foothold. Any
>>>> guidance would be greatly appreciated, especially if anyone has any
>>>> experience implementing their own python runner.
>>>>
>>>> Thanks in advance! Also, could I get a Slack invite?
>>>> Cheers,
>>>> Joey
>>>>
>>>>
>>>>
>>>

Re: Getting Started With Implementing a Runner

Reply via email to