Hi Can,
Thanks for the update. Interesting question. Flink has an optimization
built in called chaining which works together nicely with Beam.
Essentially, operators which share the same partitioning get executed
one after another inside a master operator. This saves resources.
Interestingly, Beam's Fuser for portable Runners does something similar.
AFAIK there is no built-in solution for the old-style Runners. I think
it would be possible to build something like this on top of the existing
translation.
Cheers,
Max
On 20.03.19 13:07, Can Gencer wrote:
Hi again,
We've made some progress on the runner since writing this more than a
month ago, the repo is available here publicly:
https://github.com/hazelcast/hazelcast-jet-beam-runner
Still very much a work in progress though. One of the issues I wanted to
raise is that currently we're translating each PTransform to a Jet
Vertex (could be consider analogous to a Flink operator or a vertex in
Tez). This is sub-optimal, since Beam creates lots of transforms for
computations that could be performed inside the same Vertex, such as
subsequent mapping transforms and many others. Ideally you only need
distinct vertices where the data is re-partitioned and/or shuffled. I'm
curious if Beam offers some way of translating the PTransform graph to a
more minimal set of transforms, i.e. some kind of planner or would this
have to be custom code? We've done a similar integration with Cascading
in the past and it offered a planner which given a set of rules would
partition the Cascading DAG into a minimal set of vertices for the same
DAG. Curious if Beam has any similar functionality?
On Sat, Feb 16, 2019 at 4:50 AM Kenneth Knowles <[email protected]
<mailto:[email protected]>> wrote:
Elaborating on what Robert alluded to: when I wrote that runner
author guide, portability was in its infancy. Now Beam Python can be
run on Flink. So that guide is primarily focused on the "deserialize
a Java DoFn and call its methods" approach. A decent amount of it is
still really important to know, but is now the responsibility of the
"SDK harness", aka language-specific coprocessor. For Python & Go &
<insert new SDK language here> you really want to use the
portability protos and the portable Flink runner is the best model.
Kenn
On Fri, Feb 15, 2019 at 2:08 AM Robert Bradshaw <[email protected]
<mailto:[email protected]>> wrote:
On Fri, Feb 15, 2019 at 7:36 AM Can Gencer <[email protected]
<mailto:[email protected]>> wrote:
>
> We at Hazelcast are looking into writing a Beam runner for
Hazelcast Jet (https://github.com/hazelcast/hazelcast-jet). I
wanted to introduce myself as we'll likely have questions as we
start development.
Welcome!
Hazelcast looks interesting, a Beam runner for it would be very
cool.
> Some of the things I'm wondering about currently:
>
> * Currently there seems to be a guide available at
https://beam.apache.org/contribute/runner-guide/ , is this up to
date? Is there anything in specific to be aware of when starting
with a new runner that's not covered here?
That looks like a pretty good starting point. At a quick glance, I
don't see anything that looks out of date. Another resource that
might
be helpful is a talk from last year on writing an SDK (but as it
mostly covers the runner-sdk interaction, it's also quite useful for
understanding the runner side:
https://docs.google.com/presentation/d/1Cso0XP9dmj77OD9Bd53C1M3W1sPJF0ZnA20gzb2BPhE/edit#slide=id.p
And please feel free to ask any questions on this list as well; we'd
be happy to help.
> * Should we be targeting the latest master which is at
2.12-SNAPSHOT or a stable version?
I would target the latest master.
> * After a runner is developed, how is the maintenance
typically handled, as the runners seems to be part of Beam codebase?
Either is possible. Several runner adapters are part of the Beam
codebase, but for example the IMB Streams Beam runner is not. There
are certainly pros and cons (certainly early on when the APIs
themselves were under heavy development it was easier to keep things
in sync in the same codebase, but things have mostly stabilized
now).
A runner only becomes part of the Beam codebase if there are members
of the community committed to maintaining it (which could include
you). Both approaches are fine.
- Robert