Hi again, We've made some progress on the runner since writing this more than a month ago, the repo is available here publicly: https://github.com/hazelcast/hazelcast-jet-beam-runner
Still very much a work in progress though. One of the issues I wanted to raise is that currently we're translating each PTransform to a Jet Vertex (could be consider analogous to a Flink operator or a vertex in Tez). This is sub-optimal, since Beam creates lots of transforms for computations that could be performed inside the same Vertex, such as subsequent mapping transforms and many others. Ideally you only need distinct vertices where the data is re-partitioned and/or shuffled. I'm curious if Beam offers some way of translating the PTransform graph to a more minimal set of transforms, i.e. some kind of planner or would this have to be custom code? We've done a similar integration with Cascading in the past and it offered a planner which given a set of rules would partition the Cascading DAG into a minimal set of vertices for the same DAG. Curious if Beam has any similar functionality? On Sat, Feb 16, 2019 at 4:50 AM Kenneth Knowles <[email protected]> wrote: > Elaborating on what Robert alluded to: when I wrote that runner author > guide, portability was in its infancy. Now Beam Python can be run on Flink. > So that guide is primarily focused on the "deserialize a Java DoFn and call > its methods" approach. A decent amount of it is still really important to > know, but is now the responsibility of the "SDK harness", aka > language-specific coprocessor. For Python & Go & <insert new SDK language > here> you really want to use the portability protos and the portable Flink > runner is the best model. > > Kenn > > > On Fri, Feb 15, 2019 at 2:08 AM Robert Bradshaw <[email protected]> > wrote: > >> On Fri, Feb 15, 2019 at 7:36 AM Can Gencer <[email protected]> wrote: >> > >> > We at Hazelcast are looking into writing a Beam runner for Hazelcast >> Jet (https://github.com/hazelcast/hazelcast-jet). I wanted to introduce >> myself as we'll likely have questions as we start development. >> >> Welcome! >> >> Hazelcast looks interesting, a Beam runner for it would be very cool. >> >> > Some of the things I'm wondering about currently: >> > >> > * Currently there seems to be a guide available at >> https://beam.apache.org/contribute/runner-guide/ , is this up to date? >> Is there anything in specific to be aware of when starting with a new >> runner that's not covered here? >> >> That looks like a pretty good starting point. At a quick glance, I >> don't see anything that looks out of date. Another resource that might >> be helpful is a talk from last year on writing an SDK (but as it >> mostly covers the runner-sdk interaction, it's also quite useful for >> understanding the runner side: >> >> https://docs.google.com/presentation/d/1Cso0XP9dmj77OD9Bd53C1M3W1sPJF0ZnA20gzb2BPhE/edit#slide=id.p >> And please feel free to ask any questions on this list as well; we'd >> be happy to help. >> >> > * Should we be targeting the latest master which is at 2.12-SNAPSHOT or >> a stable version? >> >> I would target the latest master. >> >> > * After a runner is developed, how is the maintenance typically >> handled, as the runners seems to be part of Beam codebase? >> >> Either is possible. Several runner adapters are part of the Beam >> codebase, but for example the IMB Streams Beam runner is not. There >> are certainly pros and cons (certainly early on when the APIs >> themselves were under heavy development it was easier to keep things >> in sync in the same codebase, but things have mostly stabilized now). >> A runner only becomes part of the Beam codebase if there are members >> of the community committed to maintaining it (which could include >> you). Both approaches are fine. >> >> - Robert >> >
