Re: Thoughts on a reference runner to invest in?

Maximilian Michels Tue, 12 Feb 2019 03:35:03 -0800

Do you consider job submission and artifact staging part of theReferenceRunner? If so, these parts have been reused or served as amodel for the portable FlinkRunner. So they had some value.

A reference implementation helps Runner authors to understand and reusethe code. However, I agree that the Flink implementation is more helpfulto Runners authors than a ReferenceRunner which was designed for singlenode testing.


I think there are three parts which help to push forward portability:

1) Good library support for new portable Runners (Java)
2) A reference implementation of a distributed Runner (Flink)

3) An easy way for users to run/test portable Pipelines (Python viaFnApiRunner)

The main motivation for the portability layer is supporting additionallanguage to Java. Most users will be using Python, so focusing on a goodreference Runner in Python is key.


-Max

On 12.02.19 10:11, Robert Bradshaw wrote:

This is certainly an interesting question, and I definitely have myopinions, but am curious as to what others think as well.
One thing that I think wasn't as clear from the outset is distinguishingbetween the development of runners/core-java and development of a Javareference runner itself. With the work on work on moving Flink toportability, it turned out that work on the latter was not aprerequisite for work on the former, and runners/core-java is theartifact that other runners want to build on. I think that it is alsothe case, as suggested, that a distributed runner's use of this sharedlibrary is a better reference point (for other distributed runners) thanone using the direct runner (e.g. there is a much more obviousdelineation between the runner's responsibility and Beam code than inthe direct runner where the boundaries between orchestration, execution,and other concerns are not as clear).
As well as serving as a reference to runner implementers, the referencerunner can also be useful for prototyping (here I think Python holds anadvantage, but we're getting into subjective areas now), documenting (orideally augmenting the documentation of) the spec (here I'd say asmaller advantage to Python, but neither runner clean, straightforward,and documented enough to serve this purpose well yet), and serving as alightweight universal local runner against which to develop (and,possibly use long term in place of a direct runner) new SDKs (hereyou'll get a wide variety of answers whether Python or Java is easier totake on as a dependency for a third language, or we could just packageit up in a docker image and take docker as a dependency).
Another more pragmatic note is that one thing that helped both the Flinkand FnApiRunner forwards is that they were driven forward by actualusecases--Lyft has actual Python (necessitating portable) pipelines theywant to run on Flink, and the FnApiRunner is the direct runner forPython. The Java ULR (at least where it is now) sits in an awkward placewhere its only role is to be a reference rather than be used, which (ina world of limited resources) makes it harder to justify investment.
- Robert
On Tue, Feb 12, 2019 at 3:53 AM Kenneth Knowles <k...@apache.org<mailto:k...@apache.org>> wrote:
    Interesting silence here. You've got it right that the reason we
    initially chose Java was because of the cross-runner sharing. The
    reference runner could be the first target runner for any new
    feature and then its work could be directly (or indirectly via
    copy/paste/modify if it works better) be used in other runners.
    Examples:

      - The implementations of (pre-portability) state & timers in
    runners/core-java and prototyped in the Java DirectRunner made it a
    matter of a couple of days to implement on other runners, and they
    saw pretty quick adoption.
      - Probably the same could be said for the first drafts of the
    runners, which re-used a bunch of runners/core-java and had each
    others' translation code as a reference.

    I'm interested if anyone would be willing to confirm if it is
    because the FlinkRunner has forged ahead and the Dataflow worker is
    open source. It makes sense that the code from a distributed runner
    is an even better reference point if you are building another
    distributed runner. From the look of it, the SamzaRunner had no
    trouble getting started on portability.

    Kenn

    On Mon, Feb 11, 2019 at 6:04 PM Daniel Oliveira
    <danolive...@google.com <mailto:danolive...@google.com>> wrote:

        Yeah, the FnApiRunner is what I'm leaning towards too. I wasn't
        sure how much demand there was for an actual reference
        implementation in Java though, so I was hoping there were runner
        authors that would want to chime in.

        On the other hand, the Flink runner could serve as a reference
        implementation for portable features since it's further along,
        so maybe it's not an issue regardless.

        On Mon, Feb 11, 2019 at 1:09 PM Sam Rohde <sro...@google.com
        <mailto:sro...@google.com>> wrote:

            Thanks for starting this thread. If I had to guess, I would
            say there is more of a demand for Python as it's more widely
            used for data scientists/ analytics. Being pragmatic, the
            FnApiRunner already has more feature work than the Java so
            we should go with that.

            -Sam

            On Fri, Feb 8, 2019 at 10:07 AM Daniel Oliveira
            <danolive...@google.com <mailto:danolive...@google.com>> wrote:

                Hello Beam dev community,

                For those who don't know me, I work for Google and I've
                been working on the Java reference runner, which is a
                portable, local Java runner (it's basically the direct
                runner with the portability APIs implemented). Our goal
                in working on this was to have a portable runner which
                ran locally so it could be used by users for testing
                portable pipelines, devs for testing new features with
                portability, and for runner authors to provide a simple
                reference implementation of a portable runner.

                Due to various circumstances though, progress on the
                Java reference runner has been pretty slow, and a Python
                runner which does pretty much the same things was made
                to aid portability development in Python (called the
                FnApiRunner). This runner is currently further along in
                feature work than the Java reference runner, so we've
                been reevaluating if we should switch to investing in it
                instead.

                My question to the community is: Which runner do you
                think would be more valuable to the dev community and
                Beam users? For those of you who are runner authors, do
                you have a preference for what language you'd like to
                see a reference implementation in?

                Thanks,
                Daniel Oliveira

Re: Thoughts on a reference runner to invest in?

Reply via email to