Do you consider job submission and artifact staging part of the
ReferenceRunner? If so, these parts have been reused or served as a
model for the portable FlinkRunner. So they had some value.
A reference implementation helps Runner authors to understand and reuse
the code. However, I agree that the Flink implementation is more helpful
to Runners authors than a ReferenceRunner which was designed for single
node testing.
I think there are three parts which help to push forward portability:
1) Good library support for new portable Runners (Java)
2) A reference implementation of a distributed Runner (Flink)
3) An easy way for users to run/test portable Pipelines (Python via
FnApiRunner)
The main motivation for the portability layer is supporting additional
language to Java. Most users will be using Python, so focusing on a good
reference Runner in Python is key.
-Max
On 12.02.19 10:11, Robert Bradshaw wrote:
This is certainly an interesting question, and I definitely have my
opinions, but am curious as to what others think as well.
One thing that I think wasn't as clear from the outset is distinguishing
between the development of runners/core-java and development of a Java
reference runner itself. With the work on work on moving Flink to
portability, it turned out that work on the latter was not a
prerequisite for work on the former, and runners/core-java is the
artifact that other runners want to build on. I think that it is also
the case, as suggested, that a distributed runner's use of this shared
library is a better reference point (for other distributed runners) than
one using the direct runner (e.g. there is a much more obvious
delineation between the runner's responsibility and Beam code than in
the direct runner where the boundaries between orchestration, execution,
and other concerns are not as clear).
As well as serving as a reference to runner implementers, the reference
runner can also be useful for prototyping (here I think Python holds an
advantage, but we're getting into subjective areas now), documenting (or
ideally augmenting the documentation of) the spec (here I'd say a
smaller advantage to Python, but neither runner clean, straightforward,
and documented enough to serve this purpose well yet), and serving as a
lightweight universal local runner against which to develop (and,
possibly use long term in place of a direct runner) new SDKs (here
you'll get a wide variety of answers whether Python or Java is easier to
take on as a dependency for a third language, or we could just package
it up in a docker image and take docker as a dependency).
Another more pragmatic note is that one thing that helped both the Flink
and FnApiRunner forwards is that they were driven forward by actual
usecases--Lyft has actual Python (necessitating portable) pipelines they
want to run on Flink, and the FnApiRunner is the direct runner for
Python. The Java ULR (at least where it is now) sits in an awkward place
where its only role is to be a reference rather than be used, which (in
a world of limited resources) makes it harder to justify investment.
- Robert
On Tue, Feb 12, 2019 at 3:53 AM Kenneth Knowles <k...@apache.org
<mailto:k...@apache.org>> wrote:
Interesting silence here. You've got it right that the reason we
initially chose Java was because of the cross-runner sharing. The
reference runner could be the first target runner for any new
feature and then its work could be directly (or indirectly via
copy/paste/modify if it works better) be used in other runners.
Examples:
- The implementations of (pre-portability) state & timers in
runners/core-java and prototyped in the Java DirectRunner made it a
matter of a couple of days to implement on other runners, and they
saw pretty quick adoption.
- Probably the same could be said for the first drafts of the
runners, which re-used a bunch of runners/core-java and had each
others' translation code as a reference.
I'm interested if anyone would be willing to confirm if it is
because the FlinkRunner has forged ahead and the Dataflow worker is
open source. It makes sense that the code from a distributed runner
is an even better reference point if you are building another
distributed runner. From the look of it, the SamzaRunner had no
trouble getting started on portability.
Kenn
On Mon, Feb 11, 2019 at 6:04 PM Daniel Oliveira
<danolive...@google.com <mailto:danolive...@google.com>> wrote:
Yeah, the FnApiRunner is what I'm leaning towards too. I wasn't
sure how much demand there was for an actual reference
implementation in Java though, so I was hoping there were runner
authors that would want to chime in.
On the other hand, the Flink runner could serve as a reference
implementation for portable features since it's further along,
so maybe it's not an issue regardless.
On Mon, Feb 11, 2019 at 1:09 PM Sam Rohde <sro...@google.com
<mailto:sro...@google.com>> wrote:
Thanks for starting this thread. If I had to guess, I would
say there is more of a demand for Python as it's more widely
used for data scientists/ analytics. Being pragmatic, the
FnApiRunner already has more feature work than the Java so
we should go with that.
-Sam
On Fri, Feb 8, 2019 at 10:07 AM Daniel Oliveira
<danolive...@google.com <mailto:danolive...@google.com>> wrote:
Hello Beam dev community,
For those who don't know me, I work for Google and I've
been working on the Java reference runner, which is a
portable, local Java runner (it's basically the direct
runner with the portability APIs implemented). Our goal
in working on this was to have a portable runner which
ran locally so it could be used by users for testing
portable pipelines, devs for testing new features with
portability, and for runner authors to provide a simple
reference implementation of a portable runner.
Due to various circumstances though, progress on the
Java reference runner has been pretty slow, and a Python
runner which does pretty much the same things was made
to aid portability development in Python (called the
FnApiRunner). This runner is currently further along in
feature work than the Java reference runner, so we've
been reevaluating if we should switch to investing in it
instead.
My question to the community is: Which runner do you
think would be more valuable to the dev community and
Beam users? For those of you who are runner authors, do
you have a preference for what language you'd like to
see a reference implementation in?
Thanks,
Daniel Oliveira