On Tue, Jan 30, 2018 at 11:44 AM, Kenneth Knowles <k...@google.com> wrote: > (just dev@) > > *Low-level IL* > I wanted to comment more on the common intermediate layer idea of Ben's. > This is an awesome idea but I'm not sure it is Beam so much as Tez or Onyx. > I imagine most runners have some such representation internally. > > Our layers in the stack with a low-level IL: > > 6. High-level DSLs - SQL, CEP, etc > 5. High-level PL libraries - Java, Python, Go SDKs > 4. High-level whole-pipeline IL - Today's pipeline definition with > associative/commutative combiners, etc > 3. Low-level whole-pipeline IL - DAG of possibly-stateful operators and > different sorts of edges > 2. ---- runner (with runner-specific optimizations) ---- > 1. Low-level single-worker IL - Fn API > 0. UDF invocation - SDK harness > > I see a few issues: > > - The higher levels afford more radical optimization, and we thus far claim > that the runner owns optimization. > - Beam mostly maps the high-level IL directly to high-level APIs of > existing engines. Mapping low-level to high-level will pay in performance. > - There might not be that big a gap between the high-level IL and the > low-level IL, because Beam at this point has a lot of low-level features. > > Interestingly, we are actually already being forced down this path. With > portability it no longer works to translate directly to a runner's > primitives; we need to fuse together same-language transforms first. So we > are currently doing at least naive fusion prior in a shared library prior to > the runner, translating the high-level Pipeline into a lower-level > representation (still a Pipeline proto as I understand it, but not really > because it isn't using Beam's primitives).
Note that doing fusion is not required, rather we are doing this because some runners don't expose the notion of a "fused graph" at the public API level (e.g. on the workers) and even if they did, it'd probably be a more complicated surface area than a single uber operation. > When this is done, the portable > Pipeline proto might, in practice, only really be used as an input to this > shared optimizer, and runners will be translating its low-level output IL to > their own physical plans! > > Kenn > > > > On Tue, Jan 30, 2018 at 11:24 AM, Kenneth Knowles <k...@google.com> wrote: >> >> I've got some thoughts :-) >> >> Here is how I see the direction(s): >> >> - Requirements to be relevant: known scale, SQL, retractions (required >> for correct answers) >> - Core value-add: portability! I don't know that there is any other >> project ambitiously trying to run Python and Go on "every" data processing >> engine. >> - Experiments: SDF and dynamic work rebalancing. Just like event time >> processing, when it matters to users these will become widespread and then >> Beam's runner can easily make the features portable. >> >> So let's do portability really well on all our most active runners. I have >> a radical proposal for how we should think about it: >> >> A portable Beam runner should be defined to be a _service_ hosting the >> Beam job management APIs. >> >> In that sense, we have zero runners today. Even Dataflow is just a service >> hosting its own API with a client-side library for converting a Beam >> pipeline into a Dataflow pipeline. Re-orienting our thinking this way is not >> actually a huge change in code, but emphasizes: >> >> - our "runners/core" etc should focus on making these services easy >> (Thomas G is doing great work here right now) >> - a user selecting a runner should be thought of more as just pointing at >> a different endpoint >> - our testing infrastructure should become much more service-oriented, >> standing these up even for local testing >> - ditto Luke's point about making a crisp line of SDK/runner >> responsibility >> >> Kenn >> >> >> On Fri, Jan 26, 2018 at 12:58 PM, Lukasz Cwik <lc...@google.com> wrote: >>> >>> 1) Instead of enabling it easier to write features I think more users >>> would care about being able to move their pipeline between different runners >>> and one of the key missing features is dynamic work rebalancing in all >>> runners (except Dataflow). >>> Also, portability is meant to help make a crisp line between what are the >>> responsibilities of the Runner and the SDK which would help make it easier >>> to write features in an SDK and to support features in Runners. >>> >>> 2) To realize portability there are a lot of JIRAs being tracked under >>> the portability label[1] that need addressing to be able to run an existing >>> pipeline in a portable manner before we even get to more advanced features. >>> >>> 1: >>> https://issues.apache.org/jira/browse/BEAM-3515?jql=project%20%3D%20BEAM%20AND%20labels%20%3D%20portability >>> >>> 3) Ben, do you want to design and run a couple of polls (similar to the >>> Java 8 poll) to get feedback from our users based upon the list of major >>> features being developed? >>> >>> 4) Yes, plenty. It would be worthwhile to have someone walk through the >>> open JIRAs and mark them with a label and also summarize what groups they >>> fall under as there are plenty of good ideas there. >>> >>> On Tue, Jan 23, 2018 at 5:25 PM, Robert Bradshaw <rober...@google.com> >>> wrote: >>>> >>>> In terms of features, I think a key thing we should focus on is making >>>> simple things simple. Beam is very powerful, but it doesn't always >>>> make easy things easy. Features like schema'd PCollections could go a >>>> long way here. Also fully fleshing out/smoothing our runner >>>> portability story is part of this too. >>>> >>>> For beam 3.x we could also reason about if there's any complexity that >>>> doesn't hold its weight (e.g. side inputs on CombineFns). >>>> >>>> On Mon, Jan 22, 2018 at 9:20 PM, Jean-Baptiste Onofré <j...@nanthrax.net> >>>> wrote: >>>> > Hi Ben, >>>> > >>>> > about the "technical roadmap", we have a thread about "Beam 3.x >>>> > roadmap". >>>> > >>>> > It already provides ideas for points 3 & 4. >>>> > >>>> > Regards >>>> > JB >>>> > >>>> > On 01/22/2018 09:15 PM, Ben Chambers wrote: >>>> >> Thanks Davor for starting the state of the project discussions [1]. >>>> >> >>>> >> >>>> >> In this fork of the state of the project discussion, I’d like to >>>> >> start the >>>> >> discussion of the feature roadmap for 2018 (and beyond). >>>> >> >>>> >> >>>> >> To kick off the discussion, I think the features could be divided >>>> >> into several >>>> >> areas, as follows: >>>> >> >>>> >> 1. >>>> >> >>>> >> Enabling Contributions: How do we make it easier to add new >>>> >> features to the >>>> >> supported runners? Can we provide a common intermediate layer >>>> >> below the >>>> >> existing functionality that features are translated to so that >>>> >> runners only >>>> >> need to support the intermediate layer and new features only need >>>> >> to target >>>> >> it? What other ways can we make it easier to contribute to the >>>> >> development >>>> >> of Beam? >>>> >> >>>> >> 2. >>>> >> >>>> >> Realizing Portability: What gaps are there in the promise of >>>> >> portability? >>>> >> For example in [1] we discussed the fact that users must write >>>> >> per-runner >>>> >> code to push system metrics from runners to their monitoring >>>> >> platform. This >>>> >> limits their ability to actually change runners. Credential >>>> >> management for >>>> >> different environments also falls into this category. >>>> >> >>>> >> 3. >>>> >> >>>> >> Large Features: What major features (like Beam SQL, Beam Python, >>>> >> etc.) would >>>> >> increase the Beam user base in 2018? >>>> >> >>>> >> 4. >>>> >> >>>> >> Improvements: What small changes could make Beam more appealing >>>> >> to users? >>>> >> Are there API improvements we could make or common mistakes we >>>> >> could detect >>>> >> and/or prevent? >>>> >> >>>> >> >>>> >> Thanks in advance for participating in the discussion. I believe that >>>> >> 2018 could >>>> >> be a great year for Beam, providing easier, more complete runner >>>> >> portability and >>>> >> features that make Beam easier to use for everyone. >>>> >> >>>> >> >>>> >> Ben >>>> >> >>>> >> >>>> >> [1] >>>> >> >>>> >> https://lists.apache.org/thread.html/f750f288af8dab3f468b869bf5a3f473094f4764db419567f33805d0@%3Cdev.beam.apache.org%3E >>>> >> >>>> >> [2] >>>> >> >>>> >> https://lists.apache.org/thread.html/01a80d62f2df6b84bfa41f05e15fda900178f882877c294fed8be91e@%3Cdev.beam.apache.org%3E >>>> > >>>> > -- >>>> > Jean-Baptiste Onofré >>>> > jbono...@apache.org >>>> > http://blog.nanthrax.net >>>> > Talend - http://www.talend.com >>> >>> >> >