Re: [DISCUSS] State of the project: Feature roadmap for 2018

Ben Chambers Tue, 30 Jan 2018 12:59:30 -0800

I think I agree with this, but wanted to point out a few things:

1. High-level DSLs may target the IL directly, rather than going through
the high-level PL libraries. This would allow them to make more direct use
of the capabilities of the IL.


2. I agree that the portability work is basically introducing what could
become a Beam IL, and it probably doesn't make sense to introduce another
layer. I guess the question is to what extent that IL is "Beam model
specific" vs. to what extent it is more general and/or closer to how
runners are organized.

The closer it is to Beam, the easier it is to translate Beam programs into
that IL. The closer it is to the runners, the easier it will be to
implement each runner. I think I may just be advocating for ensuring that
the ILs used in the portability work should be evaluated for their fit
compared to the runners, rather then the fit compared to the Beam model.

-- Ben

On Tue, Jan 30, 2018 at 11:44 AM Kenneth Knowles <[email protected]> wrote:

> (just dev@)
>
> *Low-level IL*
> I wanted to comment more on the common intermediate layer idea of Ben's.
> This is an awesome idea but I'm not sure it is Beam so much as Tez or Onyx.
> I imagine most runners have some such representation internally.
>
> Our layers in the stack with a low-level IL:
>
> 6. High-level DSLs - SQL, CEP, etc
> 5. High-level PL libraries - Java, Python, Go SDKs
> 4. High-level whole-pipeline IL - Today's pipeline definition with
> associative/commutative combiners, etc
> 3. Low-level whole-pipeline IL - DAG of possibly-stateful operators and
> different sorts of edges
> 2. ---- runner (with runner-specific optimizations) ----
> 1. Low-level single-worker IL - Fn API
> 0. UDF invocation - SDK harness
>
> I see a few issues:
>
>  - The higher levels afford more radical optimization, and we thus far
> claim that the runner owns optimization.
>  - Beam mostly maps the high-level IL directly to high-level APIs of
> existing engines. Mapping low-level to high-level will pay in performance.
>  - There might not be that big a gap between the high-level IL and the
> low-level IL, because Beam at this point has a lot of low-level features.
>
> Interestingly, we are actually already being forced down this path. With
> portability it no longer works to translate directly to a runner's
> primitives; we need to fuse together same-language transforms first. So we
> are currently doing at least naive fusion prior in a shared library prior
> to the runner, translating the high-level Pipeline into a lower-level
> representation (still a Pipeline proto as I understand it, but not really
> because it isn't using Beam's primitives). When this is done, the portable
> Pipeline proto might, in practice, only really be used as an input to this
> shared optimizer, and runners will be translating its low-level output IL
> to their own physical plans!
>
> Kenn
>
>
>
> On Tue, Jan 30, 2018 at 11:24 AM, Kenneth Knowles <[email protected]> wrote:
>
>> I've got some thoughts :-)
>>
>> Here is how I see the direction(s):
>>
>>  - Requirements to be relevant: known scale, SQL, retractions (required
>> for correct answers)
>>  - Core value-add: portability! I don't know that there is any other
>> project ambitiously trying to run Python and Go on "every" data processing
>> engine.
>>  - Experiments: SDF and dynamic work rebalancing. Just like event time
>> processing, when it matters to users these will become widespread and then
>> Beam's runner can easily make the features portable.
>>
>> So let's do portability really well on all our most active runners. I
>> have a radical proposal for how we should think about it:
>>
>>     A portable Beam runner should be defined to be a _service_ hosting
>> the Beam job management APIs.
>>
>> In that sense, we have zero runners today. Even Dataflow is just a
>> service hosting its own API with a client-side library for converting a
>> Beam pipeline into a Dataflow pipeline. Re-orienting our thinking this way
>> is not actually a huge change in code, but emphasizes:
>>
>>  - our "runners/core" etc should focus on making these services easy
>> (Thomas G is doing great work here right now)
>>  - a user selecting a runner should be thought of more as just pointing
>> at a different endpoint
>>  - our testing infrastructure should become much more service-oriented,
>> standing these up even for local testing
>>  - ditto Luke's point about making a crisp line of SDK/runner
>> responsibility
>>
>> Kenn
>>
>>
>> On Fri, Jan 26, 2018 at 12:58 PM, Lukasz Cwik <[email protected]> wrote:
>>
>>> 1) Instead of enabling it easier to write features I think more users
>>> would care about being able to move their pipeline between different
>>> runners and one of the key missing features is dynamic work rebalancing in
>>> all runners (except Dataflow).
>>> Also, portability is meant to help make a crisp line between what are
>>> the responsibilities of the Runner and the SDK which would help make it
>>> easier to write features in an SDK and to support features in Runners.
>>>
>>> 2) To realize portability there are a lot of JIRAs being tracked under
>>> the portability label[1] that need addressing to be able to run an existing
>>> pipeline in a portable manner before we even get to more advanced features.
>>>
>>> 1:
>>> https://issues.apache.org/jira/browse/BEAM-3515?jql=project%20%3D%20BEAM%20AND%20labels%20%3D%20portability
>>>
>>> 3) Ben, do you want to design and run a couple of polls (similar to the
>>> Java 8 poll) to get feedback from our users based upon the list of major
>>> features being developed?
>>>
>>> 4) Yes, plenty. It would be worthwhile to have someone walk through the
>>> open JIRAs and mark them with a label and also summarize what groups they
>>> fall under as there are plenty of good ideas there.
>>>
>>> On Tue, Jan 23, 2018 at 5:25 PM, Robert Bradshaw <[email protected]>
>>> wrote:
>>>
>>>> In terms of features, I think a key thing we should focus on is making
>>>> simple things simple. Beam is very powerful, but it doesn't always
>>>> make easy things easy. Features like schema'd PCollections could go a
>>>> long way here. Also fully fleshing out/smoothing our runner
>>>> portability story is part of this too.
>>>>
>>>> For beam 3.x we could also reason about if there's any complexity that
>>>> doesn't hold its weight (e.g. side inputs on CombineFns).
>>>>
>>>> On Mon, Jan 22, 2018 at 9:20 PM, Jean-Baptiste Onofré <[email protected]>
>>>> wrote:
>>>> > Hi Ben,
>>>> >
>>>> > about the "technical roadmap", we have a thread about "Beam 3.x
>>>> roadmap".
>>>> >
>>>> > It already provides ideas for points 3 & 4.
>>>> >
>>>> > Regards
>>>> > JB
>>>> >
>>>> > On 01/22/2018 09:15 PM, Ben Chambers wrote:
>>>> >> Thanks Davor for starting the state of the project discussions [1].
>>>> >>
>>>> >>
>>>> >> In this fork of the state of the project discussion, I’d like to
>>>> start the
>>>> >> discussion of the feature roadmap for 2018 (and beyond).
>>>> >>
>>>> >>
>>>> >> To kick off the discussion, I think the features could be divided
>>>> into several
>>>> >> areas, as follows:
>>>> >>
>>>> >>  1.
>>>> >>
>>>> >>     Enabling Contributions: How do we make it easier to add new
>>>> features to the
>>>> >>     supported runners? Can we provide a common intermediate layer
>>>> below the
>>>> >>     existing functionality that features are translated to so that
>>>> runners only
>>>> >>     need to support the intermediate layer and new features only
>>>> need to target
>>>> >>     it? What other ways can we make it easier to contribute to the
>>>> development
>>>> >>     of Beam?
>>>> >>
>>>> >>  2.
>>>> >>
>>>> >>     Realizing Portability: What gaps are there in the promise of
>>>> portability?
>>>> >>     For example in [1] we discussed the fact that users must write
>>>> per-runner
>>>> >>     code to push system metrics from runners to their monitoring
>>>> platform. This
>>>> >>     limits their ability to actually change runners. Credential
>>>> management for
>>>> >>     different environments also falls into this category.
>>>> >>
>>>> >>  3.
>>>> >>
>>>> >>     Large Features: What major features (like Beam SQL, Beam Python,
>>>> etc.) would
>>>> >>     increase the Beam user base in 2018?
>>>> >>
>>>> >>  4.
>>>> >>
>>>> >>     Improvements: What small changes could make Beam more appealing
>>>> to users?
>>>> >>     Are there API improvements we could make or common mistakes we
>>>> could detect
>>>> >>     and/or prevent?
>>>> >>
>>>> >>
>>>> >> Thanks in advance for participating in the discussion. I believe
>>>> that 2018 could
>>>> >> be a great year for Beam, providing easier, more complete runner
>>>> portability and
>>>> >> features that make Beam easier to use for everyone.
>>>> >>
>>>> >>
>>>> >> Ben
>>>> >>
>>>> >>
>>>> >> [1]
>>>> >>
>>>> https://lists.apache.org/thread.html/f750f288af8dab3f468b869bf5a3f473094f4764db419567f33805d0@%3Cdev.beam.apache.org%3E
>>>> >>
>>>> >> [2]
>>>> >>
>>>> https://lists.apache.org/thread.html/01a80d62f2df6b84bfa41f05e15fda900178f882877c294fed8be91e@%3Cdev.beam.apache.org%3E
>>>> >
>>>> > --
>>>> > Jean-Baptiste Onofré
>>>> > [email protected]
>>>> > http://blog.nanthrax.net
>>>> > Talend - http://www.talend.com
>>>>
>>>
>>>
>>
>

Re: [DISCUSS] State of the project: Feature roadmap for 2018

Reply via email to