Re: [DISCUSS] State of the project: Feature roadmap for 2018

Robert Bradshaw Tue, 30 Jan 2018 12:48:26 -0800

On Tue, Jan 30, 2018 at 11:44 AM, Kenneth Knowles <k...@google.com> wrote:
> (just dev@)
>
> *Low-level IL*
> I wanted to comment more on the common intermediate layer idea of Ben's.
> This is an awesome idea but I'm not sure it is Beam so much as Tez or Onyx.
> I imagine most runners have some such representation internally.
>
> Our layers in the stack with a low-level IL:
>
> 6. High-level DSLs - SQL, CEP, etc
> 5. High-level PL libraries - Java, Python, Go SDKs
> 4. High-level whole-pipeline IL - Today's pipeline definition with
> associative/commutative combiners, etc
> 3. Low-level whole-pipeline IL - DAG of possibly-stateful operators and
> different sorts of edges
> 2. ---- runner (with runner-specific optimizations) ----
> 1. Low-level single-worker IL - Fn API
> 0. UDF invocation - SDK harness
>
> I see a few issues:
>
>  - The higher levels afford more radical optimization, and we thus far claim
> that the runner owns optimization.
>  - Beam mostly maps the high-level IL directly to high-level APIs of
> existing engines. Mapping low-level to high-level will pay in performance.
>  - There might not be that big a gap between the high-level IL and the
> low-level IL, because Beam at this point has a lot of low-level features.
>
> Interestingly, we are actually already being forced down this path. With
> portability it no longer works to translate directly to a runner's
> primitives; we need to fuse together same-language transforms first. So we
> are currently doing at least naive fusion prior in a shared library prior to
> the runner, translating the high-level Pipeline into a lower-level
> representation (still a Pipeline proto as I understand it, but not really
> because it isn't using Beam's primitives).


Note that doing fusion is not required, rather we are doing this
because some runners don't expose the notion of a "fused graph" at the
public API level (e.g. on the workers) and even if they did, it'd
probably be a more complicated surface area than a single uber
operation.

> When this is done, the portable
> Pipeline proto might, in practice, only really be used as an input to this
> shared optimizer, and runners will be translating its low-level output IL to
> their own physical plans!
>
> Kenn
>
>
>
> On Tue, Jan 30, 2018 at 11:24 AM, Kenneth Knowles <k...@google.com> wrote:
>>
>> I've got some thoughts :-)
>>
>> Here is how I see the direction(s):
>>
>>  - Requirements to be relevant: known scale, SQL, retractions (required
>> for correct answers)
>>  - Core value-add: portability! I don't know that there is any other
>> project ambitiously trying to run Python and Go on "every" data processing
>> engine.
>>  - Experiments: SDF and dynamic work rebalancing. Just like event time
>> processing, when it matters to users these will become widespread and then
>> Beam's runner can easily make the features portable.
>>
>> So let's do portability really well on all our most active runners. I have
>> a radical proposal for how we should think about it:
>>
>>     A portable Beam runner should be defined to be a _service_ hosting the
>> Beam job management APIs.
>>
>> In that sense, we have zero runners today. Even Dataflow is just a service
>> hosting its own API with a client-side library for converting a Beam
>> pipeline into a Dataflow pipeline. Re-orienting our thinking this way is not
>> actually a huge change in code, but emphasizes:
>>
>>  - our "runners/core" etc should focus on making these services easy
>> (Thomas G is doing great work here right now)
>>  - a user selecting a runner should be thought of more as just pointing at
>> a different endpoint
>>  - our testing infrastructure should become much more service-oriented,
>> standing these up even for local testing
>>  - ditto Luke's point about making a crisp line of SDK/runner
>> responsibility
>>
>> Kenn
>>
>>
>> On Fri, Jan 26, 2018 at 12:58 PM, Lukasz Cwik <lc...@google.com> wrote:
>>>
>>> 1) Instead of enabling it easier to write features I think more users
>>> would care about being able to move their pipeline between different runners
>>> and one of the key missing features is dynamic work rebalancing in all
>>> runners (except Dataflow).
>>> Also, portability is meant to help make a crisp line between what are the
>>> responsibilities of the Runner and the SDK which would help make it easier
>>> to write features in an SDK and to support features in Runners.
>>>
>>> 2) To realize portability there are a lot of JIRAs being tracked under
>>> the portability label[1] that need addressing to be able to run an existing
>>> pipeline in a portable manner before we even get to more advanced features.
>>>
>>> 1:
>>> https://issues.apache.org/jira/browse/BEAM-3515?jql=project%20%3D%20BEAM%20AND%20labels%20%3D%20portability
>>>
>>> 3) Ben, do you want to design and run a couple of polls (similar to the
>>> Java 8 poll) to get feedback from our users based upon the list of major
>>> features being developed?
>>>
>>> 4) Yes, plenty. It would be worthwhile to have someone walk through the
>>> open JIRAs and mark them with a label and also summarize what groups they
>>> fall under as there are plenty of good ideas there.
>>>
>>> On Tue, Jan 23, 2018 at 5:25 PM, Robert Bradshaw <rober...@google.com>
>>> wrote:
>>>>
>>>> In terms of features, I think a key thing we should focus on is making
>>>> simple things simple. Beam is very powerful, but it doesn't always
>>>> make easy things easy. Features like schema'd PCollections could go a
>>>> long way here. Also fully fleshing out/smoothing our runner
>>>> portability story is part of this too.
>>>>
>>>> For beam 3.x we could also reason about if there's any complexity that
>>>> doesn't hold its weight (e.g. side inputs on CombineFns).
>>>>
>>>> On Mon, Jan 22, 2018 at 9:20 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
>>>> wrote:
>>>> > Hi Ben,
>>>> >
>>>> > about the "technical roadmap", we have a thread about "Beam 3.x
>>>> > roadmap".
>>>> >
>>>> > It already provides ideas for points 3 & 4.
>>>> >
>>>> > Regards
>>>> > JB
>>>> >
>>>> > On 01/22/2018 09:15 PM, Ben Chambers wrote:
>>>> >> Thanks Davor for starting the state of the project discussions [1].
>>>> >>
>>>> >>
>>>> >> In this fork of the state of the project discussion, I’d like to
>>>> >> start the
>>>> >> discussion of the feature roadmap for 2018 (and beyond).
>>>> >>
>>>> >>
>>>> >> To kick off the discussion, I think the features could be divided
>>>> >> into several
>>>> >> areas, as follows:
>>>> >>
>>>> >>  1.
>>>> >>
>>>> >>     Enabling Contributions: How do we make it easier to add new
>>>> >> features to the
>>>> >>     supported runners? Can we provide a common intermediate layer
>>>> >> below the
>>>> >>     existing functionality that features are translated to so that
>>>> >> runners only
>>>> >>     need to support the intermediate layer and new features only need
>>>> >> to target
>>>> >>     it? What other ways can we make it easier to contribute to the
>>>> >> development
>>>> >>     of Beam?
>>>> >>
>>>> >>  2.
>>>> >>
>>>> >>     Realizing Portability: What gaps are there in the promise of
>>>> >> portability?
>>>> >>     For example in [1] we discussed the fact that users must write
>>>> >> per-runner
>>>> >>     code to push system metrics from runners to their monitoring
>>>> >> platform. This
>>>> >>     limits their ability to actually change runners. Credential
>>>> >> management for
>>>> >>     different environments also falls into this category.
>>>> >>
>>>> >>  3.
>>>> >>
>>>> >>     Large Features: What major features (like Beam SQL, Beam Python,
>>>> >> etc.) would
>>>> >>     increase the Beam user base in 2018?
>>>> >>
>>>> >>  4.
>>>> >>
>>>> >>     Improvements: What small changes could make Beam more appealing
>>>> >> to users?
>>>> >>     Are there API improvements we could make or common mistakes we
>>>> >> could detect
>>>> >>     and/or prevent?
>>>> >>
>>>> >>
>>>> >> Thanks in advance for participating in the discussion. I believe that
>>>> >> 2018 could
>>>> >> be a great year for Beam, providing easier, more complete runner
>>>> >> portability and
>>>> >> features that make Beam easier to use for everyone.
>>>> >>
>>>> >>
>>>> >> Ben
>>>> >>
>>>> >>
>>>> >> [1]
>>>> >>
>>>> >> https://lists.apache.org/thread.html/f750f288af8dab3f468b869bf5a3f473094f4764db419567f33805d0@%3Cdev.beam.apache.org%3E
>>>> >>
>>>> >> [2]
>>>> >>
>>>> >> https://lists.apache.org/thread.html/01a80d62f2df6b84bfa41f05e15fda900178f882877c294fed8be91e@%3Cdev.beam.apache.org%3E
>>>> >
>>>> > --
>>>> > Jean-Baptiste Onofré
>>>> > jbono...@apache.org
>>>> > http://blog.nanthrax.net
>>>> > Talend - http://www.talend.com
>>>
>>>
>>
>

Re: [DISCUSS] State of the project: Feature roadmap for 2018

Reply via email to