Re: Beam high level directions (was "Graal instead of docker?")

Robert Burke Thu, 17 May 2018 11:44:04 -0700

The approach you're looking for sounds like the user's Runner of Choice,
would use a user side version of the runner core, without changing the
Runner of Choice?


So a user would update their version of the SDK, and the runner would have
to pull the core component from the user pipeline?

That sounds like it increases pipeline size and decreases pipeline
portability, especially for pipelines that are not in the same language as
the runner-core, such as for Python and Go.

It's not clear to me what runners would be doing in that scenario either.
Do you have a proposal about where the interface boundaries would be?

On Wed, May 16, 2018, 10:05 PM Romain Manni-Bucau <rmannibu...@gmail.com>
wrote:

> The runner core doesnt fully align on that or rephrased more accurately,
> it doesnt go as far as it could for me. Having to call it, is still an
> issue since it requires a runner update instead of getting the new feature
> for free. The next step sounds to be *one* runner where implementations
> plug their translations probably. It would reverse the current pattern and
> prepare beam for the future. One good example of such implementation is the
> sdf which can "just" reuse dofn primitives to wire its support through
> runners.
>
> Le jeu. 17 mai 2018 02:01, Jesse Anderson <je...@bigdatainstitute.io> a
> écrit :
>
>> This -> "I'd like that each time you think that you ask yourself "does
>> it need?"."
>>
>> On Wed, May 16, 2018 at 4:53 PM Robert Bradshaw <rober...@google.com>
>> wrote:
>>
>>> Thanks for your email, Romain. It helps understand your goals and where
>>> you're coming from. I'd also like to see a thinner core, and agree it's
>>> beneficial to reduce dependencies where possible, especially when
>>> supporting the usecase where the pipeline is constructed in an
>>> environment
>>> other than an end-user's main.
>>>
>>> It seems a lot of the portability work, despite being on the surface
>>> driven
>>> by multi-language, aligns well with many of these goals. For example, all
>>> the work going on in runners-core to provide a rich library that all
>>> (Java,
>>> and perhaps non-Java) runners can leverage to do DAG preprocessing
>>> (fusion,
>>> combiner lifting, ...) and handle the low-level details of managing
>>> worker
>>> subprocesses. As you state, the more we can put into these libraries, the
>>> more all runners can get "for free" by interacting with them, while still
>>> providing the flexibility to adapt to their differing models and
>>> strengths.
>>>
>>> Getting this right is, for me at least, one of the highest priorities for
>>> Beam.
>>>
>>> - Robert
>>> On Wed, May 16, 2018 at 11:51 AM Kenneth Knowles <k...@google.com> wrote:
>>>
>>> > Hi Romain,
>>>
>>> > This gives a clear view of your perspective. I also recommend you ask
>>> around to those who have been working on Beam and big data processing
>>> for a
>>> long time to learn more about their perspective.
>>>
>>> > Your "Beam Analysis" is pretty accurate about what we've been trying to
>>> build. I would say (a) & (b) as "any language on any runner" and (c) is
>>> our
>>> plan of how to do it: define primitives which are fundamental to parallel
>>> processing and formalize a language-independent representation, with
>>> adapters for each language and data processing engine.
>>>
>>> > Of course anyone in the community may have their own particular goal.
>>> We
>>> don't control what they work on, and we are grateful for their efforts.
>>>
>>> > Technically, there is plenty to agree with. I think as you learn about
>>> Beam you will find that many of your suggestions are already handled in
>>> some way. You may also continue to learn sometimes about the specific
>>> reasons things are done in a different way than you expected. These
>>> should
>>> help you find how to build what you want to build.
>>>
>>> > Kenn
>>>
>>> > On Wed, May 16, 2018 at 1:14 AM Romain Manni-Bucau <
>>> rmannibu...@gmail.com>
>>> wrote:
>>>
>>> >> Hi guys,
>>>
>>> >> Since it is not the first time we have a thread where we end up not
>>> understanding each other, I'd like to take this as an opportunity to
>>> clarify what i'm looking for, in a more formal way. This assumes our
>>> misunderstandings come from the fact I mainly tried to fix issues one by
>>> ones, instead of painting the big picture I'm getting after. (My rational
>>> was I was not able to invest more time in that but I start to think it
>>> was
>>> not a good chocie). I really hope it helps.
>>>
>>> >> 1. Beam analysis
>>>
>>> >> Beam has three main goals:
>>>
>>> >> a. Being a portable API accross runners (I also call them
>>> "implementations" by opposition of "api")
>>> >> b. Bringing some interoperability between languages and therefore
>>> users
>>> >> c. Provide primitives (groupby for instance), I/O and generic
>>> processing
>>> items
>>>
>>> >> Indeed it doesn't cover all beam's features but, high level, it is
>>> what
>>> it brings.
>>>
>>> >> In terms of advantages and why choosing beam instead of spark, for
>>> instance, the benefit is mainly to not be vendor locked on one side and
>>> to
>>> enable more users on the other side (you note that point c is just
>>> catching
>>> up on vendors ecosystems with these statements).
>>>
>>> >> 2. Portable API accross environments
>>>
>>> >> It is key, here, to keep in mind beam is not an environment or a
>>> runner.
>>> It is by design, a library *embedded* in other environment.
>>>
>>> >> a. This means that Beam must keep its stack as clean as possible. If
>>> it
>>> is still ambiguous: beam must be dependency free.
>>>
>>> >> Until now the workaround has been to shade dependencies. This is not a
>>> solution since it leads to big jobs of hundreds of mega which prevents to
>>> scale since we deploy from the network. It makes all deployments,
>>> managements, and storage a pain on ops side. The other pitfall of shades
>>> (or shadowing since we are on gradle now) is that it completely breaks
>>> any
>>> company tooling and prevent vulnerability scanning or dependency
>>> upgrades -
>>> not handled by dev team - to work correctly. This is a major issue for
>>> any
>>> software targetting some professional level which should not be
>>> underestimated.
>>>
>>> >>  From that point we can get scared but with Java 8 there is no real
>>> point
>>> having a tons of dependencies for the sdk core - this is for java but
>>> should be true for most languages since beam requirements are light here.
>>>
>>> >> However it can also require to rethink the sdk core modularity: why is
>>> there some IO here? Do we need a big fat sdk core?
>>>
>>> >> b. API or "put it all"?
>>>
>>> >> Current API is in sdk-core but actually it prevents a modular
>>> development since there are primitives and some IO in the core. What
>>> would
>>> be sane is to extract the actual API from the core and get a beam-api.
>>> This
>>> way we match all kind of user consumes:
>>>
>>> >> - IO developers (they only need the SDF)
>>> >> - pipeline writers (they only need the pipeline + IO)
>>> >> - etc...
>>>
>>> >> To make it an API it requires some changes but nothing crazy probably
>>> and it would make beam more consumable and potentially reusable in other
>>> environments.
>>>
>>> >> I'll not detail the API points here since it is not the goal (think I
>>> tracked most of them in
>>> https://gist.github.com/rmannibucau/ab7543c23b6f57af921d98639fbcd436 if
>>> you
>>> are interested)
>>>
>>> >> c. Environment is not only about jars
>>>
>>> >> Beam has two main execution environments:
>>>
>>> >> - the "pipeline.run" one
>>> >> - the pipeline execution (runner)
>>>
>>> >> The last one is quite known and already has some challenges:
>>>
>>> >> - can be a main execution so nothing crazy to manage
>>> >> - can use subclassloaders to execute jobs, scale and isolate jobs
>>> >> - etc... (we can think to an OSGi flavor for instance)
>>>
>>> >> The first one is way more challenging since you must match:
>>>
>>> >> - flat mains
>>> >> - JavaEE containers
>>> >> - OSGi containers
>>> >> - custom weird environments (spring boot jar launcher)
>>> >> - ...
>>>
>>> >> This all lead to two very key consequences and programming rule
>>> respect:
>>>
>>> >> - lifecycle: any component must ensure its lifecycle is very well
>>> respected (we must avoid "JVM will clean up anyway" kind of thinking)
>>> >> - no blind cache or static abuse, this must fit *all* environments
>>> (pipelineoptionsfacctory is a good example of that)
>>>
>>> >> 3. Make it hurtless for integrators/community
>>>
>>> >> Beam's success is bound to the fact runners exist. A concern which is
>>> quite important is that beam keeps adding features and say "runners will
>>> implement them". I'd like that each time you think that you ask yourself
>>> "does it need?".
>>>
>>> >> I'll take two examples:
>>>
>>> >> - the language portable support: there is no need to do it in all
>>> runners, you can have a generic runner delegating to the right
>>> implementation@runner the tasks and therefore, adding language
>>> portability
>>> feature, you support OOTB all existing runners without impacting them
>>> >> - the metrics pusher: this one got some discussion and lead to a
>>> polling
>>> implementation which doesn't work in all runners not having a waiting
>>> "driver" (hazelcast, spark in client mode etc...). Now it is going to be
>>> added to the portable API if I got it right...if you think about it, you
>>> can just instrument the pipeline by modifying the DAG before translating
>>> it
>>> and therefore work on all runners for free as well.
>>>
>>> >> These two simple examples show that the work should probably be done
>>> on
>>> adding DAG preprocessors (sorted) and runner as something enrichable,
>>> rather than with ad-hoc solutions for each feature.
>>>
>>> >> 4. Be more reactive
>>>
>>> >> If you check I/O, most of them can support asynchronous handling. The
>>> gain is to be aligned on the actual I/O and not only be asynchronous to
>>> starts a new thread. Using that allows to scale way more and use more
>>> efficiently resources of the machine.
>>>
>>> >> However it has a big pitfall: the whole programming model must be
>>> reactive. Indeed, we can support a conversion from a not reactive to a
>>> reactive model implicitly for simple case (think to a DoFn multiplying
>>> by 2
>>> an int) but the I/O should be reactive and beam should be reactive in its
>>> completion to benefit from it.
>>>
>>>
>>>
>>> >> Summary: if I try to summarize this mail which tries to share the
>>> philosophy I'm approaching beam with, more than particular issues, i'd
>>> say
>>> that I strongly think, that to be a success, Beam but embrace what it
>>> is: a
>>> portable layer on top of existing implementations. It means that it must
>>> define a clear and minimal API for each kind of usage and probably expose
>>> it by user kind (so actually N api). it must embrace the environments it
>>> runs in and assume the constraints it brings. And finally it should be
>>> less
>>> intrusive in all its layers and try to add features more transversally
>>> when
>>> possible (and it is possible in a lot of cases). If you bring features
>>> for
>>> free with new releases, everybody wins, if you announce features and no
>>> runner support it, then you loose (and loose users).
>>>
>>>
>>>
>>> >> Hope it helps,
>>> >> Romain Manni-Bucau
>>> >> @rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book
>>>
>>

Re: Beam high level directions (was "Graal instead of docker?")

Reply via email to