Re: Beam high level directions (was "Graal instead of docker?")

Jesse Anderson Wed, 16 May 2018 17:02:26 -0700

This -> "I'd like that each time you think that you ask yourself "does it
need?"."


On Wed, May 16, 2018 at 4:53 PM Robert Bradshaw <rober...@google.com> wrote:

> Thanks for your email, Romain. It helps understand your goals and where
> you're coming from. I'd also like to see a thinner core, and agree it's
> beneficial to reduce dependencies where possible, especially when
> supporting the usecase where the pipeline is constructed in an environment
> other than an end-user's main.
>
> It seems a lot of the portability work, despite being on the surface driven
> by multi-language, aligns well with many of these goals. For example, all
> the work going on in runners-core to provide a rich library that all (Java,
> and perhaps non-Java) runners can leverage to do DAG preprocessing (fusion,
> combiner lifting, ...) and handle the low-level details of managing worker
> subprocesses. As you state, the more we can put into these libraries, the
> more all runners can get "for free" by interacting with them, while still
> providing the flexibility to adapt to their differing models and strengths.
>
> Getting this right is, for me at least, one of the highest priorities for
> Beam.
>
> - Robert
> On Wed, May 16, 2018 at 11:51 AM Kenneth Knowles <k...@google.com> wrote:
>
> > Hi Romain,
>
> > This gives a clear view of your perspective. I also recommend you ask
> around to those who have been working on Beam and big data processing for a
> long time to learn more about their perspective.
>
> > Your "Beam Analysis" is pretty accurate about what we've been trying to
> build. I would say (a) & (b) as "any language on any runner" and (c) is our
> plan of how to do it: define primitives which are fundamental to parallel
> processing and formalize a language-independent representation, with
> adapters for each language and data processing engine.
>
> > Of course anyone in the community may have their own particular goal. We
> don't control what they work on, and we are grateful for their efforts.
>
> > Technically, there is plenty to agree with. I think as you learn about
> Beam you will find that many of your suggestions are already handled in
> some way. You may also continue to learn sometimes about the specific
> reasons things are done in a different way than you expected. These should
> help you find how to build what you want to build.
>
> > Kenn
>
> > On Wed, May 16, 2018 at 1:14 AM Romain Manni-Bucau <
> rmannibu...@gmail.com>
> wrote:
>
> >> Hi guys,
>
> >> Since it is not the first time we have a thread where we end up not
> understanding each other, I'd like to take this as an opportunity to
> clarify what i'm looking for, in a more formal way. This assumes our
> misunderstandings come from the fact I mainly tried to fix issues one by
> ones, instead of painting the big picture I'm getting after. (My rational
> was I was not able to invest more time in that but I start to think it was
> not a good chocie). I really hope it helps.
>
> >> 1. Beam analysis
>
> >> Beam has three main goals:
>
> >> a. Being a portable API accross runners (I also call them
> "implementations" by opposition of "api")
> >> b. Bringing some interoperability between languages and therefore users
> >> c. Provide primitives (groupby for instance), I/O and generic processing
> items
>
> >> Indeed it doesn't cover all beam's features but, high level, it is what
> it brings.
>
> >> In terms of advantages and why choosing beam instead of spark, for
> instance, the benefit is mainly to not be vendor locked on one side and to
> enable more users on the other side (you note that point c is just catching
> up on vendors ecosystems with these statements).
>
> >> 2. Portable API accross environments
>
> >> It is key, here, to keep in mind beam is not an environment or a runner.
> It is by design, a library *embedded* in other environment.
>
> >> a. This means that Beam must keep its stack as clean as possible. If it
> is still ambiguous: beam must be dependency free.
>
> >> Until now the workaround has been to shade dependencies. This is not a
> solution since it leads to big jobs of hundreds of mega which prevents to
> scale since we deploy from the network. It makes all deployments,
> managements, and storage a pain on ops side. The other pitfall of shades
> (or shadowing since we are on gradle now) is that it completely breaks any
> company tooling and prevent vulnerability scanning or dependency upgrades -
> not handled by dev team - to work correctly. This is a major issue for any
> software targetting some professional level which should not be
> underestimated.
>
> >>  From that point we can get scared but with Java 8 there is no real
> point
> having a tons of dependencies for the sdk core - this is for java but
> should be true for most languages since beam requirements are light here.
>
> >> However it can also require to rethink the sdk core modularity: why is
> there some IO here? Do we need a big fat sdk core?
>
> >> b. API or "put it all"?
>
> >> Current API is in sdk-core but actually it prevents a modular
> development since there are primitives and some IO in the core. What would
> be sane is to extract the actual API from the core and get a beam-api. This
> way we match all kind of user consumes:
>
> >> - IO developers (they only need the SDF)
> >> - pipeline writers (they only need the pipeline + IO)
> >> - etc...
>
> >> To make it an API it requires some changes but nothing crazy probably
> and it would make beam more consumable and potentially reusable in other
> environments.
>
> >> I'll not detail the API points here since it is not the goal (think I
> tracked most of them in
> https://gist.github.com/rmannibucau/ab7543c23b6f57af921d98639fbcd436 if
> you
> are interested)
>
> >> c. Environment is not only about jars
>
> >> Beam has two main execution environments:
>
> >> - the "pipeline.run" one
> >> - the pipeline execution (runner)
>
> >> The last one is quite known and already has some challenges:
>
> >> - can be a main execution so nothing crazy to manage
> >> - can use subclassloaders to execute jobs, scale and isolate jobs
> >> - etc... (we can think to an OSGi flavor for instance)
>
> >> The first one is way more challenging since you must match:
>
> >> - flat mains
> >> - JavaEE containers
> >> - OSGi containers
> >> - custom weird environments (spring boot jar launcher)
> >> - ...
>
> >> This all lead to two very key consequences and programming rule respect:
>
> >> - lifecycle: any component must ensure its lifecycle is very well
> respected (we must avoid "JVM will clean up anyway" kind of thinking)
> >> - no blind cache or static abuse, this must fit *all* environments
> (pipelineoptionsfacctory is a good example of that)
>
> >> 3. Make it hurtless for integrators/community
>
> >> Beam's success is bound to the fact runners exist. A concern which is
> quite important is that beam keeps adding features and say "runners will
> implement them". I'd like that each time you think that you ask yourself
> "does it need?".
>
> >> I'll take two examples:
>
> >> - the language portable support: there is no need to do it in all
> runners, you can have a generic runner delegating to the right
> implementation@runner the tasks and therefore, adding language portability
> feature, you support OOTB all existing runners without impacting them
> >> - the metrics pusher: this one got some discussion and lead to a polling
> implementation which doesn't work in all runners not having a waiting
> "driver" (hazelcast, spark in client mode etc...). Now it is going to be
> added to the portable API if I got it right...if you think about it, you
> can just instrument the pipeline by modifying the DAG before translating it
> and therefore work on all runners for free as well.
>
> >> These two simple examples show that the work should probably be done on
> adding DAG preprocessors (sorted) and runner as something enrichable,
> rather than with ad-hoc solutions for each feature.
>
> >> 4. Be more reactive
>
> >> If you check I/O, most of them can support asynchronous handling. The
> gain is to be aligned on the actual I/O and not only be asynchronous to
> starts a new thread. Using that allows to scale way more and use more
> efficiently resources of the machine.
>
> >> However it has a big pitfall: the whole programming model must be
> reactive. Indeed, we can support a conversion from a not reactive to a
> reactive model implicitly for simple case (think to a DoFn multiplying by 2
> an int) but the I/O should be reactive and beam should be reactive in its
> completion to benefit from it.
>
>
>
> >> Summary: if I try to summarize this mail which tries to share the
> philosophy I'm approaching beam with, more than particular issues, i'd say
> that I strongly think, that to be a success, Beam but embrace what it is: a
> portable layer on top of existing implementations. It means that it must
> define a clear and minimal API for each kind of usage and probably expose
> it by user kind (so actually N api). it must embrace the environments it
> runs in and assume the constraints it brings. And finally it should be less
> intrusive in all its layers and try to add features more transversally when
> possible (and it is possible in a lot of cases). If you bring features for
> free with new releases, everybody wins, if you announce features and no
> runner support it, then you loose (and loose users).
>
>
>
> >> Hope it helps,
> >> Romain Manni-Bucau
> >> @rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book
>

Re: Beam high level directions (was "Graal instead of docker?")

Reply via email to