This -> "I'd like that each time you think that you ask yourself "does it need?"."
On Wed, May 16, 2018 at 4:53 PM Robert Bradshaw <rober...@google.com> wrote: > Thanks for your email, Romain. It helps understand your goals and where > you're coming from. I'd also like to see a thinner core, and agree it's > beneficial to reduce dependencies where possible, especially when > supporting the usecase where the pipeline is constructed in an environment > other than an end-user's main. > > It seems a lot of the portability work, despite being on the surface driven > by multi-language, aligns well with many of these goals. For example, all > the work going on in runners-core to provide a rich library that all (Java, > and perhaps non-Java) runners can leverage to do DAG preprocessing (fusion, > combiner lifting, ...) and handle the low-level details of managing worker > subprocesses. As you state, the more we can put into these libraries, the > more all runners can get "for free" by interacting with them, while still > providing the flexibility to adapt to their differing models and strengths. > > Getting this right is, for me at least, one of the highest priorities for > Beam. > > - Robert > On Wed, May 16, 2018 at 11:51 AM Kenneth Knowles <k...@google.com> wrote: > > > Hi Romain, > > > This gives a clear view of your perspective. I also recommend you ask > around to those who have been working on Beam and big data processing for a > long time to learn more about their perspective. > > > Your "Beam Analysis" is pretty accurate about what we've been trying to > build. I would say (a) & (b) as "any language on any runner" and (c) is our > plan of how to do it: define primitives which are fundamental to parallel > processing and formalize a language-independent representation, with > adapters for each language and data processing engine. > > > Of course anyone in the community may have their own particular goal. We > don't control what they work on, and we are grateful for their efforts. > > > Technically, there is plenty to agree with. I think as you learn about > Beam you will find that many of your suggestions are already handled in > some way. You may also continue to learn sometimes about the specific > reasons things are done in a different way than you expected. These should > help you find how to build what you want to build. > > > Kenn > > > On Wed, May 16, 2018 at 1:14 AM Romain Manni-Bucau < > rmannibu...@gmail.com> > wrote: > > >> Hi guys, > > >> Since it is not the first time we have a thread where we end up not > understanding each other, I'd like to take this as an opportunity to > clarify what i'm looking for, in a more formal way. This assumes our > misunderstandings come from the fact I mainly tried to fix issues one by > ones, instead of painting the big picture I'm getting after. (My rational > was I was not able to invest more time in that but I start to think it was > not a good chocie). I really hope it helps. > > >> 1. Beam analysis > > >> Beam has three main goals: > > >> a. Being a portable API accross runners (I also call them > "implementations" by opposition of "api") > >> b. Bringing some interoperability between languages and therefore users > >> c. Provide primitives (groupby for instance), I/O and generic processing > items > > >> Indeed it doesn't cover all beam's features but, high level, it is what > it brings. > > >> In terms of advantages and why choosing beam instead of spark, for > instance, the benefit is mainly to not be vendor locked on one side and to > enable more users on the other side (you note that point c is just catching > up on vendors ecosystems with these statements). > > >> 2. Portable API accross environments > > >> It is key, here, to keep in mind beam is not an environment or a runner. > It is by design, a library *embedded* in other environment. > > >> a. This means that Beam must keep its stack as clean as possible. If it > is still ambiguous: beam must be dependency free. > > >> Until now the workaround has been to shade dependencies. This is not a > solution since it leads to big jobs of hundreds of mega which prevents to > scale since we deploy from the network. It makes all deployments, > managements, and storage a pain on ops side. The other pitfall of shades > (or shadowing since we are on gradle now) is that it completely breaks any > company tooling and prevent vulnerability scanning or dependency upgrades - > not handled by dev team - to work correctly. This is a major issue for any > software targetting some professional level which should not be > underestimated. > > >> From that point we can get scared but with Java 8 there is no real > point > having a tons of dependencies for the sdk core - this is for java but > should be true for most languages since beam requirements are light here. > > >> However it can also require to rethink the sdk core modularity: why is > there some IO here? Do we need a big fat sdk core? > > >> b. API or "put it all"? > > >> Current API is in sdk-core but actually it prevents a modular > development since there are primitives and some IO in the core. What would > be sane is to extract the actual API from the core and get a beam-api. This > way we match all kind of user consumes: > > >> - IO developers (they only need the SDF) > >> - pipeline writers (they only need the pipeline + IO) > >> - etc... > > >> To make it an API it requires some changes but nothing crazy probably > and it would make beam more consumable and potentially reusable in other > environments. > > >> I'll not detail the API points here since it is not the goal (think I > tracked most of them in > https://gist.github.com/rmannibucau/ab7543c23b6f57af921d98639fbcd436 if > you > are interested) > > >> c. Environment is not only about jars > > >> Beam has two main execution environments: > > >> - the "pipeline.run" one > >> - the pipeline execution (runner) > > >> The last one is quite known and already has some challenges: > > >> - can be a main execution so nothing crazy to manage > >> - can use subclassloaders to execute jobs, scale and isolate jobs > >> - etc... (we can think to an OSGi flavor for instance) > > >> The first one is way more challenging since you must match: > > >> - flat mains > >> - JavaEE containers > >> - OSGi containers > >> - custom weird environments (spring boot jar launcher) > >> - ... > > >> This all lead to two very key consequences and programming rule respect: > > >> - lifecycle: any component must ensure its lifecycle is very well > respected (we must avoid "JVM will clean up anyway" kind of thinking) > >> - no blind cache or static abuse, this must fit *all* environments > (pipelineoptionsfacctory is a good example of that) > > >> 3. Make it hurtless for integrators/community > > >> Beam's success is bound to the fact runners exist. A concern which is > quite important is that beam keeps adding features and say "runners will > implement them". I'd like that each time you think that you ask yourself > "does it need?". > > >> I'll take two examples: > > >> - the language portable support: there is no need to do it in all > runners, you can have a generic runner delegating to the right > implementation@runner the tasks and therefore, adding language portability > feature, you support OOTB all existing runners without impacting them > >> - the metrics pusher: this one got some discussion and lead to a polling > implementation which doesn't work in all runners not having a waiting > "driver" (hazelcast, spark in client mode etc...). Now it is going to be > added to the portable API if I got it right...if you think about it, you > can just instrument the pipeline by modifying the DAG before translating it > and therefore work on all runners for free as well. > > >> These two simple examples show that the work should probably be done on > adding DAG preprocessors (sorted) and runner as something enrichable, > rather than with ad-hoc solutions for each feature. > > >> 4. Be more reactive > > >> If you check I/O, most of them can support asynchronous handling. The > gain is to be aligned on the actual I/O and not only be asynchronous to > starts a new thread. Using that allows to scale way more and use more > efficiently resources of the machine. > > >> However it has a big pitfall: the whole programming model must be > reactive. Indeed, we can support a conversion from a not reactive to a > reactive model implicitly for simple case (think to a DoFn multiplying by 2 > an int) but the I/O should be reactive and beam should be reactive in its > completion to benefit from it. > > > > >> Summary: if I try to summarize this mail which tries to share the > philosophy I'm approaching beam with, more than particular issues, i'd say > that I strongly think, that to be a success, Beam but embrace what it is: a > portable layer on top of existing implementations. It means that it must > define a clear and minimal API for each kind of usage and probably expose > it by user kind (so actually N api). it must embrace the environments it > runs in and assume the constraints it brings. And finally it should be less > intrusive in all its layers and try to add features more transversally when > possible (and it is possible in a lot of cases). If you bring features for > free with new releases, everybody wins, if you announce features and no > runner support it, then you loose (and loose users). > > > > >> Hope it helps, > >> Romain Manni-Bucau > >> @rmannibucau | Blog | Old Blog | Github | LinkedIn | Book >