Beam high level directions (was "Graal instead of docker?")

Romain Manni-Bucau Wed, 16 May 2018 01:15:07 -0700

Hi guys,

Since it is not the first time we have a thread where we end up not
understanding each other, I'd like to take this as an opportunity to
clarify what i'm looking for, in a more formal way. This assumes our
misunderstandings come from the fact I mainly tried to fix issues one by
ones, instead of painting the big picture I'm getting after. (My rational
was I was not able to invest more time in that but I start to think it was
not a good chocie). I really hope it helps.


1. Beam analysis

Beam has three main goals:

a. Being a portable API accross runners (I also call them "implementations"
by opposition of "api")
b. Bringing some interoperability between languages and therefore users
c. Provide primitives (groupby for instance), I/O and generic processing
items

Indeed it doesn't cover all beam's features but, high level, it is what it
brings.

In terms of advantages and why choosing beam instead of spark, for
instance, the benefit is mainly to not be vendor locked on one side and to
enable more users on the other side (you note that point c is just catching
up on vendors ecosystems with these statements).

2. Portable API accross environments

It is key, here, to keep in mind beam is not an environment or a runner. It
is by design, a library *embedded* in other environment.

a. This means that Beam must keep its stack as clean as possible. If it is
still ambiguous: beam must be dependency free.

Until now the workaround has been to shade dependencies. This is not a
solution since it leads to big jobs of hundreds of mega which prevents to
scale since we deploy from the network. It makes all deployments,
managements, and storage a pain on ops side. The other pitfall of shades
(or shadowing since we are on gradle now) is that it completely breaks any
company tooling and prevent vulnerability scanning or dependency upgrades -
not handled by dev team - to work correctly. This is a major issue for any
software targetting some professional level which should not be
underestimated.

>From that point we can get scared but with Java 8 there is no real point
having a tons of dependencies for the sdk core - this is for java but
should be true for most languages since beam requirements are light here.

However it can also require to rethink the sdk core modularity: why is
there some IO here? Do we need a big fat sdk core?

b. API or "put it all"?

Current API is in sdk-core but actually it prevents a modular development
since there are primitives and some IO in the core. What would be sane is
to extract the actual API from the core and get a beam-api. This way we
match all kind of user consumes:

- IO developers (they only need the SDF)
- pipeline writers (they only need the pipeline + IO)
- etc...

To make it an API it requires some changes but nothing crazy probably and
it would make beam more consumable and potentially reusable in other
environments.

I'll not detail the API points here since it is not the goal (think I
tracked most of them in
https://gist.github.com/rmannibucau/ab7543c23b6f57af921d98639fbcd436 if you
are interested)

c. Environment is not only about jars

Beam has two main execution environments:

- the "pipeline.run" one
- the pipeline execution (runner)

The last one is quite known and already has some challenges:

- can be a main execution so nothing crazy to manage
- can use subclassloaders to execute jobs, scale and isolate jobs
- etc... (we can think to an OSGi flavor for instance)

The first one is way more challenging since you must match:

- flat mains
- JavaEE containers
- OSGi containers
- custom weird environments (spring boot jar launcher)
- ...

This all lead to two very key consequences and programming rule respect:

- lifecycle: any component must ensure its lifecycle is very well respected
(we must avoid "JVM will clean up anyway" kind of thinking)
- no blind cache or static abuse, this must fit *all* environments
(pipelineoptionsfacctory is a good example of that)

3. Make it hurtless for integrators/community

Beam's success is bound to the fact runners exist. A concern which is quite
important is that beam keeps adding features and say "runners will
implement them". I'd like that each time you think that you ask yourself
"does it need?".

I'll take two examples:

- the language portable support: there is no need to do it in all runners,
you can have a generic runner delegating to the right implementation@runner
the tasks and therefore, adding language portability feature, you support
OOTB all existing runners without impacting them
- the metrics pusher: this one got some discussion and lead to a polling
implementation which doesn't work in all runners not having a waiting
"driver" (hazelcast, spark in client mode etc...). Now it is going to be
added to the portable API if I got it right...if you think about it, you
can just instrument the pipeline by modifying the DAG before translating it
and therefore work on all runners for free as well.

These two simple examples show that the work should probably be done on
adding DAG preprocessors (sorted) and runner as something enrichable,
rather than with ad-hoc solutions for each feature.

4. Be more reactive

If you check I/O, most of them can support asynchronous handling. The gain
is to be aligned on the actual I/O and not only be asynchronous to starts a
new thread. Using that allows to scale way more and use more efficiently
resources of the machine.

However it has a big pitfall: the whole programming model must be reactive.
Indeed, we can support a conversion from a not reactive to a reactive model
implicitly for simple case (think to a DoFn multiplying by 2 an int) but
the I/O should be reactive and beam should be reactive in its completion to
benefit from it.



Summary: if I try to summarize this mail which tries to share the
philosophy I'm approaching beam with, more than particular issues, i'd say
that I strongly think, that to be a success, Beam but embrace what it is: a
portable layer on top of existing implementations. It means that it must
define a clear and minimal API for each kind of usage and probably expose
it by user kind (so actually N api). it must embrace the environments it
runs in and assume the constraints it brings. And finally it should be less
intrusive in all its layers and try to add features more transversally when
possible (and it is possible in a lot of cases). If you bring features for
free with new releases, everybody wins, if you announce features and no
runner support it, then you loose (and loose users).



Hope it helps,
Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>

Beam high level directions (was "Graal instead of docker?")

Reply via email to