I also think that at a high level the success of Beam as a
project/community and as a piece of software depends on having
multiple viable runners with healthy set of users and contributors.
The pieces that are missing to me:
*User-focused comparison of runners (and IOs)*
+1 to Jesse's. Automated capability tests don't really help this.
Benchmarks will be part of the story but are worth very little on
their own. Focusing on these is just choosing to measure things that
are easy to measure instead of addressing what is important, which is
in the end almost always qualitative.
*Automated integration tests on clusters*
We do need to know that runners and IOs "work" in a basic yes/no
manner on every commit/release, beyond unit tests. I am not really
willing to strongly claim to a potential user that something "works"
without this level of automation.
*More uniform operational experiences*
Setting up your Spark/Flink/Apex deployment should be different.
Launching a Beam pipeline on it should not be.
*Portability: Any SDK on any runner*
We have now one SDK on master and one SDK on a dev branch that both
support portable execution somewhat. Unfortunately we have no major
open source runner that supports portability*. "Java on any runner" is
not compelling enough any more, if it ever was.
----
Reviews: I agree our response latency is too slow. I do not agree that
our quality bar is too high; I think we should raise it
*significantly*. Our codebase fails tests for long periods. Our tests
need to be green enough that we are comfortable blocking merges *even
for unrelated failures*. We should be able to cut a release any time,
modulo known blocker-level bugs.
Runner dev: I think Etienne's point about making it more uniform to
add features to all runners actually is quite important, since the
portability framework is a lot harder than "translate a Beam ParDo to
XYZ's FlatMap" where they are both Java. And even the support code
we've been building is not obvious to use and probably won't be for
the foreseeable future. This fits well into the "Ben thread" on
technical ideas so I'll comment there.
Kenn
*We do have a local batch-only portable runner in Python
On Fri, Jan 26, 2018 at 10:09 AM, Lukasz Cwik <lc...@google.com
<mailto:lc...@google.com>> wrote:
Etienne, for the cross runner coherence, the portability framework
is attempting to create an API across all runners for job
management and job execution. A lot of work still needs to be done
to define and implement these APIs and migrate runners and SDKs to
support it since the current set of Java APIs are adhoc in usage
and purpose. In my opinion, development should really be focused
to migrate runners and SDKs to use these APIs to get developer
coherence. Work is slowly progressing on integrating them into the
Java, Python, and Go SDKs and there are several JIRA issues in
this regard but involvement from more people could help.
Some helpful pointers are:
https://s.apache.org/beam-runner-api
<https://s.apache.org/beam-runner-api>
https://s.apache.org/beam-fn-api <https://s.apache.org/beam-fn-api>
https://issues.apache.org/jira/browse/BEAM-3515?jql=project%20%3D%20BEAM%20AND%20labels%20%3D%20portability
<https://issues.apache.org/jira/browse/BEAM-3515?jql=project%20%3D%20BEAM%20AND%20labels%20%3D%20portability>
On Fri, Jan 26, 2018 at 7:21 AM, Etienne Chauchot
<echauc...@apache.org <mailto:echauc...@apache.org>> wrote:
Hi all,
Does anyone have comments about my point about dev coherence
across the runners?
Thanks
Etienne
Le 22/01/2018 à 16:16, Etienne Chauchot a écrit :
Thanks Davor for bringing this discussion up!
I particularly like that you listed the different areas of
improvement and proposed to assign people based on their
tastes.
I wanted to add a point about consistency across runners,
but through the dev point of view: I've been working on a
trans-runner feature lately (metrics push agnostic of the
runners) for which I compared the behavior of the runners
and wired up this feature into the flink and spark runners
themselves. I must admit that I had a hard time figuring
out how to wire it up in the different runners and that it
was completely different between the runners. Also, their
use (or non-use) of runner-core facilities vary. Even in
the architecture of the tests: some, like spark, own their
validates runner tests in the runner module and others
runners run validates runner tests that are owned by
sdk-core module. I also noticed some differences in the
way to do streaming test: for some runners to trigger
streaming mode it is needed to use an equivalent of direct
runner's TestStream in the pipeline but for others putting
streaming=true in pipelineOptions is enough.
=> long story short, IMHO I think that It could be
interesting to enhance the runner API to contain more than
run(). This could have the benefit to increase the
coherence between runners. Besides we would need to find
the correct balance between too many methods in the runner
api that would reduce the flexibility of the runner
implementations and too few methods that would reduce the
coherence between the runners.
=>In addition, to enhance the coherence (dev point of
view) between the runners, having all the runners run the
exact same validates runner tests in both batch and
streaming modes would be awesome!
Another thing: big +1 to have a programmatic way of
defining the capability matrix as Romain suggested.
Also agree on Ismaël's point about too flexible concepts
across runners (termination, bundling, ...)
Also big +1 to what Jessee wrote. I was myself in the past
in the user architect position, and I can confirm that all
the points that he mentioned are accurate.
Best,
Etienne
Le 16/01/2018 à 17:39, Ismaël Mejía a écrit :
Thanks Davor for opening this discussion and HUGE +1
to do this every
year or in cycles. I will fork this thread into a new
one for the
Culture / Project management part issues as suggested.
About the diversity of users across runners subject I
think that this
requires more attention to unification and implies at
least work in
different areas:
* Automatized validation and consistent semantics
among runners
Users should be confident that moving their code from
one runner to
the other just works and the only way to ensure this
is by having a
runner to pass ValidatesRunner/TCK tests and with this
'graduate' such
support as Romain suggested. The capatility-matrix is
really nice but
it is not a programmatic way to do this. And also
usually individual
features do work, but feature combinations produce
issues so we need
to have a more exact semantics to avoid these.
Some parts of Beam's semantics are loose (e.g. bundle
partiiioning,
pipeline termination, etc.), this I suppose has been a
design decision
to allow flexibility in the runners implementation but
it becomes
inconvenient when users move among runners and have
different results.
I am not sure if the current tradeoff is worth the
usability sacrifice
for the end user.
* Make user experience across runners a priority
Today all runners not only behave in different ways
but the way users
publish and package their applications differ. Of
course this is not a
trivial problem because deployment normally is a end
user problem, but
we can improve in this area, e.g. guaranteeing a
consistent deployment
mechanism across runners, and making IO integration
easier for example
when using multiple IOs and switching runners it is
easy to run into
conflicts, we should try to minimize this for the
end-users.
* Simplify operational tasks among runners
We need to add a minimum degree of consistent
observability across
runners. Of course Beam has metrics to do this, but it
is not enough,
an end-user that starts on one runner and moves to
another has to deal
with a totally different set of logs and operational
issues. We can
try to improve this too, of course without trying to
cover the full
spectrum but at least bringing some minimum set of
observability. I
hope that the current work on portability will bring
some improvements
in this area. But this is crucial for users that
probably pass more
time running (and dealing) with issues in their jobs
than writing
them.
We need to have some integration tests that simulate
common user
scenarios and some distribution use cases, e.g.
Probably the most
common data store used for streaming is Kafka (at
least in Open
Source). We should have an IT that tests some common
issues that can
arrive when you use kafka, what happens if a kafka
broker goes down,
does Beam continues to read without issue? what about
a new leader
election, do we continue to work as expected, etc. Few
projects have
something like this and this will send a clear message
that Beam cares
about reliability too.
Apart of these, I think we also need to work on:
* Simpler APIs + User friendly libraries.
I want to add a big thanks for Jesse for his list on
criteria that
people seek when they choose a framework for data
processing. And the
first point 'Will this dramatically improve the
problems I'm trying to
solve?' is super important. Of course Beam has
portability and a rich
model as its biggest assets but I have been
consistently asked in
conferences if Beam has libraries for graph
processing, CEP, Machine
Learning or a Scala API.
Of course we have had some progress with the recent
addition of the
SQL and hopefully the schema-aware PCollections would
help there too,
but there is still some way to go, and of course this
can not be
crucial considering the portability goals of Beam but
these libraries
are sometimes what make users to decide if they use a
tool or not, so
better have those than not.
These are the most important issues from my point of
view. my excuses
for the long email but this was the perfect moment to
discuss these.
One extra point I think we should write and agree on a
concise roadmap
and take a look at our progress on it at the middle
and the end of the
year as other communities do.
Regards,
Ismaël
On Mon, Jan 15, 2018 at 7:49 PM, Jesse Anderson
<je...@bigdatainstitute.io
<mailto:je...@bigdatainstitute.io>> wrote:
I think a focus on the runners is what's key to
Beam's adoption. The runners
are the foundation on which Beam sits. If the
runners don't work properly,
Beam won't work.
A focus on improved unit tests is a good start,
but isn't what's needed.
Compatibility matrices will help see how your
runner of choice stacks up,
but that requires too much knowledge of Beam's
internals to be
interpretable.
Imagine you're an (enterprise) architect looking
at adopting Beam. What do
you look at or what do you look for before going
deeper? What would make you
stick your neck out to adopt Beam? For my
experience, there are several/pass
fails along the way.
Here are a few of the common ones I've seen:
Will this dramatically improve the problems I'm
trying to solve? (not
writing APIs/better programming model/Beam's
better handling of windowing)
Can I get commercial support for Beam? (This is
changing soon)
Are other people using Beam with the configuration
and use case as me? (e.g.
I'm using Spark with Beam to process imagery. Are
others doing this in
production?)
Is there good documentation and books on the
subject? (Tyler's and others'
book will improve this)
Can I get my team trained on this new technology?
(I have Beam training and
Google has some cursory training)
I think the one the community can improve on the
most is the social proof of
Beam. I've tried to do this
(http://www.jesse-anderson.com/2017/06/beam-2-0-q-and-a/
<http://www.jesse-anderson.com/2017/06/beam-2-0-q-and-a/>
and
http://www.jesse-anderson.com/2016/07/question-and-answers-with-the-apache-beam-team/
<http://www.jesse-anderson.com/2016/07/question-and-answers-with-the-apache-beam-team/>).
We need to get the message out more about people
using Beam in production,
which configuration they have, and what their
results were. I think we have
the social proof on Dataflow, but not as much on
Spark/Flink/Apex.
I think it's important to note that these checks
don't look at the hardcore
language or API semantics that we're working on.
These are much later stage
issues, if they're ever used at all.
In my experience with other open source adoption
at enterprises, it starts
with architects and works its way around the
organization from there.
Thanks,
Jesse
On Mon, Jan 15, 2018 at 8:14 AM Ted Yu
<yuzhih...@gmail.com <mailto:yuzhih...@gmail.com>>
wrote:
bq. are hard to detect in our unit-test framework
Looks like more integration tests would help
discover bug / regression
more quickly. If committer reviewing the PR
has concern in this regard, the
concern should be stated on the PR so that the
contributor (and reviewer)
can spend more time in solidifying the solution.
bq. I've gone and fixed these issues myself
when merging
We can make stricter checkstyle rules so that
the code wouldn't pass build
without addressing commonly known issues.
Cheers
On Sun, Jan 14, 2018 at 12:37 PM, Reuven Lax
<re...@google.com <mailto:re...@google.com>>
wrote:
I agree with the sentiment, but I don't
completely agree with the
criteria.
I think we need to be much better about
reviewing PRs. Some PRs languish
for too long before the reviewer gets to
it (and I've been guilty of this
too), which does not send a good message.
Also new PRs sometimes languish
because there is no reviewer assigned;
maybe we could write a gitbot to
automatically assign a reviewer to every
new PR?
Also, I think that the bar for merging a
PR from a contributor should not
be "the PR is perfect." It's perfectly
fine to merge a PR that still has
some issues (especially if the issues are
stylistic). In the past when I've
done this, I've gone and fixed these
issues myself when merging. It was a
bit more work for me to fix these things
myself, but it was a small price to
pay in order to portray Beam as a
welcoming place for contributions.
On the other hand, "the build does not
break" is - in my opinion - too
weak of a criterion for merging. A few
reasons for this:
* Beam is a data-processing framework,
and data integrity is paramount.
If a reviewer sees an issue that could
lead to data loss (or duplication, or
corruption), I don't think that PR should
be merged. Historically many such
issues only actually manifest at scale,
and are hard to detect in our
unit-test framework. (we also need to
invest in more at-scale tests to catch
such issues).
* Beam guarantees backwards
compatibility for users (except across
major versions). If a bad API gets merged
and released (and the chances of
"forgetting" about it before the release
is cut is unfortunately high), we
are stuck with it. This is less of an
issue for many other open-source
projects that do not make such a
compatibility guarantee, as they are able
to simply remove or fix the API in the
next version.
I think we still need honest review of
PRs, with the criteria being
stronger than "the build doesn't break."
However reviewers also need to be
reasonable about what they ask for.
Reuven
On Sun, Jan 14, 2018 at 11:19 AM, Ted Yu
<yuzhih...@gmail.com
<mailto:yuzhih...@gmail.com>> wrote:
bq. if a PR is basically right (it
does what it should) without breaking
the build, then it has to be merged fast
+1 on above.
This would give contributors positive
feedback.
On Sun, Jan 14, 2018 at 8:13 AM,
Jean-Baptiste Onofré <j...@nanthrax.net
<mailto:j...@nanthrax.net>>
wrote:
Hi Davor,
Thanks a lot for this e-mail.
I would like to emphasize two
areas where we have to improve:
1. Apache way and community. We
still have to focus and being
dedicated
on our communities (both user &
dev). Helping, encouraging,
growing our
communities is key for the
project. Building bridges between
communities is
also very important. We have to be
more "accessible": sometime
simplifying
our discussions, showing more
interest and open minded in the
proposals
would help as well. I think we do
a good job already: we just have to
improve.
2. Execution: a successful project
is a project with a regular activity
in term of releases, fixes,
improvements.
Regarding the PR, I think today we
have a PR opened for long. And I
think for three reasons:
- some are not ready, not good
enough, no question on these ones
- some needs reviewer and speed
up: we have to be careful on the open
PRs and review asap
- some are under review but we
have a lot of "ping pong" and long
discussion, not always justified.
I already said that on the mailing
list
but, as for other Apache projects,
if a PR is basically right (it
does what
it should) without breaking the
build, then it has to be merged
fast. If it
requires additional changes
(tests, polishing, improvements,
...), then it
can be addressed in new PRs.
As already mentioned in the Beam
2.3.0 thread, we have to adopt a
regular schedule for releases.
It's a best effort to have a
release every 2
months, whatever the release will
contain. That's essential to
maintain a
good activity in the project and
for the third party projects using
Beam.
Again, don't get me wrong: we
already do a good job ! It's just area
where I think we have to improve.
Anyway, thanks for all the hard
work we are doing all together !
Regards
JB
On 13/01/2018 05:12, Davor Bonaci
wrote:
Hi everyone --
Apache Beam was established as
a top-level project a year ago (on
December 21, to be exact).
This first anniversary is a
great opportunity for
us to look back at the past
year, celebrate its successes,
learn from any
mistakes we have made, and
plan for the next 1+ years.
I’d like to invite everyone in
the community, particularly
users and
observers on this mailing
list, to participate in this
discussion. Apache
Beam is your project and I,
for one, would much appreciate
your candid
thoughts and comments. Just as
some other projects do, I’d
like to make this
“state of the project”
discussion an annual tradition
in this community.
In terms of successes, the
availability of the first
stable release,
version 2.0.0, was the biggest
and most important milestone
last year.
Additionally, we have expanded
the project’s breadth with new
components,
including several new runners,
SDKs, and DSLs, and
interconnected a large
number of storage/messaging
systems with new Beam IOs. In
terms of community
growth, crossing 200 lifetime
individual contributors and
achieving 76
contributors to a single
release were other highlights.
We have doubled the
number of committers, and
invited a handful of new PMC
members. Thanks to
each and every one of you for
making all of this possible in
our first year.
On the other hand, in such a
young project as Beam, there are
naturally many areas for
improvement. This is the
principal purpose of this
thread (and any of its forks).
To organize the separate
discussions, I’d
suggest to fork separate
threads for different
discussion areas:
* Culture and governance
(anything related to people
and their
behavior)
* Community growth (what can
we do to further grow a
diverse and
vibrant community)
* Technical execution
(anything related to releases,
their frequency,
website, infrastructure)
* Feature roadmap for 2018
(what can we do to make the
project more
attractive to users, Beam 3.0,
etc.).
I know many passionate folks
who particularly care about
each of these
areas, but let me call on some
folks from the community to
get things
started: Ismael for culture,
Gris for community, JB for
technical execution,
and Ben for feature roadmap.
Perhaps we can use this thread
to discuss project-wide
vision. To seed
that discussion, I’d start
somewhat provocatively -- we
aren’t doing so well
on the diversity of users
across runners, which is very
important to the
realization of the project’s
vision. Would you agree, and
would you be
willing to make it the
project’s #1 priority for the
next 1-2 years?
Thanks -- and please join us
in what would hopefully be a
productive
and informative discussion
that shapes the future of this
project!
Davor