Re: [DISCUSS] State of the project

Etienne Chauchot Mon, 22 Jan 2018 07:17:37 -0800

Thanks Davor for bringing this discussion up!

I particularly like that you listed the different areas of improvementand proposed to assign people based on their tastes.

I wanted to add a point about consistency across runners, but throughthe dev point of view: I've been working on a trans-runner featurelately (metrics push agnostic of the runners) for which I compared thebehavior of the runners and wired up this feature into the flink andspark runners themselves. I must admit that I had a hard time figuringout how to wire it up in the different runners and that it wascompletely different between the runners. Also, their use (or non-use)of runner-core facilities vary. Even in the architecture of the tests:some, like spark, own their validates runner tests in the runner moduleand others runners run validates runner tests that are owned by sdk-coremodule. I also noticed some differences in the way to do streaming test:for some runners to trigger streaming mode it is needed to use anequivalent of direct runner's TestStream in the pipeline but for othersputting streaming=true in pipelineOptions is enough.

=> long story short, IMHO I think that It could be interesting toenhance the runner API to contain more than run(). This could have thebenefit to increase the coherence between runners. Besides we would needto find the correct balance between too many methods in the runner apithat would reduce the flexibility of the runner implementations and toofew methods that would reduce the coherence between the runners.

=>In addition, to enhance the coherence (dev point of view) between therunners, having all the runners run the exact same validates runnertests in both batch and streaming modes would be awesome!

Another thing: big +1 to have a programmatic way of defining thecapability matrix as Romain suggested.

Also agree on Ismaël's point about too flexible concepts across runners(termination, bundling, ...)

Also big +1 to what Jessee wrote. I was myself in the past in the userarchitect position, and I can confirm that all the points that hementioned are accurate.


Best,

Etienne


Le 16/01/2018 à 17:39, Ismaël Mejía a écrit :

Thanks Davor for opening this discussion and HUGE +1 to do this every
year or in cycles. I will fork this thread into a new one for the
Culture / Project management part issues as suggested.

About the diversity of users across runners subject I think that this
requires more attention to unification and implies at least work in
different areas:

* Automatized validation and consistent semantics among runners

Users should be confident that moving their code from one runner to
the other just works and the only way to ensure this is by having a
runner to pass ValidatesRunner/TCK tests and with this 'graduate' such
support as Romain suggested. The capatility-matrix is really nice but
it is not a programmatic way to do this. And also usually individual
features do work, but feature combinations produce issues so we need
to have a more exact semantics to avoid these.

Some parts of Beam's semantics are loose (e.g. bundle partiiioning,
pipeline termination, etc.), this I suppose has been a design decision
to allow flexibility in the runners implementation but it becomes
inconvenient when users move among runners and have different results.
I am not sure if the current tradeoff is worth the usability sacrifice
for the end user.

* Make user experience across runners a priority

Today all runners not only behave in different ways but the way users
publish and package their applications differ. Of course this is not a
trivial problem because deployment normally is a end user problem, but
we can improve in this area, e.g. guaranteeing a consistent deployment
mechanism across runners, and making IO integration easier for example
when using multiple IOs and switching runners it is easy to run into
conflicts, we should try to minimize this for the end-users.

* Simplify operational tasks among runners

We need to add a minimum degree of consistent observability across
runners. Of course Beam has metrics to do this, but it is not enough,
an end-user that starts on one runner and moves to another has to deal
with a totally different set of logs and operational issues. We can
try to improve this too, of course without trying to cover the full
spectrum but at least bringing some minimum set of observability. I
hope that the current work on portability will bring some improvements
in this area. But this is crucial for users that probably pass more
time running (and dealing) with issues in their jobs than writing
them.

We need to have some integration tests that simulate common user
scenarios and some distribution use cases, e.g. Probably the most
common data store used for streaming is Kafka (at least in Open
Source). We should have an IT that tests some common issues that can
arrive when you use kafka, what happens if a kafka broker goes down,
does Beam continues to read without issue? what about a new leader
election, do we continue to work as expected, etc. Few projects have
something like this and this will send a clear message that Beam cares
about reliability too.

Apart of these, I think we also need to work on:

* Simpler APIs + User friendly libraries.

I want to add a big thanks for Jesse for his list on criteria that
people seek when they choose a framework for data processing. And the
first point 'Will this dramatically improve the problems I'm trying to
solve?' is super important. Of course Beam has portability and a rich
model as its biggest assets  but I have been consistently asked in
conferences if Beam has libraries for graph processing, CEP, Machine
Learning or a Scala API.

Of course we have had some progress with the recent addition of the
SQL and hopefully the schema-aware PCollections would help there too,
but there is still some way to go, and of course this can not be
crucial considering the portability goals of Beam but these libraries
are sometimes what make users to decide if they use a tool or not, so
better have those than not.

These are the most important issues from my point of view. my excuses
for the long email but this was the perfect moment to discuss these.

One extra point I think we should write and agree on a concise roadmap
and take a look at our progress on it at the middle and the end of the
year as other communities do.

Regards,
Ismaël

On Mon, Jan 15, 2018 at 7:49 PM, Jesse Anderson
<je...@bigdatainstitute.io> wrote:

I think a focus on the runners is what's key to Beam's adoption. The runners
are the foundation on which Beam sits. If the runners don't work properly,
Beam won't work.

A focus on improved unit tests is a good start, but isn't what's needed.
Compatibility matrices will help see how your runner of choice stacks up,
but that requires too much knowledge of Beam's internals to be
interpretable.

Imagine you're an (enterprise) architect looking at adopting Beam. What do
you look at or what do you look for before going deeper? What would make you
stick your neck out to adopt Beam? For my experience, there are several/pass
fails along the way.

Here are a few of the common ones I've seen:

Will this dramatically improve the problems I'm trying to solve? (not
writing APIs/better programming model/Beam's better handling of windowing)
Can I get commercial support for Beam? (This is changing soon)
Are other people using Beam with the configuration and use case as me? (e.g.
I'm using Spark with Beam to process imagery. Are others doing this in
production?)
Is there good documentation and books on the subject? (Tyler's and others'
book will improve this)
Can I get my team trained on this new technology? (I have Beam training and
Google has some cursory training)

I think the one the community can improve on the most is the social proof of
Beam. I've tried to do this
(http://www.jesse-anderson.com/2017/06/beam-2-0-q-and-a/ and
http://www.jesse-anderson.com/2016/07/question-and-answers-with-the-apache-beam-team/).
We need to get the message out more about people using Beam in production,
which configuration they have, and what their results were. I think we have
the social proof on Dataflow, but not as much on Spark/Flink/Apex.

I think it's important to note that these checks don't look at the hardcore
language or API semantics that we're working on. These are much later stage
issues, if they're ever used at all.

In my experience with other open source adoption at enterprises, it starts
with architects and works its way around the organization from there.

Thanks,

Jesse

On Mon, Jan 15, 2018 at 8:14 AM Ted Yu <yuzhih...@gmail.com> wrote:

bq. are hard to detect in our unit-test framework

Looks like more integration tests would help discover bug / regression
more quickly. If committer reviewing the PR has concern in this regard, the
concern should be stated on the PR so that the contributor (and reviewer)
can spend more time in solidifying the solution.

bq. I've gone and fixed these issues myself when merging

We can make stricter checkstyle rules so that the code wouldn't pass build
without addressing commonly known issues.

Cheers

On Sun, Jan 14, 2018 at 12:37 PM, Reuven Lax <re...@google.com> wrote:

I agree with the sentiment, but I don't completely agree with the
criteria.

I think we need to be much better about reviewing PRs. Some PRs languish
for too long before the reviewer gets to it (and I've been guilty of this
too), which does not send a good message. Also new PRs sometimes languish
because there is no reviewer assigned; maybe we could write a gitbot to
automatically assign a reviewer to every new PR?

Also, I think that the bar for merging a PR from a contributor should not
be "the PR is perfect." It's perfectly fine to merge a PR that still has
some issues (especially if the issues are stylistic). In the past when I've
done this, I've gone and fixed these issues myself when merging. It was a
bit more work for me to fix these things myself, but it was a small price to
pay in order to portray Beam as a welcoming place for contributions.

On the other hand, "the build does not break" is - in my opinion - too
weak of a criterion for merging. A few reasons for this:

   * Beam is a data-processing framework, and data integrity is paramount.
If a reviewer sees an issue that could lead to data loss (or duplication, or
corruption), I don't think that PR should be merged. Historically many such
issues only actually manifest at scale, and are hard to detect in our
unit-test framework. (we also need to invest in more at-scale tests to catch
such issues).

   * Beam guarantees backwards compatibility for users (except across
major versions). If a bad API gets merged and released (and the chances of
"forgetting" about it before the release is cut is unfortunately high), we
are stuck with it. This is less of an issue for many other open-source
projects that do not make such a compatibility guarantee, as they are able
to simply remove or fix the API in the next version.

I think we still need honest review of PRs, with the criteria being
stronger than "the build doesn't break." However reviewers also need to be
reasonable about what they ask for.

Reuven

On Sun, Jan 14, 2018 at 11:19 AM, Ted Yu <yuzhih...@gmail.com> wrote:

bq. if a PR is basically right (it does what it should) without breaking
the build, then it has to be merged fast

+1 on above.
This would give contributors positive feedback.

On Sun, Jan 14, 2018 at 8:13 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

Hi Davor,

Thanks a lot for this e-mail.

I would like to emphasize two areas where we have to improve:

1. Apache way and community. We still have to focus and being dedicated
on our communities (both user & dev). Helping, encouraging, growing our
communities is key for the project. Building bridges between communities is
also very important. We have to be more "accessible": sometime simplifying
our discussions, showing more interest and open minded in the proposals
would help as well. I think we do a good job already: we just have to
improve.

2. Execution: a successful project is a project with a regular activity
in term of releases, fixes, improvements.
Regarding the PR, I think today we have a PR opened for long. And I
think for three reasons:
- some are not ready, not good enough, no question on these ones
- some needs reviewer and speed up: we have to be careful on the open
PRs and review asap
- some are under review but we have a lot of "ping pong" and long
discussion, not always justified. I already said that on the mailing list
but, as for other Apache projects, if a PR is basically right (it does what
it should) without breaking the build, then it has to be merged fast. If it
requires additional changes (tests, polishing, improvements, ...), then it
can be addressed in new PRs.
As already mentioned in the Beam 2.3.0 thread, we have to adopt a
regular schedule for releases. It's a best effort to have a release every 2
months, whatever the release will contain. That's essential to maintain a
good activity in the project and for the third party projects using Beam.

Again, don't get me wrong: we already do a good job ! It's just area
where I think we have to improve.

Anyway, thanks for all the hard work we are doing all together !

Regards
JB


On 13/01/2018 05:12, Davor Bonaci wrote:

Hi everyone --
Apache Beam was established as a top-level project a year ago (on
December 21, to be exact). This first anniversary is a great opportunity for
us to look back at the past year, celebrate its successes, learn from any
mistakes we have made, and plan for the next 1+ years.

I’d like to invite everyone in the community, particularly users and
observers on this mailing list, to participate in this discussion. Apache
Beam is your project and I, for one, would much appreciate your candid
thoughts and comments. Just as some other projects do, I’d like to make this
“state of the project” discussion an annual tradition in this community.

In terms of successes, the availability of the first stable release,
version 2.0.0, was the biggest and most important milestone last year.
Additionally, we have expanded the project’s breadth with new components,
including several new runners, SDKs, and DSLs, and interconnected a large
number of storage/messaging systems with new Beam IOs. In terms of community
growth, crossing 200 lifetime individual contributors and achieving 76
contributors to a single release were other highlights. We have doubled the
number of committers, and invited a handful of new PMC members. Thanks to
each and every one of you for making all of this possible in our first year.

On the other hand, in such a young project as Beam, there are
naturally many areas for improvement. This is the principal purpose of this
thread (and any of its forks). To organize the separate discussions, I’d
suggest to fork separate threads for different discussion areas:
* Culture and governance (anything related to people and their
behavior)
* Community growth (what can we do to further grow a diverse and
vibrant community)
* Technical execution (anything related to releases, their frequency,
website, infrastructure)
* Feature roadmap for 2018 (what can we do to make the project more
attractive to users, Beam 3.0, etc.).

I know many passionate folks who particularly care about each of these
areas, but let me call on some folks from the community to get things
started: Ismael for culture, Gris for community, JB for technical execution,
and Ben for feature roadmap.

Perhaps we can use this thread to discuss project-wide vision. To seed
that discussion, I’d start somewhat provocatively -- we aren’t doing so well
on the diversity of users across runners, which is very important to the
realization of the project’s vision. Would you agree, and would you be
willing to make it the project’s #1 priority for the next 1-2 years?

Thanks -- and please join us in what would hopefully be a productive
and informative discussion that shapes the future of this project!

Davor

Re: [DISCUSS] State of the project

Reply via email to