Thanks Davor for opening this discussion and HUGE +1 to do this every year or in cycles. I will fork this thread into a new one for the Culture / Project management part issues as suggested.
About the diversity of users across runners subject I think that this requires more attention to unification and implies at least work in different areas: * Automatized validation and consistent semantics among runners Users should be confident that moving their code from one runner to the other just works and the only way to ensure this is by having a runner to pass ValidatesRunner/TCK tests and with this 'graduate' such support as Romain suggested. The capatility-matrix is really nice but it is not a programmatic way to do this. And also usually individual features do work, but feature combinations produce issues so we need to have a more exact semantics to avoid these. Some parts of Beam's semantics are loose (e.g. bundle partiiioning, pipeline termination, etc.), this I suppose has been a design decision to allow flexibility in the runners implementation but it becomes inconvenient when users move among runners and have different results. I am not sure if the current tradeoff is worth the usability sacrifice for the end user. * Make user experience across runners a priority Today all runners not only behave in different ways but the way users publish and package their applications differ. Of course this is not a trivial problem because deployment normally is a end user problem, but we can improve in this area, e.g. guaranteeing a consistent deployment mechanism across runners, and making IO integration easier for example when using multiple IOs and switching runners it is easy to run into conflicts, we should try to minimize this for the end-users. * Simplify operational tasks among runners We need to add a minimum degree of consistent observability across runners. Of course Beam has metrics to do this, but it is not enough, an end-user that starts on one runner and moves to another has to deal with a totally different set of logs and operational issues. We can try to improve this too, of course without trying to cover the full spectrum but at least bringing some minimum set of observability. I hope that the current work on portability will bring some improvements in this area. But this is crucial for users that probably pass more time running (and dealing) with issues in their jobs than writing them. We need to have some integration tests that simulate common user scenarios and some distribution use cases, e.g. Probably the most common data store used for streaming is Kafka (at least in Open Source). We should have an IT that tests some common issues that can arrive when you use kafka, what happens if a kafka broker goes down, does Beam continues to read without issue? what about a new leader election, do we continue to work as expected, etc. Few projects have something like this and this will send a clear message that Beam cares about reliability too. Apart of these, I think we also need to work on: * Simpler APIs + User friendly libraries. I want to add a big thanks for Jesse for his list on criteria that people seek when they choose a framework for data processing. And the first point 'Will this dramatically improve the problems I'm trying to solve?' is super important. Of course Beam has portability and a rich model as its biggest assets but I have been consistently asked in conferences if Beam has libraries for graph processing, CEP, Machine Learning or a Scala API. Of course we have had some progress with the recent addition of the SQL and hopefully the schema-aware PCollections would help there too, but there is still some way to go, and of course this can not be crucial considering the portability goals of Beam but these libraries are sometimes what make users to decide if they use a tool or not, so better have those than not. These are the most important issues from my point of view. my excuses for the long email but this was the perfect moment to discuss these. One extra point I think we should write and agree on a concise roadmap and take a look at our progress on it at the middle and the end of the year as other communities do. Regards, Ismaël On Mon, Jan 15, 2018 at 7:49 PM, Jesse Anderson <je...@bigdatainstitute.io> wrote: > I think a focus on the runners is what's key to Beam's adoption. The runners > are the foundation on which Beam sits. If the runners don't work properly, > Beam won't work. > > A focus on improved unit tests is a good start, but isn't what's needed. > Compatibility matrices will help see how your runner of choice stacks up, > but that requires too much knowledge of Beam's internals to be > interpretable. > > Imagine you're an (enterprise) architect looking at adopting Beam. What do > you look at or what do you look for before going deeper? What would make you > stick your neck out to adopt Beam? For my experience, there are several/pass > fails along the way. > > Here are a few of the common ones I've seen: > > Will this dramatically improve the problems I'm trying to solve? (not > writing APIs/better programming model/Beam's better handling of windowing) > Can I get commercial support for Beam? (This is changing soon) > Are other people using Beam with the configuration and use case as me? (e.g. > I'm using Spark with Beam to process imagery. Are others doing this in > production?) > Is there good documentation and books on the subject? (Tyler's and others' > book will improve this) > Can I get my team trained on this new technology? (I have Beam training and > Google has some cursory training) > > I think the one the community can improve on the most is the social proof of > Beam. I've tried to do this > (http://www.jesse-anderson.com/2017/06/beam-2-0-q-and-a/ and > http://www.jesse-anderson.com/2016/07/question-and-answers-with-the-apache-beam-team/). > We need to get the message out more about people using Beam in production, > which configuration they have, and what their results were. I think we have > the social proof on Dataflow, but not as much on Spark/Flink/Apex. > > I think it's important to note that these checks don't look at the hardcore > language or API semantics that we're working on. These are much later stage > issues, if they're ever used at all. > > In my experience with other open source adoption at enterprises, it starts > with architects and works its way around the organization from there. > > Thanks, > > Jesse > > On Mon, Jan 15, 2018 at 8:14 AM Ted Yu <yuzhih...@gmail.com> wrote: >> >> bq. are hard to detect in our unit-test framework >> >> Looks like more integration tests would help discover bug / regression >> more quickly. If committer reviewing the PR has concern in this regard, the >> concern should be stated on the PR so that the contributor (and reviewer) >> can spend more time in solidifying the solution. >> >> bq. I've gone and fixed these issues myself when merging >> >> We can make stricter checkstyle rules so that the code wouldn't pass build >> without addressing commonly known issues. >> >> Cheers >> >> On Sun, Jan 14, 2018 at 12:37 PM, Reuven Lax <re...@google.com> wrote: >>> >>> I agree with the sentiment, but I don't completely agree with the >>> criteria. >>> >>> I think we need to be much better about reviewing PRs. Some PRs languish >>> for too long before the reviewer gets to it (and I've been guilty of this >>> too), which does not send a good message. Also new PRs sometimes languish >>> because there is no reviewer assigned; maybe we could write a gitbot to >>> automatically assign a reviewer to every new PR? >>> >>> Also, I think that the bar for merging a PR from a contributor should not >>> be "the PR is perfect." It's perfectly fine to merge a PR that still has >>> some issues (especially if the issues are stylistic). In the past when I've >>> done this, I've gone and fixed these issues myself when merging. It was a >>> bit more work for me to fix these things myself, but it was a small price to >>> pay in order to portray Beam as a welcoming place for contributions. >>> >>> On the other hand, "the build does not break" is - in my opinion - too >>> weak of a criterion for merging. A few reasons for this: >>> >>> * Beam is a data-processing framework, and data integrity is paramount. >>> If a reviewer sees an issue that could lead to data loss (or duplication, or >>> corruption), I don't think that PR should be merged. Historically many such >>> issues only actually manifest at scale, and are hard to detect in our >>> unit-test framework. (we also need to invest in more at-scale tests to catch >>> such issues). >>> >>> * Beam guarantees backwards compatibility for users (except across >>> major versions). If a bad API gets merged and released (and the chances of >>> "forgetting" about it before the release is cut is unfortunately high), we >>> are stuck with it. This is less of an issue for many other open-source >>> projects that do not make such a compatibility guarantee, as they are able >>> to simply remove or fix the API in the next version. >>> >>> I think we still need honest review of PRs, with the criteria being >>> stronger than "the build doesn't break." However reviewers also need to be >>> reasonable about what they ask for. >>> >>> Reuven >>> >>> On Sun, Jan 14, 2018 at 11:19 AM, Ted Yu <yuzhih...@gmail.com> wrote: >>>> >>>> bq. if a PR is basically right (it does what it should) without breaking >>>> the build, then it has to be merged fast >>>> >>>> +1 on above. >>>> This would give contributors positive feedback. >>>> >>>> On Sun, Jan 14, 2018 at 8:13 AM, Jean-Baptiste Onofré <j...@nanthrax.net> >>>> wrote: >>>>> >>>>> Hi Davor, >>>>> >>>>> Thanks a lot for this e-mail. >>>>> >>>>> I would like to emphasize two areas where we have to improve: >>>>> >>>>> 1. Apache way and community. We still have to focus and being dedicated >>>>> on our communities (both user & dev). Helping, encouraging, growing our >>>>> communities is key for the project. Building bridges between communities >>>>> is >>>>> also very important. We have to be more "accessible": sometime simplifying >>>>> our discussions, showing more interest and open minded in the proposals >>>>> would help as well. I think we do a good job already: we just have to >>>>> improve. >>>>> >>>>> 2. Execution: a successful project is a project with a regular activity >>>>> in term of releases, fixes, improvements. >>>>> Regarding the PR, I think today we have a PR opened for long. And I >>>>> think for three reasons: >>>>> - some are not ready, not good enough, no question on these ones >>>>> - some needs reviewer and speed up: we have to be careful on the open >>>>> PRs and review asap >>>>> - some are under review but we have a lot of "ping pong" and long >>>>> discussion, not always justified. I already said that on the mailing list >>>>> but, as for other Apache projects, if a PR is basically right (it does >>>>> what >>>>> it should) without breaking the build, then it has to be merged fast. If >>>>> it >>>>> requires additional changes (tests, polishing, improvements, ...), then it >>>>> can be addressed in new PRs. >>>>> As already mentioned in the Beam 2.3.0 thread, we have to adopt a >>>>> regular schedule for releases. It's a best effort to have a release every >>>>> 2 >>>>> months, whatever the release will contain. That's essential to maintain a >>>>> good activity in the project and for the third party projects using Beam. >>>>> >>>>> Again, don't get me wrong: we already do a good job ! It's just area >>>>> where I think we have to improve. >>>>> >>>>> Anyway, thanks for all the hard work we are doing all together ! >>>>> >>>>> Regards >>>>> JB >>>>> >>>>> >>>>> On 13/01/2018 05:12, Davor Bonaci wrote: >>>>>> >>>>>> Hi everyone -- >>>>>> Apache Beam was established as a top-level project a year ago (on >>>>>> December 21, to be exact). This first anniversary is a great opportunity >>>>>> for >>>>>> us to look back at the past year, celebrate its successes, learn from any >>>>>> mistakes we have made, and plan for the next 1+ years. >>>>>> >>>>>> I’d like to invite everyone in the community, particularly users and >>>>>> observers on this mailing list, to participate in this discussion. Apache >>>>>> Beam is your project and I, for one, would much appreciate your candid >>>>>> thoughts and comments. Just as some other projects do, I’d like to make >>>>>> this >>>>>> “state of the project” discussion an annual tradition in this community. >>>>>> >>>>>> In terms of successes, the availability of the first stable release, >>>>>> version 2.0.0, was the biggest and most important milestone last year. >>>>>> Additionally, we have expanded the project’s breadth with new components, >>>>>> including several new runners, SDKs, and DSLs, and interconnected a large >>>>>> number of storage/messaging systems with new Beam IOs. In terms of >>>>>> community >>>>>> growth, crossing 200 lifetime individual contributors and achieving 76 >>>>>> contributors to a single release were other highlights. We have doubled >>>>>> the >>>>>> number of committers, and invited a handful of new PMC members. Thanks to >>>>>> each and every one of you for making all of this possible in our first >>>>>> year. >>>>>> >>>>>> On the other hand, in such a young project as Beam, there are >>>>>> naturally many areas for improvement. This is the principal purpose of >>>>>> this >>>>>> thread (and any of its forks). To organize the separate discussions, I’d >>>>>> suggest to fork separate threads for different discussion areas: >>>>>> * Culture and governance (anything related to people and their >>>>>> behavior) >>>>>> * Community growth (what can we do to further grow a diverse and >>>>>> vibrant community) >>>>>> * Technical execution (anything related to releases, their frequency, >>>>>> website, infrastructure) >>>>>> * Feature roadmap for 2018 (what can we do to make the project more >>>>>> attractive to users, Beam 3.0, etc.). >>>>>> >>>>>> I know many passionate folks who particularly care about each of these >>>>>> areas, but let me call on some folks from the community to get things >>>>>> started: Ismael for culture, Gris for community, JB for technical >>>>>> execution, >>>>>> and Ben for feature roadmap. >>>>>> >>>>>> Perhaps we can use this thread to discuss project-wide vision. To seed >>>>>> that discussion, I’d start somewhat provocatively -- we aren’t doing so >>>>>> well >>>>>> on the diversity of users across runners, which is very important to the >>>>>> realization of the project’s vision. Would you agree, and would you be >>>>>> willing to make it the project’s #1 priority for the next 1-2 years? >>>>>> >>>>>> Thanks -- and please join us in what would hopefully be a productive >>>>>> and informative discussion that shapes the future of this project! >>>>>> >>>>>> Davor >>>> >>>> >>> >> >