Re: [DISCUSS] State of the project

Lukasz Cwik Fri, 26 Jan 2018 10:09:54 -0800

Etienne, for the cross runner coherence, the portability framework is
attempting to create an API across all runners for job management and job
execution. A lot of work still needs to be done to define and implement
these APIs and migrate runners and SDKs to support it since the current set
of Java APIs are adhoc in usage and purpose. In my opinion, development
should really be focused to migrate runners and SDKs to use these APIs to
get developer coherence. Work is slowly progressing on integrating them
into the Java, Python, and Go SDKs and there are several JIRA issues in
this regard but involvement from more people could help.


Some helpful pointers are:
https://s.apache.org/beam-runner-api
https://s.apache.org/beam-fn-api
https://issues.apache.org/jira/browse/BEAM-3515?jql=project%20%3D%20BEAM%20AND%20labels%20%3D%20portability

On Fri, Jan 26, 2018 at 7:21 AM, Etienne Chauchot <echauc...@apache.org>
wrote:

> Hi all,
>
> Does anyone have comments about my point about dev coherence across the
> runners?
>
> Thanks
> Etienne
>
>
> Le 22/01/2018 à 16:16, Etienne Chauchot a écrit :
>
>> Thanks Davor for bringing this discussion up!
>>
>> I particularly like that you listed the different areas of improvement
>> and proposed to assign people based on their tastes.
>>
>> I wanted to add a point about consistency across runners, but through the
>> dev point of view: I've been working on a trans-runner feature lately
>> (metrics push agnostic of the runners) for which I compared the behavior of
>> the runners and wired up this feature into the flink and spark runners
>> themselves. I must admit that I had a hard time figuring out how to wire it
>> up in the different runners and that it was completely different between
>> the runners. Also, their use (or non-use) of runner-core facilities vary.
>> Even in the architecture of the tests: some, like spark, own their
>> validates runner tests in the runner module and others runners run
>> validates runner tests that are owned by sdk-core module. I also noticed
>> some differences in the way to do streaming test: for some runners to
>> trigger streaming mode it is needed to use an equivalent of direct runner's
>> TestStream in the pipeline but for others putting streaming=true in
>> pipelineOptions is enough.
>>
>> => long story short, IMHO I think that It could be interesting to enhance
>> the runner API to contain more than run(). This could have the benefit to
>> increase the coherence between runners. Besides we would need to find the
>> correct balance between too many methods in the runner api that would
>> reduce the flexibility of the runner implementations and too few methods
>> that would reduce the coherence between the runners.
>>
>> =>In addition, to enhance the coherence (dev point of view) between the
>> runners, having all the runners run the exact same validates runner tests
>> in both batch and streaming modes would be awesome!
>>
>> Another thing: big +1 to have a programmatic way of defining the
>> capability matrix as Romain suggested.
>>
>> Also agree on Ismaël's point about too flexible concepts across runners
>> (termination, bundling, ...)
>>
>> Also big +1 to what Jessee wrote. I was myself in the past in the user
>> architect position, and I can confirm that all the points that he mentioned
>> are accurate.
>>
>> Best,
>>
>> Etienne
>>
>>
>> Le 16/01/2018 à 17:39, Ismaël Mejía a écrit :
>>
>>> Thanks Davor for opening this discussion and HUGE +1 to do this every
>>> year or in cycles. I will fork this thread into a new one for the
>>> Culture / Project management part issues as suggested.
>>>
>>> About the diversity of users across runners subject I think that this
>>> requires more attention to unification and implies at least work in
>>> different areas:
>>>
>>> * Automatized validation and consistent semantics among runners
>>>
>>> Users should be confident that moving their code from one runner to
>>> the other just works and the only way to ensure this is by having a
>>> runner to pass ValidatesRunner/TCK tests and with this 'graduate' such
>>> support as Romain suggested. The capatility-matrix is really nice but
>>> it is not a programmatic way to do this. And also usually individual
>>> features do work, but feature combinations produce issues so we need
>>> to have a more exact semantics to avoid these.
>>>
>>> Some parts of Beam's semantics are loose (e.g. bundle partiiioning,
>>> pipeline termination, etc.), this I suppose has been a design decision
>>> to allow flexibility in the runners implementation but it becomes
>>> inconvenient when users move among runners and have different results.
>>> I am not sure if the current tradeoff is worth the usability sacrifice
>>> for the end user.
>>>
>>> * Make user experience across runners a priority
>>>
>>> Today all runners not only behave in different ways but the way users
>>> publish and package their applications differ. Of course this is not a
>>> trivial problem because deployment normally is a end user problem, but
>>> we can improve in this area, e.g. guaranteeing a consistent deployment
>>> mechanism across runners, and making IO integration easier for example
>>> when using multiple IOs and switching runners it is easy to run into
>>> conflicts, we should try to minimize this for the end-users.
>>>
>>> * Simplify operational tasks among runners
>>>
>>> We need to add a minimum degree of consistent observability across
>>> runners. Of course Beam has metrics to do this, but it is not enough,
>>> an end-user that starts on one runner and moves to another has to deal
>>> with a totally different set of logs and operational issues. We can
>>> try to improve this too, of course without trying to cover the full
>>> spectrum but at least bringing some minimum set of observability. I
>>> hope that the current work on portability will bring some improvements
>>> in this area. But this is crucial for users that probably pass more
>>> time running (and dealing) with issues in their jobs than writing
>>> them.
>>>
>>> We need to have some integration tests that simulate common user
>>> scenarios and some distribution use cases, e.g. Probably the most
>>> common data store used for streaming is Kafka (at least in Open
>>> Source). We should have an IT that tests some common issues that can
>>> arrive when you use kafka, what happens if a kafka broker goes down,
>>> does Beam continues to read without issue? what about a new leader
>>> election, do we continue to work as expected, etc. Few projects have
>>> something like this and this will send a clear message that Beam cares
>>> about reliability too.
>>>
>>> Apart of these, I think we also need to work on:
>>>
>>> * Simpler APIs + User friendly libraries.
>>>
>>> I want to add a big thanks for Jesse for his list on criteria that
>>> people seek when they choose a framework for data processing. And the
>>> first point 'Will this dramatically improve the problems I'm trying to
>>> solve?' is super important. Of course Beam has portability and a rich
>>> model as its biggest assets  but I have been consistently asked in
>>> conferences if Beam has libraries for graph processing, CEP, Machine
>>> Learning or a Scala API.
>>>
>>> Of course we have had some progress with the recent addition of the
>>> SQL and hopefully the schema-aware PCollections would help there too,
>>> but there is still some way to go, and of course this can not be
>>> crucial considering the portability goals of Beam but these libraries
>>> are sometimes what make users to decide if they use a tool or not, so
>>> better have those than not.
>>>
>>> These are the most important issues from my point of view. my excuses
>>> for the long email but this was the perfect moment to discuss these.
>>>
>>> One extra point I think we should write and agree on a concise roadmap
>>> and take a look at our progress on it at the middle and the end of the
>>> year as other communities do.
>>>
>>> Regards,
>>> Ismaël
>>>
>>> On Mon, Jan 15, 2018 at 7:49 PM, Jesse Anderson
>>> <je...@bigdatainstitute.io> wrote:
>>>
>>>> I think a focus on the runners is what's key to Beam's adoption. The
>>>> runners
>>>> are the foundation on which Beam sits. If the runners don't work
>>>> properly,
>>>> Beam won't work.
>>>>
>>>> A focus on improved unit tests is a good start, but isn't what's needed.
>>>> Compatibility matrices will help see how your runner of choice stacks
>>>> up,
>>>> but that requires too much knowledge of Beam's internals to be
>>>> interpretable.
>>>>
>>>> Imagine you're an (enterprise) architect looking at adopting Beam. What
>>>> do
>>>> you look at or what do you look for before going deeper? What would
>>>> make you
>>>> stick your neck out to adopt Beam? For my experience, there are
>>>> several/pass
>>>> fails along the way.
>>>>
>>>> Here are a few of the common ones I've seen:
>>>>
>>>> Will this dramatically improve the problems I'm trying to solve? (not
>>>> writing APIs/better programming model/Beam's better handling of
>>>> windowing)
>>>> Can I get commercial support for Beam? (This is changing soon)
>>>> Are other people using Beam with the configuration and use case as me?
>>>> (e.g.
>>>> I'm using Spark with Beam to process imagery. Are others doing this in
>>>> production?)
>>>> Is there good documentation and books on the subject? (Tyler's and
>>>> others'
>>>> book will improve this)
>>>> Can I get my team trained on this new technology? (I have Beam training
>>>> and
>>>> Google has some cursory training)
>>>>
>>>> I think the one the community can improve on the most is the social
>>>> proof of
>>>> Beam. I've tried to do this
>>>> (http://www.jesse-anderson.com/2017/06/beam-2-0-q-and-a/ and
>>>> http://www.jesse-anderson.com/2016/07/question-and-answers-w
>>>> ith-the-apache-beam-team/).
>>>> We need to get the message out more about people using Beam in
>>>> production,
>>>> which configuration they have, and what their results were. I think we
>>>> have
>>>> the social proof on Dataflow, but not as much on Spark/Flink/Apex.
>>>>
>>>> I think it's important to note that these checks don't look at the
>>>> hardcore
>>>> language or API semantics that we're working on. These are much later
>>>> stage
>>>> issues, if they're ever used at all.
>>>>
>>>> In my experience with other open source adoption at enterprises, it
>>>> starts
>>>> with architects and works its way around the organization from there.
>>>>
>>>> Thanks,
>>>>
>>>> Jesse
>>>>
>>>> On Mon, Jan 15, 2018 at 8:14 AM Ted Yu <yuzhih...@gmail.com> wrote:
>>>>
>>>>> bq. are hard to detect in our unit-test framework
>>>>>
>>>>> Looks like more integration tests would help discover bug / regression
>>>>> more quickly. If committer reviewing the PR has concern in this
>>>>> regard, the
>>>>> concern should be stated on the PR so that the contributor (and
>>>>> reviewer)
>>>>> can spend more time in solidifying the solution.
>>>>>
>>>>> bq. I've gone and fixed these issues myself when merging
>>>>>
>>>>> We can make stricter checkstyle rules so that the code wouldn't pass
>>>>> build
>>>>> without addressing commonly known issues.
>>>>>
>>>>> Cheers
>>>>>
>>>>> On Sun, Jan 14, 2018 at 12:37 PM, Reuven Lax <re...@google.com> wrote:
>>>>>
>>>>>> I agree with the sentiment, but I don't completely agree with the
>>>>>> criteria.
>>>>>>
>>>>>> I think we need to be much better about reviewing PRs. Some PRs
>>>>>> languish
>>>>>> for too long before the reviewer gets to it (and I've been guilty of
>>>>>> this
>>>>>> too), which does not send a good message. Also new PRs sometimes
>>>>>> languish
>>>>>> because there is no reviewer assigned; maybe we could write a gitbot
>>>>>> to
>>>>>> automatically assign a reviewer to every new PR?
>>>>>>
>>>>>> Also, I think that the bar for merging a PR from a contributor should
>>>>>> not
>>>>>> be "the PR is perfect." It's perfectly fine to merge a PR that still
>>>>>> has
>>>>>> some issues (especially if the issues are stylistic). In the past
>>>>>> when I've
>>>>>> done this, I've gone and fixed these issues myself when merging. It
>>>>>> was a
>>>>>> bit more work for me to fix these things myself, but it was a small
>>>>>> price to
>>>>>> pay in order to portray Beam as a welcoming place for contributions.
>>>>>>
>>>>>> On the other hand, "the build does not break" is - in my opinion - too
>>>>>> weak of a criterion for merging. A few reasons for this:
>>>>>>
>>>>>>    * Beam is a data-processing framework, and data integrity is
>>>>>> paramount.
>>>>>> If a reviewer sees an issue that could lead to data loss (or
>>>>>> duplication, or
>>>>>> corruption), I don't think that PR should be merged. Historically
>>>>>> many such
>>>>>> issues only actually manifest at scale, and are hard to detect in our
>>>>>> unit-test framework. (we also need to invest in more at-scale tests
>>>>>> to catch
>>>>>> such issues).
>>>>>>
>>>>>>    * Beam guarantees backwards compatibility for users (except across
>>>>>> major versions). If a bad API gets merged and released (and the
>>>>>> chances of
>>>>>> "forgetting" about it before the release is cut is unfortunately
>>>>>> high), we
>>>>>> are stuck with it. This is less of an issue for many other open-source
>>>>>> projects that do not make such a compatibility guarantee, as they are
>>>>>> able
>>>>>> to simply remove or fix the API in the next version.
>>>>>>
>>>>>> I think we still need honest review of PRs, with the criteria being
>>>>>> stronger than "the build doesn't break." However reviewers also need
>>>>>> to be
>>>>>> reasonable about what they ask for.
>>>>>>
>>>>>> Reuven
>>>>>>
>>>>>> On Sun, Jan 14, 2018 at 11:19 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>>>
>>>>>>> bq. if a PR is basically right (it does what it should) without
>>>>>>> breaking
>>>>>>> the build, then it has to be merged fast
>>>>>>>
>>>>>>> +1 on above.
>>>>>>> This would give contributors positive feedback.
>>>>>>>
>>>>>>> On Sun, Jan 14, 2018 at 8:13 AM, Jean-Baptiste Onofré <
>>>>>>> j...@nanthrax.net>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Davor,
>>>>>>>>
>>>>>>>> Thanks a lot for this e-mail.
>>>>>>>>
>>>>>>>> I would like to emphasize two areas where we have to improve:
>>>>>>>>
>>>>>>>> 1. Apache way and community. We still have to focus and being
>>>>>>>> dedicated
>>>>>>>> on our communities (both user & dev). Helping, encouraging, growing
>>>>>>>> our
>>>>>>>> communities is key for the project. Building bridges between
>>>>>>>> communities is
>>>>>>>> also very important. We have to be more "accessible": sometime
>>>>>>>> simplifying
>>>>>>>> our discussions, showing more interest and open minded in the
>>>>>>>> proposals
>>>>>>>> would help as well. I think we do a good job already: we just have
>>>>>>>> to
>>>>>>>> improve.
>>>>>>>>
>>>>>>>> 2. Execution: a successful project is a project with a regular
>>>>>>>> activity
>>>>>>>> in term of releases, fixes, improvements.
>>>>>>>> Regarding the PR, I think today we have a PR opened for long. And I
>>>>>>>> think for three reasons:
>>>>>>>> - some are not ready, not good enough, no question on these ones
>>>>>>>> - some needs reviewer and speed up: we have to be careful on the
>>>>>>>> open
>>>>>>>> PRs and review asap
>>>>>>>> - some are under review but we have a lot of "ping pong" and long
>>>>>>>> discussion, not always justified. I already said that on the
>>>>>>>> mailing list
>>>>>>>> but, as for other Apache projects, if a PR is basically right (it
>>>>>>>> does what
>>>>>>>> it should) without breaking the build, then it has to be merged
>>>>>>>> fast. If it
>>>>>>>> requires additional changes (tests, polishing, improvements, ...),
>>>>>>>> then it
>>>>>>>> can be addressed in new PRs.
>>>>>>>> As already mentioned in the Beam 2.3.0 thread, we have to adopt a
>>>>>>>> regular schedule for releases. It's a best effort to have a release
>>>>>>>> every 2
>>>>>>>> months, whatever the release will contain. That's essential to
>>>>>>>> maintain a
>>>>>>>> good activity in the project and for the third party projects using
>>>>>>>> Beam.
>>>>>>>>
>>>>>>>> Again, don't get me wrong: we already do a good job ! It's just area
>>>>>>>> where I think we have to improve.
>>>>>>>>
>>>>>>>> Anyway, thanks for all the hard work we are doing all together !
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> JB
>>>>>>>>
>>>>>>>>
>>>>>>>> On 13/01/2018 05:12, Davor Bonaci wrote:
>>>>>>>>
>>>>>>>>> Hi everyone --
>>>>>>>>> Apache Beam was established as a top-level project a year ago (on
>>>>>>>>> December 21, to be exact). This first anniversary is a great
>>>>>>>>> opportunity for
>>>>>>>>> us to look back at the past year, celebrate its successes, learn
>>>>>>>>> from any
>>>>>>>>> mistakes we have made, and plan for the next 1+ years.
>>>>>>>>>
>>>>>>>>> I’d like to invite everyone in the community, particularly users
>>>>>>>>> and
>>>>>>>>> observers on this mailing list, to participate in this discussion.
>>>>>>>>> Apache
>>>>>>>>> Beam is your project and I, for one, would much appreciate your
>>>>>>>>> candid
>>>>>>>>> thoughts and comments. Just as some other projects do, I’d like to
>>>>>>>>> make this
>>>>>>>>> “state of the project” discussion an annual tradition in this
>>>>>>>>> community.
>>>>>>>>>
>>>>>>>>> In terms of successes, the availability of the first stable
>>>>>>>>> release,
>>>>>>>>> version 2.0.0, was the biggest and most important milestone last
>>>>>>>>> year.
>>>>>>>>> Additionally, we have expanded the project’s breadth with new
>>>>>>>>> components,
>>>>>>>>> including several new runners, SDKs, and DSLs, and interconnected
>>>>>>>>> a large
>>>>>>>>> number of storage/messaging systems with new Beam IOs. In terms of
>>>>>>>>> community
>>>>>>>>> growth, crossing 200 lifetime individual contributors and
>>>>>>>>> achieving 76
>>>>>>>>> contributors to a single release were other highlights. We have
>>>>>>>>> doubled the
>>>>>>>>> number of committers, and invited a handful of new PMC members.
>>>>>>>>> Thanks to
>>>>>>>>> each and every one of you for making all of this possible in our
>>>>>>>>> first year.
>>>>>>>>>
>>>>>>>>> On the other hand, in such a young project as Beam, there are
>>>>>>>>> naturally many areas for improvement. This is the principal
>>>>>>>>> purpose of this
>>>>>>>>> thread (and any of its forks). To organize the separate
>>>>>>>>> discussions, I’d
>>>>>>>>> suggest to fork separate threads for different discussion areas:
>>>>>>>>> * Culture and governance (anything related to people and their
>>>>>>>>> behavior)
>>>>>>>>> * Community growth (what can we do to further grow a diverse and
>>>>>>>>> vibrant community)
>>>>>>>>> * Technical execution (anything related to releases, their
>>>>>>>>> frequency,
>>>>>>>>> website, infrastructure)
>>>>>>>>> * Feature roadmap for 2018 (what can we do to make the project more
>>>>>>>>> attractive to users, Beam 3.0, etc.).
>>>>>>>>>
>>>>>>>>> I know many passionate folks who particularly care about each of
>>>>>>>>> these
>>>>>>>>> areas, but let me call on some folks from the community to get
>>>>>>>>> things
>>>>>>>>> started: Ismael for culture, Gris for community, JB for technical
>>>>>>>>> execution,
>>>>>>>>> and Ben for feature roadmap.
>>>>>>>>>
>>>>>>>>> Perhaps we can use this thread to discuss project-wide vision. To
>>>>>>>>> seed
>>>>>>>>> that discussion, I’d start somewhat provocatively -- we aren’t
>>>>>>>>> doing so well
>>>>>>>>> on the diversity of users across runners, which is very important
>>>>>>>>> to the
>>>>>>>>> realization of the project’s vision. Would you agree, and would
>>>>>>>>> you be
>>>>>>>>> willing to make it the project’s #1 priority for the next 1-2
>>>>>>>>> years?
>>>>>>>>>
>>>>>>>>> Thanks -- and please join us in what would hopefully be a
>>>>>>>>> productive
>>>>>>>>> and informative discussion that shapes the future of this project!
>>>>>>>>>
>>>>>>>>> Davor
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>
>

Re: [DISCUSS] State of the project

Reply via email to