I do support all the efforts to get Dataflow, Flink, and Spark to 3 (Fn API). But I disagree with it as a requirement; the whole point of ptransforms with URNs is that if the runner can figure out how to execute it according to semantics, then it is fine. A runner meets (1) and (2) but can only run certain subset of DoFns is allowed by design (whether the subset is based on language, state/timer support, etc).
Kenn On Tue, Mar 10, 2020 at 9:45 AM Luke Cwik <[email protected]> wrote: > I would like to move away from having runners access APIs that are related > to pipeline construction and other internal SDK APIs and I would like for > SDKs to not inspect internal runner APIs. This would enable the community > to improve each independently without needing to fix the world all the time > and would enable the community to run a cluster that supports multiple Beam > versions at the same time and would also allow for the cluster to be > updated independently of the pipelines it runs. > > As a community, I believe we need to achieve 1, 2 and 3. Outside of the > Apache Beam repo, anyone can do whatever they want but there should be no > compatibility guarantees. > > 4 and 5 are extensions that enable a richer set of pipelines to run and > are optional like many other parts such as if a runner supports metrics > aggregation or dynamic work rebalancing. > > On Tue, Mar 10, 2020 at 9:11 AM Kenneth Knowles <[email protected]> wrote: > >> There are a lot of different meanings to "portable runner". Here are some: >> >> (1) A runner that accepts a pipeline proto and either runs it or says it >> cannot run it >> (2) A runner that accepts jobs via the job management APIs >> (3) A runner that executes UDFs via the Fn API >> (4) A runner that can execute multiple languages >> (5) A runner that can run cross-language transforms aka multiple >> languages in the same pipeline >> >> I think (1) is a very good bar, and (2) is a nice addition on top of >> that. Then we have a unified way to submit pipelines and understand their >> status. >> >> I think (3) is optional - a runner can run things however it likes, >> including with native implementations. And then (4) and (5) as well are >> just levels of feature capabilities. >> >> Kenn >> >> On Tue, Mar 10, 2020 at 8:54 AM Luke Cwik <[email protected]> wrote: >> >>> +1 >>> >>> On Tue, Mar 10, 2020 at 12:59 AM Alex Van Boxel <[email protected]> >>> wrote: >>> >>>> One last thing, for any runner after this one... wouldn't it be a good >>>> acceptance criteria to only accept portable implementations anymore? >>>> >>>> _/ >>>> _/ Alex Van Boxel >>>> >>>> >>>> On Mon, Mar 9, 2020 at 10:42 PM Ismaël Mejía <[email protected]> wrote: >>>> >>>>> Good points Kenn. I think we mostly agree on what has been discussed >>>>> in this >>>>> thread the pros/cons of having runners on our repository, but this is >>>>> probably >>>>> not the best moment in time to change any policy in that aspect. >>>>> >>>>> So if nobody objects I think we can proceed. I am OOO this week so >>>>> with less >>>>> time to continue with the code review, but I will be back to finish >>>>> the review >>>>> and hopefully finally get this merged with Pulasthi next week (sorry >>>>> for the >>>>> delay). >>>>> >>>>> > (don't wait for me on code review - if Ismaël said it is good, then >>>>> it is >>>>> > good.) >>>>> >>>>> Thanks for your confidence. Twister2 runners looks good so far, but I >>>>> will >>>>> confirm 100% next week :) In the meantime if someone has some extra >>>>> cycles to >>>>> take a look extra feedback is always welcome. >>>>> >>>>> On Mon, Mar 9, 2020 at 5:50 AM Kenneth Knowles <[email protected]> >>>>> wrote: >>>>> > >>>>> > I haven't heard anyone suggest that we need a vote. I haven't heard >>>>> anyone object to this being merged to master. Some time ago, we mostly >>>>> decided to favor master instead of branches, because it is so much >>>>> smoother >>>>> for contributors and users. >>>>> > >>>>> > So I am poking this thread one last time and otherwise I would >>>>> consider it consensus that once code review is done the runner is a part >>>>> of >>>>> Beam (experimental!). >>>>> > >>>>> > (don't wait for me on code review - if Ismaël said it is good, then >>>>> it is good.) >>>>> > >>>>> > Kenn >>>>> > >>>>> > On Fri, Mar 6, 2020 at 7:47 AM Pulasthi Supun Wickramasinghe < >>>>> [email protected]> wrote: >>>>> >> >>>>> >> I understand that the discussion is on a more broad level than the >>>>> Twister2 runner. From my experience developing the runner the main >>>>> advantage of being inside the beam project was the easy access to the wide >>>>> range of tests and other core/utility code as Kyle pointed out. Unmerging >>>>> runners that are not properly maintained and updated would be the most >>>>> logical path to follow since the internals of the runners are only well >>>>> understood by developers of that particular project. It would be >>>>> unreasonable to expect the Beam community to maintain them. And since the >>>>> runners do not alter the core API's I assume they would be easy to unmerge >>>>> if the need arises. >>>>> >> >>>>> >> Talking specifically about Twister2 runner, we hope to continue >>>>> developing the runner in the future to add both streaming capability and >>>>> develop a portable runner as well. The team behind Twister2 is working >>>>> towards the goal to get the project into Apache Incubator in the near >>>>> future (Hopefully to submit the proposal in the next couple of months). >>>>> >> >>>>> >> Best Regards, >>>>> >> Pulasthi >>>>> >> >>>>> >> >>>>> >> >>>>> >> On Thu, Mar 5, 2020 at 6:56 PM Robert Bradshaw <[email protected]> >>>>> wrote: >>>>> >>> >>>>> >>> I think we will get to a point where it makes sense for runners to >>>>> >>> live in their own repositories, with their own release cadence, but >>>>> >>> we're not at that point yet. One prerequisite is a stable >>>>> API--we're >>>>> >>> closing in on that with the portability protos, but many (java) >>>>> >>> runners actually share the common runner core libraries and that is >>>>> >>> even less set in stone. >>>>> >>> >>>>> >>> On the other hand, taking responsibility for maintaining all >>>>> runners >>>>> >>> is not a tenable or scalable position for the Beam project. If a >>>>> >>> runner is merged, it should be understood that it can be >>>>> "un-merged" >>>>> >>> if it causes a maintenance burden. A completely separate >>>>> >>> project/repository makes this less messy. >>>>> >>> >>>>> >>> On Thu, Mar 5, 2020 at 10:01 AM Kenneth Knowles <[email protected]> >>>>> wrote: >>>>> >>> > >>>>> >>> > I agree with both of you, mostly :-) >>>>> >>> > >>>>> >>> > The monorepo approach doesn't work/scale well for shipped >>>>> libraries (name a Google library that silently just works and never causes >>>>> any dependency problems) and the pain we feel has been constant and >>>>> increasing, but I don't think we are at the breaking point. >>>>> >>> > >>>>> >>> > But Google's big monorepo [1] demonstrates similar benefits to >>>>> what Kyle describes. In the early stages the benefit of not having to >>>>> think >>>>> too hard about build/test infra and share it everywhere is a big help, and >>>>> it scales well. Eventually, shipping test utility libraries and compliance >>>>> suites can be equivalent. And to your point - it is very helpful for users >>>>> to know that they can use CassandraIO with the other Beam artifacts. This >>>>> is why Google requires the whole big repo to depend on a single version of >>>>> any externally-controlled artifact. But, yes, as a consequence it is >>>>> preposterously difficult to stay up to date, since literally anything can >>>>> block progress. You need a unified escalation chain for that policy to >>>>> make >>>>> sense. It is the definition of a healthy Apache project to *not* have that >>>>> (PMC is different). >>>>> >>> > >>>>> >>> > Independent dependencies, independent git histories, and >>>>> independent release cadence/process are all separate discussions. >>>>> >>> > >>>>> >>> > It is a broader question than this particular contribution, so >>>>> let's merge this runner before changing our whole way of doing things :-) >>>>> >>> > >>>>> >>> > Kenn >>>>> >>> > >>>>> >>> > [1] >>>>> https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext >>>>> (really quite a balanced analysis) >>>>> >>> > >>>>> >>> > On Wed, Mar 4, 2020 at 11:51 AM Kyle Weaver <[email protected]> >>>>> wrote: >>>>> >>> >> >>>>> >>> >> > Should runners, current and future, be in the same repository >>>>> as Beam >>>>> >>> >> > core? >>>>> >>> >> >>>>> >>> >> In the distant past, runners lived in their own repositories, >>>>> and then were donated to Beam. But Beam's current uber-repo setup allows a >>>>> lot of convenience. For example, a ton of code (including core >>>>> functionality and tests) is shared directly between runners, which is >>>>> useful for keeping runners up to date and ensuring consistent behavior >>>>> between them (in other words, maintainable and reliable). >>>>> >>> >> >>>>> >>> >> Generally, it is up to the authors of a particular Beam related >>>>> project/subproject to decide whether to host their code in Beam or in a >>>>> different repo, and up to the community to decide whether to take on the >>>>> donation, as discussed in previous threads on the Twister2 runner. In this >>>>> case, it seems there is agreement between the Twister2 runner authors and >>>>> the community that the runner can be hosted in Beam proper. >>>>> >>> >> >>>>> >>> >> There are examples of successful independent Beam projects, >>>>> such as Spotify's Scio, but having an independent project with its own >>>>> releases requires a lot of dedicated resources, and the bar for entry for >>>>> extending Beam should not be that high. All that's required of subproject >>>>> authors is that they keep the subproject in step with Beam. If they can't >>>>> maintain it any longer, the subproject can be allowed to bitrot without >>>>> getting in anyone's way. On the other hand, I'm not sure of the details >>>>> with Cassandra, but in general, a subproject should not have "the ability >>>>> to block progress" just because it is contained in the Beam uber-repo. >>>>> >>> >> >>>>> >>> >> tl;dr Having an uber repo generally seems to work for Beam. >>>>> Exceptions are few enough to be handled on a case-by-case basis. >>>>> >>> >> >>>>> >>> >> On Wed, Mar 4, 2020 at 11:12 AM Elliotte Rusty Harold < >>>>> [email protected]> wrote: >>>>> >>> >>> >>>>> >>> >>> Generic question without commenting on Twister2 specifically: >>>>> >>> >>> >>>>> >>> >>> Should runners, current and future, be in the same repository >>>>> as Beam >>>>> >>> >>> core? Can or should they be completely separate products with >>>>> their >>>>> >>> >>> own release cycles? >>>>> >>> >>> >>>>> >>> >>> Generally, loose coupling leads to more maintainable, reliable >>>>> >>> >>> projects. Specifically, Cassandra is holding back some other >>>>> changes >>>>> >>> >>> in Beam and I really wish it didn't have the ability to block >>>>> >>> >>> progress. The more different runners we have in core, the >>>>> worse this >>>>> >>> >>> problem is likely to become. >>>>> >>> >>> >>>>> >>> >>> >>>>> >>> >>> On Wed, Mar 4, 2020 at 2:03 PM Pulasthi Supun Wickramasinghe >>>>> >>> >>> <[email protected]> wrote: >>>>> >>> >>> > >>>>> >>> >>> > Hi >>>>> >>> >>> > >>>>> >>> >>> > I believe the pull request is pretty complete now with the >>>>> help of Ismaël. Kenn, would you be able to take a look at it and suggest >>>>> any changes if needed?. The build checks and validations tests are passing >>>>> at the moment. I will start working on the documentation that you >>>>> mentioned in an earlier email separately. >>>>> >>> >>> > >>>>> >>> >>> > Best Regards, >>>>> >>> >>> > Pulasthi >>>>> >>> >>> > >>>>> >>> >>> > >>>>> >>> >>> > >>>>> >>> >>> > >>>>> >>> >>> > >>>>> >>> >>> > On Tue, Feb 18, 2020 at 1:45 PM Pulasthi Supun >>>>> Wickramasinghe <[email protected]> wrote: >>>>> >>> >>> >> >>>>> >>> >>> >> Hi All, >>>>> >>> >>> >> >>>>> >>> >>> >> I have created the initial pull request [1] to contribute >>>>> the Twister2 Beam runner to the Apache Beam codebase. More information on >>>>> Twister2 can be found here[2] and the Twister2 codebase is available >>>>> here[3]. At the moment only batch mode is supported in the runner, but we >>>>> are planning to add stream support and implement a portable runner for >>>>> Twister2 in the near future. >>>>> >>> >>> >> >>>>> >>> >>> >> As Kenn pointed out in an earlier email it would be great >>>>> to have inputs from the community regarding this contribution since it is >>>>> a >>>>> sizable one. I am sure there are many improvements that can be done in the >>>>> contributed codebase with input from the community. >>>>> >>> >>> >> >>>>> >>> >>> >> [1] https://github.com/apache/beam/pull/10888 >>>>> >>> >>> >> [2] https://twister2.org/ >>>>> >>> >>> >> [3] https://github.com/DSC-SPIDAL/twister2 >>>>> >>> >>> >> >>>>> >>> >>> >> Best Regards, >>>>> >>> >>> >> Pulasthi >>>>> >>> >>> >> -- >>>>> >>> >>> >> Pulasthi S. Wickramasinghe >>>>> >>> >>> >> PhD Candidate | Research Assistant >>>>> >>> >>> >> School of Informatics and Computing | Digital Science Center >>>>> >>> >>> >> Indiana University, Bloomington >>>>> >>> >>> >> cell: 224-386-9035 <(224)%20386-9035> >>>>> >>> >>> > >>>>> >>> >>> > >>>>> >>> >>> > >>>>> >>> >>> > -- >>>>> >>> >>> > Pulasthi S. Wickramasinghe >>>>> >>> >>> > PhD Candidate | Research Assistant >>>>> >>> >>> > School of Informatics and Computing | Digital Science Center >>>>> >>> >>> > Indiana University, Bloomington >>>>> >>> >>> > cell: 224-386-9035 <(224)%20386-9035> >>>>> >>> >>> >>>>> >>> >>> >>>>> >>> >>> >>>>> >>> >>> -- >>>>> >>> >>> Elliotte Rusty Harold >>>>> >>> >>> [email protected] >>>>> >> >>>>> >> >>>>> >> >>>>> >> -- >>>>> >> Pulasthi S. Wickramasinghe >>>>> >> PhD Candidate | Research Assistant >>>>> >> School of Informatics and Computing | Digital Science Center >>>>> >> Indiana University, Bloomington >>>>> >> cell: 224-386-9035 <(224)%20386-9035> >>>>> >>>>
