Hi Ismaël Did you get some free time to perform a code review on the pull request
Best Regards Pulasthi On Tue, Mar 10, 2020 at 3:30 PM Luke Cwik <[email protected]> wrote: > I have to disagree. Allowing for runners within the Apache Beam repo and > SDKs that reach into the implementation details of each other are > usability, feature development, maintenance and complexity problems. > > The usability issue comes from our public core facing APIs exposing > methods that runners "need" so they can introspect details that shouldn't > be visible to them (e.g. setWindowingStrategyInternal on PCollection). > Getting to 1 would remove the pipeline construction time instances but not > the execution side ones and there are currently 100+ usages of > the @Internal annotation. > > The feature development and maintenance issues both stem from duplication > of work. We need to have at least two copies of how to do something, one > that is for runner -> SDK direct and one for Fn API. An example of this is > the timer family work which was started and completed for the non portable > implementation yet the portable implementation was left as future work. > > Finally, the complexity comes from how many layers we have that wrap > existing components to create variants for different use cases. I'm looking > at all the DoFnRunners and each of their variants and how those have layers > within themselves within the SDK and how additional layers have been made > to interface with runner specific internal details. > > > On Tue, Mar 10, 2020 at 12:07 PM Kenneth Knowles <[email protected]> wrote: > >> I do support all the efforts to get Dataflow, Flink, and Spark to 3 (Fn >> API). But I disagree with it as a requirement; the whole point of >> ptransforms with URNs is that if the runner can figure out how to execute >> it according to semantics, then it is fine. A runner meets (1) and (2) but >> can only run certain subset of DoFns is allowed by design (whether the >> subset is based on language, state/timer support, etc). >> >> Kenn >> >> On Tue, Mar 10, 2020 at 9:45 AM Luke Cwik <[email protected]> wrote: >> >>> I would like to move away from having runners access APIs that are >>> related to pipeline construction and other internal SDK APIs and I would >>> like for SDKs to not inspect internal runner APIs. This would enable the >>> community to improve each independently without needing to fix the world >>> all the time and would enable the community to run a cluster that supports >>> multiple Beam versions at the same time and would also allow for the >>> cluster to be updated independently of the pipelines it runs. >>> >>> As a community, I believe we need to achieve 1, 2 and 3. Outside of the >>> Apache Beam repo, anyone can do whatever they want but there should be no >>> compatibility guarantees. >>> >>> 4 and 5 are extensions that enable a richer set of pipelines to run and >>> are optional like many other parts such as if a runner supports metrics >>> aggregation or dynamic work rebalancing. >>> >>> On Tue, Mar 10, 2020 at 9:11 AM Kenneth Knowles <[email protected]> wrote: >>> >>>> There are a lot of different meanings to "portable runner". Here are >>>> some: >>>> >>>> (1) A runner that accepts a pipeline proto and either runs it or says >>>> it cannot run it >>>> (2) A runner that accepts jobs via the job management APIs >>>> (3) A runner that executes UDFs via the Fn API >>>> (4) A runner that can execute multiple languages >>>> (5) A runner that can run cross-language transforms aka multiple >>>> languages in the same pipeline >>>> >>>> I think (1) is a very good bar, and (2) is a nice addition on top of >>>> that. Then we have a unified way to submit pipelines and understand their >>>> status. >>>> >>>> I think (3) is optional - a runner can run things however it likes, >>>> including with native implementations. And then (4) and (5) as well are >>>> just levels of feature capabilities. >>>> >>>> Kenn >>>> >>>> On Tue, Mar 10, 2020 at 8:54 AM Luke Cwik <[email protected]> wrote: >>>> >>>>> +1 >>>>> >>>>> On Tue, Mar 10, 2020 at 12:59 AM Alex Van Boxel <[email protected]> >>>>> wrote: >>>>> >>>>>> One last thing, for any runner after this one... wouldn't it be a >>>>>> good acceptance criteria to only accept portable implementations anymore? >>>>>> >>>>>> _/ >>>>>> _/ Alex Van Boxel >>>>>> >>>>>> >>>>>> On Mon, Mar 9, 2020 at 10:42 PM Ismaël Mejía <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Good points Kenn. I think we mostly agree on what has been discussed >>>>>>> in this >>>>>>> thread the pros/cons of having runners on our repository, but this >>>>>>> is probably >>>>>>> not the best moment in time to change any policy in that aspect. >>>>>>> >>>>>>> So if nobody objects I think we can proceed. I am OOO this week so >>>>>>> with less >>>>>>> time to continue with the code review, but I will be back to finish >>>>>>> the review >>>>>>> and hopefully finally get this merged with Pulasthi next week (sorry >>>>>>> for the >>>>>>> delay). >>>>>>> >>>>>>> > (don't wait for me on code review - if Ismaël said it is good, >>>>>>> then it is >>>>>>> > good.) >>>>>>> >>>>>>> Thanks for your confidence. Twister2 runners looks good so far, but >>>>>>> I will >>>>>>> confirm 100% next week :) In the meantime if someone has some extra >>>>>>> cycles to >>>>>>> take a look extra feedback is always welcome. >>>>>>> >>>>>>> On Mon, Mar 9, 2020 at 5:50 AM Kenneth Knowles <[email protected]> >>>>>>> wrote: >>>>>>> > >>>>>>> > I haven't heard anyone suggest that we need a vote. I haven't >>>>>>> heard anyone object to this being merged to master. Some time ago, we >>>>>>> mostly decided to favor master instead of branches, because it is so >>>>>>> much >>>>>>> smoother for contributors and users. >>>>>>> > >>>>>>> > So I am poking this thread one last time and otherwise I would >>>>>>> consider it consensus that once code review is done the runner is a >>>>>>> part of >>>>>>> Beam (experimental!). >>>>>>> > >>>>>>> > (don't wait for me on code review - if Ismaël said it is good, >>>>>>> then it is good.) >>>>>>> > >>>>>>> > Kenn >>>>>>> > >>>>>>> > On Fri, Mar 6, 2020 at 7:47 AM Pulasthi Supun Wickramasinghe < >>>>>>> [email protected]> wrote: >>>>>>> >> >>>>>>> >> I understand that the discussion is on a more broad level than >>>>>>> the Twister2 runner. From my experience developing the runner the main >>>>>>> advantage of being inside the beam project was the easy access to the >>>>>>> wide >>>>>>> range of tests and other core/utility code as Kyle pointed out. >>>>>>> Unmerging >>>>>>> runners that are not properly maintained and updated would be the most >>>>>>> logical path to follow since the internals of the runners are only well >>>>>>> understood by developers of that particular project. It would be >>>>>>> unreasonable to expect the Beam community to maintain them. And since >>>>>>> the >>>>>>> runners do not alter the core API's I assume they would be easy to >>>>>>> unmerge >>>>>>> if the need arises. >>>>>>> >> >>>>>>> >> Talking specifically about Twister2 runner, we hope to continue >>>>>>> developing the runner in the future to add both streaming capability and >>>>>>> develop a portable runner as well. The team behind Twister2 is working >>>>>>> towards the goal to get the project into Apache Incubator in the near >>>>>>> future (Hopefully to submit the proposal in the next couple of months). >>>>>>> >> >>>>>>> >> Best Regards, >>>>>>> >> Pulasthi >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> On Thu, Mar 5, 2020 at 6:56 PM Robert Bradshaw < >>>>>>> [email protected]> wrote: >>>>>>> >>> >>>>>>> >>> I think we will get to a point where it makes sense for runners >>>>>>> to >>>>>>> >>> live in their own repositories, with their own release cadence, >>>>>>> but >>>>>>> >>> we're not at that point yet. One prerequisite is a stable >>>>>>> API--we're >>>>>>> >>> closing in on that with the portability protos, but many (java) >>>>>>> >>> runners actually share the common runner core libraries and that >>>>>>> is >>>>>>> >>> even less set in stone. >>>>>>> >>> >>>>>>> >>> On the other hand, taking responsibility for maintaining all >>>>>>> runners >>>>>>> >>> is not a tenable or scalable position for the Beam project. If a >>>>>>> >>> runner is merged, it should be understood that it can be >>>>>>> "un-merged" >>>>>>> >>> if it causes a maintenance burden. A completely separate >>>>>>> >>> project/repository makes this less messy. >>>>>>> >>> >>>>>>> >>> On Thu, Mar 5, 2020 at 10:01 AM Kenneth Knowles <[email protected]> >>>>>>> wrote: >>>>>>> >>> > >>>>>>> >>> > I agree with both of you, mostly :-) >>>>>>> >>> > >>>>>>> >>> > The monorepo approach doesn't work/scale well for shipped >>>>>>> libraries (name a Google library that silently just works and never >>>>>>> causes >>>>>>> any dependency problems) and the pain we feel has been constant and >>>>>>> increasing, but I don't think we are at the breaking point. >>>>>>> >>> > >>>>>>> >>> > But Google's big monorepo [1] demonstrates similar benefits to >>>>>>> what Kyle describes. In the early stages the benefit of not having to >>>>>>> think >>>>>>> too hard about build/test infra and share it everywhere is a big help, >>>>>>> and >>>>>>> it scales well. Eventually, shipping test utility libraries and >>>>>>> compliance >>>>>>> suites can be equivalent. And to your point - it is very helpful for >>>>>>> users >>>>>>> to know that they can use CassandraIO with the other Beam artifacts. >>>>>>> This >>>>>>> is why Google requires the whole big repo to depend on a single version >>>>>>> of >>>>>>> any externally-controlled artifact. But, yes, as a consequence it is >>>>>>> preposterously difficult to stay up to date, since literally anything >>>>>>> can >>>>>>> block progress. You need a unified escalation chain for that policy to >>>>>>> make >>>>>>> sense. It is the definition of a healthy Apache project to *not* have >>>>>>> that >>>>>>> (PMC is different). >>>>>>> >>> > >>>>>>> >>> > Independent dependencies, independent git histories, and >>>>>>> independent release cadence/process are all separate discussions. >>>>>>> >>> > >>>>>>> >>> > It is a broader question than this particular contribution, so >>>>>>> let's merge this runner before changing our whole way of doing things >>>>>>> :-) >>>>>>> >>> > >>>>>>> >>> > Kenn >>>>>>> >>> > >>>>>>> >>> > [1] >>>>>>> https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext >>>>>>> (really quite a balanced analysis) >>>>>>> >>> > >>>>>>> >>> > On Wed, Mar 4, 2020 at 11:51 AM Kyle Weaver < >>>>>>> [email protected]> wrote: >>>>>>> >>> >> >>>>>>> >>> >> > Should runners, current and future, be in the same >>>>>>> repository as Beam >>>>>>> >>> >> > core? >>>>>>> >>> >> >>>>>>> >>> >> In the distant past, runners lived in their own repositories, >>>>>>> and then were donated to Beam. But Beam's current uber-repo setup >>>>>>> allows a >>>>>>> lot of convenience. For example, a ton of code (including core >>>>>>> functionality and tests) is shared directly between runners, which is >>>>>>> useful for keeping runners up to date and ensuring consistent behavior >>>>>>> between them (in other words, maintainable and reliable). >>>>>>> >>> >> >>>>>>> >>> >> Generally, it is up to the authors of a particular Beam >>>>>>> related project/subproject to decide whether to host their code in Beam >>>>>>> or >>>>>>> in a different repo, and up to the community to decide whether to take >>>>>>> on >>>>>>> the donation, as discussed in previous threads on the Twister2 runner. >>>>>>> In >>>>>>> this case, it seems there is agreement between the Twister2 runner >>>>>>> authors >>>>>>> and the community that the runner can be hosted in Beam proper. >>>>>>> >>> >> >>>>>>> >>> >> There are examples of successful independent Beam projects, >>>>>>> such as Spotify's Scio, but having an independent project with its own >>>>>>> releases requires a lot of dedicated resources, and the bar for entry >>>>>>> for >>>>>>> extending Beam should not be that high. All that's required of >>>>>>> subproject >>>>>>> authors is that they keep the subproject in step with Beam. If they >>>>>>> can't >>>>>>> maintain it any longer, the subproject can be allowed to bitrot without >>>>>>> getting in anyone's way. On the other hand, I'm not sure of the details >>>>>>> with Cassandra, but in general, a subproject should not have "the >>>>>>> ability >>>>>>> to block progress" just because it is contained in the Beam uber-repo. >>>>>>> >>> >> >>>>>>> >>> >> tl;dr Having an uber repo generally seems to work for Beam. >>>>>>> Exceptions are few enough to be handled on a case-by-case basis. >>>>>>> >>> >> >>>>>>> >>> >> On Wed, Mar 4, 2020 at 11:12 AM Elliotte Rusty Harold < >>>>>>> [email protected]> wrote: >>>>>>> >>> >>> >>>>>>> >>> >>> Generic question without commenting on Twister2 specifically: >>>>>>> >>> >>> >>>>>>> >>> >>> Should runners, current and future, be in the same >>>>>>> repository as Beam >>>>>>> >>> >>> core? Can or should they be completely separate products >>>>>>> with their >>>>>>> >>> >>> own release cycles? >>>>>>> >>> >>> >>>>>>> >>> >>> Generally, loose coupling leads to more maintainable, >>>>>>> reliable >>>>>>> >>> >>> projects. Specifically, Cassandra is holding back some other >>>>>>> changes >>>>>>> >>> >>> in Beam and I really wish it didn't have the ability to block >>>>>>> >>> >>> progress. The more different runners we have in core, the >>>>>>> worse this >>>>>>> >>> >>> problem is likely to become. >>>>>>> >>> >>> >>>>>>> >>> >>> >>>>>>> >>> >>> On Wed, Mar 4, 2020 at 2:03 PM Pulasthi Supun Wickramasinghe >>>>>>> >>> >>> <[email protected]> wrote: >>>>>>> >>> >>> > >>>>>>> >>> >>> > Hi >>>>>>> >>> >>> > >>>>>>> >>> >>> > I believe the pull request is pretty complete now with the >>>>>>> help of Ismaël. Kenn, would you be able to take a look at it and suggest >>>>>>> any changes if needed?. The build checks and validations tests are >>>>>>> passing >>>>>>> at the moment. I will start working on the documentation that you >>>>>>> mentioned in an earlier email separately. >>>>>>> >>> >>> > >>>>>>> >>> >>> > Best Regards, >>>>>>> >>> >>> > Pulasthi >>>>>>> >>> >>> > >>>>>>> >>> >>> > >>>>>>> >>> >>> > >>>>>>> >>> >>> > >>>>>>> >>> >>> > >>>>>>> >>> >>> > On Tue, Feb 18, 2020 at 1:45 PM Pulasthi Supun >>>>>>> Wickramasinghe <[email protected]> wrote: >>>>>>> >>> >>> >> >>>>>>> >>> >>> >> Hi All, >>>>>>> >>> >>> >> >>>>>>> >>> >>> >> I have created the initial pull request [1] to contribute >>>>>>> the Twister2 Beam runner to the Apache Beam codebase. More information >>>>>>> on >>>>>>> Twister2 can be found here[2] and the Twister2 codebase is available >>>>>>> here[3]. At the moment only batch mode is supported in the runner, but >>>>>>> we >>>>>>> are planning to add stream support and implement a portable runner for >>>>>>> Twister2 in the near future. >>>>>>> >>> >>> >> >>>>>>> >>> >>> >> As Kenn pointed out in an earlier email it would be great >>>>>>> to have inputs from the community regarding this contribution since it >>>>>>> is a >>>>>>> sizable one. I am sure there are many improvements that can be done in >>>>>>> the >>>>>>> contributed codebase with input from the community. >>>>>>> >>> >>> >> >>>>>>> >>> >>> >> [1] https://github.com/apache/beam/pull/10888 >>>>>>> >>> >>> >> [2] https://twister2.org/ >>>>>>> >>> >>> >> [3] https://github.com/DSC-SPIDAL/twister2 >>>>>>> >>> >>> >> >>>>>>> >>> >>> >> Best Regards, >>>>>>> >>> >>> >> Pulasthi >>>>>>> >>> >>> >> -- >>>>>>> >>> >>> >> Pulasthi S. Wickramasinghe >>>>>>> >>> >>> >> PhD Candidate | Research Assistant >>>>>>> >>> >>> >> School of Informatics and Computing | Digital Science >>>>>>> Center >>>>>>> >>> >>> >> Indiana University, Bloomington >>>>>>> >>> >>> >> cell: 224-386-9035 <(224)%20386-9035> >>>>>>> >>> >>> > >>>>>>> >>> >>> > >>>>>>> >>> >>> > >>>>>>> >>> >>> > -- >>>>>>> >>> >>> > Pulasthi S. Wickramasinghe >>>>>>> >>> >>> > PhD Candidate | Research Assistant >>>>>>> >>> >>> > School of Informatics and Computing | Digital Science >>>>>>> Center >>>>>>> >>> >>> > Indiana University, Bloomington >>>>>>> >>> >>> > cell: 224-386-9035 <(224)%20386-9035> >>>>>>> >>> >>> >>>>>>> >>> >>> >>>>>>> >>> >>> >>>>>>> >>> >>> -- >>>>>>> >>> >>> Elliotte Rusty Harold >>>>>>> >>> >>> [email protected] >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> -- >>>>>>> >> Pulasthi S. Wickramasinghe >>>>>>> >> PhD Candidate | Research Assistant >>>>>>> >> School of Informatics and Computing | Digital Science Center >>>>>>> >> Indiana University, Bloomington >>>>>>> >> cell: 224-386-9035 <(224)%20386-9035> >>>>>>> >>>>>> -- Pulasthi S. Wickramasinghe PhD Candidate | Research Assistant School of Informatics and Computing | Digital Science Center Indiana University, Bloomington cell: 224-386-9035
