Hi Ismaël, Thanks for the update, No problem at all, please take your time and let me know if my assistance is needed, The virus has affected everyone's timetables. I hope you are safe.
Best Regards, Pulasthi On Fri, Apr 3, 2020 at 12:14 PM Ismaël Mejía <[email protected]> wrote: > Hello Pulasthi, > > Please excuse me for my delay, I have probably 1/3 of my common > available time since the coronavirus lockdown so I have not advanced > as expected. I hope to catch up rapidly and ping you. Our expected > target of merging it before the 2.21.0 release seems to be hard to get > at this point because the branch will be cut next week. I hope this is > not a problem but if it is please excuse me. > > I also profit to ask any other Beamer that could have more free cycles > at the moment in case (s)he can give me an extra hand for the review. > > Regards, > Ismaël > > > On Fri, Apr 3, 2020 at 4:16 AM Pulasthi Supun Wickramasinghe > <[email protected]> wrote: > > > > Hi Ismaël > > > > Did you get some free time to perform a code review on the pull request > > > > Best Regards > > Pulasthi > > > > On Tue, Mar 10, 2020 at 3:30 PM Luke Cwik <[email protected]> wrote: > >> > >> I have to disagree. Allowing for runners within the Apache Beam repo > and SDKs that reach into the implementation details of each other are > usability, feature development, maintenance and complexity problems. > >> > >> The usability issue comes from our public core facing APIs exposing > methods that runners "need" so they can introspect details that shouldn't > be visible to them (e.g. setWindowingStrategyInternal on PCollection). > Getting to 1 would remove the pipeline construction time instances but not > the execution side ones and there are currently 100+ usages of the > @Internal annotation. > >> > >> The feature development and maintenance issues both stem from > duplication of work. We need to have at least two copies of how to do > something, one that is for runner -> SDK direct and one for Fn API. An > example of this is the timer family work which was started and completed > for the non portable implementation yet the portable implementation was > left as future work. > >> > >> Finally, the complexity comes from how many layers we have that wrap > existing components to create variants for different use cases. I'm looking > at all the DoFnRunners and each of their variants and how those have layers > within themselves within the SDK and how additional layers have been made > to interface with runner specific internal details. > >> > >> > >> On Tue, Mar 10, 2020 at 12:07 PM Kenneth Knowles <[email protected]> > wrote: > >>> > >>> I do support all the efforts to get Dataflow, Flink, and Spark to 3 > (Fn API). But I disagree with it as a requirement; the whole point of > ptransforms with URNs is that if the runner can figure out how to execute > it according to semantics, then it is fine. A runner meets (1) and (2) but > can only run certain subset of DoFns is allowed by design (whether the > subset is based on language, state/timer support, etc). > >>> > >>> Kenn > >>> > >>> On Tue, Mar 10, 2020 at 9:45 AM Luke Cwik <[email protected]> wrote: > >>>> > >>>> I would like to move away from having runners access APIs that are > related to pipeline construction and other internal SDK APIs and I would > like for SDKs to not inspect internal runner APIs. This would enable the > community to improve each independently without needing to fix the world > all the time and would enable the community to run a cluster that supports > multiple Beam versions at the same time and would also allow for the > cluster to be updated independently of the pipelines it runs. > >>>> > >>>> As a community, I believe we need to achieve 1, 2 and 3. Outside of > the Apache Beam repo, anyone can do whatever they want but there should be > no compatibility guarantees. > >>>> > >>>> 4 and 5 are extensions that enable a richer set of pipelines to run > and are optional like many other parts such as if a runner supports metrics > aggregation or dynamic work rebalancing. > >>>> > >>>> On Tue, Mar 10, 2020 at 9:11 AM Kenneth Knowles <[email protected]> > wrote: > >>>>> > >>>>> There are a lot of different meanings to "portable runner". Here are > some: > >>>>> > >>>>> (1) A runner that accepts a pipeline proto and either runs it or > says it cannot run it > >>>>> (2) A runner that accepts jobs via the job management APIs > >>>>> (3) A runner that executes UDFs via the Fn API > >>>>> (4) A runner that can execute multiple languages > >>>>> (5) A runner that can run cross-language transforms aka multiple > languages in the same pipeline > >>>>> > >>>>> I think (1) is a very good bar, and (2) is a nice addition on top of > that. Then we have a unified way to submit pipelines and understand their > status. > >>>>> > >>>>> I think (3) is optional - a runner can run things however it likes, > including with native implementations. And then (4) and (5) as well are > just levels of feature capabilities. > >>>>> > >>>>> Kenn > >>>>> > >>>>> On Tue, Mar 10, 2020 at 8:54 AM Luke Cwik <[email protected]> wrote: > >>>>>> > >>>>>> +1 > >>>>>> > >>>>>> On Tue, Mar 10, 2020 at 12:59 AM Alex Van Boxel <[email protected]> > wrote: > >>>>>>> > >>>>>>> One last thing, for any runner after this one... wouldn't it be a > good acceptance criteria to only accept portable implementations anymore? > >>>>>>> > >>>>>>> _/ > >>>>>>> _/ Alex Van Boxel > >>>>>>> > >>>>>>> > >>>>>>> On Mon, Mar 9, 2020 at 10:42 PM Ismaël Mejía <[email protected]> > wrote: > >>>>>>>> > >>>>>>>> Good points Kenn. I think we mostly agree on what has been > discussed in this > >>>>>>>> thread the pros/cons of having runners on our repository, but > this is probably > >>>>>>>> not the best moment in time to change any policy in that aspect. > >>>>>>>> > >>>>>>>> So if nobody objects I think we can proceed. I am OOO this week > so with less > >>>>>>>> time to continue with the code review, but I will be back to > finish the review > >>>>>>>> and hopefully finally get this merged with Pulasthi next week > (sorry for the > >>>>>>>> delay). > >>>>>>>> > >>>>>>>> > (don't wait for me on code review - if Ismaël said it is good, > then it is > >>>>>>>> > good.) > >>>>>>>> > >>>>>>>> Thanks for your confidence. Twister2 runners looks good so far, > but I will > >>>>>>>> confirm 100% next week :) In the meantime if someone has some > extra cycles to > >>>>>>>> take a look extra feedback is always welcome. > >>>>>>>> > >>>>>>>> On Mon, Mar 9, 2020 at 5:50 AM Kenneth Knowles <[email protected]> > wrote: > >>>>>>>> > > >>>>>>>> > I haven't heard anyone suggest that we need a vote. I haven't > heard anyone object to this being merged to master. Some time ago, we > mostly decided to favor master instead of branches, because it is so much > smoother for contributors and users. > >>>>>>>> > > >>>>>>>> > So I am poking this thread one last time and otherwise I would > consider it consensus that once code review is done the runner is a part of > Beam (experimental!). > >>>>>>>> > > >>>>>>>> > (don't wait for me on code review - if Ismaël said it is good, > then it is good.) > >>>>>>>> > > >>>>>>>> > Kenn > >>>>>>>> > > >>>>>>>> > On Fri, Mar 6, 2020 at 7:47 AM Pulasthi Supun Wickramasinghe < > [email protected]> wrote: > >>>>>>>> >> > >>>>>>>> >> I understand that the discussion is on a more broad level than > the Twister2 runner. From my experience developing the runner the main > advantage of being inside the beam project was the easy access to the wide > range of tests and other core/utility code as Kyle pointed out. Unmerging > runners that are not properly maintained and updated would be the most > logical path to follow since the internals of the runners are only well > understood by developers of that particular project. It would be > unreasonable to expect the Beam community to maintain them. And since the > runners do not alter the core API's I assume they would be easy to unmerge > if the need arises. > >>>>>>>> >> > >>>>>>>> >> Talking specifically about Twister2 runner, we hope to > continue developing the runner in the future to add both streaming > capability and develop a portable runner as well. The team behind Twister2 > is working towards the goal to get the project into Apache Incubator in the > near future (Hopefully to submit the proposal in the next couple of months). > >>>>>>>> >> > >>>>>>>> >> Best Regards, > >>>>>>>> >> Pulasthi > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> >> On Thu, Mar 5, 2020 at 6:56 PM Robert Bradshaw < > [email protected]> wrote: > >>>>>>>> >>> > >>>>>>>> >>> I think we will get to a point where it makes sense for > runners to > >>>>>>>> >>> live in their own repositories, with their own release > cadence, but > >>>>>>>> >>> we're not at that point yet. One prerequisite is a stable > API--we're > >>>>>>>> >>> closing in on that with the portability protos, but many > (java) > >>>>>>>> >>> runners actually share the common runner core libraries and > that is > >>>>>>>> >>> even less set in stone. > >>>>>>>> >>> > >>>>>>>> >>> On the other hand, taking responsibility for maintaining all > runners > >>>>>>>> >>> is not a tenable or scalable position for the Beam project. > If a > >>>>>>>> >>> runner is merged, it should be understood that it can be > "un-merged" > >>>>>>>> >>> if it causes a maintenance burden. A completely separate > >>>>>>>> >>> project/repository makes this less messy. > >>>>>>>> >>> > >>>>>>>> >>> On Thu, Mar 5, 2020 at 10:01 AM Kenneth Knowles < > [email protected]> wrote: > >>>>>>>> >>> > > >>>>>>>> >>> > I agree with both of you, mostly :-) > >>>>>>>> >>> > > >>>>>>>> >>> > The monorepo approach doesn't work/scale well for shipped > libraries (name a Google library that silently just works and never causes > any dependency problems) and the pain we feel has been constant and > increasing, but I don't think we are at the breaking point. > >>>>>>>> >>> > > >>>>>>>> >>> > But Google's big monorepo [1] demonstrates similar benefits > to what Kyle describes. In the early stages the benefit of not having to > think too hard about build/test infra and share it everywhere is a big > help, and it scales well. Eventually, shipping test utility libraries and > compliance suites can be equivalent. And to your point - it is very helpful > for users to know that they can use CassandraIO with the other Beam > artifacts. This is why Google requires the whole big repo to depend on a > single version of any externally-controlled artifact. But, yes, as a > consequence it is preposterously difficult to stay up to date, since > literally anything can block progress. You need a unified escalation chain > for that policy to make sense. It is the definition of a healthy Apache > project to *not* have that (PMC is different). > >>>>>>>> >>> > > >>>>>>>> >>> > Independent dependencies, independent git histories, and > independent release cadence/process are all separate discussions. > >>>>>>>> >>> > > >>>>>>>> >>> > It is a broader question than this particular contribution, > so let's merge this runner before changing our whole way of doing things :-) > >>>>>>>> >>> > > >>>>>>>> >>> > Kenn > >>>>>>>> >>> > > >>>>>>>> >>> > [1] > https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext > (really quite a balanced analysis) > >>>>>>>> >>> > > >>>>>>>> >>> > On Wed, Mar 4, 2020 at 11:51 AM Kyle Weaver < > [email protected]> wrote: > >>>>>>>> >>> >> > >>>>>>>> >>> >> > Should runners, current and future, be in the same > repository as Beam > >>>>>>>> >>> >> > core? > >>>>>>>> >>> >> > >>>>>>>> >>> >> In the distant past, runners lived in their own > repositories, and then were donated to Beam. But Beam's current uber-repo > setup allows a lot of convenience. For example, a ton of code (including > core functionality and tests) is shared directly between runners, which is > useful for keeping runners up to date and ensuring consistent behavior > between them (in other words, maintainable and reliable). > >>>>>>>> >>> >> > >>>>>>>> >>> >> Generally, it is up to the authors of a particular Beam > related project/subproject to decide whether to host their code in Beam or > in a different repo, and up to the community to decide whether to take on > the donation, as discussed in previous threads on the Twister2 runner. In > this case, it seems there is agreement between the Twister2 runner authors > and the community that the runner can be hosted in Beam proper. > >>>>>>>> >>> >> > >>>>>>>> >>> >> There are examples of successful independent Beam > projects, such as Spotify's Scio, but having an independent project with > its own releases requires a lot of dedicated resources, and the bar for > entry for extending Beam should not be that high. All that's required of > subproject authors is that they keep the subproject in step with Beam. If > they can't maintain it any longer, the subproject can be allowed to bitrot > without getting in anyone's way. On the other hand, I'm not sure of the > details with Cassandra, but in general, a subproject should not have "the > ability to block progress" just because it is contained in the Beam > uber-repo. > >>>>>>>> >>> >> > >>>>>>>> >>> >> tl;dr Having an uber repo generally seems to work for > Beam. Exceptions are few enough to be handled on a case-by-case basis. > >>>>>>>> >>> >> > >>>>>>>> >>> >> On Wed, Mar 4, 2020 at 11:12 AM Elliotte Rusty Harold < > [email protected]> wrote: > >>>>>>>> >>> >>> > >>>>>>>> >>> >>> Generic question without commenting on Twister2 > specifically: > >>>>>>>> >>> >>> > >>>>>>>> >>> >>> Should runners, current and future, be in the same > repository as Beam > >>>>>>>> >>> >>> core? Can or should they be completely separate products > with their > >>>>>>>> >>> >>> own release cycles? > >>>>>>>> >>> >>> > >>>>>>>> >>> >>> Generally, loose coupling leads to more maintainable, > reliable > >>>>>>>> >>> >>> projects. Specifically, Cassandra is holding back some > other changes > >>>>>>>> >>> >>> in Beam and I really wish it didn't have the ability to > block > >>>>>>>> >>> >>> progress. The more different runners we have in core, the > worse this > >>>>>>>> >>> >>> problem is likely to become. > >>>>>>>> >>> >>> > >>>>>>>> >>> >>> > >>>>>>>> >>> >>> On Wed, Mar 4, 2020 at 2:03 PM Pulasthi Supun > Wickramasinghe > >>>>>>>> >>> >>> <[email protected]> wrote: > >>>>>>>> >>> >>> > > >>>>>>>> >>> >>> > Hi > >>>>>>>> >>> >>> > > >>>>>>>> >>> >>> > I believe the pull request is pretty complete now with > the help of Ismaël. Kenn, would you be able to take a look at it and > suggest any changes if needed?. The build checks and validations tests are > passing at the moment. I will start working on the documentation that you > mentioned in an earlier email separately. > >>>>>>>> >>> >>> > > >>>>>>>> >>> >>> > Best Regards, > >>>>>>>> >>> >>> > Pulasthi > >>>>>>>> >>> >>> > > >>>>>>>> >>> >>> > > >>>>>>>> >>> >>> > > >>>>>>>> >>> >>> > > >>>>>>>> >>> >>> > > >>>>>>>> >>> >>> > On Tue, Feb 18, 2020 at 1:45 PM Pulasthi Supun > Wickramasinghe <[email protected]> wrote: > >>>>>>>> >>> >>> >> > >>>>>>>> >>> >>> >> Hi All, > >>>>>>>> >>> >>> >> > >>>>>>>> >>> >>> >> I have created the initial pull request [1] to > contribute the Twister2 Beam runner to the Apache Beam codebase. More > information on Twister2 can be found here[2] and the Twister2 codebase is > available here[3]. At the moment only batch mode is supported in the > runner, but we are planning to add stream support and implement a portable > runner for Twister2 in the near future. > >>>>>>>> >>> >>> >> > >>>>>>>> >>> >>> >> As Kenn pointed out in an earlier email it would be > great to have inputs from the community regarding this contribution since > it is a sizable one. I am sure there are many improvements that can be done > in the contributed codebase with input from the community. > >>>>>>>> >>> >>> >> > >>>>>>>> >>> >>> >> [1] https://github.com/apache/beam/pull/10888 > >>>>>>>> >>> >>> >> [2] https://twister2.org/ > >>>>>>>> >>> >>> >> [3] https://github.com/DSC-SPIDAL/twister2 > >>>>>>>> >>> >>> >> > >>>>>>>> >>> >>> >> Best Regards, > >>>>>>>> >>> >>> >> Pulasthi > >>>>>>>> >>> >>> >> -- > >>>>>>>> >>> >>> >> Pulasthi S. Wickramasinghe > >>>>>>>> >>> >>> >> PhD Candidate | Research Assistant > >>>>>>>> >>> >>> >> School of Informatics and Computing | Digital Science > Center > >>>>>>>> >>> >>> >> Indiana University, Bloomington > >>>>>>>> >>> >>> >> cell: 224-386-9035 > >>>>>>>> >>> >>> > > >>>>>>>> >>> >>> > > >>>>>>>> >>> >>> > > >>>>>>>> >>> >>> > -- > >>>>>>>> >>> >>> > Pulasthi S. Wickramasinghe > >>>>>>>> >>> >>> > PhD Candidate | Research Assistant > >>>>>>>> >>> >>> > School of Informatics and Computing | Digital Science > Center > >>>>>>>> >>> >>> > Indiana University, Bloomington > >>>>>>>> >>> >>> > cell: 224-386-9035 > >>>>>>>> >>> >>> > >>>>>>>> >>> >>> > >>>>>>>> >>> >>> > >>>>>>>> >>> >>> -- > >>>>>>>> >>> >>> Elliotte Rusty Harold > >>>>>>>> >>> >>> [email protected] > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> >> -- > >>>>>>>> >> Pulasthi S. Wickramasinghe > >>>>>>>> >> PhD Candidate | Research Assistant > >>>>>>>> >> School of Informatics and Computing | Digital Science Center > >>>>>>>> >> Indiana University, Bloomington > >>>>>>>> >> cell: 224-386-9035 > > > > > > > > -- > > Pulasthi S. Wickramasinghe > > PhD Candidate | Research Assistant > > School of Informatics and Computing | Digital Science Center > > Indiana University, Bloomington > > cell: 224-386-9035 > -- Pulasthi S. Wickramasinghe PhD Candidate | Research Assistant School of Informatics and Computing | Digital Science Center Indiana University, Bloomington cell: 224-386-9035
