Hello Pulasthi,

Please excuse me for my delay, I have probably 1/3 of my common
available time since the coronavirus lockdown so I have not advanced
as expected. I hope to catch up rapidly and ping you. Our expected
target of merging it before the 2.21.0 release seems to be hard to get
at this point because the branch will be cut next week. I hope this is
not a problem but if it is please excuse me.

I also profit to ask any other Beamer that could have more free cycles
at the moment in case (s)he can give me an extra hand for the review.

Regards,
Ismaël


On Fri, Apr 3, 2020 at 4:16 AM Pulasthi Supun Wickramasinghe
<[email protected]> wrote:
>
> Hi Ismaël
>
> Did you get some free time to perform a code review on the pull request
>
> Best Regards
> Pulasthi
>
> On Tue, Mar 10, 2020 at 3:30 PM Luke Cwik <[email protected]> wrote:
>>
>> I have to disagree. Allowing for runners within the Apache Beam repo and 
>> SDKs that reach into the implementation details of each other are usability, 
>> feature development, maintenance and complexity problems.
>>
>> The usability issue comes from our public core facing APIs exposing methods 
>> that runners "need" so they can introspect details that shouldn't be visible 
>> to them (e.g. setWindowingStrategyInternal on PCollection). Getting to 1 
>> would remove the pipeline construction time instances but not the execution 
>> side ones and there are currently 100+ usages of the @Internal annotation.
>>
>> The feature development and maintenance issues both stem from duplication of 
>> work. We need to have at least two copies of how to do something, one that 
>> is for runner -> SDK direct and one for Fn API. An example of this is the 
>> timer family work which was started and completed for the non portable 
>> implementation yet the portable implementation was left as future work.
>>
>> Finally, the complexity comes from how many layers we have that wrap 
>> existing components to create variants for different use cases. I'm looking 
>> at all the DoFnRunners and each of their variants and how those have layers 
>> within themselves within the SDK and how additional layers have been made to 
>> interface with runner specific internal details.
>>
>>
>> On Tue, Mar 10, 2020 at 12:07 PM Kenneth Knowles <[email protected]> wrote:
>>>
>>> I do support all the efforts to get Dataflow, Flink, and Spark to 3 (Fn 
>>> API). But I disagree with it as a requirement; the whole point of 
>>> ptransforms with URNs is that if the runner can figure out how to execute 
>>> it according to semantics, then it is fine. A runner meets (1) and (2) but 
>>> can only run certain subset of DoFns is allowed by design (whether the 
>>> subset is based on language, state/timer support, etc).
>>>
>>> Kenn
>>>
>>> On Tue, Mar 10, 2020 at 9:45 AM Luke Cwik <[email protected]> wrote:
>>>>
>>>> I would like to move away from having runners access APIs that are related 
>>>> to pipeline construction and other internal SDK APIs and I would like for 
>>>> SDKs to not inspect internal runner APIs. This would enable the community 
>>>> to improve each independently without needing to fix the world all the 
>>>> time and would enable the community to run a cluster that supports 
>>>> multiple Beam versions at the same time and would also allow for the 
>>>> cluster to be updated independently of the pipelines it runs.
>>>>
>>>> As a community, I believe we need to achieve 1, 2 and 3. Outside of the 
>>>> Apache Beam repo, anyone can do whatever they want but there should be no 
>>>> compatibility guarantees.
>>>>
>>>> 4 and 5 are extensions that enable a richer set of pipelines to run and 
>>>> are optional like many other parts such as if a runner supports metrics 
>>>> aggregation or dynamic work rebalancing.
>>>>
>>>> On Tue, Mar 10, 2020 at 9:11 AM Kenneth Knowles <[email protected]> wrote:
>>>>>
>>>>> There are a lot of different meanings to "portable runner". Here are some:
>>>>>
>>>>> (1) A runner that accepts a pipeline proto and either runs it or says it 
>>>>> cannot run it
>>>>> (2) A runner that accepts jobs via the job management APIs
>>>>> (3) A runner that executes UDFs via the Fn API
>>>>> (4) A runner that can execute multiple languages
>>>>> (5) A runner that can run cross-language transforms aka multiple 
>>>>> languages in the same pipeline
>>>>>
>>>>> I think (1) is a very good bar, and (2) is a nice addition on top of 
>>>>> that. Then we have a unified way to submit pipelines and understand their 
>>>>> status.
>>>>>
>>>>> I think (3) is optional - a runner can run things however it likes, 
>>>>> including with native implementations. And then (4) and (5) as well are 
>>>>> just levels of feature capabilities.
>>>>>
>>>>> Kenn
>>>>>
>>>>> On Tue, Mar 10, 2020 at 8:54 AM Luke Cwik <[email protected]> wrote:
>>>>>>
>>>>>> +1
>>>>>>
>>>>>> On Tue, Mar 10, 2020 at 12:59 AM Alex Van Boxel <[email protected]> wrote:
>>>>>>>
>>>>>>> One last thing, for any runner after this one... wouldn't it be a good 
>>>>>>> acceptance criteria to only accept portable implementations anymore?
>>>>>>>
>>>>>>>  _/
>>>>>>> _/ Alex Van Boxel
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Mar 9, 2020 at 10:42 PM Ismaël Mejía <[email protected]> wrote:
>>>>>>>>
>>>>>>>> Good points Kenn. I think we mostly agree on what has been discussed 
>>>>>>>> in this
>>>>>>>> thread the pros/cons of having runners on our repository, but this is 
>>>>>>>> probably
>>>>>>>> not the best moment in time to change any policy in that aspect.
>>>>>>>>
>>>>>>>> So if nobody objects I think we can proceed. I am OOO this week so 
>>>>>>>> with less
>>>>>>>> time to continue with the code review, but I will be back to finish 
>>>>>>>> the review
>>>>>>>> and hopefully finally get this merged with Pulasthi next week (sorry 
>>>>>>>> for the
>>>>>>>> delay).
>>>>>>>>
>>>>>>>> > (don't wait for me on code review - if Ismaël said it is good, then 
>>>>>>>> > it is
>>>>>>>> > good.)
>>>>>>>>
>>>>>>>> Thanks for your confidence. Twister2 runners looks good so far, but I 
>>>>>>>> will
>>>>>>>> confirm 100% next week :) In the meantime if someone has some extra 
>>>>>>>> cycles to
>>>>>>>> take a look extra feedback is always welcome.
>>>>>>>>
>>>>>>>> On Mon, Mar 9, 2020 at 5:50 AM Kenneth Knowles <[email protected]> wrote:
>>>>>>>> >
>>>>>>>> > I haven't heard anyone suggest that we need a vote. I haven't heard 
>>>>>>>> > anyone object to this being merged to master. Some time ago, we 
>>>>>>>> > mostly decided to favor master instead of branches, because it is so 
>>>>>>>> > much smoother for contributors and users.
>>>>>>>> >
>>>>>>>> > So I am poking this thread one last time and otherwise I would 
>>>>>>>> > consider it consensus that once code review is done the runner is a 
>>>>>>>> > part of Beam (experimental!).
>>>>>>>> >
>>>>>>>> > (don't wait for me on code review - if Ismaël said it is good, then 
>>>>>>>> > it is good.)
>>>>>>>> >
>>>>>>>> > Kenn
>>>>>>>> >
>>>>>>>> > On Fri, Mar 6, 2020 at 7:47 AM Pulasthi Supun Wickramasinghe 
>>>>>>>> > <[email protected]> wrote:
>>>>>>>> >>
>>>>>>>> >> I understand that the discussion is on a more broad level than the 
>>>>>>>> >> Twister2 runner. From my experience developing the runner the main 
>>>>>>>> >> advantage of being inside the beam project was the easy access to 
>>>>>>>> >> the wide range of tests and other core/utility code as Kyle pointed 
>>>>>>>> >> out. Unmerging runners that are not properly maintained and updated 
>>>>>>>> >> would be the most logical path to follow since the internals of the 
>>>>>>>> >> runners are only well understood by developers of that particular 
>>>>>>>> >> project. It would be unreasonable to expect the Beam community to 
>>>>>>>> >> maintain them. And since the runners do not alter the core API's I 
>>>>>>>> >> assume they would be easy to unmerge if the need arises.
>>>>>>>> >>
>>>>>>>> >> Talking specifically about Twister2 runner, we hope to continue 
>>>>>>>> >> developing the runner in the future to add both streaming 
>>>>>>>> >> capability and develop a portable runner as well. The team behind 
>>>>>>>> >> Twister2 is working towards the goal to get the project into Apache 
>>>>>>>> >> Incubator in the near future (Hopefully to submit the proposal in 
>>>>>>>> >> the next couple of months).
>>>>>>>> >>
>>>>>>>> >> Best Regards,
>>>>>>>> >> Pulasthi
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >> On Thu, Mar 5, 2020 at 6:56 PM Robert Bradshaw 
>>>>>>>> >> <[email protected]> wrote:
>>>>>>>> >>>
>>>>>>>> >>> I think we will get to a point where it makes sense for runners to
>>>>>>>> >>> live in their own repositories, with their own release cadence, but
>>>>>>>> >>> we're not at that point yet. One prerequisite is a stable 
>>>>>>>> >>> API--we're
>>>>>>>> >>> closing in on that with the portability protos, but many (java)
>>>>>>>> >>> runners actually share the common runner core libraries and that is
>>>>>>>> >>> even less set in stone.
>>>>>>>> >>>
>>>>>>>> >>> On the other hand, taking responsibility for maintaining all 
>>>>>>>> >>> runners
>>>>>>>> >>> is not a tenable or scalable position for the Beam project. If a
>>>>>>>> >>> runner is merged, it should be understood that it can be 
>>>>>>>> >>> "un-merged"
>>>>>>>> >>> if it causes a maintenance burden. A completely separate
>>>>>>>> >>> project/repository makes this less messy.
>>>>>>>> >>>
>>>>>>>> >>> On Thu, Mar 5, 2020 at 10:01 AM Kenneth Knowles <[email protected]> 
>>>>>>>> >>> wrote:
>>>>>>>> >>> >
>>>>>>>> >>> > I agree with both of you, mostly :-)
>>>>>>>> >>> >
>>>>>>>> >>> > The monorepo approach doesn't work/scale well for shipped 
>>>>>>>> >>> > libraries (name a Google library that silently just works and 
>>>>>>>> >>> > never causes any dependency problems) and the pain we feel has 
>>>>>>>> >>> > been constant and increasing, but I don't think we are at the 
>>>>>>>> >>> > breaking point.
>>>>>>>> >>> >
>>>>>>>> >>> > But Google's big monorepo [1] demonstrates similar benefits to 
>>>>>>>> >>> > what Kyle describes. In the early stages the benefit of not 
>>>>>>>> >>> > having to think too hard about build/test infra and share it 
>>>>>>>> >>> > everywhere is a big help, and it scales well. Eventually, 
>>>>>>>> >>> > shipping test utility libraries and compliance suites can be 
>>>>>>>> >>> > equivalent. And to your point - it is very helpful for users to 
>>>>>>>> >>> > know that they can use CassandraIO with the other Beam 
>>>>>>>> >>> > artifacts. This is why Google requires the whole big repo to 
>>>>>>>> >>> > depend on a single version of any externally-controlled 
>>>>>>>> >>> > artifact. But, yes, as a consequence it is preposterously 
>>>>>>>> >>> > difficult to stay up to date, since literally anything can block 
>>>>>>>> >>> > progress. You need a unified escalation chain for that policy to 
>>>>>>>> >>> > make sense. It is the definition of a healthy Apache project to 
>>>>>>>> >>> > *not* have that (PMC is different).
>>>>>>>> >>> >
>>>>>>>> >>> > Independent dependencies, independent git histories, and 
>>>>>>>> >>> > independent release cadence/process are all separate discussions.
>>>>>>>> >>> >
>>>>>>>> >>> > It is a broader question than this particular contribution, so 
>>>>>>>> >>> > let's merge this runner before changing our whole way of doing 
>>>>>>>> >>> > things :-)
>>>>>>>> >>> >
>>>>>>>> >>> > Kenn
>>>>>>>> >>> >
>>>>>>>> >>> > [1] 
>>>>>>>> >>> > https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext
>>>>>>>> >>> >  (really quite a balanced analysis)
>>>>>>>> >>> >
>>>>>>>> >>> > On Wed, Mar 4, 2020 at 11:51 AM Kyle Weaver 
>>>>>>>> >>> > <[email protected]> wrote:
>>>>>>>> >>> >>
>>>>>>>> >>> >> > Should runners, current and future, be in the same repository 
>>>>>>>> >>> >> > as Beam
>>>>>>>> >>> >> > core?
>>>>>>>> >>> >>
>>>>>>>> >>> >> In the distant past, runners lived in their own repositories, 
>>>>>>>> >>> >> and then were donated to Beam. But Beam's current uber-repo 
>>>>>>>> >>> >> setup allows a lot of convenience. For example, a ton of code 
>>>>>>>> >>> >> (including core functionality and tests) is shared directly 
>>>>>>>> >>> >> between runners, which is useful for keeping runners up to date 
>>>>>>>> >>> >> and ensuring consistent behavior between them (in other words, 
>>>>>>>> >>> >> maintainable and reliable).
>>>>>>>> >>> >>
>>>>>>>> >>> >> Generally, it is up to the authors of a particular Beam related 
>>>>>>>> >>> >> project/subproject to decide whether to host their code in Beam 
>>>>>>>> >>> >> or in a different repo, and up to the community to decide 
>>>>>>>> >>> >> whether to take on the donation, as discussed in previous 
>>>>>>>> >>> >> threads on the Twister2 runner. In this case, it seems there is 
>>>>>>>> >>> >> agreement between the Twister2 runner authors and the community 
>>>>>>>> >>> >> that the runner can be hosted in Beam proper.
>>>>>>>> >>> >>
>>>>>>>> >>> >> There are examples of successful independent Beam projects, 
>>>>>>>> >>> >> such as Spotify's Scio, but having an independent project with 
>>>>>>>> >>> >> its own releases requires a lot of dedicated resources, and the 
>>>>>>>> >>> >> bar for entry for extending Beam should not be that high. All 
>>>>>>>> >>> >> that's required of subproject authors is that they keep the 
>>>>>>>> >>> >> subproject in step with Beam. If they can't maintain it any 
>>>>>>>> >>> >> longer, the subproject can be allowed to bitrot without getting 
>>>>>>>> >>> >> in anyone's way. On the other hand, I'm not sure of the details 
>>>>>>>> >>> >> with Cassandra, but in general, a subproject should not have 
>>>>>>>> >>> >> "the ability to block progress" just because it is contained in 
>>>>>>>> >>> >> the Beam uber-repo.
>>>>>>>> >>> >>
>>>>>>>> >>> >> tl;dr Having an uber repo generally seems to work for Beam. 
>>>>>>>> >>> >> Exceptions are few enough to be handled on a case-by-case basis.
>>>>>>>> >>> >>
>>>>>>>> >>> >> On Wed, Mar 4, 2020 at 11:12 AM Elliotte Rusty Harold 
>>>>>>>> >>> >> <[email protected]> wrote:
>>>>>>>> >>> >>>
>>>>>>>> >>> >>> Generic question without commenting on Twister2 specifically:
>>>>>>>> >>> >>>
>>>>>>>> >>> >>> Should runners, current and future, be in the same repository 
>>>>>>>> >>> >>> as Beam
>>>>>>>> >>> >>> core? Can or should they be completely separate products with 
>>>>>>>> >>> >>> their
>>>>>>>> >>> >>> own release cycles?
>>>>>>>> >>> >>>
>>>>>>>> >>> >>> Generally, loose coupling leads to more maintainable, reliable
>>>>>>>> >>> >>> projects. Specifically, Cassandra is holding back some other 
>>>>>>>> >>> >>> changes
>>>>>>>> >>> >>> in Beam and I really wish it didn't have the ability to block
>>>>>>>> >>> >>> progress. The more different runners we have in core, the 
>>>>>>>> >>> >>> worse this
>>>>>>>> >>> >>> problem is likely to become.
>>>>>>>> >>> >>>
>>>>>>>> >>> >>>
>>>>>>>> >>> >>> On Wed, Mar 4, 2020 at 2:03 PM Pulasthi Supun Wickramasinghe
>>>>>>>> >>> >>> <[email protected]> wrote:
>>>>>>>> >>> >>> >
>>>>>>>> >>> >>> > Hi
>>>>>>>> >>> >>> >
>>>>>>>> >>> >>> > I believe the pull request is pretty complete now with the 
>>>>>>>> >>> >>> > help of Ismaël. Kenn, would you be able to take a look at it 
>>>>>>>> >>> >>> > and suggest any changes if needed?. The build checks and 
>>>>>>>> >>> >>> > validations tests are passing at the moment.  I will start 
>>>>>>>> >>> >>> > working on the documentation that you mentioned in an 
>>>>>>>> >>> >>> > earlier email separately.
>>>>>>>> >>> >>> >
>>>>>>>> >>> >>> > Best Regards,
>>>>>>>> >>> >>> > Pulasthi
>>>>>>>> >>> >>> >
>>>>>>>> >>> >>> >
>>>>>>>> >>> >>> >
>>>>>>>> >>> >>> >
>>>>>>>> >>> >>> >
>>>>>>>> >>> >>> > On Tue, Feb 18, 2020 at 1:45 PM Pulasthi Supun 
>>>>>>>> >>> >>> > Wickramasinghe <[email protected]> wrote:
>>>>>>>> >>> >>> >>
>>>>>>>> >>> >>> >> Hi All,
>>>>>>>> >>> >>> >>
>>>>>>>> >>> >>> >> I have created the initial pull request [1] to contribute 
>>>>>>>> >>> >>> >> the Twister2 Beam runner to the Apache Beam codebase. More 
>>>>>>>> >>> >>> >> information on Twister2 can be found here[2] and the 
>>>>>>>> >>> >>> >> Twister2 codebase is available here[3]. At the moment only 
>>>>>>>> >>> >>> >> batch mode is supported in the runner, but we are planning 
>>>>>>>> >>> >>> >> to add stream support and implement a portable runner for 
>>>>>>>> >>> >>> >> Twister2 in the near future.
>>>>>>>> >>> >>> >>
>>>>>>>> >>> >>> >> As Kenn pointed out in an earlier email it would be great 
>>>>>>>> >>> >>> >> to have inputs from the community regarding this 
>>>>>>>> >>> >>> >> contribution since it is a sizable one. I am sure there are 
>>>>>>>> >>> >>> >> many improvements that can be done in the contributed 
>>>>>>>> >>> >>> >> codebase with input from the community.
>>>>>>>> >>> >>> >>
>>>>>>>> >>> >>> >> [1] https://github.com/apache/beam/pull/10888
>>>>>>>> >>> >>> >> [2] https://twister2.org/
>>>>>>>> >>> >>> >> [3] https://github.com/DSC-SPIDAL/twister2
>>>>>>>> >>> >>> >>
>>>>>>>> >>> >>> >> Best Regards,
>>>>>>>> >>> >>> >> Pulasthi
>>>>>>>> >>> >>> >> --
>>>>>>>> >>> >>> >> Pulasthi S. Wickramasinghe
>>>>>>>> >>> >>> >> PhD Candidate  | Research Assistant
>>>>>>>> >>> >>> >> School of Informatics and Computing | Digital Science Center
>>>>>>>> >>> >>> >> Indiana University, Bloomington
>>>>>>>> >>> >>> >> cell: 224-386-9035
>>>>>>>> >>> >>> >
>>>>>>>> >>> >>> >
>>>>>>>> >>> >>> >
>>>>>>>> >>> >>> > --
>>>>>>>> >>> >>> > Pulasthi S. Wickramasinghe
>>>>>>>> >>> >>> > PhD Candidate  | Research Assistant
>>>>>>>> >>> >>> > School of Informatics and Computing | Digital Science Center
>>>>>>>> >>> >>> > Indiana University, Bloomington
>>>>>>>> >>> >>> > cell: 224-386-9035
>>>>>>>> >>> >>>
>>>>>>>> >>> >>>
>>>>>>>> >>> >>>
>>>>>>>> >>> >>> --
>>>>>>>> >>> >>> Elliotte Rusty Harold
>>>>>>>> >>> >>> [email protected]
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >> --
>>>>>>>> >> Pulasthi S. Wickramasinghe
>>>>>>>> >> PhD Candidate  | Research Assistant
>>>>>>>> >> School of Informatics and Computing | Digital Science Center
>>>>>>>> >> Indiana University, Bloomington
>>>>>>>> >> cell: 224-386-9035
>
>
>
> --
> Pulasthi S. Wickramasinghe
> PhD Candidate  | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> cell: 224-386-9035

Reply via email to