Is it just as easy to have two jars and build an uber jar with both included? Then the runner can still be toggled with a flag.
Kenn On Tue, Oct 29, 2019 at 9:38 AM Alexey Romanenko <aromanenko....@gmail.com> wrote: > Hmm, I don’t think that jar size should play a big role comparing to the > whole size of shaded jar of users job. Even more, I think it will be quite > confusing for users to choose which jar to use if we will have 3 different > ones for similar purposes. Though, let’s see what others think. > > On 29 Oct 2019, at 15:32, Etienne Chauchot <echauc...@apache.org> wrote: > > Hi Alexey, > > Thanks for your opinion ! > > Comments inline > > Etienne > On 28/10/2019 17:34, Alexey Romanenko wrote: > > Let me share some of my thoughts on this. > > - shall we filter out the package name from the release? > > Until new runner is not ready to be used in production (or, at least, be > used for beta testing but users should be clearly warned about that in this > case), I believe we need to filter out its classes from published jar to > avoid a confusion. > > Yes that is what I think also > > - should we release 2 jars: one for the old and one for the new ? > > - should we release 3 jars: one for the new, one for the new and one > for both ? > > Once new runner will be released, then I think we need to provide only one > single jar and allow user to switch between different Spark runners with > CLI option. > > I would vote for 3 jars: one for new, one for old, and one for both. > Indeed, in some cases, users are looking very closely at the size of jars. > This solution meets all use cases > > - should we create a special entry to the capability matrix ? > > Sure, since it has its own uniq characteristics and implementation, but > again, only once new runner will be "officially released". > > +1 > > > > On 28 Oct 2019, at 10:27, Etienne Chauchot <echauc...@apache.org> wrote: > > Hi guys, > > Any opinions on the point2 communication to users ? > > Etienne > On 24/10/2019 15:44, Etienne Chauchot wrote: > > Hi guys, > > I'm glad to announce that the PR for the merge to master of the new runner > based on Spark Structured Streaming framework is submitted: > > https://github.com/apache/beam/pull/9866 > > > 1. Regarding the status of the runner: > > -the runner passes 93% of the validates runner tests in batch mode. > > -Streaming mode is barely started (waiting for the multi-aggregations > support in spark Structured Streaming framework from the Spark community) > > -Runner can execute Nexmark > > -Some things are not wired up yet > > -Beam Schemas not wired with Spark Schemas > > -Optional features of the model not implemented: state api, timer api, > splittable doFn api, … > > > 2. Regarding the communication to users: > > - for reasons explained by Ismael: the runner is in the same module as the > "older" one. But it is in a different sub-package and both runners share > the same build. > > - How should we communicate to users: > > - shall we filter out the package name from the release? > > - should we release 2 jars: one for the old and one for the new ? > > - should we release 3 jars: one for the new, one for the new and one > for both ? > > - should we create a special entry to the capability matrix ? > > WDYT ? > > Best > > Etienne > > > On 23/10/2019 19:11, Mikhail Gryzykhin wrote: > > +1 to merge. > > It is worth keeping things in master with explicitly marked status. It > will make effort more visible to users and easier to get feedback upon. > > --Mikhail > > On Wed, Oct 23, 2019 at 8:36 AM Etienne Chauchot <echauc...@apache.org> > wrote: > >> Hi guys, >> >> The new spark runner now supports beam coders and passes 93% of the batch >> validates runner tests (+4%). I think it is time to merge it to master. I >> will submit a PR in the coming days. >> >> next steps: support schemas and thus better leverage catalyst optimizer >> (among other things optims based on data), port perfs optims that were done >> in the current runner. >> >> Best >> >> Etienne >> On 11/10/2019 22:48, Pablo Estrada wrote: >> >> +1 for merging : ) >> >> On Fri, Oct 11, 2019 at 12:43 PM Robert Bradshaw <rober...@google.com> >> wrote: >> >>> Sounds like a good plan to me. >>> >>> On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot <echauc...@apache.org> >>> wrote: >>> >>>> Comments inline >>>> On 10/10/2019 23:44, Ismaël Mejía wrote: >>>> >>>> +1 >>>> >>>> The earlier we get to master the better to encourage not only code >>>> contributions but as important to have early user feedback. >>>> >>>> >>>> Question is: do we keep the "old" spark runner for a while or not (or just >>>> keep on previous version/tag on git) ? >>>> >>>> It is still too early to even start discussing when to remove the >>>> classical runner given that the new runner is still a WIP. However the >>>> overall goal is that this runner becomes the de-facto one once the VR >>>> tests and the performance become at least equal to the classical >>>> runner, in the meantime the best for users is that they co-exist, >>>> let’s not forget that the other runner has been already battle tested >>>> for more than 3 years and has had lots of improvements in the last >>>> year. >>>> >>>> +1 on what Ismael says: no soon removal, >>>> >>>> The plan I had in mind at first (that I showed at the apacheCon) was >>>> this but I'm proposing moving the first gray label to before the red box. >>>> >>>> <beogijnhpieapoll.png> >>>> >>>> >>>> I don't think the number of commits should be an issue--we shouldn't >>>> just squash years worth of history away. (OTOH, if this is a case of >>>> this branch containing lots of little, irrelevant commits that would >>>> have normally been squashed away in the normal review process we do >>>> for the main branch, then, yes, some cleanup could be nice.) >>>> >>>> About the commits we should encourage a clear history but we have also >>>> to remove useless commits that are still present in the branch, >>>> commits of the “Fix errorprone” / “Cleaning” kind and even commits >>>> that make a better narrative sense together should be probably >>>> squashed, because they do not bring much to the history. It is not >>>> about more or less commits it is about its relevance as Robert >>>> mentions. >>>> >>>> >>>> I think our experiences with things that go to master early have been very >>>> good. So I am in favor ASAP. We can exclude it from releases easily until >>>> it is ready for end users. >>>> I have the same question as Robert - how much is modifications and how >>>> much is new? I notice it is in a subdirectory of the beam-runners-spark >>>> module. >>>> >>>> In its current form we cannot exclude it but this relates to the other >>>> question, so better to explain a bit of history: The new runner used >>>> to live in its own module and subdirectory because it is a full blank >>>> page rewrite and the decision was not to use any of the classical >>>> runner classes to not be constrained by its evolution. >>>> >>>> However the reason to put it back in the same module as a subdirectory >>>> was to encourage early use, in more detail: The way you deploy spark >>>> jobs today is usually by packaging and staging an uber jar (~200MB of >>>> pure dependency joy) that contains the user pipeline classes, the >>>> spark runner module and its dependencies. If we have two spark runners >>>> in separate modules the user would need to repackage and redeploy >>>> their pipelines every time they want to switch from the classical >>>> Spark runner to the structured streaming runner which is painful and >>>> time and space consuming compared with the one module approach where >>>> they just change the name of the runner class and that’s it. The idea >>>> here is to make easy for users to test the new runner, but at the same >>>> time to make easy to come back to the classical runner in case of any >>>> issue. >>>> >>>> Ismaël >>>> >>>> On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles <k...@apache.org> >>>> <k...@apache.org> wrote: >>>> >>>> +1 >>>> >>>> I think our experiences with things that go to master early have been very >>>> good. So I am in favor ASAP. We can exclude it from releases easily until >>>> it is ready for end users. >>>> >>>> I have the same question as Robert - how much is modifications and how >>>> much is new? I notice it is in a subdirectory of the beam-runners-spark >>>> module. >>>> >>>> I did not see any major changes to dependencies but I will also ask if it >>>> has major version differences so that you might want a separate artifact? >>>> >>>> Kenn >>>> >>>> On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw <rober...@google.com> >>>> <rober...@google.com> wrote: >>>> >>>> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot <echauc...@apache.org> >>>> <echauc...@apache.org> wrote: >>>> >>>> Hi guys, >>>> >>>> You probably know that there has been for several months an work >>>> developing a new Spark runner based on Spark Structured Streaming >>>> framework. This work is located in a feature branch >>>> here:https://github.com/apache/beam/tree/spark-runner_structured-streaming >>>> >>>> To attract more contributors and get some user feedback, we think it is >>>> time to merge it to master. Before doing so, some steps need to be >>>> achieved: >>>> >>>> - finish the work on spark Encoders (that allow to call Beam coders) >>>> because, right now, the runner is in an unstable state (some transforms >>>> use the new way of doing ser/de and some use the old one, making a >>>> pipeline incoherent toward serialization) >>>> >>>> - clean history: The history contains commits from November 2018, so >>>> there is a good amount of work, thus a consequent number of commits. >>>> They were already squashed but not from September 2019 >>>> >>>> I don't think the number of commits should be an issue--we shouldn't >>>> just squash years worth of history away. (OTOH, if this is a case of >>>> this branch containing lots of little, irrelevant commits that would >>>> have normally been squashed away in the normal review process we do >>>> for the main branch, then, yes, some cleanup could be nice.) >>>> >>>> >>>> Regarding status: >>>> >>>> - the runner passes 89% of the validates runner tests in batch mode. We >>>> hope to pass more with the new Encoders >>>> >>>> - Streaming mode is barely started (waiting for the multi-aggregations >>>> support in spark SS framework from the Spark community) >>>> >>>> - Runner can execute Nexmark >>>> >>>> - Some things are not wired up yet >>>> >>>> - Beam Schemas not wired with Spark Schemas >>>> >>>> - Optional features of the model not implemented: state api, timer >>>> api, splittable doFn api, … >>>> >>>> WDYT, can we merge it to master once the 2 steps are done ? >>>> >>>> I think that as long as it sits parallel to the existing runner, and >>>> is clearly marked with its status, it makes sense to me. How many >>>> changes does it make to the existing codebase (as opposed to add new >>>> code)? >>>> >>>> > >