Hmm, I don’t think that jar size should play a big role comparing to the whole size of shaded jar of users job. Even more, I think it will be quite confusing for users to choose which jar to use if we will have 3 different ones for similar purposes. Though, let’s see what others think.
> On 29 Oct 2019, at 15:32, Etienne Chauchot <[email protected]> wrote: > > Hi Alexey, > Thanks for your opinion ! > > Comments inline > > Etienne > On 28/10/2019 17:34, Alexey Romanenko wrote: >> Let me share some of my thoughts on this. >>>> - shall we filter out the package name from the release? >> Until new runner is not ready to be used in production (or, at least, be >> used for beta testing but users should be clearly warned about that in this >> case), I believe we need to filter out its classes from published jar to >> avoid a confusion. > Yes that is what I think also >>>> - should we release 2 jars: one for the old and one for the new ? >>>> - should we release 3 jars: one for the new, one for the new and one >>>> for both ? >>>> >> Once new runner will be released, then I think we need to provide only one >> single jar and allow user to switch between different Spark runners with CLI >> option. > I would vote for 3 jars: one for new, one for old, and one for both. Indeed, > in some cases, users are looking very closely at the size of jars. This > solution meets all use cases >>>> - should we create a special entry to the capability matrix ? >>>> >> >> Sure, since it has its own uniq characteristics and implementation, but >> again, only once new runner will be "officially released". > +1 >> >> >>> On 28 Oct 2019, at 10:27, Etienne Chauchot <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> Hi guys, >>> >>> Any opinions on the point2 communication to users ? >>> >>> Etienne >>> On 24/10/2019 15:44, Etienne Chauchot wrote: >>>> Hi guys, >>>> >>>> I'm glad to announce that the PR for the merge to master of the new runner >>>> based on Spark Structured Streaming framework is submitted: >>>> >>>> https://github.com/apache/beam/pull/9866 >>>> <https://github.com/apache/beam/pull/9866> >>>> >>>> 1. Regarding the status of the runner: >>>> -the runner passes 93% of the validates runner tests in batch mode. >>>> >>>> -Streaming mode is barely started (waiting for the multi-aggregations >>>> support in spark Structured Streaming framework from the Spark community) >>>> >>>> -Runner can execute Nexmark >>>> >>>> -Some things are not wired up yet >>>> >>>> -Beam Schemas not wired with Spark Schemas >>>> >>>> -Optional features of the model not implemented: state api, timer api, >>>> splittable doFn api, … >>>> >>>> 2. Regarding the communication to users: >>>> >>>> - for reasons explained by Ismael: the runner is in the same module as the >>>> "older" one. But it is in a different sub-package and both runners share >>>> the same build. >>>> - How should we communicate to users: >>>> - shall we filter out the package name from the release? >>>> - should we release 2 jars: one for the old and one for the new ? >>>> - should we release 3 jars: one for the new, one for the new and one >>>> for both ? >>>> >>>> - should we create a special entry to the capability matrix ? >>>> >>>> WDYT ? >>>> Best >>>> >>>> Etienne >>>> >>>> On 23/10/2019 19:11, Mikhail Gryzykhin wrote: >>>>> +1 to merge. >>>>> >>>>> It is worth keeping things in master with explicitly marked status. It >>>>> will make effort more visible to users and easier to get feedback upon. >>>>> >>>>> --Mikhail >>>>> >>>>> On Wed, Oct 23, 2019 at 8:36 AM Etienne Chauchot <[email protected] >>>>> <mailto:[email protected]>> wrote: >>>>> Hi guys, >>>>> >>>>> The new spark runner now supports beam coders and passes 93% of the batch >>>>> validates runner tests (+4%). I think it is time to merge it to master. I >>>>> will submit a PR in the coming days. >>>>> >>>>> next steps: support schemas and thus better leverage catalyst optimizer >>>>> (among other things optims based on data), port perfs optims that were >>>>> done in the current runner. >>>>> Best >>>>> Etienne >>>>> On 11/10/2019 22:48, Pablo Estrada wrote: >>>>>> +1 for merging : ) >>>>>> >>>>>> On Fri, Oct 11, 2019 at 12:43 PM Robert Bradshaw <[email protected] >>>>>> <mailto:[email protected]>> wrote: >>>>>> Sounds like a good plan to me. >>>>>> >>>>>> On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot <[email protected] >>>>>> <mailto:[email protected]>> wrote: >>>>>> Comments inline >>>>>> On 10/10/2019 23:44, Ismaël Mejía wrote: >>>>>>> +1 >>>>>>> >>>>>>> The earlier we get to master the better to encourage not only code >>>>>>> contributions but as important to have early user feedback. >>>>>>> >>>>>>>> Question is: do we keep the "old" spark runner for a while or not (or >>>>>>>> just keep on previous version/tag on git) ? >>>>>>> It is still too early to even start discussing when to remove the >>>>>>> classical runner given that the new runner is still a WIP. However the >>>>>>> overall goal is that this runner becomes the de-facto one once the VR >>>>>>> tests and the performance become at least equal to the classical >>>>>>> runner, in the meantime the best for users is that they co-exist, >>>>>>> let’s not forget that the other runner has been already battle tested >>>>>>> for more than 3 years and has had lots of improvements in the last >>>>>>> year. >>>>>> +1 on what Ismael says: no soon removal, >>>>>> The plan I had in mind at first (that I showed at the apacheCon) was >>>>>> this but I'm proposing moving the first gray label to before the red >>>>>> box. >>>>>> <beogijnhpieapoll.png> >>>>>> >>>>>> >>>>>>>> I don't think the number of commits should be an issue--we shouldn't >>>>>>>> just squash years worth of history away. (OTOH, if this is a case of >>>>>>>> this branch containing lots of little, irrelevant commits that would >>>>>>>> have normally been squashed away in the normal review process we do >>>>>>>> for the main branch, then, yes, some cleanup could be nice.) >>>>>>> About the commits we should encourage a clear history but we have also >>>>>>> to remove useless commits that are still present in the branch, >>>>>>> commits of the “Fix errorprone” / “Cleaning” kind and even commits >>>>>>> that make a better narrative sense together should be probably >>>>>>> squashed, because they do not bring much to the history. It is not >>>>>>> about more or less commits it is about its relevance as Robert >>>>>>> mentions. >>>>>>> >>>>>>>> I think our experiences with things that go to master early have been >>>>>>>> very good. So I am in favor ASAP. We can exclude it from releases >>>>>>>> easily until it is ready for end users. >>>>>>>> I have the same question as Robert - how much is modifications and how >>>>>>>> much is new? I notice it is in a subdirectory of the >>>>>>>> beam-runners-spark module. >>>>>>> In its current form we cannot exclude it but this relates to the other >>>>>>> question, so better to explain a bit of history: The new runner used >>>>>>> to live in its own module and subdirectory because it is a full blank >>>>>>> page rewrite and the decision was not to use any of the classical >>>>>>> runner classes to not be constrained by its evolution. >>>>>>> >>>>>>> However the reason to put it back in the same module as a subdirectory >>>>>>> was to encourage early use, in more detail: The way you deploy spark >>>>>>> jobs today is usually by packaging and staging an uber jar (~200MB of >>>>>>> pure dependency joy) that contains the user pipeline classes, the >>>>>>> spark runner module and its dependencies. If we have two spark runners >>>>>>> in separate modules the user would need to repackage and redeploy >>>>>>> their pipelines every time they want to switch from the classical >>>>>>> Spark runner to the structured streaming runner which is painful and >>>>>>> time and space consuming compared with the one module approach where >>>>>>> they just change the name of the runner class and that’s it. The idea >>>>>>> here is to make easy for users to test the new runner, but at the same >>>>>>> time to make easy to come back to the classical runner in case of any >>>>>>> issue. >>>>>>> >>>>>>> Ismaël >>>>>>> >>>>>>> On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles <[email protected]> >>>>>>> <mailto:[email protected]> wrote: >>>>>>>> +1 >>>>>>>> >>>>>>>> I think our experiences with things that go to master early have been >>>>>>>> very good. So I am in favor ASAP. We can exclude it from releases >>>>>>>> easily until it is ready for end users. >>>>>>>> >>>>>>>> I have the same question as Robert - how much is modifications and how >>>>>>>> much is new? I notice it is in a subdirectory of the >>>>>>>> beam-runners-spark module. >>>>>>>> >>>>>>>> I did not see any major changes to dependencies but I will also ask if >>>>>>>> it has major version differences so that you might want a separate >>>>>>>> artifact? >>>>>>>> >>>>>>>> Kenn >>>>>>>> >>>>>>>> On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw <[email protected]> >>>>>>>> <mailto:[email protected]> wrote: >>>>>>>>> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot >>>>>>>>> <[email protected]> <mailto:[email protected]> wrote: >>>>>>>>>> Hi guys, >>>>>>>>>> >>>>>>>>>> You probably know that there has been for several months an work >>>>>>>>>> developing a new Spark runner based on Spark Structured Streaming >>>>>>>>>> framework. This work is located in a feature branch here: >>>>>>>>>> https://github.com/apache/beam/tree/spark-runner_structured-streaming >>>>>>>>>> >>>>>>>>>> <https://github.com/apache/beam/tree/spark-runner_structured-streaming> >>>>>>>>>> >>>>>>>>>> To attract more contributors and get some user feedback, we think it >>>>>>>>>> is >>>>>>>>>> time to merge it to master. Before doing so, some steps need to be >>>>>>>>>> achieved: >>>>>>>>>> >>>>>>>>>> - finish the work on spark Encoders (that allow to call Beam coders) >>>>>>>>>> because, right now, the runner is in an unstable state (some >>>>>>>>>> transforms >>>>>>>>>> use the new way of doing ser/de and some use the old one, making a >>>>>>>>>> pipeline incoherent toward serialization) >>>>>>>>>> >>>>>>>>>> - clean history: The history contains commits from November 2018, so >>>>>>>>>> there is a good amount of work, thus a consequent number of commits. >>>>>>>>>> They were already squashed but not from September 2019 >>>>>>>>> I don't think the number of commits should be an issue--we shouldn't >>>>>>>>> just squash years worth of history away. (OTOH, if this is a case of >>>>>>>>> this branch containing lots of little, irrelevant commits that would >>>>>>>>> have normally been squashed away in the normal review process we do >>>>>>>>> for the main branch, then, yes, some cleanup could be nice.) >>>>>>>>> >>>>>>>>>> Regarding status: >>>>>>>>>> >>>>>>>>>> - the runner passes 89% of the validates runner tests in batch mode. >>>>>>>>>> We >>>>>>>>>> hope to pass more with the new Encoders >>>>>>>>>> >>>>>>>>>> - Streaming mode is barely started (waiting for the >>>>>>>>>> multi-aggregations >>>>>>>>>> support in spark SS framework from the Spark community) >>>>>>>>>> >>>>>>>>>> - Runner can execute Nexmark >>>>>>>>>> >>>>>>>>>> - Some things are not wired up yet >>>>>>>>>> >>>>>>>>>> - Beam Schemas not wired with Spark Schemas >>>>>>>>>> >>>>>>>>>> - Optional features of the model not implemented: state api, >>>>>>>>>> timer >>>>>>>>>> api, splittable doFn api, … >>>>>>>>>> >>>>>>>>>> WDYT, can we merge it to master once the 2 steps are done ? >>>>>>>>> I think that as long as it sits parallel to the existing runner, and >>>>>>>>> is clearly marked with its status, it makes sense to me. How many >>>>>>>>> changes does it make to the existing codebase (as opposed to add new >>>>>>>>> code)? >>
