On Thu, Oct 10, 2019 at 2:44 PM Ismaël Mejía <[email protected]> wrote:
> +1 > > The earlier we get to master the better to encourage not only code > contributions but as important to have early user feedback. > > > Question is: do we keep the "old" spark runner for a while or not (or > just keep on previous version/tag on git) ? > > It is still too early to even start discussing when to remove the > classical runner given that the new runner is still a WIP. However the > overall goal is that this runner becomes the de-facto one once the VR > tests and the performance become at least equal to the classical > runner, in the meantime the best for users is that they co-exist, > let’s not forget that the other runner has been already battle tested > for more than 3 years and has had lots of improvements in the last > year. > > > I don't think the number of commits should be an issue--we shouldn't > > just squash years worth of history away. (OTOH, if this is a case of > > this branch containing lots of little, irrelevant commits that would > > have normally been squashed away in the normal review process we do > > for the main branch, then, yes, some cleanup could be nice.) > > About the commits we should encourage a clear history but we have also > to remove useless commits that are still present in the branch, > commits of the “Fix errorprone” / “Cleaning” kind and even commits > that make a better narrative sense together should be probably > squashed, because they do not bring much to the history. It is not > about more or less commits it is about its relevance as Robert > mentions. > > > I think our experiences with things that go to master early have been > very good. So I am in favor ASAP. We can exclude it from releases easily > until it is ready for end users. > > I have the same question as Robert - how much is modifications and how > much is new? I notice it is in a subdirectory of the beam-runners-spark > module. > > In its current form we cannot exclude it but this relates to the other > question, so better to explain a bit of history: The new runner used > to live in its own module and subdirectory because it is a full blank > page rewrite and the decision was not to use any of the classical > runner classes to not be constrained by its evolution. > > However the reason to put it back in the same module as a subdirectory > was to encourage early use, in more detail: The way you deploy spark > jobs today is usually by packaging and staging an uber jar (~200MB of > pure dependency joy) that contains the user pipeline classes, the > spark runner module and its dependencies. If we have two spark runners > in separate modules the user would need to repackage and redeploy > their pipelines every time they want to switch from the classical > Spark runner to the structured streaming runner which is painful and > time and space consuming compared with the one module approach where > they just change the name of the runner class and that’s it. The idea > here is to make easy for users to test the new runner, but at the same > time to make easy to come back to the classical runner in case of any > issue. > That makes sense. And it is a good reason to release it and it can just be activated by adventurous users. Kenn > Ismaël > > On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles <[email protected]> wrote: > > > > +1 > > > > I think our experiences with things that go to master early have been > very good. So I am in favor ASAP. We can exclude it from releases easily > until it is ready for end users. > > > > I have the same question as Robert - how much is modifications and how > much is new? I notice it is in a subdirectory of the beam-runners-spark > module. > > > > I did not see any major changes to dependencies but I will also ask if > it has major version differences so that you might want a separate artifact? > > > > Kenn > > > > On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw <[email protected]> > wrote: > >> > >> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot <[email protected]> > wrote: > >> > > >> > Hi guys, > >> > > >> > You probably know that there has been for several months an work > >> > developing a new Spark runner based on Spark Structured Streaming > >> > framework. This work is located in a feature branch here: > >> > https://github.com/apache/beam/tree/spark-runner_structured-streaming > >> > > >> > To attract more contributors and get some user feedback, we think it > is > >> > time to merge it to master. Before doing so, some steps need to be > achieved: > >> > > >> > - finish the work on spark Encoders (that allow to call Beam coders) > >> > because, right now, the runner is in an unstable state (some > transforms > >> > use the new way of doing ser/de and some use the old one, making a > >> > pipeline incoherent toward serialization) > >> > > >> > - clean history: The history contains commits from November 2018, so > >> > there is a good amount of work, thus a consequent number of commits. > >> > They were already squashed but not from September 2019 > >> > >> I don't think the number of commits should be an issue--we shouldn't > >> just squash years worth of history away. (OTOH, if this is a case of > >> this branch containing lots of little, irrelevant commits that would > >> have normally been squashed away in the normal review process we do > >> for the main branch, then, yes, some cleanup could be nice.) > >> > >> > Regarding status: > >> > > >> > - the runner passes 89% of the validates runner tests in batch mode. > We > >> > hope to pass more with the new Encoders > >> > > >> > - Streaming mode is barely started (waiting for the multi-aggregations > >> > support in spark SS framework from the Spark community) > >> > > >> > - Runner can execute Nexmark > >> > > >> > - Some things are not wired up yet > >> > > >> > - Beam Schemas not wired with Spark Schemas > >> > > >> > - Optional features of the model not implemented: state api, > timer > >> > api, splittable doFn api, … > >> > > >> > WDYT, can we merge it to master once the 2 steps are done ? > >> > >> I think that as long as it sits parallel to the existing runner, and > >> is clearly marked with its status, it makes sense to me. How many > >> changes does it make to the existing codebase (as opposed to add new > >> code)? >
