Re: [spark structured streaming runner] merge to master?

Kenneth Knowles Thu, 10 Oct 2019 16:23:50 -0700

On Thu, Oct 10, 2019 at 2:44 PM Ismaël Mejía <[email protected]> wrote:


> +1
>
> The earlier we get to master the better to encourage not only code
> contributions but as important to have early user feedback.
>
> > Question is: do we keep the "old" spark runner for a while or not (or
> just keep on previous version/tag on git) ?
>
> It is still too early to even start discussing when to remove the
> classical runner given that the new runner is still a WIP. However the
> overall goal is that this runner becomes the de-facto one once the VR
> tests and the performance become at least equal to the classical
> runner, in the meantime the best for users is that they co-exist,
> let’s not forget that the other runner has been already battle tested
> for more than 3 years and has had lots of improvements in the last
> year.
>
> > I don't think the number of commits should be an issue--we shouldn't
> > just squash years worth of history away. (OTOH, if this is a case of
> > this branch containing lots of little, irrelevant commits that would
> > have normally been squashed away in the normal review process we do
> > for the main branch, then, yes, some cleanup could be nice.)
>
> About the commits we should encourage a clear history but we have also
> to remove useless commits that are still present in the branch,
> commits of the “Fix errorprone” / “Cleaning” kind and even commits
> that make a better narrative sense together should be probably
> squashed, because they do not bring much to the history. It is not
> about more or less commits it is about its relevance as Robert
> mentions.
>
> > I think our experiences with things that go to master early have been
> very good. So I am in favor ASAP. We can exclude it from releases easily
> until it is ready for end users.
> > I have the same question as Robert - how much is modifications and how
> much is new? I notice it is in a subdirectory of the beam-runners-spark
> module.
>
> In its current form we cannot exclude it but this relates to the other
> question, so better to explain a bit of history: The new runner used
> to live in its own module and subdirectory because it is a full blank
> page rewrite and the decision was not to use any of the classical
> runner classes to not be constrained by its evolution.
>
> However the reason to put it back in the same module as a subdirectory
> was to encourage early use, in more detail: The way you deploy spark
> jobs today is usually by packaging and staging an uber jar (~200MB of
> pure dependency joy) that contains the user pipeline classes, the
> spark runner module and its dependencies. If we have two spark runners
> in separate modules the user would need to repackage and redeploy
> their pipelines every time they want to switch from the classical
> Spark runner to the structured streaming runner which is painful and
> time and space consuming compared with the one module approach where
> they just change the name of the runner class and that’s it. The idea
> here is to make easy for users to test the new runner, but at the same
> time to make easy to come back to the classical runner in case of any
> issue.
>

That makes sense. And it is a good reason to release it and it can just be
activated by adventurous users.

Kenn


> Ismaël
>
> On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles <[email protected]> wrote:
> >
> > +1
> >
> > I think our experiences with things that go to master early have been
> very good. So I am in favor ASAP. We can exclude it from releases easily
> until it is ready for end users.
> >
> > I have the same question as Robert - how much is modifications and how
> much is new? I notice it is in a subdirectory of the beam-runners-spark
> module.
> >
> > I did not see any major changes to dependencies but I will also ask if
> it has major version differences so that you might want a separate artifact?
> >
> > Kenn
> >
> > On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw <[email protected]>
> wrote:
> >>
> >> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot <[email protected]>
> wrote:
> >> >
> >> > Hi guys,
> >> >
> >> > You probably know that there has been for several months an work
> >> > developing a new Spark runner based on Spark Structured Streaming
> >> > framework. This work is located in a feature branch here:
> >> > https://github.com/apache/beam/tree/spark-runner_structured-streaming
> >> >
> >> > To attract more contributors and get some user feedback, we think it
> is
> >> > time to merge it to master. Before doing so, some steps need to be
> achieved:
> >> >
> >> > - finish the work on spark Encoders (that allow to call Beam coders)
> >> > because, right now, the runner is in an unstable state (some
> transforms
> >> > use the new way of doing ser/de and some use the old one, making a
> >> > pipeline incoherent toward serialization)
> >> >
> >> > - clean history: The history contains commits from November 2018, so
> >> > there is a good amount of work, thus a consequent number of commits.
> >> > They were already squashed but not from September 2019
> >>
> >> I don't think the number of commits should be an issue--we shouldn't
> >> just squash years worth of history away. (OTOH, if this is a case of
> >> this branch containing lots of little, irrelevant commits that would
> >> have normally been squashed away in the normal review process we do
> >> for the main branch, then, yes, some cleanup could be nice.)
> >>
> >> > Regarding status:
> >> >
> >> > - the runner passes 89% of the validates runner tests in batch mode.
> We
> >> > hope to pass more with the new Encoders
> >> >
> >> > - Streaming mode is barely started (waiting for the multi-aggregations
> >> > support in spark SS framework from the Spark community)
> >> >
> >> > - Runner can execute Nexmark
> >> >
> >> > - Some things are not wired up yet
> >> >
> >> >      - Beam Schemas not wired with Spark Schemas
> >> >
> >> >      - Optional features of the model not implemented:  state api,
> timer
> >> > api, splittable doFn api, …
> >> >
> >> > WDYT, can we merge it to master once the 2 steps are done ?
> >>
> >> I think that as long as it sits parallel to the existing runner, and
> >> is clearly marked with its status, it makes sense to me. How many
> >> changes does it make to the existing codebase (as opposed to add new
> >> code)?
>

Re: [spark structured streaming runner] merge to master?

Reply via email to