Re: [spark structured streaming runner] merge to master?

Kenneth Knowles Thu, 10 Oct 2019 12:02:58 -0700

+1

I think our experiences with things that go to master early have been very
good. So I am in favor ASAP. We can exclude it from releases easily until
it is ready for end users.


I have the same question as Robert - how much is modifications and how much
is new? I notice it is in a subdirectory of the beam-runners-spark module.

I did not see any major changes to dependencies but I will also ask if it
has major version differences so that you might want a separate artifact?

Kenn

On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw <[email protected]>
wrote:

> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot <[email protected]>
> wrote:
> >
> > Hi guys,
> >
> > You probably know that there has been for several months an work
> > developing a new Spark runner based on Spark Structured Streaming
> > framework. This work is located in a feature branch here:
> > https://github.com/apache/beam/tree/spark-runner_structured-streaming
> >
> > To attract more contributors and get some user feedback, we think it is
> > time to merge it to master. Before doing so, some steps need to be
> achieved:
> >
> > - finish the work on spark Encoders (that allow to call Beam coders)
> > because, right now, the runner is in an unstable state (some transforms
> > use the new way of doing ser/de and some use the old one, making a
> > pipeline incoherent toward serialization)
> >
> > - clean history: The history contains commits from November 2018, so
> > there is a good amount of work, thus a consequent number of commits.
> > They were already squashed but not from September 2019
>
> I don't think the number of commits should be an issue--we shouldn't
> just squash years worth of history away. (OTOH, if this is a case of
> this branch containing lots of little, irrelevant commits that would
> have normally been squashed away in the normal review process we do
> for the main branch, then, yes, some cleanup could be nice.)
>
> > Regarding status:
> >
> > - the runner passes 89% of the validates runner tests in batch mode. We
> > hope to pass more with the new Encoders
> >
> > - Streaming mode is barely started (waiting for the multi-aggregations
> > support in spark SS framework from the Spark community)
> >
> > - Runner can execute Nexmark
> >
> > - Some things are not wired up yet
> >
> >      - Beam Schemas not wired with Spark Schemas
> >
> >      - Optional features of the model not implemented:  state api, timer
> > api, splittable doFn api, …
> >
> > WDYT, can we merge it to master once the 2 steps are done ?
>
> I think that as long as it sits parallel to the existing runner, and
> is clearly marked with its status, it makes sense to me. How many
> changes does it make to the existing codebase (as opposed to add new
> code)?
>

Re: [spark structured streaming runner] merge to master?

Reply via email to