Re: [spark structured streaming runner] merge to master?

Etienne Chauchot Fri, 11 Oct 2019 05:57:05 -0700

I think that it is important to also provide in the build a jar thatonly contains the old runner, for people who want to ship only one.


Etienne


On 10/10/2019 15:40, Alexey Romanenko wrote:

+1 for merging this new runner too (even if it’s not 100% ready for the moment) 
in case if it doesn’t break/fail/affect all other tests and Jenkins jobs. I 
mean, it should be transparent for other Beam components.

Also, since it won’t be officially “released” right after merging, we need to 
clearly warn users that it’s not ready to use in production.

On 10 Oct 2019, at 15:25, Ryan Skraba <[email protected]> wrote:

Merging to master sounds like a really good idea, even if it is not
feature-complete yet.

It's already a pretty big accomplishment getting it to the current
state (great job all!).  Merging it into master would give it a pretty
good boost for visibility and encouraging some discussion about where
it's going.

I don't think there's any question about removing the RDD-based
(a.k.a. old/legacy/stable) spark runner yet!

All my best, Ryan


On Thu, Oct 10, 2019 at 2:47 PM Jean-Baptiste Onofré <[email protected]> wrote:

+1

As the runner seems almost "equivalent" to the one we have, it makes sense.

Question is: do we keep the "old" spark runner for a while or not (or
just keep on previous version/tag on git) ?

Regards
JB

On 10/10/2019 09:39, Etienne Chauchot wrote:

Hi guys,

You probably know that there has been for several months an work
developing a new Spark runner based on Spark Structured Streaming
framework. This work is located in a feature branch here:
https://github.com/apache/beam/tree/spark-runner_structured-streaming

To attract more contributors and get some user feedback, we think it is
time to merge it to master. Before doing so, some steps need to be
achieved:

- finish the work on spark Encoders (that allow to call Beam coders)
because, right now, the runner is in an unstable state (some transforms
use the new way of doing ser/de and some use the old one, making a
pipeline incoherent toward serialization)

- clean history: The history contains commits from November 2018, so
there is a good amount of work, thus a consequent number of commits.
They were already squashed but not from September 2019

Regarding status:

- the runner passes 89% of the validates runner tests in batch mode. We
hope to pass more with the new Encoders

- Streaming mode is barely started (waiting for the multi-aggregations
support in spark SS framework from the Spark community)

- Runner can execute Nexmark

- Some things are not wired up yet

    - Beam Schemas not wired with Spark Schemas

    - Optional features of the model not implemented:  state api, timer
api, splittable doFn api, …

WDYT, can we merge it to master once the 2 steps are done ?

Best

Etienne

--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: [spark structured streaming runner] merge to master?

Reply via email to