Re: [spark structured streaming runner] merge to master?

Ismaël Mejía Thu, 10 Oct 2019 14:45:00 -0700

+1

The earlier we get to master the better to encourage not only code
contributions but as important to have early user feedback.

> Question is: do we keep the "old" spark runner for a while or not (or just 
> keep on previous version/tag on git) ?

It is still too early to even start discussing when to remove the
classical runner given that the new runner is still a WIP. However the
overall goal is that this runner becomes the de-facto one once the VR
tests and the performance become at least equal to the classical
runner, in the meantime the best for users is that they co-exist,
let’s not forget that the other runner has been already battle tested
for more than 3 years and has had lots of improvements in the last
year.

> I don't think the number of commits should be an issue--we shouldn't
> just squash years worth of history away. (OTOH, if this is a case of
> this branch containing lots of little, irrelevant commits that would
> have normally been squashed away in the normal review process we do
> for the main branch, then, yes, some cleanup could be nice.)

About the commits we should encourage a clear history but we have also
to remove useless commits that are still present in the branch,
commits of the “Fix errorprone” / “Cleaning” kind and even commits
that make a better narrative sense together should be probably
squashed, because they do not bring much to the history. It is not
about more or less commits it is about its relevance as Robert
mentions.

> I think our experiences with things that go to master early have been very 
> good. So I am in favor ASAP. We can exclude it from releases easily until it 
> is ready for end users.
> I have the same question as Robert - how much is modifications and how much 
> is new? I notice it is in a subdirectory of the beam-runners-spark module.

In its current form we cannot exclude it but this relates to the other
question, so better to explain a bit of history: The new runner used
to live in its own module and subdirectory because it is a full blank
page rewrite and the decision was not to use any of the classical
runner classes to not be constrained by its evolution.

However the reason to put it back in the same module as a subdirectory
was to encourage early use, in more detail: The way you deploy spark
jobs today is usually by packaging and staging an uber jar (~200MB of
pure dependency joy) that contains the user pipeline classes, the
spark runner module and its dependencies. If we have two spark runners
in separate modules the user would need to repackage and redeploy
their pipelines every time they want to switch from the classical
Spark runner to the structured streaming runner which is painful and
time and space consuming compared with the one module approach where
they just change the name of the runner class and that’s it. The idea
here is to make easy for users to test the new runner, but at the same
time to make easy to come back to the classical runner in case of any
issue.

Ismaël

On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles <[email protected]> wrote:
>
> +1
>
> I think our experiences with things that go to master early have been very 
> good. So I am in favor ASAP. We can exclude it from releases easily until it 
> is ready for end users.
>
> I have the same question as Robert - how much is modifications and how much 
> is new? I notice it is in a subdirectory of the beam-runners-spark module.
>
> I did not see any major changes to dependencies but I will also ask if it has 
> major version differences so that you might want a separate artifact?
>
> Kenn
>
> On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw <[email protected]> wrote:
>>
>> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot <[email protected]> 
>> wrote:
>> >
>> > Hi guys,
>> >
>> > You probably know that there has been for several months an work
>> > developing a new Spark runner based on Spark Structured Streaming
>> > framework. This work is located in a feature branch here:
>> > https://github.com/apache/beam/tree/spark-runner_structured-streaming
>> >
>> > To attract more contributors and get some user feedback, we think it is
>> > time to merge it to master. Before doing so, some steps need to be 
>> > achieved:
>> >
>> > - finish the work on spark Encoders (that allow to call Beam coders)
>> > because, right now, the runner is in an unstable state (some transforms
>> > use the new way of doing ser/de and some use the old one, making a
>> > pipeline incoherent toward serialization)
>> >
>> > - clean history: The history contains commits from November 2018, so
>> > there is a good amount of work, thus a consequent number of commits.
>> > They were already squashed but not from September 2019
>>
>> I don't think the number of commits should be an issue--we shouldn't
>> just squash years worth of history away. (OTOH, if this is a case of
>> this branch containing lots of little, irrelevant commits that would
>> have normally been squashed away in the normal review process we do
>> for the main branch, then, yes, some cleanup could be nice.)
>>
>> > Regarding status:
>> >
>> > - the runner passes 89% of the validates runner tests in batch mode. We
>> > hope to pass more with the new Encoders
>> >
>> > - Streaming mode is barely started (waiting for the multi-aggregations
>> > support in spark SS framework from the Spark community)
>> >
>> > - Runner can execute Nexmark
>> >
>> > - Some things are not wired up yet
>> >
>> >      - Beam Schemas not wired with Spark Schemas
>> >
>> >      - Optional features of the model not implemented:  state api, timer
>> > api, splittable doFn api, …
>> >
>> > WDYT, can we merge it to master once the 2 steps are done ?
>>
>> I think that as long as it sits parallel to the existing runner, and
>> is clearly marked with its status, it makes sense to me. How many
>> changes does it make to the existing codebase (as opposed to add new
>> code)?

Re: [spark structured streaming runner] merge to master?

Reply via email to