Re: [spark structured streaming runner] merge to master?

Kenneth Knowles Tue, 29 Oct 2019 15:54:46 -0700

Is it just as easy to have two jars and build an uber jar with both
included? Then the runner can still be toggled with a flag.


Kenn

On Tue, Oct 29, 2019 at 9:38 AM Alexey Romanenko <[email protected]>
wrote:

> Hmm, I don’t think that jar size should play a big role comparing to the
> whole size of shaded jar of users job. Even more, I think it will be quite
> confusing for users to choose which jar to use if we will have 3 different
> ones for similar purposes. Though, let’s see what others think.
>
> On 29 Oct 2019, at 15:32, Etienne Chauchot <[email protected]> wrote:
>
> Hi Alexey,
>
> Thanks for your opinion !
>
> Comments inline
>
> Etienne
> On 28/10/2019 17:34, Alexey Romanenko wrote:
>
> Let me share some of my thoughts on this.
>
>     - shall we filter out the package name from the release?
>
> Until new runner is not ready to be used in production (or, at least, be
> used for beta testing but users should be clearly warned about that in this
> case), I believe we need to filter out its classes from published jar to
> avoid a confusion.
>
> Yes that is what I think also
>
>     - should we release 2 jars: one for the old and one for the new ?
>
>     - should we release 3 jars: one for the new, one for the new and one
> for both ?
>
> Once new runner will be released, then I think we need to provide only one
> single jar and allow user to switch between different Spark runners with
> CLI option.
>
> I would vote for 3 jars: one for new, one for old, and one for both.
> Indeed, in some cases, users are looking very closely at the size of jars.
> This solution meets all use cases
>
>     - should we create a special entry to the capability matrix ?
>
> Sure, since it has its own uniq characteristics and implementation, but
> again, only once new runner will be "officially released".
>
> +1
>
>
>
> On 28 Oct 2019, at 10:27, Etienne Chauchot <[email protected]> wrote:
>
> Hi guys,
>
> Any opinions on the point2 communication to users ?
>
> Etienne
> On 24/10/2019 15:44, Etienne Chauchot wrote:
>
> Hi guys,
>
> I'm glad to announce that the PR for the merge to master of the new runner
> based on Spark Structured Streaming framework is submitted:
>
> https://github.com/apache/beam/pull/9866
>
>
> 1. Regarding the status of the runner:
>
> -the runner passes 93% of the validates runner tests in batch mode.
>
> -Streaming mode is barely started (waiting for the multi-aggregations
> support in spark Structured Streaming framework from the Spark community)
>
> -Runner can execute Nexmark
>
> -Some things are not wired up yet
>
>   -Beam Schemas not wired with Spark Schemas
>
>   -Optional features of the model not implemented: state api, timer api,
> splittable doFn api, …
>
>
> 2. Regarding the communication to users:
>
> - for reasons explained by Ismael: the runner is in the same module as the
> "older" one. But it is in a different sub-package and both runners share
> the same build.
>
> - How should we communicate to users:
>
>     - shall we filter out the package name from the release?
>
>     - should we release 2 jars: one for the old and one for the new ?
>
>     - should we release 3 jars: one for the new, one for the new and one
> for both ?
>
>     - should we create a special entry to the capability matrix ?
>
> WDYT ?
>
> Best
>
> Etienne
>
>
> On 23/10/2019 19:11, Mikhail Gryzykhin wrote:
>
> +1 to merge.
>
> It is worth keeping things in master with explicitly marked status. It
> will make effort more visible to users and easier to get feedback upon.
>
> --Mikhail
>
> On Wed, Oct 23, 2019 at 8:36 AM Etienne Chauchot <[email protected]>
> wrote:
>
>> Hi guys,
>>
>> The new spark runner now supports beam coders and passes 93% of the batch
>> validates runner tests (+4%). I think it is time to merge it to master. I
>> will submit a PR in the coming days.
>>
>> next steps: support schemas and thus better leverage catalyst optimizer
>> (among other things optims based on data), port perfs optims that were done
>> in the current runner.
>>
>> Best
>>
>> Etienne
>> On 11/10/2019 22:48, Pablo Estrada wrote:
>>
>> +1 for merging : )
>>
>> On Fri, Oct 11, 2019 at 12:43 PM Robert Bradshaw <[email protected]>
>> wrote:
>>
>>> Sounds like a good plan to me.
>>>
>>> On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot <[email protected]>
>>> wrote:
>>>
>>>> Comments inline
>>>> On 10/10/2019 23:44, Ismaël Mejía wrote:
>>>>
>>>> +1
>>>>
>>>> The earlier we get to master the better to encourage not only code
>>>> contributions but as important to have early user feedback.
>>>>
>>>>
>>>> Question is: do we keep the "old" spark runner for a while or not (or just 
>>>> keep on previous version/tag on git) ?
>>>>
>>>> It is still too early to even start discussing when to remove the
>>>> classical runner given that the new runner is still a WIP. However the
>>>> overall goal is that this runner becomes the de-facto one once the VR
>>>> tests and the performance become at least equal to the classical
>>>> runner, in the meantime the best for users is that they co-exist,
>>>> let’s not forget that the other runner has been already battle tested
>>>> for more than 3 years and has had lots of improvements in the last
>>>> year.
>>>>
>>>> +1 on what Ismael says: no soon removal,
>>>>
>>>> The plan I had in mind at first (that I showed at the apacheCon) was
>>>> this but I'm proposing moving the first gray label to before the red box.
>>>>
>>>> <beogijnhpieapoll.png>
>>>>
>>>>
>>>> I don't think the number of commits should be an issue--we shouldn't
>>>> just squash years worth of history away. (OTOH, if this is a case of
>>>> this branch containing lots of little, irrelevant commits that would
>>>> have normally been squashed away in the normal review process we do
>>>> for the main branch, then, yes, some cleanup could be nice.)
>>>>
>>>> About the commits we should encourage a clear history but we have also
>>>> to remove useless commits that are still present in the branch,
>>>> commits of the “Fix errorprone” / “Cleaning” kind and even commits
>>>> that make a better narrative sense together should be probably
>>>> squashed, because they do not bring much to the history. It is not
>>>> about more or less commits it is about its relevance as Robert
>>>> mentions.
>>>>
>>>>
>>>> I think our experiences with things that go to master early have been very 
>>>> good. So I am in favor ASAP. We can exclude it from releases easily until 
>>>> it is ready for end users.
>>>> I have the same question as Robert - how much is modifications and how 
>>>> much is new? I notice it is in a subdirectory of the beam-runners-spark 
>>>> module.
>>>>
>>>> In its current form we cannot exclude it but this relates to the other
>>>> question, so better to explain a bit of history: The new runner used
>>>> to live in its own module and subdirectory because it is a full blank
>>>> page rewrite and the decision was not to use any of the classical
>>>> runner classes to not be constrained by its evolution.
>>>>
>>>> However the reason to put it back in the same module as a subdirectory
>>>> was to encourage early use, in more detail: The way you deploy spark
>>>> jobs today is usually by packaging and staging an uber jar (~200MB of
>>>> pure dependency joy) that contains the user pipeline classes, the
>>>> spark runner module and its dependencies. If we have two spark runners
>>>> in separate modules the user would need to repackage and redeploy
>>>> their pipelines every time they want to switch from the classical
>>>> Spark runner to the structured streaming runner which is painful and
>>>> time and space consuming compared with the one module approach where
>>>> they just change the name of the runner class and that’s it. The idea
>>>> here is to make easy for users to test the new runner, but at the same
>>>> time to make easy to come back to the classical runner in case of any
>>>> issue.
>>>>
>>>> Ismaël
>>>>
>>>> On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles <[email protected]> 
>>>> <[email protected]> wrote:
>>>>
>>>> +1
>>>>
>>>> I think our experiences with things that go to master early have been very 
>>>> good. So I am in favor ASAP. We can exclude it from releases easily until 
>>>> it is ready for end users.
>>>>
>>>> I have the same question as Robert - how much is modifications and how 
>>>> much is new? I notice it is in a subdirectory of the beam-runners-spark 
>>>> module.
>>>>
>>>> I did not see any major changes to dependencies but I will also ask if it 
>>>> has major version differences so that you might want a separate artifact?
>>>>
>>>> Kenn
>>>>
>>>> On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw <[email protected]> 
>>>> <[email protected]> wrote:
>>>>
>>>> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot <[email protected]> 
>>>> <[email protected]> wrote:
>>>>
>>>> Hi guys,
>>>>
>>>> You probably know that there has been for several months an work
>>>> developing a new Spark runner based on Spark Structured Streaming
>>>> framework. This work is located in a feature branch 
>>>> here:https://github.com/apache/beam/tree/spark-runner_structured-streaming
>>>>
>>>> To attract more contributors and get some user feedback, we think it is
>>>> time to merge it to master. Before doing so, some steps need to be 
>>>> achieved:
>>>>
>>>> - finish the work on spark Encoders (that allow to call Beam coders)
>>>> because, right now, the runner is in an unstable state (some transforms
>>>> use the new way of doing ser/de and some use the old one, making a
>>>> pipeline incoherent toward serialization)
>>>>
>>>> - clean history: The history contains commits from November 2018, so
>>>> there is a good amount of work, thus a consequent number of commits.
>>>> They were already squashed but not from September 2019
>>>>
>>>> I don't think the number of commits should be an issue--we shouldn't
>>>> just squash years worth of history away. (OTOH, if this is a case of
>>>> this branch containing lots of little, irrelevant commits that would
>>>> have normally been squashed away in the normal review process we do
>>>> for the main branch, then, yes, some cleanup could be nice.)
>>>>
>>>>
>>>> Regarding status:
>>>>
>>>> - the runner passes 89% of the validates runner tests in batch mode. We
>>>> hope to pass more with the new Encoders
>>>>
>>>> - Streaming mode is barely started (waiting for the multi-aggregations
>>>> support in spark SS framework from the Spark community)
>>>>
>>>> - Runner can execute Nexmark
>>>>
>>>> - Some things are not wired up yet
>>>>
>>>>      - Beam Schemas not wired with Spark Schemas
>>>>
>>>>      - Optional features of the model not implemented:  state api, timer
>>>> api, splittable doFn api, …
>>>>
>>>> WDYT, can we merge it to master once the 2 steps are done ?
>>>>
>>>> I think that as long as it sits parallel to the existing runner, and
>>>> is clearly marked with its status, it makes sense to me. How many
>>>> changes does it make to the existing codebase (as opposed to add new
>>>> code)?
>>>>
>>>>
>
>

Re: [spark structured streaming runner] merge to master?

Reply via email to