Re: [spark structured streaming runner] merge to master?

Alexey Romanenko Tue, 29 Oct 2019 09:38:56 -0700

Hmm, I don’t think that jar size should play a big role comparing to the whole 
size of shaded jar of users job. Even more, I think it will be quite confusing 
for users to choose which jar to use if we will have 3 different ones for 
similar purposes. Though, let’s see what others think.


> On 29 Oct 2019, at 15:32, Etienne Chauchot <[email protected]> wrote:
> 
> Hi Alexey, 
> Thanks for your opinion !
> 
> Comments inline
> 
> Etienne
> On 28/10/2019 17:34, Alexey Romanenko wrote:
>> Let me share some of my thoughts on this.
>>>>     - shall we filter out the package name from the release? 
>> Until new runner is not ready to be used in production (or, at least, be 
>> used for beta testing but users should be clearly warned about that in this 
>> case), I believe we need to filter out its classes from published jar to 
>> avoid a confusion.
> Yes that is what I think also
>>>>     - should we release 2 jars: one for the old and one for the new ? 
>>>>     - should we release 3 jars: one for the new, one for the new and one 
>>>> for both ?
>>>> 
>> Once new runner will be released, then I think we need to provide only one 
>> single jar and allow user to switch between different Spark runners with CLI 
>> option.
> I would vote for 3 jars: one for new, one for old, and one for both. Indeed, 
> in some cases, users are looking very closely at the size of jars. This 
> solution meets all use cases
>>>>     - should we create a special entry to the capability matrix ?
>>>> 
>> 
>> Sure, since it has its own uniq characteristics and implementation, but 
>> again, only once new runner will be "officially released".
> +1
>> 
>> 
>>> On 28 Oct 2019, at 10:27, Etienne Chauchot <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> Hi guys,
>>> 
>>> Any opinions on the point2 communication to users ?
>>> 
>>> Etienne
>>> On 24/10/2019 15:44, Etienne Chauchot wrote:
>>>> Hi guys,
>>>> 
>>>> I'm glad to announce that the PR for the merge to master of the new runner 
>>>> based on Spark Structured Streaming framework is submitted:
>>>> 
>>>> https://github.com/apache/beam/pull/9866 
>>>> <https://github.com/apache/beam/pull/9866>
>>>> 
>>>> 1. Regarding the status of the runner: 
>>>> -the runner passes 93% of the validates runner tests in batch mode.
>>>> 
>>>> -Streaming mode is barely started (waiting for the multi-aggregations 
>>>> support in spark Structured Streaming framework from the Spark community)
>>>> 
>>>> -Runner can execute Nexmark
>>>> 
>>>> -Some things are not wired up yet
>>>> 
>>>>   -Beam Schemas not wired with Spark Schemas
>>>> 
>>>>   -Optional features of the model not implemented: state api, timer api, 
>>>> splittable doFn api, …
>>>> 
>>>> 2. Regarding the communication to users:
>>>> 
>>>> - for reasons explained by Ismael: the runner is in the same module as the 
>>>> "older" one. But it is in a different sub-package and both runners share 
>>>> the same build.  
>>>> - How should we communicate to users: 
>>>>     - shall we filter out the package name from the release? 
>>>>     - should we release 2 jars: one for the old and one for the new ? 
>>>>     - should we release 3 jars: one for the new, one for the new and one 
>>>> for both ?
>>>> 
>>>>     - should we create a special entry to the capability matrix ?
>>>> 
>>>> WDYT ?
>>>> Best
>>>> 
>>>> Etienne
>>>> 
>>>> On 23/10/2019 19:11, Mikhail Gryzykhin wrote:
>>>>> +1 to merge.
>>>>> 
>>>>> It is worth keeping things in master with explicitly marked status. It 
>>>>> will make effort more visible to users and easier to get feedback upon.
>>>>> 
>>>>> --Mikhail
>>>>> 
>>>>> On Wed, Oct 23, 2019 at 8:36 AM Etienne Chauchot <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> Hi guys,
>>>>> 
>>>>> The new spark runner now supports beam coders and passes 93% of the batch 
>>>>> validates runner tests (+4%). I think it is time to merge it to master. I 
>>>>> will submit a PR in the coming days.
>>>>> 
>>>>> next steps: support schemas and thus better leverage catalyst optimizer 
>>>>> (among other things optims based on data), port perfs optims that were 
>>>>> done in the current runner.
>>>>> Best
>>>>> Etienne
>>>>> On 11/10/2019 22:48, Pablo Estrada wrote:
>>>>>> +1 for merging : )
>>>>>> 
>>>>>> On Fri, Oct 11, 2019 at 12:43 PM Robert Bradshaw <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>> Sounds like a good plan to me. 
>>>>>> 
>>>>>> On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>> Comments inline
>>>>>> On 10/10/2019 23:44, Ismaël Mejía wrote:
>>>>>>> +1
>>>>>>> 
>>>>>>> The earlier we get to master the better to encourage not only code
>>>>>>> contributions but as important to have early user feedback.
>>>>>>> 
>>>>>>>> Question is: do we keep the "old" spark runner for a while or not (or 
>>>>>>>> just keep on previous version/tag on git) ?
>>>>>>> It is still too early to even start discussing when to remove the
>>>>>>> classical runner given that the new runner is still a WIP. However the
>>>>>>> overall goal is that this runner becomes the de-facto one once the VR
>>>>>>> tests and the performance become at least equal to the classical
>>>>>>> runner, in the meantime the best for users is that they co-exist,
>>>>>>> let’s not forget that the other runner has been already battle tested
>>>>>>> for more than 3 years and has had lots of improvements in the last
>>>>>>> year.
>>>>>> +1 on what Ismael says: no soon removal, 
>>>>>> The plan I had in mind at first (that I showed at the apacheCon) was 
>>>>>> this but I'm proposing moving the first gray label to before the red 
>>>>>> box. 
>>>>>> <beogijnhpieapoll.png>
>>>>>> 
>>>>>> 
>>>>>>>> I don't think the number of commits should be an issue--we shouldn't
>>>>>>>> just squash years worth of history away. (OTOH, if this is a case of
>>>>>>>> this branch containing lots of little, irrelevant commits that would
>>>>>>>> have normally been squashed away in the normal review process we do
>>>>>>>> for the main branch, then, yes, some cleanup could be nice.)
>>>>>>> About the commits we should encourage a clear history but we have also
>>>>>>> to remove useless commits that are still present in the branch,
>>>>>>> commits of the “Fix errorprone” / “Cleaning” kind and even commits
>>>>>>> that make a better narrative sense together should be probably
>>>>>>> squashed, because they do not bring much to the history. It is not
>>>>>>> about more or less commits it is about its relevance as Robert
>>>>>>> mentions.
>>>>>>> 
>>>>>>>> I think our experiences with things that go to master early have been 
>>>>>>>> very good. So I am in favor ASAP. We can exclude it from releases 
>>>>>>>> easily until it is ready for end users.
>>>>>>>> I have the same question as Robert - how much is modifications and how 
>>>>>>>> much is new? I notice it is in a subdirectory of the 
>>>>>>>> beam-runners-spark module.
>>>>>>> In its current form we cannot exclude it but this relates to the other
>>>>>>> question, so better to explain a bit of history: The new runner used
>>>>>>> to live in its own module and subdirectory because it is a full blank
>>>>>>> page rewrite and the decision was not to use any of the classical
>>>>>>> runner classes to not be constrained by its evolution.
>>>>>>> 
>>>>>>> However the reason to put it back in the same module as a subdirectory
>>>>>>> was to encourage early use, in more detail: The way you deploy spark
>>>>>>> jobs today is usually by packaging and staging an uber jar (~200MB of
>>>>>>> pure dependency joy) that contains the user pipeline classes, the
>>>>>>> spark runner module and its dependencies. If we have two spark runners
>>>>>>> in separate modules the user would need to repackage and redeploy
>>>>>>> their pipelines every time they want to switch from the classical
>>>>>>> Spark runner to the structured streaming runner which is painful and
>>>>>>> time and space consuming compared with the one module approach where
>>>>>>> they just change the name of the runner class and that’s it. The idea
>>>>>>> here is to make easy for users to test the new runner, but at the same
>>>>>>> time to make easy to come back to the classical runner in case of any
>>>>>>> issue.
>>>>>>> 
>>>>>>> Ismaël
>>>>>>> 
>>>>>>> On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles <[email protected]> 
>>>>>>> <mailto:[email protected]> wrote:
>>>>>>>> +1
>>>>>>>> 
>>>>>>>> I think our experiences with things that go to master early have been 
>>>>>>>> very good. So I am in favor ASAP. We can exclude it from releases 
>>>>>>>> easily until it is ready for end users.
>>>>>>>> 
>>>>>>>> I have the same question as Robert - how much is modifications and how 
>>>>>>>> much is new? I notice it is in a subdirectory of the 
>>>>>>>> beam-runners-spark module.
>>>>>>>> 
>>>>>>>> I did not see any major changes to dependencies but I will also ask if 
>>>>>>>> it has major version differences so that you might want a separate 
>>>>>>>> artifact?
>>>>>>>> 
>>>>>>>> Kenn
>>>>>>>> 
>>>>>>>> On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw <[email protected]> 
>>>>>>>> <mailto:[email protected]> wrote:
>>>>>>>>> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot 
>>>>>>>>> <[email protected]> <mailto:[email protected]> wrote:
>>>>>>>>>> Hi guys,
>>>>>>>>>> 
>>>>>>>>>> You probably know that there has been for several months an work
>>>>>>>>>> developing a new Spark runner based on Spark Structured Streaming
>>>>>>>>>> framework. This work is located in a feature branch here:
>>>>>>>>>> https://github.com/apache/beam/tree/spark-runner_structured-streaming
>>>>>>>>>>  
>>>>>>>>>> <https://github.com/apache/beam/tree/spark-runner_structured-streaming>
>>>>>>>>>> 
>>>>>>>>>> To attract more contributors and get some user feedback, we think it 
>>>>>>>>>> is
>>>>>>>>>> time to merge it to master. Before doing so, some steps need to be 
>>>>>>>>>> achieved:
>>>>>>>>>> 
>>>>>>>>>> - finish the work on spark Encoders (that allow to call Beam coders)
>>>>>>>>>> because, right now, the runner is in an unstable state (some 
>>>>>>>>>> transforms
>>>>>>>>>> use the new way of doing ser/de and some use the old one, making a
>>>>>>>>>> pipeline incoherent toward serialization)
>>>>>>>>>> 
>>>>>>>>>> - clean history: The history contains commits from November 2018, so
>>>>>>>>>> there is a good amount of work, thus a consequent number of commits.
>>>>>>>>>> They were already squashed but not from September 2019
>>>>>>>>> I don't think the number of commits should be an issue--we shouldn't
>>>>>>>>> just squash years worth of history away. (OTOH, if this is a case of
>>>>>>>>> this branch containing lots of little, irrelevant commits that would
>>>>>>>>> have normally been squashed away in the normal review process we do
>>>>>>>>> for the main branch, then, yes, some cleanup could be nice.)
>>>>>>>>> 
>>>>>>>>>> Regarding status:
>>>>>>>>>> 
>>>>>>>>>> - the runner passes 89% of the validates runner tests in batch mode. 
>>>>>>>>>> We
>>>>>>>>>> hope to pass more with the new Encoders
>>>>>>>>>> 
>>>>>>>>>> - Streaming mode is barely started (waiting for the 
>>>>>>>>>> multi-aggregations
>>>>>>>>>> support in spark SS framework from the Spark community)
>>>>>>>>>> 
>>>>>>>>>> - Runner can execute Nexmark
>>>>>>>>>> 
>>>>>>>>>> - Some things are not wired up yet
>>>>>>>>>> 
>>>>>>>>>>      - Beam Schemas not wired with Spark Schemas
>>>>>>>>>> 
>>>>>>>>>>      - Optional features of the model not implemented:  state api, 
>>>>>>>>>> timer
>>>>>>>>>> api, splittable doFn api, …
>>>>>>>>>> 
>>>>>>>>>> WDYT, can we merge it to master once the 2 steps are done ?
>>>>>>>>> I think that as long as it sits parallel to the existing runner, and
>>>>>>>>> is clearly marked with its status, it makes sense to me. How many
>>>>>>>>> changes does it make to the existing codebase (as opposed to add new
>>>>>>>>> code)?
>>

Re: [spark structured streaming runner] merge to master?

Reply via email to