Hello, Thanks Etienne for opening the Pull Request and starting the discussion for the review process. I also want to thank publicly all the people that somehow contributed to this:
- Mark Shields and the original people at google who worked at nexmark for contributing this in the first place. - Etienne because his work and constant help really improved the status of the queries, your work on query 3 was really nice, and also for the hard work of helping me test all the queries with all the runners and ping the runner maintainers for fixes. - Aviem/Amit for all the help to solve the issues with the spark runner whose support is now almost feature complete (even in streaming!). - Aljoscha/Jinsong for the fix to merge IntervalWindowFn and for quickly adding the support for metrics. - Thomas Groh and Kenneth for fixing some needed parts in Direct Runner + answering our questions on the State/Timer API. - JB and the talend crew for all the feedback and help to run in our benchmark cluster. - And of course the rest of the Beam community :) Some comments: - This does not need to have a feature branch since we have been working on this in a fork for months now and with the stable API we can simply do a traditional PR review. Of course the review is a bit bigger so we expect it to take some time, but I hope we can get some quick progress once FSR is out. - We need a hand from the google guys, for the moment we have tested all the queries in all the runners, but not in the Dataflow runner because we don't have access to it (well we have but not with the freedom that you guys have to run the benchmark at will), so if we can get some access that would be nice or if this is not possible, it would be nice if some of you guys help us test/report any given issue on this runner, - We also have to decide the future of some features, this is probably independent of the current PR and part of the evolution of Nexmark on Beam: -- There are still some pending things that can be improved even after the review once in master, e.g. we have for the moment only synthetic sources but the original version took also data from Pubsub, we have to define the correct scope for this and given the case also add other sources, e.g. Kafka, HDFS. -- Query 10 is really oriented to testing Google Runner/IOs specific features, so we have to decide what to do with this one, maybe mirroring it with Kafka/HDFS to have something equivalent in the Apache world. This is all for now, I am really glad that this is finally happening and I hope this soon gets merged. Ismaël On Fri, May 12, 2017 at 6:07 PM, Lukasz Cwik <[email protected]> wrote: > I think these are valuable enough that we should get them into apache/master > > On Fri, May 12, 2017 at 4:34 AM, Jean-Baptiste Onofré <[email protected]> > wrote: > >> Hi, >> >> PR or even a feature branch could work. Up to you. >> >> Regards >> JB >> >> >> On 05/12/2017 10:55 AM, Etienne Chauchot wrote: >> >>> Hi guys, >>> >>> I wanted to let you know that I have just submitted a PR around NexMark. >>> This is >>> a port of the NexMark queries to Beam, to be used as integration tests. >>> This can also be used as A-B testing (no-regression or performance >>> comparison >>> between 2 versions of the same engine or of the same runner) >>> >>> This a continuation of the previous PR (#99) from Mark Shields. >>> The code has changed quite a bit: some queries have changed to use new >>> Beam APIs >>> and there where some big refactorings. More important, we can now run all >>> the >>> queries in all the runners. >>> >>> Nevertheless, there are still some open issues in Nexmark >>> (https://github.com/iemejia/beam/issues) and in Beam upstream (see issue >>> links >>> in https://issues.apache.org/jira/browse/BEAM-160) >>> >>> I wanted to submit the PR before our (Ismaël and I) NexMark talk at the >>> ApacheCon. The PR is not perfect but it is in a good shape to share it. >>> >>> Best, >>> >>> Etienne >>> >>> >>> >>> Le 22/03/2017 à 04:51, Kenneth Knowles a écrit : >>> >>>> This is great! Having a variety of realistic-ish pipelines running on all >>>> runners complements the validation suite and IO IT work. >>>> >>>> If I recall, some of these involve heavy and esoteric uses of state, so >>>> definitely give me a ping if you hit any trouble. >>>> >>>> Kenn >>>> >>>> On Tue, Mar 21, 2017 at 9:38 AM, Etienne Chauchot <[email protected]> >>>> wrote: >>>> >>>> Hi all, >>>>> >>>>> Ismael and I are working on upgrading the Nexmark implementation for >>>>> Beam. >>>>> See https://github.com/iemejia/beam/tree/BEAM-160-nexmark and >>>>> https://issues.apache.org/jira/browse/BEAM-160. We are continuing the >>>>> work done by Mark Shields. See https://github.com/apache/beam/pull/366 >>>>> for the original PR. >>>>> >>>>> The PR contains queries that have a wide coverage of the Beam model and >>>>> that represent a realistic end user use case (some come from client >>>>> experience on Google Cloud Dataflow). >>>>> >>>>> So far, we have upgraded the implementation to the latest Beam snapshot. >>>>> And we are able to execute a good subset of the queries in the different >>>>> runners. We upgraded the nexmark drivers to do so: direct driver >>>>> (upgraded >>>>> from inProcessDriver) and flink driver and we added a new one for spark. >>>>> >>>>> There is still a good amount of work to do and we would like to know if >>>>> you think that this contribution can have its place into Beam >>>>> eventually. >>>>> >>>>> The interests of having Nexmark on Beam that we have seen so far are: >>>>> >>>>> - Rich batch/streaming test >>>>> >>>>> - A-B testing of runners or runtimes (non-regression, performance >>>>> comparison between versions ...) >>>>> >>>>> - Integration testing (sdk/runners, runner/runtime, ...) >>>>> >>>>> - Validate beam capability matrix >>>>> >>>>> - It can be used as part of the ongoing PerfKit work (if there is any >>>>> interest). >>>>> >>>>> As a final note, we are tracking the issues in the same repo. If someone >>>>> is interested in contributing, or have more ideas, you are welcome :) >>>>> >>>>> Etienne >>>>> >>>>> >>>>> >>> >> -- >> Jean-Baptiste Onofré >> [email protected] >> http://blog.nanthrax.net >> Talend - http://www.talend.com >>
