Re: [DISCUSSION] Framework for SQL performance regressions detection.

Denis Magda Tue, 26 May 2020 15:48:51 -0700

Roman,

Probably you are right that the SQL benchmarks need to be incorporated into
the existing 'ignite-benchmarks' module. Please disregard my suggestion of
setting a dedicated repository.


-
Denis


On Sat, May 23, 2020 at 6:12 AM Roman Kondakov <[email protected]>
wrote:

> Hi Denis,
>
> I'm not sure we need a separate repository for it. What would be the
> benefit of using a separate repo?
>
> BTW I noticed that Ignite has `ignite-benchmarks` module. It contains
> JMH/JOL benchmarks for now. We can also put the SQL benchmark to this
> module. What do you think?
>
>
> --
> Kind Regards
> Roman Kondakov
>
>
> On 22.05.2020 22:36, Denis Magda wrote:
> > Hi Roman,
> >
> > +1 for sure. On a side note, should we create a separate ASF/Git
> repository
> > for the project? Not sure we need to put the suite in the main Ignite
> repo.
> >
> > -
> > Denis
> >
> >
> > On Fri, May 22, 2020 at 8:54 AM Roman Kondakov
> <[email protected]>
> > wrote:
> >
> >> Hi everybody!
> >>
> >> Currently Ignite doesn't have an ability to detect SQL performance
> >> regressions between different versions. We have a Yardstick benchmark
> >> module, but it has several drawbacks:
> >> - it doesn't compare different Ignite versions
> >> - it doesn't check the query result
> >> - it doesn't have an ability to execute randomized SQL queries (aka
> >> fuzzy testing)
> >>
> >> So, Yardstick is not very helpful for detecting SQL performance
> >> regressions.
> >>
> >> I think we need a brand-new framework for this task and I propose to
> >> implement it by adopting the ideas taken from the Apollo tool paper [1].
> >> The Apollo tool pipeline works like like this:
> >>
> >> 1. Apollo start two different versions of databases simultaneously.
> >> 2. Then Apollo populates them with the same dataset
> >> 3. Apollo generates random SQL queries using external library (i.e.
> >> SQLSmith [2])
> >> 4. Each query is executed in both database versions. Execution time is
> >> measured by the framework.
> >> 5. If the execution time difference for the same query exceeds some
> >> threshold (say, 2x slower), the query is logged.
> >> 6. Apollo then tries to simplify the problematic queries in order to
> >> obtain the minimal reproducer.
> >> 7. Apollo also has an ability to automatically perform git history
> >> binary search to find the bad commit
> >> 8. It also can localize a root cause of the regression by carrying out
> >> the statistical debugging.
> >>
> >> I think we don't have to implement all these Apollo steps. First 4 steps
> >> will be enough for our needs.
> >>
> >> My proposal is to create a new module called 'sql-testing'. We need a
> >> separate module because it should be suitable for both query engines:
> >> H2-based and upcoming Calcite-based. This module will contain a test
> >> suite which works in the following way:
> >> 1. It starts two Ignite clusters with different versions (current
> >> version and the previous release version).
> >> 2. Framework then runs randomly generated queries in both clusters and
> >> checks the execution time for each cluster. We need to port SQLSmith [2]
> >> library from C++ to java for this step. But initially we can start with
> >> some set of hardcoded queries and postpone the SQLSmith port. Randomized
> >> queries can be added later.
> >> 3. All problematic queries are then reported as performance issues. In
> >> this way we can manually examine the problems.
> >>
> >> This tool will bring a certain amount of robustness to our SQL layer as
> >> well as some portion of confidence in absence of SQL query regressions.
> >>
> >> What do you think?
> >>
> >>
> >> [1] http://www.vldb.org/pvldb/vol13/p57-jung.pdf
> >> [2] https://github.com/anse1/sqlsmith
> >>
> >>
> >> --
> >> Kind Regards
> >> Roman Kondakov
> >>
> >>
> >
>

Re: [DISCUSSION] Framework for SQL performance regressions detection.

Reply via email to