[web2py] Re: DAL Limitations with aggregate and union queries

Niphlod Sun, 05 Feb 2012 07:51:31 -0800

well, same old, same old.... sorry for the introduction, but I work as
a DBA and query tuning/optimization is in front of my eyes for more
than 8 hours a day ;-)


The first thing you learn in this job is try. The second is try. The
third is try. The fourth is "understand the complexity". The fifth is
retry ;-)

Actually this is a pretty common request, and, first things first,
relational databases sucks at this kind of queries.

Just to clear out some doubts, I'll put some key points on this
requirement:
- you have a table
- you want only one record for max and one record for min salaries
(e.g. : if jobs and gates have the same salary, extract only one of
them, no matter what)

In addition, I'll clear some more doubts to you: when you use the DAL,
every query is made in the same connection. If you use multiple
queries, they use the same connection (no additional roundtrips for
connections, that have some impact on timings). Once the connection is
established, one can argue that making two distinct queries is anyway
slower that making two queries, and this is generally considered true,
but watch out the tests!!

Let's break some method to achieve you resultset:

method1 (you method):
select name, salary from (select min(salary) as minsal from emp) as x
inner join emp as t on t.salary = x.minsal union select name, salary
from (select max(salary) as maxsal from emp) as x inner join emp as t
on t.salary = x.maxsal

it has a problem... if your table is populated with common salaries
(for example, jobs and gates have a salary of 5), you have more than 2
rows, so the final print statements report incorrect data.

Let's say, for the sake of the discussion, that you have distinct
salaries in your table.

method2:
SELECT * FROM (SELECT  emp.name, emp.salary FROM emp ORDER BY
emp.salary LIMIT 1 OFFSET 0) AS a UNION ALL SELECT * FROM (SELECT
emp.name, emp.salary FROM emp ORDER BY emp.salary DESC LIMIT 1 OFFSET
0) AS b

PS: UNION ALL is used to be sure to return exactly two rows, UNION
"filters out" the same rows

which can be composed "dinamically" as:
    minquery = db()._select(db.emp.name, db.emp.salary,
orderby=db.emp.salary, limitby=(0,1))
    maxquery = db()._select(db.emp.name, db.emp.salary,
orderby=~db.emp.salary, limitby=(0,1))
    final_statement = 'SELECT * FROM (%s) AS a UNION ALL SELECT * FROM
(%s) AS b' % (minquery[:-1], maxquery[:-1])

method3:
two distinct queries:
SELECT  emp.name, emp.salary FROM emp ORDER BY emp.salary LIMIT 1
OFFSET 0
SELECT  emp.name, emp.salary FROM emp ORDER BY emp.salary DESC LIMIT 1
OFFSET 0
they can be composed as
minquery = db()._select(db.emp.name, db.emp.salary,
orderby=db.emp.salary, limitby=(0,1))
maxquery = db()._select(db.emp.name, db.emp.salary,
orderby=~db.emp.salary, limitby=(0,1))

So, method2 and method3 are "fullfilling" your request to have some
way to write this kind of queries abstracted in DAL, in some ways.
method3 pisses you off because of the "multiple queries" involved.

Always for the sake of discussion, let me test this three methods.
Let's say you want to measure actual db performance, using
db.executesql() for all of the three methods with the plain text query
used for every method (we want to measure db performance without our
results being "compromised" by the DAL actually abstracting your
queries)

All tests were done initializing the db, and timing only the
db.executesql() function, repeated 10000 times. Results, as always
with timeit, are in seconds.

test1: 5 records in the db, sqlite
method1 : 0.867677211761
method2 : 1.0282201767
method3 : 1.17870497704

test2: 5 records in the db, sqlite, column salary indexed
method1 : 0.867677211761
method2 : 0.810430049896
method3 : 0.806806087494

test3: 5000 records in the db, sqlite
method1 : 59.733631134
method2 : 100.970735073
method3 : 100.889001131

test4: 5000 records in the db, sqlite, column salary indexed
method1 : 31.1111619473
method2 : 0.830200910568
method3 : 0.837182998657

test5: 5 records in the db, postgresql
method1 : 7.11347198486
method2 : 5.63948893547
method3 : 7.27239108086

test6: 5 records in the db, postgresql, column salary indexed
method1 : 7.69179606438
method2 : 5.09459590912
method3 : 6.26593804359

test7: 5000 record in the db, postgresql
method1 : 39.8379509449
method2 : 29.3324680328
method3 : 30.5438249111

test8: 5000 records in the db, postgresql, column salary indexed
method1 : 22.3900408745
method2 : 4.97471690178
method3 : 6.57953190804

So, what do the tests show in general?
A) different engines respond in different ways (i.e. if you are
superoptimizing you have to test and consider the best way associated
with your db engine). I won't be surprised if Mysql or Oracle show
totally different timings....
B) different number of records in a table matter. Again, if you are
superoptimizing, you have to consider the "expected" number of rows in
a table

What do the tests show in this particular case?
A) putting an index in the column you are ordering on "always wins"
B) scaling the number of records, method2 and method3 seems to deliver
best performances than method1
C) making two distinct queries takes more time than one, but the
difference is very little

And don't forget, we are 'seeing' differences that matters in a
repetition of 10000 times. In real world scenarios, the whole "why
doesn't the DAL allow to construct a query that fires one time only to
save time" isn't that important ;-D

Last but not least, I don't know how much salaries varies during
time....if you're actually superoptimizing, there are alternative ways
to scale out... just a bunch of them:
- let's say that salaries varies one time a week, you could query the
first time and cache that value for a week
- let's say that salaries table is "read" intensive but not "write"
intensive, you can prepare a trigger (on insert and on update) that
updates another table with two rows only, and you query this table for
min and max

Sorry for the long post, feel free to ask if something is wrong.

[web2py] Re: DAL Limitations with aggregate and union queries

Reply via email to