Re: [web2py] Re: DAL Limitations with aggregate and union queries

2012-02-10 Thread Michele Comitini
Niphlod,

Just out of curiosity.
Is the dal connection pooled?
If in postgres you CLUSTER on the salary index do figures change?

mic

2012/2/10 BrendanC bren...@gmail.com:
 Niphlod,
 Thanks (belatedly) for your valuable  contribution and  your test results.
  The only thing missing from your test is contention for database
 resources/connections and that is what prompted my initial question. In the
 past I worked on a client server application where we had performance issues
 re  getting and releasing database connections (this was against a heavily
 used clinical trials Oracle database).

 I'm very clear on the value of adding indexes to the database - however that
 is a bit tangential to my initial question - which was primarily focused on
 understanding the limitations/boundaries of using the DAL and connection
 management.

 I'm curious about your statement that the DAL will always use the same
 connection. My understanding is that the DAL closes and reopens the database
 connection for each request.  Whether it's the same connection or not may
 not matter -  once you close a db connection it goes back into the pool and
 you may/may not incur a delay in opening new connection. (In a lightly
 loaded system you may never see this issue however).

 I was under the impression that minimizing db connections was desirable to
 prevent thrashing. In the real world this may no longer be an issue. I do
 however recall this being an issue in the past and it's something that is
 seldom mentioned in Web development.

 Finally, just so there is no confusion here - I'm not criticizing web2py or
 the DAL here - my question is equally valid for Rails, Django etc - afaik
 all these frameworks close and reopen connections for each request.

 Again thanks for your thoughtful response to my initial question.

 BrendanC




[web2py] Re: DAL Limitations with aggregate and union queries

2012-02-10 Thread Niphlod
just take a look to 
http://web2py.com/book/default/chapter/06#Connection-pooling for 
understanding how DAL handles connections.

Your initial question was a concern on speed on aggregate and unions, and 
the script you provided clearly use one and only one db connection.

I'm sorry if I misunderstood, but having to aggregate or union involved 
in a query to fetch/show results involves anyway one and only one db 
connections.

I stated (and tested) that the difference in making 2 distinct queries and 
retrieve the results against making one query that returns similar results 
is negligible in this case.

The two queries method involves clearly two roundtrips to the database, 
but on the same connection.

When you use web2py, you can rely on pool_size parameter (as explained 
into the book) to avoid making and closing a connection for every request. 
Here *request* is a user requesting a page, not your application making a 
query to the database. You could do

def index():
for a in range(1000):
result = db(db.emp.id0).select()

in one controller. What will be the behaviour ? User request the index 
page, DAL istantiate a new connection, DAL make 1000 queries to the db 
using the same connection.

If you use no pool_size, then every time a user request the index page, 
the DAL will instantiate a new connection, make 1000 queries, and close the 
connection.
If you use the pool_size parameter, then the DAL fetch a connection from 
the pool, make 1000 queries, and put the connection back into the pool, 
ready to be used in subsequent requests of users going to the index page.

I hope I cleared all your doubts, if not, please ask :D 




Re: [web2py] Re: DAL Limitations with aggregate and union queries

2012-02-10 Thread Michele Comitini
Niphold,

Thanks for the reply.

My first question is about this:
http://www.postgresql.org/docs/8.1/static/sql-cluster.html.  Sorry if
I did not explicit about what I was referring, as it is postgresql
specific.  I was wondering if you would see different figures or not,
because order by is influenced by index clustering (i.e. a presorted
table around that index).

My interest is about the high initial step in postgresql response,
i.e. the response seems a function of  the number of records + an
initial offset.
I was wondering if it was something related to connection and remote
backend initialization.

How the dal pooling works is known to me in depth, due to the fact
that I contributed fix some little bugs there.
Roundtrip of queries are clear to me.
Before psycopg2 there was a psycopg1:  the base of code was written by
me, so I suppose I have done my homework understanding how postgresql
connections work.

mic




2012/2/10 Niphlod niph...@gmail.com:
 just take a look to
 http://web2py.com/book/default/chapter/06#Connection-pooling for
 understanding how DAL handles connections.

 Your initial question was a concern on speed on aggregate and unions, and
 the script you provided clearly use one and only one db connection.

 I'm sorry if I misunderstood, but having to aggregate or union involved
 in a query to fetch/show results involves anyway one and only one db
 connections.

 I stated (and tested) that the difference in making 2 distinct queries and
 retrieve the results against making one query that returns similar results
 is negligible in this case.

 The two queries method involves clearly two roundtrips to the database,
 but on the same connection.

 When you use web2py, you can rely on pool_size parameter (as explained
 into the book) to avoid making and closing a connection for every request.
 Here request is a user requesting a page, not your application making a
 query to the database. You could do

 def index():
     for a in range(1000):
     result = db(db.emp.id0).select()

 in one controller. What will be the behaviour ? User request the index
 page, DAL istantiate a new connection, DAL make 1000 queries to the db using
 the same connection.

 If you use no pool_size, then every time a user request the index page,
 the DAL will instantiate a new connection, make 1000 queries, and close the
 connection.
 If you use the pool_size parameter, then the DAL fetch a connection from
 the pool, make 1000 queries, and put the connection back into the pool,
 ready to be used in subsequent requests of users going to the index page.

 I hope I cleared all your doubts, if not, please ask :D




Re: [web2py] Re: DAL Limitations with aggregate and union queries

2012-02-10 Thread Niphlod
@mcm : I was replying to BrendanC.

On cluster indexes: retried, getting slighly better resultsthis timings 
don't get into account the actual CLUSTER operation, only the queries 
fired after having the table reindexed.

method1 : 20.4900407235 
method2 : 4.55434324348 
method3 : 5.32198321083 

Short version of normal vs clustered indexes debate... they are present 
on a number of production ready engines ... all implementions are a 
little different, but the principles remains the same. Please don't come up 
with but db x instead does that, I'm trying to explain only the key 
points.

When you create an index on a column (let's take the case of a single 
column, the same thing goes for multiple columns in the same index) an 
external resource (take that as an external file) is created, sorted, 
with pointers to the original row in the table. 
When you insert a record, the table gets updated and the column value along 
with the row pointer is inserted into the external resource. The 
external resource should be immediately resorted, but in real world that 
operation is fired looking at db stats and various tuning parameters (e.g. 
re-sort the index only if the order of the resultset vs the phisical order 
of records is inconsistent for the 20% part)

if you create an index with all the table columns you'll end up with a 
bigger external file than the one actually storing the table. Obviously 
this is useless, but there are situations where you have a table with 10 
columns, and you need to access them filtering or ordering on 2-3 columns 
differently: you then create several indexes ( index1 on columns 1,2,3 , 
index2 on columns 4,5,6, etc) and you end up having external indexes for 
that tables that summed up take more space than the table 
itself...Generally the single index is smaller than the table, yet 
sorted and then faster to scan. 
In advanced setups, indexes files and table files are stored in different 
disks, preferring to place the indexes on the faster device available.

Clustered indexes instead force the records in the table to be phisically 
rearranged in the table file. Queries on clustered indexes are generally 
faster, because the db can access the table sequentially and doesn't have 
to check first the external file and then the table one.
However, they were created for two main reasons:
- save space (clustered indexes don't require additional space or very 
little)
- to speed up access when the table is read-intensive but not 
write-intensive

For the intrinsic nature of the clustered index, when a row is updated on 
the column that carries a clustered index, the entire table should be 
rearranged, and as my knowledge in all the dbs this is a blocking 
operation. DBs normally don't do that, for performance reasons: you issue a 
CLUSTER index, table gets blocked, phisically rearranged, and that's it. 
When you update or insert new records, they are normally appended to the 
table. If you want your new records to be phisically rearranged, you 
issue another CLUSTER operation.

Also, you can't get consistent timings on a CLUSTER operation: the length 
of the operation, given that there are no other queries currently involved 
that block the process, take a proportional time to the actual 
adiacency of the records in the table:

table1
idnamesalary
1 jim4
2 john  5
3 ed 6

table2
idnamesalary
1 jim4
2 john  6
3 ed 5

issuing a clustered index on salary on table1 will take less time than 
issuing the same cluster index on table2 it has little meaning 
measuring the creation of the index.

You can't then think to issue a CLUSTER before every query: usually you 
plan to rearrange indexes at specific times or when a certain % of sorted 
records are no more phisically adiacent, basing your decision on the 
degrading statistics of your query: the more records are changed since 
the CLUSTER operation, the more the queries will be slower. You should 
balance the time that takes to issue a CLUSTER operation vs the degrading 
timings of your queries, minding that when you issue a CLUSTER index, noone 
will access that table for the entire time of the phisical rearrangement.

Clustered indexes are a great thing for gigantic tables with millions of 
records accessed for the 90% of the time with ORDER BY queries, tables 
that are not write-intensive.

Consider also that, for example, you want to cluster on a created_on 
field (datetime), but there is a column, let's say active (boolean) you 
filter with, with no other indexes.

e.g. SELECT * from table where active = True ORDER BY created_on

That query is going to perform badly if active=True records are a small 
part of the table anyway. A full scan of the table will be required anyway, 
and the corresponding sort will benefit just a small amount with the 
cluster index in place on created_on.

Again, for any further explanations, just ask 



[web2py] Re: DAL Limitations with aggregate and union queries

2012-02-10 Thread BrendanC

 Here *request* is a user requesting a page, not your application making 
a query to the database.


OK - now that clears up my confusion and it all makes sense- guess I was 
having a YASM (yet another senior momentg) here. (So used to thinking in 
terms of http requests it's  easy to forget the important distinction here).

Thx again for your help.

BrendanC


[web2py] Re: DAL Limitations with aggregate and union queries

2012-02-09 Thread BrendanC
Niphlod,
Thanks (belatedly) for your valuable  contribution and  your test results. 
 The only thing missing from your test is contention for database 
resources/connections and that is what prompted my initial question. In the 
past I worked on a client server application where we had performance 
issues re  getting and releasing database connections (this was against a 
heavily used clinical trials Oracle database). 

I'm very clear on the value of adding indexes to the database - however 
that is a bit tangential to my initial question - which was primarily 
focused on understanding the limitations/boundaries of using the DAL and 
connection management. 

I'm curious about your statement that the DAL will always use the same 
connection. My understanding is that the DAL closes and reopens the 
database connection for each request.  Whether it's the same connection or 
not may not matter -  once you close a db connection it goes back into the 
pool and you may/may not incur a delay in opening new connection. (In a 
lightly loaded system you may never see this issue however).

I was under the impression that minimizing db connections was desirable to 
prevent thrashing. In the real world this may no longer be an issue. I do 
however recall this being an issue in the past and it's something that is 
seldom mentioned in Web development. 

Finally, just so there is no confusion here - I'm not criticizing web2py or 
the DAL here - my question is equally valid for Rails, Django etc - afaik 
all these frameworks close and reopen connections for each request. 

Again thanks for your thoughtful response to my initial question. 

BrendanC




[web2py] Re: DAL Limitations with aggregate and union queries

2012-02-05 Thread Niphlod
well, same old, same old sorry for the introduction, but I work as
a DBA and query tuning/optimization is in front of my eyes for more
than 8 hours a day ;-)

The first thing you learn in this job is try. The second is try. The
third is try. The fourth is understand the complexity. The fifth is
retry ;-)

Actually this is a pretty common request, and, first things first,
relational databases sucks at this kind of queries.

Just to clear out some doubts, I'll put some key points on this
requirement:
- you have a table
- you want only one record for max and one record for min salaries
(e.g. : if jobs and gates have the same salary, extract only one of
them, no matter what)

In addition, I'll clear some more doubts to you: when you use the DAL,
every query is made in the same connection. If you use multiple
queries, they use the same connection (no additional roundtrips for
connections, that have some impact on timings). Once the connection is
established, one can argue that making two distinct queries is anyway
slower that making two queries, and this is generally considered true,
but watch out the tests!!

Let's break some method to achieve you resultset:

method1 (you method):
select name, salary from (select min(salary) as minsal from emp) as x
inner join emp as t on t.salary = x.minsal union select name, salary
from (select max(salary) as maxsal from emp) as x inner join emp as t
on t.salary = x.maxsal

it has a problem... if your table is populated with common salaries
(for example, jobs and gates have a salary of 5), you have more than 2
rows, so the final print statements report incorrect data.

Let's say, for the sake of the discussion, that you have distinct
salaries in your table.

method2:
SELECT * FROM (SELECT  emp.name, emp.salary FROM emp ORDER BY
emp.salary LIMIT 1 OFFSET 0) AS a UNION ALL SELECT * FROM (SELECT
emp.name, emp.salary FROM emp ORDER BY emp.salary DESC LIMIT 1 OFFSET
0) AS b

PS: UNION ALL is used to be sure to return exactly two rows, UNION
filters out the same rows

which can be composed dinamically as:
minquery = db()._select(db.emp.name, db.emp.salary,
orderby=db.emp.salary, limitby=(0,1))
maxquery = db()._select(db.emp.name, db.emp.salary,
orderby=~db.emp.salary, limitby=(0,1))
final_statement = 'SELECT * FROM (%s) AS a UNION ALL SELECT * FROM
(%s) AS b' % (minquery[:-1], maxquery[:-1])

method3:
two distinct queries:
SELECT  emp.name, emp.salary FROM emp ORDER BY emp.salary LIMIT 1
OFFSET 0
SELECT  emp.name, emp.salary FROM emp ORDER BY emp.salary DESC LIMIT 1
OFFSET 0
they can be composed as
minquery = db()._select(db.emp.name, db.emp.salary,
orderby=db.emp.salary, limitby=(0,1))
maxquery = db()._select(db.emp.name, db.emp.salary,
orderby=~db.emp.salary, limitby=(0,1))

So, method2 and method3 are fullfilling your request to have some
way to write this kind of queries abstracted in DAL, in some ways.
method3 pisses you off because of the multiple queries involved.

Always for the sake of discussion, let me test this three methods.
Let's say you want to measure actual db performance, using
db.executesql() for all of the three methods with the plain text query
used for every method (we want to measure db performance without our
results being compromised by the DAL actually abstracting your
queries)

All tests were done initializing the db, and timing only the
db.executesql() function, repeated 1 times. Results, as always
with timeit, are in seconds.

test1: 5 records in the db, sqlite
method1 : 0.867677211761
method2 : 1.0282201767
method3 : 1.17870497704

test2: 5 records in the db, sqlite, column salary indexed
method1 : 0.867677211761
method2 : 0.810430049896
method3 : 0.806806087494

test3: 5000 records in the db, sqlite
method1 : 59.733631134
method2 : 100.970735073
method3 : 100.889001131

test4: 5000 records in the db, sqlite, column salary indexed
method1 : 31.619473
method2 : 0.830200910568
method3 : 0.837182998657

test5: 5 records in the db, postgresql
method1 : 7.11347198486
method2 : 5.63948893547
method3 : 7.27239108086

test6: 5 records in the db, postgresql, column salary indexed
method1 : 7.69179606438
method2 : 5.09459590912
method3 : 6.26593804359

test7: 5000 record in the db, postgresql
method1 : 39.8379509449
method2 : 29.3324680328
method3 : 30.5438249111

test8: 5000 records in the db, postgresql, column salary indexed
method1 : 22.3900408745
method2 : 4.97471690178
method3 : 6.57953190804

So, what do the tests show in general?
A) different engines respond in different ways (i.e. if you are
superoptimizing you have to test and consider the best way associated
with your db engine). I won't be surprised if Mysql or Oracle show
totally different timings
B) different number of records in a table matter. Again, if you are
superoptimizing, you have to consider the expected number of rows in
a table

What do the tests show in this particular case?
A) putting an index in the column you are ordering on always wins
B) scaling the 

[web2py] Re: DAL Limitations with aggregate and union queries

2012-02-05 Thread Simon Lukell
As a 'bystander', I personally think that Niphlod's response is of
such good quality that the gist of it deserves inclusion in the book.


[web2py] Re: DAL Limitations with aggregate and union queries

2012-02-05 Thread Massimo Di Pierro
+1

On Feb 5, 3:20 pm, Simon Lukell sluk...@gmail.com wrote:
 As a 'bystander', I personally think that Niphlod's response is of
 such good quality that the gist of it deserves inclusion in the book.