Re: [web2py] Re: DAL Limitations with aggregate and union queries
Niphlod, Just out of curiosity. Is the dal connection pooled? If in postgres you CLUSTER on the salary index do figures change? mic 2012/2/10 BrendanC bren...@gmail.com: Niphlod, Thanks (belatedly) for your valuable contribution and your test results. The only thing missing from your test is contention for database resources/connections and that is what prompted my initial question. In the past I worked on a client server application where we had performance issues re getting and releasing database connections (this was against a heavily used clinical trials Oracle database). I'm very clear on the value of adding indexes to the database - however that is a bit tangential to my initial question - which was primarily focused on understanding the limitations/boundaries of using the DAL and connection management. I'm curious about your statement that the DAL will always use the same connection. My understanding is that the DAL closes and reopens the database connection for each request. Whether it's the same connection or not may not matter - once you close a db connection it goes back into the pool and you may/may not incur a delay in opening new connection. (In a lightly loaded system you may never see this issue however). I was under the impression that minimizing db connections was desirable to prevent thrashing. In the real world this may no longer be an issue. I do however recall this being an issue in the past and it's something that is seldom mentioned in Web development. Finally, just so there is no confusion here - I'm not criticizing web2py or the DAL here - my question is equally valid for Rails, Django etc - afaik all these frameworks close and reopen connections for each request. Again thanks for your thoughtful response to my initial question. BrendanC
[web2py] Re: DAL Limitations with aggregate and union queries
just take a look to http://web2py.com/book/default/chapter/06#Connection-pooling for understanding how DAL handles connections. Your initial question was a concern on speed on aggregate and unions, and the script you provided clearly use one and only one db connection. I'm sorry if I misunderstood, but having to aggregate or union involved in a query to fetch/show results involves anyway one and only one db connections. I stated (and tested) that the difference in making 2 distinct queries and retrieve the results against making one query that returns similar results is negligible in this case. The two queries method involves clearly two roundtrips to the database, but on the same connection. When you use web2py, you can rely on pool_size parameter (as explained into the book) to avoid making and closing a connection for every request. Here *request* is a user requesting a page, not your application making a query to the database. You could do def index(): for a in range(1000): result = db(db.emp.id0).select() in one controller. What will be the behaviour ? User request the index page, DAL istantiate a new connection, DAL make 1000 queries to the db using the same connection. If you use no pool_size, then every time a user request the index page, the DAL will instantiate a new connection, make 1000 queries, and close the connection. If you use the pool_size parameter, then the DAL fetch a connection from the pool, make 1000 queries, and put the connection back into the pool, ready to be used in subsequent requests of users going to the index page. I hope I cleared all your doubts, if not, please ask :D
Re: [web2py] Re: DAL Limitations with aggregate and union queries
Niphold, Thanks for the reply. My first question is about this: http://www.postgresql.org/docs/8.1/static/sql-cluster.html. Sorry if I did not explicit about what I was referring, as it is postgresql specific. I was wondering if you would see different figures or not, because order by is influenced by index clustering (i.e. a presorted table around that index). My interest is about the high initial step in postgresql response, i.e. the response seems a function of the number of records + an initial offset. I was wondering if it was something related to connection and remote backend initialization. How the dal pooling works is known to me in depth, due to the fact that I contributed fix some little bugs there. Roundtrip of queries are clear to me. Before psycopg2 there was a psycopg1: the base of code was written by me, so I suppose I have done my homework understanding how postgresql connections work. mic 2012/2/10 Niphlod niph...@gmail.com: just take a look to http://web2py.com/book/default/chapter/06#Connection-pooling for understanding how DAL handles connections. Your initial question was a concern on speed on aggregate and unions, and the script you provided clearly use one and only one db connection. I'm sorry if I misunderstood, but having to aggregate or union involved in a query to fetch/show results involves anyway one and only one db connections. I stated (and tested) that the difference in making 2 distinct queries and retrieve the results against making one query that returns similar results is negligible in this case. The two queries method involves clearly two roundtrips to the database, but on the same connection. When you use web2py, you can rely on pool_size parameter (as explained into the book) to avoid making and closing a connection for every request. Here request is a user requesting a page, not your application making a query to the database. You could do def index(): for a in range(1000): result = db(db.emp.id0).select() in one controller. What will be the behaviour ? User request the index page, DAL istantiate a new connection, DAL make 1000 queries to the db using the same connection. If you use no pool_size, then every time a user request the index page, the DAL will instantiate a new connection, make 1000 queries, and close the connection. If you use the pool_size parameter, then the DAL fetch a connection from the pool, make 1000 queries, and put the connection back into the pool, ready to be used in subsequent requests of users going to the index page. I hope I cleared all your doubts, if not, please ask :D
Re: [web2py] Re: DAL Limitations with aggregate and union queries
@mcm : I was replying to BrendanC. On cluster indexes: retried, getting slighly better resultsthis timings don't get into account the actual CLUSTER operation, only the queries fired after having the table reindexed. method1 : 20.4900407235 method2 : 4.55434324348 method3 : 5.32198321083 Short version of normal vs clustered indexes debate... they are present on a number of production ready engines ... all implementions are a little different, but the principles remains the same. Please don't come up with but db x instead does that, I'm trying to explain only the key points. When you create an index on a column (let's take the case of a single column, the same thing goes for multiple columns in the same index) an external resource (take that as an external file) is created, sorted, with pointers to the original row in the table. When you insert a record, the table gets updated and the column value along with the row pointer is inserted into the external resource. The external resource should be immediately resorted, but in real world that operation is fired looking at db stats and various tuning parameters (e.g. re-sort the index only if the order of the resultset vs the phisical order of records is inconsistent for the 20% part) if you create an index with all the table columns you'll end up with a bigger external file than the one actually storing the table. Obviously this is useless, but there are situations where you have a table with 10 columns, and you need to access them filtering or ordering on 2-3 columns differently: you then create several indexes ( index1 on columns 1,2,3 , index2 on columns 4,5,6, etc) and you end up having external indexes for that tables that summed up take more space than the table itself...Generally the single index is smaller than the table, yet sorted and then faster to scan. In advanced setups, indexes files and table files are stored in different disks, preferring to place the indexes on the faster device available. Clustered indexes instead force the records in the table to be phisically rearranged in the table file. Queries on clustered indexes are generally faster, because the db can access the table sequentially and doesn't have to check first the external file and then the table one. However, they were created for two main reasons: - save space (clustered indexes don't require additional space or very little) - to speed up access when the table is read-intensive but not write-intensive For the intrinsic nature of the clustered index, when a row is updated on the column that carries a clustered index, the entire table should be rearranged, and as my knowledge in all the dbs this is a blocking operation. DBs normally don't do that, for performance reasons: you issue a CLUSTER index, table gets blocked, phisically rearranged, and that's it. When you update or insert new records, they are normally appended to the table. If you want your new records to be phisically rearranged, you issue another CLUSTER operation. Also, you can't get consistent timings on a CLUSTER operation: the length of the operation, given that there are no other queries currently involved that block the process, take a proportional time to the actual adiacency of the records in the table: table1 idnamesalary 1 jim4 2 john 5 3 ed 6 table2 idnamesalary 1 jim4 2 john 6 3 ed 5 issuing a clustered index on salary on table1 will take less time than issuing the same cluster index on table2 it has little meaning measuring the creation of the index. You can't then think to issue a CLUSTER before every query: usually you plan to rearrange indexes at specific times or when a certain % of sorted records are no more phisically adiacent, basing your decision on the degrading statistics of your query: the more records are changed since the CLUSTER operation, the more the queries will be slower. You should balance the time that takes to issue a CLUSTER operation vs the degrading timings of your queries, minding that when you issue a CLUSTER index, noone will access that table for the entire time of the phisical rearrangement. Clustered indexes are a great thing for gigantic tables with millions of records accessed for the 90% of the time with ORDER BY queries, tables that are not write-intensive. Consider also that, for example, you want to cluster on a created_on field (datetime), but there is a column, let's say active (boolean) you filter with, with no other indexes. e.g. SELECT * from table where active = True ORDER BY created_on That query is going to perform badly if active=True records are a small part of the table anyway. A full scan of the table will be required anyway, and the corresponding sort will benefit just a small amount with the cluster index in place on created_on. Again, for any further explanations, just ask
[web2py] Re: DAL Limitations with aggregate and union queries
Here *request* is a user requesting a page, not your application making a query to the database. OK - now that clears up my confusion and it all makes sense- guess I was having a YASM (yet another senior momentg) here. (So used to thinking in terms of http requests it's easy to forget the important distinction here). Thx again for your help. BrendanC
[web2py] Re: DAL Limitations with aggregate and union queries
Niphlod, Thanks (belatedly) for your valuable contribution and your test results. The only thing missing from your test is contention for database resources/connections and that is what prompted my initial question. In the past I worked on a client server application where we had performance issues re getting and releasing database connections (this was against a heavily used clinical trials Oracle database). I'm very clear on the value of adding indexes to the database - however that is a bit tangential to my initial question - which was primarily focused on understanding the limitations/boundaries of using the DAL and connection management. I'm curious about your statement that the DAL will always use the same connection. My understanding is that the DAL closes and reopens the database connection for each request. Whether it's the same connection or not may not matter - once you close a db connection it goes back into the pool and you may/may not incur a delay in opening new connection. (In a lightly loaded system you may never see this issue however). I was under the impression that minimizing db connections was desirable to prevent thrashing. In the real world this may no longer be an issue. I do however recall this being an issue in the past and it's something that is seldom mentioned in Web development. Finally, just so there is no confusion here - I'm not criticizing web2py or the DAL here - my question is equally valid for Rails, Django etc - afaik all these frameworks close and reopen connections for each request. Again thanks for your thoughtful response to my initial question. BrendanC
[web2py] Re: DAL Limitations with aggregate and union queries
well, same old, same old sorry for the introduction, but I work as a DBA and query tuning/optimization is in front of my eyes for more than 8 hours a day ;-) The first thing you learn in this job is try. The second is try. The third is try. The fourth is understand the complexity. The fifth is retry ;-) Actually this is a pretty common request, and, first things first, relational databases sucks at this kind of queries. Just to clear out some doubts, I'll put some key points on this requirement: - you have a table - you want only one record for max and one record for min salaries (e.g. : if jobs and gates have the same salary, extract only one of them, no matter what) In addition, I'll clear some more doubts to you: when you use the DAL, every query is made in the same connection. If you use multiple queries, they use the same connection (no additional roundtrips for connections, that have some impact on timings). Once the connection is established, one can argue that making two distinct queries is anyway slower that making two queries, and this is generally considered true, but watch out the tests!! Let's break some method to achieve you resultset: method1 (you method): select name, salary from (select min(salary) as minsal from emp) as x inner join emp as t on t.salary = x.minsal union select name, salary from (select max(salary) as maxsal from emp) as x inner join emp as t on t.salary = x.maxsal it has a problem... if your table is populated with common salaries (for example, jobs and gates have a salary of 5), you have more than 2 rows, so the final print statements report incorrect data. Let's say, for the sake of the discussion, that you have distinct salaries in your table. method2: SELECT * FROM (SELECT emp.name, emp.salary FROM emp ORDER BY emp.salary LIMIT 1 OFFSET 0) AS a UNION ALL SELECT * FROM (SELECT emp.name, emp.salary FROM emp ORDER BY emp.salary DESC LIMIT 1 OFFSET 0) AS b PS: UNION ALL is used to be sure to return exactly two rows, UNION filters out the same rows which can be composed dinamically as: minquery = db()._select(db.emp.name, db.emp.salary, orderby=db.emp.salary, limitby=(0,1)) maxquery = db()._select(db.emp.name, db.emp.salary, orderby=~db.emp.salary, limitby=(0,1)) final_statement = 'SELECT * FROM (%s) AS a UNION ALL SELECT * FROM (%s) AS b' % (minquery[:-1], maxquery[:-1]) method3: two distinct queries: SELECT emp.name, emp.salary FROM emp ORDER BY emp.salary LIMIT 1 OFFSET 0 SELECT emp.name, emp.salary FROM emp ORDER BY emp.salary DESC LIMIT 1 OFFSET 0 they can be composed as minquery = db()._select(db.emp.name, db.emp.salary, orderby=db.emp.salary, limitby=(0,1)) maxquery = db()._select(db.emp.name, db.emp.salary, orderby=~db.emp.salary, limitby=(0,1)) So, method2 and method3 are fullfilling your request to have some way to write this kind of queries abstracted in DAL, in some ways. method3 pisses you off because of the multiple queries involved. Always for the sake of discussion, let me test this three methods. Let's say you want to measure actual db performance, using db.executesql() for all of the three methods with the plain text query used for every method (we want to measure db performance without our results being compromised by the DAL actually abstracting your queries) All tests were done initializing the db, and timing only the db.executesql() function, repeated 1 times. Results, as always with timeit, are in seconds. test1: 5 records in the db, sqlite method1 : 0.867677211761 method2 : 1.0282201767 method3 : 1.17870497704 test2: 5 records in the db, sqlite, column salary indexed method1 : 0.867677211761 method2 : 0.810430049896 method3 : 0.806806087494 test3: 5000 records in the db, sqlite method1 : 59.733631134 method2 : 100.970735073 method3 : 100.889001131 test4: 5000 records in the db, sqlite, column salary indexed method1 : 31.619473 method2 : 0.830200910568 method3 : 0.837182998657 test5: 5 records in the db, postgresql method1 : 7.11347198486 method2 : 5.63948893547 method3 : 7.27239108086 test6: 5 records in the db, postgresql, column salary indexed method1 : 7.69179606438 method2 : 5.09459590912 method3 : 6.26593804359 test7: 5000 record in the db, postgresql method1 : 39.8379509449 method2 : 29.3324680328 method3 : 30.5438249111 test8: 5000 records in the db, postgresql, column salary indexed method1 : 22.3900408745 method2 : 4.97471690178 method3 : 6.57953190804 So, what do the tests show in general? A) different engines respond in different ways (i.e. if you are superoptimizing you have to test and consider the best way associated with your db engine). I won't be surprised if Mysql or Oracle show totally different timings B) different number of records in a table matter. Again, if you are superoptimizing, you have to consider the expected number of rows in a table What do the tests show in this particular case? A) putting an index in the column you are ordering on always wins B) scaling the
[web2py] Re: DAL Limitations with aggregate and union queries
As a 'bystander', I personally think that Niphlod's response is of such good quality that the gist of it deserves inclusion in the book.
[web2py] Re: DAL Limitations with aggregate and union queries
+1 On Feb 5, 3:20 pm, Simon Lukell sluk...@gmail.com wrote: As a 'bystander', I personally think that Niphlod's response is of such good quality that the gist of it deserves inclusion in the book.