Re: [web2py] Re: DAL Limitations with aggregate and union queries

Niphlod Fri, 10 Feb 2012 09:43:16 -0800

@mcm : I was replying to BrendanC.

On cluster indexes: retried, getting slighly better results....this timings 
don't get into account the actual "CLUSTER" operation, only the queries 
fired after having the table reindexed.


method1 : 20.4900407235 
method2 : 4.55434324348 
method3 : 5.32198321083 

Short version of "normal vs clustered indexes" debate... they are present 
on a number of "production ready" engines ... all implementions are a 
little different, but the principles remains the same. Please don't come up 
with "but db x instead does that", I'm trying to explain only the key 
points.

When you create an index on a column (let's take the case of a single 
column, the same thing goes for multiple columns in the same index) an 
"external" resource (take that as an external "file") is created, sorted, 
with pointers to the original row in the table. 
When you insert a record, the table gets updated and the column value along 
with the row pointer is "inserted" into the "external" resource. The 
external resource should be immediately resorted, but in real world that 
operation is fired looking at db stats and various tuning parameters (e.g. 
re-sort the index only if the order of the resultset vs the phisical order 
of records is inconsistent for the 20% part)

if you create an index with all the table columns you'll end up with a 
bigger "external" file than the one actually storing the table. Obviously 
this is useless, but there are situations where you have a table with 10 
columns, and you need to access them filtering or ordering on 2-3 columns 
differently: you then create several indexes ( index1 on columns 1,2,3 , 
index2 on columns 4,5,6, etc) and you end up having external "indexes" for 
that tables that summed up take more space than the table 
itself.......Generally the single index is smaller than the table, yet 
sorted and then faster to "scan". 
In advanced setups, indexes files and table files are stored in different 
disks, preferring to place the indexes on the faster device available.

Clustered indexes instead "force" the records in the table to be phisically 
rearranged in the "table file". Queries on clustered indexes are generally 
faster, because the db can access the table sequentially and doesn't have 
to check first the "external file" and then the table one.
However, they were created for two main reasons:
- save space (clustered indexes don't require additional space or very 
little)
- to speed up access when the table is read-intensive but not 
write-intensive

For the intrinsic nature of the clustered index, when a row is updated on 
the column that carries a clustered index, the entire table should be 
rearranged, and as my knowledge in all the dbs this is a blocking 
operation. DBs normally don't do that, for performance reasons: you issue a 
CLUSTER index, table gets blocked, phisically rearranged, and that's it. 
When you update or insert new records, they are normally appended to the 
table. If you want your "new" records to be phisically rearranged, you 
issue another CLUSTER operation.

Also, you can't get consistent timings on a CLUSTER operation: the length 
of the operation, given that there are no other queries currently involved 
that block the process, take a "proportional time" to the actual 
"adiacency" of the records in the table:

table1
id    name    salary
1     jim        4
2     john      5
3     ed         6

table2
id    name    salary
1     jim        4
2     john      6
3     ed         5

issuing a clustered index on salary on table1 will take less time than 
issuing the same cluster index on table2.... it has little meaning 
measuring the creation of the index.

You can't then think to issue a CLUSTER before every query: usually you 
"plan" to rearrange indexes at specific times or when a certain % of sorted 
records are no more phisically adiacent, basing your decision on the 
"degrading" statistics of your query: the more records are changed since 
the CLUSTER operation, the more the queries will be slower. You should 
balance the time that takes to issue a CLUSTER operation vs the degrading 
timings of your queries, minding that when you issue a CLUSTER index, noone 
will access that table for the entire time of the phisical rearrangement.

Clustered indexes are a great thing for gigantic tables with millions of 
records accessed for the 90% of the time with "ORDER BY" queries, tables 
that are not write-intensive.

Consider also that, for example, you want to cluster on a "created_on" 
field (datetime), but there is a column, let's say "active" (boolean) you 
filter with, with no other indexes.

e.g. SELECT * from table where active = True ORDER BY created_on

That query is going to perform badly if "active=True" records are a small 
part of the table anyway. A full scan of the table will be required anyway, 
and the corresponding sort will benefit just a small amount with the 
cluster index in place on created_on.

Again, for any further explanations, just ask

Re: [web2py] Re: DAL Limitations with aggregate and union queries

Reply via email to