@mcm : I was replying to BrendanC. On cluster indexes: retried, getting slighly better results....this timings don't get into account the actual "CLUSTER" operation, only the queries fired after having the table reindexed.
method1 : 20.4900407235 method2 : 4.55434324348 method3 : 5.32198321083 Short version of "normal vs clustered indexes" debate... they are present on a number of "production ready" engines ... all implementions are a little different, but the principles remains the same. Please don't come up with "but db x instead does that", I'm trying to explain only the key points. When you create an index on a column (let's take the case of a single column, the same thing goes for multiple columns in the same index) an "external" resource (take that as an external "file") is created, sorted, with pointers to the original row in the table. When you insert a record, the table gets updated and the column value along with the row pointer is "inserted" into the "external" resource. The external resource should be immediately resorted, but in real world that operation is fired looking at db stats and various tuning parameters (e.g. re-sort the index only if the order of the resultset vs the phisical order of records is inconsistent for the 20% part) if you create an index with all the table columns you'll end up with a bigger "external" file than the one actually storing the table. Obviously this is useless, but there are situations where you have a table with 10 columns, and you need to access them filtering or ordering on 2-3 columns differently: you then create several indexes ( index1 on columns 1,2,3 , index2 on columns 4,5,6, etc) and you end up having external "indexes" for that tables that summed up take more space than the table itself.......Generally the single index is smaller than the table, yet sorted and then faster to "scan". In advanced setups, indexes files and table files are stored in different disks, preferring to place the indexes on the faster device available. Clustered indexes instead "force" the records in the table to be phisically rearranged in the "table file". Queries on clustered indexes are generally faster, because the db can access the table sequentially and doesn't have to check first the "external file" and then the table one. However, they were created for two main reasons: - save space (clustered indexes don't require additional space or very little) - to speed up access when the table is read-intensive but not write-intensive For the intrinsic nature of the clustered index, when a row is updated on the column that carries a clustered index, the entire table should be rearranged, and as my knowledge in all the dbs this is a blocking operation. DBs normally don't do that, for performance reasons: you issue a CLUSTER index, table gets blocked, phisically rearranged, and that's it. When you update or insert new records, they are normally appended to the table. If you want your "new" records to be phisically rearranged, you issue another CLUSTER operation. Also, you can't get consistent timings on a CLUSTER operation: the length of the operation, given that there are no other queries currently involved that block the process, take a "proportional time" to the actual "adiacency" of the records in the table: table1 id name salary 1 jim 4 2 john 5 3 ed 6 table2 id name salary 1 jim 4 2 john 6 3 ed 5 issuing a clustered index on salary on table1 will take less time than issuing the same cluster index on table2.... it has little meaning measuring the creation of the index. You can't then think to issue a CLUSTER before every query: usually you "plan" to rearrange indexes at specific times or when a certain % of sorted records are no more phisically adiacent, basing your decision on the "degrading" statistics of your query: the more records are changed since the CLUSTER operation, the more the queries will be slower. You should balance the time that takes to issue a CLUSTER operation vs the degrading timings of your queries, minding that when you issue a CLUSTER index, noone will access that table for the entire time of the phisical rearrangement. Clustered indexes are a great thing for gigantic tables with millions of records accessed for the 90% of the time with "ORDER BY" queries, tables that are not write-intensive. Consider also that, for example, you want to cluster on a "created_on" field (datetime), but there is a column, let's say "active" (boolean) you filter with, with no other indexes. e.g. SELECT * from table where active = True ORDER BY created_on That query is going to perform badly if "active=True" records are a small part of the table anyway. A full scan of the table will be required anyway, and the corresponding sort will benefit just a small amount with the cluster index in place on created_on. Again, for any further explanations, just ask