Re: Optimising for many rows and returned records (de-coupling query time to record set size for range queries)
Hello Adam Adam Wolff wrote: Actually runs through the table four times instead of twice, and maybe can't even use the index for the whole query. Assuming my results are not typical of MySQL query times, this would explain the sqrt() relationship of returned rows to query time. I have tried your suggestions of using a sub-query and have had trouble getting the syntax valid. But on using explain, it seems that 4 bytes of the index (either lat or lon) are being used and a brute force search on the index for the other constraint. If the query is returning 25600 points from a 100m dataset, it is brute seaching through 1.6m records in the second part of the index. If it were an option of creating 2 1.6M lists then looking for commonalities, it may be faster to instead use 1 1.6m item list then brute force constraint search. I have received suggestions to use spatial indexes, which I am looking into. Alternatively, I could optimise queries by creating multiple slices of the data set accross one axis then use a key on the other axis. MySQL 5.1 partitioning scheme may help. -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: Optimising for many rows and returned records (de-coupling query time to record set size for range queries)
Well, I hadn't known about the spatial features of MySQL. If you're ok using vendor extensions then that definitely looks like the way to go: http://dev.mysql.com/doc/refman/5.0/en/gis-introduction.html A On Apr 24, Nick Hill wrote: Hello Adam Adam Wolff wrote: Actually runs through the table four times instead of twice, and maybe can't even use the index for the whole query. Assuming my results are not typical of MySQL query times, this would explain the sqrt() relationship of returned rows to query time. I have tried your suggestions of using a sub-query and have had trouble getting the syntax valid. But on using explain, it seems that 4 bytes of the index (either lat or lon) are being used and a brute force search on the index for the other constraint. If the query is returning 25600 points from a 100m dataset, it is brute seaching through 1.6m records in the second part of the index. If it were an option of creating 2 1.6M lists then looking for commonalities, it may be faster to instead use 1 1.6m item list then brute force constraint search. I have received suggestions to use spatial indexes, which I am looking into. Alternatively, I could optimise queries by creating multiple slices of the data set accross one axis then use a key on the other axis. MySQL 5.1 partitioning scheme may help. -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Optimising for many rows and returned records (de-coupling query time to record set size for range queries)
Hello I have been looking at planning the database strategy for openstreetmap (http://www.openstreetmap.org). There are several data types stored in tables with longitude and latitude columns. Select statements work by selecting where lat$lat1 and lat$lat2 and lon$lon1 and lon$lon2 I have made many empirical tests and have concluded: 1) I can improve performance by a factor of 2-2.5 by changing the double lat/lon to an integer then selecting on an integer. 2) I have concluded that for each 10 fold increase in the number of records, select queries take twice as long. For each doubling of the number of returned records, there is a sqrt(2) increase in select query time. All this is assuming all relevant database information is in memory. As the database grows, it would likely improve database performance by splitting an individual table into several thousand tables using the file system directory btree algorithm to effectively pre-select the data before the query is handled to the MySQL engine. This is not a neat solution. A much better way would be to improve the mysql index performance on very large numbers of records. Given that there is such a strong relationship between the number of records returned, and query time, I conclude that the whole index tree is matched for every given number of root x records returned. If all records we are matching are under a single node or under a small number of nodes in the index tree, perhaps there is some way of telling the database engine to ignore the rest of the index tree. Could this work, or am I misunderstanding how the index tree works? Are there existing optimisations which can de-couple the relationship between number of records and query time where the records I am selecting are within a small range? Background information: We can boil all this down to a mathematical relationship where query1 selects s number of records from r records dataset and query2 selects b number of records from c records dataset Tquery1 is time to execue query 1 and Tquery2 is time to execute query2. Tquery2=Tquery1 * sqrt(b/s) * (2^log(r/c)) + (b-s*CONST/15000)+CONST Where for my processor, CONST is 0.03 This can be simplified (loosing some accuracy) to: Tquery2=Tquery1 * sqrt(b/s) * (2^log(r/c) Raw data for selects: Creating a plan with 10 points and averaging over 25 queries Points_per_tile Query_Time 25600 0.118 25600 0.119 25600 0.119 25600 0.119 12800 0.069 64000.042 32000.026 16000.017 800 0.011 400 0.008 200 0.005 100 0.004 50 0.003 Creating a plan with 100 points and averaging over 25 queries Points_per_tile Query_Time 25600 0.224 25600 0.223 25600 0.222 25600 0.223 12800 0.145 64000.093 32000.062 16000.043 800 0.029 400 0.020 200 0.015 100 0.011 50 0.008 Creating a plan with 1000 points and averaging over 25 queries Points_per_tile Query_Time 25600 0.558 25600 0.548 25600 0.551 25600 0.551 12800 0.376 64000.257 32000.181 16000.125 800 0.087 400 0.062 200 0.044 100 0.031 Creating a plan with 1 points and averaging over 25 queries Points_per_tile Query_Time 25600 2.422 25600 2.332 25600 2.493 25600 2.446 12800 1.769 64001.295 32000.866 16000.657 800 0.456 400 0.328 200 0.233 100 0.159 50 0.118 Source code for the above test: #!/usr/bin/perl -w #Program creates random point fields eqyuivalent to bitfieldtest.pl except the data is stored #as regular signed integers. To represent the globe as closely as possible, extents between #-180 and +179.99 will be used. Therefore, adding 180 normalises for international date line 0. #Prime Meridian 180. 11**0.01 use DBI; use Time::HiRes qw( usleep ualarm gettimeofday tv_interval ); $DBHOST = localhost; $DBNAME = nickh; $DBUSER = nickh; $DBPASS = xx; #initialise database $driver = mysql; $dsn = DBI:$driver:database=$DBNAME;host=$DBHOST; $dbh = DBI-connect($dsn, $DBUSER, $DBPASS); [EMAIL PROTECTED](1); @plane_densities=(10,100,1000,1); @tile_points=(25600,25600,25600,25600,12800,6400,3200,1600,800,400,200,100,50); $query_iterations=25; $debug=0; sub create_bitfield; sub run_tests; foreach $density(@plane_densities){ print Creating a plan with $density points and averaging over $query_iterations queries\nPoints_per_tile Query_Time\n; create_bitfield($density); foreach $tilepoints(@tile_points){ my $testtime=run_tests($density,$tilepoints);
Re: Optimising for many rows and returned records (de-coupling query time to record set size for range queries)
I didn't look through your code very carefully. Have you confirmed (using EXPLAIN) that your select query is using the index? I don't much about the mysql optimizer, but it's possible that this query: $query1=SELECT lat,lon from integer_test WHERE lat$lat1 and lat$lat2 and lon$lon1 and lon$lon2; Actually runs through the table four times instead of twice, and maybe can't even use the index for the whole query. Is the performance different if you use a subqery? Some thing like: SELECT id, lat, lon FROM integer_test WHERE lat$lat1 and lat$lat2 LEFT JOIN (SELECT id, lat, lon FROM integer_test AS i2 WHERE lon$lon and lon$lon ) ON integer_test.id = i2.id assuming that you have independent indicies on lat and lon. I didn't try this so the syntax may be wrong. You could also dispense with the id idea, but if so you should probably declare (lat,lon) as your primary key. A and lon$lon1 and lon$lon2; On Apr 23, Nick Hill wrote: Hello I have been looking at planning the database strategy for openstreetmap (http://www.openstreetmap.org). There are several data types stored in tables with longitude and latitude columns. Select statements work by selecting where lat$lat1 and lat$lat2 and lon$lon1 and lon$lon2 I have made many empirical tests and have concluded: 1) I can improve performance by a factor of 2-2.5 by changing the double lat/lon to an integer then selecting on an integer. 2) I have concluded that for each 10 fold increase in the number of records, select queries take twice as long. For each doubling of the number of returned records, there is a sqrt(2) increase in select query time. All this is assuming all relevant database information is in memory. As the database grows, it would likely improve database performance by splitting an individual table into several thousand tables using the file system directory btree algorithm to effectively pre-select the data before the query is handled to the MySQL engine. This is not a neat solution. A much better way would be to improve the mysql index performance on very large numbers of records. Given that there is such a strong relationship between the number of records returned, and query time, I conclude that the whole index tree is matched for every given number of root x records returned. If all records we are matching are under a single node or under a small number of nodes in the index tree, perhaps there is some way of telling the database engine to ignore the rest of the index tree. Could this work, or am I misunderstanding how the index tree works? Are there existing optimisations which can de-couple the relationship between number of records and query time where the records I am selecting are within a small range? Background information: We can boil all this down to a mathematical relationship where query1 selects s number of records from r records dataset and query2 selects b number of records from c records dataset Tquery1 is time to execue query 1 and Tquery2 is time to execute query2. Tquery2=Tquery1 * sqrt(b/s) * (2^log(r/c)) + (b-s*CONST/15000)+CONST Where for my processor, CONST is 0.03 This can be simplified (loosing some accuracy) to: Tquery2=Tquery1 * sqrt(b/s) * (2^log(r/c) Raw data for selects: Creating a plan with 10 points and averaging over 25 queries Points_per_tile Query_Time 25600 0.118 25600 0.119 25600 0.119 25600 0.119 12800 0.069 64000.042 32000.026 16000.017 800 0.011 400 0.008 200 0.005 100 0.004 50 0.003 Creating a plan with 100 points and averaging over 25 queries Points_per_tile Query_Time 25600 0.224 25600 0.223 25600 0.222 25600 0.223 12800 0.145 64000.093 32000.062 16000.043 800 0.029 400 0.020 200 0.015 100 0.011 50 0.008 Creating a plan with 1000 points and averaging over 25 queries Points_per_tile Query_Time 25600 0.558 25600 0.548 25600 0.551 25600 0.551 12800 0.376 64000.257 32000.181 16000.125 800 0.087 400 0.062 200 0.044 100 0.031 Creating a plan with 1 points and averaging over 25 queries Points_per_tile Query_Time 25600 2.422 25600 2.332 25600 2.493 25600 2.446 12800 1.769 64001.295 32000.866 16000.657 800 0.456 400 0.328 200 0.233 100 0.159 50 0.118 Source code for the above test: #!/usr/bin/perl -w #Program creates random point fields eqyuivalent to
Re: Optimising for many rows and returned records (de-coupling query time to record set size for range queries)
Actually I think this should be an INNER JOIN -- not a LEFT JOIN. A On Apr 23, Adam Wolff wrote: I didn't look through your code very carefully. Have you confirmed (using EXPLAIN) that your select query is using the index? I don't much about the mysql optimizer, but it's possible that this query: $query1=SELECT lat,lon from integer_test WHERE lat$lat1 and lat$lat2 and lon$lon1 and lon$lon2; Actually runs through the table four times instead of twice, and maybe can't even use the index for the whole query. Is the performance different if you use a subqery? Some thing like: SELECT id, lat, lon FROM integer_test WHERE lat$lat1 and lat$lat2 LEFT JOIN (SELECT id, lat, lon FROM integer_test AS i2 WHERE lon$lon and lon$lon ) ON integer_test.id = i2.id assuming that you have independent indicies on lat and lon. I didn't try this so the syntax may be wrong. You could also dispense with the id idea, but if so you should probably declare (lat,lon) as your primary key. A and lon$lon1 and lon$lon2; On Apr 23, Nick Hill wrote: Hello I have been looking at planning the database strategy for openstreetmap (http://www.openstreetmap.org). There are several data types stored in tables with longitude and latitude columns. Select statements work by selecting where lat$lat1 and lat$lat2 and lon$lon1 and lon$lon2 I have made many empirical tests and have concluded: 1) I can improve performance by a factor of 2-2.5 by changing the double lat/lon to an integer then selecting on an integer. 2) I have concluded that for each 10 fold increase in the number of records, select queries take twice as long. For each doubling of the number of returned records, there is a sqrt(2) increase in select query time. All this is assuming all relevant database information is in memory. As the database grows, it would likely improve database performance by splitting an individual table into several thousand tables using the file system directory btree algorithm to effectively pre-select the data before the query is handled to the MySQL engine. This is not a neat solution. A much better way would be to improve the mysql index performance on very large numbers of records. Given that there is such a strong relationship between the number of records returned, and query time, I conclude that the whole index tree is matched for every given number of root x records returned. If all records we are matching are under a single node or under a small number of nodes in the index tree, perhaps there is some way of telling the database engine to ignore the rest of the index tree. Could this work, or am I misunderstanding how the index tree works? Are there existing optimisations which can de-couple the relationship between number of records and query time where the records I am selecting are within a small range? Background information: We can boil all this down to a mathematical relationship where query1 selects s number of records from r records dataset and query2 selects b number of records from c records dataset Tquery1 is time to execue query 1 and Tquery2 is time to execute query2. Tquery2=Tquery1 * sqrt(b/s) * (2^log(r/c)) + (b-s*CONST/15000)+CONST Where for my processor, CONST is 0.03 This can be simplified (loosing some accuracy) to: Tquery2=Tquery1 * sqrt(b/s) * (2^log(r/c) Raw data for selects: Creating a plan with 10 points and averaging over 25 queries Points_per_tile Query_Time 25600 0.118 25600 0.119 25600 0.119 25600 0.119 12800 0.069 64000.042 32000.026 16000.017 800 0.011 400 0.008 200 0.005 100 0.004 50 0.003 Creating a plan with 100 points and averaging over 25 queries Points_per_tile Query_Time 25600 0.224 25600 0.223 25600 0.222 25600 0.223 12800 0.145 64000.093 32000.062 16000.043 800 0.029 400 0.020 200 0.015 100 0.011 50 0.008 Creating a plan with 1000 points and averaging over 25 queries Points_per_tile Query_Time 25600 0.558 25600 0.548 25600 0.551 25600 0.551 12800 0.376 64000.257 32000.181 16000.125 800 0.087 400 0.062 200 0.044 100 0.031 Creating a plan with 1 points and averaging over 25 queries Points_per_tile Query_Time 25600 2.422 25600 2.332 25600 2.493 25600 2.446 12800 1.769 64001.295 3200
Re: Optimising for many rows and returned records (de-coupling query time to record set size for range queries)
On 4/23/06, Nick Hill [EMAIL PROTECTED] wrote: 1) I can improve performance by a factor of 2-2.5 by changing the double lat/lon to an integer then selecting on an integer. 2) I have concluded that for each 10 fold increase in the number of records, select queries take twice as long. For each doubling of the number of returned records, there is a sqrt(2) increase in select query time. I've noticed a couple things. 1) Right now you're emulating spatial index. 2) In future, you're going to emulate partitioning. Why do you think that doing this stuff manually is better than using builtin capabilities? As the database grows, it would likely improve database performance by splitting an individual table into several thousand tables using the file system directory btree algorithm to effectively pre-select the data before the query is handled to the MySQL engine. This is not a neat solution. A much better way would be to improve the mysql index performance on very large numbers of records. Selects against a table use b-trees too. Splitting data into lot of tables won't help with selects at all (well, it may help on scans with concurrent large data sets if data will be spread across different physical drives, but not with regular range lookups that you're doing). It will only help with inserts. Given that there is such a strong relationship between the number of records returned, and query time, I conclude that the whole index tree is matched for every given number of root x records returned. If all records we are matching are under a single node or under a small number of nodes in the index tree, perhaps there is some way of telling the database engine to ignore the rest of the index tree. What is a 'root record'? Are you speaking about internal representation of b-tree? Could this work, or am I misunderstanding how the index tree works? Are there existing optimisations which can de-couple the relationship between number of records and query time where the records I am selecting are within a small range? For studying select query performance issues it's better think about index as simply about a sorted array with random-access, where each random access costs O(lgN) and accesses to adjanced data cost O(1). If your points are spread uniformly in space, cost of select query you've shown is O(N*lgN) -- Alexey Polyakov -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: Optimising for many rows and returned records (de-coupling query time to record set size for range queries)
Alexey Polyakov wrote: On 4/23/06, Nick Hill [EMAIL PROTECTED] wrote: I've noticed a couple things. 1) Right now you're emulating spatial index. 2) In future, you're going to emulate partitioning. Why do you think that doing this stuff manually is better than using builtin capabilities? 1) I am ignorant about any performance advantages of spatial indexes. 2) I am ignorant about built-in partitioning capabilities. Selects against a table use b-trees too. Splitting data into lot of tables won't help with selects at all (well, it may help on scans with concurrent large data sets if data will be spread across different physical drives, but not with regular range lookups that you're doing). It will only help with inserts. Assuming even distribution, selecting a table amongst 1000 will only take a few ms while 2^log10(1000) gives an 8 fold improvement in select performance. But then, I may be running inefficient queries as postulated by Adam Wolff. Given that there is such a strong relationship between the number of records returned, and query time, I conclude that the whole index tree is matched for every given number of root x records returned. If all records we are matching are under a single node or under a small number of nodes in the index tree, perhaps there is some way of telling the database engine to ignore the rest of the index tree. What is a 'root record'? Are you speaking about internal representation of b-tree? Yes. I am suggesting that a lower node in the B-tree may have below it all records the select query is looking for, thereby providing a short-cut. Could this work, or am I misunderstanding how the index tree works? Are there existing optimisations which can de-couple the relationship between number of records and query time where the records I am selecting are within a small range? For studying select query performance issues it's better think about index as simply about a sorted array with random-access, where each random access costs O(lgN) and accesses to adjanced data cost O(1). If your points are spread uniformly in space, cost of select query you've shown is O(N*lgN) I am unfamiliar with this representation. I am not sure I understand. -Nick Hill. -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]