Optimising for many rows and returned records (de-coupling query time to record set size for range queries)

Nick Hill Sun, 23 Apr 2006 08:49:07 -0700

Hello

I have been looking at planning the database strategy for openstreetmap(http://www.openstreetmap.org).

There are several data types stored in tables with longitude andlatitude columns. Select statements work by selecting


where lat>$lat1 and lat<$lat2 and lon>$lon1 and lon<$lon2

I have made many empirical tests and have concluded:

1) I can improve performance by a factor of 2-2.5 by changing the doublelat/lon to an integer then selecting on an integer.

2) I have concluded that for each 10 fold increase in the number ofrecords, select queries take twice as long. For each doubling of thenumber of returned records, there is a sqrt(2) increase in select querytime.


All this is assuming all relevant database information is in memory.

As the database grows, it would likely improve database performance bysplitting an individual table into several thousand tables using thefile system directory btree algorithm to effectively pre-select the databefore the query is handled to the MySQL engine. This is not a neatsolution. A much better way would be to improve the mysql indexperformance on very large numbers of records.

Given that there is such a strong relationship between the number ofrecords returned, and query time, I conclude that the whole index treeis matched for every given number of root x records returned. If allrecords we are matching are under a single node or under a small numberof nodes in the index tree, perhaps there is some way of telling thedatabase engine to ignore the rest of the index tree.

Could this work, or am I misunderstanding how the index tree works? Arethere existing optimisations which can de-couple the relationshipbetween number of records and query time where the records I amselecting are within a small range?




Background information:

We can boil all this down to a mathematical relationship where
query1 selects s number of records from r records dataset
and
query2 selects b number of records from c records dataset

Tquery1 is time to execue query 1 and Tquery2 is time to execute query2.

Tquery2=Tquery1 * sqrt(b/s) * (2^log(r/c)) + (b-s*CONST/15000)+CONST
Where for my processor, CONST is 0.03


This can be simplified (loosing some accuracy) to:

Tquery2=Tquery1 * sqrt(b/s) * (2^log(r/c)




Raw data for selects:
Creating a plan with 100000 points and averaging over 25 queries
Points_per_tile Query_Time
25600           0.118
25600           0.119
25600           0.119
25600           0.119
12800           0.069
6400            0.042
3200            0.026
1600            0.017
800             0.011
400             0.008
200             0.005
100             0.004
50              0.003
Creating a plan with 1000000 points and averaging over 25 queries
Points_per_tile Query_Time
25600           0.224
25600           0.223
25600           0.222
25600           0.223
12800           0.145
6400            0.093
3200            0.062
1600            0.043
800             0.029
400             0.020
200             0.015
100             0.011
50              0.008
Creating a plan with 10000000 points and averaging over 25 queries
Points_per_tile Query_Time
25600           0.558
25600           0.548
25600           0.551
25600           0.551
12800           0.376
6400            0.257
3200            0.181
1600            0.125
800             0.087
400             0.062
200             0.044
100             0.031
Creating a plan with 100000000 points and averaging over 25 queries
Points_per_tile Query_Time
25600           2.422
25600           2.332
25600           2.493
25600           2.446
12800           1.769
6400            1.295
3200            0.866
1600            0.657
800             0.456
400             0.328
200             0.233
100             0.159
50              0.118

Source code for the above test:
#!/usr/bin/perl -w

#Program creates random point fields eqyuivalent to bitfieldtest.plexcept the data is stored#as regular signed integers. To represent the globe as closely aspossible, extents between#-180 and +179.999999 will be used. Therefore, adding 180 normalises forinternational date line 0.

#Prime Meridian 180. 111111**0.000001


use DBI;
use Time::HiRes qw( usleep ualarm gettimeofday tv_interval );

$DBHOST = "localhost";
$DBNAME = "nickh";
$DBUSER = "nickh";
$DBPASS = "xxxxxx";

#initialise database
$driver = "mysql";
$dsn = "DBI:$driver:database=$DBNAME;host=$DBHOST";
$dbh = DBI->connect($dsn, $DBUSER, $DBPASS);

[EMAIL PROTECTED](100000000);
@plane_densities=(100000,1000000,10000000,100000000);
@tile_points=(25600,25600,25600,25600,12800,6400,3200,1600,800,400,200,100,50);
$query_iterations=25;
$debug=0;

sub create_bitfield;
sub run_tests;

foreach $density(@plane_densities){

print "Creating a plan with $density points and averaging over$query_iterations queries\nPoints_per_tile Query_Time\n";

create_bitfield($density);
    foreach $tilepoints(@tile_points){
    my $testtime=run_tests($density,$tilepoints);

printf '%-14d %.3f',$tilepoints,$testtime; # prints "<1.0>"print"$density $testtime s\n";

    print "\n";
    }
}



$dbh->disconnect;
exit 0;


sub create_bitfield{
#takes number of points on the point field as argument.
#We first create table without an index, as the indes slow populating table
my $prep=q~
DROP TABLE IF EXISTS integer_test;

CREATE TABLE `integer_test` (
  `lat` integer NOT NULL default '0',
  `lon` integer NOT NULL default '0'
  ) TYPE=MyISAM
~;

#drop/create tables without index
foreach $aprepare(split(/;/,$prep)) {
#print "preparing $preparation";
    $dbh->do($aprepare);
    }


#populate table
for ($i=0;$i<$_[0];$i+=100){
#create 100 element batched inserts (value,value),(value,value) etc
my $insert='';
   for ($j=0;$j<100;$j++){
   if ($j>0) {$insert .= ','; }
   $lat=int(rand 3599999999)-1800000000;
   $lon=int(rand 3599999999)-1800000000;
   $insert .= "($lat,$lon)";
   }

my $sql="INSERT INTO `integer_test` VALUES $insert;";
#print "SQL1 is $sql1\n\n";
$dbh->do($sql);

}
#After populating table, we create indexes.
#print "Creating index... This may take some time\n";
my $sql='CREATE INDEX index1 ON integer_test (lat,lon);';
$dbh->do($sql);
}


sub run_tests{
#Parameters: Points in field; Size of tile in average number of points
my $number_of_points=$_[0];
my $target_query_return=$_[1];

my$proportion_of_each_axis=1/(sqrt($number_of_points/$target_query_return));my $max_extent=1-$proportion_of_each_axis; #Maximum extent for querywithout extending beyond bound

my $returnedrows;
$query1total=0;

for($i=0;$i<$query_iterations;$i++){
#$lat1=(0.4+(rand ($max_extent/10)));
#$lon1=(0.4+(rand ($max_extent/10)));

$lat1=int(rand ($max_extent*3599999999))-1800000000;
$lon1=int(rand ($max_extent*3599999999))-1800000000;

$lat2=int($lat1+($proportion_of_each_axis*3599999999));
$lon2=int($lon1+($proportion_of_each_axis*3599999999));

if ($debug){print "querying bounds $lat1 $lon1 $lat2 $lon2 with queries\n";}

$query1="SELECT lat,lon from integer_test WHERE lat>$lat1 and lat<$lat2and lon>$lon1 and lon<$lon2";



#print "Query 1: $query1\n";
$t0 = Time::HiRes::time();
my $sth = $dbh->prepare($query1);
$sth->execute();
$returnedrows+=$sth->rows();
#$sth->finish;
$elapsed=Time::HiRes::time()-$t0;
#print "Fetched $rows points in $elapsed \n";
$query1total+=$elapsed;


}
#print $returnedrows;

if ($debug){print "Each of the $query_iterations returned an average " .$returnedrows/$query_iterations . "records\n";}#print "unit test time is " . ($query1total/$query_iterations) ."seconds \n\n";

return ($query1total/$query_iterations);
}

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Optimising for many rows and returned records (de-coupling query time to record set size for range queries)

Reply via email to