Re: Optimising for many rows and returned records (de-coupling query time to record set size for range queries)

Adam Wolff Sun, 23 Apr 2006 12:15:27 -0700

I didn't look through your code very carefully. Have you confirmed (using 
EXPLAIN) that your select query is using the index?


I don't much about the mysql optimizer, but it's possible that this query:
> $query1="SELECT lat,lon from integer_test WHERE lat>$lat1 and lat<$lat2 
> and lon>$lon1 and lon<$lon2";
Actually runs through the table four times instead of twice, and maybe
can't even use the index for the whole query.

Is the performance different if you use a subqery? Some thing like:
SELECT id, lat, lon FROM integer_test WHERE lat>$lat1 and lat<$lat2 
 LEFT JOIN (SELECT id, lat, lon FROM integer_test AS i2
                   WHERE lon>$lon and lon<$lon )
 ON integer_test.id = i2.id

assuming that you have independent indicies on lat and lon. I didn't try
this so the syntax may be wrong. You could also dispense with the id idea,
but if so you should probably declare (lat,lon) as your primary key.

A

> and lon>$lon1 and lon<$lon2";

On Apr 23, Nick Hill wrote:

> Hello
> 
> I have been looking at planning the database strategy for openstreetmap
> (http://www.openstreetmap.org).
> 
> There are several data types stored in tables with longitude and latitude
> columns. Select statements work by selecting
> 
> where lat>$lat1 and lat<$lat2 and lon>$lon1 and lon<$lon2
> 
> I have made many empirical tests and have concluded:
> 
> 1) I can improve performance by a factor of 2-2.5 by changing the double
> lat/lon to an integer then selecting on an integer.
> 
> 2) I have concluded that for each 10 fold increase in the number of records,
> select queries take twice as long. For each doubling of the number of returned
> records, there is a sqrt(2) increase in select query time.
> 
> All this is assuming all relevant database information is in memory.
> 
> 
> As the database grows, it would likely improve database performance by
> splitting an individual table into several thousand tables using the file
> system directory btree algorithm to effectively pre-select the data before the
> query is handled to the MySQL engine. This is not a neat solution. A much
> better way would be to improve the mysql index performance on very large
> numbers of records.
> 
> Given that there is such a strong relationship between the number of records
> returned, and query time, I conclude that the whole index tree is matched for
> every given number of root x records returned. If all records we are matching
> are under a single node or under a small number of nodes in the index tree,
> perhaps there is some way of telling the database engine to ignore the rest of
> the index tree.
> 
> Could this work, or am I misunderstanding how the index tree works? Are there
> existing optimisations which can de-couple the relationship between number of
> records and query time where the records I am selecting are within a small
> range?
> 
> 
> 
> Background information:
> 
> We can boil all this down to a mathematical relationship where
> query1 selects s number of records from r records dataset
> and
> query2 selects b number of records from c records dataset
> 
> Tquery1 is time to execue query 1 and Tquery2 is time to execute query2.
> 
> Tquery2=Tquery1 * sqrt(b/s) * (2^log(r/c)) + (b-s*CONST/15000)+CONST
> Where for my processor, CONST is 0.03
> 
> 
> This can be simplified (loosing some accuracy) to:
> 
> Tquery2=Tquery1 * sqrt(b/s) * (2^log(r/c)
> 
> 
> 
> 
> Raw data for selects:
> Creating a plan with 100000 points and averaging over 25 queries
> Points_per_tile Query_Time
> 25600           0.118
> 25600           0.119
> 25600           0.119
> 25600           0.119
> 12800           0.069
> 6400            0.042
> 3200            0.026
> 1600            0.017
> 800             0.011
> 400             0.008
> 200             0.005
> 100             0.004
> 50              0.003
> Creating a plan with 1000000 points and averaging over 25 queries
> Points_per_tile Query_Time
> 25600           0.224
> 25600           0.223
> 25600           0.222
> 25600           0.223
> 12800           0.145
> 6400            0.093
> 3200            0.062
> 1600            0.043
> 800             0.029
> 400             0.020
> 200             0.015
> 100             0.011
> 50              0.008
> Creating a plan with 10000000 points and averaging over 25 queries
> Points_per_tile Query_Time
> 25600           0.558
> 25600           0.548
> 25600           0.551
> 25600           0.551
> 12800           0.376
> 6400            0.257
> 3200            0.181
> 1600            0.125
> 800             0.087
> 400             0.062
> 200             0.044
> 100             0.031
> Creating a plan with 100000000 points and averaging over 25 queries
> Points_per_tile Query_Time
> 25600           2.422
> 25600           2.332
> 25600           2.493
> 25600           2.446
> 12800           1.769
> 6400            1.295
> 3200            0.866
> 1600            0.657
> 800             0.456
> 400             0.328
> 200             0.233
> 100             0.159
> 50              0.118
> 
> Source code for the above test:
> #!/usr/bin/perl -w
> 
> #Program creates random point fields eqyuivalent to bitfieldtest.pl except the
> data is stored
> #as regular signed integers. To represent the globe as closely as possible,
> extents between
> #-180 and +179.999999 will be used. Therefore, adding 180 normalises for
> international date line 0.
> #Prime Meridian 180. 111111**0.000001
> 
> 
> use DBI;
> use Time::HiRes qw( usleep ualarm gettimeofday tv_interval );
> 
> $DBHOST = "localhost";
> $DBNAME = "nickh";
> $DBUSER = "nickh";
> $DBPASS = "xxxxxx";
> 
> #initialise database
> $driver = "mysql";
> $dsn = "DBI:$driver:database=$DBNAME;host=$DBHOST";
> $dbh = DBI->connect($dsn, $DBUSER, $DBPASS);
> 
> [EMAIL PROTECTED](100000000);
> @plane_densities=(100000,1000000,10000000,100000000);
> @tile_points=(25600,25600,25600,25600,12800,6400,3200,1600,800,400,200,100,50);
> $query_iterations=25;
> $debug=0;
> 
> sub create_bitfield;
> sub run_tests;
> 
> foreach $density(@plane_densities){
> print "Creating a plan with $density points and averaging over
> $query_iterations queries\nPoints_per_tile Query_Time\n";
> create_bitfield($density);
>     foreach $tilepoints(@tile_points){
>     my $testtime=run_tests($density,$tilepoints);
>     printf '%-14d  %.3f',$tilepoints,$testtime;  # prints "<1.0>"print
> "$density   $testtime s\n";
>     print "\n";
>     }
> }
> 
> 
> 
> $dbh->disconnect;
> exit 0;
> 
> 
> sub create_bitfield{
> #takes number of points on the point field as argument.
> #We first create table without an index, as the indes slow populating table
> my $prep=q~
> DROP TABLE IF EXISTS integer_test;
> 
> CREATE TABLE `integer_test` (
>   `lat` integer NOT NULL default '0',
>   `lon` integer NOT NULL default '0'
>   ) TYPE=MyISAM
> ~;
> 
> #drop/create tables without index
> foreach $aprepare(split(/;/,$prep)) {
> #print "preparing $preparation";
>     $dbh->do($aprepare);
>     }
> 
> 
> #populate table
> for ($i=0;$i<$_[0];$i+=100){
> #create 100 element batched inserts (value,value),(value,value) etc
> my $insert='';
>    for ($j=0;$j<100;$j++){
>    if ($j>0) {$insert .= ','; }
>    $lat=int(rand 3599999999)-1800000000;
>    $lon=int(rand 3599999999)-1800000000;
>    $insert .= "($lat,$lon)";
>    }
> 
> my $sql="INSERT INTO `integer_test` VALUES $insert;";
> #print "SQL1 is $sql1\n\n";
> $dbh->do($sql);
> 
> }
> #After populating table, we create indexes.
> #print "Creating index... This may take some time\n";
> my $sql='CREATE INDEX index1 ON integer_test (lat,lon);';
> $dbh->do($sql);
> }
> 
> 
> sub run_tests{
> #Parameters: Points in field; Size of tile in average number of points
> my $number_of_points=$_[0];
> my $target_query_return=$_[1];
> my $proportion_of_each_axis=1/(sqrt($number_of_points/$target_query_return));
> my $max_extent=1-$proportion_of_each_axis;  #Maximum extent for query without
> extending beyond bound
> my $returnedrows;
> $query1total=0;
> 
> for($i=0;$i<$query_iterations;$i++){
> #$lat1=(0.4+(rand ($max_extent/10)));
> #$lon1=(0.4+(rand ($max_extent/10)));
> 
> $lat1=int(rand ($max_extent*3599999999))-1800000000;
> $lon1=int(rand ($max_extent*3599999999))-1800000000;
> 
> $lat2=int($lat1+($proportion_of_each_axis*3599999999));
> $lon2=int($lon1+($proportion_of_each_axis*3599999999));
> 
> if ($debug){print "querying bounds $lat1 $lon1 $lat2 $lon2 with queries \n";}
> 
> $query1="SELECT lat,lon from integer_test WHERE lat>$lat1 and lat<$lat2 and
> lon>$lon1 and lon<$lon2";
> 
> 
> #print "Query 1: $query1\n";
> $t0 = Time::HiRes::time();
> my $sth = $dbh->prepare($query1);
> $sth->execute();
> $returnedrows+=$sth->rows();
> #$sth->finish;
> $elapsed=Time::HiRes::time()-$t0;
> #print "Fetched $rows points in $elapsed \n";
> $query1total+=$elapsed;
> 
> 
> }
> #print $returnedrows;
> if ($debug){print "Each of the $query_iterations returned an average " .
> $returnedrows/$query_iterations . "records\n";}
> #print "unit test time is " . ($query1total/$query_iterations) . "seconds
> \n\n";
> return ($query1total/$query_iterations);
> }
> 
> 

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Re: Optimising for many rows and returned records (de-coupling query time to record set size for range queries)

Reply via email to