Re: Optimising for many rows and returned records (de-coupling query time to record set size for range queries)

2006-04-24 Thread Nick Hill

Hello Adam

Adam Wolff wrote:

Actually runs through the table four times instead of twice, and maybe
can't even use the index for the whole query.


Assuming my results are not typical of MySQL query times, this would 
explain the sqrt() relationship of returned rows to query time.


I have tried your suggestions of using a sub-query and have had trouble 
getting the syntax valid. But on using explain, it seems that 4 bytes of 
the index (either lat or lon) are being used and a brute force search on 
the index for the other constraint.


If the query is returning 25600 points from a 100m dataset, it is brute 
seaching through 1.6m records in the second part of the index.


If it were an option of creating 2 1.6M lists then looking for 
commonalities, it may be faster to instead use 1 1.6m item list then 
brute force constraint search.


I have received suggestions to use spatial indexes, which I am looking 
into. Alternatively, I could optimise queries by creating multiple 
slices of the data set accross one axis then use a key on the other 
axis. MySQL 5.1 partitioning scheme may help.



--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: Optimising for many rows and returned records (de-coupling query time to record set size for range queries)

2006-04-24 Thread Adam Wolff
Well, I hadn't known about the spatial features of MySQL. If you're ok 
using vendor extensions then that definitely looks like the way to go:
http://dev.mysql.com/doc/refman/5.0/en/gis-introduction.html

A

On Apr 24, Nick Hill wrote:

 Hello Adam
 
 Adam Wolff wrote:
  Actually runs through the table four times instead of twice, and maybe
  can't even use the index for the whole query.
 
 Assuming my results are not typical of MySQL query times, this would explain
 the sqrt() relationship of returned rows to query time.
 
 I have tried your suggestions of using a sub-query and have had trouble
 getting the syntax valid. But on using explain, it seems that 4 bytes of the
 index (either lat or lon) are being used and a brute force search on the index
 for the other constraint.
 
 If the query is returning 25600 points from a 100m dataset, it is brute
 seaching through 1.6m records in the second part of the index.
 
 If it were an option of creating 2 1.6M lists then looking for commonalities,
 it may be faster to instead use 1 1.6m item list then brute force constraint
 search.
 
 I have received suggestions to use spatial indexes, which I am looking into.
 Alternatively, I could optimise queries by creating multiple slices of the
 data set accross one axis then use a key on the other axis. MySQL 5.1
 partitioning scheme may help.
 
 

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Optimising for many rows and returned records (de-coupling query time to record set size for range queries)

2006-04-23 Thread Nick Hill

Hello

I have been looking at planning the database strategy for openstreetmap 
(http://www.openstreetmap.org).


There are several data types stored in tables with longitude and 
latitude columns. Select statements work by selecting


where lat$lat1 and lat$lat2 and lon$lon1 and lon$lon2

I have made many empirical tests and have concluded:

1) I can improve performance by a factor of 2-2.5 by changing the double 
lat/lon to an integer then selecting on an integer.


2) I have concluded that for each 10 fold increase in the number of 
records, select queries take twice as long. For each doubling of the 
number of returned records, there is a sqrt(2) increase in select query 
time.


All this is assuming all relevant database information is in memory.


As the database grows, it would likely improve database performance by 
splitting an individual table into several thousand tables using the 
file system directory btree algorithm to effectively pre-select the data 
before the query is handled to the MySQL engine. This is not a neat 
solution. A much better way would be to improve the mysql index 
performance on very large numbers of records.


Given that there is such a strong relationship between the number of 
records returned, and query time, I conclude that the whole index tree 
is matched for every given number of root x records returned. If all 
records we are matching are under a single node or under a small number 
of nodes in the index tree, perhaps there is some way of telling the 
database engine to ignore the rest of the index tree.


Could this work, or am I misunderstanding how the index tree works? Are 
there existing optimisations which can de-couple the relationship 
between number of records and query time where the records I am 
selecting are within a small range?




Background information:

We can boil all this down to a mathematical relationship where
query1 selects s number of records from r records dataset
and
query2 selects b number of records from c records dataset

Tquery1 is time to execue query 1 and Tquery2 is time to execute query2.

Tquery2=Tquery1 * sqrt(b/s) * (2^log(r/c)) + (b-s*CONST/15000)+CONST
Where for my processor, CONST is 0.03


This can be simplified (loosing some accuracy) to:

Tquery2=Tquery1 * sqrt(b/s) * (2^log(r/c)




Raw data for selects:
Creating a plan with 10 points and averaging over 25 queries
Points_per_tile Query_Time
25600   0.118
25600   0.119
25600   0.119
25600   0.119
12800   0.069
64000.042
32000.026
16000.017
800 0.011
400 0.008
200 0.005
100 0.004
50  0.003
Creating a plan with 100 points and averaging over 25 queries
Points_per_tile Query_Time
25600   0.224
25600   0.223
25600   0.222
25600   0.223
12800   0.145
64000.093
32000.062
16000.043
800 0.029
400 0.020
200 0.015
100 0.011
50  0.008
Creating a plan with 1000 points and averaging over 25 queries
Points_per_tile Query_Time
25600   0.558
25600   0.548
25600   0.551
25600   0.551
12800   0.376
64000.257
32000.181
16000.125
800 0.087
400 0.062
200 0.044
100 0.031
Creating a plan with 1 points and averaging over 25 queries
Points_per_tile Query_Time
25600   2.422
25600   2.332
25600   2.493
25600   2.446
12800   1.769
64001.295
32000.866
16000.657
800 0.456
400 0.328
200 0.233
100 0.159
50  0.118

Source code for the above test:
#!/usr/bin/perl -w

#Program creates random point fields eqyuivalent to bitfieldtest.pl 
except the data is stored
#as regular signed integers. To represent the globe as closely as 
possible, extents between
#-180 and +179.99 will be used. Therefore, adding 180 normalises for 
international date line 0.

#Prime Meridian 180. 11**0.01


use DBI;
use Time::HiRes qw( usleep ualarm gettimeofday tv_interval );

$DBHOST = localhost;
$DBNAME = nickh;
$DBUSER = nickh;
$DBPASS = xx;

#initialise database
$driver = mysql;
$dsn = DBI:$driver:database=$DBNAME;host=$DBHOST;
$dbh = DBI-connect($dsn, $DBUSER, $DBPASS);

[EMAIL PROTECTED](1);
@plane_densities=(10,100,1000,1);
@tile_points=(25600,25600,25600,25600,12800,6400,3200,1600,800,400,200,100,50);
$query_iterations=25;
$debug=0;

sub create_bitfield;
sub run_tests;

foreach $density(@plane_densities){
print Creating a plan with $density points and averaging over 
$query_iterations queries\nPoints_per_tile Query_Time\n;

create_bitfield($density);
foreach $tilepoints(@tile_points){
my $testtime=run_tests($density,$tilepoints);
   

Re: Optimising for many rows and returned records (de-coupling query time to record set size for range queries)

2006-04-23 Thread Adam Wolff
I didn't look through your code very carefully. Have you confirmed (using 
EXPLAIN) that your select query is using the index?

I don't much about the mysql optimizer, but it's possible that this query:
 $query1=SELECT lat,lon from integer_test WHERE lat$lat1 and lat$lat2 
 and lon$lon1 and lon$lon2;
Actually runs through the table four times instead of twice, and maybe
can't even use the index for the whole query.

Is the performance different if you use a subqery? Some thing like:
SELECT id, lat, lon FROM integer_test WHERE lat$lat1 and lat$lat2 
 LEFT JOIN (SELECT id, lat, lon FROM integer_test AS i2
   WHERE lon$lon and lon$lon )
 ON integer_test.id = i2.id

assuming that you have independent indicies on lat and lon. I didn't try
this so the syntax may be wrong. You could also dispense with the id idea,
but if so you should probably declare (lat,lon) as your primary key.

A

 and lon$lon1 and lon$lon2;

On Apr 23, Nick Hill wrote:

 Hello
 
 I have been looking at planning the database strategy for openstreetmap
 (http://www.openstreetmap.org).
 
 There are several data types stored in tables with longitude and latitude
 columns. Select statements work by selecting
 
 where lat$lat1 and lat$lat2 and lon$lon1 and lon$lon2
 
 I have made many empirical tests and have concluded:
 
 1) I can improve performance by a factor of 2-2.5 by changing the double
 lat/lon to an integer then selecting on an integer.
 
 2) I have concluded that for each 10 fold increase in the number of records,
 select queries take twice as long. For each doubling of the number of returned
 records, there is a sqrt(2) increase in select query time.
 
 All this is assuming all relevant database information is in memory.
 
 
 As the database grows, it would likely improve database performance by
 splitting an individual table into several thousand tables using the file
 system directory btree algorithm to effectively pre-select the data before the
 query is handled to the MySQL engine. This is not a neat solution. A much
 better way would be to improve the mysql index performance on very large
 numbers of records.
 
 Given that there is such a strong relationship between the number of records
 returned, and query time, I conclude that the whole index tree is matched for
 every given number of root x records returned. If all records we are matching
 are under a single node or under a small number of nodes in the index tree,
 perhaps there is some way of telling the database engine to ignore the rest of
 the index tree.
 
 Could this work, or am I misunderstanding how the index tree works? Are there
 existing optimisations which can de-couple the relationship between number of
 records and query time where the records I am selecting are within a small
 range?
 
 
 
 Background information:
 
 We can boil all this down to a mathematical relationship where
 query1 selects s number of records from r records dataset
 and
 query2 selects b number of records from c records dataset
 
 Tquery1 is time to execue query 1 and Tquery2 is time to execute query2.
 
 Tquery2=Tquery1 * sqrt(b/s) * (2^log(r/c)) + (b-s*CONST/15000)+CONST
 Where for my processor, CONST is 0.03
 
 
 This can be simplified (loosing some accuracy) to:
 
 Tquery2=Tquery1 * sqrt(b/s) * (2^log(r/c)
 
 
 
 
 Raw data for selects:
 Creating a plan with 10 points and averaging over 25 queries
 Points_per_tile Query_Time
 25600   0.118
 25600   0.119
 25600   0.119
 25600   0.119
 12800   0.069
 64000.042
 32000.026
 16000.017
 800 0.011
 400 0.008
 200 0.005
 100 0.004
 50  0.003
 Creating a plan with 100 points and averaging over 25 queries
 Points_per_tile Query_Time
 25600   0.224
 25600   0.223
 25600   0.222
 25600   0.223
 12800   0.145
 64000.093
 32000.062
 16000.043
 800 0.029
 400 0.020
 200 0.015
 100 0.011
 50  0.008
 Creating a plan with 1000 points and averaging over 25 queries
 Points_per_tile Query_Time
 25600   0.558
 25600   0.548
 25600   0.551
 25600   0.551
 12800   0.376
 64000.257
 32000.181
 16000.125
 800 0.087
 400 0.062
 200 0.044
 100 0.031
 Creating a plan with 1 points and averaging over 25 queries
 Points_per_tile Query_Time
 25600   2.422
 25600   2.332
 25600   2.493
 25600   2.446
 12800   1.769
 64001.295
 32000.866
 16000.657
 800 0.456
 400 0.328
 200 0.233
 100 0.159
 50  0.118
 
 Source code for the above test:
 #!/usr/bin/perl -w
 
 #Program creates random point fields eqyuivalent to 

Re: Optimising for many rows and returned records (de-coupling query time to record set size for range queries)

2006-04-23 Thread Adam Wolff
Actually I think this should be an INNER JOIN -- not a LEFT JOIN.

A

On Apr 23, Adam Wolff wrote:

 I didn't look through your code very carefully. Have you confirmed (using 
 EXPLAIN) that your select query is using the index?
 
 I don't much about the mysql optimizer, but it's possible that this query:
  $query1=SELECT lat,lon from integer_test WHERE lat$lat1 and lat$lat2 
  and lon$lon1 and lon$lon2;
 Actually runs through the table four times instead of twice, and maybe
 can't even use the index for the whole query.
 
 Is the performance different if you use a subqery? Some thing like:
 SELECT id, lat, lon FROM integer_test WHERE lat$lat1 and lat$lat2 
  LEFT JOIN (SELECT id, lat, lon FROM integer_test AS i2
WHERE lon$lon and lon$lon )
  ON integer_test.id = i2.id
 
 assuming that you have independent indicies on lat and lon. I didn't try
 this so the syntax may be wrong. You could also dispense with the id idea,
 but if so you should probably declare (lat,lon) as your primary key.
 
 A
 
  and lon$lon1 and lon$lon2;
 
 On Apr 23, Nick Hill wrote:
 
  Hello
  
  I have been looking at planning the database strategy for openstreetmap
  (http://www.openstreetmap.org).
  
  There are several data types stored in tables with longitude and latitude
  columns. Select statements work by selecting
  
  where lat$lat1 and lat$lat2 and lon$lon1 and lon$lon2
  
  I have made many empirical tests and have concluded:
  
  1) I can improve performance by a factor of 2-2.5 by changing the double
  lat/lon to an integer then selecting on an integer.
  
  2) I have concluded that for each 10 fold increase in the number of records,
  select queries take twice as long. For each doubling of the number of 
  returned
  records, there is a sqrt(2) increase in select query time.
  
  All this is assuming all relevant database information is in memory.
  
  
  As the database grows, it would likely improve database performance by
  splitting an individual table into several thousand tables using the file
  system directory btree algorithm to effectively pre-select the data before 
  the
  query is handled to the MySQL engine. This is not a neat solution. A much
  better way would be to improve the mysql index performance on very large
  numbers of records.
  
  Given that there is such a strong relationship between the number of records
  returned, and query time, I conclude that the whole index tree is matched 
  for
  every given number of root x records returned. If all records we are 
  matching
  are under a single node or under a small number of nodes in the index tree,
  perhaps there is some way of telling the database engine to ignore the rest 
  of
  the index tree.
  
  Could this work, or am I misunderstanding how the index tree works? Are 
  there
  existing optimisations which can de-couple the relationship between number 
  of
  records and query time where the records I am selecting are within a small
  range?
  
  
  
  Background information:
  
  We can boil all this down to a mathematical relationship where
  query1 selects s number of records from r records dataset
  and
  query2 selects b number of records from c records dataset
  
  Tquery1 is time to execue query 1 and Tquery2 is time to execute query2.
  
  Tquery2=Tquery1 * sqrt(b/s) * (2^log(r/c)) + (b-s*CONST/15000)+CONST
  Where for my processor, CONST is 0.03
  
  
  This can be simplified (loosing some accuracy) to:
  
  Tquery2=Tquery1 * sqrt(b/s) * (2^log(r/c)
  
  
  
  
  Raw data for selects:
  Creating a plan with 10 points and averaging over 25 queries
  Points_per_tile Query_Time
  25600   0.118
  25600   0.119
  25600   0.119
  25600   0.119
  12800   0.069
  64000.042
  32000.026
  16000.017
  800 0.011
  400 0.008
  200 0.005
  100 0.004
  50  0.003
  Creating a plan with 100 points and averaging over 25 queries
  Points_per_tile Query_Time
  25600   0.224
  25600   0.223
  25600   0.222
  25600   0.223
  12800   0.145
  64000.093
  32000.062
  16000.043
  800 0.029
  400 0.020
  200 0.015
  100 0.011
  50  0.008
  Creating a plan with 1000 points and averaging over 25 queries
  Points_per_tile Query_Time
  25600   0.558
  25600   0.548
  25600   0.551
  25600   0.551
  12800   0.376
  64000.257
  32000.181
  16000.125
  800 0.087
  400 0.062
  200 0.044
  100 0.031
  Creating a plan with 1 points and averaging over 25 queries
  Points_per_tile Query_Time
  25600   2.422
  25600   2.332
  25600   2.493
  25600   2.446
  12800   1.769
  64001.295
  3200   

Re: Optimising for many rows and returned records (de-coupling query time to record set size for range queries)

2006-04-23 Thread Alexey Polyakov
On 4/23/06, Nick Hill [EMAIL PROTECTED] wrote:

 1) I can improve performance by a factor of 2-2.5 by changing the double
 lat/lon to an integer then selecting on an integer.

 2) I have concluded that for each 10 fold increase in the number of
 records, select queries take twice as long. For each doubling of the
 number of returned records, there is a sqrt(2) increase in select query
 time.

I've noticed a couple things.
1) Right now you're emulating spatial index.
2) In future, you're going to emulate partitioning.

Why do you think that doing this stuff manually is better than using
builtin capabilities?

 As the database grows, it would likely improve database performance by
 splitting an individual table into several thousand tables using the
 file system directory btree algorithm to effectively pre-select the data
 before the query is handled to the MySQL engine. This is not a neat
 solution. A much better way would be to improve the mysql index
 performance on very large numbers of records.

Selects against a table use b-trees too. Splitting data into lot of
tables won't help with selects at all (well, it may help on scans with
concurrent large data sets if data will be spread across different
physical drives, but not with regular range lookups that you're
doing). It will only help with inserts.

 Given that there is such a strong relationship between the number of
 records returned, and query time, I conclude that the whole index tree
 is matched for every given number of root x records returned. If all
 records we are matching are under a single node or under a small number
 of nodes in the index tree, perhaps there is some way of telling the
 database engine to ignore the rest of the index tree.

What is a 'root record'? Are you speaking about internal
representation of b-tree?

 Could this work, or am I misunderstanding how the index tree works? Are
 there existing optimisations which can de-couple the relationship
 between number of records and query time where the records I am
 selecting are within a small range?

For studying select query performance issues it's better think about
index as simply about a sorted array with random-access, where each
random access costs O(lgN) and accesses to adjanced data cost O(1).
If your points are spread uniformly in space, cost of select query
you've shown is O(N*lgN)


--
Alexey Polyakov

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: Optimising for many rows and returned records (de-coupling query time to record set size for range queries)

2006-04-23 Thread Nick Hill

Alexey Polyakov wrote:

On 4/23/06, Nick Hill [EMAIL PROTECTED] wrote:
I've noticed a couple things.
1) Right now you're emulating spatial index.
2) In future, you're going to emulate partitioning.

Why do you think that doing this stuff manually is better than using
builtin capabilities?


1) I am ignorant about any performance advantages of spatial indexes.
2) I am ignorant about built-in partitioning capabilities.



Selects against a table use b-trees too. Splitting data into lot of
tables won't help with selects at all (well, it may help on scans with
concurrent large data sets if data will be spread across different
physical drives, but not with regular range lookups that you're
doing). It will only help with inserts.


Assuming even distribution, selecting a table amongst 1000 will only 
take a few ms while 2^log10(1000) gives an 8 fold improvement in select 
performance. But then, I may be running inefficient queries as 
postulated by Adam Wolff.



Given that there is such a strong relationship between the number of
records returned, and query time, I conclude that the whole index tree
is matched for every given number of root x records returned. If all
records we are matching are under a single node or under a small number
of nodes in the index tree, perhaps there is some way of telling the
database engine to ignore the rest of the index tree.



What is a 'root record'? Are you speaking about internal
representation of b-tree?


Yes. I am suggesting that a lower node in the B-tree may have below it 
all records the select query is looking for, thereby providing a short-cut.



Could this work, or am I misunderstanding how the index tree works? Are
there existing optimisations which can de-couple the relationship
between number of records and query time where the records I am
selecting are within a small range?



For studying select query performance issues it's better think about
index as simply about a sorted array with random-access, where each
random access costs O(lgN) and accesses to adjanced data cost O(1).
If your points are spread uniformly in space, cost of select query
you've shown is O(N*lgN)


I am unfamiliar with this representation. I am not sure I understand.

-Nick Hill.

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]