Re: Query Complexity (big 'O')

2005-06-21 Thread Bastian Balthazar Bux
Dan Bolser wrote:
 Hello, 
 
 I am interested in the theoretical time / space complexity of SQL queries
 on indexed / non-indexed data. 
 
 I think I read somewhere that a JOIN on an indexed column is something
 like O[mn*log(mn)] (m rows joined to n).
 
 I assume without an index it is just O[m*n]
 
 Specifically I want to know the complexity of a query that does a
 'cross tabulation'
 
 SELECT
   X, 
   SUM(if(Y=1,Z,0)) AS s1,
   SUM(if(Y=2,Z,0)) AS s2,
   SUM(if(Y=3,Z,0)) AS s3,
   ...
 FROM 
   T1
 GROUP BY 
   X;
 
 Assuming both X and Y are indexed, how does the complexity grow with
 increasing 's' (more if clauses). More basic, what is the complexity of
 the group by statement?
 
 Can anyone point me to a good online guide to complexity of SQL?
 
 Thanks very much for any suggestions :)
 
 Dan.
 

It's a bit more complex than that, I'm not an expert of mathematics, so
here I'll try to explain things as I know them, hope to give you all the
elements needed to calculate the space complexity again.

First the previous query don't use indexes at all, you can see this from
the output of :
EXPLAIN SELECT ... GROUP BY X;

To take advantage from indexes the query could be written as:

SELECT X, 1 AS Y, SUM(Z) AS s1 FROM  T1 WHERE Y=1 GROUP BY X
UNION
SELECT X, 2 AS Y, SUM(Z) AS s1 FROM  T1 WHERE Y=2 GROUP BY X
UNION
SELECT X, 3 AS Y, SUM(Z) AS s1 FROM  T1 WHERE Y=2 GROUP BY X

this way whatever the complexity is it will end with the summa of all
query, in this case 3 * complexity

Now, how to build indexes:
The first place to look is the WHERE clause, it's the first used to cut
unwanted data.
The index will contain Y at his inside, to be more exact the index
*must* have Y as first member to be used.

Then examine the GROUP BY clause, to group the database must order for
the content of the groups. In this case we want to index for X.
There is a problem here, generally databases can't use two index for a
single table (not totally true, take it as is for now).
As a result of this we *must* create an index that contain the ordered
couple Y,X .

The analisys can finish here, but there is still space for another
optimization, this one must be evaluated every time knowing the shape of
the table, the amount of data contained etc.
We can see that the only other element of the query is Z .
Indexes are kept separated from data on the disk, so if all the data
needed is contained into the index we can avoid a second disk read for
the data.
having an index on (Y,X,Z) in this order permit to access only the
indexes and not the table data.

play with EXPLAIN to learn more on how indexes are used, it's very
informative.

HTH
Francesco Riosa



-- 
 
. These pages are best viewed by coming to my house and looking at   .
. my monitor. [S. Lucas Bergman (on his website)].
 

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: Query Complexity (big 'O')

2005-06-21 Thread Jochem van Dieten
On 6/21/05, Dan Bolser wrote:
 
 I am interested in the theoretical time / space complexity of SQL queries
 on indexed / non-indexed data.

I doubt this is the right list for theory.


 Specifically I want to know the complexity of a query that does a
 'cross tabulation'
 
 SELECT
   X,
   SUM(if(Y=1,Z,0)) AS s1,
   SUM(if(Y=2,Z,0)) AS s2,
   SUM(if(Y=3,Z,0)) AS s3,
   ...
 FROM
   T1
 GROUP BY
   X;
 
 Assuming both X and Y are indexed, how does the complexity grow with
 increasing 's' (more if clauses).

In MySQL: I bet the indexes don't matter and the complexity grows less
then linear. The EXPLAIN output will tell you why.


 Can anyone point me to a good online guide to complexity of SQL?

The language SQL or some implementation?

Consider looking at PostgreSQL instead of MySQL as your test system. I
find the tools to look inside much better:
http://www.mail-archive.com/pgsql-hackers@postgresql.org/msg17592.html

Jochem

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: Query Complexity (big 'O')

2005-06-21 Thread SGreen
Dan Bolser [EMAIL PROTECTED] wrote on 06/21/2005 09:51:06 AM:

 Hello,

 I am interested in the theoretical time / space complexity of SQL 
queries
 on indexed / non-indexed data.

 I think I read somewhere that a JOIN on an indexed column is something
 like O[mn*log(mn)] (m rows joined to n).

 I assume without an index it is just O[m*n]

 Specifically I want to know the complexity of a query that does a
 'cross tabulation'

 SELECT
 X,
 SUM(if(Y=1,Z,0)) AS s1,
 SUM(if(Y=2,Z,0)) AS s2,
 SUM(if(Y=3,Z,0)) AS s3,
 ...
 FROM
 T1
 GROUP BY
 X;
 
 Assuming both X and Y are indexed, how does the complexity grow with
 increasing 's' (more if clauses). More basic, what is the complexity of
 the group by statement?

 Can anyone point me to a good online guide to complexity of SQL?

 Thanks very much for any suggestions :)

 Dan.

I believe you will see a nearly linear growth as you add terms to your 
query (but only to a point). My reasoning:

I think I can accurately say the total time spent on any query would be 
the sum of the times spent on each phase of the query analysis

Tp - time required to parse the query
Tj - time to process any table JOINS (initial column identifications, too)
Tw - time to process the WHERE clause against your source tables.
Tgb - time to process the GROUP BY clause
Th - time to process the HAVING clause
Tob - time to perform the final sort
=
Tquery - total time for query (sum of all terms above)

Each one of those phases has a BIG O expression associated with it and 
each phase has at least one opportunity for query optimization. Some are 
optimized by table design (field definitions + indexes), others are 
optimized through efficient SQL (proper JOIN definitions, multistage 
queries, etc.)

In your case you are doing at least four things when you add another term 
to your query:
1) you have added another term that needs parsing
1) you add another IF() evaluation that must occur for each rows 
evaluation during the Tw phase.
2) you increase how much data must be transferred from the output of the 
Tw phase, and each phase after that, by one column.
3) you add another SUM() evaluation to the Tgb phase of the query.

So, incrementally you have increased the time required to perform the 
query by adding another evaluated column but to estimate how much depends 
on both your hardware and your settings. If, for example, the next column 
you added forced the engine to evaluate the Tj portion of this query using 
paged memory, you have just changed the slope of your equation based on 
meeting a physical limitation and your time estimate formula would now 
have a different set of coefficients. The only way to honestly know how 
many summarized columns it takes to reach that point is to benchmark the 
query on your equipment and with your settings under your normal 
production load. 

I know I really didn't answer your theory question (not exactly) but I 
believe that evaluating the BIG O of a SQL query is a bit more complex 
than you presented in your post. I encourage you to keep asking questions 
and I am sure that you will either tap the list dry or find out what you 
want to know.

Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine



RE: Query Complexity (big 'O')

2005-06-21 Thread Dathan Pattishall
It's a Big O of N.


DVP

Dathan Vance Pattishall http://www.friendster.com

 

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
 Sent: Tuesday, June 21, 2005 9:39 AM
 To: Dan Bolser
 Cc: mysql@lists.mysql.com
 Subject: Re: Query Complexity (big 'O')
 
 Dan Bolser [EMAIL PROTECTED] wrote on 06/21/2005 09:51:06 AM:
 
  Hello,
 
  I am interested in the theoretical time / space complexity of SQL
 queries
  on indexed / non-indexed data.
 
  I think I read somewhere that a JOIN on an indexed column 
 is something 
  like O[mn*log(mn)] (m rows joined to n).
 
  I assume without an index it is just O[m*n]
 
  Specifically I want to know the complexity of a query that does a 
  'cross tabulation'
 
  SELECT
  X,
  SUM(if(Y=1,Z,0)) AS s1,
  SUM(if(Y=2,Z,0)) AS s2,
  SUM(if(Y=3,Z,0)) AS s3,
  ...
  FROM
  T1
  GROUP BY
  X;
  
  Assuming both X and Y are indexed, how does the complexity 
 grow with 
  increasing 's' (more if clauses). More basic, what is the 
 complexity 
  of the group by statement?
 
  Can anyone point me to a good online guide to complexity of SQL?
 
  Thanks very much for any suggestions :)
 
  Dan.
 
 I believe you will see a nearly linear growth as you add 
 terms to your query (but only to a point). My reasoning:
 
 I think I can accurately say the total time spent on any 
 query would be the sum of the times spent on each phase of 
 the query analysis
 
 Tp - time required to parse the query
 Tj - time to process any table JOINS (initial column 
 identifications, too) Tw - time to process the WHERE clause 
 against your source tables.
 Tgb - time to process the GROUP BY clause Th - time to 
 process the HAVING clause Tob - time to perform the final 
 sort =
 Tquery - total time for query (sum of all terms above)
 
 Each one of those phases has a BIG O expression associated 
 with it and each phase has at least one opportunity for query 
 optimization. Some are optimized by table design (field 
 definitions + indexes), others are optimized through 
 efficient SQL (proper JOIN definitions, multistage queries, etc.)
 
 In your case you are doing at least four things when you add 
 another term to your query:
 1) you have added another term that needs parsing
 1) you add another IF() evaluation that must occur for each 
 rows evaluation during the Tw phase.
 2) you increase how much data must be transferred from the 
 output of the Tw phase, and each phase after that, by one column.
 3) you add another SUM() evaluation to the Tgb phase of the query.
 
 So, incrementally you have increased the time required to 
 perform the query by adding another evaluated column but to 
 estimate how much depends on both your hardware and your 
 settings. If, for example, the next column you added forced 
 the engine to evaluate the Tj portion of this query using 
 paged memory, you have just changed the slope of your 
 equation based on meeting a physical limitation and your 
 time estimate formula would now have a different set of 
 coefficients. The only way to honestly know how many 
 summarized columns it takes to reach that point is to 
 benchmark the query on your equipment and with your settings 
 under your normal production load. 
 
 I know I really didn't answer your theory question (not 
 exactly) but I believe that evaluating the BIG O of a SQL 
 query is a bit more complex than you presented in your post. 
 I encourage you to keep asking questions and I am sure that 
 you will either tap the list dry or find out what you want to know.
 
 Shawn Green
 Database Administrator
 Unimin Corporation - Spruce Pine
 
 

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]