Dan Bolser <[EMAIL PROTECTED]> wrote on 06/21/2005 09:51:06 AM:

> Hello,

> I am interested in the theoretical time / space complexity of SQL 
queries
> on indexed / non-indexed data.

> I think I read somewhere that a JOIN on an indexed column is something
> like O[mn*log(mn)] (m rows joined to n).

> I assume without an index it is just O[m*n]

> Specifically I want to know the complexity of a query that does a
> 'cross tabulation'

> SELECT
> X,
> SUM(if(Y=1,Z,0)) AS s1,
> SUM(if(Y=2,Z,0)) AS s2,
> SUM(if(Y=3,Z,0)) AS s3,
> ...
> FROM
> T1
> GROUP BY
> X;
> 
> Assuming both X and Y are indexed, how does the complexity grow with
> increasing 's' (more if clauses). More basic, what is the complexity of
> the group by statement?

> Can anyone point me to a good online guide to complexity of SQL?

> Thanks very much for any suggestions :)

> Dan.

I believe you will see a nearly linear growth as you add terms to your 
query (but only to a point). My reasoning:

I think I can accurately say the total time spent on any query would be 
the sum of the times spent on each phase of the query analysis

Tp - time required to parse the query
Tj - time to process any table JOINS (initial column identifications, too)
Tw - time to process the WHERE clause against your source tables.
Tgb - time to process the GROUP BY clause
Th - time to process the HAVING clause
Tob - time to perform the final sort
=================================================================
Tquery - total time for query (sum of all terms above)

Each one of those phases has a BIG O expression associated with it and 
each phase has at least one opportunity for query optimization. Some are 
optimized by table design (field definitions + indexes), others are 
optimized through efficient SQL (proper JOIN definitions, multistage 
queries, etc.)

In your case you are doing at least four things when you add another term 
to your query:
1) you have added another term that needs parsing
1) you add another IF() evaluation that must occur for each rows 
evaluation during the Tw phase.
2) you increase how much data must be transferred from the output of the 
Tw phase, and each phase after that, by one column.
3) you add another SUM() evaluation to the Tgb phase of the query.

So, incrementally you have increased the time required to perform the 
query by adding another evaluated column but to estimate how much depends 
on both your hardware and your settings. If, for example, the next column 
you added forced the engine to evaluate the Tj portion of this query using 
paged memory, you have just changed the slope of your equation based on 
meeting a physical limitation and your "time estimate formula" would now 
have a different set of coefficients. The only way to honestly know how 
many summarized columns it takes to reach that point is to benchmark the 
query on your equipment and with your settings under your normal 
production load. 

I know I really didn't answer your theory question (not exactly) but I 
believe that evaluating the "BIG O" of a SQL query is a bit more complex 
than you presented in your post. I encourage you to keep asking questions 
and I am sure that you will either tap the list dry or find out what you 
want to know.

Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine

Reply via email to