Dear Shawn,
First off, apologies for the delay in reply to this email.
Secondly, thanks a lot for a very illuminating dicussion on composite keys
and the way MySQL handles them. Reading through the whole discussion, I have a
minor question is popping up in my heads...it is as follows:
If I have a table with composite key fld1,fld2,fld3,fld4. My normal way of
handling the situation is to create a unique primary key on
(fld1,fld2,fld3,fld4) and then create "single" non-primary indices on each of
the remaining fields (so essentially three indices - fld2-idx, fld3-idx and
fld4-idx).
Based on your experience, Is it more effective (in terms of speed of query
and cost of insert) to create a composite primary index like
(fld1-fld2-fld3-fld4), (fld2-fld1-fld-3-fld4) , (fld3-fld1-fld2-fld4) and
(fld4-fld1-fld2-fld3) thereby bringing all fields (fld1 to fld 4) on the
leftmost side.....OR Is it better to create one composite primary key index
(fld1-fld2-fld3-fld4) and three single non-primary indices on fld2,fld3 and
fld4 respectively ? Any particular preference one way or another?
Thanks!
Cheers
Manoj
----- Original Message -----
From: [EMAIL PROTECTED]
To: ManojW
Cc: MySQL List
Sent: Thursday, April 07, 2005 11:44 PM
Subject: Re: Question on Composite Index
"ManojW" <[EMAIL PROTECTED]> wrote on 04/06/2005 10:09:31 PM:
> Dear All,
> Just to get a better understanding of how indices work in MySQL - If I
> have a Innodb table with a composite primary key (fld1,fld2,fld3,fld4,fld5),
> then my understanding is that MySQL optimizes just the leftmost primary key
> (fld1 in this case).
>
> Hence a query like select * from tbl1 where fld2 > 900 would result in a
> full table scan even though it's part of the composite key but select *
> from tbl1 where fld1 > 900 would be extremely quicker since it would search
> based on Index pages.
>
> Is my understanding correct? If so, how can we get around this issue ? In
> real-life databases you will always run in cases where you end up making a
> composite key on table. One possible solution would be to create non-unique,
> non-primary index on each of fld2,fld3,fld4,fld5 but then the inserts would
> be horribly slow hence was wondering if I am totally missing a very clean
> solution to the whole issue.
>
> Your kind help would be greatly appreciated!
>
> Regards
>
> Manoj
>
>
I think you have the basics down. I can show you something similar to what
happens in an index when it is built. Maybe this will explain why you can only
use an index to resolve the left most columns of a multi-column index.
Imagine you have a table, Example1, with columns A, B, C, D, and E. For the
purposes of this demonstration, the table will consist of data representing
every possible combination of only 5 different values for each column (column A
will only contain the values a1, a2, a3, a4, and a5. The same goes for each of
the other columns). This means that a small section of the table could look
like (this represents data rows 1470-1480)
Example1
+--------------+
| A| B| C| D| E|
+--------------+
| ... |
|a3|b2|c4|d5|e1|
|a3|b2|c4|d5|e2|
|a3|b2|c4|d5|e3|
|a3|b2|c4|d5|e4|
|a3|b2|c4|d5|e5|
|a3|b2|c5|d1|e1|
|a3|b2|c5|d1|e2|
|a3|b2|c5|d1|e3|
|a3|b2|c5|d1|e4|
|a3|b2|c5|d1|e5|
|a3|b2|c5|d2|e1|
| ... |
+--------------+
Now let's create an index on the columns A, B, and C. Each index row will
contain the values of those columns for each row plus an offset into the
datafile of where to locate that row (so that you can retrieve values from
columns D or E). If each row's offset is an "o" value (o1 is where data row 1
starts, o2 is where data row 2 starts, etc...) then the index file looks
something like this
KEY(A,B,C)
+----------+-------------+
|Key values|Offset values|
+----------+-------------+
| ... |
| a3-b2-c4| o1470|
| a3-b2-c4| o1471|
| a3-b2-c4| o1472|
| a3-b2-c4| o1473|
| a3-b2-c4| o1474|
| a3-b2-c5| o1475|
| a3-b2-c5| o1476|
| a3-b2-c5| o1477|
| a3-b2-c5| o1478|
| a3-b2-c5| o1479|
| a3-b2-c5| o1480|
| ... |
+------------------------+
If you declare an index on 3 columns, all 3 columns are "hashed" together to
form the equivalent of a single value (this is not the ONLY way to hash values
together but it works as an illustration for the purposes of answering your
question). The index files are sorted according to their hashed values. That
means that it is very easy to find where the a3 values are (they are all
together) or the a3-b6 values (as they are also all together) but to find just
the b5 values in the index, you end up searching the whole thing because there
could be b5 values associated with ANY of the a* values.
So to answer your questions, in order for you to be able to use an index to
locate a row where a column has a particular value (or range of values) you
must be able to easily navigate to those values within the index. To do that in
a reasonable amount of time you have to be able to resolve the values for every
column that comes before it in the index as well. If the column you are trying
to match is the leftmost (first listed) column in an index, you only need to
match on the leftmost portion of the index's key (hash). If the column you want
to match is the second or later column in the hash, this index can't help you
make things work any faster. Yes, it may make INSERTs much slower if you put
each column at the beginning of their own index but you save time during SELECT
queries because you will probably be able to use an index to find the data you
seek.
Another advantage to indexes is that under the right circumstances, all of
the data you need to answer a query may be contained within the index itself.
If this is the case, the database engine doesn't even try to read the actual
data file, it just gets the data straight from the index file and finishes a
few steps sooner. Fewer trips to the disk = much faster response = happy
database users. Sure you are trading INSERT speed for SELECT speed (and
potentially lots of free space on your disks) but you can't get it both ways.
Tables with fewer indexes on them are less likely to use an index for any
random query so they will generally respond more slowly. The upside is, they
have a smaller footprint on the disk. If you can give up some INSERT speed and
some free space on your hard drives, then the extra indexes are usually worth
it for the performance boost they provide.
One way to keep indexes small is to only index the columns you need. How many
queries actually specify all 16 columns of a table in their WHERE clause?
Probably not too many. The engine can use the partial results of say a 3 column
match from an index (it narrows down the search area to a "short list" of rows)
and perform what is called a "range scan" on just those rows to evaluate the
other values in your WHERE clause.
There is a great section in the manual on optimization:
http://dev.mysql.com/doc/mysql/en/mysql-optimization.html
And here is what it says about optimizing WHERE evaluation:
http://dev.mysql.com/doc/mysql/en/where-optimizations.html
And this is about how to avoid table scans:
http://dev.mysql.com/doc/mysql/en/how-to-avoid-table-scan.html
Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine