This is a lengthy rant about a MySQL situation i was in lately. It is not a
question, just a recount of some events that i went through and some of my
conclusions. Feel free to ignore at will :).

While working on a project the past two weeks, i had to deal with a rather
large database to import and query. What follows is my experience in this
project and some interesting trivia about how MySQL server works. My
computer machine is a bit ancient, i have a P4 2.8HT CPU, 512 MB DDR RAM,
and a SATA harddisk. My operating system is Ubuntu 7.04, and the only other
process that was running was firefox (to pass the time reading bash :-p),
Database server is MySQL 5.0.22, default installation (except i changed
default charset to utf8). All queries were run using the MySQL terminal
client.

Here's the database structure:
Table A:
    id   int(11) primary key not null
    num int(15) unique
contains ~ 1 million rows

Table B:
    id int(11) primary key not null
    A_id int(11) references Table A, id not null
    attribute int(5)  (contains around 130 distinct values)
contains ~ 13.5 million rows

Now the idea is that based on user input i need to get Table A.num, filtered
by whatever the user selects in Table B.attribute. My first reaction was to
do a join between the tables:

SELECT DISTINCT a.num FROM a, b WHERE a.id = b.a_id AND (b.attribute = x OR
b.attribute = y OR ...) ORDER BY rand() LIMIT z;

After two and a half hours of hogging the CPU at 100%, i killed the process
and decided it wasn't such a good idea. The problem here is that a cartesean
product was calculated first before starting to filter, which would generate
a temporary table with 13,381,664 * 991,483 rows, followed by recursing over
the entire temporary result set to check for my filter. After which, the
result would be ordered according to a.num and duplicate values would be
emitted according to the distinct rule, followed by re-ordering the result
set randomly, and finally selecting LIMIT number of rows to display.

So i decided to find another way, asking some friends i learned that
rewriting it as a nested query would give me better results so i changed my
query to the following:

SELECT a.num FROM a WHERE a.id IN (SELECT DISTINCT a_id FROM b WHERE
b.attribute = x or ... ORDER BY rand() LIMIT z);

The idea here was to work on table B by itself, filtering the data, getting
a subset of rows that is z rows long, randomly ordered, and is filtered
according to my criteria, and use it to get the data i required.
First I was surprised here to find out that my database server version does
not support order by or rand clauses in the nested query, which would force
me to move them to the outer query. This would make the inner query generate
a much larger result set, which would be used to to filter the outer table
then randomizing them and getting the number of rows i wanted. Needless to
say this is slower than what i wanted but since it simply wasn't possible to
do it the way i wanted to i rewrote it like so:

SELECT a.num FROM a WHERE a.id IN (SELECT DISTINCT a_id FROM b WHERE
b.attribute = x or ...) ORDER BY rand() LIMIT z;

After about an hour and twenty minutes, the query finished and a result set
was presented. I did a little victory dance, but i knew this huge delay
won't be tolerated, so i had to think about how to hasten it even further.

Now, i've heard a lot about indices, but never actually had to use them.
Other than my primary key, or unique constraints, i never created an index.
So, i decided to try this index thing, thinking that it wouldn't help
much... but what the heck. I created an index on the attribute field in
table B. The process took about 3 hours.
Re-running both of my queries. I was extremely surprised to see the first
query (join) was actually FASTER than my second query (nested). The only
thing i managed to conclude was that MySQL doesn't support indices in inner
statements (not sure, if you have a better explanation, please do share).
The nested query took about the same time to finish (a few minutes faster,
but since i only ran it once it might have been a fluke), while the join
statement was reduced to a whooping 20 minute execution time.

Getting high on the results, and gaining some new found respect to indices,
i decided to create a second index, this time on the fields (attribute and
a_id) in table B. Creating this index took a LOT of time, like 6-7 hours,
but the results were worth it. Now my join query finishes in about 40-50
seconds. creating the index on a_id actually makes a difference because in
my query the first condition is "a.id = b.a_id", and it gets evaluated
first, so when we get to the filtering part, we don't have a cartesean
product, only the rows that are related.

I'm at this stage now, still looking for ways to make even faster, however i
gained a lot of respect to indices. I still have no idea how google manages
to finish our search queries in a fraction of a second, i know they have WAY
better hardware, but they also have WAY more information in their servers.
One thing though, is that they have the ability to tune their DB server, a
thing i can't don't have access to. Also, when you have a read only database
(like my case), indices are a VERY good investment. Also, understanding how
your DB server works, and how the query gets evaluated could give you a huge
advantage.

Special thanks goes to my project manager, Ghassan Noubani, for rewriting my
join query to its present form, and suggesting the second index.

-- 
Al-Faisal El-Dajani
Phone: +962-7-77 799 781
P.O Box: 140056
11814 Amman, Jordan

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Jolug" group.
 To post to this group, send email to [email protected]
 To unsubscribe from this group, send email to [EMAIL PROTECTED]
 For more options, visit this group at 
http://groups.google.com/group/Jolug?hl=en-GB
-~----------~----~----~----~------~----~------~--~---

رد على