index length mate, first thing to look at, second would be my.cnf optimizations.
Zaid On 6/18/07, Al-Faisal El-Dajani <[EMAIL PROTECTED]> wrote: > This is a lengthy rant about a MySQL situation i was in lately. It is not a > question, just a recount of some events that i went through and some of my > conclusions. Feel free to ignore at will :). > > While working on a project the past two weeks, i had to deal with a rather > large database to import and query. What follows is my experience in this > project and some interesting trivia about how MySQL server works. My > computer machine is a bit ancient, i have a P4 2.8HT CPU, 512 MB DDR RAM, > and a SATA harddisk. My operating system is Ubuntu 7.04, and the only other > process that was running was firefox (to pass the time reading bash :-p), > Database server is MySQL 5.0.22, default installation (except i changed > default charset to utf8). All queries were run using the MySQL terminal > client. > > Here's the database structure: > Table A: > id int(11) primary key not null > num int(15) unique > contains ~ 1 million rows > > Table B: > id int(11) primary key not null > A_id int(11) references Table A, id not null > attribute int(5) (contains around 130 distinct values) > contains ~ 13.5 million rows > > Now the idea is that based on user input i need to get Table A.num, filtered > by whatever the user selects in Table B.attribute. My first reaction was to > do a join between the tables: > > SELECT DISTINCT a.num FROM a, b WHERE a.id = b.a_id AND (b.attribute = x OR > b.attribute = y OR ...) ORDER BY rand() LIMIT z; > > After two and a half hours of hogging the CPU at 100%, i killed the process > and decided it wasn't such a good idea. The problem here is that a cartesean > product was calculated first before starting to filter, which would generate > a temporary table with 13,381,664 * 991,483 rows, followed by recursing over > the entire temporary result set to check for my filter. After which, the > result would be ordered according to a.num and duplicate values would be > emitted according to the distinct rule, followed by re-ordering the result > set randomly, and finally selecting LIMIT number of rows to display. > > So i decided to find another way, asking some friends i learned that > rewriting it as a nested query would give me better results so i changed my > query to the following: > > SELECT a.num FROM a WHERE a.id IN (SELECT DISTINCT a_id FROM b WHERE > b.attribute = x or ... ORDER BY rand() LIMIT z); > > The idea here was to work on table B by itself, filtering the data, getting > a subset of rows that is z rows long, randomly ordered, and is filtered > according to my criteria, and use it to get the data i required. > First I was surprised here to find out that my database server version does > not support order by or rand clauses in the nested query, which would force > me to move them to the outer query. This would make the inner query generate > a much larger result set, which would be used to to filter the outer table > then randomizing them and getting the number of rows i wanted. Needless to > say this is slower than what i wanted but since it simply wasn't possible to > do it the way i wanted to i rewrote it like so: > > SELECT a.num FROM a WHERE a.id IN (SELECT DISTINCT a_id FROM b WHERE > b.attribute = x or ...) ORDER BY rand() LIMIT z; > > After about an hour and twenty minutes, the query finished and a result set > was presented. I did a little victory dance, but i knew this huge delay > won't be tolerated, so i had to think about how to hasten it even further. > > Now, i've heard a lot about indices, but never actually had to use them. > Other than my primary key, or unique constraints, i never created an index. > So, i decided to try this index thing, thinking that it wouldn't help > much... but what the heck. I created an index on the attribute field in > table B. The process took about 3 hours. > Re-running both of my queries. I was extremely surprised to see the first > query (join) was actually FASTER than my second query (nested). The only > thing i managed to conclude was that MySQL doesn't support indices in inner > statements (not sure, if you have a better explanation, please do share). > The nested query took about the same time to finish (a few minutes faster, > but since i only ran it once it might have been a fluke), while the join > statement was reduced to a whooping 20 minute execution time. > > Getting high on the results, and gaining some new found respect to indices, > i decided to create a second index, this time on the fields (attribute and > a_id) in table B. Creating this index took a LOT of time, like 6-7 hours, > but the results were worth it. Now my join query finishes in about 40-50 > seconds. creating the index on a_id actually makes a difference because in > my query the first condition is " a.id = b.a_id", and it gets evaluated > first, so when we get to the filtering part, we don't have a cartesean > product, only the rows that are related. > > I'm at this stage now, still looking for ways to make even faster, however i > gained a lot of respect to indices. I still have no idea how google manages > to finish our search queries in a fraction of a second, i know they have WAY > better hardware, but they also have WAY more information in their servers. > One thing though, is that they have the ability to tune their DB server, a > thing i can't don't have access to. Also, when you have a read only database > (like my case), indices are a VERY good investment. Also, understanding how > your DB server works, and how the query gets evaluated could give you a huge > advantage. > > Special thanks goes to my project manager, Ghassan Noubani, for rewriting my > join query to its present form, and suggesting the second index. > > -- > Al-Faisal El-Dajani > Phone: +962-7-77 799 781 > P.O Box: 140056 > 11814 Amman, Jordan > > > -- --------------------------- Netiquette -> http://www.dtcc.edu/cs/rfc1855.html Netiquette Nazi -> http://redwing.hutman.net/%7Emreed/warriorshtm/netiquettenazi.htm --------------------------- --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Jolug" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/Jolug?hl=en-GB -~----------~----~----~----~------~----~------~--~---

