suggestion for more efficient inner join algorithm

Tim Samshuijzen Wed, 02 Apr 2003 05:47:25 -0800

Hello,

We run a large on-line books database which is searched a lot. We are using MySQL but we are seriously running into optimisation problems because the searches are becoming slower and slower as the database grows. Simplifying our situation, visitors need to search for author and/or title and whether the book is available. We have created our own form of full-text-search by extracting all words from the author and title fields and inserting them in a keywords table. To save space, we reference the keywords as numbers, as each unique word has an entry in a separate dictionary table.

Oversimplified, the database looks like this:

CREATE TABLE books (
   BookNumber int(11) NOT NULL auto_increment,
   Author varchar(60) NOT NULL,
   Titel text NOT NULL,
   Available tinyint(4) NOT NULL,
   PRIMARY KEY (BookNumber),
   KEY FirstBook (FirstEdition, BookNumber)
);

CREATE TABLE dictionary (
   WordNumber int(11) NOT NULL auto_increment,
   Keyword varchar(50) NOT NULL,
   PRIMARY KEY (WordNumber),
   KEY Keyword (Keyword)
);

CREATE TABLE keywords (
   WordNumber int(11) NOT NULL,
   BookNumber int(11) NOT NULL,
   KEY WordBook (WordNumber, BookNumber),
   KEY BookNumber (BookNumber)
);

When someone does a search, first our PHP script locates the WordNumber values of the keywords entered for searching. The eventual main query is basically something like:

select
  b.BookID
from
  books as b, keywords as k
where
  k.BookID = b.BookID AND
  b.Available = 1 AND
  k.WordNumber = 1234
LIMIT 0,50;

Basically, this query does the following:
  - select the books which are available
  - select the keyword records where the WordNumber field equals 1234
  - return only the BookID values which appear in both these result sets

The way MySQL seems to do the query is:
 - find the smallest result set of the two sub-selects
 - for each record in the smallest result set return the BookID values that
   appear in the other result.

This method is fine if one of the sub-selects is a small result. But in our case all the sub-selects are often all very large (>1000000). The above query takes about 5 seconds on our machine, which is much too slow.

Mathematically speaking the above method does not provide the fastest algorithm.

Because of the indexing, each sub-select is (can be) sorted by BookID. This allows for the following very fast algorithm:

1. set a reading position to the beginning of the first sub-result 2. set a reading position to the beginning of the second sub-result 3. if an EOF is reached then exit 4. read the BookID value of the current reading position in the first sub-result 5. read the BookID value of the current reading position in the second sub-result 6. if each BookID is equal, then store the BookID value in the final result set, increment both reading positions, go to 3. 7. see which BookID is the highest value. 4. within the sub-select that holds the lowest value, move the reading position to the next value that is at least greater than the other current BookID value. 5. go to 3

The above algorithm is very fast because the sub-results are sorted by BookID.
One can imagine the algorithm to be very similar for selects which perform more
than two sub-selects.

I have tested this algorithm from a PHP script. The script terminates after finding 50 results (similar to LIMIT 0,50). The script finishes after executing about 5000 queries, all of which are performed very fast. The eventual search time is about 5 seconds. Therefore, this takes the same amount of time as the single query which we use now!!!! I can imagine that if this fast algorithm is built into the source code of MySQL then these searches become incredibly fast. Especially because each of the 5000 queries are single index reads.

I have heard more complaints from other MySQL users/developers who use similar types of queries. Can anyone tell me why this algorithm is not handled in MySQL? Is the development team planning to use such a search method? If not, then I will build this search algorithm into MySQL myself. Is the MySQL development team interested in such a search algorithm?

Finally, we find it a major problem that the SORT BY ... DESC is not optimised. It almost always does a file-sort. We want to use this because we want to show the most recently added records first. If this is also not on the immediate to-do list of the MySQL development team, then I may aswell build this into the source code myself too.

I look forward to any reactions

Best regards,

Tim Samshuijzen
Software & database engineer
mailto:[EMAIL PROTECTED]


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

suggestion for more efficient inner join algorithm

Reply via email to