Hello,

I'm designing a tool to archive mailing list and I'm using a built-in
search engine with a database structure similar to the structure
use by udmsearch (crc32 multi).

For a query where you want to match all keywords, I think that
a single query joining all tables is faster than several queries
using a temporary database.

For udmsearch, this would mean replacing these queries :

CREATE TABLE t9690554262079472 ( url_id INT DEFAULT '0' NOT NULL, word_id
INT DEFAULT '0' NOT NULL, intag TINYINT DEFAULT '0' NOT NULL, KEY
i1t9690554262079472(url_id), KEY i2t9690554262079472(word_id))
----------------------------------------------------------------------------
----
INSERT INTO t9690554262079472 SELECT url_id,word_id,intag FROM ndict6 WHERE
word_id = 1739405965
----------------------------------------------------------------------------
----
INSERT INTO t9690554262079472 SELECT url_id,word_id,intag FROM ndict6 WHERE
word_id = -1532716185
----------------------------------------------------------------------------
----
INSERT INTO t9690554262079472 SELECT url_id,word_id,intag FROM ndict10 WHERE
word_id = 1201902955
----------------------------------------------------------------------------
----
SELECT SQL_SMALL_RESULT url_id, sum(intag) as r FROM t9690554262079472 GROUP
BY url_id HAVING (( ( sum(word_id=1739405965)>0 ) AND
 sum(word_id=-1532716185)>0 ) AND ( sum(word_id=1201902955)>0 ) )) ORDER BY
r DESC
----------------------------------------------------------------------------
----


by this one :

SELECT SQL_SMALL_RESULT
d1.url_id, d1.intag+d2.intag+d3.intag as r
FROM ndict6 d1,ndict6 d2,ndict10 d3
where d1.word_id = -1532716185 and
d2.word_id=-1532716185 and d2.url_id=d1.url_id and
d3.word_id=1201902955 and d3.url_id=d1.url_id
ORDER BY r DESC

Theorically, the idea it that it is not optimal to search all words
independently, and then "join" the results.

here is the optimizer output :

+-------+------+--------------------+-----------+---------+-----------+-----
-+-------+
| table | type | possible_keys      | key       | key_len | ref       | rows
| Extra |
+-------+------+--------------------+-----------+---------+-----------+-----
-+-------+
| d3    | ref  | url_id10,word_id10 | word_id10 |       4 | ???       |  100
|       |
| d1    | ref  | url_id6,word_id6   | url_id6   |       4 | d3.url_id |   10
|       |
| d2    | ref  | url_id6,word_id6   | url_id6   |       4 | d1.url_id |   10
|       |
+-------+------+--------------------+-----------+---------+-----------+-----
-+-------+

It search the urls matching the third keyword first,
then for each url found checks if it matches the first
keyword, then the same goes for the second keyword.

Someone proposed recently to use subqueries for such
a task, the query I propose eliminates this problem. I don't like
subqueries and they aren't compatible with all dbms, and I
prefer to let the system decide about which strategy to use.

If your dbms optimizer is smart enough, it can decide like in the
example above that it's better to start with the third keyword which
is the most restrictive, then check for each url if it matches the second
keyword (as key on line 2 is url_il6 and not word_id6, it would have
searched the documents matching the first keyword and joined both
if it was better)

I don't have a big enough database for significative tests.
If you have a big database with a slow query, please try to
use the new version, or give me the debug output and I'll
send you back a modified query to test, I'd like to see if
it's really faster. Or provide me a shell access to your database
where I can test it myself (unix only, I only know about
mysql,postgresql, sybase !)

I'm interest to know for various dbms if the optimizer is indeed
clever enough, for example to start with the most significative
keyword. If not, you could force the order of the query using
stats for keywords (word_id,number of urls)

Alain



______________
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]

Reply via email to