On Tuesday 27 Nov 2001 20:17, Mike Wexler wrote:
> > The query
> >
> >   SELECT * FROM table
> >   WHERE MATCH (description) AGAINST ("+A +really +nice +toy" IN BOOLEAN
> > MODE) ORDER BY MATCH (description) AGAINST ("+A +really +nice +toy" IN
> > BOOLEAN MODE) LIMIT 10
> >
> > will abort the search when it retrieves the 500th document with all the
> > three words. It's impossible to say, how many index entries it will look
> > at.

I take it the "IN BOOLEAN MODE" part of the AGAINST() is going to be new to 
4.0.1.

Incidentally, how are the WHERE clauses handled when MATCH/AGAINST is used 
for FTS? Given that I am seeing a fairly linear increase in query time with 
the increase in number of matched terms, I would guess that the FTS is 
performed first. Especially since limiting other constraints in the WHERE 
clause produces no noticeable reduction in query time. This seems to be 
wasteful.

Considering that FTS is likely the slowest part of the query, it would 
probably be beneficial in terms of performance to have it execute last, with 
all other "simpler" constraints being satisfied first, so fewer records need 
to be searched.

Another question - is there a way to acquire a list of words in the FTS 
index? Someting like

SELECT          Word,
                count(*) AS Frequency
FROM            FTSIndex
GROUP BY                Word
ORDER BY                FREQUENCY ASC
LIMIT           100;

This would allow for easier overview of what "dead" words are being indexed, 
and therefore allow for easier isolation of new "stop words", and reduction 
in unnecessary searching that FTS would have to perform, thus increasing 
performance. Considering that I'm really after SELECT speed, would more 
careful tuning of stop words be likeky to yield signifficant performance 
improvements?

It would also be REALLY nice to have a "dynamic" list of stop words. I know 
you said that this is definitely planned, but it would be nice to know how 
soon...

Another thing - it would probably be useful to gather some statistics about 
FTS queries performed. For example, say the terms 'perl' or 'linux' appear 
10000 times in the queries (since the last checkpoint interval, or whatever). 
But the words 'nt' and 'vb' appear 0 times. It might be beneficial to tune 
the FTS in a slightly dynamic way, so that the frequently searched terms have 
a sort of "keep-alive", while the words that are never searched for (but are 
indexed) get added to the stop-word list.

OTOH, if a stop word does end up being searched for a number of times that 
exceeds some threshold (say, 0.1%), it could be removed from the stop word 
list. The FTS index would then need to be updated, but if the database 
structure is dynamic (for example a search engine of some sort) where the 
records get deleted and inserted all the time, the database would be able to 
"learn" what are "start words" and what are "stop words".

These statistics could be analyzed and acted upon when, say, ANALYZE is 
issued. It would make the FTS "self-tuning". Of course, the FTS feature in 
the TODO that allows specifiying explicitly strings that get indexed "as they 
are" becomes much more useful, then, as it can prevent things from being
automatically added to the "stop word" list during ANALYZE/OPTIMIZE stage.

Has any of this been at least thought about? I've just checked the TODO, and 
it doesn't appear to be there...

Looking forward to 4.0.1.

BTW, will the file formats be compatible? Or will it require a dump + restore 
of the database, when going from 4.0.0 to 4.0.1?

Regards.

Gordan


---------------------------------------------------------------------
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/           (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Reply via email to