Hi!

On Oct 23, Matt W wrote:
> Hi Sergei,
> 
> More full-text questions from me since I just noticed your code and doc
> changes. :-)
> 
> What does this new WITH QUERY EXPANSION syntax do? More relevant
> results? More flexible? Faster? Is it for NLQ, boolean, or both (since
> both ft_[nlq | boolean]_search.c are changed)? Does it have something to
> do with 2 level indexes, or aren't they being used yet? Sorry for all
> the questions!

First - it's not pushed yet :)
Then - no it does not have anything to do with 2 level indexing, it's
for NLQ only, slower, unrelated, yes.

The idea - well known in Information Retrieval science - basically is to
perform a search, take top N documents, add them to the query, and redo
the search.

It is expected to improve results for short queries (short query text in
AGAINST). I said "expected" because all test collections that I have use
very long queries, so though query expansion did increase recall
significantly, overall results were worse.

So, I need to get test collections with short queries and tune the
algorithm somewhat.

And yes, it makes the search slower - sometimes noticeably
slower. This will also be fixed, when I'll implement so called "unsafe"
optimization for NL search. It is this optimization that relies on 2-level
index structure, but, again, I need to get new test collections to do
it, to adjust thresholds for best results/speed.

> Also noticed that ft_max_word_len_for_sort has become a constant,
> instead of run-time definable, and ft_query_expansion_limit "replaces"
> it, though they don't sound related. I'm wondering about
> max_..._for_sort because, at least in 4.0, I need to lower it to 10-12
> to keep the temp files smaller when building the index. :-( Are the temp
> files going to get too big in 4.1 when I can't adjust
> ft_max_word_len_for_sort or is the algorithm different when indexing? If
> the temp files are the same size as 4.0, I wish ft_max_word_len_for_sort
> would be restored or I'm going to have problems. :-(

I removed it because I thought it's too complex and is never used - so
it's better to remove it for not to confuse users and keep number of
variables manageable. If I'm wrong here - I'll put it back, of course :)

Making it 10-12 to keep temp files smaller - you will not need it in
4.1, as 4.1 uses strlen(word)+const bytes per word in temp file, not
ft_max_word_len_for_sort bytes. (By the way, this nice feature
applies to normal indexes on VARCHAR/CHAR columns too :).

So, the main reason for ft_max_word_len_for_sort (reduce i/o
significantly - from 255 to 20 bytes/word) was removed.

Another reason still applies (that's why it's defined to be 20, and not
255) - in memory each word occupies ft_max_word_len_for_sort bytes, so
the smaller this value the more words will fit in one "sort chunk".
But it doesn't impact performance as much as i/o.

Words that are longer than ft_max_word_len_for_sort will not be added to
index during repair_by_sort - so they have to be added later using old
and slow method (simply inserted into b-tree).

The way to choose this value properly is to run "ft_dump -l"
and set that threshold so that almost all the words are shorter,
but no memory is wasted on few extra long words. 20 is almost always
an ok value here. 

Regards,
Sergei

-- 
   __  ___     ___ ____  __
  /  |/  /_ __/ __/ __ \/ /   Sergei Golubchik <[EMAIL PROTECTED]>
 / /|_/ / // /\ \/ /_/ / /__  MySQL AB, Senior Software Developer
/_/  /_/\_, /___/\___\_\___/  Osnabrueck, Germany
       <___/  www.mysql.com

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Reply via email to