Hi! On Oct 23, Matt W wrote: > Hi Sergei, > > More full-text questions from me since I just noticed your code and doc > changes. :-) > > What does this new WITH QUERY EXPANSION syntax do? More relevant > results? More flexible? Faster? Is it for NLQ, boolean, or both (since > both ft_[nlq | boolean]_search.c are changed)? Does it have something to > do with 2 level indexes, or aren't they being used yet? Sorry for all > the questions!
First - it's not pushed yet :) Then - no it does not have anything to do with 2 level indexing, it's for NLQ only, slower, unrelated, yes. The idea - well known in Information Retrieval science - basically is to perform a search, take top N documents, add them to the query, and redo the search. It is expected to improve results for short queries (short query text in AGAINST). I said "expected" because all test collections that I have use very long queries, so though query expansion did increase recall significantly, overall results were worse. So, I need to get test collections with short queries and tune the algorithm somewhat. And yes, it makes the search slower - sometimes noticeably slower. This will also be fixed, when I'll implement so called "unsafe" optimization for NL search. It is this optimization that relies on 2-level index structure, but, again, I need to get new test collections to do it, to adjust thresholds for best results/speed. > Also noticed that ft_max_word_len_for_sort has become a constant, > instead of run-time definable, and ft_query_expansion_limit "replaces" > it, though they don't sound related. I'm wondering about > max_..._for_sort because, at least in 4.0, I need to lower it to 10-12 > to keep the temp files smaller when building the index. :-( Are the temp > files going to get too big in 4.1 when I can't adjust > ft_max_word_len_for_sort or is the algorithm different when indexing? If > the temp files are the same size as 4.0, I wish ft_max_word_len_for_sort > would be restored or I'm going to have problems. :-( I removed it because I thought it's too complex and is never used - so it's better to remove it for not to confuse users and keep number of variables manageable. If I'm wrong here - I'll put it back, of course :) Making it 10-12 to keep temp files smaller - you will not need it in 4.1, as 4.1 uses strlen(word)+const bytes per word in temp file, not ft_max_word_len_for_sort bytes. (By the way, this nice feature applies to normal indexes on VARCHAR/CHAR columns too :). So, the main reason for ft_max_word_len_for_sort (reduce i/o significantly - from 255 to 20 bytes/word) was removed. Another reason still applies (that's why it's defined to be 20, and not 255) - in memory each word occupies ft_max_word_len_for_sort bytes, so the smaller this value the more words will fit in one "sort chunk". But it doesn't impact performance as much as i/o. Words that are longer than ft_max_word_len_for_sort will not be added to index during repair_by_sort - so they have to be added later using old and slow method (simply inserted into b-tree). The way to choose this value properly is to run "ft_dump -l" and set that threshold so that almost all the words are shorter, but no memory is wasted on few extra long words. 20 is almost always an ok value here. Regards, Sergei -- __ ___ ___ ____ __ / |/ /_ __/ __/ __ \/ / Sergei Golubchik <[EMAIL PROTECTED]> / /|_/ / // /\ \/ /_/ / /__ MySQL AB, Senior Software Developer /_/ /_/\_, /___/\___\_\___/ Osnabrueck, Germany <___/ www.mysql.com -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]