Hi! On Dec 19, Matt W wrote: > > > > > > Just have a couple more full-text search inquiries here. :-) > > > > > > I'm not exactly clear on how matching rows are found when searching for > > > 2 or more required words: '+word1 +word2'. I understand that it can't > > > currently know which word occurs less, so that it can be searched > > > first -- this optimization will come with 4.1's 2-level indexes. :-) > > > > > > I just want to know, when it finds a match for whichever word is tried > > > first, how does it check if the other required word(s) are present in > > > the same row? Say that word1 and word2 are each present in 100,000 > > > rows. > > > > > > 2) It *seems* the best way would be to do a lookup for (word2 + <rowid > > > for word1>) and see if there's a match. Is this what's done? I'm not > > > sure it's possible though with the way the index is structured... > > > > it is possible, but it is only sensible if word1 is much more rare than > > word1. This could be done with 2-level indexes :) > > I assume that should say "if word2 is much more rare than word1." I > guess that's because it would need too many [random?] index lookups > otherwise?
No, if word1 is more rare. In your approach, as far as I understand, one does an index scan for all occurences of word1 and an index lookup to "word2+<rowid for word1>" for each found word1. Thus the total number of index lookups is 2*(number_or_word1s), and for the best performance word1 should be more rare than the word2. > > > 3) Or, and I'm thinking *maybe* this is how it's done from what I've > > > heard, does it get all the matches for word1, then for word2, and then > > > "intersect" them to find ones which are present in the same row? If so, > > > how will the 2-level index optimization change things? Will it do #2? > > > > Yes to both questions, without the word "then". > > First, one match is found for each word. Then read_next is called for > > the word with the lowest rowid, etc. > > Not completely clear on this. :-) I get that one match is found for > each word... then whichever word occured first in the table (lowest > rowid) is... what? :-/ > > Oh, wait, I get what you mean! ;-) You're saying that read_next is > called for the word with the lowest rowid until you see if the rowid > matches the rowid from the other word(s)? Then if the rowid gets > greater than what you're looking for, you know that there's no matching > row? (Since you say that each word is sorted by rowid -- see below > about that, though.) > > Then I'm not sure what happens to find the next matching row. Find a > match again for each word starting after the last found rowid? I'm not > familiar enough with the MySQL code (or C) to understand what's going on > in ft_boolean_search.c. :-( I'm using "priority heap" - it's a data structure that can efficiently find a smallest (or largest - depending on comparison function :) element from the list. http://www.nist.gov/dads/HTML/priorityque.html In our implementation of heaps, the top element is the smallest one, think as lighter elements are going up, and heavier are sinking down. The pseudocode is like this: for each word[i] { rowid=index_read(word[i]); heap->add( <word[i], rowid> ); // adding a struct of two elements // rowid is the sorting key } while heap->is_not_empty() { current=heap->top->rowid // get the smallest rowid ... init matching tree and counters ... while heap->top->rowid==current { word=heap->top->word; rowid=index_next(word); heap->replace_top(word, rowid) // after this replace(), this word // went down another word is now on the top ... update matching tree and counters accordingly ... } // no more words for the rowid == current ... check matching tree and counters ... if (query is satisfied) return; } > > The advantage is that matches are > > found and returned earlier - a user don't have to wait for the index > > scan to complete. Also LIMIT, if used, cuts off more work, that is LIMIT > > is more effective. > > Right, I know that LIMIT helps -- as long as there's no ORDER BY, etc. > that needs to find all rows to sort. :-) > > That brings me to the main reason for asking these questions: for > searching on my forum system. There could be 5-10+ million posts, which > would put upwards of 250+ million entries in the full-text index. I'll > probably use a LIMIT of 10-20,000 (w/o ORDER BY) to prevent searches > from taking too long (and returning way too many matches!). > > However, with that many posts, I think it's possible that a search could > match 100k+ posts. Then the search would have to be narrowed down to a > particular forum or date range. If I add "AND forumid=123" etc. to the > WHERE clause, that will make the search hit the data file for > who-knows-how-many posts that aren't in the desired forum -- thus > scanning more data rows than specified by LIMIT. > > But you told list member Steven Roussey one time that he could add the > forum id to a character column and include it in the full-text index. Actually it was he, who wrote about this trick, not me :) > I thought I'd try that too -- along with some text for the month and year > for date range searches. Although now I'm not sure that's a good idea, > because what if there's 1 million posts in a single forum (is it bad to > have the same "word" for a forum id in 1 million rows??) and a search is > done that would actually find < LIMIT rows *across all forums*? If I > include the forum id in the search, the current full-text code will look > at all 1M "words" for that forum id. Much slower than just doing the > search and manually checking forum id. :-( Of course, if the situation > is reversed (common search in a *small* forum), this method would be > faster than manually checking. > > Any other ideas for improving performance when searching a small subset > of the full-text index? Too bad we can't include numeric columns in the > full-text index (which would work as usual in the WHERE, not in MATCH > ()). I thought about this feature - but all the ideas I got make natural language search almost impossible. It relies on global, collection-wide, statistics, and combined index would require to know this statistics for any subset of the tree, like in SELECT ... WHERE forumid=123 AND match ... SELECT ... WHERE forumid=123 OR forumid=124 AND match ... SELECT ... WHERE forumid>15 AND forumid<25 AND match ... SELECT ... WHERE forumid>23 AND match ... So - I'm still thinking about it... As for other ideas, well, if the number of forums is reasonably small, you can create one table per forum and combine them with a MERGE table. > > Looking on the description above how "+word1 +word2" works, you can see > > that it relies on the fact that key entries for each word are sorted by > > rowid. But it is how MyISAM indexes work, they are always ordered by all > > keyparts, and rowid is always the last keypart, so fulltext index is > > always ORDER BY word, rowid. > > What about weight? I thought the index was <word, weight, rowid>. So I > assumed each word was sorted by weight (even though boolean doesn't use > it, right? It's still in the index), then by rowid... Right. But I found that I needed ORDER BY word, rowid, so I introduced a new flag to keyseg flags in MyISAM - "ignore me when sorting". That's why FULLETXT index is not sorted by weight. > > But when the search is done for the word prefix only, this order breaks, > > the index is, of course, not ORDER BY LEFT(word,5), rowid. Consider the > > index > > > > aaaaaa, 5 > > aaabbb, 10 > > aaaccc, 3 > > > > The workaround was to exclude prefixes from index scan whenever possible > > (that is in "+word +pref1* +pref2* ..." prefixes are not looked up in > > the index at all, in "+pref1* +pref2* ..." only the first prefix is > > looked up in the index) and to add a row scan (in the LIKE sense and as you > > described in #1), to remove "half-matched" rows. > > OK, I see. :-) I'm surprised they're not looked up in the index. But > after seeing the code, it looks like it has something to do with > preventing a row from being matched twice. Not sure though. Yes, this too. > Are you saying that for "+word +pref1* +pref2*", word and pref1 are > looked up in the index, or just word? It sounds like you meant word and > pref1, but from a test I just did, it seems to be only word. No, only +word. Because if the query is "+word +pref1* +pref2*" then +pref1* and +pref2* cannot really ADD new rows to the result, they only REMOVE rows from what was found by +word. But as the row is scanned anyway when trunc* operators are used, there's no need to search for +pref1* in the index. Of course this trick doesn't work when the word actually CAN add new rows to the result, like it "word pref1* pref2" Regards, Sergei -- __ ___ ___ ____ __ / |/ /_ __/ __/ __ \/ / Sergei Golubchik <[EMAIL PROTECTED]> / /|_/ / // /\ \/ /_/ / /__ MySQL AB, Senior Software Developer /_/ /_/\_, /___/\___\_\___/ Osnabrueck, Germany <___/ www.mysql.com -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]