Re: How boolean full-text search finds matches?
Hi Sergei! Thanks for replying again. I hope I'm not wasting too much of your time with my questions! :-) More below... - Original Message - From: Sergei Golubchik Sent: Thursday, December 18, 2003 7:17 AM Subject: Re: How boolean full-text search finds matches? Hi! On Dec 17, Matt W wrote: Hi, Just have a couple more full-text search inquiries here. :-) I'm not exactly clear on how matching rows are found when searching for 2 or more required words: '+word1 +word2'. I understand that it can't currently know which word occurs less, so that it can be searched first -- this optimization will come with 4.1's 2-level indexes. :-) I just want to know, when it finds a match for whichever word is tried first, how does it check if the other required word(s) are present in the same row? Say that word1 and word2 are each present in 100,000 rows. 1) Surely it doesn't check the 100,000 entries for word2 for EACH word1 match to see if they're in the same row, does it? No it does not :) Good! 2) It *seems* the best way would be to do a lookup for (word2 + rowid for word1) and see if there's a match. Is this what's done? I'm not sure it's possible though with the way the index is structured... it is possible, but it is only sensible if word1 is much more rare than word1. This could be done with 2-level indexes :) I assume that should say if word2 is much more rare than word1. I guess that's because it would need too many [random?] index lookups otherwise? 3) Or, and I'm thinking *maybe* this is how it's done from what I've heard, does it get all the matches for word1, then for word2, and then intersect them to find ones which are present in the same row? If so, how will the 2-level index optimization change things? Will it do #2? Yes to both questions, without the word then. First, one match is found for each word. Then read_next is called for the word with the lowest rowid, etc. Not completely clear on this. :-) I get that one match is found for each word... then whichever word occured first in the table (lowest rowid) is... what? :-/ Oh, wait, I get what you mean! ;-) You're saying that read_next is called for the word with the lowest rowid until you see if the rowid matches the rowid from the other word(s)? Then if the rowid gets greater than what you're looking for, you know that there's no matching row? (Since you say that each word is sorted by rowid -- see below about that, though.) Then I'm not sure what happens to find the next matching row. Find a match again for each word starting after the last found rowid? I'm not familiar enough with the MySQL code (or C) to understand what's going on in ft_boolean_search.c. :-( The advantage is that matches are found and returned earlier - a user don't have to wait for the index scan to complete. Also LIMIT, if used, cuts off more work, that is LIMIT is more effective. Right, I know that LIMIT helps -- as long as there's no ORDER BY, etc. that needs to find all rows to sort. :-) That brings me to the main reason for asking these questions: for searching on my forum system. There could be 5-10+ million posts, which would put upwards of 250+ million entries in the full-text index. I'll probably use a LIMIT of 10-20,000 (w/o ORDER BY) to prevent searches from taking too long (and returning way too many matches!). However, with that many posts, I think it's possible that a search could match 100k+ posts. Then the search would have to be narrowed down to a particular forum or date range. If I add AND forumid=123 etc. to the WHERE clause, that will make the search hit the data file for who-knows-how-many posts that aren't in the desired forum -- thus scanning more data rows than specified by LIMIT. But you told list member Steven Roussey one time that he could add the forum id to a character column and include it in the full-text index. I thought I'd try that too -- along with some text for the month and year for date range searches. Although now I'm not sure that's a good idea, because what if there's 1 million posts in a single forum (is it bad to have the same word for a forum id in 1 million rows??) and a search is done that would actually find LIMIT rows *across all forums*? If I include the forum id in the search, the current full-text code will look at all 1M words for that forum id. Much slower than just doing the search and manually checking forum id. :-( Of course, if the situation is reversed (common search in a *small* forum), this method would be faster than manually checking. Any other ideas for improving performance when searching a small subset of the full-text index? Too bad we can't include numeric columns in the full-text index (which would work as usual in the WHERE, not in MATCH ()). But when one word is much more common than the second one, it is better to do #2, and it's what I'll probably do. Yeah, like for my example. :-) Next
Re: How boolean full-text search finds matches?
Hi! On Dec 17, Matt W wrote: Hi, Just have a couple more full-text search inquiries here. :-) I'm not exactly clear on how matching rows are found when searching for 2 or more required words: '+word1 +word2'. I understand that it can't currently know which word occurs less, so that it can be searched first -- this optimization will come with 4.1's 2-level indexes. :-) I just want to know, when it finds a match for whichever word is tried first, how does it check if the other required word(s) are present in the same row? Say that word1 and word2 are each present in 100,000 rows. 1) Surely it doesn't check the 100,000 entries for word2 for EACH word1 match to see if they're in the same row, does it? No it does not :) 2) It *seems* the best way would be to do a lookup for (word2 + rowid for word1) and see if there's a match. Is this what's done? I'm not sure it's possible though with the way the index is structured... it is possible, but it is only sensible if word1 is much more rare than word1. This could be done with 2-level indexes :) 3) Or, and I'm thinking *maybe* this is how it's done from what I've heard, does it get all the matches for word1, then for word2, and then intersect them to find ones which are present in the same row? If so, how will the 2-level index optimization change things? Will it do #2? Yes to both questions, without the word then. First, one match is found for each word. Then read_next is called for the word with the lowest rowid, etc. The advantage is that matches are found and returned earlier - a user don't have to wait for the index scan to complete. Also LIMIT, if used, cuts off more work, that is LIMIT is more effective. But when one word is much more common than the second one, it is better to do #2, and it's what I'll probably do. Next question is... a few weeks ago I was doing some test searches like '+word1 +word2'. Actually, maybe I was only using 1 word, I can't remember, but I don't think it matters. Anyway, I happened to try changing the query to '+word1* +word2*' -- e.g. adding a wild-card to the end of the same word(s) -- and I was amazed at how much faster the query was! (And no, there's no query cache; and they were both run many times so the index was cached. :-)) Can't remember how much faster, but it wasn't insignificant. Then I tried adding a wild-card to the end of words in another search (the wild-card did not make more rows match as far as I know), but that made it a little slower (I'd expect that, if anything). Is there any explanation for why adding wild-cards would make a search faster? Yes :) Looking on the description above how +word1 +word2 works, you can see that it relies on the fact that key entries for each word are sorted by rowid. But it is how MyISAM indexes work, they are always ordered by all keyparts, and rowid is always the last keypart, so fulltext index is always ORDER BY word, rowid. But when the search is done for the word prefix only, this order breaks, the index is, of course, not ORDER BY LEFT(word,5), rowid. Consider the index aa, 5 aaabbb, 10 aaaccc, 3 The workaround was to exclude prefixes from index scan whenever possible (that is in +word +pref1* +pref2* ... prefixes are not looked up in the index at all, in +pref1* +pref2* ... only the first prefix is looked up in the index) and to add a row scan (in the LIKE sense and as you described in #1), to remove half-matched rows. Thus it is sometimes slower - as more row data are read from the disk, and sometimes faster - as there are less index scans performed. Obvious optimization is to choose the most rare prefix for the scan, not the first one. It is in the todo, of course :) P.S. Sergei, if you see this, in one of your replies to my full-text suggestions back in September ( http://lists.mysql.com/mysql/149644 ), you said Another reply will follow... I never saw another reply though. :-/ It's OK, I was just wondering what other interesting things you were going to say! :-) Oops, checking... Yes, sorry. But I have to admit, I absolutely do not remember what I had in mind when I wrote it and what should've followed :( Sorry for this. Regards, Sergei -- __ ___ ___ __ / |/ /_ __/ __/ __ \/ / Sergei Golubchik [EMAIL PROTECTED] / /|_/ / // /\ \/ /_/ / /__ MySQL AB, Senior Software Developer /_/ /_/\_, /___/\___\_\___/ Osnabrueck, Germany ___/ www.mysql.com -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
How boolean full-text search finds matches?
Hi, Just have a couple more full-text search inquiries here. :-) I'm not exactly clear on how matching rows are found when searching for 2 or more required words: '+word1 +word2'. I understand that it can't currently know which word occurs less, so that it can be searched first -- this optimization will come with 4.1's 2-level indexes. :-) I just want to know, when it finds a match for whichever word is tried first, how does it check if the other required word(s) are present in the same row? Say that word1 and word2 are each present in 100,000 rows. 1) Surely it doesn't check the 100,000 entries for word2 for EACH word1 match to see if they're in the same row, does it? 2) It *seems* the best way would be to do a lookup for (word2 + rowid for word1) and see if there's a match. Is this what's done? I'm not sure it's possible though with the way the index is structured... 3) Or, and I'm thinking *maybe* this is how it's done from what I've heard, does it get all the matches for word1, then for word2, and then intersect them to find ones which are present in the same row? If so, how will the 2-level index optimization change things? Will it do #2? Next question is... a few weeks ago I was doing some test searches like '+word1 +word2'. Actually, maybe I was only using 1 word, I can't remember, but I don't think it matters. Anyway, I happened to try changing the query to '+word1* +word2*' -- e.g. adding a wild-card to the end of the same word(s) -- and I was amazed at how much faster the query was! (And no, there's no query cache; and they were both run many times so the index was cached. :-)) Can't remember how much faster, but it wasn't insignificant. Then I tried adding a wild-card to the end of words in another search (the wild-card did not make more rows match as far as I know), but that made it a little slower (I'd expect that, if anything). Is there any explanation for why adding wild-cards would make a search faster? Thanks in advance! Matt P.S. Sergei, if you see this, in one of your replies to my full-text suggestions back in September ( http://lists.mysql.com/mysql/149644 ), you said Another reply will follow... I never saw another reply though. :-/ It's OK, I was just wondering what other interesting things you were going to say! :-) -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
How boolean full-text search finds matches?
Hi, Just have a couple more full-text search inquiries here. :-) I'm not exactly clear on how matching rows are found when searching for 2 or more required words: '+word1 +word2'. I understand that it can't currently know which word occurs less, so that it can be searched first -- this optimization will come with 4.1's 2-level indexes. :-) I just want to know, when it finds a match for whichever word is tried first, how does it check if the other required word(s) are present in the same row? Say that word1 and word2 are each present in 100,000 rows. 1) Surely it doesn't check the 100,000 entries for word2 for EACH word1 match to see if they're in the same row, does it? 2) It *seems* the best way would be to do a lookup for (word2 + rowid for word1) and see if there's a match. Is this what's done? I'm not sure it's possible though with the way the index is structured... 3) Or, and I'm thinking *maybe* this is how it's done from what I've heard, does it get all the matches for word1, then for word2, and then intersect them to find ones which are present in the same row? If so, how will the 2-level index optimization change things? Will it do #2? Next question is... a few weeks ago I was doing some test searches like '+word1 +word2'. Actually, maybe I was only using 1 word, I can't remember, but I don't think it matters. Anyway, I happened to try changing the query to '+word1* +word2*' -- e.g. adding a wild-card to the end of the same word(s) -- and I was amazed at how much faster the query was! (And no, there's no query cache; and they were both run many times so the index was cached. :-)) Can't remember how much faster, but it wasn't insignificant. Then I tried adding a wild-card to the end of words in another search (the wild-card did not make more rows match as far as I know), but that made it a little slower (I'd expect that, if anything). Is there any explanation for why adding wild-cards would make a search faster? Thanks in advance! Matt P.S. Sergei, if you see this, in one of your replies to my full-text suggestions back in September ( http://lists.mysql.com/mysql/149644 ), you said Another reply will follow... I never saw another reply though. :-/ It's OK, I was just wondering what other interesting things you were going to say! :-) -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]