Re: Lots of FULLTEXT stuff (suggestions)
Hi! On Sep 11, Matt W wrote: Hi Sergei! I'll try to keep my observations/ideas below as short and simple to understand as possible. :-) Thanks :) Here I reply very quickly to some questions. Another reply will follow... Sure, boolean mode is faster in *some* cases, since, as you said, it doesn't need the list of all matched documents. From my experience, it's [only] faster in searches like ' some words ' (no boolean operators; and yes, I know some is a stopword ;-)), especially with LIMIT, but it just returns the first documents it finds with any single word. No, it's MUCH faster then search string contains popular words. E.g. 1,000,000 rows, and the few (say, five) words that are present in 400,000 rows each, but intersection is small (numbers similar to my benchmarks). This will make NL fulltext search engine to keep a list of 1,000,000 documents (w/o actual row data of course) in the memory - and search through it - it'll be hell as slow. ... the ones that contain more of the words will usually be ranked highest from my experience. Normally yes. But this is not guaranteed. E.g. the document that contain only one of the words, but is short can have higher relevance than the other one than contains both words, but also has many other words, so relative weight of these two words in the document is low. This effect is only noticeable when texts' lengths are significatly different. It would be great if these disk seeks could be optimized to read a chunk of rows at a time *in row order* as it seems right now that each row is read one-at-a-time in relevance order. Like if you could take a chunk of, say, 1000 row pointers which are in relevance order, sort them in datafile row order, and then read them like that. Wouldn't this cause fewer random seeks since you keep moving in the same direction in the file? Yes, that would. I can say it for sure, as this optimization is already implemented in MySQL. It is used in filesort. I should've known it can be used here too :) Thanks. Right now if I want all words in my application, I'm favoring using a natural language query with LIMIT 1000 or so and then running another query with LIKE to check those 1000 document IDs to see if they contain all words. You can simply do AND MATCH ... AGAINST (... IN BOOLEAN MODE)+0 +0 will guarantee that in this case an index will not be used, so that it will be used for your MATCH in natural-language mode. (I cannot say out of my head what MATCH from both optimizer will prefer, so this simple trick will do) I was even thinking that IN BOOLEAN MODE syntax should be removed (ignored, actually, to maintain backwards compatibility) and whether or not boolean operators are used in the query determines whether or not boolean mode is used (relevance sorting in both cases). This seems logical. :-) Yes, it how it worked in the very first version. Then I though to make it explicit to avoid problems with boolean operators that can occasionally be present in the natural language query (especially when it is a big chunk of text - a very typical application for NL search). To the developers: any word on if and when any of these things would be implemented? I know from the TODO and other list messages that some will. Any *estimates* (in months or MySQL version) on when would be great. Just any info on new full-text features, even ones that I didn't mention, would be awesome to hear. :-) And like how they would be implemented and used by us (syntax, etc.). As I told - it's very difficult to predict this :( Anyway, I doubt anything that requires changing .frm file structure will get into 4.1 How about changing the default min/max (or just min if you want) word length? I think everyone *really* wishes ft_min_word_len was 3. Seems like that and indexing numbers shorter than min_word_len could be easily done. Please? :-) Yes, it's safe enough for 4.1 Sorry, I don't know what this means. :-) You mean ft_min_word_len will be 3 by default in 4.1? And what is safe enough? safe enough means that no big features can be added to 4.1 anymore. Only small local changes. Most fulltext-related changes are local :) P.S. Is there a document somewhere that has information about the internals of full-text search or MySQL in general? I noticed this bk commit - mysqldoc tree (1.790) message on the Internals list the other day: http://lists.mysql.com/list.php?3:mss:9961:200309:bpfbpgphemknogaidjep and bk commit - mysqldoc tree (1.799) http://lists.mysql.com/list.php?3:mss:9996:200309:lejkmpinlmdninacgcpl They appear to be updating a document about internals (inc. full-text search). However, I can't find this document anywhere (source (for Win32), MySQL site/docs). Could you tell me where to get it? :-) Thanks! It's in the bk tree only. But you may use http bk interface: http://mysql.bkbits.net:8080/mysqldoc http://mysql.bkbits.net:8080/mysqldoc/anno/Docs/[EMAIL
Re: Lots of FULLTEXT stuff (suggestions)
Hi Sergei! Thanks for your reply and taking time to read and consider my suggestions. :-) I didn't reply sooner because I was deciding what to say in this message. ;-) I joined the list specifically for posting these suggestions, and, with your reply, I wanted to say that it's great to have direct contact with the developers like this! I'll try to keep my observations/ideas below as short and simple to understand as possible. :-) - Original Message - From: Sergei Golubchik [EMAIL PROTECTED] Sent: Wednesday, August 27, 2003 3:31 PM Subject: Re: Lots of FULLTEXT stuff (suggestions) Hi! First: thanks for ideas - I'm adding them to my todo :) About dates - it's very difficult to say when a particular feature will be implemented. Anyway, first I'm going to finish with this 2-level index structure - to implement optimizations that rely on it. Any speed/optimization improvements are welcome for gigs of data, especially with IN BOOLEAN MODE (e.g. automagically sorted by relevance like a natural language query, although this is probably difficult if a wildcard* is used?). It's not possible - at least I don't know to do it. In natural language mode the fulltext search is done in in Fulltext initialization stage - as you noticed. So an engine can sort documents on relevance. In boolean mode each found document is returned at once - that's why this search mode is faster, it need not support/keep the list of all matched documents. Yeah, it would be really nice though if boolean searches could auto-sort by relevance (see below). I have 2 speed vs. usefulness issues -- 1 with boolean mode and another with natural language mode: Sure, boolean mode is faster in *some* cases, since, as you said, it doesn't need the list of all matched documents. From my experience, it's [only] faster in searches like ' some words ' (no boolean operators; and yes, I know some is a stopword ;-)), especially with LIMIT, but it just returns the first documents it finds with any single word. I think this is pretty useless as you will get many rows that would be low relevance. If you use boolean mode without boolean operators, you should've just used natural language and, if combined with LIMIT, you would fairly quickly get the most relevant documents with any of the words -- and the ones that contain more of the words will usually be ranked highest from my experience. The results are MUCH better than boolean mode without boolean operators, even though it may take slightly longer. And now the issue with natural language mode: If the needed parts of the datafile aren't cached by the OS, all the random disk seeks are KILLING performance! However, once the query is run once and the OS has cached the datafile parts, the search is VERY quick (with LIMIT to prevent a ton of rows). I would say faster than any but the simplest ' no boolean operator ' boolean searches. And the results are relevance sorted! It would be great if these disk seeks could be optimized to read a chunk of rows at a time *in row order* as it seems right now that each row is read one-at-a-time in relevance order. Like if you could take a chunk of, say, 1000 row pointers which are in relevance order, sort them in datafile row order, and then read them like that. Wouldn't this cause fewer random seeks since you keep moving in the same direction in the file? Of course you'd need to get that chunk of rows back in relevance order if they were read in row order. :-) If you can't hold those rows in memory to put back in order, maybe you could read/discard, read/discard, and so on, the rows in row order, then when you read them in random relevance order, the data would at least be cached by the OS. Just something I, who doesn't know much about file access, was thinking about! :-) Boolean mode seems to read the datafile faster because, at least in my fresh table, the results are found/read in datafile order. Back to boolean mode. I don't think it's faster than natural language (especially on subsequent same queries) when you have a search like ' +some +words ' or ' some words '. When some and words appear in many documents, but rarely, if ever, appear in the same document. It uses lots of CPU time finding one word in the index and failing on the boolean criteria. Right now if I want all words in my application, I'm favoring using a natural language query with LIMIT 1000 or so and then running another query with LIKE to check those 1000 document IDs to see if they contain all words. And the documents that do will almost certainly be in the first rows returned. e.g. once they're not for more than a few documents in a row, they probably won't all appear in a document again in the lower relevance results. Heck, even if I want to simulate (' some +words ' IN BOOLEAN MODE) with natural language, I can use (' some words words '). By specifying words multiple times, it gives it higher relevance, so I can again check with LIKE in another query
Re: Lots of FULLTEXT stuff (suggestions)
Hi! First: thanks for ideas - I'm adding them to my todo :) About dates - it's very difficult to say when a particular feature will be implemented. Anyway, first I'm going to finish with this 2-level index structure - to implement optimizations that rely on it. Any speed/optimization improvements are welcome for gigs of data, especially with IN BOOLEAN MODE (e.g. automagically sorted by relevance like a natural language query, although this is probably difficult if a wildcard* is used?). It's not possible - at least I don't know to do it. In natural language mode the fulltext search is done in in Fulltext initialization stage - as you noticed. So an engine can sort documents on relevance. In boolean mode each found document is returned at once - that's why this search mode is faster, it need not support/keep the list of all matched documents. And the FULLTEXT index shouldn't always be chosen for non-const join types when another index would find less rows first. e.g. ... WHERE MATCH ... AND primary_key IN (1, 2); should use the PRIMARY key, not the FULLTEXT. :-) But maybe that's not possible, since I guess it's a problem auto sorting by relevance if it's not using the FULLTEXT index. Hmm. The logic in making FULLTEXT index always the preferred one is that even if it's not the index as reported by EXPLAIN, it is still used in Fulltext initialization. So, using it in join to retrieve rows adds no extra costs. But now I think that there is still the cost of reading row data from disk, so using PRIMARY/UNIQUE index can be faster in some cases. I am not sure, though, optimizer can take this into account properly - to know the number of matched rows before choosing an index would mean doing fulltext search for EXPLAIN too - I doubt it will be appreciated :) Still, with 2-level index some estimations can be made... Great - thanks for the idea! Anyway, in boolean mode there's no initialization so there's no reasons (besides historical) for it to be preferred - it'll be fixed. To the developers: any word on if and when any of these things would be implemented? I know from the TODO and other list messages that some will. Any *estimates* (in months or MySQL version) on when would be great. Just any info on new full-text features, even ones that I didn't mention, would be awesome to hear. :-) And like how they would be implemented and used by us (syntax, etc.). As I told - it's very difficult to predict this :( Anyway, I doubt anything that requires changing .frm file structure will get into 4.1 How about changing the default min/max (or just min if you want) word length? I think everyone *really* wishes ft_min_word_len was 3. Seems like that and indexing numbers shorter than min_word_len could be easily done. Please? :-) Yes, it's safe enough for 4.1 There Sergei is talking about a new .frm format (plain text) that will allow more of these features. Will it allow us to somehow define how to parse things or something?? Could you elaborate more on what this will bring? In November 2001, he said the new .frm format would be here this year. It's been almost 2 years since then, so when is it do? It's now planned for 5.1 - plain text .frm comes together with complete redesign of internal table structure handling, table structure cache, etc. But even without it .frm format was extended in 4.1 so I don't need it for adding per-index options anymore. Also, are the current MySQL versions using the 2 level full-text index format yet? I'm thinking not? 4.1.0 is using it. This index structure was done to make possible new powerful optimizations. It is these optimizations what is not implemented yet :( It's in my highest-priority todo. Finally, in the full-text TODO, it says Generic user-suppliable UDF preparser. Could you also elaborate on this? The generic part almost makes it sound like some sort of script to define how to parse the text. But UDF makes it sound like a separate thing that has to be loaded with CREATE FUNCTION. But UDFs won't work with your MySQL binaries, will they, since they're complied statically? mysql-max binary is compiled dynamically - so it works with UDF's. And UDF in the todo item does not mean it will MySQL User-Definable Function yet - it could be a Stored Procedure, e.g. The idea is to be able to supply a function (whatever it is) that takes a column's data and returns a list of words that this data contain. It could be used e.g. to fulltext-index pdf's or xml's or MS Word files, or whatever. Regards, Sergei -- __ ___ ___ __ / |/ /_ __/ __/ __ \/ / Sergei Golubchik [EMAIL PROTECTED] / /|_/ / // /\ \/ /_/ / /__ MySQL AB, Senior Software Developer /_/ /_/\_, /___/\___\_\___/ Osnabrueck, Germany ___/ www.mysql.com -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
RE: Lots of FULLTEXT stuff (suggestions)
Thanks for replying. Your posts that I've found when searching for FULLTEXT information have had great ideas. :-) Searching millions of posts efficiently and effectively isn't easy. :-( Heh. FULLTEXT does not scale very well once the files get bigger than your RAM. The redesign of the index where it gets normalized will help quite a bit in reducing the size of the files. For large tables, it will help immensely. Most 1-3 letter words that you don't want indexed should be stopwords anyway, right? So why NOT index the ones that are left? Doesn't seem like it'd make the index much larger to me. BTW, what is your min_word_len value? I haven't really thought about it. Although I don't see any value in one-letter words (or numbers). I use min_word_len=3 and my own stop list, which is merge of stopwords in many languages. I made both changes at the same time and ended up with a slightly smaller index. --steve- -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: Lots of FULLTEXT stuff (suggestions)
Matt, I fully agree that indexing short words and numbers is a necessity sometimes. I'm processing legal text where abbreviations are widely used and people want to search for chunks like: Art. 234 Abs. 3 OR and the search should also find occurrances of Art. 234 OR These are so common that I risk to run into the 50% cutoff unless using BOOLEAN MODE. Indexing numbers is top of my wish list. I observed user's queries for some time now and found that the ranking of the results is optimal (i.e. match the user's expectations) when the words he typed occur close together in the text (but not necessarily close to the top). Regards, Thomas Spahni On Sun, 24 Aug 2003, Matt W wrote: Hi all, I'm planning to use MySQL's full-text search for my forum system (possibly 5+ million posts). I've been playing with it a lot lately to see the performance and functionality and have some suggestions/questions. First, since a few of you may be wanting to know, here is a thread where I was doing some speed/optimization tests and stuff with 3 million posts: http://www.sitepointforums.com/showthread.php?threadid=69555 (From post #12) Especially discovered that IN BOOLEAN MODE is really slow if you want to sort by relevance (with a lot of matching rows anyway). :-( For non-BOOLEAN searches, though, I can get 1000 relevance-sorted results in about 8-10 secs. for searches that match a LOT of rows and everything has to be read from disk. The full-text processing seems to be very fast (max 1-2 seconds of FULLTEXT initialization in PROCESSLIST). It's the disk seeks to read random rows from the data file (Sending data) that take the most time (7200 RPM/~8ms seek IDE drive). Searches are *MUCH* faster when the needed parts of the data file are cached by the OS! Anyway, my suggestions: -- *) Min/Max Word Length -- This should really be able to be set on at least a per table basis (others may want per index). Right now, people that don't have control of the server are at the mercy of the admin to change the min/max word length. I would also suggest that ft_min_word_len be 3 and ft_max_word_len be 32 by default. I think these would be better defaults for everyone than the current 4/254. Or if we could use SET ft_min_word_len=n; etc. for the current connection it would be nice. *) Parser: Indexing of Any and All Numbers -- I think it would be a good idea to index any sequence of digits less than ft_min_word_len long. Anything numeric could be very relevant for searching -- software versions, ages, dates, etc. -- and shouldn't be excluded. Even anything *containing* a number (among letters) is probably relevant for searching, again, even if it's shorter than ft_min_word_len. e.g. RC1, B2, 8oz, F5, etc. *) Parser: Other Things -- I've seen people trying to search catalog/item/part numbers with pieces of the number separated by - or / for example (making some pieces too short). How about indexing words that are on either side of a - or / (with no space) no matter their length? I don't mean including the - or / in the index -- just the usual word characters on either side (I think) as *separate* words, not a *single* word with the - or / removed. This would help with things like CD-ROM, TCP/IP, etc. Single quotes being counted as a word character is another issue I have. (I discovered that they're not counted as part of the word when on the end(s): 'quote' (thank God! :-))) Example: if someone searches for MySQL, it won't find rows with MySQL's. Since possessive's (sic) are the biggest problem, how about stripping any 's from the end of the word in the index? So MySQL's would be indexed as MySQL. *) Always Index Words -- Like it says in the full-text TODO section of the manual. This should be able to be set on at least a per table basis (again, others may want per index). *) Stopword File -- I would also like to be able to define this per table somehow. *) Miscellaneous -- Mostly functionality related, from the TODO: STEMMING! (controlled more finely than server level I hope), multi-byte character set support, proximity operators. Anything to get it closer to Verity's full-text functionality. ;-) Any speed/optimization improvements are welcome for gigs of data, especially with IN BOOLEAN MODE (e.g. automagically sorted by relevance like a natural language query, although this is probably difficult if a wildcard* is used?). And the FULLTEXT index shouldn't always be chosen for non-const join types when another index would find less rows first. e.g. ... WHERE MATCH ... AND primary_key IN (1, 2); should use the PRIMARY key, not the FULLTEXT. :-) But maybe that's not possible, since I guess it's a problem auto sorting by relevance if it's not using the FULLTEXT index. -- To other full-text users: what do you think of these suggestions? To the
Re: Lots of FULLTEXT stuff (suggestions)
Hi Steven, Thanks for replying. Your posts that I've found when searching for FULLTEXT information have had great ideas. :-) Searching millions of posts efficiently and effectively isn't easy. :-( Heh. Thinking about ft_min_word_len some more (and how I wish they'd lower the default), I don't really see that much use/purpose for a minimum word length. Most 1-3 letter words that you don't want indexed should be stopwords anyway, right? So why NOT index the ones that are left? Doesn't seem like it'd make the index much larger to me. BTW, what is your min_word_len value? - Original Message - From: Steven Roussey Sent: Sunday, August 24, 2003 3:39 PM Subject: RE: Lots of FULLTEXT stuff (suggestions) And the FULLTEXT index shouldn't always be chosen for non-const join types when another index would find less rows first. The short answer is that it doesn't work that way (also, I think this is why there are no composite indexes between integer and fulltext indexes). The two systems don't know anything about each other. Yeah... Also, are the current MySQL versions using the 2 level full-text index format yet? I'm thinking not? No. MySQL 4.1.0 has some low-level support for this, but FTS needs to altered (quite a bit I'd guess) to use it. So there is hope that it will come in the 4.1.x line, but no guarantee. I'm hoping for 4.1.x! BTW, I wonder if Sergei is the [only] one who handles all the full-text coding? In November 2001, he said the new .frm format would be here this year. It's been almost 2 years since then, so when is it do? ;-/ I think it was pushed back to version 5.1. I'd figure another two years. Aww. :-( Yeah, now I remember seeing that in Features planned for 5.1. Since the acquisition of SAP DB, I think they're supposed to have more resources for faster development. Maybe some of these things will come sooner than expected. That's what I'm hoping anyway. Matt -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: Lots of FULLTEXT stuff (suggestions)
My TODO List: 1. Speed More speed in inserts, deletes and selects. 2. Stoplist 1 stoplist for each index create fulltext index on x( y) stoplist 'mystop' the stoplist could be divided in section like my.cfg: [STOPWORD] [VARIABLES] ft_min_word_len=3 [CLASSES] ... [SPECIAL CHARS] ... The classes section defines word boundary; assigning chars or group of chars to one class (Alphanumeric, Word Break, punctuation, etc.). The parser uses this section to extract words to index. SPECIAL CHARS = replacement of some character ( ASCII 128) to an equivalent 7bit character ( eg é=e -- index resumé as resume). 3. Proximity 4. TAGs Recognize tags to stop/restart index: inside the text I can insert tags to skip indexing a part of text (stop index, start index). 5. UDF or Filter Cascade UDF. The output of one function can be sent to input of another function. Every record can have a different filter. This solution allows to handle external documents( Text, Word, Excel, XML, PDF, etc) without storing their data in a table (just store path type) crypt them (usefull in CD-ROM distribution). This filter can also fill other columns with data from the text. These features with no great speed degrade can improve MySql and open new opportunity to MySql. Santino -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
RE: Lots of FULLTEXT stuff (suggestions)
Lots of stuff STEMMING! (controlled more finely than server level I hope), multi-byte character set support, proximity operators. Anything to get it closer to Verity's full-text functionality. ;-) Yes, all these things would be nice... :) And the FULLTEXT index shouldn't always be chosen for non-const join types when another index would find less rows first. The short answer is that it doesn't work that way (also, I think this is why there are no composite indexes between integer and fulltext indexes). The two systems don't know anything about each other. Also, are the current MySQL versions using the 2 level full-text index format yet? I'm thinking not? No. MySQL 4.1.0 has some low-level support for this, but FTS needs to altered (quite a bit I'd guess) to use it. So there is hope that it will come in the 4.1.x line, but no guarantee. In November 2001, he said the new .frm format would be here this year. It's been almost 2 years since then, so when is it do? ;-/ I think it was pushed back to version 5.1. I'd figure another two years. --steve- -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]