Re: Distributed Fulltext?
On Fri, 2002-02-15 at 02:44, Alex Aulbach wrote: > Wednesday, from David Axmark: > > > Your other point about exact vs. approximate answers is unclear, I expect > > > that Google's answers are exact for their currently available indexes at any > > > given time. But even if they are approximate, I'd be happy with that too. > > > The scoring on a FULLTEXT search in Mysql is "exact" but based on a > > > formula that is approximate anyway. > > > > No, MySQL returns all data according to a search. Web engines return > > what they they find on one search machine. So you can get different > > results with Google every time you hit refresh if you are routed to > > different machines. This had happened to me when I was looking for the > > number of matches and not the result itself. > > > > So we should try to make fulltext searches with a limit between 10 and > > 100 be fast to be closer to google. > > > > I have also head about some other things web search engines do since I > > know some people at FAST but I have forgot that already. > > My opinion is, that mySQL itself never should try to find approximate > matches. This is against the definition of SQL itself. SQL is a fourth > generation language. That means, if you say "SELECT", the engine selects. > And it has to be as exactly that, what I have searched, every time, on > every machine in any combination with the same data. > > So SQL needs a new language construct to make an approximate search. But > what is an approximate search? How is approximate defined? > I agree, If there is a good technical reason for doing so. I would be willing to add a keyword APPROXIMATE for such a thing. But it would have to be a extremely clear even to uninitiated that this is approximate. I can se such a feature used by the people who build search engines with MySQL. > I don't think it is a good idea to implement it in this way. > Approximazation must be always done on the application level, cause it is > highly dependend on application, what an approximate result could be. > > > We will try to make every feature as good as possible. But we do have > > limited resources. > > Exactly. FTS is not so important as other features and people which want > you to include a new feature should think about supporting mysql with > money. :-) > > But (yes, we support mysql! :-) I think the need is growing rapidly, cause > the amount of data, that has to be indexed is growing over the years. And > other DB's have much more experices with it. Currently we can live with > the speed. Those who cannot live with it should buy better machines, > think about their SE-concept or support mysql. > > Search engines techniques are *not* trivial, so the last way is in my eyes > one of the cheapest. > > > > Well there is always the option of sponsoring further fulltext > > development. We have a guy who has been working on the GNU fulltext > > engines who is interesting in working with MySQL fulltext. But for the > > moment we can not afford it. > > This was my first thought: People write about speed problems and how to > cluster and so on. Things, that I would calculate with weeks and high TCO. > > But it maybe much cheaper to pay mySQL for this. How much do you estimate > would it cost to implement inverted files? I think this is difficult, > cause Sergei told me, that he couldn't use mySQL-index files any more. If needed we could add a new index file (.MYT ?). But what would require some internal changes of unknown (to me) size. But if this is needed for continues development of fulltext we will do it. > I just ask, nothing special in sight, but many questions from everyone who > needs it. Cause FTS is a feature which highly improves the value of a web > site. And coustomers have no problem to pay for things they think they get > money for. FTS is such a thing. > > But perhaps if we know, under which circumstances FTS is improved, it is > easier for us to find a possible way to share the costs for it or find a > compromise. I also understand, if mySQL don't want to speak about it > here. > > I think it is also important for us, how much it can be theoretically > improved. My calculations showed me a theoretical speed up of factor 100 > or so. This is ... wow. But in live everything is most times slower... We want to do things as fast as possible in most cases. We will try to do this for text-search also but the question is how long time it will take before we finish. > > So if some of you are interested in sponsoring this (or know about > > others who might be) write to [EMAIL PROTECTED] > > Or like this... maybe we find coustomers who needs it. Think it's > possible. > > My personal feeling is and my stomach says, that fulltext indexing is a > feature, which needs to be expanded. We do have Sergei working on it still. And we do plan to expand it during the coming months. But the way to get our attention in cases like this is simple. Give us some core or money! /David -
Re: Distributed Fulltext?
Wednesday, from David Axmark: > > Your other point about exact vs. approximate answers is unclear, I expect > > that Google's answers are exact for their currently available indexes at any > > given time. But even if they are approximate, I'd be happy with that too. > > The scoring on a FULLTEXT search in Mysql is "exact" but based on a > > formula that is approximate anyway. > > No, MySQL returns all data according to a search. Web engines return > what they they find on one search machine. So you can get different > results with Google every time you hit refresh if you are routed to > different machines. This had happened to me when I was looking for the > number of matches and not the result itself. > > So we should try to make fulltext searches with a limit between 10 and > 100 be fast to be closer to google. > > I have also head about some other things web search engines do since I > know some people at FAST but I have forgot that already. My opinion is, that mySQL itself never should try to find approximate matches. This is against the definition of SQL itself. SQL is a fourth generation language. That means, if you say "SELECT", the engine selects. And it has to be as exactly that, what I have searched, every time, on every machine in any combination with the same data. So SQL needs a new language construct to make an approximate search. But what is an approximate search? How is approximate defined? I don't think it is a good idea to implement it in this way. Approximazation must be always done on the application level, cause it is highly dependend on application, what an approximate result could be. > We will try to make every feature as good as possible. But we do have > limited resources. Exactly. FTS is not so important as other features and people which want you to include a new feature should think about supporting mysql with money. :-) But (yes, we support mysql! :-) I think the need is growing rapidly, cause the amount of data, that has to be indexed is growing over the years. And other DB's have much more experices with it. Currently we can live with the speed. Those who cannot live with it should buy better machines, think about their SE-concept or support mysql. Search engines techniques are *not* trivial, so the last way is in my eyes one of the cheapest. > Well there is always the option of sponsoring further fulltext > development. We have a guy who has been working on the GNU fulltext > engines who is interesting in working with MySQL fulltext. But for the > moment we can not afford it. This was my first thought: People write about speed problems and how to cluster and so on. Things, that I would calculate with weeks and high TCO. But it maybe much cheaper to pay mySQL for this. How much do you estimate would it cost to implement inverted files? I think this is difficult, cause Sergei told me, that he couldn't use mySQL-index files any more. I just ask, nothing special in sight, but many questions from everyone who needs it. Cause FTS is a feature which highly improves the value of a web site. And coustomers have no problem to pay for things they think they get money for. FTS is such a thing. But perhaps if we know, under which circumstances FTS is improved, it is easier for us to find a possible way to share the costs for it or find a compromise. I also understand, if mySQL don't want to speak about it here. I think it is also important for us, how much it can be theoretically improved. My calculations showed me a theoretical speed up of factor 100 or so. This is ... wow. But in live everything is most times slower... > So if some of you are interested in sponsoring this (or know about > others who might be) write to [EMAIL PROTECTED] Or like this... maybe we find coustomers who needs it. Think it's possible. My personal feeling is and my stomach says, that fulltext indexing is a feature, which needs to be expanded. -- SSilk - Alexander Aulbach - Herbipolis/Frankonia Minoris - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?
Wednesday, from Mike Wexler: > I don't think that would be appropriate. My example, is our site (tias.com) has > lots of antiques and collectibles. One popular categories is jewelry. If > somebody does a search for "gold jewelry" and the search engine interprets this > as anything that mentions gold or jewelry. It is going to match a lot of items. > It would be nice if we could use explain or something like it to get a rough > estimate of how many results a query would generate, and if it was really bad, > we could tell the user to be more specific. This is not a solution, but we make it by using the sql query SELECT COUNT(*) FROM table WHERE MATCH(index) AGAINST ('gold') (results e.g. in 100) and SELECT COUNT(*) FROM table WHERE MATCH(index) AGAINST ('jewelry') (results e.g. in 200) OR-Search: The result is between 300 and 500 matches. AND-Search: The result is between 0 and 200 matches. The problem is: The queries lasts nearly as fast, as the complete search. :) -- SSilk - Alexander Aulbach - Herbipolis/Frankonia Minoris - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?
Hi, I also will explain how we made FTS "fast". (sorry for my bad english) First some DATA: The table which has to be indexed has ~60 entries. There are articles inside it, which are in average 3-4 kb each (which says nothing!) with about 300 words each (this number is very important!). This makes theoretically an index of 180 Million. But we have stopwords and only about 1 Mio different words. So the index for this relation words <---> article is about 25-30 Mio recors/relations. > This is the only important thing: How big is the index? Does it go completly into memory? Cause of other expierences with FTS in mySQL we decided to put the FTS machine itself on an extra DB-Server. This machine has 1 GHz, 4 GB Ram, Raid 5. The main application has a centralistic write module. Every article has to go throug these routines. So it was easy to write the article not only into the article database, but also into the search database. Before we do this, we strip away html and other things to reduce data. Some things which makes the engine "slow": * Words, that match very often. This is, cause mysql has to find all records and FTS has to calculate how good the match is. That is very time consuming. So a SELECT * FROM index is very bad. Better is a SELECT index FROM fts. The selected amount of data is much lower. * at startup the machine is very slow and if there are much updates. But it's clear why, I think I don't need to explain it. * too much requests (>10 per seconds, but this depends very much on what is searched). Think this is also clear. BTW: we found out a trick for "AND"-searches (by default FTS searches with OR!) 1. Count matches for each word (SELECT COUNT(*) FROM fts WHERE MATCH(ftsindex) AGAINST (word1)) 2. Sort by found matches ascending. 3. SELECT * FROM fts WHERE MATCH(ftsindex) AGAINS(word with lowest number of matches) AND MATCH(ftsindex) AGAINST(word with second lowest number of matches) AND This speeds up the query about 30-50% against the same query, when the words are sorted descending! It is obvious: If the first set is the smallest set mysql has to compare much less possibilities. So let's summarize: - Split the searching facilities from your application! This is also very important not to block your application too much, when the ft-index must be generated (e.g. cause machine cpu bites into dust) - depending on the size of index: 1. Use an own table 2. put more RAM into the machine 3. Use an own database on another machine 4. Put more RAM into the machine 5. Split you DB into several DB's on different machines. 6. Why don't you by a real google SE? Positive side effect of using an own DB: You are able to index when there is time to, e.g. when the load is low. Asynchronous Searching facilities can be a good idea in some reasons. - write only those data into FTS-Table, which should be found. This reduces indexing and search time and amount of data. - put also other searching criterias into this table, if it is possible. Try to avoid joins! - There exists some optimization tricks to speed up search. - Caching can also be a very good idea. The biggest searching times takes about 2 seconds with this method. -- SSilk - Alexander Aulbach - Herbipolis/Frankonia Minoris - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?
Mike Wexler wrote: That may be what you want. But I want a search engine. For my database. I want to be able to do search of the million records of inventory in our database and find the items with certain keywords, that are currently available for sale, in the venue the user is shopping in and within the price range the user specifies. This is a combination of database and fulltext search that I think is exactly in MySQL's marketplace. Personally I use the term 'search engine' to mean Google-like text-only search. I don't have a term for the above, but FTS seems fine. I agree that MySQL users will tend to want an index over several columns including one that is text (FTS). As for practical advice now, I'll tell you what we did: 1. Got better hardware. Athlons, DDR memory, RAID: more is better. :) 2. We split the FTS tables into 10 tables. Since we do not need to search the text without some other constraint (yet -- I'd like FTS to work with merge tables in the few cases where we did), we know which table to use. Your mileage may vary. This helps for several reasons: less index to pull into memory for the search, and FTS does not seem to be O(n) but rather O(n^2) though I can't be sure. 3. Replicated the table to do FTS on to another computer. All it does is FTS. We made our own routine to replicate the table (actually 10 tables) since our text is compressed on the main server (smaller = less disk access = faster) and the indexes are different (obviously). In this case, it can hold a lot of the index in memory -- big difference. Also, a long search does not lock the tables. This is worth in and of itself. Even if you have this copy of the table on the same server. De-normalization can produce huge speed increases depending on what and how you do things. 4. If you do #2 and #3 you'll notice that you can have x (10 for us) number of servers partition the FTS. We don't actually do this, but we could and therefore get 'Distributed Fulltext' -- the title of this thread!!! Number 1 and 3 should work for everyone, I think. Only if your app can partition, can you do #2 and #4. Sincerely, Steven Roussey http://Network54.com/?pp=e - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?
> While any speed up with a full table fulltext search would be helpful > and useful, there are instances where the search is intersected with > another column and the problem of search is therefore more complex but > also leads to potential optimizations. > > In our case we rarely do searches that use just the FTS index. More > commonly, we do searches based on other columns and a FTS indexed text > column. The WHERE clause restraints on the non-text column usually > eliminate 99% of the rows. Fulltext indexes can not be combined with > numeric columns in a single composite index, so as far as I can tell, > MySQL performs the FTS then filters the results to meet the WHERE > conditions. I believe the required amount of the index to be paged into > main memory is much larger than might otherwise be needed (since the > indexes are larger than available memory, and there a zillion other > tables accessed simultaneously, this tends to effect all queries). There's also a consideration about splitting the workload between the RDBMS and the application. I've improved the performance of applications by simplifying the SQL and making the Perl/C do more of the work. When it is successful (usually), it's because the application can be written to "understand" the implications behind the query results, and the language is more generalised than SQL. Paul Wilson iiNet Ltd - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?
> Steve Rapaport wrote: > > Someone correctly pointed out today that it's not Mysql's job > > to be Google, and I agree. But it seems to me that it would be > > fair for mysql to be able to handle searches in under 1 second > > for databases 1 millionth the size of Google. All I want here > > is a decent flexible phone book lookup, not a search engine. > > That may be what you want. But I want a search engine. For my database. I want > to be able to do search of the million records of inventory in our database and > find the items with certain keywords, that are currently available for sale, in > the venue the user is shopping in and within the price range the user specifies. > This is a combination of database and fulltext search that I think is exactly in > MySQL's marketplace. I think that there are two different requirements here. A database able to handle fulltext searches is becoming almost mandatory, and MySQL achieves this, within limits. They appear to recognise the current limitations of the product, and are working to make it more capable, and that's exactly what I'd expect from a company with good market awareness. However, a full search-engine is a different creature. MySQL is unlikely ever to equal Google. That requires tailored software and data structures, and more hardware than most of us are likely to have access to. MySQL has a stop list, in my view it also needs a "yes list" so that indexing on a closed vocabulary can be implemented easily. The work that I've done trying to implement full text searching (before MySQL offered the functionality) shows what pretty much everyone else seems to have found. Regardless of the capabilities of the underlying engine, full text retrieval only works well with very large document sets. With smaller volumes of text, you only get good results from a closed vocabulary - keywords if you will. If MySQL could be restricted to a list of words/phrases to be indexed, I think that a lot of the performance issues would be reduced but it's certainly not the answer to every problem. Paul Wilson iiNet Ltd - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?
Brian DeFeyter wrote: > On Wed, 2002-02-13 at 16:39, Mike Wexler wrote: > >> >>Brian DeFeyter wrote: >> >>>I sorta like that idea. I don't know exactly what you can and can't do >>>as far as indexing inside of HEAP tables.. but the index size would >>>likely differ from the written index. Then you can expand the idea and >>>use the X/(num slices) on (num slices) boxes technique.. sending the >>>query to each, and compiling all of the results. >>> >>>Your comment about only having a few common words that are searched >>>makes me wonder if a reverse-stopword function would be valuable (ie: >>>only index words in a list you generate) Probably could be done other >>>ways though too with what's available. Maybe a psuedo-bitmap index >>>inside a mysql table? >>> >>I don't think that would be appropriate. My example, is our site (tias.com) has >>lots of antiques and collectibles. One popular categories is jewelry. If >>somebody does a search for "gold jewelry" and the search engine interprets this >>as anything that mentions gold or jewelry. It is going to match a lot of items. >>It would be nice if we could use explain or something like it to get a rough >>estimate of how many results a query would generate, and if it was really bad, >>we could tell the user to be more specific. >> > > The FOUND_ROWS() function in mysql-4.0.0+ might help you out in this > case. That only works after you've wait 2 minutes for the query to complete. I want something (like EXPLAIN) that is fairly quick and that you can run ahead of time. > > - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?
On Wed, 2002-02-13 at 16:39, Mike Wexler wrote: > > > Brian DeFeyter wrote: > > I sorta like that idea. I don't know exactly what you can and can't do > > as far as indexing inside of HEAP tables.. but the index size would > > likely differ from the written index. Then you can expand the idea and > > use the X/(num slices) on (num slices) boxes technique.. sending the > > query to each, and compiling all of the results. > > > > Your comment about only having a few common words that are searched > > makes me wonder if a reverse-stopword function would be valuable (ie: > > only index words in a list you generate) Probably could be done other > > ways though too with what's available. Maybe a psuedo-bitmap index > > inside a mysql table? > > I don't think that would be appropriate. My example, is our site (tias.com) has > lots of antiques and collectibles. One popular categories is jewelry. If > somebody does a search for "gold jewelry" and the search engine interprets this > as anything that mentions gold or jewelry. It is going to match a lot of items. > It would be nice if we could use explain or something like it to get a rough > estimate of how many results a query would generate, and if it was really bad, > we could tell the user to be more specific. The FOUND_ROWS() function in mysql-4.0.0+ might help you out in this case. - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?
Brian DeFeyter wrote: > I sorta like that idea. I don't know exactly what you can and can't do > as far as indexing inside of HEAP tables.. but the index size would > likely differ from the written index. Then you can expand the idea and > use the X/(num slices) on (num slices) boxes technique.. sending the > query to each, and compiling all of the results. > > Your comment about only having a few common words that are searched > makes me wonder if a reverse-stopword function would be valuable (ie: > only index words in a list you generate) Probably could be done other > ways though too with what's available. Maybe a psuedo-bitmap index > inside a mysql table? I don't think that would be appropriate. My example, is our site (tias.com) has lots of antiques and collectibles. One popular categories is jewelry. If somebody does a search for "gold jewelry" and the search engine interprets this as anything that mentions gold or jewelry. It is going to match a lot of items. It would be nice if we could use explain or something like it to get a rough estimate of how many results a query would generate, and if it was really bad, we could tell the user to be more specific. > > - bdf > > On Wed, 2002-02-13 at 12:09, Mike Wexler wrote: > >>My understanding is that part of how google and Altavista get such high speeds >>is to keep everything in memory. Is it possible to create a HEAP table with a >>full text index? If so, does the full text index take advantage of being in >>memory? For example, I would imagine that if you were keeping the whole index in >>memory, details like the index page size, and the format of the pointers/record >>numbers would be different. >> >>Then you could just do something roughly like (i know the syntax is a little off) >> >>CREATE HEAP TABLE fooFast SELECT * FROM fooSlow >>ALTER fooFast ADD fulltext(a, b, c) >> >>Or maybe you could just have fooSlow on one server. And then have it replicated >>on N other servers. But on the other servers you could alter the table type so >>it was a heap table. So you would have one persistent table and a bunch of >>replicated heap tables. And all the search go could against the heap tables. >> >> >>Brian Bray wrote: >> >>>It seems to me like the best solution that could be implemented as-is >>>would be to keep a random int column in your table (with a range of say >>>1-100) and then have fulltext server 1 psudo-replicate records with a >>>the random number in the range of 1-10, server 2 11-20 and server 3 >>>21-30 and so on. >>> >>>Then run your query on all 10 servers and merge the result sets and >>>possibly re-sort them if you use the score column. >>> >>>The problem with splitting the index up by word is that is messes up all >>>your scoring and ranking. For example what if you search using 5 >>>keywords, all starting with letters from different groups? Your going >>>to get pretty bad score for each match, and it could totally break >>>boolean searches. >>> >>>-- >>>Brian Bray >>> >>> >>> >>> >>>Brian DeFeyter wrote: >>> >>> On Thu, 2002-02-07 at 15:40, Tod Harter wrote: [snip] >Wouldn't be too tough to write a little query routing system if you are using >perl. Use DBD::Proxy on the web server side, and just hack the perl proxy >server so it routes the query to several places and returns a single result >set. Ordering could be achieved as well. I'm sure there are commercial >packages out there as well. I don't see why the individual database servers >would need to do anything special. > > > [snip] If I'm understanding you correctly, I think you're refering to routing based on the first character of the word. That would work for cases where the query is searching for a word that begins with a certain character.. however fulltext searches also return results with the term in the middle. ie: a search for 'foo' could return: foo.txt foobar but also could return: thisisfoo that_is_foolish I could be wrong, but it's my understanding that MySQL stores it's fulltext index based on all the 'unique words' found. For such a system as you mentioned above, you'd probably have to create your own fulltext indexing system to determine: a) where to store the data 'segments' and b) how to route queries. It seems like this could probably be done much more efficiently inside of the server. - Brian - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php >>> >>> >>>---
Re: Distributed Fulltext?
I sorta like that idea. I don't know exactly what you can and can't do as far as indexing inside of HEAP tables.. but the index size would likely differ from the written index. Then you can expand the idea and use the X/(num slices) on (num slices) boxes technique.. sending the query to each, and compiling all of the results. Your comment about only having a few common words that are searched makes me wonder if a reverse-stopword function would be valuable (ie: only index words in a list you generate) Probably could be done other ways though too with what's available. Maybe a psuedo-bitmap index inside a mysql table? - bdf On Wed, 2002-02-13 at 12:09, Mike Wexler wrote: > My understanding is that part of how google and Altavista get such high speeds > is to keep everything in memory. Is it possible to create a HEAP table with a > full text index? If so, does the full text index take advantage of being in > memory? For example, I would imagine that if you were keeping the whole index in > memory, details like the index page size, and the format of the pointers/record > numbers would be different. > > Then you could just do something roughly like (i know the syntax is a little off) > > CREATE HEAP TABLE fooFast SELECT * FROM fooSlow > ALTER fooFast ADD fulltext(a, b, c) > > Or maybe you could just have fooSlow on one server. And then have it replicated > on N other servers. But on the other servers you could alter the table type so > it was a heap table. So you would have one persistent table and a bunch of > replicated heap tables. And all the search go could against the heap tables. > > > Brian Bray wrote: > > It seems to me like the best solution that could be implemented as-is > > would be to keep a random int column in your table (with a range of say > > 1-100) and then have fulltext server 1 psudo-replicate records with a > > the random number in the range of 1-10, server 2 11-20 and server 3 > > 21-30 and so on. > > > > Then run your query on all 10 servers and merge the result sets and > > possibly re-sort them if you use the score column. > > > > The problem with splitting the index up by word is that is messes up all > > your scoring and ranking. For example what if you search using 5 > > keywords, all starting with letters from different groups? Your going > > to get pretty bad score for each match, and it could totally break > > boolean searches. > > > > -- > > Brian Bray > > > > > > > > > > Brian DeFeyter wrote: > > > >>On Thu, 2002-02-07 at 15:40, Tod Harter wrote: > >>[snip] > >> > >> > >>>Wouldn't be too tough to write a little query routing system if you are using > >>>perl. Use DBD::Proxy on the web server side, and just hack the perl proxy > >>>server so it routes the query to several places and returns a single result > >>>set. Ordering could be achieved as well. I'm sure there are commercial > >>>packages out there as well. I don't see why the individual database servers > >>>would need to do anything special. > >>> > >>> > >>[snip] > >> > >>If I'm understanding you correctly, I think you're refering to routing > >>based on the first character of the word. That would work for cases > >>where the query is searching for a word that begins with a certain > >>character.. however fulltext searches also return results with the term > >>in the middle. > >> > >>ie: a search for 'foo' could return: > >>foo.txt > >>foobar > >> > >>but also could return: > >>thisisfoo > >>that_is_foolish > >> > >>I could be wrong, but it's my understanding that MySQL stores it's > >>fulltext index based on all the 'unique words' found. For such a system > >>as you mentioned above, you'd probably have to create your own fulltext > >>indexing system to determine: a) where to store the data 'segments' and > >>b) how to route queries. It seems like this could probably be done much > >>more efficiently inside of the server. > >> > >> - Brian > >> > >> > >> > >>- > >>Before posting, please check: > >> http://www.mysql.com/manual.php (the manual) > >> http://lists.mysql.com/ (the list archive) > >> > >>To request this thread, e-mail <[EMAIL PROTECTED]> > >>To unsubscribe, e-mail <[EMAIL PROTECTED]> > >>Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php > >> > >> > >> > > > > > > > > - > > Before posting, please check: > >http://www.mysql.com/manual.php (the manual) > >http://lists.mysql.com/ (the list archive) > > > > To request this thread, e-mail <[EMAIL PROTECTED]> > > To unsubscribe, e-mail <[EMAIL PROTECTED]> > > Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php > > > > > > - > > Before posting, please check: > >http://www.mysql.com/manual.php (the manual) > >http://lists.mysql.com/
Re: Distributed Fulltext?
Steve Rapaport wrote: > > Someone correctly pointed out today that it's not Mysql's job > to be Google, and I agree. But it seems to me that it would be > fair for mysql to be able to handle searches in under 1 second > for databases 1 millionth the size of Google. All I want here > is a decent flexible phone book lookup, not a search engine. That may be what you want. But I want a search engine. For my database. I want to be able to do search of the million records of inventory in our database and find the items with certain keywords, that are currently available for sale, in the venue the user is shopping in and within the price range the user specifies. This is a combination of database and fulltext search that I think is exactly in MySQL's marketplace. And the current situation where a user fulltext query can take between 1 ms and 100 secs with no easy way for the application to tell ahead of time is really unacceptable. So I think several things need to be done: 1) full text search needs to be faster (they are working on it). 2) There needs to be some way (ala explain) of figuring out roughly how long a particular full text search will take. If I coudl do EXPLAIN SELECT * FROM inventory WHERE MATCH ($userQuery) AGAINST (title, description) and get something useful for predicting search time. Then I could at least say something to the user like: That query matches too many records. Please pick a more specific query. Rather than hosing the whole database for 3 or 4 common words. 3) Find ways to use replication and memory to speed up full-text queries. See my previous mail about heap tables for a possible example. - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?
My understanding is that part of how google and Altavista get such high speeds is to keep everything in memory. Is it possible to create a HEAP table with a full text index? If so, does the full text index take advantage of being in memory? For example, I would imagine that if you were keeping the whole index in memory, details like the index page size, and the format of the pointers/record numbers would be different. Then you could just do something roughly like (i know the syntax is a little off) CREATE HEAP TABLE fooFast SELECT * FROM fooSlow ALTER fooFast ADD fulltext(a, b, c) Or maybe you could just have fooSlow on one server. And then have it replicated on N other servers. But on the other servers you could alter the table type so it was a heap table. So you would have one persistent table and a bunch of replicated heap tables. And all the search go could against the heap tables. Brian Bray wrote: > It seems to me like the best solution that could be implemented as-is > would be to keep a random int column in your table (with a range of say > 1-100) and then have fulltext server 1 psudo-replicate records with a > the random number in the range of 1-10, server 2 11-20 and server 3 > 21-30 and so on. > > Then run your query on all 10 servers and merge the result sets and > possibly re-sort them if you use the score column. > > The problem with splitting the index up by word is that is messes up all > your scoring and ranking. For example what if you search using 5 > keywords, all starting with letters from different groups? Your going > to get pretty bad score for each match, and it could totally break > boolean searches. > > -- > Brian Bray > > > > > Brian DeFeyter wrote: > >>On Thu, 2002-02-07 at 15:40, Tod Harter wrote: >>[snip] >> >> >>>Wouldn't be too tough to write a little query routing system if you are using >>>perl. Use DBD::Proxy on the web server side, and just hack the perl proxy >>>server so it routes the query to several places and returns a single result >>>set. Ordering could be achieved as well. I'm sure there are commercial >>>packages out there as well. I don't see why the individual database servers >>>would need to do anything special. >>> >>> >>[snip] >> >>If I'm understanding you correctly, I think you're refering to routing >>based on the first character of the word. That would work for cases >>where the query is searching for a word that begins with a certain >>character.. however fulltext searches also return results with the term >>in the middle. >> >>ie: a search for 'foo' could return: >>foo.txt >>foobar >> >>but also could return: >>thisisfoo >>that_is_foolish >> >>I could be wrong, but it's my understanding that MySQL stores it's >>fulltext index based on all the 'unique words' found. For such a system >>as you mentioned above, you'd probably have to create your own fulltext >>indexing system to determine: a) where to store the data 'segments' and >>b) how to route queries. It seems like this could probably be done much >>more efficiently inside of the server. >> >> - Brian >> >> >> >>- >>Before posting, please check: >> http://www.mysql.com/manual.php (the manual) >> http://lists.mysql.com/ (the list archive) >> >>To request this thread, e-mail <[EMAIL PROTECTED]> >>To unsubscribe, e-mail <[EMAIL PROTECTED]> >>Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php >> >> >> > > > > - > Before posting, please check: >http://www.mysql.com/manual.php (the manual) >http://lists.mysql.com/ (the list archive) > > To request this thread, e-mail <[EMAIL PROTECTED]> > To unsubscribe, e-mail <[EMAIL PROTECTED]> > Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php > > > - > Before posting, please check: >http://www.mysql.com/manual.php (the manual) >http://lists.mysql.com/ (the list archive) > > To request this thread, e-mail <[EMAIL PROTECTED]> > To unsubscribe, e-mail <[EMAIL PROTECTED]> > Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php > - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?
> [comparisons to Google...] While any speed up with a full table fulltext search would be helpful and useful, there are instances where the search is intersected with another column and the problem of search is therefore more complex but also leads to potential optimizations. In our case we rarely do searches that use just the FTS index. More commonly, we do searches based on other columns and a FTS indexed text column. The WHERE clause restraints on the non-text column usually eliminate 99% of the rows. Fulltext indexes can not be combined with numeric columns in a single composite index, so as far as I can tell, MySQL performs the FTS then filters the results to meet the WHERE conditions. I believe the required amount of the index to be paged into main memory is much larger than might otherwise be needed (since the indexes are larger than available memory, and there a zillion other tables accessed simultaneously, this tends to effect all queries). So besides the obvious desire to lower the size of indexes, it would be beneficial if a combined index could lower the number of matching rows and the amount of the index needed to be in memory. I actually find it odd for people to not be using secondary columns in a fulltext search, other than to do something like Google (which does not lend itself to using SQL in the first place. Of course, I have a myopic view stemming from my own requirements and experience. All that said, I'm glad to have the FTS we have. Thanks Sergei for such a wonderful improvement to LIKE '%word%'! Sincerely, Steven Roussey http://Network54.com/?pp=e PS - is FTS search and update O(n^2)? - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?
On Tue, 2002-02-12 at 15:38, Steve Rapaport wrote: > David Axmark writes: > > > So the standard answer with Apples and Oranges certainly apply here! > > More like Äpplen och Apelsiner, that is, different but similar. You Swedish > guys should know. Thanks for answering, David, I appreciate the attention > from a founder. > > I also appreciate your point that Google is updating continuously and > therefore not always caught up to the current state of the web. > But isn't that a problem with indexing speed, not with search speed? > Their search speed is still amazing, as is Altavista, and most of the > other engines. > > Your other point about exact vs. approximate answers is unclear, I expect > that Google's answers are exact for their currently available indexes at any > given time. But even if they are approximate, I'd be happy with that too. > The scoring on a FULLTEXT search in Mysql is "exact" but based on a > formula that is approximate anyway. No, MySQL returns all data according to a search. Web engines return what they they find on one search machine. So you can get different results with Google every time you hit refresh if you are routed to different machines. This had happened to me when I was looking for the number of matches and not the result itself. So we should try to make fulltext searches with a limit between 10 and 100 be fast to be closer to google. I have also head about some other things web search engines do since I know some people at FAST but I have forgot that already. > I'll summarize this thread best I can. > > >From the math I used, we started with my estimate of 10^9, > which was mistaken. The real figure was 10^6, that is, Google > searches fulltext about a million times faster than Mysql. > Then we used Google's 1 machines +DRAM indexing to reduce the > gap to 10^2, or 100 times faster. I would say we should reduce it even further but that could be discussed. > It turns out that 100 times is about the factor that is causing > my application problems. If it just ran 100 times faster it would be > about as fast as a regular indexed search, and I'd > be happy. > > A few people suggested that Mysql shouldn't try to be faster, > I (and some high-support customers like Mike Wexler) disagreed. > And Alex Aulbach, bless him, actually did his homework and showed that > things could be much improved with smart index techniques like > inverted files. We will try to make every feature as good as possible. But we do have limited resources. > Then Sergei Golubchik wrote back to say he had taken some of the good ideas > and inserted them into the TODO list, although he had higher priorities > at the time. > > And I was satisfied for now, although my application still isn't working > satisfactorily due to a really slow and CPU-hungry FULLTEXT search. Well there is always the option of sponsoring further fulltext development. We have a guy who has been working on the GNU fulltext engines who is interesting in working with MySQL fulltext. But for the moment we can not afford it. So if some of you are interested in sponsoring this (or know about others who might be) write to [EMAIL PROTECTED] > I think that's our story so far. > > Steve Rapaport > Director, Technical Services > A-Tono - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?
I said: > > Why is it that Altavista can index terabytes overnight and return > > a fulltext boolean for the WHOLE WEB > > within a second, and Mysql takes so long? On Friday 08 February 2002 08:56, Vincent Stoessel wrote: > Apples and oranges. Yeah, I know. But let's see if we can make some distinctions. If, say, Google, can search 2 trillion web pages, averaging say 70k bytes each, in 1 second, and Mysql can search 22 million records, with an index on 40 bytes each, in 3 seconds (my experience) on a good day, what's the order of magnitude difference? Roughly 10^9. > Have you seen the /hardware/ run that enterprise with? Irrelevant, you're unlikely to get 9 orders of magnitude difference with faster hardware or even with clustering. > Also, their software is optimized for full text searches and that > is /all/ they do. Mysql is an SQL database and is optimized as such. Absolutely granted. You are completely right. And I don't expect the data format to change. BUT: thought experiment: When Mysql decides to generate a FULLTEXT index, it is using an index file, the .MYI file, which can have ANY FORMAT it wants to. If the .MYI format is poorly optimized for fulltext searches, they can improve the format. They can even introduce a new index file type .MYF solely for optimizing fulltext searches. None of this need have any impact on the data file format or the SQL search optimizations, and yet it could still improve the search speed for fulltext. It might not help as much for the slow indexing, but it could certainly improve the performance of the search. Thinking out loud... Steve - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?
Ooops, factual error: > > If, say, Google, can search 2 trillion web pages, averaging say 70k > > bytes each, in 1 second, and Mysql can search 22 million records, with > > an index on 40 bytes each, in 3 seconds (my experience) on a good day, > > what's the order of magnitude difference? Roughly 10^9. Google has only 2 Billion with a B web pages. So the order of magnitude difference is only 10^6. But that still leaves a mysql Fulltext search a MILLION times slower than Google. Someone correctly pointed out today that it's not Mysql's job to be Google, and I agree. But it seems to me that it would be fair for mysql to be able to handle searches in under 1 second for databases 1 millionth the size of Google. All I want here is a decent flexible phone book lookup, not a search engine. -steve - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?
> Why is it that Altavista can index terabytes overnight and return > a fulltext boolean for the WHOLE WEB > within a second, and Mysql takes so long? I don't know about Altavista, but if you read up on Google, they do indeed do some sort of spreading of keywords across multiple machines - last I saw they had over 10,000 machines. But you have to ask the question as to whether this would be an appropriate development for MySQL. This would seeem to me to be a very complex development of interest to a very small subset of MySQL customers. MySQL, it seems to me, specialises in doing the mainstream database manipulations used by many people simply, quicly, and (my most important criterion) reliably. Such a development would, IMO, be away from the whole trend of MySQL, and shouldn't happen before triggers, stored procedures and the like - if ever. A multe-machine distributed database is a wholly different animal - IMO again. If I was going there I wouldn't start from here - it would need a "clean brain" look at the whole problem. If you really need such searches, Google survives by licencint that technology - speak to them. Alec Cawley - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?
Steve Rapaport wrote: > On Friday 08 February 2002 06:14 pm, James Montebello wrote: > >>Distribution is how Google gets its speed. You say clustering won't >>solve the problem, but distributing the indicies across many processors >>*is* going to gain you a huge speed increase through sheer parallelism. >> > > True, but not enough. The BEST parallelism can do in a compute-bound > application is divide the time by the number of processors. That's assuming > a PERFECT routing system. (Correct me if I'm wrong here) There are actually some exceptions. Specifically if you have a very large problem set and parallelism allows you to move the problem set into a faster storage medium, you can sometimes see greater performance increases. Lets say you are doing a full text search. And you have a 20 GB full text index. It may not be feasible to build a machine with 20GB of RAM for storing the index in RAM. But it might be more feasible to store 1/20th of the index in each of 20 1GB machines. And with RAM being >1000 times as fast as hard disk. You could get a huge win. > > So to make the routing system + parallelism add up to > a MILLION times better performance, you would need at > least a MILLION processors. I doubt that even Google is > doing that. > > >>Google uses thousands of processors to handle its index, and any given >>search is going to be spread over 100s of processors. >> > > Right, so we can expect Google to do, say, 10,000 times (10^4) better > than Mysql at a Fulltext search. But in fact we're seeing > a million, 10^6, being generous. It's that extra factor > of a hundred (more likely a thousand, my estimates > were very generous) that I'm getting all fussy about. > > >>Asking >>a general purpose RDBMS to be really good a 10 different things is asking >>a bit much. Fulltext searches are well down the list. >> > > Here we get out of the realm of hard numbers and into opinions. But here's > mine: If mysql bothers to support fulltext searches, it's presumably because > there's some demand for them in some circumstances. The level of scaling and > optimization that could reasonably be expected: What it takes to make the > feature useful (i.e. perform similarly) in similar cases to other Mysql > features. With a regular index, I can do a hash lookup on 23 million records > subsecond. With a fulltext index (on a small field, only 40 characters) my > time slips to 3 to 180 seconds. That extra little factor of 100 is my > problem. Distributing over 4 processors wouldn't really help much. And > because people don't always type a company name the exact same way, > FULLTEXT really is the best way to do this. > > So Monty et al, my request is this: please put on the wish list, enhancement > of FULLTEXT search to approximately match the performance of an indexed > search on 25 million records, on the same hardware and with other things > held equal. > > Steve Rapaport > > - > Before posting, please check: >http://www.mysql.com/manual.php (the manual) >http://lists.mysql.com/ (the list archive) > > To request this thread, e-mail <[EMAIL PROTECTED]> > To unsubscribe, e-mail <[EMAIL PROTECTED]> > Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php > > > - > Before posting, please check: >http://www.mysql.com/manual.php (the manual) >http://lists.mysql.com/ (the list archive) > > To request this thread, e-mail <[EMAIL PROTECTED]> > To unsubscribe, e-mail <[EMAIL PROTECTED]> > Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php > - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?
On Thursday 07 February 2002 14:53, Brian DeFeyter wrote: > Has anyone made a suggestion or thought about ways to distribute > databases which focus on fulltext indexes? > > fulltext indexes do a good job of indexing a moderate amount of data, > but when you get a lot of data to be indexed, the queries slow down > significantly. > > I have an example table, with about 90 million rows.. and has a fulltext > index on a varchar(100) field. A single-word query which would return > approx 300k results takes an average of 15 seconds. A query with smaller > results (~ 10k) can be as quick as 1 sec.. which I would consider > acceptable. > > Has any thought about splitting the data into distributed files or even > machines? ie: something as simple as 'words' starting with 'X' are split > into a-h, i-p, q-z... or something more advanced? (maybe mysqld could > automatically split results based on (#results per unique 'word' / > desired # of 'split files/machines') Would such a system give any > advantages to searching speed and concurrenct query scalability? I > haven't looked at the fulltext internals.. so I don't know if such > "query routing" could take place or not. > > If nothing else, does anyone else have experience with a table of this > size or even larger? What kind of tuning have you done? > > Thanks, > > - Brian Wouldn't be too tough to write a little query routing system if you are using perl. Use DBD::Proxy on the web server side, and just hack the perl proxy server so it routes the query to several places and returns a single result set. Ordering could be achieved as well. I'm sure there are commercial packages out there as well. I don't see why the individual database servers would need to do anything special. > > > - > Before posting, please check: >http://www.mysql.com/manual.php (the manual) >http://lists.mysql.com/ (the list archive) > > To request this thread, e-mail <[EMAIL PROTECTED]> > To unsubscribe, e-mail > <[EMAIL PROTECTED]> Trouble > unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?
It seems to me like the best solution that could be implemented as-is would be to keep a random int column in your table (with a range of say 1-100) and then have fulltext server 1 psudo-replicate records with a the random number in the range of 1-10, server 2 11-20 and server 3 21-30 and so on. Then run your query on all 10 servers and merge the result sets and possibly re-sort them if you use the score column. The problem with splitting the index up by word is that is messes up all your scoring and ranking. For example what if you search using 5 keywords, all starting with letters from different groups? Your going to get pretty bad score for each match, and it could totally break boolean searches. -- Brian Bray Brian DeFeyter wrote: > On Thu, 2002-02-07 at 15:40, Tod Harter wrote: > [snip] > >>Wouldn't be too tough to write a little query routing system if you are using >>perl. Use DBD::Proxy on the web server side, and just hack the perl proxy >>server so it routes the query to several places and returns a single result >>set. Ordering could be achieved as well. I'm sure there are commercial >>packages out there as well. I don't see why the individual database servers >>would need to do anything special. >> > [snip] > > If I'm understanding you correctly, I think you're refering to routing > based on the first character of the word. That would work for cases > where the query is searching for a word that begins with a certain > character.. however fulltext searches also return results with the term > in the middle. > > ie: a search for 'foo' could return: > foo.txt > foobar > > but also could return: > thisisfoo > that_is_foolish > > I could be wrong, but it's my understanding that MySQL stores it's > fulltext index based on all the 'unique words' found. For such a system > as you mentioned above, you'd probably have to create your own fulltext > indexing system to determine: a) where to store the data 'segments' and > b) how to route queries. It seems like this could probably be done much > more efficiently inside of the server. > > - Brian > > > > - > Before posting, please check: >http://www.mysql.com/manual.php (the manual) >http://lists.mysql.com/ (the list archive) > > To request this thread, e-mail <[EMAIL PROTECTED]> > To unsubscribe, e-mail <[EMAIL PROTECTED]> > Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php > > - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?
I did this at a previous job, and we split the data up more or less this way (we used a pre-existing item number for the split which was essentially random in relation to the text data), with a aggregator that did the query X ways, each to a separate box holding 1/X of the data. The results from each unit were paged and sorted, so all the aggregator did was do a simple merge sort on a "page" of the set, which was fast. On a 6M record dataset, it produced millisecond-range search results. Not exactly Google-class, but pretty good for 12 Linux boxes, two programmers, and about six weeks of effort. james montebello On Thu, 7 Feb 2002, Brian Bray wrote: > > It seems to me like the best solution that could be implemented as-is > would be to keep a random int column in your table (with a range of say > 1-100) and then have fulltext server 1 psudo-replicate records with a > the random number in the range of 1-10, server 2 11-20 and server 3 > 21-30 and so on. > > Then run your query on all 10 servers and merge the result sets and > possibly re-sort them if you use the score column. > > The problem with splitting the index up by word is that is messes up all > your scoring and ranking. For example what if you search using 5 > keywords, all starting with letters from different groups? Your going > to get pretty bad score for each match, and it could totally break > boolean searches. > > -- > Brian Bray > > > > > Brian DeFeyter wrote: > > On Thu, 2002-02-07 at 15:40, Tod Harter wrote: > > [snip] > > > >>Wouldn't be too tough to write a little query routing system if you are using > >>perl. Use DBD::Proxy on the web server side, and just hack the perl proxy > >>server so it routes the query to several places and returns a single result > >>set. Ordering could be achieved as well. I'm sure there are commercial > >>packages out there as well. I don't see why the individual database servers > >>would need to do anything special. > >> > > [snip] > > > > If I'm understanding you correctly, I think you're refering to routing > > based on the first character of the word. That would work for cases > > where the query is searching for a word that begins with a certain > > character.. however fulltext searches also return results with the term > > in the middle. > > > > ie: a search for 'foo' could return: > > foo.txt > > foobar > > > > but also could return: > > thisisfoo > > that_is_foolish > > > > I could be wrong, but it's my understanding that MySQL stores it's > > fulltext index based on all the 'unique words' found. For such a system > > as you mentioned above, you'd probably have to create your own fulltext > > indexing system to determine: a) where to store the data 'segments' and > > b) how to route queries. It seems like this could probably be done much > > more efficiently inside of the server. > > > > - Brian > > > > > > > > - > > Before posting, please check: > >http://www.mysql.com/manual.php (the manual) > >http://lists.mysql.com/ (the list archive) > > > > To request this thread, e-mail <[EMAIL PROTECTED]> > > To unsubscribe, e-mail <[EMAIL PROTECTED]> > > Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php > > > > > > > > - > Before posting, please check: >http://www.mysql.com/manual.php (the manual) >http://lists.mysql.com/ (the list archive) > > To request this thread, e-mail <[EMAIL PROTECTED]> > To unsubscribe, e-mail <[EMAIL PROTECTED]> > Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php > - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?
David Axmark writes: > So the standard answer with Apples and Oranges certainly apply here! More like Äpplen och Apelsiner, that is, different but similar. You Swedish guys should know. Thanks for answering, David, I appreciate the attention from a founder. I also appreciate your point that Google is updating continuously and therefore not always caught up to the current state of the web. But isn't that a problem with indexing speed, not with search speed? Their search speed is still amazing, as is Altavista, and most of the other engines. Your other point about exact vs. approximate answers is unclear, I expect that Google's answers are exact for their currently available indexes at any given time. But even if they are approximate, I'd be happy with that too. The scoring on a FULLTEXT search in Mysql is "exact" but based on a formula that is approximate anyway. I'll summarize this thread best I can. >From the math I used, we started with my estimate of 10^9, which was mistaken. The real figure was 10^6, that is, Google searches fulltext about a million times faster than Mysql. Then we used Google's 1 machines +DRAM indexing to reduce the gap to 10^2, or 100 times faster. It turns out that 100 times is about the factor that is causing my application problems. If it just ran 100 times faster it would be about as fast as a regular indexed search, and I'd be happy. A few people suggested that Mysql shouldn't try to be faster, I (and some high-support customers like Mike Wexler) disagreed. And Alex Aulbach, bless him, actually did his homework and showed that things could be much improved with smart index techniques like inverted files. Then Sergei Golubchik wrote back to say he had taken some of the good ideas and inserted them into the TODO list, although he had higher priorities at the time. And I was satisfied for now, although my application still isn't working satisfactorily due to a really slow and CPU-hungry FULLTEXT search. I think that's our story so far. Steve Rapaport Director, Technical Services A-Tono - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?
On Fri, 2002-02-08 at 11:11, Steve Rapaport wrote: > I said: > > > Why is it that Altavista can index terabytes overnight and return > > > a fulltext boolean for the WHOLE WEB > > > within a second, and Mysql takes so long? > > On Friday 08 February 2002 08:56, Vincent Stoessel wrote: > > > Apples and oranges. > > Yeah, I know. But let's see if we can make some distinctions. > If, say, Google, can search 2 trillion web pages, averaging say 70k bytes > each, in 1 second, and Mysql can search 22 million records, with an index > on 40 bytes each, in 3 seconds (my experience) on a good day, > what's the order of magnitude difference? Roughly 10^9. > > > Have you seen the /hardware/ run that enterprise with? > Irrelevant, you're unlikely to get 9 orders of magnitude difference with > faster hardware or even with clustering. Actually not since Google runs on 1 of machines. And if you search for a while you will notice that you will get different answers (at least I have had troubles with this). Also I have used google to track occurrences of certain words over time. And according to google some words (with millions of occurrences) changed with about 40% in a week. So they simple provide approximate answers while MySQL has to provide exact answers. So the standard answer with Apples and Oranges certainly apply here! /David - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?
On Friday 08 February 2002 06:14 pm, James Montebello wrote: > Distribution is how Google gets its speed. You say clustering won't > solve the problem, but distributing the indicies across many processors > *is* going to gain you a huge speed increase through sheer parallelism. True, but not enough. The BEST parallelism can do in a compute-bound application is divide the time by the number of processors. That's assuming a PERFECT routing system. (Correct me if I'm wrong here) So to make the routing system + parallelism add up to a MILLION times better performance, you would need at least a MILLION processors. I doubt that even Google is doing that. > Google uses thousands of processors to handle its index, and any given > search is going to be spread over 100s of processors. Right, so we can expect Google to do, say, 10,000 times (10^4) better than Mysql at a Fulltext search. But in fact we're seeing a million, 10^6, being generous. It's that extra factor of a hundred (more likely a thousand, my estimates were very generous) that I'm getting all fussy about. > Asking > a general purpose RDBMS to be really good a 10 different things is asking > a bit much. Fulltext searches are well down the list. Here we get out of the realm of hard numbers and into opinions. But here's mine: If mysql bothers to support fulltext searches, it's presumably because there's some demand for them in some circumstances. The level of scaling and optimization that could reasonably be expected: What it takes to make the feature useful (i.e. perform similarly) in similar cases to other Mysql features. With a regular index, I can do a hash lookup on 23 million records subsecond. With a fulltext index (on a small field, only 40 characters) my time slips to 3 to 180 seconds. That extra little factor of 100 is my problem. Distributing over 4 processors wouldn't really help much. And because people don't always type a company name the exact same way, FULLTEXT really is the best way to do this. So Monty et al, my request is this: please put on the wish list, enhancement of FULLTEXT search to approximately match the performance of an indexed search on 25 million records, on the same hardware and with other things held equal. Steve Rapaport - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?
> Last week on Slashdot there was an article where the CEO of Google mentioned he > uses DRAM (solid state disk arrays) rather than hard drives for the indexes and > arrays because of the magnitude of difference in speed they provide. > > There's your 10^6 difference in speed (or part of it). > > G. > Google Articles on Slashdot: > http://slashdot.org/article.pl?sid=02/02/03/1339216&mode=thread sql, MySQL, db, rdbms, query - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?
For the slice servers, you simply assume that if one is lost, you lose X% of the data until it is revived, which is usually not even noticable by the end user. For the aggregators, we had four behind a load-balancer. In practice, we had nearly zero downtime over a roughly 18 month period. james montebello On 7 Feb 2002, Amir Aliabadi wrote: > How do you make something like this fault tolerant? > The answer is probably what I suspect, 2 of every thing. > How does the aggregator handle this or are these machines in a cluster? > > We are thinking of how to rebuild our fulltext search. Currently it is > in MS SQL 7.0 - MySQL 4.0 seems to blow the doors off the cataloging > time as compaired to MS SQL 7.0 Or even 8.0. > > > On Thu, 2002-02-07 at 15:19, James Montebello wrote: > > > > I did this at a previous job, and we split the data up more or less > > this way (we used a pre-existing item number for the split which was > > essentially random in relation to the text data), with a aggregator that > > did the query X ways, each to a separate box holding 1/X of the data. > > The results from each unit were paged and sorted, so all the aggregator > > did was do a simple merge sort on a "page" of the set, which was fast. > > On a 6M record dataset, it produced millisecond-range search results. > > Not exactly Google-class, but pretty good for 12 Linux boxes, two > > programmers, and about six weeks of effort. > > > > james montebello > > > > On Thu, 7 Feb 2002, Brian Bray wrote: > > > > > > > > It seems to me like the best solution that could be implemented as-is > > > would be to keep a random int column in your table (with a range of say > > > 1-100) and then have fulltext server 1 psudo-replicate records with a > > > the random number in the range of 1-10, server 2 11-20 and server 3 > > > 21-30 and so on. > > > > > > Then run your query on all 10 servers and merge the result sets and > > > possibly re-sort them if you use the score column. > > > > > > The problem with splitting the index up by word is that is messes up all > > > your scoring and ranking. For example what if you search using 5 > > > keywords, all starting with letters from different groups? Your going > > > to get pretty bad score for each match, and it could totally break > > > boolean searches. > > > > > > -- > > > Brian Bray > > > > > > > > > > > > > > > Brian DeFeyter wrote: > > > > On Thu, 2002-02-07 at 15:40, Tod Harter wrote: > > > > [snip] > > > > > > > >>Wouldn't be too tough to write a little query routing system if you are using > > > >>perl. Use DBD::Proxy on the web server side, and just hack the perl proxy > > > >>server so it routes the query to several places and returns a single result > > > >>set. Ordering could be achieved as well. I'm sure there are commercial > > > >>packages out there as well. I don't see why the individual database servers > > > >>would need to do anything special. > > > >> > > > > [snip] > > > > > > > > If I'm understanding you correctly, I think you're refering to routing > > > > based on the first character of the word. That would work for cases > > > > where the query is searching for a word that begins with a certain > > > > character.. however fulltext searches also return results with the term > > > > in the middle. > > > > > > > > ie: a search for 'foo' could return: > > > > foo.txt > > > > foobar > > > > > > > > but also could return: > > > > thisisfoo > > > > that_is_foolish > > > > > > > > I could be wrong, but it's my understanding that MySQL stores it's > > > > fulltext index based on all the 'unique words' found. For such a system > > > > as you mentioned above, you'd probably have to create your own fulltext > > > > indexing system to determine: a) where to store the data 'segments' and > > > > b) how to route queries. It seems like this could probably be done much > > > > more efficiently inside of the server. > > > > > > > > - Brian > > > > > > > > > > > > > > > > - > > > > Before posting, please check: > > > >http://www.mysql.com/manual.php (the manual) > > > >http://lists.mysql.com/ (the list archive) > > > > > > > > To request this thread, e-mail <[EMAIL PROTECTED]> > > > > To unsubscribe, e-mail <[EMAIL PROTECTED]> > > > > Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php > > > > > > > > > > > > > > > > > > > > - > > > Before posting, please check: > > >http://www.mysql.com/manual.php (the manual) > > >http://lists.mysql.com/ (the list archive) > > > > > > To request this thread, e-mail <[EMAIL PROTECTED]> > > > To unsubscribe, e-mail <[EMAIL PROTECTED]> > > > Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php > > > > > > > > > - > > Before posti
Re: Distributed Fulltext?
Yesterday, from Brian DeFeyter: > Has anyone made a suggestion or thought about ways to distribute > databases which focus on fulltext indexes? > > fulltext indexes do a good job of indexing a moderate amount of data, > but when you get a lot of data to be indexed, the queries slow down > significantly. Yea, this is cause the method to search the index is a geometric algorithm. > I have an example table, with about 90 million rows.. and has a fulltext > index on a varchar(100) field. A single-word query which would return > approx 300k results takes an average of 15 seconds. A query with smaller > results (~ 10k) can be as quick as 1 sec.. which I would consider > acceptable. The only interesting thing is, how many words have to be indexed (how many rows is not very important), how big grows your index (does it go in the memory) and how many rows can be found for one word. These are the most depending things. > Has any thought about splitting the data into distributed files or even > machines? ie: something as simple as 'words' starting with 'X' are split > into a-h, i-p, q-z... or something more advanced? (maybe mysqld could > automatically split results based on (#results per unique 'word' / > desired # of 'split files/machines') Would such a system give any > advantages to searching speed and concurrenct query scalability? I > haven't looked at the fulltext internals.. so I don't know if such > "query routing" could take place or not. Hum, I think it's *much* cheaper is to come together and pay the mySQL people to introduce a new feature into mySQL called "inverted files". This method is in short, that you only store the word and in which records it can be found. This can redurce the size of indexes and so the speed dramatically. > If nothing else, does anyone else have experience with a table of this > size or even larger? What kind of tuning have you done? We have made for example an extra table for indexing on an extra server. This is good, cause - mySQL can "concentrate" only for this table, so the machine dosn't need to swap - big searches or reindex dosn't bother the rest of the system - indexing can be done via a cron-job - we optimized the write routines, for example we stripped all tags out and wrote special stop-lists, before we write it. BTW: My wishlist for fulltext indexing: --- - inverted files - rules to define words - stop-word-lists and stop-word-regex - a function which returns a table, which can tell me, what can be searched. E.g SHOW MATCH(table.index) AGAINST('hugobla hugo bla'); word count rows stopword autostopword score wscore hugobla 22no no 10 1 hugo106no no 8 1 bla 00no yes 0 1 Words can be searched, depending on rules, stop-words and what has been indexed. - configurable scoring, e.g. very fast (and simple) scoring for speed - special scores for special words - a cache, which stores often used words and the rows -- SSilk - Alexander Aulbach - Herbipolis/Frankonia Minoris - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?
> On Friday 08 February 2002 08:56, Vincent Stoessel wrote: > >> Apples and oranges. > > Yeah, I know. But let's see if we can make some distinctions. > If, say, Google, can search 2 trillion web pages, averaging say 70k > bytes each, in 1 second, and Mysql can search 22 million records, with > an index on 40 bytes each, in 3 seconds (my experience) on a good day, > what's the order of magnitude difference? Roughly 10^9. > >> Have you seen the /hardware/ run that enterprise with? > Irrelevant, you're unlikely to get 9 orders of magnitude difference with > faster hardware or even with clustering. Google runs on cheap pc hardware with local IDE disk... nothing fancy. The majority of the speed comes from "sharding" the data to distributed indexes which are optimized for what they do, and having a frontend that knows where to route those requests. >> Also, their software is optimized for full text searches and that >> is /all/ they do. Mysql is an SQL database and is optimized as such. > Absolutely granted. You are completely right. > And I don't expect the data format to change. > > BUT: thought experiment: > > When Mysql decides to generate a FULLTEXT index, it is using an index > file, the .MYI file, which can have ANY FORMAT it wants to. If the .MYI > format is poorly optimized for fulltext searches, they can improve the > format. They can even introduce a new index file type .MYF solely for > optimizing fulltext searche. Agreed... however I don't know if that's the case either. Anyone (developers?) know enough about the internals to comment on such an approach? It seems if you know your goal is fulltext speed.. things could probably be done better. > Thinking out loud... > Steve - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?
Also, I have to ask the question: Why is it that Altavista can index terabytes overnight and return a fulltext boolean for the WHOLE WEB within a second, and Mysql takes so long? On Friday 08 February 2002 11:50, Steve Rapaport wrote: > I second the question. It could also reduce the size of the > fulltext index and the time taken to update it. > > -steve > > > On Thursday 07 February 2002 20:53, Brian wrote: > > > Has anyone made a suggestion or thought about ways to distribute > > > databases which focus on fulltext indexes? > > > > > > fulltext indexes do a good job of indexing a moderate amount of data, > > > but when you get a lot of data to be indexed, the queries slow down > > > significantly. > > > > > > I have an example table, with about 90 million rows.. and has a > > > fulltext index on a varchar(100) field. A single-word query which would > > > return approx 300k results takes an average of 15 seconds. A query with > > > smaller results (~ 10k) can be as quick as 1 sec.. which I would > > > consider acceptable. > > > > > > Has any thought about splitting the data into distributed files or even > > > machines? ie: something as simple as 'words' starting with 'X' are > > > split into a-h, i-p, q-z... or something more advanced? (maybe mysqld > > > could automatically split results based on (#results per unique 'word' > > > / desired # of 'split files/machines') Would such a system give any > > > advantages to searching speed and concurrenct query scalability? I > > > haven't looked at the fulltext internals.. so I don't know if such > > > "query routing" could take place or not. > > > > > > If nothing else, does anyone else have experience with a table of this > > > size or even larger? What kind of tuning have you done? > > - > Before posting, please check: >http://www.mysql.com/manual.php (the manual) >http://lists.mysql.com/ (the list archive) > > To request this thread, e-mail <[EMAIL PROTECTED]> > To unsubscribe, e-mail > <[EMAIL PROTECTED]> Trouble > unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?
On Thu, 2002-02-07 at 15:40, Tod Harter wrote: [snip] > Wouldn't be too tough to write a little query routing system if you are using > perl. Use DBD::Proxy on the web server side, and just hack the perl proxy > server so it routes the query to several places and returns a single result > set. Ordering could be achieved as well. I'm sure there are commercial > packages out there as well. I don't see why the individual database servers > would need to do anything special. [snip] If I'm understanding you correctly, I think you're refering to routing based on the first character of the word. That would work for cases where the query is searching for a word that begins with a certain character.. however fulltext searches also return results with the term in the middle. ie: a search for 'foo' could return: foo.txt foobar but also could return: thisisfoo that_is_foolish I could be wrong, but it's my understanding that MySQL stores it's fulltext index based on all the 'unique words' found. For such a system as you mentioned above, you'd probably have to create your own fulltext indexing system to determine: a) where to store the data 'segments' and b) how to route queries. It seems like this could probably be done much more efficiently inside of the server. - Brian - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?
I second the question. It could also reduce the size of the fulltext index and the time taken to update it. -steve > On Thursday 07 February 2002 20:53, Brian wrote: > > Has anyone made a suggestion or thought about ways to distribute > > databases which focus on fulltext indexes? > > > > fulltext indexes do a good job of indexing a moderate amount of data, > > but when you get a lot of data to be indexed, the queries slow down > > significantly. > > > > I have an example table, with about 90 million rows.. and has a fulltext > > index on a varchar(100) field. A single-word query which would return > > approx 300k results takes an average of 15 seconds. A query with smaller > > results (~ 10k) can be as quick as 1 sec.. which I would consider > > acceptable. > > > > Has any thought about splitting the data into distributed files or even > > machines? ie: something as simple as 'words' starting with 'X' are split > > into a-h, i-p, q-z... or something more advanced? (maybe mysqld could > > automatically split results based on (#results per unique 'word' / > > desired # of 'split files/machines') Would such a system give any > > advantages to searching speed and concurrenct query scalability? I > > haven't looked at the fulltext internals.. so I don't know if such > > "query routing" could take place or not. > > > > If nothing else, does anyone else have experience with a table of this > > size or even larger? What kind of tuning have you done? - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?
> Last week on Slashdot there was an article where the CEO of Google mentioned he > uses DRAM (solid state disk arrays) rather than hard drives for the indexes and > arrays because of the magnitude of difference in speed they provide. > > There's your 10^6 difference in speed (or part of it). > > G. > Google Articles on Slashdot: > http://slashdot.org/article.pl?sid=02/02/03/1339216&mode=thread sql, MySQL, db, rdbms, query - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?
On Friday 08 February 2002 06:14 pm, James Montebello wrote: > Distribution is how Google gets its speed. You say clustering won't > solve the problem, but distributing the indicies across many processors > *is* going to gain you a huge speed increase through sheer parallelism. True, but not enough. The BEST parallelism can do in a compute-bound application is divide the time by the number of processors. That's assuming a PERFECT routing system. (Correct me if I'm wrong here) So to make the routing system + parallelism add up to a MILLION times better performance, you would need at least a MILLION processors. I doubt that even Google is doing that. > Google uses thousands of processors to handle its index, and any given > search is going to be spread over 100s of processors. Right, so we can expect Google to do, say, 10,000 times (10^4) better than Mysql at a Fulltext search. But in fact we're seeing a million, 10^6, being generous. It's that extra factor of a hundred (more likely a thousand, my estimates were very generous) that I'm getting all fussy about. > Asking > a general purpose RDBMS to be really good a 10 different things is asking > a bit much. Fulltext searches are well down the list. Here we get out of the realm of hard numbers and into opinions. But here's mine: If mysql bothers to support fulltext searches, it's presumably because there's some demand for them in some circumstances. The level of scaling and optimization that could reasonably be expected: What it takes to make the feature useful (i.e. perform similarly) in similar cases to other Mysql features. With a regular index, I can do a hash lookup on 23 million records subsecond. With a fulltext index (on a small field, only 40 characters) my time slips to 3 to 180 seconds. That extra little factor of 100 is my problem. Distributing over 4 processors wouldn't really help much. And because people don't always type a company name the exact same way, FULLTEXT really is the best way to do this. So Monty et al, my request is this: please put on the wish list, enhancement of FULLTEXT search to approximately match the performance of an indexed search on 25 million records, on the same hardware and with other things held equal. Steve Rapaport - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?
For the slice servers, you simply assume that if one is lost, you lose X% of the data until it is revived, which is usually not even noticable by the end user. For the aggregators, we had four behind a load-balancer. In practice, we had nearly zero downtime over a roughly 18 month period. james montebello On 7 Feb 2002, Amir Aliabadi wrote: > How do you make something like this fault tolerant? > The answer is probably what I suspect, 2 of every thing. > How does the aggregator handle this or are these machines in a cluster? > > We are thinking of how to rebuild our fulltext search. Currently it is > in MS SQL 7.0 - MySQL 4.0 seems to blow the doors off the cataloging > time as compaired to MS SQL 7.0 Or even 8.0. > > > On Thu, 2002-02-07 at 15:19, James Montebello wrote: > > > > I did this at a previous job, and we split the data up more or less > > this way (we used a pre-existing item number for the split which was > > essentially random in relation to the text data), with a aggregator that > > did the query X ways, each to a separate box holding 1/X of the data. > > The results from each unit were paged and sorted, so all the aggregator > > did was do a simple merge sort on a "page" of the set, which was fast. > > On a 6M record dataset, it produced millisecond-range search results. > > Not exactly Google-class, but pretty good for 12 Linux boxes, two > > programmers, and about six weeks of effort. > > > > james montebello > > > > On Thu, 7 Feb 2002, Brian Bray wrote: > > > > > > > > It seems to me like the best solution that could be implemented as-is > > > would be to keep a random int column in your table (with a range of say > > > 1-100) and then have fulltext server 1 psudo-replicate records with a > > > the random number in the range of 1-10, server 2 11-20 and server 3 > > > 21-30 and so on. > > > > > > Then run your query on all 10 servers and merge the result sets and > > > possibly re-sort them if you use the score column. > > > > > > The problem with splitting the index up by word is that is messes up all > > > your scoring and ranking. For example what if you search using 5 > > > keywords, all starting with letters from different groups? Your going > > > to get pretty bad score for each match, and it could totally break > > > boolean searches. > > > > > > -- > > > Brian Bray > > > > > > > > > > > > > > > Brian DeFeyter wrote: > > > > On Thu, 2002-02-07 at 15:40, Tod Harter wrote: > > > > [snip] > > > > > > > >>Wouldn't be too tough to write a little query routing system if you are using > > > >>perl. Use DBD::Proxy on the web server side, and just hack the perl proxy > > > >>server so it routes the query to several places and returns a single result > > > >>set. Ordering could be achieved as well. I'm sure there are commercial > > > >>packages out there as well. I don't see why the individual database servers > > > >>would need to do anything special. > > > >> > > > > [snip] > > > > > > > > If I'm understanding you correctly, I think you're refering to routing > > > > based on the first character of the word. That would work for cases > > > > where the query is searching for a word that begins with a certain > > > > character.. however fulltext searches also return results with the term > > > > in the middle. > > > > > > > > ie: a search for 'foo' could return: > > > > foo.txt > > > > foobar > > > > > > > > but also could return: > > > > thisisfoo > > > > that_is_foolish > > > > > > > > I could be wrong, but it's my understanding that MySQL stores it's > > > > fulltext index based on all the 'unique words' found. For such a system > > > > as you mentioned above, you'd probably have to create your own fulltext > > > > indexing system to determine: a) where to store the data 'segments' and > > > > b) how to route queries. It seems like this could probably be done much > > > > more efficiently inside of the server. > > > > > > > > - Brian > > > > > > > > > > > > > > > > - > > > > Before posting, please check: > > > >http://www.mysql.com/manual.php (the manual) > > > >http://lists.mysql.com/ (the list archive) > > > > > > > > To request this thread, e-mail <[EMAIL PROTECTED]> > > > > To unsubscribe, e-mail <[EMAIL PROTECTED]> > > > > Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php > > > > > > > > > > > > > > > > > > > > - > > > Before posting, please check: > > >http://www.mysql.com/manual.php (the manual) > > >http://lists.mysql.com/ (the list archive) > > > > > > To request this thread, e-mail <[EMAIL PROTECTED]> > > > To unsubscribe, e-mail <[EMAIL PROTECTED]> > > > Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php > > > > > > > > > - > > Before posti
Re: Distributed Fulltext?
Ooops, factual error: > > If, say, Google, can search 2 trillion web pages, averaging say 70k > > bytes each, in 1 second, and Mysql can search 22 million records, with > > an index on 40 bytes each, in 3 seconds (my experience) on a good day, > > what's the order of magnitude difference? Roughly 10^9. Google has only 2 Billion with a B web pages. So the order of magnitude difference is only 10^6. But that still leaves a mysql Fulltext search a MILLION times slower than Google. Someone correctly pointed out today that it's not Mysql's job to be Google, and I agree. But it seems to me that it would be fair for mysql to be able to handle searches in under 1 second for databases 1 millionth the size of Google. All I want here is a decent flexible phone book lookup, not a search engine. -steve - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?
Yesterday, from Brian DeFeyter: > Has anyone made a suggestion or thought about ways to distribute > databases which focus on fulltext indexes? > > fulltext indexes do a good job of indexing a moderate amount of data, > but when you get a lot of data to be indexed, the queries slow down > significantly. Yea, this is cause the method to search the index is a geometric algorithm. > I have an example table, with about 90 million rows.. and has a fulltext > index on a varchar(100) field. A single-word query which would return > approx 300k results takes an average of 15 seconds. A query with smaller > results (~ 10k) can be as quick as 1 sec.. which I would consider > acceptable. The only interesting thing is, how many words have to be indexed (how many rows is not very important), how big grows your index (does it go in the memory) and how many rows can be found for one word. These are the most depending things. > Has any thought about splitting the data into distributed files or even > machines? ie: something as simple as 'words' starting with 'X' are split > into a-h, i-p, q-z... or something more advanced? (maybe mysqld could > automatically split results based on (#results per unique 'word' / > desired # of 'split files/machines') Would such a system give any > advantages to searching speed and concurrenct query scalability? I > haven't looked at the fulltext internals.. so I don't know if such > "query routing" could take place or not. Hum, I think it's *much* cheaper is to come together and pay the mySQL people to introduce a new feature into mySQL called "inverted files". This method is in short, that you only store the word and in which records it can be found. This can redurce the size of indexes and so the speed dramatically. > If nothing else, does anyone else have experience with a table of this > size or even larger? What kind of tuning have you done? We have made for example an extra table for indexing on an extra server. This is good, cause - mySQL can "concentrate" only for this table, so the machine dosn't need to swap - big searches or reindex dosn't bother the rest of the system - indexing can be done via a cron-job - we optimized the write routines, for example we stripped all tags out and wrote special stop-lists, before we write it. BTW: My wishlist for fulltext indexing: --- - inverted files - rules to define words - stop-word-lists and stop-word-regex - a function which returns a table, which can tell me, what can be searched. E.g SHOW MATCH(table.index) AGAINST('hugobla hugo bla'); word count rows stopword autostopword score wscore hugobla 22no no 10 1 hugo106no no 8 1 bla 00no yes 0 1 Words can be searched, depending on rules, stop-words and what has been indexed. - configurable scoring, e.g. very fast (and simple) scoring for speed - special scores for special words - a cache, which stores often used words and the rows -- SSilk - Alexander Aulbach - Herbipolis/Frankonia Minoris - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?
I said: > > Why is it that Altavista can index terabytes overnight and return > > a fulltext boolean for the WHOLE WEB > > within a second, and Mysql takes so long? On Friday 08 February 2002 08:56, Vincent Stoessel wrote: > Apples and oranges. Yeah, I know. But let's see if we can make some distinctions. If, say, Google, can search 2 trillion web pages, averaging say 70k bytes each, in 1 second, and Mysql can search 22 million records, with an index on 40 bytes each, in 3 seconds (my experience) on a good day, what's the order of magnitude difference? Roughly 10^9. > Have you seen the /hardware/ run that enterprise with? Irrelevant, you're unlikely to get 9 orders of magnitude difference with faster hardware or even with clustering. > Also, their software is optimized for full text searches and that > is /all/ they do. Mysql is an SQL database and is optimized as such. Absolutely granted. You are completely right. And I don't expect the data format to change. BUT: thought experiment: When Mysql decides to generate a FULLTEXT index, it is using an index file, the .MYI file, which can have ANY FORMAT it wants to. If the .MYI format is poorly optimized for fulltext searches, they can improve the format. They can even introduce a new index file type .MYF solely for optimizing fulltext searches. None of this need have any impact on the data file format or the SQL search optimizations, and yet it could still improve the search speed for fulltext. It might not help as much for the slow indexing, but it could certainly improve the performance of the search. Thinking out loud... Steve - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?
> Why is it that Altavista can index terabytes overnight and return > a fulltext boolean for the WHOLE WEB > within a second, and Mysql takes so long? I don't know about Altavista, but if you read up on Google, they do indeed do some sort of spreading of keywords across multiple machines - last I saw they had over 10,000 machines. But you have to ask the question as to whether this would be an appropriate development for MySQL. This would seeem to me to be a very complex development of interest to a very small subset of MySQL customers. MySQL, it seems to me, specialises in doing the mainstream database manipulations used by many people simply, quicly, and (my most important criterion) reliably. Such a development would, IMO, be away from the whole trend of MySQL, and shouldn't happen before triggers, stored procedures and the like - if ever. A multe-machine distributed database is a wholly different animal - IMO again. If I was going there I wouldn't start from here - it would need a "clean brain" look at the whole problem. If you really need such searches, Google survives by licencint that technology - speak to them. Alec Cawley - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?
Also, I have to ask the question: Why is it that Altavista can index terabytes overnight and return a fulltext boolean for the WHOLE WEB within a second, and Mysql takes so long? On Friday 08 February 2002 11:50, Steve Rapaport wrote: > I second the question. It could also reduce the size of the > fulltext index and the time taken to update it. > > -steve > > > On Thursday 07 February 2002 20:53, Brian wrote: > > > Has anyone made a suggestion or thought about ways to distribute > > > databases which focus on fulltext indexes? > > > > > > fulltext indexes do a good job of indexing a moderate amount of data, > > > but when you get a lot of data to be indexed, the queries slow down > > > significantly. > > > > > > I have an example table, with about 90 million rows.. and has a > > > fulltext index on a varchar(100) field. A single-word query which would > > > return approx 300k results takes an average of 15 seconds. A query with > > > smaller results (~ 10k) can be as quick as 1 sec.. which I would > > > consider acceptable. > > > > > > Has any thought about splitting the data into distributed files or even > > > machines? ie: something as simple as 'words' starting with 'X' are > > > split into a-h, i-p, q-z... or something more advanced? (maybe mysqld > > > could automatically split results based on (#results per unique 'word' > > > / desired # of 'split files/machines') Would such a system give any > > > advantages to searching speed and concurrenct query scalability? I > > > haven't looked at the fulltext internals.. so I don't know if such > > > "query routing" could take place or not. > > > > > > If nothing else, does anyone else have experience with a table of this > > > size or even larger? What kind of tuning have you done? > > - > Before posting, please check: >http://www.mysql.com/manual.php (the manual) >http://lists.mysql.com/ (the list archive) > > To request this thread, e-mail <[EMAIL PROTECTED]> > To unsubscribe, e-mail > <[EMAIL PROTECTED]> Trouble > unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?
How do you make something like this fault tolerant? The answer is probably what I suspect, 2 of every thing. How does the aggregator handle this or are these machines in a cluster? We are thinking of how to rebuild our fulltext search. Currently it is in MS SQL 7.0 - MySQL 4.0 seems to blow the doors off the cataloging time as compaired to MS SQL 7.0 Or even 8.0. On Thu, 2002-02-07 at 15:19, James Montebello wrote: > > I did this at a previous job, and we split the data up more or less > this way (we used a pre-existing item number for the split which was > essentially random in relation to the text data), with a aggregator that > did the query X ways, each to a separate box holding 1/X of the data. > The results from each unit were paged and sorted, so all the aggregator > did was do a simple merge sort on a "page" of the set, which was fast. > On a 6M record dataset, it produced millisecond-range search results. > Not exactly Google-class, but pretty good for 12 Linux boxes, two > programmers, and about six weeks of effort. > > james montebello > > On Thu, 7 Feb 2002, Brian Bray wrote: > > > > > It seems to me like the best solution that could be implemented as-is > > would be to keep a random int column in your table (with a range of say > > 1-100) and then have fulltext server 1 psudo-replicate records with a > > the random number in the range of 1-10, server 2 11-20 and server 3 > > 21-30 and so on. > > > > Then run your query on all 10 servers and merge the result sets and > > possibly re-sort them if you use the score column. > > > > The problem with splitting the index up by word is that is messes up all > > your scoring and ranking. For example what if you search using 5 > > keywords, all starting with letters from different groups? Your going > > to get pretty bad score for each match, and it could totally break > > boolean searches. > > > > -- > > Brian Bray > > > > > > > > > > Brian DeFeyter wrote: > > > On Thu, 2002-02-07 at 15:40, Tod Harter wrote: > > > [snip] > > > > > >>Wouldn't be too tough to write a little query routing system if you are using > > >>perl. Use DBD::Proxy on the web server side, and just hack the perl proxy > > >>server so it routes the query to several places and returns a single result > > >>set. Ordering could be achieved as well. I'm sure there are commercial > > >>packages out there as well. I don't see why the individual database servers > > >>would need to do anything special. > > >> > > > [snip] > > > > > > If I'm understanding you correctly, I think you're refering to routing > > > based on the first character of the word. That would work for cases > > > where the query is searching for a word that begins with a certain > > > character.. however fulltext searches also return results with the term > > > in the middle. > > > > > > ie: a search for 'foo' could return: > > > foo.txt > > > foobar > > > > > > but also could return: > > > thisisfoo > > > that_is_foolish > > > > > > I could be wrong, but it's my understanding that MySQL stores it's > > > fulltext index based on all the 'unique words' found. For such a system > > > as you mentioned above, you'd probably have to create your own fulltext > > > indexing system to determine: a) where to store the data 'segments' and > > > b) how to route queries. It seems like this could probably be done much > > > more efficiently inside of the server. > > > > > > - Brian > > > > > > > > > > > > - > > > Before posting, please check: > > >http://www.mysql.com/manual.php (the manual) > > >http://lists.mysql.com/ (the list archive) > > > > > > To request this thread, e-mail <[EMAIL PROTECTED]> > > > To unsubscribe, e-mail <[EMAIL PROTECTED]> > > > Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php > > > > > > > > > > > > > > - > > Before posting, please check: > >http://www.mysql.com/manual.php (the manual) > >http://lists.mysql.com/ (the list archive) > > > > To request this thread, e-mail <[EMAIL PROTECTED]> > > To unsubscribe, e-mail <[EMAIL PROTECTED]> > > Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php > > > > > - > Before posting, please check: >http://www.mysql.com/manual.php (the manual) >http://lists.mysql.com/ (the list archive) > > To request this thread, e-mail <[EMAIL PROTECTED]> > To unsubscribe, e-mail <[EMAIL PROTECTED]> > Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php > - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To u
Re: Distributed Fulltext?
I did this at a previous job, and we split the data up more or less this way (we used a pre-existing item number for the split which was essentially random in relation to the text data), with a aggregator that did the query X ways, each to a separate box holding 1/X of the data. The results from each unit were paged and sorted, so all the aggregator did was do a simple merge sort on a "page" of the set, which was fast. On a 6M record dataset, it produced millisecond-range search results. Not exactly Google-class, but pretty good for 12 Linux boxes, two programmers, and about six weeks of effort. james montebello On Thu, 7 Feb 2002, Brian Bray wrote: > > It seems to me like the best solution that could be implemented as-is > would be to keep a random int column in your table (with a range of say > 1-100) and then have fulltext server 1 psudo-replicate records with a > the random number in the range of 1-10, server 2 11-20 and server 3 > 21-30 and so on. > > Then run your query on all 10 servers and merge the result sets and > possibly re-sort them if you use the score column. > > The problem with splitting the index up by word is that is messes up all > your scoring and ranking. For example what if you search using 5 > keywords, all starting with letters from different groups? Your going > to get pretty bad score for each match, and it could totally break > boolean searches. > > -- > Brian Bray > > > > > Brian DeFeyter wrote: > > On Thu, 2002-02-07 at 15:40, Tod Harter wrote: > > [snip] > > > >>Wouldn't be too tough to write a little query routing system if you are using > >>perl. Use DBD::Proxy on the web server side, and just hack the perl proxy > >>server so it routes the query to several places and returns a single result > >>set. Ordering could be achieved as well. I'm sure there are commercial > >>packages out there as well. I don't see why the individual database servers > >>would need to do anything special. > >> > > [snip] > > > > If I'm understanding you correctly, I think you're refering to routing > > based on the first character of the word. That would work for cases > > where the query is searching for a word that begins with a certain > > character.. however fulltext searches also return results with the term > > in the middle. > > > > ie: a search for 'foo' could return: > > foo.txt > > foobar > > > > but also could return: > > thisisfoo > > that_is_foolish > > > > I could be wrong, but it's my understanding that MySQL stores it's > > fulltext index based on all the 'unique words' found. For such a system > > as you mentioned above, you'd probably have to create your own fulltext > > indexing system to determine: a) where to store the data 'segments' and > > b) how to route queries. It seems like this could probably be done much > > more efficiently inside of the server. > > > > - Brian > > > > > > > > - > > Before posting, please check: > >http://www.mysql.com/manual.php (the manual) > >http://lists.mysql.com/ (the list archive) > > > > To request this thread, e-mail <[EMAIL PROTECTED]> > > To unsubscribe, e-mail <[EMAIL PROTECTED]> > > Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php > > > > > > > > - > Before posting, please check: >http://www.mysql.com/manual.php (the manual) >http://lists.mysql.com/ (the list archive) > > To request this thread, e-mail <[EMAIL PROTECTED]> > To unsubscribe, e-mail <[EMAIL PROTECTED]> > Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php > - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?
It seems to me like the best solution that could be implemented as-is would be to keep a random int column in your table (with a range of say 1-100) and then have fulltext server 1 psudo-replicate records with a the random number in the range of 1-10, server 2 11-20 and server 3 21-30 and so on. Then run your query on all 10 servers and merge the result sets and possibly re-sort them if you use the score column. The problem with splitting the index up by word is that is messes up all your scoring and ranking. For example what if you search using 5 keywords, all starting with letters from different groups? Your going to get pretty bad score for each match, and it could totally break boolean searches. -- Brian Bray Brian DeFeyter wrote: > On Thu, 2002-02-07 at 15:40, Tod Harter wrote: > [snip] > >>Wouldn't be too tough to write a little query routing system if you are using >>perl. Use DBD::Proxy on the web server side, and just hack the perl proxy >>server so it routes the query to several places and returns a single result >>set. Ordering could be achieved as well. I'm sure there are commercial >>packages out there as well. I don't see why the individual database servers >>would need to do anything special. >> > [snip] > > If I'm understanding you correctly, I think you're refering to routing > based on the first character of the word. That would work for cases > where the query is searching for a word that begins with a certain > character.. however fulltext searches also return results with the term > in the middle. > > ie: a search for 'foo' could return: > foo.txt > foobar > > but also could return: > thisisfoo > that_is_foolish > > I could be wrong, but it's my understanding that MySQL stores it's > fulltext index based on all the 'unique words' found. For such a system > as you mentioned above, you'd probably have to create your own fulltext > indexing system to determine: a) where to store the data 'segments' and > b) how to route queries. It seems like this could probably be done much > more efficiently inside of the server. > > - Brian > > > > - > Before posting, please check: >http://www.mysql.com/manual.php (the manual) >http://lists.mysql.com/ (the list archive) > > To request this thread, e-mail <[EMAIL PROTECTED]> > To unsubscribe, e-mail <[EMAIL PROTECTED]> > Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php > > - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?
I second the question. It could also reduce the size of the fulltext index and the time taken to update it. -steve > On Thursday 07 February 2002 20:53, Brian wrote: > > Has anyone made a suggestion or thought about ways to distribute > > databases which focus on fulltext indexes? > > > > fulltext indexes do a good job of indexing a moderate amount of data, > > but when you get a lot of data to be indexed, the queries slow down > > significantly. > > > > I have an example table, with about 90 million rows.. and has a fulltext > > index on a varchar(100) field. A single-word query which would return > > approx 300k results takes an average of 15 seconds. A query with smaller > > results (~ 10k) can be as quick as 1 sec.. which I would consider > > acceptable. > > > > Has any thought about splitting the data into distributed files or even > > machines? ie: something as simple as 'words' starting with 'X' are split > > into a-h, i-p, q-z... or something more advanced? (maybe mysqld could > > automatically split results based on (#results per unique 'word' / > > desired # of 'split files/machines') Would such a system give any > > advantages to searching speed and concurrenct query scalability? I > > haven't looked at the fulltext internals.. so I don't know if such > > "query routing" could take place or not. > > > > If nothing else, does anyone else have experience with a table of this > > size or even larger? What kind of tuning have you done? - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?
On Thu, 2002-02-07 at 15:40, Tod Harter wrote: [snip] > Wouldn't be too tough to write a little query routing system if you are using > perl. Use DBD::Proxy on the web server side, and just hack the perl proxy > server so it routes the query to several places and returns a single result > set. Ordering could be achieved as well. I'm sure there are commercial > packages out there as well. I don't see why the individual database servers > would need to do anything special. [snip] If I'm understanding you correctly, I think you're refering to routing based on the first character of the word. That would work for cases where the query is searching for a word that begins with a certain character.. however fulltext searches also return results with the term in the middle. ie: a search for 'foo' could return: foo.txt foobar but also could return: thisisfoo that_is_foolish I could be wrong, but it's my understanding that MySQL stores it's fulltext index based on all the 'unique words' found. For such a system as you mentioned above, you'd probably have to create your own fulltext indexing system to determine: a) where to store the data 'segments' and b) how to route queries. It seems like this could probably be done much more efficiently inside of the server. - Brian - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?
On Thursday 07 February 2002 14:53, Brian DeFeyter wrote: > Has anyone made a suggestion or thought about ways to distribute > databases which focus on fulltext indexes? > > fulltext indexes do a good job of indexing a moderate amount of data, > but when you get a lot of data to be indexed, the queries slow down > significantly. > > I have an example table, with about 90 million rows.. and has a fulltext > index on a varchar(100) field. A single-word query which would return > approx 300k results takes an average of 15 seconds. A query with smaller > results (~ 10k) can be as quick as 1 sec.. which I would consider > acceptable. > > Has any thought about splitting the data into distributed files or even > machines? ie: something as simple as 'words' starting with 'X' are split > into a-h, i-p, q-z... or something more advanced? (maybe mysqld could > automatically split results based on (#results per unique 'word' / > desired # of 'split files/machines') Would such a system give any > advantages to searching speed and concurrenct query scalability? I > haven't looked at the fulltext internals.. so I don't know if such > "query routing" could take place or not. > > If nothing else, does anyone else have experience with a table of this > size or even larger? What kind of tuning have you done? > > Thanks, > > - Brian Wouldn't be too tough to write a little query routing system if you are using perl. Use DBD::Proxy on the web server side, and just hack the perl proxy server so it routes the query to several places and returns a single result set. Ordering could be achieved as well. I'm sure there are commercial packages out there as well. I don't see why the individual database servers would need to do anything special. > > > - > Before posting, please check: >http://www.mysql.com/manual.php (the manual) >http://lists.mysql.com/ (the list archive) > > To request this thread, e-mail <[EMAIL PROTECTED]> > To unsubscribe, e-mail > <[EMAIL PROTECTED]> Trouble > unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php