Re: Distributed Fulltext?

Mike Wexler Wed, 13 Feb 2002 14:16:11 -0800

Brian DeFeyter wrote:
> I sorta like that idea. I don't know exactly what you can and can't do
> as far as indexing inside of HEAP tables.. but the index size would
> likely differ from the written index. Then you can expand the idea and
> use the X/(num slices) on (num slices) boxes technique.. sending the
> query to each, and compiling all of the results.
> 
> Your comment about only having a few common words that are searched
> makes me wonder if a reverse-stopword function would be valuable (ie:
> only index words in a list you generate) Probably could be done other
> ways though too with what's available. Maybe a psuedo-bitmap index
> inside a mysql table?

I don't think that would be appropriate. My example, is our site (tias.com) has 
lots of antiques and collectibles. One popular categories is jewelry. If 
somebody does a search for "gold jewelry" and the search engine interprets this 
as anything that mentions gold or jewelry. It is going to match a lot of items. 
It would be nice if we could use explain or something like it to get a rough 
estimate of how many results a query would generate, and if it was really bad, 
we could tell the user to be more specific.


> 
>  - bdf
> 
> On Wed, 2002-02-13 at 12:09, Mike Wexler wrote:
> 
>>My understanding is that part of how google and Altavista get such high speeds 
>>is to keep everything in memory. Is it possible to create a HEAP table with a 
>>full text index? If so, does the full text index take advantage of being in 
>>memory? For example, I would imagine that if you were keeping the whole index in 
>>memory, details like the index page size, and the format of the pointers/record 
>>numbers would be different.
>>
>>Then you could just do something roughly like (i know the syntax is a little off)
>>
>>CREATE HEAP TABLE fooFast SELECT * FROM fooSlow
>>ALTER fooFast ADD fulltext(a, b, c)
>>
>>Or maybe you could just have fooSlow on one server. And then have it replicated 
>>on N other servers. But on the other servers you could alter the table type so 
>>it was a heap table. So you would have one persistent table and a bunch of 
>>replicated heap tables. And all the search go could against the heap tables.
>>
>>
>>Brian Bray wrote:
>>
>>>It seems to me like the best solution that could be implemented as-is 
>>>would be to keep a random int column in your table (with a range of say 
>>>1-100) and then have fulltext server 1 psudo-replicate records with a 
>>>the random number in the range of 1-10, server 2 11-20 and server 3 
>>>21-30 and so on.
>>>
>>>Then run your query on all 10 servers and merge the result sets and 
>>>possibly re-sort them if you use the score column.
>>>
>>>The problem with splitting the index up by word is that is messes up all 
>>>your scoring and ranking.  For example what if you search using 5 
>>>keywords, all starting with letters from different groups?  Your going 
>>>to get pretty bad score for each match, and it could totally break 
>>>boolean searches.
>>>
>>>--
>>>Brian Bray
>>>
>>>
>>>
>>>
>>>Brian DeFeyter wrote:
>>>
>>>
>>>>On Thu, 2002-02-07 at 15:40, Tod Harter wrote:
>>>>[snip]
>>>>
>>>>
>>>>
>>>>>Wouldn't be too tough to write a little query routing system if you are using 
>>>>>perl. Use DBD::Proxy on the web server side, and just hack the perl proxy 
>>>>>server so it routes the query to several places and returns a single result 
>>>>>set. Ordering could be achieved as well. I'm sure there are commercial 
>>>>>packages out there as well. I don't see why the individual database servers 
>>>>>would need to do anything special.
>>>>>
>>>>>
>>>>>
>>>>[snip]
>>>>
>>>>If I'm understanding you correctly, I think you're refering to routing
>>>>based on the first character of the word. That would work for cases
>>>>where the query is searching for a word that begins with a certain
>>>>character.. however fulltext searches also return results with the term
>>>>in the middle.
>>>>
>>>>ie: a search for 'foo' could return:
>>>>foo.txt
>>>>foobar
>>>>
>>>>but also could return:
>>>>thisisfoo
>>>>that_is_foolish
>>>>
>>>>I could be wrong, but it's my understanding that MySQL stores it's
>>>>fulltext index based on all the 'unique words' found. For such a system
>>>>as you mentioned above, you'd probably have to create your own fulltext
>>>>indexing system to determine: a) where to store the data 'segments' and
>>>>b) how to route queries.  It seems like this could probably be done much
>>>>more efficiently inside of the server.
>>>>
>>>>- Brian
>>>>
>>>>
>>>>
>>>>---------------------------------------------------------------------
>>>>Before posting, please check:
>>>>  http://www.mysql.com/manual.php   (the manual)
>>>>  http://lists.mysql.com/           (the list archive)
>>>>
>>>>To request this thread, e-mail <[EMAIL PROTECTED]>
>>>>To unsubscribe, e-mail <[EMAIL PROTECTED]>
>>>>Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>---------------------------------------------------------------------
>>>Before posting, please check:
>>>   http://www.mysql.com/manual.php   (the manual)
>>>   http://lists.mysql.com/           (the list archive)
>>>
>>>To request this thread, e-mail <[EMAIL PROTECTED]>
>>>To unsubscribe, e-mail <[EMAIL PROTECTED]>
>>>Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
>>>
>>>
>>>---------------------------------------------------------------------
>>>Before posting, please check:
>>>   http://www.mysql.com/manual.php   (the manual)
>>>   http://lists.mysql.com/           (the list archive)
>>>
>>>To request this thread, e-mail <[EMAIL PROTECTED]>
>>>To unsubscribe, e-mail <[EMAIL PROTECTED]>
>>>Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
>>>
>>>
>>
>>
>>---------------------------------------------------------------------
>>Before posting, please check:
>>   http://www.mysql.com/manual.php   (the manual)
>>   http://lists.mysql.com/           (the list archive)
>>
>>To request this thread, e-mail <[EMAIL PROTECTED]>
>>To unsubscribe, e-mail <[EMAIL PROTECTED]>
>>Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
>>
>>
>>
> 



---------------------------------------------------------------------
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/           (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Distributed Fulltext?

Reply via email to