Re: Distributed Fulltext?

Alex Aulbach Fri, 08 Feb 2002 02:10:54 -0800

Yesterday, from Brian DeFeyter:

> Has anyone made a suggestion or thought about ways to distribute
> databases which focus on fulltext indexes?
>
> fulltext indexes do a good job of indexing a moderate amount of data,
> but when you get a lot of data to be indexed, the queries slow down
> significantly.


Yea, this is cause the method to search the index is a geometric
algorithm.

> I have an example table, with about 90 million rows.. and has a fulltext
> index on a varchar(100) field. A single-word query which would return
> approx 300k results takes an average of 15 seconds. A query with smaller
> results (~ 10k) can be as quick as 1 sec.. which I would consider
> acceptable.

The only interesting thing is, how many words have to be indexed (how many
rows is not very important), how big grows your index (does it go in the
memory) and how many rows can be found for one word.

These are the most depending things.

> Has any thought about splitting the data into distributed files or even
> machines? ie: something as simple as 'words' starting with 'X' are split
> into a-h, i-p, q-z... or something more advanced? (maybe mysqld could
> automatically split results based on (#results per unique 'word' /
> desired # of 'split files/machines') Would such a system give any
> advantages to searching speed and concurrenct query scalability? I
> haven't looked at the fulltext internals.. so I don't know if such
> "query routing" could take place or not.

Hum, I think it's *much* cheaper is to come together and pay the mySQL
people to introduce a new feature into mySQL called "inverted files". This
method is in short, that you only store the word and in which records it
can be found. This can redurce the size of indexes and so the speed
dramatically.

> If nothing else, does anyone else have experience with a table of this
> size or even larger? What kind of tuning have you done?

We have made for example an extra table for indexing on an extra server.
This is good, cause

- mySQL can "concentrate" only for this table, so the
machine dosn't need to swap

- big searches or reindex dosn't bother the rest of the system

- indexing can be done via a cron-job

- we optimized the write routines, for example we stripped all tags out
and wrote special stop-lists, before we write it.



BTW: My wishlist for fulltext indexing:
---------------------------------------

- inverted files

- rules to define words

- stop-word-lists and stop-word-regex

- a function which returns a table, which can tell me, what can be
searched.
E.g SHOW MATCH(table.index) AGAINST('hugobla hugo bla');

 word      count     rows     stopword   autostopword   score wscore
 hugobla      2        2        no         no             10      1
 hugo        10        6        no         no              8      1
 bla          0        0        no         yes             0      1

Words can be searched, depending on rules, stop-words and what has been
indexed.

- configurable scoring, e.g. very fast (and simple) scoring for speed

- special scores for special words

- a cache, which stores often used words and the rows


-- 

SSilk - Alexander Aulbach - Herbipolis/Frankonia Minoris


---------------------------------------------------------------------
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/           (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

Reply via email to