Re: Distributed Fulltext?

Alex Aulbach Thu, 14 Feb 2002 16:16:22 -0800


Hi, I also will explain how we made FTS "fast". (sorry for my bad english)


First some DATA: The table which has to be indexed has ~600000 entries.
There are articles inside it, which are in average 3-4 kb each (which says
nothing!) with about 300 words each (this number is very important!).

This makes theoretically an index of 180 Million. But we have stopwords
and only about 1 Mio different words. So the index for this relation

                  words      <----------->     article

is about 25-30 Mio recors/relations.

----> This is the only important thing: How big is the index?
Does it go completly into memory?


Cause of other expierences with FTS in mySQL we decided to put the FTS
machine itself on an extra DB-Server. This machine has 1 GHz, 4 GB Ram,
Raid 5.

The main application has a centralistic write module. Every article has to
go throug these routines. So it was easy to write the article not only
into the article database, but also into the search database. Before we do
this, we strip away html and other things to reduce data.

Some things which makes the engine "slow":

* Words, that match very often. This is, cause mysql has to find all
records and FTS has to calculate how good the match is. That is very time
consuming. So a SELECT * FROM index is very bad. Better is a SELECT index
FROM fts. The selected amount of data is much lower.

* at startup the machine is very slow and if there are much updates. But
it's clear why, I think I don't need to explain it.

* too much requests (>10 per seconds, but this depends very much on what
is searched). Think this is also clear.


BTW: we found out a trick for "AND"-searches (by default FTS
searches with OR!)

1. Count matches for each word (SELECT COUNT(*) FROM fts WHERE
MATCH(ftsindex) AGAINST (word1))
2. Sort by found matches ascending.
3. SELECT * FROM fts WHERE MATCH(ftsindex) AGAINS(word with lowest number
of matches) AND MATCH(ftsindex) AGAINST(word with second lowest number of
matches) AND ....

This speeds up the query about 30-50% against the same query, when the
words are sorted descending! It is obvious: If the first set is the
smallest set mysql has to compare much less possibilities.



So let's summarize:

- Split the searching facilities from your application!

  This is also very important not to block your application too much, when
  the ft-index must be generated (e.g. cause machine cpu bites into dust)

- depending on the size of index:
  1. Use an own table
  2. put more RAM into the machine
  3. Use an own database on another machine
  4. Put more RAM into the machine
  5. Split you DB into several DB's on different machines.
  6. Why don't you by a real google SE?

  Positive side effect of using an own DB: You are able to index when
  there is time to, e.g. when the load is low. Asynchronous Searching
  facilities can be a good idea in some reasons.

- write only those data into FTS-Table, which should be found. This
  reduces indexing and search time and amount of data.

- put also other searching criterias into this table, if it is possible.
  Try to avoid joins!

- There exists some optimization tricks to speed up search.

- Caching can also be a very good idea.


The biggest searching times takes about 2 seconds with this method.



-- 

SSilk - Alexander Aulbach - Herbipolis/Frankonia Minoris


---------------------------------------------------------------------
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/           (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

Reply via email to