If you want to make a LARGE (millions of documents, gigabytes of data)
searchengine, you'd better know something about searchengine algorithms :)

I'll give you one hint - A database is not a searchengine. A database is not
meant to be a searchengine. And, until some decent text-processing/searching
functions are built in, mysql is not a searchengine. The trick with indexing
words and "%like%" will only get you so far, and "LARGE" hardly describes
how far that is.

(Which doesn't mean mysql doesn't do a fine job of searching through a
select number of small documents, but it does mean it's still a database and
not a google backend.)

You should, if only to get an idea about what's involved in searchengines,
take a peek at the following urls:

http://citeseer.nj.nec.com/moffat94compression.html
(A lot of links to similar, very educating, articles here too.)

http://www.ping.be/~ping0658/avrank.html#zipf

http://www.swtech.com/server/websvr/wsindex/

Regards,

Sander

> -----Original Message-----
> From: Cedric Veilleux [mailto:[EMAIL PROTECTED]]
> Sent: 23 March 2001 04:30
> To: [EMAIL PROTECTED]
> Subject: Large search engine
>
>
> Hi,
>
>   I am planning a very large search engine. I've spent some time reading
> the archive and I found some suggestions on how to do this. The word
> indexing method is a very interesting alternative to slow "...where like
> '%foo%';" queries.
>
>   There is from 100k to 500k documents to index, each are about 10
> KBytes large. Plain text.
>
>   The search engine will allow complex boolean queries (AND, OR, NEAR,
> NOT).
>
>   I have 2 plans in mind, I'd like to have opinions on what's would be
> the most efficient.
>
> First Way:
> We populate 2 tables:
> one containing the documents (text and ID)
> one containing all the words and the documents ID containing each word
>
> The idea is to first filter the documents and then to perform a query in
> the documents that contains at least one of the words, so we're supposed
> to get a decent speed.
>
> I know this is used by many people and I know it gives good results,
> even when searching through 100k+ documents. Although, I am wondering if
> there is not a way to do it without any use of LIKE statements. I really
> don't know if what I have in mind is a good idea, it may be completely
> stupid and inefficient, I have very little DB experiences.
>
> Anyways, what if in the table containing the words and the matching
> document ID's, we also specify where in each documents the word is
> located.
>
> ex:
> WORD|   DOCS     |   LOCATIONS
> sun | 32;45;1302 | 3 ; 554,1022 ; 76,675,3445
>
> So word sun is the third word of doc 32, the 554th and 1022th word doc
> 45, etc..
>
> Then the search script will do all the job without sending any other
> queries. May get quite complicated but it should work, it may also be
> easier to process sun NEAR star (maybe this is easy to do with LIKE too,
> I don't know, but I saw nothing in the docs.)
>
>
> Thank you,
>
> Cedric Veilleux
>
>
> ---------------------------------------------------------------------
> Before posting, please check:
>    http://www.mysql.com/manual.php   (the manual)
>    http://lists.mysql.com/           (the list archive)
>
> To request this thread, e-mail <[EMAIL PROTECTED]>
> To unsubscribe, e-mail
> <[EMAIL PROTECTED]>
> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
>
>


---------------------------------------------------------------------
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/           (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Reply via email to