> -----Original Message-----
> From: David Morel [mailto:[EMAIL PROTECTED]
> Sent: 16 January 2004 17:32
> To: Brad Campbell
> Cc: George Ionescu; [EMAIL PROTECTED]
> Subject: Re: [sqlite] Full text search implementation
>
>
> > My regex patch should do that
> >
> > SELECT * FROM Categories WHERE CategoryDescription RLIKE
> 'Beverages" and CategoryDescription NOT
> > RLIKE 'Whiskey';
> >
>
> In such a simple string matching I suspect a regex search is totally
> overkill... that's ok for a db containing 1000 rows, but try it on
> 700,000 rows (390Mb) like the one i have here ;-)
>
I don't think that your LIKE version will perform much better - SQLite
doesn't use indexes when doing LIKE comparisons.

> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> --
> ***********************************************
> [EMAIL PROTECTED]
> OpenPGP public key: http://www.amakuru.net/dmorel.asc
> 28192ef126bc871757cb7d97f4a44536
>
>
>

Using LIKE as a means of doing a full text search is virtually useless in
the real world of text retrieval.  The query take no account of context,
which is essential when dealing with intelligent text queries.
A full-on full text engine (BASIS, BRS etc) has to maintain a set of
meta-data for each text column that can be searched i.e.

When data is added to a text column, the text must be parsed to split it up
into searchable words using a break character list.
These words must then be reduced to their searchable stem (pluralisation,
inflexions, Porter stemming etc) and insignificant words ('a', 'and', 'the'
etc (stop words)) removed.
The words are then added to the column index - the posting in the index
contains the row ID, the start character position of the word and the
original length of the word.  It may also contain grammatical context info
such as the sentence/paragraph number.
At this point, some systems may also add into the index other variants of
the words (common mispellings, morphs etc) to improve recall.

Now, when you do a search on that column, the system has to parse your query
terms, stem them and weed out stop words in the same way as when data was
added.  It then looks up the words in the column index and collates the
proximity of the words.
There's not normally much point in searching for 'SQLite' and 'document' if
you can't tell the system to find them with the same sentence, paragraph, or
adjacent.

As you can see, a proper full text search engine is considerably more work
than it first looks.  Add onto this all the complexities of applying this to
different languages and you have a pretty major coding effort on your hands.

I have a working prototype of such a beast using SQLite that I'd be
interested in sharing the devlopment of, if anyone is interested?

Steve




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to