RE: [sqlite] Full text search implementation

2004-01-16 Thread Steve O'Hara


> -Original Message-
> From: Bertrand Mansion [mailto:[EMAIL PROTECTED]
> Sent: 16 January 2004 19:18
> To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
> Subject: Re: [sqlite] Full text search implementation
>
>
> <[EMAIL PROTECTED]> wrote :
>
> >
> >
> >> -Original Message-
> >> From: David Morel [mailto:[EMAIL PROTECTED]
> >> Sent: 16 January 2004 17:32
> >> To: Brad Campbell
> >> Cc: George Ionescu; [EMAIL PROTECTED]
> >> Subject: Re: [sqlite] Full text search implementation
> >>
> >>
> >>> My regex patch should do that
> >>>
> >>> SELECT * FROM Categories WHERE CategoryDescription RLIKE
> >> 'Beverages" and CategoryDescription NOT
> >>> RLIKE 'Whiskey';
> >>>
> >>
> >> In such a simple string matching I suspect a regex search is totally
> >> overkill... that's ok for a db containing 1000 rows, but try it on
> >> 700,000 rows (390Mb) like the one i have here ;-)
> >>
> > I don't think that your LIKE version will perform much better - SQLite
> > doesn't use indexes when doing LIKE comparisons.
> >
> >>> -
> >>> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>> For additional commands, e-mail: [EMAIL PROTECTED]
> >> --
> >> ***
> >> [EMAIL PROTECTED]
> >> OpenPGP public key: http://www.amakuru.net/dmorel.asc
> >> 28192ef126bc871757cb7d97f4a44536
> >>
> >>
> >>
> >
> > Using LIKE as a means of doing a full text search is virtually
> useless in
> > the real world of text retrieval.  The query take no account of context,
> > which is essential when dealing with intelligent text queries.
> > A full-on full text engine (BASIS, BRS etc) has to maintain a set of
> > meta-data for each text column that can be searched i.e.
> >
> > When data is added to a text column, the text must be parsed to
> split it up
> > into searchable words using a break character list.
> > These words must then be reduced to their searchable stem
> (pluralisation,
> > inflexions, Porter stemming etc) and insignificant words ('a',
> 'and', 'the'
> > etc (stop words)) removed.
> > The words are then added to the column index - the posting in the index
> > contains the row ID, the start character position of the word and the
> > original length of the word.  It may also contain grammatical
> context info
> > such as the sentence/paragraph number.
> > At this point, some systems may also add into the index other
> variants of
> > the words (common mispellings, morphs etc) to improve recall.
> >
> > Now, when you do a search on that column, the system has to
> parse your query
> > terms, stem them and weed out stop words in the same way as
> when data was
> > added.  It then looks up the words in the column index and collates the
> > proximity of the words.
> > There's not normally much point in searching for 'SQLite' and
> 'document' if
> > you can't tell the system to find them with the same sentence,
> paragraph, or
> > adjacent.
> >
> > As you can see, a proper full text search engine is
> considerably more work
> > than it first looks.  Add onto this all the complexities of
> applying this to
> > different languages and you have a pretty major coding effort
> on your hands.
> >
> > I have a working prototype of such a beast using SQLite that I'd be
> > interested in sharing the devlopment of, if anyone is interested?
>
> I agree. You just forgot about the scoring algorithm, a full text query
> should also be able to return a score.
>
> IMO, search engines (with tokenizer, indexer, stemmer, stopwords,
> substrings, fuzzy, binary converter...) offer a good choice in terms of
> features when it comes to full text search. It seems that Mnogosearch
> (http://www.mnogosearch.org) has included sqlite as their default db for
> their search engine software. So this might be a good companion to sqlite.
>
> The best solution would be IMO to have all this optionally integrated into
> the database engine. That's what Oracle does with Context and last time I
> have used it, it was working very well (it was with Oracle 9i). But that
> might also make the engine too heavy.
>
> Mysql offers fulltext but it is only available AFAIK on the MyISAM table
> format. So you can't have foreign key constraints (available on InnoDB
> format) and full text at the same time.
>
> Bertrand Mansion

Re: [sqlite] Full text search implementation

2004-01-16 Thread Bertrand Mansion
<[EMAIL PROTECTED]> wrote :

> 
> 
>> -Original Message-
>> From: David Morel [mailto:[EMAIL PROTECTED]
>> Sent: 16 January 2004 17:32
>> To: Brad Campbell
>> Cc: George Ionescu; [EMAIL PROTECTED]
>> Subject: Re: [sqlite] Full text search implementation
>> 
>> 
>>> My regex patch should do that
>>> 
>>> SELECT * FROM Categories WHERE CategoryDescription RLIKE
>> 'Beverages" and CategoryDescription NOT
>>> RLIKE 'Whiskey';
>>> 
>> 
>> In such a simple string matching I suspect a regex search is totally
>> overkill... that's ok for a db containing 1000 rows, but try it on
>> 700,000 rows (390Mb) like the one i have here ;-)
>> 
> I don't think that your LIKE version will perform much better - SQLite
> doesn't use indexes when doing LIKE comparisons.
> 
>>> -
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> For additional commands, e-mail: [EMAIL PROTECTED]
>> --
>> ***
>> [EMAIL PROTECTED]
>> OpenPGP public key: http://www.amakuru.net/dmorel.asc
>> 28192ef126bc871757cb7d97f4a44536
>> 
>> 
>> 
> 
> Using LIKE as a means of doing a full text search is virtually useless in
> the real world of text retrieval.  The query take no account of context,
> which is essential when dealing with intelligent text queries.
> A full-on full text engine (BASIS, BRS etc) has to maintain a set of
> meta-data for each text column that can be searched i.e.
> 
> When data is added to a text column, the text must be parsed to split it up
> into searchable words using a break character list.
> These words must then be reduced to their searchable stem (pluralisation,
> inflexions, Porter stemming etc) and insignificant words ('a', 'and', 'the'
> etc (stop words)) removed.
> The words are then added to the column index - the posting in the index
> contains the row ID, the start character position of the word and the
> original length of the word.  It may also contain grammatical context info
> such as the sentence/paragraph number.
> At this point, some systems may also add into the index other variants of
> the words (common mispellings, morphs etc) to improve recall.
> 
> Now, when you do a search on that column, the system has to parse your query
> terms, stem them and weed out stop words in the same way as when data was
> added.  It then looks up the words in the column index and collates the
> proximity of the words.
> There's not normally much point in searching for 'SQLite' and 'document' if
> you can't tell the system to find them with the same sentence, paragraph, or
> adjacent.
> 
> As you can see, a proper full text search engine is considerably more work
> than it first looks.  Add onto this all the complexities of applying this to
> different languages and you have a pretty major coding effort on your hands.
> 
> I have a working prototype of such a beast using SQLite that I'd be
> interested in sharing the devlopment of, if anyone is interested?

I agree. You just forgot about the scoring algorithm, a full text query
should also be able to return a score.

IMO, search engines (with tokenizer, indexer, stemmer, stopwords,
substrings, fuzzy, binary converter...) offer a good choice in terms of
features when it comes to full text search. It seems that Mnogosearch
(http://www.mnogosearch.org) has included sqlite as their default db for
their search engine software. So this might be a good companion to sqlite.

The best solution would be IMO to have all this optionally integrated into
the database engine. That's what Oracle does with Context and last time I
have used it, it was working very well (it was with Oracle 9i). But that
might also make the engine too heavy.

Mysql offers fulltext but it is only available AFAIK on the MyISAM table
format. So you can't have foreign key constraints (available on InnoDB
format) and full text at the same time.

Bertrand Mansion
Mamasam


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: [sqlite] Full text search implementation

2004-01-16 Thread Steve O'Hara


> -Original Message-
> From: David Morel [mailto:[EMAIL PROTECTED]
> Sent: 16 January 2004 17:32
> To: Brad Campbell
> Cc: George Ionescu; [EMAIL PROTECTED]
> Subject: Re: [sqlite] Full text search implementation
>
>
> > My regex patch should do that
> >
> > SELECT * FROM Categories WHERE CategoryDescription RLIKE
> 'Beverages" and CategoryDescription NOT
> > RLIKE 'Whiskey';
> >
>
> In such a simple string matching I suspect a regex search is totally
> overkill... that's ok for a db containing 1000 rows, but try it on
> 700,000 rows (390Mb) like the one i have here ;-)
>
I don't think that your LIKE version will perform much better - SQLite
doesn't use indexes when doing LIKE comparisons.

> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> --
> ***
> [EMAIL PROTECTED]
> OpenPGP public key: http://www.amakuru.net/dmorel.asc
> 28192ef126bc871757cb7d97f4a44536
>
>
>

Using LIKE as a means of doing a full text search is virtually useless in
the real world of text retrieval.  The query take no account of context,
which is essential when dealing with intelligent text queries.
A full-on full text engine (BASIS, BRS etc) has to maintain a set of
meta-data for each text column that can be searched i.e.

When data is added to a text column, the text must be parsed to split it up
into searchable words using a break character list.
These words must then be reduced to their searchable stem (pluralisation,
inflexions, Porter stemming etc) and insignificant words ('a', 'and', 'the'
etc (stop words)) removed.
The words are then added to the column index - the posting in the index
contains the row ID, the start character position of the word and the
original length of the word.  It may also contain grammatical context info
such as the sentence/paragraph number.
At this point, some systems may also add into the index other variants of
the words (common mispellings, morphs etc) to improve recall.

Now, when you do a search on that column, the system has to parse your query
terms, stem them and weed out stop words in the same way as when data was
added.  It then looks up the words in the column index and collates the
proximity of the words.
There's not normally much point in searching for 'SQLite' and 'document' if
you can't tell the system to find them with the same sentence, paragraph, or
adjacent.

As you can see, a proper full text search engine is considerably more work
than it first looks.  Add onto this all the complexities of applying this to
different languages and you have a pretty major coding effort on your hands.

I have a working prototype of such a beast using SQLite that I'd be
interested in sharing the devlopment of, if anyone is interested?

Steve




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [sqlite] Full text search implementation

2004-01-16 Thread Brad Campbell
David Morel wrote:
My regex patch should do that

SELECT * FROM Categories WHERE CategoryDescription RLIKE 'Beverages" and CategoryDescription NOT 
RLIKE 'Whiskey';



In such a simple string matching I suspect a regex search is totally
overkill... that's ok for a db containing 1000 rows, but try it on
700,000 rows (390Mb) like the one i have here ;-)
Yup, as soon as I sent it I thought "What a perfect application for LIKE 
'%Whiskey%'..
And I do like Whiskey!
:p)

Brad

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: [sqlite] Full text search implementation

2004-01-16 Thread David Morel
> My regex patch should do that
> 
> SELECT * FROM Categories WHERE CategoryDescription RLIKE 'Beverages" and 
> CategoryDescription NOT 
> RLIKE 'Whiskey';
> 

In such a simple string matching I suspect a regex search is totally
overkill... that's ok for a db containing 1000 rows, but try it on
700,000 rows (390Mb) like the one i have here ;-)

> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
-- 
***
[EMAIL PROTECTED]
OpenPGP public key: http://www.amakuru.net/dmorel.asc
28192ef126bc871757cb7d97f4a44536




signature.asc
Description: Ceci est une partie de message	=?ISO-8859-1?Q?num=E9riquement?= =?ISO-8859-1?Q?_sign=E9e=2E?=


Re: [sqlite] Full text search implementation

2004-01-16 Thread David Morel
Le ven 16/01/2004 à 15:30, George Ionescu a écrit :
> I've been using sqlite for quite some time now and I am a fan (thank you Dr.
> Hipp).

so am I

> Regarding Full Text Search syntax, I think CONTAINS statement would be ok
> (as implemented in other database engines).
> So:
> SELECT * FROM Categories
> WHERE CONTAINS(CategoryDescription, 'Beverages')
> would return all the rows where column CategoryDescription contains the
> word 'Beverages' :-)

doesn't 
SELECT * FROM Categories  WHERE CategoryDescription LIKE '%Beverages%';
do just that ?

I guess full-text indexing applies to columns containing larger chunks
of text...

D.Morel
-- 
***
[EMAIL PROTECTED]
OpenPGP public key: http://www.amakuru.net/dmorel.asc
28192ef126bc871757cb7d97f4a44536




signature.asc
Description: Ceci est une partie de message	=?ISO-8859-1?Q?num=E9riquement?= =?ISO-8859-1?Q?_sign=E9e=2E?=


Re: [sqlite] Full text search implementation

2004-01-16 Thread Brad Campbell
George Ionescu wrote:
Hello dear SQLite users.
Hello Dr. Hipp,
I've been using sqlite for quite some time now and I am a fan (thank you Dr.
Hipp).
I will try to rewrite my (unfinished) Document Management application using
SQLite and I was wondering if Full Text Search will be implemented.
I'vee seen a ticket posted by Dr. Hipp on 2003-09-03 where he got a response
from a guy who's oppinion was that Full Text Search is not useful in sqlite;
I have a different oppinion: since all major database engines implement this
feature, why should sqlite be left aside?
Regarding Full Text Search syntax, I think CONTAINS statement would be ok
(as implemented in other database engines).
So:
SELECT * FROM Categories
WHERE CONTAINS(CategoryDescription, 'Beverages')
would return all the rows where column CategoryDescription contains the
word 'Beverages' :-)
similar
SELECT * FROM Categories
WHERE CONTAINS(CategoryDescription, ' "Beverages" AND NOT "Whiskey"
')
would return all the rows where column CategoryDescription contains the
word 'Beverages' and does not contain the word 'Whiskey'
My regex patch should do that

SELECT * FROM Categories WHERE CategoryDescription RLIKE 'Beverages" and CategoryDescription NOT 
RLIKE 'Whiskey';

You can find a working test version here
http://www.wasp.net.au/~brad/sqlite-110104-snapshot-bkc1.tgz
Brad

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]