Re: [sqlite] FTS statistics and stemming

Stephen Woodbridge Mon, 28 Jul 2008 13:55:36 -0700

Scott Hess wrote:
> On Sat, Jul 26, 2008 at 1:28 PM, Stephen Woodbridge
> <[EMAIL PROTECTED]> wrote:
>> Alexey Pechnikov wrote:
>>> I'm know that ispell, myspell, hunspell and trigrams are used in PostgreSQL
>>> FTS. A lot of languages are supported this. And soundex function useful for
>>> morphology search if to write word by latin alphabet (transliteration by
>>> replace each symbol of national alphabet by one or more latin):
> <snip>
>>> There is stemming in Apache Lucene, Sphinx (included morphology by soundex)
>>> and Xapian too.
>>>
>>> Are these futures planned to be in SQLIte FTS?
>> Well, I will leave the question of plans to Scott Hess the FTS developer
>> to answer.
> 
> Unfortunately, my interests don't really run towards implementing
> useful new stemmers.  I mean, I could, but I'm unlikely to do a good
> job unless I'm doing it because it scratches some engineering itch I
> have.  I tend to have more interest in infrastructure-y things, like
> how to safely encode/decode data structures.  I know this is an
> unsatisfactory answer :-).


Scott,

This makes perfect sense to me. The stemmers are all written and C 
source is available from the snowball site for anyone that needs one in 
another language, and the existing porter and simple stemmers provide an 
adequate examples of how to integrate additional stemmers into sqlite so 
there is probably no need for you to mess with them.

I think that more interesting infrastructure stuff might be along the 
lines of:

1) looking at how postgresql has integrated fts into it which make is 
very flexible and extensible
2) looking into ways to do fast fuzzy FTS with scoring of results
3) looking at a simple extension that would allow an FTS column of 
soundex tokens to be automatically built from the source documents.

    create virtual table mytable using fts3
       (sndx tokenize soundex on doc, doc tokenize porter);

    select docid from mytable where sndx MATCH 'list of words';

So the idea would be that doc is tokenized using porter, like today, and 
the sndx column is build from soundex tokens generated from the doc 
column tokens. And on the select it would be smart enough to compute the 
soundex tokens for 'list of words' before doing the query if the column 
is soundex encoded. The syntax should be fixed to something appropriate, 
I just wanted to get the idea across.

Anyway, I'm sure that whatever you decide to do, it will be useful and 
helpful to the community and I for one really appreciate your efforts.

Thanks,
   -Steve
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] FTS statistics and stemming

Reply via email to