Hi,

I would like to build up a table of all the unique words occurring in
my corpus (for spelling suggestion feature). Presently I am using the
Porter stemming tokenizer and I would not like to stop using the
stemmer at any cost. Although if I was not using the Porter stemmer
then I could easily obtain the list of unique words in the corpus
using the FTS4Aux module. But using the stemmer means that all the
words are stored in the index in their stem form which is not
desirable for building a dictionary of proper English words.

One solution is to use a custom tokenizer. I was thinking of using the
default Porter tokenizer supplied with Sqlite and adding some bits of
code to store the token in a separate table before stemming it down.
But I am not sure if it is ok to access or modify the database using
Sql statements inside a tokenizer. Now that I think of it, the
tokenizer code is also executed when an SQL query is performed against
the FTS table (when performing search), at which time I don't want my
dictionary building code to execute. So perhaps this is not a good
idea.

What other options do I have ?

Thanks
Abhinav
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to