Hi, I would like to build up a table of all the unique words occurring in my corpus (for spelling suggestion feature). Presently I am using the Porter stemming tokenizer and I would not like to stop using the stemmer at any cost. Although if I was not using the Porter stemmer then I could easily obtain the list of unique words in the corpus using the FTS4Aux module. But using the stemmer means that all the words are stored in the index in their stem form which is not desirable for building a dictionary of proper English words.
One solution is to use a custom tokenizer. I was thinking of using the default Porter tokenizer supplied with Sqlite and adding some bits of code to store the token in a separate table before stemming it down. But I am not sure if it is ok to access or modify the database using Sql statements inside a tokenizer. Now that I think of it, the tokenizer code is also executed when an SQL query is performed against the FTS table (when performing search), at which time I don't want my dictionary building code to execute. So perhaps this is not a good idea. What other options do I have ? Thanks Abhinav _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users