> On Sep 21, 2018, at 3:26 AM, Domingo Alvarez Duarte <mingo...@gmail.com> 
> wrote:
> 
> looking at some fts5 tables it seems that an option to limit the minimum 
> number of characters to at least 2 or 3 would be a good shot as stopwords,

A real stop-word list is valuable, but I don’t think a simple minimum-length 
rule would be as useful. Maybe in a few contexts, but not in general. (It’s not 
useful even for English text; for example, I’m very glad that Google indexes 
the word “C” so I can look up questions about C programming!)

> another interest option would be a regex like black/white list of sequence of 
> characters to be indexed.

You can do all this and more with a custom tokenizer :)

(Most real-world uses of FTS for natural language text will end up needing a 
custom tokenizer anyway, because IIRC the default tokenizer is very stupid and 
only breaks at whitespace. At a minimum you need one that can ignore inter-word 
punctuation like periods and commas, and recognize some non-ASCII characters 
like curly quotes and en-dashes.

—Jens
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to