> On Sep 21, 2018, at 3:26 AM, Domingo Alvarez Duarte <mingo...@gmail.com> > wrote: > > looking at some fts5 tables it seems that an option to limit the minimum > number of characters to at least 2 or 3 would be a good shot as stopwords,
A real stop-word list is valuable, but I don’t think a simple minimum-length rule would be as useful. Maybe in a few contexts, but not in general. (It’s not useful even for English text; for example, I’m very glad that Google indexes the word “C” so I can look up questions about C programming!) > another interest option would be a regex like black/white list of sequence of > characters to be indexed. You can do all this and more with a custom tokenizer :) (Most real-world uses of FTS for natural language text will end up needing a custom tokenizer anyway, because IIRC the default tokenizer is very stupid and only breaks at whitespace. At a minimum you need one that can ignore inter-word punctuation like periods and commas, and recognize some non-ASCII characters like curly quotes and en-dashes. —Jens _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users