> On Sep 20, 2018, at 11:01 PM, 邱朗 <qiulang2...@126.com> wrote: > > https://www.sqlite.org/fts5.html <https://www.sqlite.org/fts5.html> said " > The unicode tokenizer classifies all unicode characters as either "separator" > or "token" characters. By default all space and punctuation characters, as > defined by Unicode 6.1, are considered separators, and all other characters > as token characters... " I really doubt unicode tokenizer requires white > space, that is ascii tokenizer.
Detecting word breaks in many East Asian languages (not just CJK; Thai is another) is a rather difficult task and requires having a non-small database of character sequences to match. I’m sure the SQLite maintainers considered it too large to build into their Unicode tokenizer. IIRC, ICU can do this, as can special libraries like Mecab. —Jens _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users