> On Sep 20, 2018, at 11:01 PM, 邱朗 <qiulang2...@126.com> wrote:
> 
> https://www.sqlite.org/fts5.html <https://www.sqlite.org/fts5.html> said " 
> The unicode tokenizer classifies all unicode characters as either "separator" 
> or "token" characters. By default all space and punctuation characters, as 
> defined by Unicode 6.1, are considered separators, and all other characters 
> as token characters... "  I really doubt unicode tokenizer requires white 
> space, that is ascii tokenizer.

Detecting word breaks in many East Asian languages (not just CJK; Thai is 
another) is a rather difficult task and requires having a non-small database of 
character sequences to match. I’m sure the SQLite maintainers considered it too 
large to build into their Unicode tokenizer.

IIRC, ICU can do this, as can special libraries like Mecab.

—Jens
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to