On Fri, Sep 21, 2018 at 12:02 AM 邱朗 <qiulang2...@126.com> wrote: > > https://www.sqlite.org/fts5.html said " The unicode tokenizer classifies all > unicode characters as either "separator" or "token" characters. By default > all space and punctuation characters, as defined by Unicode 6.1, are > considered separators, and all other characters as token characters... " I > really doubt unicode tokenizer requires white space, that is ascii tokenizer.
Forgive my imprecise use of language. I should have said separators instead of whitespace. Regardless, CJK uses implicit separation between words, and that description seems to indicate that the unicode tokenizer expects explicit separators (be they whitespace or punctuation or something else) between tokens. > That was why I thought it might work for CJK. I think it could be made to work, or at least, I have experience making it work with CJK based on functionality exposed via ICU. I don't know if the unicode tokenizer uses ICU or if the functionality in ICU that I used is available in the unicode tables. Not understanding any of the languages represented by CJK, I can't say with any confidence how good my solution was, but it seemed to be good enough for the use case of my management and customers in the impacted regions. _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users