https://www.sqlite.org/fts5.html said " The unicode tokenizer classifies all unicode characters as either "separator" or "token" characters. By default all space and punctuation characters, as defined by Unicode 6.1, are considered separators, and all other characters as token characters... " I really doubt unicode tokenizer requires white space, that is ascii tokenizer.
That was why I thought it might work for CJK. Qiulang At 2018-09-21 13:03:54, "Scott Robison" <sc...@casaderobison.com> wrote: >On Thu, Sep 20, 2018, 8:21 PM 邱朗 <qiulang2...@126.com> wrote: > >> Hi, >> I had thought Unicode61 Tokenizer can support CJK -- Chinese Japanese >> Korean I verify my sqlite supports fts5 >> >> {snipped} >> >> But to my surprise it can't find any CJK word at all. Why is that ? > > >Based on my experience with such things, I suspect that the tokenizer >requires whitespace between adjacent words, which is not the case with CJK. >Word breaks are implicit, not explicit. > >Is the Unicode61 tokenizer based on ICU? I had to implement an algorithm >for software at work that used functionality from ICU to find CJK word >boundaries, so I believe it is possible, just not as straightforward as >whitespace delimited words. >_______________________________________________ >sqlite-users mailing list >sqlite-users@mailinglists.sqlite.org >http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users