On Thu, Sep 20, 2018, 8:21 PM 邱朗 <qiulang2...@126.com> wrote: > Hi, > I had thought Unicode61 Tokenizer can support CJK -- Chinese Japanese > Korean I verify my sqlite supports fts5 > > {snipped} > > But to my surprise it can't find any CJK word at all. Why is that ?
Based on my experience with such things, I suspect that the tokenizer requires whitespace between adjacent words, which is not the case with CJK. Word breaks are implicit, not explicit. Is the Unicode61 tokenizer based on ICU? I had to implement an algorithm for software at work that used functionality from ICU to find CJK word boundaries, so I believe it is possible, just not as straightforward as whitespace delimited words. _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users