On Thu, Sep 20, 2018, 8:21 PM 邱朗 <qiulang2...@126.com> wrote:

> Hi,
> I had thought Unicode61 Tokenizer can support CJK -- Chinese Japanese
> Korean I verify my sqlite supports fts5
>
> {snipped}
>
> But to my surprise it can't find any CJK word at all. Why is that ?


Based on my experience with such things, I suspect that the tokenizer
requires whitespace between adjacent words, which is not the case with CJK.
Word breaks are implicit, not explicit.

Is the Unicode61 tokenizer based on ICU? I had to implement an algorithm
for software at work that used functionality from ICU to find CJK word
boundaries, so I believe it is possible, just not as straightforward as
whitespace delimited words.
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to