Re: [sqlite] Why sqlite fts5 Unicode61 Tokenizer does not support CJK(Chinese Japanese Krean)?

Scott Robison Thu, 20 Sep 2018 22:04:24 -0700

On Thu, Sep 20, 2018, 8:21 PM 邱朗 <qiulang2...@126.com> wrote:

> Hi,
> I had thought Unicode61 Tokenizer can support CJK -- Chinese Japanese
> Korean I verify my sqlite supports fts5
>
> {snipped}
>
> But to my surprise it can't find any CJK word at all. Why is that ?



Based on my experience with such things, I suspect that the tokenizer
requires whitespace between adjacent words, which is not the case with CJK.
Word breaks are implicit, not explicit.

Is the Unicode61 tokenizer based on ICU? I had to implement an algorithm
for software at work that used functionality from ICU to find CJK word
boundaries, so I believe it is possible, just not as straightforward as
whitespace delimited words.
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Why sqlite fts5 Unicode61 Tokenizer does not support CJK(Chinese Japanese Krean)?

Reply via email to