Re: [sqlite] Why sqlite fts5 Unicode61 Tokenizer does not support CJK(Chinese Japanese Krean)?

Scott Robison Thu, 20 Sep 2018 23:16:14 -0700

On Fri, Sep 21, 2018 at 12:02 AM 邱朗 <qiulang2...@126.com> wrote:
>
> https://www.sqlite.org/fts5.html said " The unicode tokenizer classifies all 
> unicode characters as either "separator" or "token" characters. By default 
> all space and punctuation characters, as defined by Unicode 6.1, are 
> considered separators, and all other characters as token characters... "  I 
> really doubt unicode tokenizer requires white space, that is ascii tokenizer.


Forgive my imprecise use of language. I should have said separators
instead of whitespace. Regardless, CJK uses implicit separation
between words, and that description seems to indicate that the unicode
tokenizer expects explicit separators (be they whitespace or
punctuation or something else) between tokens.

> That was why I thought it might work for CJK.

I think it could be made to work, or at least, I have experience
making it work with CJK based on functionality exposed via ICU. I
don't know if the unicode tokenizer uses ICU or if the functionality
in ICU that I used is available in the unicode tables. Not
understanding any of the languages represented by CJK, I can't say
with any confidence how good my solution was, but it seemed to be good
enough for the use case of my management and customers in the impacted
regions.
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Why sqlite fts5 Unicode61 Tokenizer does not support CJK(Chinese Japanese Krean)?

Reply via email to