https://www.sqlite.org/fts5.html said " The unicode tokenizer classifies all 
unicode characters as either "separator" or "token" characters. By default all 
space and punctuation characters, as defined by Unicode 6.1, are considered 
separators, and all other characters as token characters... "  I really doubt 
unicode tokenizer requires white space, that is ascii tokenizer.


That was why I thought it might work for CJK. 


Qiulang


At 2018-09-21 13:03:54, "Scott Robison" <sc...@casaderobison.com> wrote:
>On Thu, Sep 20, 2018, 8:21 PM 邱朗 <qiulang2...@126.com> wrote:
>
>> Hi,
>> I had thought Unicode61 Tokenizer can support CJK -- Chinese Japanese
>> Korean I verify my sqlite supports fts5
>>
>> {snipped}
>>
>> But to my surprise it can't find any CJK word at all. Why is that ?
>
>
>Based on my experience with such things, I suspect that the tokenizer
>requires whitespace between adjacent words, which is not the case with CJK.
>Word breaks are implicit, not explicit.
>
>Is the Unicode61 tokenizer based on ICU? I had to implement an algorithm
>for software at work that used functionality from ICU to find CJK word
>boundaries, so I believe it is possible, just not as straightforward as
>whitespace delimited words.
>_______________________________________________
>sqlite-users mailing list
>sqlite-users@mailinglists.sqlite.org
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to