Hi Hideaki,

Thanks for your reply which made me figure out why I said icu version does 
"not" support Chinese: b/c in Chinese '中文' can be tokenize as either '中文' or 
'中' or '文' so when query '中文' or '中*' I can get the result but no result when 
query '文'. The same goes to '为什么', which can be be tokenize as either '为什么' or 
'为' or '什么' so no result for when query '什么'

And sadly fts5+unicode 61 definitely does not support Chinese.



BTW, it also helps me realize that I had answered this question myself at 2014 
here, https://stackoverflow.com/a/31396975/301513. So basically icu does the 
same as iOS CFStringTokenizer 


Qiulang 

在 2018-09-22 22:49:24,"Hideaki Takahashi" <mym...@gmail.com> 写道:
>Hello,
>
>full text search index can be used to see how the text is tokenized for
>both FTS4 and FTS5.
>for FTS4, fts3tokenize can be used too.
>
>sqlite> CREATE VIRTUAL TABLE icu_zh_cn USING fts3tokenize(icu, zh_CN);
>sqlite> SELECT token, start, end, position FROM icu_zh_cn WHERE
>INPUT='为什么不支持中文 fts5 does not seem to work for chinese';
>为什么|0|9|0
>不|9|12|1
>支持|12|18|2
>中文|18|24|3
>fts5|25|29|4
>does|30|34|5
>not|35|38|6
>seem|39|43|7
>to|44|46|8
>work|47|51|9
>for|52|55|10
>chinese|56|63|11
>
>based on the output above, the query below works and makes sense to me.
>sqlite> select * from zh_text where text match '中文';
>为什么不支持中文 icu does not seem to work for chinese
>
>
>FTS5 + unicode61
>sqlite> CREATE VIRTUAL TABLE ft5_test USING fts5(content, tokenize =
>'porter unicode61 remove_diacritics 1');
>sqlite> INSERT INTO ft5_test values('为什么不支持中文 fts5 does not seem to work
>for chinese');
>sqlite> CREATE VIRTUAL TABLE ft5_test_vocab_i USING fts5vocab(ft5_test,
>'instance');
>sqlite> SELECT term, doc, col, offset FROM ft5_test_vocab_i;
>(snip non-Chinese portion)
>为什么不支持中文|1|content|0
>
>FTS4 + ICU(zh_CN)
>sqlite> CREATE VIRTUAL TABLE zh_text USING fts4(text, tokenize=icu zh_CN);
>sqlite> INSERT INTO zh_text values('为什么不支持中文 icu does not seem to work for
>chinese');
>sqlite> CREATE VIRTUAL TABLE zh_terms USING fts4aux(zh_text);
>sqlite> SELECT term, col, documents FROM zh_terms;
>(snip non-Chinese portion)
>不|*|1
>不|0|1
>中文|*|1
>中文|0|1
>为什么|*|1
>为什么|0|1
>支持|*|1
>支持|0|1
>
>Thanks,
>Hideaki
>
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to