Hello,

> Takayuki SHIMIZUKAWA wrote:
> FYI, the sphinx built-in search feature provides 2 language mode:'en' and 
'ja'.

Does sphinx hava a plan to introduce a language independent tokenizer into 
Sphinx to support not only Japanese but also Chinese, Korean and Thai. 
These Asian languages also are not separated by white-space like Japanese. 

TinySegmenter, which is Sphinx's tokenizer for Japanese, does not work well 
for Chinese/Korean/Thai.

I tested TinySegmenter on Chinese and Korean by TinySegmenter Online Demo.

TinySegmenter Online Demo:
http://chasen.org/~taku/software/TinySegmenter/

And the followings are results:

北京首都国际机场 (Beijing Capital International Airport)
TinySegmenter: 北京首 | 都国 | 际机 | 场
Expected: 北京 | 首都 | 国际 | 机场

인천국제공항 (Incheon International Airport)
TinySegmenter: 인 | 천 | 국제 | 공 | 항
Expected: 인천 | 국제 | 공항

As you see, TinySegmenter does not work well for these languages. 

I think Mozilla Thunderbird team's approach can be adapted to sphinx also. 
The following site descries that they had a problem their full test search 
did not work for CJK and how they solved it. 

Thunderbird 3.0 global / full-text search support for CJK languages landed,
will show up in nightlies tomorrow, requires a new database.
https://groups.google.com/forum/#!topic/mozilla.dev.apps.thunderbird/v0_gbw4LIKo

They solved it by enhancing SQLite's porter tokenizer with bi-gram 
algorithm.

SQLite fts3_porter.c which is enhanced with bi-gram algorithm by Mozilla 
Thunderbird team:
http://hg.mozilla.org/comm-central/file/tip/mailnews/extensions/fts3/src/fts3_porter.c

I think introducing SQLite FTS into sphinx may be difficult and not 
appropriate, but their approach itself is valuable to be considered to 
support multi-language search function.

Best regard,


-- 
You received this message because you are subscribed to the Google Groups 
"sphinx-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/sphinx-users.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to