Hello, > Takayuki SHIMIZUKAWA wrote: > FYI, the sphinx built-in search feature provides 2 language mode:'en' and 'ja'.
Does sphinx hava a plan to introduce a language independent tokenizer into Sphinx to support not only Japanese but also Chinese, Korean and Thai. These Asian languages also are not separated by white-space like Japanese. TinySegmenter, which is Sphinx's tokenizer for Japanese, does not work well for Chinese/Korean/Thai. I tested TinySegmenter on Chinese and Korean by TinySegmenter Online Demo. TinySegmenter Online Demo: http://chasen.org/~taku/software/TinySegmenter/ And the followings are results: 北京首都国际机场 (Beijing Capital International Airport) TinySegmenter: 北京首 | 都国 | 际机 | 场 Expected: 北京 | 首都 | 国际 | 机场 인천국제공항 (Incheon International Airport) TinySegmenter: 인 | 천 | 국제 | 공 | 항 Expected: 인천 | 국제 | 공항 As you see, TinySegmenter does not work well for these languages. I think Mozilla Thunderbird team's approach can be adapted to sphinx also. The following site descries that they had a problem their full test search did not work for CJK and how they solved it. Thunderbird 3.0 global / full-text search support for CJK languages landed, will show up in nightlies tomorrow, requires a new database. https://groups.google.com/forum/#!topic/mozilla.dev.apps.thunderbird/v0_gbw4LIKo They solved it by enhancing SQLite's porter tokenizer with bi-gram algorithm. SQLite fts3_porter.c which is enhanced with bi-gram algorithm by Mozilla Thunderbird team: http://hg.mozilla.org/comm-central/file/tip/mailnews/extensions/fts3/src/fts3_porter.c I think introducing SQLite FTS into sphinx may be difficult and not appropriate, but their approach itself is valuable to be considered to support multi-language search function. Best regard, -- You received this message because you are subscribed to the Google Groups "sphinx-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/sphinx-users. For more options, visit https://groups.google.com/groups/opt_out.
