On Tue, 09 Jun 2009, KU Kam-ming wrote:
> however, a simple word segmentation can help.  i.e. if the UTF8
> character falls into CJK code zone, your indexer will segment the
> sentence wordly.

So do you mean that when we see an input phrase ABC where A, B, and C
are characters from the CJK zone, then we can simply index separately A,
B, and C as if they were standalone words, and then on the retrieval
side break the user query in the same way and use the boolean `and' to
find the matching records?  Would this work well for the typical CJK
queries?  We would probably need to pay attention to `word' positions
for this to work really well.

(This seems to be what mnoGoSearch's CJK phrase segmenter does:
<http://www.mnogosearch.org/doc33/msearch-cjk.html>)

Best regards
-- 
Tibor Simko ** CERN Document Server ** <http://cds.cern.ch/>

Reply via email to