I can say 'YES' at the moment.  This is the simplest way to do CJK
segmentation, but it is not lexically correct.  Good CJK segmentation
involves AI with language processing (or dictionary based).  It takes time
to do so.  In the meantime, it would be better to do as what you say for
CJK.

-----Original Message-----
From: Tibor Simko [mailto:[email protected]] 
Sent: Tuesday, June 09, 2009 11:09 PM
To: KU Kam-ming
Cc: project-cdsware-users (CDSware users list.)
Subject: Re: invenio indexes CJK ?

On Tue, 09 Jun 2009, KU Kam-ming wrote:
> however, a simple word segmentation can help.  i.e. if the UTF8
> character falls into CJK code zone, your indexer will segment the
> sentence wordly.

So do you mean that when we see an input phrase ABC where A, B, and C
are characters from the CJK zone, then we can simply index separately A,
B, and C as if they were standalone words, and then on the retrieval
side break the user query in the same way and use the boolean `and' to
find the matching records?  Would this work well for the typical CJK
queries?  We would probably need to pay attention to `word' positions
for this to work really well.

(This seems to be what mnoGoSearch's CJK phrase segmenter does:
<http://www.mnogosearch.org/doc33/msearch-cjk.html>)

Best regards
-- 
Tibor Simko ** CERN Document Server ** <http://cds.cern.ch/>

Reply via email to