On Tue, 09 Jun 2009, KU Kam-ming wrote: > however, a simple word segmentation can help. i.e. if the UTF8 > character falls into CJK code zone, your indexer will segment the > sentence wordly.
So do you mean that when we see an input phrase ABC where A, B, and C are characters from the CJK zone, then we can simply index separately A, B, and C as if they were standalone words, and then on the retrieval side break the user query in the same way and use the boolean `and' to find the matching records? Would this work well for the typical CJK queries? We would probably need to pay attention to `word' positions for this to work really well. (This seems to be what mnoGoSearch's CJK phrase segmenter does: <http://www.mnogosearch.org/doc33/msearch-cjk.html>) Best regards -- Tibor Simko ** CERN Document Server ** <http://cds.cern.ch/>
