Hi,

   We have used Lucene to index Chinese for our KM system. We have added 
a Chinese text analyzer. However, the current implementation is a hack 
to glue everything together.

   We are rearchitecting this and like to get some suggestion from the 
groups.

   Before this, let me explain first the difficulty and problems we have 
in the analysis of Chinese text. Unlike English, Chinese and most Asian 
languages like Japanese, Korean and Thai, do not have clear word 
boundary. Additional component called segmentor needs to be implemented 
to seperate the string into list of words.

   The question would be where this segmentor should be. Currently, we 
are designing it to be put below the analyzer. So, each token gets 
passed to the Analyzer would be a either a word or symbols tokens. That 
can be applied filtering.

   I think this is a good place to put the segmentor.

   In additional to our Chinese segmentor, IBM's ICU4J has a 
BreakIterator which has both RuleBase and DictionaryBase break iterator 
and currently support Thai.

   Functionally, the BreakIterator is very closely related to the 
segmentor. We are looking into the possibility to integrate our 
segmentor into the BreakIterator framework.

   If we have Lucene to use ICU4J at the bottom, we could get Lucene to 
become a search engine that is capable handle as many languages as 
supported by ICU4J.

I'd like to know what the group think of this idea.

Thanks

ICU4J:

http://oss.software.ibm.com/developerworks/opensource/icu4j/index.html

BreakIterator:

http://oss.software.ibm.com/icu4j/doc/com/ibm/text/BreakIterator.html

David Li
DigitalSesame


_______________________________________________
Lucene-dev mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/lucene-dev

Reply via email to