I am interested in adding support to Nutch for searching Myanmar
language text. Myanmar (Burmese) often does not have spaces between
words, so the process of segmenting into words is more difficult than
just whitespace. 

I assume that I need to start by creating a Myanmar Tokenizer and
Analyzer in org.apache.lucene.analysis, but what is needed to use this
within Nutch? Are there examples of other non-whitespace Tokenizers
being used in Nutch? I notice there is a translation for Thai, but I
couldn't find any Thai specific segmentation.

Out of the box, Nutch seems able to search space delimited Myanmar, but
it is usually unable to pick out words without space delimiters. 

Presumably, I'll need to adapt the code in net.nutch.analysis, but are
there other areas that I need to look at as well? Any tips would be much
appreciated.
thanks,
Keith Stribley



Reply via email to