I am interested in adding support to Nutch for searching Myanmar language text. Myanmar (Burmese) often does not have spaces between words, so the process of segmenting into words is more difficult than just whitespace.
I assume that I need to start by creating a Myanmar Tokenizer and Analyzer in org.apache.lucene.analysis, but what is needed to use this within Nutch? Are there examples of other non-whitespace Tokenizers being used in Nutch? I notice there is a translation for Thai, but I couldn't find any Thai specific segmentation. Out of the box, Nutch seems able to search space delimited Myanmar, but it is usually unable to pick out words without space delimiters. Presumably, I'll need to adapt the code in net.nutch.analysis, but are there other areas that I need to look at as well? Any tips would be much appreciated. thanks, Keith Stribley
