The Bayes algorithms favor sparse data with large numbers of potential
features.  Text is one kind of this data.

Using Naive Bayes with unicode should be fine.  The simplest method for
processing CJK text is to use character unigrams and bigrams.  This works
very well with retrieval systems, but I haven't heard if it would work with
classification although I expect it would.

On Sun, Jan 1, 2012 at 7:02 PM, Lingxiang Cheng <[email protected]>wrote:

> It's interesting that the Bayes algorithms in Mahout strongly favor text
> data than numeric data. I am thinking about using them to categorize
> chinese websites. Has anyone used it to process unicodes?

Reply via email to