Hi!

I'm testing classification using CBayes (and Bayes) algorithm and I'm having
issue when I try to classify a document with words (features) that don't
exist in my model. Let's say I try to classify a document with a single
non-existing word, it returns a constant (12.386649147018964) score for all
labels instead of returning the unknown label.

After checking in the CBayesAlgorithm class, I made my own subclass and
overrided the "featureWeight" function to return 0 if the weight of the
feature in the curent label is 0 instead of returning the theta normalized
weight. It fixed the problem in my case.

My guess is that most classification examples are created with a quite big
dataset (wikipedia, newsgroup) which includes a huge vocabulary. In my case,
my dataset doesn't have a complete vocabulary causing problems with non
existing words...

Should I fill an issue? Is it a known / normal problem?

Thanks!

André-Philippe Paquet

Reply via email to