Hi,
after my initial experiments with Mahout's Bayes and CBayes
implementations on my company's dataset, we're now trying to integrate
Mahout to classify our data in a production environment. We are
however running into two odd issues, after having succesfuly trained a
classifier (using CBayes).
We're loading the trained model into an InMemoryBayesDataStore, and
are able to get classification results (i.e. categories plus weights).
However, we're seeing two odd issues:
1) it turns out the classifier's memory use increases by classifying a
document; as a result, after a number of documents to classify, we run
into memory issues.
2) somehow, classification is not consistent: e.g. if we classify text
1, 2, 3, and then 1 again, the second time text 1 is fed, we get
slightly different weights - not by a lot, but not by little enough to
discard it as floating point rounding issues; and if we classify text
1 and then 1 again without any intermediate classification on other
texts, the weights do not change.
My colleagues and I have looked at the Mahout code, and it seems that
the memory use increase is due to getLabelID in InMemoryBayesDatastore
- which adds a label to a dictionary if it's not in there yet, but
never seems to remove any labels from the dictionary. Could this be
the source of the memory issue? I can imagine that if you're adding
words that were not in the model but occur in text to be classified,
this might increase memory use but probably shouldn't be happening (as
it's classification, not training).
Any thoughts on these two issues, whether they're related, and what to
do about them?
Robin, I suspect/hope you're able to help here?
Regards,
Loek