Hi,

after my initial experiments with Mahout's Bayes and CBayes implementations on my company's dataset, we're now trying to integrate Mahout to classify our data in a production environment. We are however running into two odd issues, after having succesfuly trained a classifier (using CBayes).

We're loading the trained model into an InMemoryBayesDataStore, and are able to get classification results (i.e. categories plus weights). However, we're seeing two odd issues:

1) it turns out the classifier's memory use increases by classifying a document; as a result, after a number of documents to classify, we run into memory issues. 2) somehow, classification is not consistent: e.g. if we classify text 1, 2, 3, and then 1 again, the second time text 1 is fed, we get slightly different weights - not by a lot, but not by little enough to discard it as floating point rounding issues; and if we classify text 1 and then 1 again without any intermediate classification on other texts, the weights do not change.

My colleagues and I have looked at the Mahout code, and it seems that the memory use increase is due to getLabelID in InMemoryBayesDatastore - which adds a label to a dictionary if it's not in there yet, but never seems to remove any labels from the dictionary. Could this be the source of the memory issue? I can imagine that if you're adding words that were not in the model but occur in text to be classified, this might increase memory use but probably shouldn't be happening (as it's classification, not training).

Any thoughts on these two issues, whether they're related, and what to do about them?

Robin, I suspect/hope you're able to help here?

Regards,
Loek

Reply via email to