Issues with memory use and inconsistent or state-influenced results when using CBayesAlgorithm

Loek Cleophas Tue, 30 Mar 2010 10:16:11 -0700

Hi,

after my initial experiments with Mahout's Bayes and CBayesimplementations on my company's dataset, we're now trying to integrateMahout to classify our data in a production environment. We arehowever running into two odd issues, after having succesfuly trained aclassifier (using CBayes).

We're loading the trained model into an InMemoryBayesDataStore, andare able to get classification results (i.e. categories plus weights).However, we're seeing two odd issues:

1) it turns out the classifier's memory use increases by classifying adocument; as a result, after a number of documents to classify, we runinto memory issues.2) somehow, classification is not consistent: e.g. if we classify text1, 2, 3, and then 1 again, the second time text 1 is fed, we getslightly different weights - not by a lot, but not by little enough todiscard it as floating point rounding issues; and if we classify text1 and then 1 again without any intermediate classification on othertexts, the weights do not change.

My colleagues and I have looked at the Mahout code, and it seems thatthe memory use increase is due to getLabelID in InMemoryBayesDatastore- which adds a label to a dictionary if it's not in there yet, butnever seems to remove any labels from the dictionary. Could this bethe source of the memory issue? I can imagine that if you're addingwords that were not in the model but occur in text to be classified,this might increase memory use but probably shouldn't be happening (asit's classification, not training).

Any thoughts on these two issues, whether they're related, and what todo about them?


Robin, I suspect/hope you're able to help here?

Regards,
Loek

Issues with memory use and inconsistent or state-influenced results when using CBayesAlgorithm

Reply via email to