On Tue, Mar 30, 2010 at 10:45 PM, Loek Cleophas <[email protected]>wrote:
> Hi, > > after my initial experiments with Mahout's Bayes and CBayes implementations > on my company's dataset, we're now trying to integrate Mahout to classify > our data in a production environment. We are however running into two odd > issues, after having succesfuly trained a classifier (using CBayes). > > We're loading the trained model into an InMemoryBayesDataStore, and are > able to get classification results (i.e. categories plus weights). However, > we're seeing two odd issues: > > 1) it turns out the classifier's memory use increases by classifying a > document; as a result, after a number of documents to classify, we run into > memory issues. > 2) somehow, classification is not consistent: e.g. if we classify text 1, > 2, 3, and then 1 again, the second time text 1 is fed, we get slightly > different weights - not by a lot, but not by little enough to discard it as > floating point rounding issues; and if we classify text 1 and then 1 again > without any intermediate classification on other texts, the weights do not > change. > This shouldn't be happening. I mean there is nothing getting changed in there. > > My colleagues and I have looked at the Mahout code, and it seems that the > memory use increase is due to getLabelID in InMemoryBayesDatastore - which > adds a label to a dictionary if it's not in there yet, but never seems to > remove any labels from the dictionary. Could this be the source of the > memory issue? I can imagine that if you're adding words that were not in the > model but occur in text to be classified, this might increase memory use but > probably shouldn't be happening (as it's classification, not training). > The number of labels is fixed right? 2,3 to 100s of a few thousands and not more. I dont see why that should be a reason for alarm. If there is a leak it could be because of something else. > > Any thoughts on these two issues, whether they're related, and what to do > about them? > > Robin, I suspect/hope you're able to help here? Can you tell me the model size in memory after first load, the dictionary size, the label size, and the increment in memory usage after a classification > > Regards, > Loek > Robin
