Re: Issues with memory use and inconsistent or state-influenced results when using CBayesAlgorithm

Ferdy Wed, 31 Mar 2010 06:00:12 -0700

Just solved it. You're right about resetting the datastore.

During classification, every word still unknown is added tofeatureDictionary. This leads to the excessive growth if lots of textswith unknown words are to be classified. The inconsistency is caused byusing a "vocabCount" that is not reset after each classification.Indeed, featureDictionary.size() is used for "vocabCount", whichincreases every time new unknown words are discovered.


My suggestions for a patch:
1) Add to Datastore: *public void reset();*

2) Call this method in ClassifierContext, in both classifyDocument(...)methods. Should be the first statement.

3) Changes to InMemoryBayesDatastore:
3a) Add:
private boolean keepState;
private Set<String> temporaryNewFeatures;
public void reset() {
 temporaryNewFeatures = new HashSet<String>();
}

3b) Change line 119: *"return featureDictionary.size();* to*featureDictionary.size()+temporaryNewFeatures.size();*

3c) Change method getFeatureID to:
 private int getFeatureID(String feature) {
   if (featureDictionary.containsKey(feature)) {
     return featureDictionary.get(feature);
   } else {
     if (keepState) {
       temporaryNewFeatures.add(feature);
       return -1;
     }
     int id = featureDictionary.size();
     featureDictionary.put(feature, id);
     return id;
   }
 }
3d) Finally, add a final statement to initialize(): *keepState = true;*

This worked for us. At least the consistency problem is solved. Thefirst classified document returns the same result result as it wouldwithout these changes. Following classified documents have slightlydifferent label scores (compared to before these changes), which makeperfectly sense.

I'm sorry for not making a patch myself, but I'm still having troublesetting up Mahout in Eclipse. I created completely new implementationsby copy-pasting complete classes with small modifications.



Sean Owen wrote:

I'm no expert here but isn't it more likely the memory consumption is
coming from the feature dictionary, and feature-label matrix? Because
the label count indeed is fixed and small.

This is a great job for a profiler like JProfiler -- would let you
easily see where the heap is being consumed. A bit more manual, but
free, is the jmap tool in Java:
http://www.startux.de/index.php/java/45-java-heap-dumpyvComment45  A
quick moment with this might easily demonstrate what's taking the
memory.

Resetting the data store might well be the right thing to do, if my
guess is right.

Re: Issues with memory use and inconsistent or state-influenced results when using CBayesAlgorithm

Reply via email to