Just solved it. You're right about resetting the datastore.
During classification, every word still unknown is added to
featureDictionary. This leads to the excessive growth if lots of texts
with unknown words are to be classified. The inconsistency is caused by
using a "vocabCount" that is not reset after each classification.
Indeed, featureDictionary.size() is used for "vocabCount", which
increases every time new unknown words are discovered.
My suggestions for a patch:
1) Add to Datastore: *public void reset();*
2) Call this method in ClassifierContext, in both classifyDocument(...)
methods. Should be the first statement.
3) Changes to InMemoryBayesDatastore:
3a) Add:
private boolean keepState;
private Set<String> temporaryNewFeatures;
public void reset() {
temporaryNewFeatures = new HashSet<String>();
}
3b) Change line 119: *"return featureDictionary.size();* to
*featureDictionary.size()+temporaryNewFeatures.size();*
3c) Change method getFeatureID to:
private int getFeatureID(String feature) {
if (featureDictionary.containsKey(feature)) {
return featureDictionary.get(feature);
} else {
if (keepState) {
temporaryNewFeatures.add(feature);
return -1;
}
int id = featureDictionary.size();
featureDictionary.put(feature, id);
return id;
}
}
3d) Finally, add a final statement to initialize(): *keepState = true;*
This worked for us. At least the consistency problem is solved. The
first classified document returns the same result result as it would
without these changes. Following classified documents have slightly
different label scores (compared to before these changes), which make
perfectly sense.
I'm sorry for not making a patch myself, but I'm still having trouble
setting up Mahout in Eclipse. I created completely new implementations
by copy-pasting complete classes with small modifications.
Sean Owen wrote:
I'm no expert here but isn't it more likely the memory consumption is
coming from the feature dictionary, and feature-label matrix? Because
the label count indeed is fixed and small.
This is a great job for a profiler like JProfiler -- would let you
easily see where the heap is being consumed. A bit more manual, but
free, is the jmap tool in Java:
http://www.startux.de/index.php/java/45-java-heap-dumpyvComment45 A
quick moment with this might easily demonstrate what's taking the
memory.
Resetting the data store might well be the right thing to do, if my
guess is right.