This will work for now. We can remove the addition features to the dictionary altogether. Will yield better performance, and lock down the model. Will require a bit more modification.
On Wed, Mar 31, 2010 at 7:59 PM, Ferdy <[email protected]> wrote: > Just solved it. You're right about resetting the datastore. > > During classification, every word still unknown is added to > featureDictionary. This leads to the excessive growth if lots of texts with > unknown words are to be classified. The inconsistency is caused by using a > "vocabCount" that is not reset after each classification. Indeed, > featureDictionary.size() is used for "vocabCount", which increases every > time new unknown words are discovered. > > My suggestions for a patch: > 1) Add to Datastore: *public void reset();* > 2) Call this method in ClassifierContext, in both classifyDocument(...) > methods. Should be the first statement. > 3) Changes to InMemoryBayesDatastore: > 3a) Add: > private boolean keepState; > private Set<String> temporaryNewFeatures; > public void reset() { > temporaryNewFeatures = new HashSet<String>(); > } > 3b) Change line 119: *"return featureDictionary.size();* to * > featureDictionary.size()+temporaryNewFeatures.size();* > 3c) Change method getFeatureID to: > private int getFeatureID(String feature) { > if (featureDictionary.containsKey(feature)) { > return featureDictionary.get(feature); > } else { > if (keepState) { > temporaryNewFeatures.add(feature); > return -1; > } > int id = featureDictionary.size(); > featureDictionary.put(feature, id); > return id; > } > } > 3d) Finally, add a final statement to initialize(): *keepState = true;* > > This worked for us. At least the consistency problem is solved. The first > classified document returns the same result result as it would without these > changes. Following classified documents have slightly different label scores > (compared to before these changes), which make perfectly sense. > > I'm sorry for not making a patch myself, but I'm still having trouble > setting up Mahout in Eclipse. I created completely new implementations by > copy-pasting complete classes with small modifications. > > > > Sean Owen wrote: > > I'm no expert here but isn't it more likely the memory consumption is > coming from the feature dictionary, and feature-label matrix? Because > the label count indeed is fixed and small. > > This is a great job for a profiler like JProfiler -- would let you > easily see where the heap is being consumed. A bit more manual, but > free, is the jmap tool in > Java:http://www.startux.de/index.php/java/45-java-heap-dumpyvComment45 A > quick moment with this might easily demonstrate what's taking the > memory. > > Resetting the data store might well be the right thing to do, if my > guess is right. > > >
