This will work for now.  We can remove the addition features to the
dictionary altogether. Will yield better performance, and lock down the
model. Will require a bit more modification.

On Wed, Mar 31, 2010 at 7:59 PM, Ferdy <[email protected]> wrote:

>  Just solved it. You're right about resetting the datastore.
>
> During classification, every word still unknown is added to
> featureDictionary. This leads to the excessive growth if lots of texts with
> unknown words are to be classified. The inconsistency is caused by using a
> "vocabCount" that is not reset after each classification. Indeed,
> featureDictionary.size() is used for "vocabCount", which increases every
> time new unknown words are discovered.
>
> My suggestions for a patch:
> 1) Add to Datastore: *public void reset();*
> 2) Call this method in ClassifierContext, in both classifyDocument(...)
> methods. Should be the first statement.
> 3) Changes to InMemoryBayesDatastore:
> 3a) Add:
> private boolean keepState;
> private Set<String> temporaryNewFeatures;
> public void reset() {
>   temporaryNewFeatures = new HashSet<String>();
> }
> 3b) Change line 119: *"return featureDictionary.size();* to *
> featureDictionary.size()+temporaryNewFeatures.size();*
> 3c) Change method getFeatureID to:
>   private int getFeatureID(String feature) {
>     if (featureDictionary.containsKey(feature)) {
>       return featureDictionary.get(feature);
>     } else {
>       if (keepState) {
>         temporaryNewFeatures.add(feature);
>         return -1;
>       }
>       int id = featureDictionary.size();
>       featureDictionary.put(feature, id);
>       return id;
>     }
>   }
> 3d) Finally, add a final statement to initialize(): *keepState = true;*
>
> This worked for us. At least the consistency problem is solved. The first
> classified document returns the same result result as it would without these
> changes. Following classified documents have slightly different label scores
> (compared to before these changes), which make perfectly sense.
>
> I'm sorry for not making a patch myself, but I'm still having trouble
> setting up Mahout in Eclipse. I created completely new implementations by
> copy-pasting complete classes with small modifications.
>
>
>
> Sean Owen wrote:
>
> I'm no expert here but isn't it more likely the memory consumption is
> coming from the feature dictionary, and feature-label matrix? Because
> the label count indeed is fixed and small.
>
> This is a great job for a profiler like JProfiler -- would let you
> easily see where the heap is being consumed. A bit more manual, but
> free, is the jmap tool in 
> Java:http://www.startux.de/index.php/java/45-java-heap-dumpyvComment45  A
> quick moment with this might easily demonstrate what's taking the
> memory.
>
> Resetting the data store might well be the right thing to do, if my
> guess is right.
>
>
>

Reply via email to