[ 
https://issues.apache.org/jira/browse/MAHOUT-299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Farris updated MAHOUT-299:
-------------------------------

    Attachment: MAHOUT-299.patch

Patch as described above:

Included other cleanups:

* Gram is no longer mutable, except in the case of readFields of course.
* Added explicit NGRAM type, remove constructors that implicitly set type.
* Added unit tests for constuctors, writability. One should be added for 
sortability/comparison.
* Better unigram handling in the mappers/reducers (no need to setType on these 
anymore)
* Switched to adjustOrPutValue when accumulating frequencies in 
OpenObjectIntHashMaps

Also, NGramCollector, NGramCollectorTest should be removed from the repo. They 
are no longer relevant. Applying this patch with -E will empty and erase these 
files, but it's up to svn to do the rest.



> Collocations: improve performance by making Gram BinaryComparable
> -----------------------------------------------------------------
>
>                 Key: MAHOUT-299
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-299
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Utils
>    Affects Versions: 0.3
>            Reporter: Drew Farris
>            Priority: Minor
>             Fix For: 0.3
>
>         Attachments: MAHOUT-299.patch
>
>
> Robin's profiling indicated that a large portion of a run was spent in 
> readFields() in Gram due to the deserialization occuring as a part of Gram 
> comparions for sorting. He pointed me to BinaryComparable and the 
> implementation in Text.
> Like Text, in this new implementation, Gram stores its string in binary form. 
> When encoding the string at construction time we allocate an extra 
> character's worth of data to hold the Gram type information. When sorting 
> Grams, the binary arrays are compared instead of deserializing and comparing 
> fields.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to