[ https://issues.apache.org/jira/browse/MAHOUT-299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Drew Farris updated MAHOUT-299: ------------------------------- Attachment: MAHOUT-299.patch Patch as described above: Included other cleanups: * Gram is no longer mutable, except in the case of readFields of course. * Added explicit NGRAM type, remove constructors that implicitly set type. * Added unit tests for constuctors, writability. One should be added for sortability/comparison. * Better unigram handling in the mappers/reducers (no need to setType on these anymore) * Switched to adjustOrPutValue when accumulating frequencies in OpenObjectIntHashMaps Also, NGramCollector, NGramCollectorTest should be removed from the repo. They are no longer relevant. Applying this patch with -E will empty and erase these files, but it's up to svn to do the rest. > Collocations: improve performance by making Gram BinaryComparable > ----------------------------------------------------------------- > > Key: MAHOUT-299 > URL: https://issues.apache.org/jira/browse/MAHOUT-299 > Project: Mahout > Issue Type: Improvement > Components: Utils > Affects Versions: 0.3 > Reporter: Drew Farris > Priority: Minor > Fix For: 0.3 > > Attachments: MAHOUT-299.patch > > > Robin's profiling indicated that a large portion of a run was spent in > readFields() in Gram due to the deserialization occuring as a part of Gram > comparions for sorting. He pointed me to BinaryComparable and the > implementation in Text. > Like Text, in this new implementation, Gram stores its string in binary form. > When encoding the string at construction time we allocate an extra > character's worth of data to hold the Gram type information. When sorting > Grams, the binary arrays are compared instead of deserializing and comparing > fields. > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.