High Document Frequency pruning for seq2sparse
----------------------------------------------
Key: MAHOUT-688
URL: https://issues.apache.org/jira/browse/MAHOUT-688
Project: Mahout
Issue Type: Improvement
Reporter: Vasil Vasilev
Priority: Minor
This improvement allows to prune the words with high document frequencies from
the tf and tf-idf vectors produced by seq2sparse, based on the standard
deviation of the words' document frequencies and specifying which rods to be
pruned in a means of times this standard deviation. One good option is 3 times
the standard deviation
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira