High Document Frequency pruning for seq2sparse
----------------------------------------------

                 Key: MAHOUT-688
                 URL: https://issues.apache.org/jira/browse/MAHOUT-688
             Project: Mahout
          Issue Type: Improvement
            Reporter: Vasil Vasilev
            Priority: Minor


This improvement allows to prune the words with high document frequencies from 
the tf and tf-idf vectors produced by seq2sparse, based on the standard 
deviation of the words' document frequencies and specifying which rods to be 
pruned in a means of times this standard deviation. One good option is 3 times 
the standard deviation

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to