[ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795135#action_12795135 ]
Ted Dunning commented on MAHOUT-220: ------------------------------------ {quote} Robin: I am not very clear what is happening there when two words have the same hash?. Arent we loosing out on a lot of information ? The one i am proposing is going to do exact numbering of the features. {quote} That is the point of the "probes" parameter. That allows for multiple hashing as Jake is suggesting. If you have, for example, 4 probes for each word, the chances of complete collision is minuscule and where there are collisions, the learning algorithm puts the weight on the non-colliding probes. The extreme case is the DenseRandomizer. Every term gets spread out to every feature so you have collisions on every term on every feature. Because of the random weighting, you preserve enough information to allow effective learning. See vowpal wabbit for a practical example. They handle 10^12 (very) sparse features in memory and can learn at disk bandwidth in some applications. {quote} Jake: They might belong in a more general place, actually. If I'm going to use some of this stuff in the decompositions (although I'm not sure yet of the efficacy of the single hash for doing SVD), it should go somewhere in the math module. {quote} Should we generalize this concept to Vectorizer? The dictionary approach can accept a previously computed dictionary (possibly augmenting it on the fly) and might be called a DictionaryVectorizer or WeightedDictionaryVectorizer. At the level I have been working, the storage of the dictionary is an open question. The randomizers could inherit from the same basic interface (or abstract class). > Mahout Bayes Code cleanup > ------------------------- > > Key: MAHOUT-220 > URL: https://issues.apache.org/jira/browse/MAHOUT-220 > Project: Mahout > Issue Type: Improvement > Components: Classification > Affects Versions: 0.3 > Reporter: Robin Anil > Assignee: Robin Anil > Fix For: 0.3 > > Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch > > > Following isabel's checkstyle, I am adding a whole slew of code cleanup with > the following exceptions > 1. Line length used is 120 instead of 80. > 2. static final log is kept as is. not LOG. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.