[ 
https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795135#action_12795135
 ] 

Ted Dunning commented on MAHOUT-220:
------------------------------------

{quote}
Robin: I am not very clear what is happening there when two words have the same 
hash?. Arent we loosing out on a lot of information ? The one i am proposing is 
going to do exact numbering of the features.
{quote}

That is the point of the "probes" parameter.  That allows for multiple hashing 
as Jake is suggesting.  If you have, for example, 4 probes for each word, the 
chances of complete collision is minuscule and where there are collisions, the 
learning algorithm puts the weight on the non-colliding probes.

The extreme case is the DenseRandomizer.  Every term gets spread out to every 
feature so you have collisions on every term on every feature.  Because of the 
random weighting, you preserve enough information to allow effective learning.

See vowpal wabbit for a practical example.  They handle 10^12 (very) sparse 
features in memory and can learn at disk bandwidth in some applications.

{quote}
Jake: They might belong in a more general place, actually. If I'm going to use 
some of this stuff in the decompositions (although I'm not sure yet of the 
efficacy of the single hash for doing SVD), it should go somewhere in the math 
module.
{quote}

Should we generalize this concept to Vectorizer?  The dictionary approach can 
accept a previously computed dictionary (possibly augmenting it on the fly) and 
might be called a DictionaryVectorizer or WeightedDictionaryVectorizer.  At the 
level I have been working, the storage of the dictionary is an open question.  
The randomizers could inherit from the same basic interface (or abstract class).


> Mahout Bayes Code cleanup
> -------------------------
>
>                 Key: MAHOUT-220
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-220
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with 
> the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to