[ 
https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795131#action_12795131
 ] 

Jake Mannix commented on MAHOUT-220:
------------------------------------

bq. I am not very clear what is happening there when two words have the same 
hash?. Arent we loosing out on a lot of information ?

You can lose some information, sure, but there are *tons* of words, and you 
don't lose much information.  It is a probabilistic technique though.

Personally I prefer the mutli-hash approach, because at least there I really 
believe the projection is preserving distances properly.  In the single hash 
case, sometimes (ie for some single word documents, with different words), the 
collapse of distance is extreme (as Robin is alluding to).

> Mahout Bayes Code cleanup
> -------------------------
>
>                 Key: MAHOUT-220
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-220
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with 
> the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to