[ 
https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795127#action_12795127
 ] 

Robin Anil commented on MAHOUT-220:
-----------------------------------

I am not very clear what is happening there when two words have the same hash?. 
Arent we loosing out on a lot of information ? The one i am proposing is going 
to do exact numbering of the features. 

One thing my method suffer from is addition of new data. It will take another 
couple of M/R to create the new dictionary file, while preserving the old ids. 
Its cumbersome but doable.
What is happening in a Randomizer approach. Since you are fixing the feature 
set size. The new hash ids will also change when that feature set size increase 
right?

> Mahout Bayes Code cleanup
> -------------------------
>
>                 Key: MAHOUT-220
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-220
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with 
> the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to