Also in a weight Normalized Bayes/CBayes Implementation. The frequency of a word in a document is divided by document length. So this uniqueness get taken care of in the Feature Mapper/Reducer Stage. So if a word occurs more in a documents of a certain class. It is assumed to be a good feature for that class. But if the same word occurs with same frequency in two documents of different classes, then the amount which they contribute towards class discrimination is based on the relative size of the document, so in that case a smaller document with same frequency of that word will ensure that word is a better feature for that class which the smaller document belongs to
I hope i am making sense in that long sentence On Mon, Aug 18, 2008 at 11:13 PM, Robin Anil (JIRA) <[EMAIL PROTECTED]> wrote: > > [ > https://issues.apache.org/jira/browse/MAHOUT-60?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12623415#action_12623415] > > Robin Anil commented on MAHOUT-60: > ---------------------------------- > > I am generating the bigrams. So if you keep only unique words then bigrams > dont get generated correctly. > > > > > > Complementary Naive Bayes > > ------------------------- > > > > Key: MAHOUT-60 > > URL: https://issues.apache.org/jira/browse/MAHOUT-60 > > Project: Mahout > > Issue Type: Sub-task > > Components: Classification > > Reporter: Robin Anil > > Assignee: Grant Ingersoll > > Priority: Minor > > Fix For: 0.1 > > > > Attachments: country.txt, MAHOUT-60-13082008.patch, > MAHOUT-60-15082008.patch, MAHOUT-60-17082008.patch, MAHOUT-60.patch, > MAHOUT-60.patch, MAHOUT-60.patch, MAHOUT-60.patch, MAHOUT-60.patch, > twcnb.jpg > > > > > > The focus is to implement an improved text classifier based on this paper > http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf. > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > >
