[
https://issues.apache.org/jira/browse/MAHOUT-60?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12623397#action_12623397
]
Grant Ingersoll commented on MAHOUT-60:
---------------------------------------
I'm getting failures in the BayesFileFormatterTest. Namely due to the change
to \t, which is an easy fix. However, I wonder why the check to the "seen"
CharSet was removed? I seem to recall that we only want unique words for
training, otherwise the calculations get screwed up, at least in the NB
implementation (not sure what you want in CNB)
The loop used to look like:
{code}
while ((token = ts.next(token)) != null) {
char[] termBuffer = token.termBuffer();
int termLen = token.termLength();
if (seen.contains(termBuffer, 0, termLen) == false) {
if (numTokens > 0) {
writer.write(' ');
}
writer.write(termBuffer, 0, termLen);
char [] tmp = new char[termLen];
System.arraycopy(termBuffer, 0, tmp, 0, termLen);
seen.add(tmp);//do this b/c CharArraySet doesn't allow offsets
}
{code}
> Complementary Naive Bayes
> -------------------------
>
> Key: MAHOUT-60
> URL: https://issues.apache.org/jira/browse/MAHOUT-60
> Project: Mahout
> Issue Type: Sub-task
> Components: Classification
> Reporter: Robin Anil
> Assignee: Grant Ingersoll
> Priority: Minor
> Fix For: 0.1
>
> Attachments: country.txt, MAHOUT-60-13082008.patch,
> MAHOUT-60-15082008.patch, MAHOUT-60-17082008.patch, MAHOUT-60.patch,
> MAHOUT-60.patch, MAHOUT-60.patch, MAHOUT-60.patch, MAHOUT-60.patch, twcnb.jpg
>
>
> The focus is to implement an improved text classifier based on this paper
> http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.