[
https://issues.apache.org/jira/browse/MAHOUT-60?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12615035#action_12615035
]
Robin Anil commented on MAHOUT-60:
----------------------------------
I thought of that. But i wasnt sure. If classifying one document requires
one map-reduce over the whole model. Then its more or less a waste of
resource utilization. But if i do batch classification. This is what i would
do. I want to know whether there is some tweak that can be done.
For the model Map:output (docid:label:featureid, weight) will lead to (docNo
* number of label,features) keys (too huge)
For each document Map:output (docid:label:featureid, featureFrequency)
Reducer: each reducer will get to values or 1 value. if its one value
ignore.
if its 2 value then multiply and output(docid:label:featureid, weight)
Start a second map reduce on this Map:output (docid,label, weight)
then reducer sums up the probabilities for each docid,label pair
Third Map:reduce can take the doc,label emit docid => label, weight
then Reduce takes the min weight label and output the result.
Any thoughts
Robin
On Sat, Jul 19, 2008 at 1:18 AM, Steven Handerson (JIRA) <[EMAIL PROTECTED]>
> Complementary Naive Bayes
> -------------------------
>
> Key: MAHOUT-60
> URL: https://issues.apache.org/jira/browse/MAHOUT-60
> Project: Mahout
> Issue Type: Sub-task
> Components: Classification
> Reporter: Robin Anil
> Assignee: Grant Ingersoll
> Priority: Minor
> Fix For: 0.1
>
> Attachments: MAHOUT-60.patch, MAHOUT-60.patch, MAHOUT-60.patch,
> twcnb.jpg
>
>
> The focus is to implement an improved text classifier based on this paper
> http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.