[ 
https://issues.apache.org/jira/browse/MAHOUT-60?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12615035#action_12615035
 ] 

robinanil edited comment on MAHOUT-60 at 7/19/08 12:24 PM:
------------------------------------------------------------

I thought of that. But i wasnt sure. If classifying one document requires
one map-reduce over the whole model. Then its more or less a waste of
resource utilization. But if i do batch classification. This is what i would
do. I want to know whether there is some  tweak that can be done.

For the model Map:output (docid:label:featureid, weight) will lead to (docNo
* number of label,features) keys (too huge)
For each document Map:output (docid:label:featureid, featureFrequency)

Reducer: each reducer will get 2 values or 1 value. if its one value
ignore.
if its 2 value then multiply and output(docid:label:featureid, weight)

Start a second map reduce on this Map:output (docid:label, weight)
then reducer sums up the probabilities for each docid:label pair

Third Map:reduce can take the doc,label emit docid => label:weight
then Reduce takes the min weight label and output the result.

Any thoughts

Robin




      was (Author: robinanil):
    I thought of that. But i wasnt sure. If classifying one document requires
one map-reduce over the whole model. Then its more or less a waste of
resource utilization. But if i do batch classification. This is what i would
do. I want to know whether there is some  tweak that can be done.

For the model Map:output (docid:label:featureid, weight) will lead to (docNo
* number of label,features) keys (too huge)
For each document Map:output (docid:label:featureid, featureFrequency)

Reducer: each reducer will get to values or 1 value. if its one value
ignore.
if its 2 value then multiply and output(docid:label:featureid, weight)

Start a second map reduce on this Map:output (docid,label, weight)
then reducer sums up the probabilities for each docid,label pair

Third Map:reduce can take the doc,label emit docid => label, weight
then Reduce takes the min weight label and output the result.

Any thoughts

Robin

On Sat, Jul 19, 2008 at 1:18 AM, Steven Handerson (JIRA) <[EMAIL PROTECTED]>


  
> Complementary Naive Bayes
> -------------------------
>
>                 Key: MAHOUT-60
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-60
>             Project: Mahout
>          Issue Type: Sub-task
>          Components: Classification
>            Reporter: Robin Anil
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 0.1
>
>         Attachments: MAHOUT-60.patch, MAHOUT-60.patch, MAHOUT-60.patch, 
> twcnb.jpg
>
>
> The focus is to implement an improved text classifier based on this paper 
> http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to