Re: [jira] Commented: (MAHOUT-60) Complementary Naive Bayes

Grant Ingersoll Mon, 21 Jul 2008 13:17:01 -0700

How big is your model when you are done training?  Just curious.

At least in the M-9 case, the storage model is not optimized at all.I think Karl used Lucene to store the model in a non-M/R version.Maybe that would be useful.



On Jul 21, 2008, at 12:45 PM, Steven Handerson (JIRA) wrote:

[ https://issues.apache.org/jira/browse/MAHOUT-60?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12615302#action_12615302 ]
Steven Handerson commented on MAHOUT-60:
----------------------------------------

Well, yes, I was thinking of batch classification (like, constructing
confusion matrices, or running the training data back through themodel to test the model).
But the problem I'm running into with the code is that the model
is too large to load in a single process, let alone multiple mappers.
So classifying fast doesn't help if simply loading the model is veryslow(and I mean very slow, and then doesn't necessarily succeed anyway-- out of mem).
I also admit that "batch classification" -- in the sense that thereis overlap
in the different feature sets from different documents --
makes it more interesting / saves some work perhaps, but you can'tcount on that anyway.
Yes, you might want something fast for single-document classification,
but map-reduce isn't the right tool for that. Indexed structuresare better.
The choices are either some indexed structure (like HBase) which can
handle large datasets / models, or just use map-reduce to join
the model to the data.  The latter is definitely not useless --
usages similiarly divide into people who have a lot of data / docsto classify,
versus people who are building some kind of online system.
Throughput versus round-trip time.
Also, note that with an indexed solution, you might have contentionfor the indexed data --if there's only one copy (which should probably be the case, forlarge models).
So I'd suggest implementing both, and to consider the cases wherethe models are very large(which is where map-reduce shines anyway). I might be the onlyperson commentingwho has tried a lot of data (800Meg input document file), and as Isaid it would
be nice to have some results (confusion matrices)
to see if the method is working for me and my particular data.
If nobody else agrees, I might have to try it myself, but I'm new atthis
and sometimes get pulled away for other work.
Complementary Naive Bayes
-------------------------

               Key: MAHOUT-60
               URL: https://issues.apache.org/jira/browse/MAHOUT-60
           Project: Mahout
        Issue Type: Sub-task
        Components: Classification
          Reporter: Robin Anil
          Assignee: Grant Ingersoll
          Priority: Minor
           Fix For: 0.1
Attachments: MAHOUT-60.patch, MAHOUT-60.patch,MAHOUT-60.patch, twcnb.jpg
The focus is to implement an improved text classifier based on thispaper http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


--------------------------
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: [jira] Commented: (MAHOUT-60) Complementary Naive Bayes

Reply via email to