How big is your model when you are done training? Just curious.
At least in the M-9 case, the storage model is not optimized at all.
I think Karl used Lucene to store the model in a non-M/R version.
Maybe that would be useful.
On Jul 21, 2008, at 12:45 PM, Steven Handerson (JIRA) wrote:
[ https://issues.apache.org/jira/browse/MAHOUT-60?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12615302
#action_12615302 ]
Steven Handerson commented on MAHOUT-60:
----------------------------------------
Well, yes, I was thinking of batch classification (like, constructing
confusion matrices, or running the training data back through the
model to test the model).
But the problem I'm running into with the code is that the model
is too large to load in a single process, let alone multiple mappers.
So classifying fast doesn't help if simply loading the model is very
slow
(and I mean very slow, and then doesn't necessarily succeed anyway
-- out of mem).
I also admit that "batch classification" -- in the sense that there
is overlap
in the different feature sets from different documents --
makes it more interesting / saves some work perhaps, but you can't
count on that anyway.
Yes, you might want something fast for single-document classification,
but map-reduce isn't the right tool for that. Indexed structures
are better.
The choices are either some indexed structure (like HBase) which can
handle large datasets / models, or just use map-reduce to join
the model to the data. The latter is definitely not useless --
usages similiarly divide into people who have a lot of data / docs
to classify,
versus people who are building some kind of online system.
Throughput versus round-trip time.
Also, note that with an indexed solution, you might have contention
for the indexed data --
if there's only one copy (which should probably be the case, for
large models).
So I'd suggest implementing both, and to consider the cases where
the models are very large
(which is where map-reduce shines anyway). I might be the only
person commenting
who has tried a lot of data (800Meg input document file), and as I
said it would
be nice to have some results (confusion matrices)
to see if the method is working for me and my particular data.
If nobody else agrees, I might have to try it myself, but I'm new at
this
and sometimes get pulled away for other work.
Complementary Naive Bayes
-------------------------
Key: MAHOUT-60
URL: https://issues.apache.org/jira/browse/MAHOUT-60
Project: Mahout
Issue Type: Sub-task
Components: Classification
Reporter: Robin Anil
Assignee: Grant Ingersoll
Priority: Minor
Fix For: 0.1
Attachments: MAHOUT-60.patch, MAHOUT-60.patch,
MAHOUT-60.patch, twcnb.jpg
The focus is to implement an improved text classifier based on this
paper http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
--------------------------
Grant Ingersoll
http://www.lucidimagination.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ