On Sep 27, 2010, at 3:51 PM, Neal Richter wrote: > Neil, > > Is your classification task online or offline? Ie will you need a > classification for a piece of text live within some web-service? > > IF OFFLINE: > > I've put up a very easy to use implementation of NaiveBayes here: > http://github.com/nealrichter/ddj_naivebayes > > It's an extension of a perl implementation from Dr Dobb's Journal. > The article is a good reference as well for people no already familiar > with the math. I suggest you experiment with this before attempting > to scale-up with Mahout+Hadoop.
Running a single node, local version is trivial in Mahout. It might take a little bit longer, but likely not all that much. > Note that this implementation does > not do any TFIDF normalization as the Mahout ones do. > > The great thing about the above NaiveBayes is that the 'training' of > the model is a trivial extension of the "word count" job of hadoop > 101. Output is "<word>, <label>, <count>" > > Obviously one should layer in TFIDF for better accuracy, once you > understand the basics. > > IF ONLINE: > > If your application will require online classification of text, > mahout+hadoop really only helps for the training phase... assuming you > software can't wait minutes for an answer from hadoop. > > For quick-n-dirty text classification I've simply used Solr. > > 1) Load your training examples as documents into Solr > Simple approach is one document per label > 2) Search the index with the text you wish to classify > 3) Come up with some mechanism to use the Solr scores to make final decisions > 4) Full boosting syntax and fields from Solr are usable for more > structured classifications. > > A system I wrote to do this has been live in EC2 for almost 2 years > doing about 50M classifications per day across 35+ topical labels. > About 20B usages so far, works fine and is accurate enough for our > needs. > > Here are some references to using TFIDF for text classification > http://www.google.com/search?q=tfidf+for+text+classification > > > Thanks - Neal > > On Mon, Sep 27, 2010 at 11:53 AM, Neil Ghosh <[email protected]> wrote: >> HI Grant, >> >> Thanks so much for responding.you can reply to this in the mailing list.I >> have changed my problem a little bit more common one. >> >> I have already gone through the tutorial written by you in IBM site.It was >> very good to start with.Thanks anyway. >> To be specific my problem is to classify a piece text crawled from web into >> two classes >> >> 1.It is a +ve feedback >> 2.It is -ve feed back. >> >> I can use the two news group example and create a model with some text (may >> be a large no of text ) by inputtng the trainer with these two labels.Should >> I leave everything to the trainer completely like this ? >> >> Or Do I have flexibility to give some other input specific to my problem ? >> Such as if words like "Problem", "Complaint" etc are more likely to appear >> in a text containing grievance. >> >> Please let me know if you have any ideas and need more info from my side. >> >> Thanks >> Neil >> >> On Mon, Sep 27, 2010 at 6:12 PM, Grant Ingersoll <[email protected]>wrote: >> >>> >>> On Sep 24, 2010, at 1:12 PM, Neil Ghosh wrote: >>> >>>> Is there any other examples/documents/reference how to use mahout for* >>> text >>>> classification. >>>> * >>>> I went through and ran the following >>>> >>>> >>>> 1. Wikipedia Bayes >>>> Example<https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html>- >>>> Classify Wikipedia data. >>>> >>>> >>>> 1. Twenty Newsgroups< >>> https://cwiki.apache.org/MAHOUT/twenty-newsgroups.html>- >>>> Classify the classic Twenty Newsgroups data. >>>> >>>> However these two are not much definitive and there aren't much >>> explanation >>>> for the examples .Please share if there are more documentation. >>> >>> >>> What kinds of problems are you looking to solve? In general, we don't have >>> too much in the way of special things for text other than we have various >>> utilities for converting text into Mahout's vector format based on various >>> weighting schemes. Both of those examples just take and convert the text >>> into vectors and then either train or test on them. I would agree, though, >>> that a good tutorial is needed. It's a bit out of date in terms of the >>> actual commands, but I believe the concepts are still accurate: >>> http://www.ibm.com/developerworks/java/library/j-mahout/ >>> >>> See >>> https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+Wiki#MahoutWiki-ImplementationBackground(and >>> the creating vectors section). Also see the Algorithms section. >>> >>> >>> -------------------------- >>> Grant Ingersoll >>> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8 >>> >>> >> >> >> -- >> Thanks and Regards >> Neil >> http://neilghosh.com >> -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem docs using Solr/Lucene: http://www.lucidimagination.com/search
