Hi Neal, Thanks for your input.I am trying to do the classification task using offline and I have to use Mahout and Hadoop for scalability anyway.Does the pearl implementation works with mahout and Hadoop ?
Thanks Neil On Tue, Sep 28, 2010 at 1:21 AM, Neal Richter <[email protected]> wrote: > Neil, > > Is your classification task online or offline? Ie will you need a > classification for a piece of text live within some web-service? > > IF OFFLINE: > > I've put up a very easy to use implementation of NaiveBayes here: > http://github.com/nealrichter/ddj_naivebayes > > It's an extension of a perl implementation from Dr Dobb's Journal. > The article is a good reference as well for people no already familiar > with the math. I suggest you experiment with this before attempting > to scale-up with Mahout+Hadoop. Note that this implementation does > not do any TFIDF normalization as the Mahout ones do. > > The great thing about the above NaiveBayes is that the 'training' of > the model is a trivial extension of the "word count" job of hadoop > 101. Output is "<word>, <label>, <count>" > > Obviously one should layer in TFIDF for better accuracy, once you > understand the basics. > > IF ONLINE: > > If your application will require online classification of text, > mahout+hadoop really only helps for the training phase... assuming you > software can't wait minutes for an answer from hadoop. > > For quick-n-dirty text classification I've simply used Solr. > > 1) Load your training examples as documents into Solr > Simple approach is one document per label > 2) Search the index with the text you wish to classify > 3) Come up with some mechanism to use the Solr scores to make final > decisions > 4) Full boosting syntax and fields from Solr are usable for more > structured classifications. > > A system I wrote to do this has been live in EC2 for almost 2 years > doing about 50M classifications per day across 35+ topical labels. > About 20B usages so far, works fine and is accurate enough for our > needs. > > Here are some references to using TFIDF for text classification > http://www.google.com/search?q=tfidf+for+text+classification > > > Thanks - Neal > > On Mon, Sep 27, 2010 at 11:53 AM, Neil Ghosh <[email protected]> wrote: > > HI Grant, > > > > Thanks so much for responding.you can reply to this in the mailing list.I > > have changed my problem a little bit more common one. > > > > I have already gone through the tutorial written by you in IBM site.It > was > > very good to start with.Thanks anyway. > > To be specific my problem is to classify a piece text crawled from web > into > > two classes > > > > 1.It is a +ve feedback > > 2.It is -ve feed back. > > > > I can use the two news group example and create a model with some text > (may > > be a large no of text ) by inputtng the trainer with these two > labels.Should > > I leave everything to the trainer completely like this ? > > > > Or Do I have flexibility to give some other input specific to my problem > ? > > Such as if words like "Problem", "Complaint" etc are more likely to > appear > > in a text containing grievance. > > > > Please let me know if you have any ideas and need more info from my side. > > > > Thanks > > Neil > > > > On Mon, Sep 27, 2010 at 6:12 PM, Grant Ingersoll <[email protected] > >wrote: > > > >> > >> On Sep 24, 2010, at 1:12 PM, Neil Ghosh wrote: > >> > >> > Is there any other examples/documents/reference how to use mahout for* > >> text > >> > classification. > >> > * > >> > I went through and ran the following > >> > > >> > > >> > 1. Wikipedia Bayes > >> > Example<https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html > >- > >> > Classify Wikipedia data. > >> > > >> > > >> > 1. Twenty Newsgroups< > >> https://cwiki.apache.org/MAHOUT/twenty-newsgroups.html>- > >> > Classify the classic Twenty Newsgroups data. > >> > > >> > However these two are not much definitive and there aren't much > >> explanation > >> > for the examples .Please share if there are more documentation. > >> > >> > >> What kinds of problems are you looking to solve? In general, we don't > have > >> too much in the way of special things for text other than we have > various > >> utilities for converting text into Mahout's vector format based on > various > >> weighting schemes. Both of those examples just take and convert the > text > >> into vectors and then either train or test on them. I would agree, > though, > >> that a good tutorial is needed. It's a bit out of date in terms of the > >> actual commands, but I believe the concepts are still accurate: > >> http://www.ibm.com/developerworks/java/library/j-mahout/ > >> > >> See > >> > https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+Wiki#MahoutWiki-ImplementationBackground(and<https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+Wiki#MahoutWiki-ImplementationBackground%28and>the > creating vectors section). Also see the Algorithms section. > >> > >> > >> -------------------------- > >> Grant Ingersoll > >> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct > 7-8 > >> > >> > > > > > > -- > > Thanks and Regards > > Neil > > http://neilghosh.com > > > -- Thanks and Regards Neil http://neilghosh.com
