Re: Text Classification using Mahout

Neil Ghosh Tue, 28 Sep 2010 08:08:35 -0700

Hi Neal,

Thanks for your input.I am trying to do the classification task using
offline and I have to use Mahout and Hadoop for  scalability anyway.Does the
pearl implementation works with mahout and Hadoop ?


Thanks
Neil

On Tue, Sep 28, 2010 at 1:21 AM, Neal Richter <[email protected]> wrote:

> Neil,
>
>  Is your classification task online or offline?  Ie will you need a
> classification for a piece of text live within some web-service?
>
> IF OFFLINE:
>
>  I've put up a very easy to use implementation of NaiveBayes here:
>  http://github.com/nealrichter/ddj_naivebayes
>
>  It's an extension of a perl implementation from Dr Dobb's Journal.
> The article is a good reference as well for people no already familiar
> with the math.  I suggest you experiment with this before attempting
> to scale-up with Mahout+Hadoop.  Note that this implementation does
> not do any TFIDF normalization as the Mahout ones do.
>
>  The great thing about the above NaiveBayes is that the 'training' of
> the model is a trivial extension of the "word count" job of hadoop
> 101.  Output is "<word>, <label>, <count>"
>
>  Obviously one should layer in TFIDF for better accuracy, once you
> understand the basics.
>
> IF ONLINE:
>
>  If your application will require online classification of text,
> mahout+hadoop really only helps for the training phase... assuming you
> software can't wait minutes for an answer from hadoop.
>
>  For quick-n-dirty text classification I've simply used Solr.
>
>  1) Load your training examples as documents into Solr
>      Simple approach is one document per label
>  2) Search the index with the text you wish to classify
>  3) Come up with some mechanism to use the Solr scores to make final
> decisions
>  4) Full boosting syntax and fields from Solr are usable for more
> structured classifications.
>
>  A system I wrote to do this has been live in EC2 for almost 2 years
> doing about 50M classifications per day across 35+ topical labels.
> About 20B usages so far, works fine and is accurate enough for our
> needs.
>
>  Here are some references to using TFIDF for text classification
>     http://www.google.com/search?q=tfidf+for+text+classification
>
>
> Thanks - Neal
>
> On Mon, Sep 27, 2010 at 11:53 AM, Neil Ghosh <[email protected]> wrote:
> > HI Grant,
> >
> > Thanks so much for responding.you can reply to this in the mailing list.I
> > have changed my problem a little bit more common one.
> >
> > I have already gone through the tutorial written by you in IBM site.It
> was
> > very good to start with.Thanks anyway.
> > To be specific my problem is to classify a piece text crawled from web
> into
> > two classes
> >
> > 1.It is a +ve feedback
> > 2.It is -ve feed back.
> >
> > I can  use the two news group example and create a model with some text
> (may
> > be a large no of text ) by inputtng the trainer with these two
> labels.Should
> > I leave everything to the trainer completely like this ?
> >
> > Or Do I have flexibility to give some other input specific to my problem
> ?
> > Such as if words like "Problem", "Complaint" etc are more likely to
> appear
> > in a text containing grievance.
> >
> > Please let me know if you have any ideas and need more info from my side.
> >
> > Thanks
> > Neil
> >
> > On Mon, Sep 27, 2010 at 6:12 PM, Grant Ingersoll <[email protected]
> >wrote:
> >
> >>
> >> On Sep 24, 2010, at 1:12 PM, Neil Ghosh wrote:
> >>
> >> > Is there any other examples/documents/reference how to use mahout for*
> >> text
> >> > classification.
> >> > *
> >> > I went through and ran the following
> >> >
> >> >
> >> >   1. Wikipedia Bayes
> >> > Example<https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html
> >-
> >> > Classify Wikipedia data.
> >> >
> >> >
> >> >   1. Twenty Newsgroups<
> >> https://cwiki.apache.org/MAHOUT/twenty-newsgroups.html>-
> >> > Classify the classic Twenty Newsgroups data.
> >> >
> >> > However these two are not much definitive and there aren't much
> >> explanation
> >> > for the examples .Please share if there are more documentation.
> >>
> >>
> >> What kinds of problems are you looking to solve?  In general, we don't
> have
> >> too much in the way of special things for text other than we have
> various
> >> utilities for converting text into Mahout's vector format based on
> various
> >> weighting schemes.  Both of those examples just take and convert the
> text
> >> into vectors and then either train or test on them.  I would agree,
> though,
> >> that a good tutorial is needed.  It's a bit out of date in terms of the
> >> actual commands, but I believe the concepts are still accurate:
> >> http://www.ibm.com/developerworks/java/library/j-mahout/
> >>
> >> See
> >>
> https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+Wiki#MahoutWiki-ImplementationBackground(and<https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+Wiki#MahoutWiki-ImplementationBackground%28and>the
>  creating vectors section).  Also see the Algorithms section.
> >>
> >>
> >> --------------------------
> >> Grant Ingersoll
> >> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct
> 7-8
> >>
> >>
> >
> >
> > --
> > Thanks and Regards
> > Neil
> > http://neilghosh.com
> >
>



-- 
Thanks and Regards
Neil
http://neilghosh.com

Re: Text Classification using Mahout

Reply via email to