Re: Text Classification using Mahout

Grant Ingersoll Tue, 28 Sep 2010 04:02:16 -0700

On Sep 27, 2010, at 3:51 PM, Neal Richter wrote:

> Neil,
> 
>  Is your classification task online or offline?  Ie will you need a
> classification for a piece of text live within some web-service?
> 
> IF OFFLINE:
> 
>  I've put up a very easy to use implementation of NaiveBayes here:
>  http://github.com/nealrichter/ddj_naivebayes
> 
>  It's an extension of a perl implementation from Dr Dobb's Journal.
> The article is a good reference as well for people no already familiar
> with the math.  I suggest you experiment with this before attempting
> to scale-up with Mahout+Hadoop.


Running a single node, local version is trivial in Mahout.  It might take a 
little bit longer, but likely not all that much.

> Note that this implementation does
> not do any TFIDF normalization as the Mahout ones do.
> 
>  The great thing about the above NaiveBayes is that the 'training' of
> the model is a trivial extension of the "word count" job of hadoop
> 101.  Output is "<word>, <label>, <count>"
> 
>  Obviously one should layer in TFIDF for better accuracy, once you
> understand the basics.
> 
> IF ONLINE:
> 
> If your application will require online classification of text,
> mahout+hadoop really only helps for the training phase... assuming you
> software can't wait minutes for an answer from hadoop.
> 
> For quick-n-dirty text classification I've simply used Solr.
> 
> 1) Load your training examples as documents into Solr
>      Simple approach is one document per label
> 2) Search the index with the text you wish to classify
> 3) Come up with some mechanism to use the Solr scores to make final decisions
> 4) Full boosting syntax and fields from Solr are usable for more
> structured classifications.
> 
> A system I wrote to do this has been live in EC2 for almost 2 years
> doing about 50M classifications per day across 35+ topical labels.
> About 20B usages so far, works fine and is accurate enough for our
> needs.
> 
>  Here are some references to using TFIDF for text classification
>     http://www.google.com/search?q=tfidf+for+text+classification
> 
> 
> Thanks - Neal
> 
> On Mon, Sep 27, 2010 at 11:53 AM, Neil Ghosh <[email protected]> wrote:
>> HI Grant,
>> 
>> Thanks so much for responding.you can reply to this in the mailing list.I
>> have changed my problem a little bit more common one.
>> 
>> I have already gone through the tutorial written by you in IBM site.It was
>> very good to start with.Thanks anyway.
>> To be specific my problem is to classify a piece text crawled from web into
>> two classes
>> 
>> 1.It is a +ve feedback
>> 2.It is -ve feed back.
>> 
>> I can  use the two news group example and create a model with some text (may
>> be a large no of text ) by inputtng the trainer with these two labels.Should
>> I leave everything to the trainer completely like this ?
>> 
>> Or Do I have flexibility to give some other input specific to my problem ?
>> Such as if words like "Problem", "Complaint" etc are more likely to appear
>> in a text containing grievance.
>> 
>> Please let me know if you have any ideas and need more info from my side.
>> 
>> Thanks
>> Neil
>> 
>> On Mon, Sep 27, 2010 at 6:12 PM, Grant Ingersoll <[email protected]>wrote:
>> 
>>> 
>>> On Sep 24, 2010, at 1:12 PM, Neil Ghosh wrote:
>>> 
>>>> Is there any other examples/documents/reference how to use mahout for*
>>> text
>>>> classification.
>>>> *
>>>> I went through and ran the following
>>>> 
>>>> 
>>>>   1. Wikipedia Bayes
>>>> Example<https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html>-
>>>> Classify Wikipedia data.
>>>> 
>>>> 
>>>>   1. Twenty Newsgroups<
>>> https://cwiki.apache.org/MAHOUT/twenty-newsgroups.html>-
>>>> Classify the classic Twenty Newsgroups data.
>>>> 
>>>> However these two are not much definitive and there aren't much
>>> explanation
>>>> for the examples .Please share if there are more documentation.
>>> 
>>> 
>>> What kinds of problems are you looking to solve?  In general, we don't have
>>> too much in the way of special things for text other than we have various
>>> utilities for converting text into Mahout's vector format based on various
>>> weighting schemes.  Both of those examples just take and convert the text
>>> into vectors and then either train or test on them.  I would agree, though,
>>> that a good tutorial is needed.  It's a bit out of date in terms of the
>>> actual commands, but I believe the concepts are still accurate:
>>> http://www.ibm.com/developerworks/java/library/j-mahout/
>>> 
>>> See
>>> https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+Wiki#MahoutWiki-ImplementationBackground(and
>>>  the creating vectors section).  Also see the Algorithms section.
>>> 
>>> 
>>> --------------------------
>>> Grant Ingersoll
>>> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
>>> 
>>> 
>> 
>> 
>> --
>> Thanks and Regards
>> Neil
>> http://neilghosh.com
>> 

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem docs using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Text Classification using Mahout

Reply via email to