Re: Regarding Google Summer of Code Lucene Mahout Project

Matthew Riley Tue, 25 Mar 2008 07:58:15 -0700

Maybe you can get some feedback from Jason Rennie (one of the authors of the
paper you linked) on your implementation - I seem to remember seeing some
comments from him on this mailing list about a week ago.


Matt

On Tue, Mar 25, 2008 at 8:55 AM, Robin Anil <[EMAIL PROTECTED]> wrote:

> Hi Isabel,
>
> On Tue, Mar 25, 2008 at 2:52 AM, Isabel Drost <
> [EMAIL PROTECTED]>
> wrote:
> >
> > On Monday 24 March 2008, Robin Anil wrote:
> >
> > > The Complement-Naive-Bayes-Classifier(coded up for this project) then
> run on
> > > the retrieved document to do post processing.
> >
> > The ideas presented in the slides look pretty interesting to me. Could
> you
> > please provide some pointers to information in the Complement Naive
> Bayes
> > Classifier? What were the reasons you chose this classifier?
> >
> Before going into Complement Naive Bayes there are certain things about
> Text
> Classification. Given a good amount of data as it is in the case of
> textual
> Data, Naive Bayes Suprisingly performs better than most of the other
> supervised learners. Reason as i see it is, Naive Bayes class margins are
> so
> bluntly defined that chances of overfitting is rare. This is also the
> reason
> why, given the proper features Naive Bayes doesnt measure up to other
> Methods. So you may say Naive Bayes in a Good Classifier for Textual Data.
> Now Complement Naive Bayes does the reverse. Instead of calculating which
> class fits the document best. It does, which complement class least fits
> the
> document.  Also it removes the bias problem due to prior probability term
> in
> NB equation. You may be interested in reading the paper which talks more
> about it Here <http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf>.
> My
> BaseClassifier implementation reproduces the work there. But for different
> classifiers (SpamDetection, Subjectivity, Polarity) , all of them inherits
> the base classifier but the feature selection module is overloaded for
> each
> of them.
>
> As you can see all of them except Polarity(Classes are Pos, Neg, Neutral)
> are Binary Classifiers where the CNB is Exactly the same as NB(just a -ve
> sign difference). But other things like normalization made a lot of
> difference in removing the false positives and biased classes.
>
> >
> >
> > > If its possible to have the classifier run along with Lucene and
> > > spit out sentences and add them to a field in real-time, It would
> > > essentially enable this system to be online and allow for real-time
> > > queries.
> >
> > So what you are hoping for is a system that can crawl and answer queries
> at
> > the same time, integrating more and more information as it becomes
> available,
> > right?
> >
> Yes and No,
> Yes because System needs to go through the index get documents and process
> the Sentences and get all opinions, Not necessarity the Target.
> No because the queries arent fixed. If you disregard the TREC queries, say
> a
> person is sitting there asking for opinion about a target. He may type
> "Nokia 6600" or "My left hand". Now, I would have to go though the DB and
> find everything which talks about Nokia and the other and do post
> processing
> if its not yet processed. Another reason is the ranking of the results
> become a problem. How do i say which among the 1000 results gives the
> better
> opinion. The doc that talks more about the target or the one which has
> more
> opinions about the target. Neither, we need to rank them based on the
> output
> of Classification Algorithms.
>
> This is where i see the use of Mahout. Say we have the core Lucene
> Architecture modded with Mahout. If i can give the results of Mahout
> Classifier to lucene for Ranking function. Based on Subjectivity, Polarity
> etc. Not only will it become easy to Implement Good IR Systems for
> Research.
> It can give rise to Some real funky use cases for Complex Production IR
> Systems.
> >
> > > I would gladly answer any queries except results
> >
> > Hmm, so for this competition there is no sample dataset available to
> test
> the
> > performance of the algorithms against? Sounds like there is no way to
> > determine which of two competing solutions is better except making two
> > submissions...
> >
> Well throughout the year, Competing researchers give One or two Queries
> and
> Hand Made results. Which is compiled and tested against each other.
>
> > Isabel
> >
> >
> > --
> > The ideal voice for radio may be defined as showing no substance, no
> sex,no
> > owner, and a message of importance for every housewife.         -- Harry
> V. Wade
> >
> >
> >
> >  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
> >  /,`.-'`'    -.  ;-;;,_
> >  |,4-  ) )-,_..;\ (  `'-'
> > '---''(_/--'  `-'\_) (fL)  IM:  <xmpp://[EMAIL PROTECTED]>
> >
>
>
>
> --
> Robin Anil
> 4th Year Dual Degree Student
> Department of Computer Science & Engineering
> IIT Kharagpur
>
>
> --------------------------------------------------------------------------------------------
> techdigger.wordpress.com
> A discursive take on the world around us
>
> www.minekey.com
> You Might Like This
>
> www.ithink.com
> Express Yourself
>

Re: Regarding Google Summer of Code Lucene Mahout Project

Reply via email to