Maybe you can get some feedback from Jason Rennie (one of the authors of the paper you linked) on your implementation - I seem to remember seeing some comments from him on this mailing list about a week ago.
Matt On Tue, Mar 25, 2008 at 8:55 AM, Robin Anil <[EMAIL PROTECTED]> wrote: > Hi Isabel, > > On Tue, Mar 25, 2008 at 2:52 AM, Isabel Drost < > [EMAIL PROTECTED]> > wrote: > > > > On Monday 24 March 2008, Robin Anil wrote: > > > > > The Complement-Naive-Bayes-Classifier(coded up for this project) then > run on > > > the retrieved document to do post processing. > > > > The ideas presented in the slides look pretty interesting to me. Could > you > > please provide some pointers to information in the Complement Naive > Bayes > > Classifier? What were the reasons you chose this classifier? > > > Before going into Complement Naive Bayes there are certain things about > Text > Classification. Given a good amount of data as it is in the case of > textual > Data, Naive Bayes Suprisingly performs better than most of the other > supervised learners. Reason as i see it is, Naive Bayes class margins are > so > bluntly defined that chances of overfitting is rare. This is also the > reason > why, given the proper features Naive Bayes doesnt measure up to other > Methods. So you may say Naive Bayes in a Good Classifier for Textual Data. > Now Complement Naive Bayes does the reverse. Instead of calculating which > class fits the document best. It does, which complement class least fits > the > document. Also it removes the bias problem due to prior probability term > in > NB equation. You may be interested in reading the paper which talks more > about it Here <http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf>. > My > BaseClassifier implementation reproduces the work there. But for different > classifiers (SpamDetection, Subjectivity, Polarity) , all of them inherits > the base classifier but the feature selection module is overloaded for > each > of them. > > As you can see all of them except Polarity(Classes are Pos, Neg, Neutral) > are Binary Classifiers where the CNB is Exactly the same as NB(just a -ve > sign difference). But other things like normalization made a lot of > difference in removing the false positives and biased classes. > > > > > > > > If its possible to have the classifier run along with Lucene and > > > spit out sentences and add them to a field in real-time, It would > > > essentially enable this system to be online and allow for real-time > > > queries. > > > > So what you are hoping for is a system that can crawl and answer queries > at > > the same time, integrating more and more information as it becomes > available, > > right? > > > Yes and No, > Yes because System needs to go through the index get documents and process > the Sentences and get all opinions, Not necessarity the Target. > No because the queries arent fixed. If you disregard the TREC queries, say > a > person is sitting there asking for opinion about a target. He may type > "Nokia 6600" or "My left hand". Now, I would have to go though the DB and > find everything which talks about Nokia and the other and do post > processing > if its not yet processed. Another reason is the ranking of the results > become a problem. How do i say which among the 1000 results gives the > better > opinion. The doc that talks more about the target or the one which has > more > opinions about the target. Neither, we need to rank them based on the > output > of Classification Algorithms. > > This is where i see the use of Mahout. Say we have the core Lucene > Architecture modded with Mahout. If i can give the results of Mahout > Classifier to lucene for Ranking function. Based on Subjectivity, Polarity > etc. Not only will it become easy to Implement Good IR Systems for > Research. > It can give rise to Some real funky use cases for Complex Production IR > Systems. > > > > > I would gladly answer any queries except results > > > > Hmm, so for this competition there is no sample dataset available to > test > the > > performance of the algorithms against? Sounds like there is no way to > > determine which of two competing solutions is better except making two > > submissions... > > > Well throughout the year, Competing researchers give One or two Queries > and > Hand Made results. Which is compiled and tested against each other. > > > Isabel > > > > > > -- > > The ideal voice for radio may be defined as showing no substance, no > sex,no > > owner, and a message of importance for every housewife. -- Harry > V. Wade > > > > > > > > |\ _,,,---,,_ Web: <http://www.isabel-drost.de> > > /,`.-'`' -. ;-;;,_ > > |,4- ) )-,_..;\ ( `'-' > > '---''(_/--' `-'\_) (fL) IM: <xmpp://[EMAIL PROTECTED]> > > > > > > -- > Robin Anil > 4th Year Dual Degree Student > Department of Computer Science & Engineering > IIT Kharagpur > > > -------------------------------------------------------------------------------------------- > techdigger.wordpress.com > A discursive take on the world around us > > www.minekey.com > You Might Like This > > www.ithink.com > Express Yourself >