You definitely cannot give it just one category, so you'll need to come up with examples that are likely not have that property. In the case of something like sarcasm, you might guess that there is a fairly low rate of sarcasm in the general population of tweets, so you can just grab a bunch of non-#sarcasm tweets and call them the non-sarcastic ones. It's okay if some of them are actually sarcastic -- as long as you have a good number of #sarcasm tweets, then that will just be in the noise. (They are all noisy labels, actually.)
Other fancier things could be done, but I'd try the above first! On Fri, May 6, 2011 at 4:00 PM, Daniel Frank <[email protected]> wrote: > OK, so this question isn't necessarily directly related to OpenNLP usage, > but it may be something worth picking your brains over. > > We currently employ OpenNLP for a number of categorization applications, > almost always with two categories. Often this is to determine whether a > document does or does not have the property X. Now, there are several more > applications we have in mind for which we can easily determine whether a > document has the property X, but not whether it *doesn't* have that > property. Let me give an example: let's say I was trying to train a > classifier that could detect sarcasm in tweets. Twitter users will > sometimes > add #sarcasm to a sarcastic tweet, and sometimes not. Thus, we could easily > obtain part of a training set by simply collecting tweets with the #sarcasm > hashtag. However, we could not automatically gather tweets that definitely > did not contain sarcasm. > > Has anyone got any thoughts about how one might train a model with only > training data for one category? I'm fairly certain that plugging it into > the > OpenNLP maxent classifier wouldn't produce any sensible results. Perhaps > something to do with clustering? Or some variant of SVM where we could try > to determine distance from a perceived 'center of mass' of a training set? > Very curious to hear the group's thoughts, let me know if anything occurs > to > you guys. Cheers, > > Dan > > PS - I'm aware that there is previous > work<http://staff.science.uva.nl/~otsur/papers/sarcasmAmazonICWSM10.pdf>on > the sarcasm question - I was just using it as an example > -- Jason Baldridge Assistant Professor, Department of Linguistics The University of Texas at Austin http://www.jasonbaldridge.com http://twitter.com/jasonbaldridge
