OK, so this question isn't necessarily directly related to OpenNLP usage, but it may be something worth picking your brains over.
We currently employ OpenNLP for a number of categorization applications, almost always with two categories. Often this is to determine whether a document does or does not have the property X. Now, there are several more applications we have in mind for which we can easily determine whether a document has the property X, but not whether it *doesn't* have that property. Let me give an example: let's say I was trying to train a classifier that could detect sarcasm in tweets. Twitter users will sometimes add #sarcasm to a sarcastic tweet, and sometimes not. Thus, we could easily obtain part of a training set by simply collecting tweets with the #sarcasm hashtag. However, we could not automatically gather tweets that definitely did not contain sarcasm. Has anyone got any thoughts about how one might train a model with only training data for one category? I'm fairly certain that plugging it into the OpenNLP maxent classifier wouldn't produce any sensible results. Perhaps something to do with clustering? Or some variant of SVM where we could try to determine distance from a perceived 'center of mass' of a training set? Very curious to hear the group's thoughts, let me know if anything occurs to you guys. Cheers, Dan PS - I'm aware that there is previous work<http://staff.science.uva.nl/~otsur/papers/sarcasmAmazonICWSM10.pdf>on the sarcasm question - I was just using it as an example
