OK, so this question isn't necessarily directly related to OpenNLP usage,
but it may be something worth picking your brains over.

We currently employ OpenNLP for a number of categorization applications,
almost always with two categories. Often this is to determine whether a
document does or does not have the property X. Now, there are several more
applications we have in mind for which we can easily determine whether a
document has the property X, but not whether it *doesn't* have that
property. Let me give an example: let's say I was trying to train a
classifier that could detect sarcasm in tweets. Twitter users will sometimes
add #sarcasm to a sarcastic tweet, and sometimes not. Thus, we could easily
obtain part of a training set by simply collecting tweets with the #sarcasm
hashtag. However, we could not automatically gather tweets that definitely
did not contain sarcasm.

Has anyone got any thoughts about how one might train a model with only
training data for one category? I'm fairly certain that plugging it into the
OpenNLP maxent classifier wouldn't produce any sensible results. Perhaps
something to do with clustering? Or some variant of SVM where we could try
to determine distance from a perceived 'center of mass' of a training set?
Very curious to hear the group's thoughts, let me know if anything occurs to
you guys. Cheers,

Dan

PS - I'm aware that there is previous
work<http://staff.science.uva.nl/~otsur/papers/sarcasmAmazonICWSM10.pdf>on
the sarcasm question - I was just using it as an example

Reply via email to