Hi,

I think your task is what we called 'one-class classification' , one typical 
scenario will be spam detection
where you can safely model spam messages but you cannot model 'not spam'.

You might want to look into some outlier detection techniques. I am no expert 
on them but I've read
that they can be quite successful.

hope it helps.


________________________________________
From: [email protected] [[email protected]] On Behalf Of Daniel Frank 
[[email protected]]
Sent: 06 May 2011 22:00
To: [email protected]
Subject: Text Categorization With Only One Category

OK, so this question isn't necessarily directly related to OpenNLP usage,
but it may be something worth picking your brains over.

We currently employ OpenNLP for a number of categorization applications,
almost always with two categories. Often this is to determine whether a
document does or does not have the property X. Now, there are several more
applications we have in mind for which we can easily determine whether a
document has the property X, but not whether it *doesn't* have that
property. Let me give an example: let's say I was trying to train a
classifier that could detect sarcasm in tweets. Twitter users will sometimes
add #sarcasm to a sarcastic tweet, and sometimes not. Thus, we could easily
obtain part of a training set by simply collecting tweets with the #sarcasm
hashtag. However, we could not automatically gather tweets that definitely
did not contain sarcasm.

Has anyone got any thoughts about how one might train a model with only
training data for one category? I'm fairly certain that plugging it into the
OpenNLP maxent classifier wouldn't produce any sensible results. Perhaps
something to do with clustering? Or some variant of SVM where we could try
to determine distance from a perceived 'center of mass' of a training set?
Very curious to hear the group's thoughts, let me know if anything occurs to
you guys. Cheers,

Dan

PS - I'm aware that there is previous
work<http://staff.science.uva.nl/~otsur/papers/sarcasmAmazonICWSM10.pdf>on
the sarcasm question - I was just using it as an example

Reply via email to