I'm new to NLP but I think OpenNLP will solve my problem. I'm trying to classify user inputted sentences about a prescribed situation into two sets of categories. One set of categories relates specifically to content of the sentence (for instance, AboutBob, AboutWeather, etc.). The other set of categories relates to the likely emotion and nature of the sentence (for instance, IsPraise, IsAssertion, IsInsult, etc.).
I plan on using the Document Categorizer; however, I have no idea how much training data I'll need and I'll need to write the training data myself. Is there any way you can give me an estimated range of ballpark figures of the number of training sentences per category I should aim for in each curpos (in other words, what are the usual ranges for this kind of project)? Also, should I aim to include as many variations on the training data sentences as possible? Right now I'm trying to estimate the amount of work required so I can roughly estimate the time I'll need to complete the project. -- Jonathan
