Hello All, I appreciate the advice. I did try training a larger model (800ish samples, 8 categories) and it performed better. Still, if type absolute non-sense like "asdfasdfsadf", the evaluation `opennlp Doccat <model>` must return a category -- so I'd like to be able to programmatically determine some confidence level. Perhaps I can reject all categories in the app if the confidence score is below a threshold? Is that possible right now?
Thanks again for your help! Patrick Baggett Online Engineer - Search Team e: [email protected] p: +1 (214) 202-8964 -----Original Message----- From: Mark G [mailto:[email protected]] Sent: Monday, October 27, 2014 7:47 PM To: [email protected] Subject: Re: Getting started with OpenNLP I think you bring up a good point inadvertently, I have run into this before, my use case was that I wanted a probability that the input text matched my samples for one class...sometimes you just need one.... I ended up just using a simple feature generator and just using a similarity measure. I can see a use case for a fuzzy scorer against a set of samples for only one category. I believe right now in the Doccat if you only have one category you always get a score of 1 for anything you pass in...regardless of how it matches any of the samples simply because it's the only one, which is really not so good. On Mon, Oct 27, 2014 at 6:03 PM, Joern Kottmann <[email protected]> wrote: > On Mon, 2014-10-27 at 19:26 +0000, [email protected] > wrote: > > So in other words, for this model, there is just one class (in a > > more complex example, there would be a number of classes). I trained > > the model and did some testing, but everything is classified as "MyClass". > > The model can only assign the classes it sees in the training data. If > you only have one class in your training data, then that is the only > class the model can assign. Actually the model always computes the > probability for each class, and many applications then just look for > the best class. > > We should probably add a warning to the trainer which says that > training with only one class doesn't make sense. > > I suggest that you try to train with a couple of classes, but at least > two. > > Here are two tips on how to create a model, maybe they are useful. > > - Make sure to use a good amount of training data. You probably need a > few hundred samples to get a model that somehow works. > > - And to determine how well the model works you should prepare some > test data to be able to evaluate on many samples and not just a few > hand picked ones. This can be done with the evaluation tool. > > HTH, > Jörn > > ________________________________ The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.
