I think if nothing matches the model at all each cat will have the same score associated.
> On Oct 28, 2014, at 10:03 AM, <[email protected]> wrote: > > Hello All, > > I appreciate the advice. I did try training a larger model (800ish samples, 8 > categories) and it performed better. Still, if type absolute non-sense like > "asdfasdfsadf", the evaluation `opennlp Doccat <model>` must return a > category -- so I'd like to be able to programmatically determine some > confidence level. Perhaps I can reject all categories in the app if the > confidence score is below a threshold? Is that possible right now? > > Thanks again for your help! > > Patrick Baggett > Online Engineer - Search Team > e: [email protected] > p: +1 (214) 202-8964 > > -----Original Message----- > From: Mark G [mailto:[email protected]] > Sent: Monday, October 27, 2014 7:47 PM > To: [email protected] > Subject: Re: Getting started with OpenNLP > > I think you bring up a good point inadvertently, I have run into this > before, my use case was that I wanted a probability that the input text > matched my samples for one class...sometimes you just need one.... I ended up > just using a simple feature generator and just using a similarity measure. I > can see a use case for a fuzzy scorer against a set of samples for only one > category. I believe right now in the Doccat if you only have one category you > always get a score of 1 for anything you pass in...regardless of how it > matches any of the samples simply because it's the only one, which is really > not so good. > >> On Mon, Oct 27, 2014 at 6:03 PM, Joern Kottmann <[email protected]> wrote: >> >> On Mon, 2014-10-27 at 19:26 +0000, [email protected] >> wrote: >>> So in other words, for this model, there is just one class (in a >>> more complex example, there would be a number of classes). I trained >>> the model and did some testing, but everything is classified as "MyClass". >> >> The model can only assign the classes it sees in the training data. If >> you only have one class in your training data, then that is the only >> class the model can assign. Actually the model always computes the >> probability for each class, and many applications then just look for >> the best class. >> >> We should probably add a warning to the trainer which says that >> training with only one class doesn't make sense. >> >> I suggest that you try to train with a couple of classes, but at least >> two. >> >> Here are two tips on how to create a model, maybe they are useful. >> >> - Make sure to use a good amount of training data. You probably need a >> few hundred samples to get a model that somehow works. >> >> - And to determine how well the model works you should prepare some >> test data to be able to evaluate on many samples and not just a few >> hand picked ones. This can be done with the evaluation tool. >> >> HTH, >> Jörn > > ________________________________ > > The information in this Internet Email is confidential and may be legally > privileged. It is intended solely for the addressee. Access to this Email by > anyone else is unauthorized. If you are not the intended recipient, any > disclosure, copying, distribution or any action taken or omitted to be taken > in reliance on it, is prohibited and may be unlawful. When addressed to our > clients any opinions or advice contained in this Email are subject to the > terms and conditions expressed in any applicable governing The Home Depot > terms of business or client engagement letter. The Home Depot disclaims all > responsibility and liability for the accuracy and content of this attachment > and for any damages or losses arising from any inaccuracies, errors, viruses, > e.g., worms, trojan horses, etc., or other items of a destructive nature, > which may be contained in this attachment and shall not be liable for direct, > indirect, consequential or special damages in connection with this e-mail > message or its attachment.
