Hi Jörn et al,

I've been looking at the Doccat module and I can see how it can be useful, but 
I'm having a bit of a hard time understanding a small detail. It seems to be 
unable to map to "no category". I've prepared a training file as follows:

MyClass Some Proper Noun 1
MyClass Some Proper Noun 2
MyClass Some Proper Noun 3
 ....
MyClass Some Proper Noun [n]

So in other words, for this model, there is just one class (in a more complex 
example, there would be a number of classes). I trained the model and did some 
testing, but everything is classified as "MyClass". Does it not have the 
ability to just say, "I don't know". For example, if I have a set of words 
w[0]- w[n], and I ask it to classify some word w* that is not equal to any 
words in w[0..n], must it return MyClass instead of "none of the above?"

With more categories, I expect this to be less of a problem, but ideally I'd 
like to be able return any of the categories found _OR_ "none of the above". 
Any thoughts?

Patrick Baggett
Online Engineer - Search Team
e: [email protected]
p: +1 (214) 202-8964

-----Original Message-----
From: Jörn Kottmann [mailto:[email protected]]
Sent: Monday, October 27, 2014 3:24 AM
To: [email protected]
Subject: Re: Getting started with OpenNLP

On 10/24/2014 06:20 PM, [email protected] wrote:
> Hello all,
>
> First off, thanks to all who contribute to this project! I've been tasked 
> with doing some research on Apache Stanbol, which uses OpenNLP, to see if it 
> can fill some roles in a few company projects. I've been reading about how to 
> train a model for named entity recognition and it seems like the simplest 
> case of "I have a list of n proper nouns, please just recognize them directly 
> and nothing else" isn't addressed in the documentation. Is this too simple a 
> use case? Would I be doing better to just use a simple substring match on a 
> phrase then? I would later like to extend the model to recognize things other 
> than just simple proper nouns, but for now, that is the simplest case I can 
> think of.

The name finder is intended to find entities which are embedded in a text, e.g. 
a news articles, medical records or company filings. It can even recognize 
names which it hasn't seen before by evaluating the context the entity appears 
in.

If you just have a list of proper nouns you might be better of using the doccat 
package instead of the name finder. The doccat component tries to assign the 
categories for the entire input text, compared to the name finder which labels 
each input token.

HTH,
Jörn


________________________________

The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.

Reply via email to