Xeon is not a word, so it only finds Intel. Division and Chief are
probably organization words.
You might get better results if you make your own organization training
set. These training sets are old, and the business world changes names
rapidly. Also, advertising text has its own terse syntax and the models
are generally trained on more formal English.
If you're doing tweets, there is a POS determiner for Tweets from CMU.
Cross-checking against noun/verb/etc. might help your results.
On 09/13/2013 03:49 AM, Siva Sakthi wrote:
Hi,
we are using opennlp for finding organizations (code below)
e.g.
1. Find out how Intel Xeon processors help make #EMC number 1 in backup at
#IDF13 going on now in San Francisco. #Speed2Lead Protect your data
Opennlp returns "Intel" in the above sentence
2. NYPD Intel Division Chief Lashes Out At FBI Over Failed Terrorist Plot
http://t.co/V0XLKrp3TI
Opennlp returns "Intel Division Chief Lashes"
Issue 1: I don't understand why it returns a composite string in the second
case, instead of just Intel
Issue 2: The "Intel" in the second sentence is not really "Intel"
My code as follows,
public static String findOrg(String message) throws Exception {
String[] words = message.split(" ");
InputStream orgIs = new FileInputStream("en-ner-organization.bin");
TokenNameFinderModel tnf = new TokenNameFinderModel(orgIs);
NameFinderME nf = new NameFinderME(tnf);
Span sp[] = nf.find(words);
String a[] = Span.spansToStrings(sp, words);
StringBuilder sb = new StringBuilder();
int l = a.length;
for (int j = 0; j < l; j++) {
sb = sb.append(a[j] + "\n");
}
return sb.toString();
}
Thanks,
Ss