Re: a call to arms (of sorts) ;)

James Kosin Thu, 30 Jun 2011 20:17:34 -0700

Jason,

-1, I don't think this classified OpenNLP correctly.

The corref functions extract content and information... and the namefinder could easily be trained to be able to pick out email and othercontent.Phone numbers would actually be more difficult; because there are noreal clues that it really is a phone number in most cases... and itdepends on the format which is not standardized. ie: US could use:1(757)555-1212, (757)555-1212, 757-555-1212, 1-757-555-1212, etc andthat doesn't include the international numbers or formats seen therethat also appear in the US.E-mail addresses are a bit easier with the '@' character and otherrequirements; but, how many e-mail addresses are put into documents orarticles on a regular bases. This would be more specialized but verydoable with the current architecture.

The real problems in all these is finding and getting the training datafor these types of information.


James

On 6/30/2011 9:30 PM, Jason Baldridge wrote:

Seth Grimes wrote the following article on Stanbol, OpenNLP and such:

http://www.cmswire.com/cms/enterprise-cms/iks-means-semantic-intelligence-for-content-management-but-will-it-survive-011848.php

I would highlight this paragraph on page 2:

OpenNLP has Apache incubation status, which is a sort-of provisional
acceptance by the Apache Software Foundation. It provides only basic NLP
functions. It doesn't support extraction of facts, events, or relationships,
nor sentiment or pattern-based information such as telephone numbers and
e-mail addresses. It appears, however, that FISE should be able to
accommodate other annotators, whether installed or invoked via calls to
entity-resolution Web services. There are many NLP engines that are more
advanced and capable than OpenNLP.

I guess a question is whether OpenNLP should be doing any of those things,
or just stay focussed on core NLP tasks that other software builds on.

Jason

Re: a call to arms (of sorts) ;)

Reply via email to