Jason, -1, I don't think this classified OpenNLP correctly.
The corref functions extract content and information... and the name finder could easily be trained to be able to pick out email and other content. Phone numbers would actually be more difficult; because there are no real clues that it really is a phone number in most cases... and it depends on the format which is not standardized. ie: US could use: 1(757)555-1212, (757)555-1212, 757-555-1212, 1-757-555-1212, etc and that doesn't include the international numbers or formats seen there that also appear in the US. E-mail addresses are a bit easier with the '@' character and other requirements; but, how many e-mail addresses are put into documents or articles on a regular bases. This would be more specialized but very doable with the current architecture.
The real problems in all these is finding and getting the training data for these types of information.
James On 6/30/2011 9:30 PM, Jason Baldridge wrote:
Seth Grimes wrote the following article on Stanbol, OpenNLP and such: http://www.cmswire.com/cms/enterprise-cms/iks-means-semantic-intelligence-for-content-management-but-will-it-survive-011848.php I would highlight this paragraph on page 2: OpenNLP has Apache incubation status, which is a sort-of provisional acceptance by the Apache Software Foundation. It provides only basic NLP functions. It doesn't support extraction of facts, events, or relationships, nor sentiment or pattern-based information such as telephone numbers and e-mail addresses. It appears, however, that FISE should be able to accommodate other annotators, whether installed or invoked via calls to entity-resolution Web services. There are many NLP engines that are more advanced and capable than OpenNLP. I guess a question is whether OpenNLP should be doing any of those things, or just stay focussed on core NLP tasks that other software builds on. Jason
