I've been working with OpenNLP sporadically over the years, and I am now 
upgrading to the current version. In doing so, I stumbled across some very odd 
(and undocumented) behavior.

Specifically, the Spans generated from NameFinderME.find() have a start and end 
index that correspond to the index of the Token, not the character. OK - I can 
handle this. However, Span.getCoveredText(String text) supposedly gets the text 
covered by the span - e.g., the actual entity found. However, this method uses 
the start and end indexes - which correspond to the Token, not the character 
index - to perform a substring operation. This created incorrect results.

For instance, in the sentence (using the standard models in English here) on 
the following (tokenized) sentence:

10 people were killed in Orchard Road on 12 May 2014.

Generates spans for location (start=5, end=7) and date (start=8, end=11). When 
you call getCoveredText on each of these spans, you should expect the following 
to be returned

10 people were killed in Orchard Road on 12 May 2014.

But instead, because it uses the token index as a character index, the 
following is actually returned:

10 people were killed in Orchard Road on 12 May 2014.

This seems to be an inconsistency, and should either be fixed or at least 
documented.


Edward Swing
Applied Research Technologist
Vision Systems + Technology, Inc., a SAS Company
6021 University Boulevard * Suite 360 * Ellicott City * Maryland * 21043
Tel: 410.418.5555 Ext: 919 * Fax: 410.418.8580
Email: [email protected]<mailto:[email protected]>
Web: http://www.vsticorp.com<http://www.vsticorp.com/>

Reply via email to