Re: RegexNameFinder when entity spans multiple tokens

Jake Dodd Wed, 19 Nov 2014 14:59:11 -0800

Found the solution to this—for future reference:

The RegexNameFinder class takes a String[] of tokens, and joins the array of 
tokens with a single space, reconstructing the sentence. It also maps the 
indices of the tokens to locations in the reconstructed sentence. Then, it 
matches the patterns against the reconstructed sentence. If matches are found, 
it uses the start and end locations (in the sentence) to pull the token indices 
from the map.


So, if the sentence “$120 billion” is tokenized as [“$”, “120”, “billion”], the 
reconstructed sentence will be “$ 120 billion.” The patterns in your 
RegexNameFinder need to account for this additional whitespace. YMMV.

Cheers

Jake

> On Nov 19, 2014, at 2:08 PM, Jake Dodd <[email protected]> wrote:
> 
> Hi all,
> 
> I’m trying to implement a RegexNameFinder for money entities (to supplement 
> results from the default OpenNLP statistical model).
> 
> The money entities will span multiple tokens (for example, “$120 billion” is 
> tokenized as ‘$’, ‘120’, ‘billion’). I’ve verified that my regex pattern will 
> match the phrase “$120 billion”, but when used as a pattern in 
> RegexNameFinder, the name finder returns no results.
> 
> Do RegexNameFinders match named entities that span multiple tokens? Or are 
> they designed to find single-token named entities?
> 
> Cheers
> 
> Jake

Re: RegexNameFinder when entity spans multiple tokens

Reply via email to