Re: OPENNLP-579

Jörn Kottmann Thu, 30 May 2013 08:48:03 -0700

We are now one iteration further. In this new version it is
possible to pass in a document at once. Which leads
to the question on how we should handle this in OpenNLP generally.


To pass in a document the following information needs to be handed over:
- Sentences
- Tokens
- Names

And maybe a the text depending on if the tokens are Spans or Strings.

If the component is stateless all this needs to handed over in onemethod call,otherwise it could handed over on a per sentences basis (thats how corefis doing it).

The DocumentNameFinder (never implemented, but interface is defined) itsdone

like this:
Span[][] find(String tokens[][])

In my opinion thats not a nice solution, it first requires that theinput textgets split into Strings and second its hard to use the returned Spans,they are only meaningfulwithin the context which is given by the returned array. Names whichcross sentences are not possible.


Another approach could be that:
Span[] find(String text, Span sentences[], Span tokens[])

Where the sentence and token offsets in the spans are character offsets, and
the returned spans or token offsets.

It would probably be nicer to use token offsets for the sentences aswell, but thats

currently incompatible with the sentence detector interface.

Any opinions on how we should solve this?

Jörn

On 05/23/2013 03:04 PM, Jörn Kottmann wrote:

Hi all,

please have a look at
https://issues.apache.org/jira/browse/OPENNLP-579
Its about a contribution to link location entities to a geo namedatabase,the component could later be extended to link other entity types aswell to
a database or dictionary.

Thanks,
Jörn

Re: OPENNLP-579

Reply via email to