We are now one iteration further. In this new version it is
possible to pass in a document at once. Which leads
to the question on how we should handle this in OpenNLP generally.
To pass in a document the following information needs to be handed over:
- Sentences
- Tokens
- Names
And maybe a the text depending on if the tokens are Spans or Strings.
If the component is stateless all this needs to handed over in one
method call,
otherwise it could handed over on a per sentences basis (thats how coref
is doing it).
The DocumentNameFinder (never implemented, but interface is defined) its
done
like this:
Span[][] find(String tokens[][])
In my opinion thats not a nice solution, it first requires that the
input text
gets split into Strings and second its hard to use the returned Spans,
they are only meaningful
within the context which is given by the returned array. Names which
cross sentences are not possible.
Another approach could be that:
Span[] find(String text, Span sentences[], Span tokens[])
Where the sentence and token offsets in the spans are character offsets, and
the returned spans or token offsets.
It would probably be nicer to use token offsets for the sentences as
well, but thats
currently incompatible with the sentence detector interface.
Any opinions on how we should solve this?
Jörn
On 05/23/2013 03:04 PM, Jörn Kottmann wrote:
Hi all,
please have a look at
https://issues.apache.org/jira/browse/OPENNLP-579
Its about a contribution to link location entities to a geo name
database,
the component could later be extended to link other entity types as
well to
a database or dictionary.
Thanks,
Jörn