I like the second approach Span[] find(String text, Span sentences[], Span tokens[])
looks like it would be easier to use. Maybe we could add a new tokenize method in Tokenizer which takes the sentence offset and outputs spans with this offset included. I could not understand what do you mean with using token offsets fot the sentences. On Thu, May 30, 2013 at 12:46 PM, Jörn Kottmann <[email protected]> wrote: > We are now one iteration further. In this new version it is > possible to pass in a document at once. Which leads > to the question on how we should handle this in OpenNLP generally. > > To pass in a document the following information needs to be handed over: > - Sentences > - Tokens > - Names > > And maybe a the text depending on if the tokens are Spans or Strings. > > If the component is stateless all this needs to handed over in one method > call, > otherwise it could handed over on a per sentences basis (thats how coref > is doing it). > > The DocumentNameFinder (never implemented, but interface is defined) its > done > like this: > Span[][] find(String tokens[][]) > > In my opinion thats not a nice solution, it first requires that the input > text > gets split into Strings and second its hard to use the returned Spans, > they are only meaningful > within the context which is given by the returned array. Names which cross > sentences are not possible. > > Another approach could be that: > Span[] find(String text, Span sentences[], Span tokens[]) > > Where the sentence and token offsets in the spans are character offsets, > and > the returned spans or token offsets. > > It would probably be nicer to use token offsets for the sentences as well, > but thats > currently incompatible with the sentence detector interface. > > Any opinions on how we should solve this? > > Jörn > > > On 05/23/2013 03:04 PM, Jörn Kottmann wrote: > >> Hi all, >> >> please have a look at >> https://issues.apache.org/**jira/browse/OPENNLP-579<https://issues.apache.org/jira/browse/OPENNLP-579> >> >> Its about a contribution to link location entities to a geo name database, >> the component could later be extended to link other entity types as well >> to >> a database or dictionary. >> >> Thanks, >> Jörn >> > >
