Hello all,
I have a question here and documentation was not very helpful.
I want to extract the position of an entity referent to the sentence.
If I run the name finder I will get tokens and Spans.
The thing is, I though the getStart and getEnd function on the spans where
the actual character being and character end.
But what it looks like is that it is the beginning token number and end
token number instead.
So it seems that if you have a set of tokens of your sentence such as:
[Former, first, lady, Nancy, Reagan]
Then the span for the name entity Nancy Reagan would be Start = 3, End = 5.
That means Start 3 for the Span is the 4th token on that Array which is
Nancy.
Then the End value is 5 which means covers from [3,5[ or [3,4] where 5 is
exclusive.
Therefore, if I were to calculate the position of Nancy Reagan I would need
to get the text covered on the Span, that is content of it, in this case
would be covered by 2 Spans.
So if I do :
StringBuilder cb = new StringBuilder();
for (int ti = sentenceAnnotations.get(si).getSpan().getStart(); ti <
sentenceAnnotations.get(si).getSpan().getEnd(); ti++) {
cb.append(tokens[ti]).append(" ");
}
I can get the value "Nancy Reagan" on the cb variable.
I could do a string search but it would fail big time if, for some odd
instance, there are 2 spaces between the tokens Nancy and Reagan.
Therefore, how can get the start and end characters for this Entity "Nancy
Reagan" on this sentence if there are situations where there might be more
than one space between tokens?
What is a good approach to mark the position of an entity on it?
Thanks,
Carlos.