If you do not want to make an assumption about the annotation ordering you
might need to iterate one iterator many times.
Lets say you need iterate over the sentences, and over the tokens in
each sentence.
Then you have a sentence iterator, and for each sentence you need to
iterate the token
iterator once.
If the annotations are ordered you can iterate the sentence and token
iterator in lock-step,
tests at OpenNLP showed that this is much faster for long to very long
documents.
Jörn
On 08/09/2012 05:18 PM, Chen, Pei wrote:
To get all the BaseTokens for a particular sentence, if we use the
.subiterator, the types has be stored in the FSindexes in a certain order
otherwise it could just return an empty list. This would require the users of
annotators to understand the ordering of types and have it preconfigured.
FSIterator<Annotation> tokensInSentenceIterator =
jcas.getAnnotationIndex(BaseToken.type).subiterator(sentence);
uimaFIT already created a convenience method that seems to do something similar
which will always return the expected tokens. Does anyone know if this was
part of the motivation? Is the performance hit (if any) worth the ease of use?
Ex:
List<BaseToken> tokens = org.uimafit.util.JCasUtil.selectCovered(jCas,
BaseToken.class, sentence);
Another alternative is UIMA's FilteredIterator.
There are a few places that use subiterator in cTAKES and it's tempting to use
uimaFIT's JCasUtil.selecteCovered() instead... What do others think?
Background: This issue surfaced when we use the cTAKES GUI (which uses uimaFIT
to wire the components together instead of the Aggregate XML descriptor).
--Pei