Larry, The FSUtil method looks like a good option.
The reason why one method can be faster than another is because if you have a type 1 with N instances and a type 2 with M instances when checking for overlap you don’t have to check all M instances of type 2 but merely a fraction of M because of the way the index is sorted. This can reduce the problem to checking Nxm wherein m represents an average scalar subset, the percentage of which it represents being dependent on the size of N. The smaller the N the smaller the factor of M is required to check. If you choose a type with a much smaller N then the constant factor becomes smaller than choosing a type with a much larger N for which m approaches the NxM upper bound. Thanks, Thomas Ginter 801-448-7676 thomas.gin...@utah.edu On Apr 1, 2014, at 2:01 PM, Kline, Larry <larry.kl...@mckesson.com> wrote: > Thomas, > > Thanks for the suggestions. Before I read your email I happened on this > method, FSUtil.getAnnotationsIteratorInSpan(aJCas, annotationType, begin, > end). I pass it the typeId of the type I want to check and the begin and end > of the dictTerm's span. I need to run some tests to make sure it is doing > what I assume it is, but if it is correct, it has much better performance > than my previous code. Perhaps this is using the method you suggest. Yes, > my dictTerms are much more numerous than the filter types but currently > iterate over them one at a time in the caller, ask if any of the filter types > overlaps, and if so discard it. If it's not discarded I go on to do other > things with it. I could probably preprocess the dictTerms first by checking > them for overlap against the filter types and remove the matches. Why do you > think this would be faster? Isn't it MxN versus NxM? > > Thanks again, > Larry > > -----Original Message----- > From: Thomas Ginter [mailto:thomas.gin...@utah.edu] > Sent: Monday, March 31, 2014 12:56 PM > To: user@uima.apache.org > Subject: Re: FilteredIterator is very slow > > Larry, > > A faster way to get the list of types that you will skip would be to do the > following: > > FSIndex<TitlePersonHonorificAnnotation> titlePersonHAIndex = > aJCas.getAnnotationIndex(TitlePersonHonorificAnnotation.type); > > Doing this for each type will yield an index that points to just the > annotations in the CAS of each type you are interested in. From there you > can get an iterator reference ( titlePersonHAIndex.iterator() ) and either > traverse each one separately or else add them to a common Collection such as > an ArrayList and iterate through that. You could also take advantage of the > fact that the default index in UIMA sorts on ascending order on the begin > index and descending order on the ending index to stop once you have > traversed the list past the ending index of the dictTerm. > > An important design decision though would be to consider whether the dictTerm > annotations are much more numerous than the TitlePersonHonorificAnnotation, > MeasurementAnnotation, and ProgFactorTerm filtering annotation types. > Generally if the filter types are much more plentiful and the dictTerm type > was more rare then looking for overlapping filter types will yield fewer > iterations of your algorithm, however if there are a lot of dictTerm > occurrences and only a few of the filter types then it may be more efficient > to iterate through the filter types and eliminate dictTerms that overlap or > are covered. > > Thanks, > > Thomas Ginter > 801-448-7676 > thomas.gin...@utah.edu > > > > > On Mar 31, 2014, at 11:47 AM, Kline, Larry <larry.kl...@mckesson.com> wrote: > >> When I use a filtered FSIterator it's an order of magnitude slower than a >> non-filtered iterator. Here's my code: >> >> Create the iterator: >> private FSIterator<Annotation> createConstrainedIterator(JCas aJCas) >> throws CASException { >> FSIterator<Annotation> it = >> aJCas.getAnnotationIndex().iterator(); >> FSTypeConstraint constraint = >> aJCas.getConstraintFactory().createTypeConstraint(); >> constraint.add((new >> TitlePersonHonorificAnnotation(aJCas)).getType()); >> constraint.add((new MeasurementAnnotation(aJCas)).getType()); >> constraint.add((new ProgFactorTerm(aJCas)).getType()); >> it = aJCas.createFilteredIterator(it, constraint); >> return it; >> } >> Use the iterator: >> public void process(JCas aJCas) throws AnalysisEngineProcessException { >> ... >> // The following is done in a loop >> if (shouldSkip(dictTerm, skipIter)) >> continue; >> ... >> } >> Here's the method called: >> private boolean shouldSkip(G2DictTerm dictTerm, FSIterator<Annotation> >> skipIter) throws CASException { >> boolean shouldSkip = false; >> skipIter.moveToFirst(); >> while (skipIter.hasNext()) { >> Annotation annotation = skipIter.next(); >> if (UIMAUtils.annotationsOverlap(dictTerm, annotation)) { >> shouldSkip = true; >> break; >> } >> } >> return shouldSkip; >> } >> >> If I change the method, createConstrainedIterator(), to this (that is, no >> constraints): >> private FSIterator<Annotation> createConstrainedIterator(JCas aJCas) >> throws CASException { >> FSIterator<Annotation> it = >> aJCas.getAnnotationIndex().iterator(); >> return it; >> } >> >> It runs literally 10 times faster. Doing some profiling I see that all of >> the time is spent in the skipIter.moveToFirst() call. I also tried creating >> the filtered iterator each time anew in the shouldSkip() method instead of >> passing it in, but that has even slightly worse performance. >> >> Given this performance I suppose I should probably use a non-filtered >> iterator and just check for the types I'm interested in inside the loop. >> >> Any other suggestions welcome. >> >> Thanks, >> Larry Kline >> >> >