Larry,

The FSUtil method looks like a good option.  

The reason why one method can be faster than another is because if you have a 
type 1 with N instances and a type 2 with M instances when checking for overlap 
you don’t have to check all M instances of type 2 but merely a fraction of M 
because of the way the index is sorted.  This can reduce the problem to 
checking Nxm wherein m represents an average scalar subset, the percentage of 
which it represents being dependent on the size of N.  The smaller the N the 
smaller the factor of M is required to check.  If you choose a type with a much 
smaller N then the constant factor becomes smaller than choosing a type with a 
much larger N for which m approaches the NxM upper bound.

Thanks,

Thomas Ginter
801-448-7676
thomas.gin...@utah.edu




On Apr 1, 2014, at 2:01 PM, Kline, Larry <larry.kl...@mckesson.com> wrote:

> Thomas,
> 
> Thanks for the suggestions.  Before I read your email I happened on this 
> method, FSUtil.getAnnotationsIteratorInSpan(aJCas, annotationType, begin, 
> end).  I pass it the typeId of the type I want to check and the begin and end 
> of the dictTerm's span.  I need to run some tests to make sure it is doing 
> what I assume it is, but if it is correct, it has much better performance 
> than my previous code.  Perhaps this is using the method you suggest.  Yes, 
> my dictTerms are much more numerous than the filter types but currently 
> iterate over them one at a time in the caller, ask if any of the filter types 
> overlaps, and if so discard it.  If it's not discarded I go on to do other 
> things with it.  I could probably preprocess the dictTerms first by checking 
> them for overlap against the filter types and remove the matches.  Why do you 
> think this would be faster?  Isn't it MxN versus NxM?
> 
> Thanks again,
> Larry
> 
> -----Original Message-----
> From: Thomas Ginter [mailto:thomas.gin...@utah.edu] 
> Sent: Monday, March 31, 2014 12:56 PM
> To: user@uima.apache.org
> Subject: Re: FilteredIterator is very slow
> 
> Larry,
> 
> A faster way to get the list of types that you will skip would be to do the 
> following:
> 
> FSIndex<TitlePersonHonorificAnnotation> titlePersonHAIndex = 
> aJCas.getAnnotationIndex(TitlePersonHonorificAnnotation.type);
> 
> Doing this for each type will yield an index that points to just the 
> annotations in the CAS of each type you are interested in.  From there you 
> can get an iterator reference ( titlePersonHAIndex.iterator() ) and either 
> traverse each one separately or else add them to a common Collection such as 
> an ArrayList and iterate through that.  You could also take advantage of the 
> fact that the default index in UIMA sorts on ascending order on the begin 
> index and descending order on the ending index to stop once you have 
> traversed the list past the ending index of the dictTerm.  
> 
> An important design decision though would be to consider whether the dictTerm 
> annotations are much more numerous than the TitlePersonHonorificAnnotation, 
> MeasurementAnnotation, and ProgFactorTerm filtering annotation types.  
> Generally if the filter types are much more plentiful and the dictTerm type 
> was more rare then looking for overlapping filter types will yield fewer 
> iterations of your algorithm, however if there are a lot of dictTerm 
> occurrences and only a few of the filter types then it may be more efficient 
> to iterate through the filter types and eliminate dictTerms that overlap or 
> are covered.  
> 
> Thanks,
> 
> Thomas Ginter
> 801-448-7676
> thomas.gin...@utah.edu
> 
> 
> 
> 
> On Mar 31, 2014, at 11:47 AM, Kline, Larry <larry.kl...@mckesson.com> wrote:
> 
>> When I use a filtered FSIterator it's an order of magnitude slower than a 
>> non-filtered iterator.  Here's my code:
>> 
>> Create the iterator:
>>      private FSIterator<Annotation> createConstrainedIterator(JCas aJCas) 
>> throws CASException {
>>             FSIterator<Annotation> it = 
>> aJCas.getAnnotationIndex().iterator();
>>             FSTypeConstraint constraint = 
>> aJCas.getConstraintFactory().createTypeConstraint();
>>             constraint.add((new 
>> TitlePersonHonorificAnnotation(aJCas)).getType());
>>             constraint.add((new MeasurementAnnotation(aJCas)).getType());
>>             constraint.add((new ProgFactorTerm(aJCas)).getType());
>>             it = aJCas.createFilteredIterator(it, constraint);
>>             return it;
>>      }
>> Use the iterator:
>>      public void process(JCas aJCas) throws AnalysisEngineProcessException {
>>             ...
>> // The following is done in a loop
>>                          if (shouldSkip(dictTerm, skipIter))
>>                                 continue;
>>             ...
>>      }
>> Here's the method called:
>>      private boolean shouldSkip(G2DictTerm dictTerm, FSIterator<Annotation> 
>> skipIter) throws CASException {
>>             boolean shouldSkip = false;
>>             skipIter.moveToFirst();
>>             while (skipIter.hasNext()) {
>>                    Annotation annotation = skipIter.next();
>>                    if (UIMAUtils.annotationsOverlap(dictTerm, annotation)) {
>>                          shouldSkip = true;
>>                          break;
>>                    }
>>             }
>>             return shouldSkip;
>>      }
>> 
>> If I change the method, createConstrainedIterator(), to this (that is, no 
>> constraints):
>>      private FSIterator<Annotation> createConstrainedIterator(JCas aJCas) 
>> throws CASException {
>>             FSIterator<Annotation> it = 
>> aJCas.getAnnotationIndex().iterator();
>>             return it;
>>      }
>> 
>> It runs literally 10 times faster.  Doing some profiling I see that all of 
>> the time is spent in the skipIter.moveToFirst() call.  I also tried creating 
>> the filtered iterator each time anew in the shouldSkip() method instead of 
>> passing it in, but that has even slightly worse performance.
>> 
>> Given this performance I suppose I should probably use a non-filtered 
>> iterator and just check for the types I'm interested in inside the loop.
>> 
>> Any other suggestions welcome.
>> 
>> Thanks,
>> Larry Kline
>> 
>> 
> 

Reply via email to