Re: Performance of the cleartk history module [EXTERNAL]

Finan, Sean Tue, 04 Jan 2022 10:47:52 -0800

Great question.

The package name "windowed" isn't helpfully self-descriptive.  It contains yet 
another bit of code that I wrote as quickly as possible to help somebody in 
real-time with a problem.
* There is only a 'procedural' difference between the two.  The models and 
methods are the same.


The assertion engine has a bunch of objects delegating to objects delegating to 
more objects.  Each object calls one or more JCasUtil.select() frequently for 
the same types.  They also redundantly call JCasUtil.selectCovered() and 
selectCovering() for the same types.

process( jcas ) {
  Collection<..> sentences = ...select(..);
  delegateA.do( sentences );
}
class DelegateA {
  void do( Collection<..> sentences ) {
   for ( Sentence sentence : sentences ) {
      Collection<Token> tokens = JCasUtil.selectCovered( jcas, Token.class, 
sentence );
      delegateB.use( tokens );
 }
}
class DelegateB {
  void use( Collection<..> tokens ) {
     Collection<Sentence> sentence = JCasUtil.selectCovering( jcas, 
Sentece.class, tokens );
    ...
  }
}

The above isn't an exact representation, but you get the point.
The problem with code like this is repeated traversal of the (object) array in 
the cas.  Every JCasUtil.select* pours through the whole thing.  For a small 
document with a small cas (or early in a pipeline), that array may be small and 
the traversal fast.  However, when people are (unadvisably) processing a single 
document that sizes in the gigabyte range, repeatedly going through the cas 
takes a long time.

So, what I did was create a single container object that holds Collections of 
the types of interest and their covering relationships, populate all that stuff 
once per process( jcas ) and pass that container through to each delegate 
object.  Basically, a jcas lite.  The biggest culprit in the assertion engines 
was repeatedly iterating over the array for covered and covering windows, hence 
the subpackage name "windowed".

Is it faster for smaller docs?  Not so much.  Does it instantaneously process 
the Encyclopedia Brittanica as one text?  Of course not.  Is it orders of 
magnitudes faster on such onerous docs?  In my tests, yes.

Going through my delegating example above, the end delegate is the same.  Hence 
the processing is the same and repeatable.  In my tests on both small and 
gargantuan documents the windowed version and the original version produced the 
same output.

Sean


   


________________________________________
From: Peter Abramowitsch <[email protected]>
Sent: Tuesday, January 4, 2022 11:39 AM
To: [email protected]
Subject: Re: Performance of the cleartk history module [EXTERNAL]

* External Email - Caution *


Hi Sean
Ok..  I was confused whether I was meant to find it in the sources.
But while you're reading this, is there a brief way to describe the
difference between the older:package

org.apache.ctakes.assertion.medfacts.cleartk;
and
org.apache.ctakes.assertion.medfacts.cleartk.windowed

Peter





On Tue, Jan 4, 2022 at 7:47 AM Finan, Sean <[email protected]>
wrote:

> Hi Peter,
>
> I created a second engine that just used text matching or regular
> expressions given the discovered events.  It also uses covering section
> types, formatted text and other things, but the text match might be the
> most impactful item.
>
> You are an accomplished developer so the email scratch below is for the
> benefit of others who search archives.
>
> class LazyHistoryFinder extends JCasAnnotator_ImplBase {
>   String[] HISTORY = { "history of", "h/o", "h / o" };
>
>   boolean isHistory( EventMention event ) {
>        text = e.getCoveredText().toLowerCase();
>       return Arrays.stream( HISTORY ).anyMatch( text::startsWith );
>   }
>
>   void process( JCas jcas ) throws Analysis*Ex {
>     JCasUtil.select( jcas, EventMention.class )
>                  .stream()
>                  .filter( this::isHistory )
>                  .foreach( e -> e.setHistoryOf(
> CONST.NE_HISTORY_OF_PRESENT ) );
>   }
> }
>
> It requires a stroll through the monstrous cas array and it certainly
> isn't sexy, but it gets the job done.
>
> Sean
>
>
> ________________________________________
> From: Peter Abramowitsch <[email protected]>
> Sent: Monday, January 3, 2022 10:23 PM
> To: [email protected]
> Subject: Re: Performance of the cleartk history module [EXTERNAL]
>
> * External Email - Caution *
>
>
> Thanks Sean
>
> By "following engine", you mean a second instance of the history engine
> that uses only the event spans, or you modified the current one to traverse
> the event-span within the context window?    I see you made some source
> changes in that area and will check tomorrow.
>
> Peter
>
> On Mon, Jan 3, 2022 at 2:26 PM Finan, Sean <
> [email protected]>
> wrote:
>
> > Hi Peter,
> >
> > I have noticed this and just added a following engine that recognized
> text
> > within event spans.  It is a lazy solution, but it fit my needs and
> > available time.
> >
> > Sean
> > ________________________________________
> > From: Peter Abramowitsch <[email protected]>
> > Sent: Monday, January 3, 2022 5:03 PM
> > To: [email protected]
> > Subject: Performance of the cleartk history module [EXTERNAL]
> >
> > * External Email - Caution *
> >
> >
> > Hi All
> >
> > I've noticed that the HistoryCleartkAnalysisEngine misses many common
> forms
> > of subject history including the obvious "h/o" prefix.    Looking into
> the
> > distribution, there's a model.jar and what  appears to be a weights file
> > containing trigger words:
> > resources/org/apache/ctakes/assertion/models/history.txt   where h, o, /
> > are all given their own weights.   But I'm not sure that they're actually
> > used in this way:  see below.   However, there's also a tiny file:
> > /org/apache/ctakes/assertion/semantic_classes/history.txt
> > which does contain a few entries including "h/o" which I assume is used
> for
> > training but is never referred to anywhere.
> >
> > Here's the behavior I'm seeing:
> > example input condition term found history feature marked range text
> > history of pregnancies "history of" included in the cu_term and prefterm
> > yes
> >   no history of pregnancies
> > history of adenopathy "history of" not included in the cu_term or
> prefterm
> > yes yes adenopathy
> > H/O postpartum psychosis "h/o" not included in the prefterm or cu_term
> yes
> > yes postpartum psychosis
> > H/O: postpartum psychosis "h/o" not included in the prefterm or cu_term
> yes
> > no postpartum psychosis
> > H/O pregnancies "h/o"  included in the  cu_term yes no h/o pregnancies
> >
> > You can see that it is quite perverse -  there is a pattern suggesting
> that
> > if the concept definition occupies the history words, then they cannot be
> > seen by the history annotation engine.
> >
> > Has anyone else noticed this - and have they done anything about it?
> >
> > Peter
> >
>

Re: Performance of the cleartk history module [EXTERNAL]

Reply via email to