RE: negation/uncertainty: pipeline runs very slowly [EXTERNAL] [SUSPICIOUS]

Finan, Sean Thu, 22 Jun 2017 09:07:39 -0700

Oops, I misquoted.
"Make it work, make it right, make it fast".

http://wiki.c2.com/?MakeItWorkMakeItRightMakeItFast


Sean

-----Original Message-----
From: Finan, Sean [mailto:[email protected]] 
Sent: Thursday, June 22, 2017 12:04 PM
To: [email protected]
Cc: Miller, Timothy
Subject: RE: negation/uncertainty: pipeline runs very slowly [EXTERNAL] 
[SUSPICIOUS]

Hi Dima,

There is possibly one major cause for things slowing down on large documents:

There is some suboptimal looping in AssertioncCleartkAnalysisEngine.  For 
instance:
    Collection<IdentifiedAnnotation> entities = 
JCasUtil.select(identifiedAnnotationView, IdentifiedAnnotation.class);
    for (IdentifiedAnnotation identifiedAnnotation : entities)    {
                ...
              List<Sentence> sents = new 
ArrayList<>(JCasUtil.selectCovering(jCas, Sentence.class, 
entityOrEventMention.getBegin(), entityOrEventMention.getEnd()));

Where entityOrEventMention is an (unnecessary btw) rename of 
identifiedAnnotation.  This gets all of the ia, then for each one gets the 
sentence.  Because of the way casutil iterates through the fslist, starting at 
the beginning each time, it would be slightly faster to get the sentences first 
and iterate over them.  It would be much faster to put identifiedAnnotations in 
a list and then mark a "previous index" while iterating through sentences, 
instead of repeated calls to jcasutil.  Something like:
prevIndex = 0;
foreach sentence {
   for ( i=prevIndex; i<aiList.size(); i++ ) {
      ai = aiList.get( i );
     while ( ai.getEnd() <= sentence.getEnd() ) {
        if ( ai.getBegin() >= sentence.getBegin() ) {
          sentenceAis.add( ai );
       }
     }
     prevIndex = i;
     break;
   }
   ----  handle sentence using sentenceAis (within sentence bounds) ---
   sentenceAis.clear();
}

I haven't run the code above, but I think it should speed up the process.  The 
same can be done for AssertionCuePhraseAnnotation.  Every selectCovered(..) 
call to jcasutil starts iterating at the beginning of the fslist.  If the 
fslist is long (eg big doc) then this will be slow.
The same thing goes for getting the BaseTokens between cue and ai using 
selectBetween(..).  Though there may be none, the entire fslist is iterated 
until ai.getEnd() is reached.

So, replace the repeated jcasutil.select*(..) calls with self-aware list 
iteration.
AssertionCleartkAnalysisEngine:
356 : IdentifiedAnnotation
399 : Sentence
422 : AssertionCuePhraseAnnotation
426 : BaseToken

I should note that there are probably many other places in ctakes where we 
could make similar improvements and thank you for bringing it to my attention - 
this class just happens to be a notable example but I know for certain that I 
have relied upon repeated jcasutil.select*(..) calls for coding expediency.  
"Make it work, then make it better."  Sometimes we don't get a chance to hit 
that second act.

If you get some time, please make the single change with ai only and see if 
there is a speed improvement.  If there is then we know that we are on the 
right track.  If so, or if you find other improvements, can you check in the 
changes?  Otherwise just let me know where you have success and maybe I can 
build on it.

Thanks,
Sean

-----Original Message-----
From: Dligach, Dmitriy [mailto:[email protected]]
Sent: Wednesday, June 21, 2017 3:18 PM
To: [email protected]
Cc: Miller, Timothy
Subject: Re: negation/uncertainty: pipeline runs very slowly [EXTERNAL]

Sean, thanks for your comments. You are right. The slowdown doesn’t have 
anything to do with documentID.

I am now convinced that the slowdown has to do with the Polarity annotator. The 
reason you and others haven’t seen this in other pipelines is that you’ve 
probably been processing relatively small files. 

I am processing MIMIC patient files, which typically have thousands of words. I 
just tried to process 300 files from the THYME corpus (where the files have 
hundreds of words) and the slowdown was barely noticeable. When running the 
same pipeline on the MIMIC files, the slowdown becomes very noticeable.


Dima



> On Jun 5, 2017, at 10:42, Finan, Sean <[email protected]> 
> wrote:
> 
> Hi Dima,
> 
> It looks like the UriCollectionReader that you are using never sets a 
> document id (type DocumentID) in the cas.  However, this shouldn't be a 
> problem as each document will be assigned a unique id "UnknownDocument"{###} 
> where {###} is a number incremented per new document with an unknown id.  The 
> message that you are seeing is just a warning.  The code fetching the 
> documentID and creating a default are very simple and should not take any 
> real processing time.
> 
> The call to get document id is the very first line in 
> AssertionCleartkAnalysisEngine:
>  @Override
>  public void process(JCas jCas) throws AnalysisEngineProcessException 
> {
>    String documentId = DocumentIDAnnotationUtil.getDocumentID(jCas);
> 
> So, the slowdown occurring after the warning message leads me to believe that 
> the problem lies later in the process ...
> 
> My suggestion is that you put a breakpoint there and run your pipeline 
> through a debugger.  Optionally, there are a couple of log.debug messages in 
> that class, so you could change the granularity of your log4j and see if you 
> can narrow down the problem.  Add more debug statements if it helps.
> 
> At any rate, I have not seen this problem in other pipelines.
> 
> Sean
> 
> -----Original Message-----
> From: Dligach, Dmitriy [mailto:[email protected]]
> Sent: Wednesday, May 24, 2017 10:34 AM
> To: cTAKES Developer list
> Subject: negation/uncertainty: pipeline runs very slowly
> 
> Dear cTAKES developers,
> 
> I am observing something strange. As soon as I add at the end of my pipeline 
> the uncertainty/negation AEs:
> 
> aggregateBuilder.add(
> PolarityCleartkAnalysisEngine.createAnnotatorDescription() ); 
> aggregateBuilder.add(
> UncertaintyCleartkAnalysisEngine.createAnnotatorDescription() );
> 
> the pipeline becomes 10-20 times slower. I just confirmed this again. As soon 
> as I remove these two AEs at the end of my pipeline, it runs very fast again.
> 
> It seems to get stuck often right after it outputs this warning:
> WARN DocumentIDAnnotationUtil - Unable to find DocumentIDAnnotation
> 
> If I remove the two AEs, this warning disappears.
> 
> The full pipeline is here:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmitri
> ydligach_ctakes-2Dmisc_blob_master_src_main_java_org_apache_ctakes_pip
> elines_UmlsLookupPipeline.java&d=DwIFAg&c=qS4goWBT7poplM69zy_3xhKwEW14
> JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=cQRgT9
> lMipJUOQCu86lnRETbYFVC0C5yfMl2r5u0lNs&s=fnshTyx1ruwH-8ktFPX4JeX-7PVWpl
> biPO2RYdGSI9E&e=
> 
> Any clues?
> 
> Thank you very much,
> 
> Dima
> 
> 
>

RE: negation/uncertainty: pipeline runs very slowly [EXTERNAL] [SUSPICIOUS]

Reply via email to