Re: negation/uncertainty: pipeline runs very slowly [EXTERNAL]

Dligach, Dmitriy Fri, 30 Jun 2017 08:01:35 -0700

Hi Sean,

First of all, thank you Sean, Steve, and Tim for giving this a thought. I 
definitely agree that the problem lies in this line:


List<Sentence> sents = new ArrayList<>(JCasUtil.selectCovering(jCas, 
Sentence.class, entityOrEventMention.getBegin(), 
entityOrEventMention.getEnd()));

The negation AE runs fine on shorter documents but as soon as I try to run it 
on large documents, which have LOTS of sentences, it becomes extremely slow.

I am sorry I haven’t been able to try the proposed solutions. I may have a 
little time after the long weekend. 

In the meantime, I created a JIRA issue: 
https://issues.apache.org/jira/browse/CTAKES-449

Thanks again for your help.

Dima



> On Jun 30, 2017, at 07:26, Finan, Sean <[email protected]> 
> wrote:
> 
> Hi Dima,
> Have you had a chance to play with the proposed solutions?  If not then let 
> us know and somebody will eventually get to it.
> Meanwhile, would you mind submitting a tar on jira?
> Thanks,
> Sean
> 
> -----Original Message-----
> From: Dligach, Dmitriy [mailto:[email protected]] 
> Sent: Wednesday, June 21, 2017 3:18 PM
> To: [email protected]
> Cc: Miller, Timothy
> Subject: Re: negation/uncertainty: pipeline runs very slowly [EXTERNAL]
> 
> Sean, thanks for your comments. You are right. The slowdown doesn’t have 
> anything to do with documentID.
> 
> I am now convinced that the slowdown has to do with the Polarity annotator. 
> The reason you and others haven’t seen this in other pipelines is that you’ve 
> probably been processing relatively small files. 
> 
> I am processing MIMIC patient files, which typically have thousands of words. 
> I just tried to process 300 files from the THYME corpus (where the files have 
> hundreds of words) and the slowdown was barely noticeable. When running the 
> same pipeline on the MIMIC files, the slowdown becomes very noticeable.
> 
> 
> Dima
> 
> 
> 
>> On Jun 5, 2017, at 10:42, Finan, Sean <[email protected]> 
>> wrote:
>> 
>> Hi Dima,
>> 
>> It looks like the UriCollectionReader that you are using never sets a 
>> document id (type DocumentID) in the cas.  However, this shouldn't be a 
>> problem as each document will be assigned a unique id "UnknownDocument"{###} 
>> where {###} is a number incremented per new document with an unknown id.  
>> The message that you are seeing is just a warning.  The code fetching the 
>> documentID and creating a default are very simple and should not take any 
>> real processing time.
>> 
>> The call to get document id is the very first line in 
>> AssertionCleartkAnalysisEngine:
>> @Override
>> public void process(JCas jCas) throws AnalysisEngineProcessException  
>> {
>>   String documentId = DocumentIDAnnotationUtil.getDocumentID(jCas);
>> 
>> So, the slowdown occurring after the warning message leads me to believe 
>> that the problem lies later in the process ...
>> 
>> My suggestion is that you put a breakpoint there and run your pipeline 
>> through a debugger.  Optionally, there are a couple of log.debug messages in 
>> that class, so you could change the granularity of your log4j and see if you 
>> can narrow down the problem.  Add more debug statements if it helps.
>> 
>> At any rate, I have not seen this problem in other pipelines.
>> 
>> Sean
>> 
>> -----Original Message-----
>> From: Dligach, Dmitriy [mailto:[email protected]]
>> Sent: Wednesday, May 24, 2017 10:34 AM
>> To: cTAKES Developer list
>> Subject: negation/uncertainty: pipeline runs very slowly
>> 
>> Dear cTAKES developers,
>> 
>> I am observing something strange. As soon as I add at the end of my pipeline 
>> the uncertainty/negation AEs:
>> 
>> aggregateBuilder.add( 
>> PolarityCleartkAnalysisEngine.createAnnotatorDescription() ); 
>> aggregateBuilder.add( 
>> UncertaintyCleartkAnalysisEngine.createAnnotatorDescription() );
>> 
>> the pipeline becomes 10-20 times slower. I just confirmed this again. As 
>> soon as I remove these two AEs at the end of my pipeline, it runs very fast 
>> again.
>> 
>> It seems to get stuck often right after it outputs this warning:
>> WARN DocumentIDAnnotationUtil - Unable to find DocumentIDAnnotation
>> 
>> If I remove the two AEs, this warning disappears.
>> 
>> The full pipeline is here:
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmitri
>> ydligach_ctakes-2Dmisc_blob_master_src_main_java_org_apache_ctakes_pip
>> elines_UmlsLookupPipeline.java&d=DwIFAg&c=qS4goWBT7poplM69zy_3xhKwEW14
>> JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=cQRgT9
>> lMipJUOQCu86lnRETbYFVC0C5yfMl2r5u0lNs&s=fnshTyx1ruwH-8ktFPX4JeX-7PVWpl
>> biPO2RYdGSI9E&e=
>> 
>> Any clues?
>> 
>> Thank you very much,
>> 
>> Dima
>> 
>> 
>> 
>

Re: negation/uncertainty: pipeline runs very slowly [EXTERNAL]

Reply via email to