Hi Sean, First of all, thank you Sean, Steve, and Tim for giving this a thought. I definitely agree that the problem lies in this line:
List<Sentence> sents = new ArrayList<>(JCasUtil.selectCovering(jCas, Sentence.class, entityOrEventMention.getBegin(), entityOrEventMention.getEnd())); The negation AE runs fine on shorter documents but as soon as I try to run it on large documents, which have LOTS of sentences, it becomes extremely slow. I am sorry I haven’t been able to try the proposed solutions. I may have a little time after the long weekend. In the meantime, I created a JIRA issue: https://issues.apache.org/jira/browse/CTAKES-449 Thanks again for your help. Dima > On Jun 30, 2017, at 07:26, Finan, Sean <[email protected]> > wrote: > > Hi Dima, > Have you had a chance to play with the proposed solutions? If not then let > us know and somebody will eventually get to it. > Meanwhile, would you mind submitting a tar on jira? > Thanks, > Sean > > -----Original Message----- > From: Dligach, Dmitriy [mailto:[email protected]] > Sent: Wednesday, June 21, 2017 3:18 PM > To: [email protected] > Cc: Miller, Timothy > Subject: Re: negation/uncertainty: pipeline runs very slowly [EXTERNAL] > > Sean, thanks for your comments. You are right. The slowdown doesn’t have > anything to do with documentID. > > I am now convinced that the slowdown has to do with the Polarity annotator. > The reason you and others haven’t seen this in other pipelines is that you’ve > probably been processing relatively small files. > > I am processing MIMIC patient files, which typically have thousands of words. > I just tried to process 300 files from the THYME corpus (where the files have > hundreds of words) and the slowdown was barely noticeable. When running the > same pipeline on the MIMIC files, the slowdown becomes very noticeable. > > > Dima > > > >> On Jun 5, 2017, at 10:42, Finan, Sean <[email protected]> >> wrote: >> >> Hi Dima, >> >> It looks like the UriCollectionReader that you are using never sets a >> document id (type DocumentID) in the cas. However, this shouldn't be a >> problem as each document will be assigned a unique id "UnknownDocument"{###} >> where {###} is a number incremented per new document with an unknown id. >> The message that you are seeing is just a warning. The code fetching the >> documentID and creating a default are very simple and should not take any >> real processing time. >> >> The call to get document id is the very first line in >> AssertionCleartkAnalysisEngine: >> @Override >> public void process(JCas jCas) throws AnalysisEngineProcessException >> { >> String documentId = DocumentIDAnnotationUtil.getDocumentID(jCas); >> >> So, the slowdown occurring after the warning message leads me to believe >> that the problem lies later in the process ... >> >> My suggestion is that you put a breakpoint there and run your pipeline >> through a debugger. Optionally, there are a couple of log.debug messages in >> that class, so you could change the granularity of your log4j and see if you >> can narrow down the problem. Add more debug statements if it helps. >> >> At any rate, I have not seen this problem in other pipelines. >> >> Sean >> >> -----Original Message----- >> From: Dligach, Dmitriy [mailto:[email protected]] >> Sent: Wednesday, May 24, 2017 10:34 AM >> To: cTAKES Developer list >> Subject: negation/uncertainty: pipeline runs very slowly >> >> Dear cTAKES developers, >> >> I am observing something strange. As soon as I add at the end of my pipeline >> the uncertainty/negation AEs: >> >> aggregateBuilder.add( >> PolarityCleartkAnalysisEngine.createAnnotatorDescription() ); >> aggregateBuilder.add( >> UncertaintyCleartkAnalysisEngine.createAnnotatorDescription() ); >> >> the pipeline becomes 10-20 times slower. I just confirmed this again. As >> soon as I remove these two AEs at the end of my pipeline, it runs very fast >> again. >> >> It seems to get stuck often right after it outputs this warning: >> WARN DocumentIDAnnotationUtil - Unable to find DocumentIDAnnotation >> >> If I remove the two AEs, this warning disappears. >> >> The full pipeline is here: >> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmitri >> ydligach_ctakes-2Dmisc_blob_master_src_main_java_org_apache_ctakes_pip >> elines_UmlsLookupPipeline.java&d=DwIFAg&c=qS4goWBT7poplM69zy_3xhKwEW14 >> JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=cQRgT9 >> lMipJUOQCu86lnRETbYFVC0C5yfMl2r5u0lNs&s=fnshTyx1ruwH-8ktFPX4JeX-7PVWpl >> biPO2RYdGSI9E&e= >> >> Any clues? >> >> Thank you very much, >> >> Dima >> >> >> >
