Hey Tim, I have recently been testing with the "smoker" notes in ctakes-examples-res, and using your new sentence detector (Lumpy) has definitely been the way to go for those notes. They have the random cr/lf within sentences. It is great that we have some example notes in ctakes that can show off your work.
Cheers, Sean -----Original Message----- From: Dligach, Dmitriy [mailto:ddlig...@luc.edu] Sent: Friday, June 30, 2017 11:03 AM To: dev@ctakes.apache.org Subject: Re: negation/uncertainty: pipeline runs very slowly [EXTERNAL] Hi Tim, Good point, but I happen to be using the ctakes-core sentence detector. Dima > On Jun 23, 2017, at 06:31, Miller, Timothy > <timothy.mil...@childrens.harvard.edu> wrote: > > Something I just thought of is that if you are using the new (beta) sentence > detector trained on Mimic, it is a bit of a "lumper" rather than a > "splitter," meaning it is more likely to miss a sentence break and make > longer sentences, sometimes absurdly long if there are no clear cues. I know > that will slow down the constituency parser and dependency parser, but not > sure why it would only slow down when negation processing is added. So, not a > solution but something to keep in mind while debugging, especially if it > interacts with Steve and Sean's feedback. > Tim > > > ________________________________________ > From: Dligach, Dmitriy <ddlig...@luc.edu> > Sent: Wednesday, June 21, 2017 9:18 PM > To: dev@ctakes.apache.org > Cc: Miller, Timothy > Subject: Re: negation/uncertainty: pipeline runs very slowly > [EXTERNAL] > > Sean, thanks for your comments. You are right. The slowdown doesn't have > anything to do with documentID. > > I am now convinced that the slowdown has to do with the Polarity annotator. > The reason you and others haven't seen this in other pipelines is that you've > probably been processing relatively small files. > > I am processing MIMIC patient files, which typically have thousands of words. > I just tried to process 300 files from the THYME corpus (where the files have > hundreds of words) and the slowdown was barely noticeable. When running the > same pipeline on the MIMIC files, the slowdown becomes very noticeable. > > > Dima > > > >> On Jun 5, 2017, at 10:42, Finan, Sean <sean.fi...@childrens.harvard.edu> >> wrote: >> >> Hi Dima, >> >> It looks like the UriCollectionReader that you are using never sets a >> document id (type DocumentID) in the cas. However, this shouldn't be a >> problem as each document will be assigned a unique id "UnknownDocument"{###} >> where {###} is a number incremented per new document with an unknown id. >> The message that you are seeing is just a warning. The code fetching the >> documentID and creating a default are very simple and should not take any >> real processing time. >> >> The call to get document id is the very first line in >> AssertionCleartkAnalysisEngine: >> @Override >> public void process(JCas jCas) throws AnalysisEngineProcessException >> { >> String documentId = DocumentIDAnnotationUtil.getDocumentID(jCas); >> >> So, the slowdown occurring after the warning message leads me to believe >> that the problem lies later in the process ... >> >> My suggestion is that you put a breakpoint there and run your pipeline >> through a debugger. Optionally, there are a couple of log.debug messages in >> that class, so you could change the granularity of your log4j and see if you >> can narrow down the problem. Add more debug statements if it helps. >> >> At any rate, I have not seen this problem in other pipelines. >> >> Sean >> >> -----Original Message----- >> From: Dligach, Dmitriy [mailto:ddlig...@luc.edu] >> Sent: Wednesday, May 24, 2017 10:34 AM >> To: cTAKES Developer list >> Subject: negation/uncertainty: pipeline runs very slowly >> >> Dear cTAKES developers, >> >> I am observing something strange. As soon as I add at the end of my pipeline >> the uncertainty/negation AEs: >> >> aggregateBuilder.add( >> PolarityCleartkAnalysisEngine.createAnnotatorDescription() ); >> aggregateBuilder.add( >> UncertaintyCleartkAnalysisEngine.createAnnotatorDescription() ); >> >> the pipeline becomes 10-20 times slower. I just confirmed this again. As >> soon as I remove these two AEs at the end of my pipeline, it runs very fast >> again. >> >> It seems to get stuck often right after it outputs this warning: >> WARN DocumentIDAnnotationUtil - Unable to find DocumentIDAnnotation >> >> If I remove the two AEs, this warning disappears. >> >> The full pipeline is here: >> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmitr >> iydligach_ctakes-2Dmisc_blob_master_src_main_java_org_apache_ctakes_p >> ipelines_UmlsLookupPipeline.java&d=DwIFAg&c=qS4goWBT7poplM69zy_3xhKwE >> W14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=cQ >> RgT9lMipJUOQCu86lnRETbYFVC0C5yfMl2r5u0lNs&s=fnshTyx1ruwH-8ktFPX4JeX-7 >> PVWplbiPO2RYdGSI9E&e= >> >> Any clues? >> >> Thank you very much, >> >> Dima >> >> >> >