Something I just thought of is that if you are using the new (beta) sentence 
detector trained on Mimic, it is a bit of a "lumper" rather than a "splitter," 
meaning it is more likely to miss a sentence break and make longer sentences, 
sometimes absurdly long if there are no clear cues. I know that will slow down 
the constituency parser and dependency parser, but not sure why it would only 
slow down when negation processing is added. So, not a solution but something 
to keep in mind while debugging, especially if it interacts with Steve and 
Sean's feedback.
Tim


________________________________________
From: Dligach, Dmitriy <ddlig...@luc.edu>
Sent: Wednesday, June 21, 2017 9:18 PM
To: dev@ctakes.apache.org
Cc: Miller, Timothy
Subject: Re: negation/uncertainty: pipeline runs very slowly [EXTERNAL]

Sean, thanks for your comments. You are right. The slowdown doesn’t have 
anything to do with documentID.

I am now convinced that the slowdown has to do with the Polarity annotator. The 
reason you and others haven’t seen this in other pipelines is that you’ve 
probably been processing relatively small files.

I am processing MIMIC patient files, which typically have thousands of words. I 
just tried to process 300 files from the THYME corpus (where the files have 
hundreds of words) and the slowdown was barely noticeable. When running the 
same pipeline on the MIMIC files, the slowdown becomes very noticeable.


Dima



> On Jun 5, 2017, at 10:42, Finan, Sean <sean.fi...@childrens.harvard.edu> 
> wrote:
>
> Hi Dima,
>
> It looks like the UriCollectionReader that you are using never sets a 
> document id (type DocumentID) in the cas.  However, this shouldn't be a 
> problem as each document will be assigned a unique id "UnknownDocument"{###} 
> where {###} is a number incremented per new document with an unknown id.  The 
> message that you are seeing is just a warning.  The code fetching the 
> documentID and creating a default are very simple and should not take any 
> real processing time.
>
> The call to get document id is the very first line in 
> AssertionCleartkAnalysisEngine:
>  @Override
>  public void process(JCas jCas) throws AnalysisEngineProcessException
>  {
>    String documentId = DocumentIDAnnotationUtil.getDocumentID(jCas);
>
> So, the slowdown occurring after the warning message leads me to believe that 
> the problem lies later in the process ...
>
> My suggestion is that you put a breakpoint there and run your pipeline 
> through a debugger.  Optionally, there are a couple of log.debug messages in 
> that class, so you could change the granularity of your log4j and see if you 
> can narrow down the problem.  Add more debug statements if it helps.
>
> At any rate, I have not seen this problem in other pipelines.
>
> Sean
>
> -----Original Message-----
> From: Dligach, Dmitriy [mailto:ddlig...@luc.edu]
> Sent: Wednesday, May 24, 2017 10:34 AM
> To: cTAKES Developer list
> Subject: negation/uncertainty: pipeline runs very slowly
>
> Dear cTAKES developers,
>
> I am observing something strange. As soon as I add at the end of my pipeline 
> the uncertainty/negation AEs:
>
> aggregateBuilder.add( 
> PolarityCleartkAnalysisEngine.createAnnotatorDescription() ); 
> aggregateBuilder.add( 
> UncertaintyCleartkAnalysisEngine.createAnnotatorDescription() );
>
> the pipeline becomes 10-20 times slower. I just confirmed this again. As soon 
> as I remove these two AEs at the end of my pipeline, it runs very fast again.
>
> It seems to get stuck often right after it outputs this warning:
> WARN DocumentIDAnnotationUtil - Unable to find DocumentIDAnnotation
>
> If I remove the two AEs, this warning disappears.
>
> The full pipeline is here:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmitriydligach_ctakes-2Dmisc_blob_master_src_main_java_org_apache_ctakes_pipelines_UmlsLookupPipeline.java&d=DwIFAg&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=cQRgT9lMipJUOQCu86lnRETbYFVC0C5yfMl2r5u0lNs&s=fnshTyx1ruwH-8ktFPX4JeX-7PVWplbiPO2RYdGSI9E&e=
>
> Any clues?
>
> Thank you very much,
>
> Dima
>
>
>

Reply via email to