Re: neural negation model in ctakes [EXTERNAL] [SUSPICIOUS]

2021-01-25 Thread Finan, Sean
Hi Tim,

This is really exciting!  

Just having this code available to use as a template is extremely useful.

Cheers,
Sean

From: Miller, Timothy 
Sent: Sunday, January 24, 2021 11:08 AM
To: dev@ctakes.apache.org
Subject: neural negation model in ctakes [EXTERNAL] [SUSPICIOUS]

* External Email - Caution *


Hi all,
I just checked in a usable proof-of-concept for a neural (RoBERTa-based to be 
specific) negation classifier. The way it works is a tiny bit of python code 
(using FastAPI) sets up a REST interface that runs the classifier:
ctakes-assertion/src/main/python/negation_rest.py

it runs a default model that I trained and uploaded into Huggingface modelhub. 
It will automatically download the first time the server is run.

there is a startup script there too:
ctakes-assertion/src/main/python/start_negation_rest.sh

The idea would be to run this on whatever machine you have with the appropriate 
GPU resources and it creates 3 REST endpoints:
/negation/initialize  -- to load the model (takes longer the first time as it 
will download)
/negation/process -- to classify the data and return negation values
/negation/collection_process_complete -- to unload the model

to mirror UIMA workflows. Then, the UIMA analysis engine sits in:
ctakes-assertion/src/main/java/org/apache/ctakes/assertion/ae/PolarityBertRestAnnotator.java

The main work here is converting the cTAKES entities/events into a simpler data 
structure that gets sent to the python REST server, making the REST call, and 
then converting the classifier output into the polarity property.

Performance:
The accuracy of this classifier is much better in my testing. I am looking 
forward to being able to hopefully make the path to improving the performance 
easier as it can potentially just be a change to the model string to have it 
grab a new model on modelhub.

The speed is marginally slower if we do a 1-for-1 swap, but that's a little bit 
misleading, because we currently run 2 parsers to generate features for the 
default ML negation module. If we don't need those parsers we can dramatically 
cut the speed of the processing even with the neural negation module. I tested 
this with the python code running on a machine with a 1070ti. The goal for 
these methods going forward if we want to scale should be to have the neural 
call do a few things with a single pass, especially if we are using large 
transformer models. But this proof of concept of a single task will hopefully 
make it easier for other folks to do that if they wish.

FYI, another way of doing this is by using python libraries like cassis and 
actually having python functions be essentially UIMA AEs -- I think there will 
be a place for both approaches and I'm not trying to wall off work in that 
direction.

Tim



Re: performance report [EXTERNAL]

2021-01-25 Thread Finan, Sean
Hi Greg, Peter,

I believe that the performance report comes from a CollectionProcessingEngine 
(CPE) 
https://uima.apache.org/d/uimaj-current/apidocs/org/apache/uima/collection/CollectionProcessingEngine.html
  

I think that UIMA's CPE GUI runs the pipeline through a CPE - hence the tool's 
name, but that may have changed in recent years.

The PipelineBuilder class in ctakes.core used by the PiperFileRunner could be 
changed to use this style of running a single-threaded pipeline - right now it 
uses a simpler UIMAFit method.
The code changes are relatively minor, but obviously significant testing would 
be required.  The ctakes PipelineBuilder does use a CPE for multi-threaded 
pipelines, so there has already been some testing on that front.

You can look at the ctakes PipelineBuilder run() method.  If you get rid of the 
if (threadCount==1) {..} else {   the the CPE will always be used.  Then just 
add a cpe.getPerformanceReport() after cpe.process() you should have a 
ProcessTrace object.  This is where my guessing ends as I have never used a 
ProcessTrace and don't know exactly what to beg of it.

I hope that is a decent start,
Sean

From: Greg Silverman 
Sent: Saturday, January 23, 2021 3:01 PM
To: dev@ctakes.apache.org
Subject: Re: performance report [EXTERNAL]

* External Email - Caution *


Hi Peter,
I have no doubt about performance differences regarding variance between
note styles and pipeline components.

We're looking for a way to benchmark the standard/non-customized pipeline
performance for processing a largish set of identical notes using several
clinical NLP annotators (specifically, ctakes, biomedicus, metamap and
clamp). At the command line, both metamap and biomedicus output a standard
performance report with total timings and the details for each specific
pipeline component. I assume there is a way to enable the performance
report output available in the GUI version of ctakes at the command line -
which is what I'm really interested in.

We're fine with information at a very coarse level, since we're interested
in a particular note type, so the aforementioned report should be
sufficient. I'm just wondering how to enable it using the standard pipeline
in cTAKES.

Thanks!

Greg--



On Sat, Jan 23, 2021 at 12:26 PM Peter Abramowitsch 
wrote:

> Hi Greg,
>
> I’ve found that there’s so much difference between note styles that have
> performance implications and so many interactions between pipeline
> configurations which affect overall performance, that really the only way
> to get a sense of performance is either on a vary coarse level, measuring
> process time across large collections of varied notes, or very granular
> using something like jvisualvm.   Using the latter I saw some surprising
> things, some of which I was able to tackle with minor software changes,
> while others are deep in UIMA utilities used by cTakes..  The biggest
> factor in my experience after processing millions of notes is after they
> have reached about 5k AND are missing punctuation.  At around this size
> begins a geometric rise in complexity of internal structures that depend on
> sentences and a serious elevation of processing time.
>
> Peter
>
> Sent from my iPad
>
> > On Jan 23, 2021, at 18:09, Greg Silverman  wrote:
> >
> > I found this:
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__medium.com_-40felix-5Fchan_install-2Dapache-2Dctakes-2D924c40967ce2&d=DwIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=uuvD9Z5PgR1KUWZ1Dc80V19dfKcr2DTrMuBxe2OCbMc&s=s-jUaTKHh4ts1f2UzY5nHsKbjA27HDpqAchBF36juTI&e=
> >  , which
> > states: "A performance report is generated when the process is done."
> >
> > However, we are running this from the command line and no such report is
> > being generated.
> >
> > Thanks!
> >
> >> On Sat, Jan 23, 2021 at 11:05 AM Greg Silverman  wrote:
> >>
> >> Hi all,
> >> Is there a way to easily generate a performance report similar to the
> one
> >> generated by MetaMap (with timings for each task, etc.)?
> >>
> >> Thanks in advance!
> >>
> >> Greg--
> >>
> >> --
> >> Greg M. Silverman
> >> Senior Systems Developer
> >> NLP/IE 
> >>  >>  >
> >> Department of Surgery
> >> University of Minnesota
> >> g...@umn.edu
> >>
> >>
> >
> > --
> > Greg M. Silverman
> > Senior Systems Developer
> > NLP/IE 
> >  >  >
>

Re: performance report [EXTERNAL]

2021-01-25 Thread Peter Abramowitsch
Thanks Sean.  The CPE ProcessTrace object was something I wasn't familiar
with.

Definitely, though, the piper file runner, by default,  should be as
lightweight and simple as possible.  Other options for threading or for
tracing should be injected or layered in without modifying default
behavior.  It is currently very stable.  In my alternative threading model
it runs thirty or more pipeline instances for weeks in a single process
under very heavy stress.

Peter


On Mon, Jan 25, 2021 at 3:48 PM Finan, Sean <
sean.fi...@childrens.harvard.edu> wrote:

> Hi Greg, Peter,
>
> I believe that the performance report comes from a
> CollectionProcessingEngine (CPE)
> https://uima.apache.org/d/uimaj-current/apidocs/org/apache/uima/collection/CollectionProcessingEngine.html
>
>
> I think that UIMA's CPE GUI runs the pipeline through a CPE - hence the
> tool's name, but that may have changed in recent years.
>
> The PipelineBuilder class in ctakes.core used by the PiperFileRunner could
> be changed to use this style of running a single-threaded pipeline - right
> now it uses a simpler UIMAFit method.
> The code changes are relatively minor, but obviously significant testing
> would be required.  The ctakes PipelineBuilder does use a CPE for
> multi-threaded pipelines, so there has already been some testing on that
> front.
>
> You can look at the ctakes PipelineBuilder run() method.  If you get rid
> of the if (threadCount==1) {..} else {   the the CPE will always be used.
> Then just add a cpe.getPerformanceReport() after cpe.process() you should
> have a ProcessTrace object.  This is where my guessing ends as I have never
> used a ProcessTrace and don't know exactly what to beg of it.
>
> I hope that is a decent start,
> Sean
> 
> From: Greg Silverman 
> Sent: Saturday, January 23, 2021 3:01 PM
> To: dev@ctakes.apache.org
> Subject: Re: performance report [EXTERNAL]
>
> * External Email - Caution *
>
>
> Hi Peter,
> I have no doubt about performance differences regarding variance between
> note styles and pipeline components.
>
> We're looking for a way to benchmark the standard/non-customized pipeline
> performance for processing a largish set of identical notes using several
> clinical NLP annotators (specifically, ctakes, biomedicus, metamap and
> clamp). At the command line, both metamap and biomedicus output a standard
> performance report with total timings and the details for each specific
> pipeline component. I assume there is a way to enable the performance
> report output available in the GUI version of ctakes at the command line -
> which is what I'm really interested in.
>
> We're fine with information at a very coarse level, since we're interested
> in a particular note type, so the aforementioned report should be
> sufficient. I'm just wondering how to enable it using the standard pipeline
> in cTAKES.
>
> Thanks!
>
> Greg--
>
>
>
> On Sat, Jan 23, 2021 at 12:26 PM Peter Abramowitsch <
> pabramowit...@gmail.com>
> wrote:
>
> > Hi Greg,
> >
> > I’ve found that there’s so much difference between note styles that have
> > performance implications and so many interactions between pipeline
> > configurations which affect overall performance, that really the only way
> > to get a sense of performance is either on a vary coarse level, measuring
> > process time across large collections of varied notes, or very granular
> > using something like jvisualvm.   Using the latter I saw some surprising
> > things, some of which I was able to tackle with minor software changes,
> > while others are deep in UIMA utilities used by cTakes..  The biggest
> > factor in my experience after processing millions of notes is after they
> > have reached about 5k AND are missing punctuation.  At around this size
> > begins a geometric rise in complexity of internal structures that depend
> on
> > sentences and a serious elevation of processing time.
> >
> > Peter
> >
> > Sent from my iPad
> >
> > > On Jan 23, 2021, at 18:09, Greg Silverman  wrote:
> > >
> > > I found this:
> > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__medium.com_-40felix-5Fchan_install-2Dapache-2Dctakes-2D924c40967ce2&d=DwIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=uuvD9Z5PgR1KUWZ1Dc80V19dfKcr2DTrMuBxe2OCbMc&s=s-jUaTKHh4ts1f2UzY5nHsKbjA27HDpqAchBF36juTI&e=
> , which
> > > states: "A performance report is generated when the process is done."
> > >
> > > However, we are running this from the command line and no such report
> is
> > > being generated.
> > >
> > > Thanks!
> > >
> > >> On Sat, Jan 23, 2021 at 11:05 AM Greg Silverman  wrote:
> > >>
> > >> Hi all,
> > >> Is there a way to easily generate a performance report similar to the
> > one
> > >> generated by MetaMap (with timings for each task, etc.)?
> > >>
> > >> Thanks in advance!
> > >>
> > >> Greg--
> > >>
> > >> --
> > >> Greg M. Silverman
> > >> Senior Systems Developer
> > >> NLP/IE <
> https:

Re: performance report [EXTERNAL]

2021-01-25 Thread Greg Silverman
Hi Sean,
Thanks! I'll give it a whirl and let you know how it works out.

Best!

On Mon, Jan 25, 2021 at 8:48 AM Finan, Sean <
sean.fi...@childrens.harvard.edu> wrote:

> Hi Greg, Peter,
>
> I believe that the performance report comes from a
> CollectionProcessingEngine (CPE)
> https://uima.apache.org/d/uimaj-current/apidocs/org/apache/uima/collection/CollectionProcessingEngine.html
>
>
> I think that UIMA's CPE GUI runs the pipeline through a CPE - hence the
> tool's name, but that may have changed in recent years.
>
> The PipelineBuilder class in ctakes.core used by the PiperFileRunner could
> be changed to use this style of running a single-threaded pipeline - right
> now it uses a simpler UIMAFit method.
> The code changes are relatively minor, but obviously significant testing
> would be required.  The ctakes PipelineBuilder does use a CPE for
> multi-threaded pipelines, so there has already been some testing on that
> front.
>
> You can look at the ctakes PipelineBuilder run() method.  If you get rid
> of the if (threadCount==1) {..} else {   the the CPE will always be used.
> Then just add a cpe.getPerformanceReport() after cpe.process() you should
> have a ProcessTrace object.  This is where my guessing ends as I have never
> used a ProcessTrace and don't know exactly what to beg of it.
>
> I hope that is a decent start,
> Sean
> 
> From: Greg Silverman 
> Sent: Saturday, January 23, 2021 3:01 PM
> To: dev@ctakes.apache.org
> Subject: Re: performance report [EXTERNAL]
>
> * External Email - Caution *
>
>
> Hi Peter,
> I have no doubt about performance differences regarding variance between
> note styles and pipeline components.
>
> We're looking for a way to benchmark the standard/non-customized pipeline
> performance for processing a largish set of identical notes using several
> clinical NLP annotators (specifically, ctakes, biomedicus, metamap and
> clamp). At the command line, both metamap and biomedicus output a standard
> performance report with total timings and the details for each specific
> pipeline component. I assume there is a way to enable the performance
> report output available in the GUI version of ctakes at the command line -
> which is what I'm really interested in.
>
> We're fine with information at a very coarse level, since we're interested
> in a particular note type, so the aforementioned report should be
> sufficient. I'm just wondering how to enable it using the standard pipeline
> in cTAKES.
>
> Thanks!
>
> Greg--
>
>
>
> On Sat, Jan 23, 2021 at 12:26 PM Peter Abramowitsch <
> pabramowit...@gmail.com>
> wrote:
>
> > Hi Greg,
> >
> > I’ve found that there’s so much difference between note styles that have
> > performance implications and so many interactions between pipeline
> > configurations which affect overall performance, that really the only way
> > to get a sense of performance is either on a vary coarse level, measuring
> > process time across large collections of varied notes, or very granular
> > using something like jvisualvm.   Using the latter I saw some surprising
> > things, some of which I was able to tackle with minor software changes,
> > while others are deep in UIMA utilities used by cTakes..  The biggest
> > factor in my experience after processing millions of notes is after they
> > have reached about 5k AND are missing punctuation.  At around this size
> > begins a geometric rise in complexity of internal structures that depend
> on
> > sentences and a serious elevation of processing time.
> >
> > Peter
> >
> > Sent from my iPad
> >
> > > On Jan 23, 2021, at 18:09, Greg Silverman  wrote:
> > >
> > > I found this:
> > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__medium.com_-40felix-5Fchan_install-2Dapache-2Dctakes-2D924c40967ce2&d=DwIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=uuvD9Z5PgR1KUWZ1Dc80V19dfKcr2DTrMuBxe2OCbMc&s=s-jUaTKHh4ts1f2UzY5nHsKbjA27HDpqAchBF36juTI&e=
> , which
> > > states: "A performance report is generated when the process is done."
> > >
> > > However, we are running this from the command line and no such report
> is
> > > being generated.
> > >
> > > Thanks!
> > >
> > >> On Sat, Jan 23, 2021 at 11:05 AM Greg Silverman  wrote:
> > >>
> > >> Hi all,
> > >> Is there a way to easily generate a performance report similar to the
> > one
> > >> generated by MetaMap (with timings for each task, etc.)?
> > >>
> > >> Thanks in advance!
> > >>
> > >> Greg--
> > >>
> > >> --
> > >> Greg M. Silverman
> > >> Senior Systems Developer
> > >> NLP/IE <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__healthinformatics.umn.edu_research_nlpie-2Dgroup&d=DwIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=uuvD9Z5PgR1KUWZ1Dc80V19dfKcr2DTrMuBxe2OCbMc&s=5Kgux8IKOmsj2xjj7DxAhKZf6anK7HF3ddsOhnI1VFM&e=
> >
> > >> Department of Surgery
> > >> University of Minnesota
> > >> g...@umn.edu
> > >>
> > >>

Re: performance report [EXTERNAL]

2021-01-25 Thread Peter Abramowitsch
Great, thanks Greg.  I'd like to see the kind of stats that are available
beyond what one can scrape from log4j

Peter

On Mon, Jan 25, 2021 at 5:16 PM Greg Silverman  wrote:

> Hi Sean,
> Thanks! I'll give it a whirl and let you know how it works out.
>
> Best!
>
> On Mon, Jan 25, 2021 at 8:48 AM Finan, Sean <
> sean.fi...@childrens.harvard.edu> wrote:
>
> > Hi Greg, Peter,
> >
> > I believe that the performance report comes from a
> > CollectionProcessingEngine (CPE)
> >
> https://uima.apache.org/d/uimaj-current/apidocs/org/apache/uima/collection/CollectionProcessingEngine.html
> >
> >
> > I think that UIMA's CPE GUI runs the pipeline through a CPE - hence the
> > tool's name, but that may have changed in recent years.
> >
> > The PipelineBuilder class in ctakes.core used by the PiperFileRunner
> could
> > be changed to use this style of running a single-threaded pipeline -
> right
> > now it uses a simpler UIMAFit method.
> > The code changes are relatively minor, but obviously significant testing
> > would be required.  The ctakes PipelineBuilder does use a CPE for
> > multi-threaded pipelines, so there has already been some testing on that
> > front.
> >
> > You can look at the ctakes PipelineBuilder run() method.  If you get rid
> > of the if (threadCount==1) {..} else {   the the CPE will always be used.
> > Then just add a cpe.getPerformanceReport() after cpe.process() you should
> > have a ProcessTrace object.  This is where my guessing ends as I have
> never
> > used a ProcessTrace and don't know exactly what to beg of it.
> >
> > I hope that is a decent start,
> > Sean
> > 
> > From: Greg Silverman 
> > Sent: Saturday, January 23, 2021 3:01 PM
> > To: dev@ctakes.apache.org
> > Subject: Re: performance report [EXTERNAL]
> >
> > * External Email - Caution *
> >
> >
> > Hi Peter,
> > I have no doubt about performance differences regarding variance between
> > note styles and pipeline components.
> >
> > We're looking for a way to benchmark the standard/non-customized pipeline
> > performance for processing a largish set of identical notes using several
> > clinical NLP annotators (specifically, ctakes, biomedicus, metamap and
> > clamp). At the command line, both metamap and biomedicus output a
> standard
> > performance report with total timings and the details for each specific
> > pipeline component. I assume there is a way to enable the performance
> > report output available in the GUI version of ctakes at the command line
> -
> > which is what I'm really interested in.
> >
> > We're fine with information at a very coarse level, since we're
> interested
> > in a particular note type, so the aforementioned report should be
> > sufficient. I'm just wondering how to enable it using the standard
> pipeline
> > in cTAKES.
> >
> > Thanks!
> >
> > Greg--
> >
> >
> >
> > On Sat, Jan 23, 2021 at 12:26 PM Peter Abramowitsch <
> > pabramowit...@gmail.com>
> > wrote:
> >
> > > Hi Greg,
> > >
> > > I’ve found that there’s so much difference between note styles that
> have
> > > performance implications and so many interactions between pipeline
> > > configurations which affect overall performance, that really the only
> way
> > > to get a sense of performance is either on a vary coarse level,
> measuring
> > > process time across large collections of varied notes, or very granular
> > > using something like jvisualvm.   Using the latter I saw some
> surprising
> > > things, some of which I was able to tackle with minor software changes,
> > > while others are deep in UIMA utilities used by cTakes..  The biggest
> > > factor in my experience after processing millions of notes is after
> they
> > > have reached about 5k AND are missing punctuation.  At around this size
> > > begins a geometric rise in complexity of internal structures that
> depend
> > on
> > > sentences and a serious elevation of processing time.
> > >
> > > Peter
> > >
> > > Sent from my iPad
> > >
> > > > On Jan 23, 2021, at 18:09, Greg Silverman 
> wrote:
> > > >
> > > > I found this:
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__medium.com_-40felix-5Fchan_install-2Dapache-2Dctakes-2D924c40967ce2&d=DwIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=uuvD9Z5PgR1KUWZ1Dc80V19dfKcr2DTrMuBxe2OCbMc&s=s-jUaTKHh4ts1f2UzY5nHsKbjA27HDpqAchBF36juTI&e=
> > , which
> > > > states: "A performance report is generated when the process is done."
> > > >
> > > > However, we are running this from the command line and no such report
> > is
> > > > being generated.
> > > >
> > > > Thanks!
> > > >
> > > >> On Sat, Jan 23, 2021 at 11:05 AM Greg Silverman 
> wrote:
> > > >>
> > > >> Hi all,
> > > >> Is there a way to easily generate a performance report similar to
> the
> > > one
> > > >> generated by MetaMap (with timings for each task, etc.)?
> > > >>
> > > >> Thanks in advance!
> > > >>
> > > >> Greg--
> > > >>
> > > >> --
> > > >> Greg M. Silverman