Question about piper system

2018-04-27 Thread Peter Abramowitsch
I've been using ctakes for several years, coding my own pipelines and
configuring XML files manually, but decided to try using the piper system
for fun.   However, hasn't been that easy, and just trying from the
documentation with the Creator GUI, a lot of permutations don't seem to
work - at least without knowing each component's idiosyncracies.  I was
wondering if there was some unofficial documentation to supplement what is
out there on your site.

(Unfortunately I have changed emails to change of work, so I don't have the
archive of previous suggestions where this may have already been covered)

1.  Using the Piper CreaterGUI What does it mean when I include a
component, for instance XMIWriter2 and on validation shows up in red even
though it has no unfulfilled mandatory parameters?

2If I use one of the example piper files from the distribution, it
runs.  But I notice there they contain no output components.  But if I
"add" any output XMI output method, it fails with an exception like
this:  *MESSAGE
LOCALIZATION FAILED: Can't find resource for bundle
java.util.PropertyResourceBundle, key Not AnalysisComponent
org.apache.ctakes.core.cc.CasConsumer.* (many other components have
similar error messages when I add them)

Only one of the pretty print methods works with "add".

NB the writeXmis shortcut works.

3.  Has the JsonCasSerializer been packaged for the Piper?


Jcasgen reports API incompatibility error on cTakes trunk

2018-05-05 Thread Peter Abramowitsch
Hello.   I just checked out the trunk into Eclipse with the Maven &
Subversive plugins.The build stumbles doing the Jcasgen of every
project that has a typesystem.   There seems to be a version
incompatibility:

Execution default of goal
org.apache.uima:jcasgen-maven-plugin:2.10.2:generate failed: Unable to load
the mojo 'generate' in the plugin
'org.apache.uima:jcasgen-maven-plugin:2.10.2' due to an API
incompatibility:
org.codehaus.plexus.component.repository.exception.ComponentLookupException:
org/apache/uima/tools/jcasgen/maven/JCasGenMojo : Unsupported major.minor
version 51.0

Also If I try to Maven-update the whole project I get

An internal error occurred during: "Updating Maven Project".
java.lang.NullPointerException

It seems to be related to ctakes-temporal as I can update all of the others
individually.


Can anyone shed any light on either of these issues?


Re: Trunk Build, Jcasgen reports API incompatibility

2018-05-06 Thread Peter Abramowitsch
Hello.  Here's a bit more information on the problems I was having
compiling the trunk from source.  I've tried with more success outside
Eclipse, but still having issues.

*Outside of Eclipse* If I do the mvn clean install of the whole ctakes,
JCasgen does run normally. and the typesystem jar is generated  All the
projects build and test* with the exception of ctakes-temporal*.  (see
attached compile output).  The errors start with this:

[ERROR]
/Users/peterabramowitsch/hbm-java/apache-ctakes-4/trunk/ctakes-temporal/src/main/java/org/apache/ctakes/temporal/ae/TemporalRelationExtractorAnnotator.java:[102,27]
[unchecked] unchecked generic array creation for varargs parameter of type
RelationFeaturesExtractor[]

I tried putting in some SuppressWarnings... but got into more trouble.


*Inside of eclipse*, if I do the type-system clean/install on its own,
casgen still fails with this.

Execution default of goal
org.apache.uima:jcasgen-maven-plugin:2.10.2:generate failed: Unable to load
the mojo 'generate' in the plugin
'org.apache.uima:jcasgen-maven-plugin:2.10.2' due to an API
incompatibility:
org.codehaus.plexus.component.repository.exception.ComponentLookupException:
org/apache/uima/tools/jcasgen/maven/JCasGenMojo : Unsupported major.minor
version 51.0

And therefore every project depending on the typesystem fails.


platform
Java jdk1.8.0_25
eclipse Version: Luna Service Release 2 (4.4.2)Build id: 20150219-0600
[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time: 3:52.371s
[INFO] Finished at: Sun May 06 13:38:14 PDT 2018
[INFO] Final Memory: 108M/434M
[INFO] 
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:2.4:compile (default-compile) on 
project ctakes-temporal: Compilation failure: Compilation failure:
[ERROR] 
/Users/peterabramowitsch/hbm-java/apache-ctakes-4/trunk/ctakes-temporal/src/main/java/org/apache/ctakes/temporal/ae/TemporalRelationExtractorAnnotator.java:[102,27]
 [unchecked] unchecked generic array creation for varargs parameter of type 
RelationFeaturesExtractor[]
[ERROR] 
/Users/peterabramowitsch/hbm-java/apache-ctakes-4/trunk/ctakes-temporal/src/main/java/org/apache/ctakes/temporal/eval/Evaluation_ImplBase.java:[899,45]
 [unchecked] unchecked generic array creation for varargs parameter of type 
Collection[]
[ERROR] where CAP#1 is a fresh type-variable:
[ERROR] CAP#1 extends TOP from capture of ? extends TOP
[ERROR] 
/Users/peterabramowitsch/hbm-java/apache-ctakes-4/trunk/ctakes-temporal/src/main/java/org/apache/ctakes/temporal/eval/Evaluation_ImplBase.java:[899,45]
 error: incompatible types: Collection cannot be converted to TOP
[ERROR] where CAP#1 is a fresh type-variable:
[ERROR] CAP#1 extends TOP from capture of ? extends TOP
[ERROR] 
/Users/peterabramowitsch/hbm-java/apache-ctakes-4/trunk/ctakes-temporal/src/main/java/org/apache/ctakes/temporal/eval/Evaluation_ImplBase.java:[949,45]
 [unchecked] unchecked generic array creation for varargs parameter of type 
Collection[]
[ERROR] where CAP#1 is a fresh type-variable:
[ERROR] CAP#1 extends TOP from capture of ? extends TOP
[ERROR] 
/Users/peterabramowitsch/hbm-java/apache-ctakes-4/trunk/ctakes-temporal/src/main/java/org/apache/ctakes/temporal/eval/Evaluation_ImplBase.java:[949,45]
 error: incompatible types: Collection cannot be converted to TOP
[ERROR] where CAP#1 is a fresh type-variable:
[ERROR] CAP#1 extends TOP from capture of ? extends TOP
[ERROR] 
/Users/peterabramowitsch/hbm-java/apache-ctakes-4/trunk/ctakes-temporal/src/main/java/org/apache/ctakes/temporal/ae/ConstituencyBasedTimeAnnotator.java:[185,26]
 [unchecked] unchecked call to 
CleartkExtractor(Class,FeatureExtractor1,Context...) as a 
member of the raw type CleartkExtractor
[ERROR] where SEARCH_T is a type-variable:
[ERROR] SEARCH_T extends Annotation declared in class CleartkExtractor
[ERROR] 
/Users/peterabramowitsch/hbm-java/apache-ctakes-4/trunk/ctakes-temporal/src/main/java/org/apache/ctakes/temporal/ae/ConstituencyBasedTimeAnnotator.java:[187,24]
 [unchecked] unchecked call to 
CleartkExtractor(Class,FeatureExtractor1,Context...) as a 
member of the raw type CleartkExtractor
[ERROR] where SEARCH_T is a type-variable:
[ERROR] SEARCH_T extends Annotation declared in class CleartkExtractor
[ERROR] 
/Users/peterabramowitsch/hbm-java/apache-ctakes-4/trunk/ctakes-temporal/src/main/java/org/apache/ctakes/temporal/ae/ConstituencyBasedTimeAnnotator.java:[232,47]
 [unchecked] unchecked call to extract(JCas,T) as a member of the raw type 
FeatureExtractor1
[ERROR] where T is a type-variable:
[ERROR] T extends Annotation declared in interface FeatureExtractor1
[ERROR] 
/Users/peterabramowitsch/hbm-java/apache-ctakes-4/trunk/ctakes-temporal/src/main/java/org/apache/ctakes/temporal/ae/ConstituencyBasedTimeAnnotator.java:[232,21]
 [unchecked] unchecked method invocation: method addAll in class ArrayList is 
ap

Getting around issue ctakes-509 compile error

2018-05-29 Thread Peter Abramowitsch
Hello All
I found out how to correct the compile of the ctakes-temporal sub project
in 4.0.1   Having done so, it now proceeds to the testing where a few tests
fall over.  As I am not an official contributor, I'll leave it to the
author to make the changes.

at lines  867 and 917 one needs to cast the result of the JCasUtil.select()
to a TOP before cloning it

for ( TOP annotation : Lists.newArrayList( *(TOP)*JCasUtil.select(
goldView, annotationClass ) ) ) {


Re: Regarding Apached cTakes

2018-05-29 Thread Peter Abramowitsch
That exception is often the sign that it doesn't see your umls username and
password.  You need to give it through the environment, via a Piper file if
you're using it, or directly in the XML descriptor for the dictionary
lookup.   Check in the user guide for help.  Or exclude the lookup
altogether.

On Tue, May 29, 2018, 11:32 PM Ankit Bisht 
wrote:

> Hello,
>
> I am trying to use Apache Ctakes. I have installed it but when I am trying
> to open it, I am getting an error that
> org.apache.uima.resource.resourceinitializationexception: Initialization of
> annotator class
> "org.apached.ctakes.dictionary.lookup2.ae.DefaultJCastermAnnotaor" failed
>
> I am following the cTAKES 4.0 User Install Guide but still facing some
> problems.
>
> Regards
> Ankit
>


for sean, chenpei or james-masanz

2018-05-30 Thread Peter Abramowitsch
Just a quick question --  Is there any skew between the 4.0.0 binary
release and the 4.0.0 source fetched from SVN?   When I build a binary
distribution from the source, and then start to load up any piper file or
hard coded pipeline, I get the error below, and when I use the binary
distribution from the ctakes site it works fine.

Exception in thread "main" java.lang.ArrayStoreException:
sun.reflect.annotation.EnumConstantNotPresentExceptionProxy
at
sun.reflect.annotation.AnnotationParser.parseEnumArray(AnnotationParser.java:744)
at
sun.reflect.annotation.AnnotationParser.parseArray(AnnotationParser.java:533)
at
sun.reflect.annotation.AnnotationParser.parseMemberValue(AnnotationParser.java:355)
at
sun.reflect.annotation.AnnotationParser.parseAnnotation2(AnnotationParser.java:286)
at
sun.reflect.annotation.AnnotationParser.parseAnnotations2(AnnotationParser.java:120)
at
sun.reflect.annotation.AnnotationParser.parseAnnotations(AnnotationParser.java:72)
at java.lang.Class.createAnnotationData(Class.java:3513)
at java.lang.Class.annotationData(Class.java:3502)
at java.lang.Class.getAnnotation(Class.java:3407)
at
org.apache.uima.fit.internal.ReflectionUtil.isAnnotationPresent(ReflectionUtil.java:168)
at
org.apache.uima.fit.internal.ReflectionUtil.getInheritableAnnotation(ReflectionUtil.java:121)
at
org.apache.uima.fit.factory.ResourceMetaDataFactory.configureResourceMetaData(ResourceMetaDataFactory.java:46)
at
org.apache.uima.fit.factory.CollectionReaderFactory.createReaderDescription(CollectionReaderFactory.java:799)
at
org.apache.uima.fit.factory.CollectionReaderFactory.createReaderDescription(CollectionReaderFactory.java:662)
at
org.apache.uima.fit.factory.CollectionReaderFactory.createReaderDescription(CollectionReaderFactory.java:599)
at
org.apache.uima.fit.factory.CollectionReaderFactory.createReaderDescription(CollectionReaderFactory.java:456)
at
org.apache.ctakes.core.pipeline.PipelineBuilder.reader(PipelineBuilder.java:96)

The sizes of the classes within some libraries seem different between the
distributed 4.0.0 jars and the ones I built from 4.0.0 source.


Re: for sean, chenpei or james-masanz [EXTERNAL]

2018-05-31 Thread Peter Abramowitsch
Thanks Sean

A week ago or so I also found and reported some code in ctakes-temporal
that didn't compile without two minor changes.  Yet the repo doesn't look
changed since April 2017.  So could it be that the binary release was built
in an environment that had been tweaked in some way and those changes
hadn't made it back into the repository.   I'm assuming that compatibility
is still with Java 1.8 and above.

Regards. Peter

On Thu, May 31, 2018, 2:09 PM Finan, Sean 
wrote:

> There shouldn't be any difference, but it looks like you may have found
> one.
> ________
> From: Peter Abramowitsch 
> Sent: Wednesday, May 30, 2018 7:22 PM
> To: dev@ctakes.apache.org
> Subject: for sean, chenpei or james-masanz [EXTERNAL]
>
> Just a quick question --  Is there any skew between the 4.0.0 binary
> release and the 4.0.0 source fetched from SVN?   When I build a binary
> distribution from the source, and then start to load up any piper file or
> hard coded pipeline, I get the error below, and when I use the binary
> distribution from the ctakes site it works fine.
>
> Exception in thread "main" java.lang.ArrayStoreException:
> sun.reflect.annotation.EnumConstantNotPresentExceptionProxy
> at
>
> sun.reflect.annotation.AnnotationParser.parseEnumArray(AnnotationParser.java:744)
> at
>
> sun.reflect.annotation.AnnotationParser.parseArray(AnnotationParser.java:533)
> at
>
> sun.reflect.annotation.AnnotationParser.parseMemberValue(AnnotationParser.java:355)
> at
>
> sun.reflect.annotation.AnnotationParser.parseAnnotation2(AnnotationParser.java:286)
> at
>
> sun.reflect.annotation.AnnotationParser.parseAnnotations2(AnnotationParser.java:120)
> at
>
> sun.reflect.annotation.AnnotationParser.parseAnnotations(AnnotationParser.java:72)
> at java.lang.Class.createAnnotationData(Class.java:3513)
> at java.lang.Class.annotationData(Class.java:3502)
> at java.lang.Class.getAnnotation(Class.java:3407)
> at
>
> org.apache.uima.fit.internal.ReflectionUtil.isAnnotationPresent(ReflectionUtil.java:168)
> at
>
> org.apache.uima.fit.internal.ReflectionUtil.getInheritableAnnotation(ReflectionUtil.java:121)
> at
>
> org.apache.uima.fit.factory.ResourceMetaDataFactory.configureResourceMetaData(ResourceMetaDataFactory.java:46)
> at
>
> org.apache.uima.fit.factory.CollectionReaderFactory.createReaderDescription(CollectionReaderFactory.java:799)
> at
>
> org.apache.uima.fit.factory.CollectionReaderFactory.createReaderDescription(CollectionReaderFactory.java:662)
> at
>
> org.apache.uima.fit.factory.CollectionReaderFactory.createReaderDescription(CollectionReaderFactory.java:599)
> at
>
> org.apache.uima.fit.factory.CollectionReaderFactory.createReaderDescription(CollectionReaderFactory.java:456)
> at
>
> org.apache.ctakes.core.pipeline.PipelineBuilder.reader(PipelineBuilder.java:96)
>
> The sizes of the classes within some libraries seem different between the
> distributed 4.0.0 jars and the ones I built from 4.0.0 source.
>


Re: Regarding Apached cTakes

2018-05-31 Thread Peter Abramowitsch
Hi Ankit

It looks like your answer is near the bottom of the stacktrace you sent.  The 
machine is out of memory.   In general with these traces the root cause of an 
error will be near the bottom, but as it is propagated up the many layers of 
the application, the messages become more generic.

So in your case the root cause is:
> Caused by: org.hsqldb.HsqlException: error in script file line: 1141005 
> java.lang.OutOfMemoryError: GC overhead limit exceeded

In general Ctakes scripts launch the app with parameters that specify the 
memory needed, but if either these are left out, or the machine itself doesn’t 
have the available resources, you will get this error.

- Peter

Sent from my iPad

> On Jun 1, 2018, at 02:51, Ankit Bisht  wrote:
> 
> Hello Rehan,
> 
> Are you talking about the below messages? I took them from CAS visual 
> debugger tool = > Tools => View log file .  I have also attached the snapshot 
> of exception that I am getting. Please let me know if you need anything else.
> 
> 10:58:42.492 - 1: org.apache.uima.tools.cvd.MainFrame.handleException(526): 
> SEVERE: Initialization of annotator class 
> "org.apache.ctakes.dictionary.lookup2.ae.DefaultJCasTermAnnotator" failed.  
> (Descriptor: 
> file:/C:/Users/ankit/Desktop/apache-ctakes-4.0.0/desc/ctakes-dictionary-lookup-fast/desc/analysis_engine/UmlsLookupAnnotator.xml)
> org.apache.uima.resource.ResourceInitializationException: Initialization of 
> annotator class 
> "org.apache.ctakes.dictionary.lookup2.ae.DefaultJCasTermAnnotator" failed.  
> (Descriptor: 
> file:/C:/Users/ankit/Desktop/apache-ctakes-4.0.0/desc/ctakes-dictionary-lookup-fast/desc/analysis_engine/UmlsLookupAnnotator.xml)
>   at 
> org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.initializeAnalysisComponent(PrimitiveAnalysisEngine_impl.java:271)
>   at 
> org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.initialize(PrimitiveAnalysisEngine_impl.java:170)
>   at 
> org.apache.uima.impl.AnalysisEngineFactory_impl.produceResource(AnalysisEngineFactory_impl.java:94)
>   at 
> org.apache.uima.impl.CompositeResourceFactory_impl.produceResource(CompositeResourceFactory_impl.java:62)
>   at org.apache.uima.UIMAFramework.produceResource(UIMAFramework.java:279)
>   at 
> org.apache.uima.UIMAFramework.produceAnalysisEngine(UIMAFramework.java:407)
>   at 
> org.apache.uima.analysis_engine.asb.impl.ASB_impl.setup(ASB_impl.java:256)
>   at 
> org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.initASB(AggregateAnalysisEngine_impl.java:429)
>   at 
> org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.initializeAggregateAnalysisEngine(AggregateAnalysisEngine_impl.java:373)
>   at 
> org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.initialize(AggregateAnalysisEngine_impl.java:186)
>   at 
> org.apache.uima.impl.AnalysisEngineFactory_impl.produceResource(AnalysisEngineFactory_impl.java:94)
>   at 
> org.apache.uima.impl.CompositeResourceFactory_impl.produceResource(CompositeResourceFactory_impl.java:62)
>   at org.apache.uima.UIMAFramework.produceResource(UIMAFramework.java:279)
>   at 
> org.apache.uima.UIMAFramework.produceAnalysisEngine(UIMAFramework.java:371)
>   at org.apache.uima.tools.cvd.MainFrame.setupAE(MainFrame.java:1484)
>   at 
> org.apache.uima.tools.cvd.MainFrame.loadAEDescriptor(MainFrame.java:476)
>   at org.apache.uima.tools.cvd.CVD.main(CVD.java:164)
> Caused by: org.apache.uima.resource.ResourceInitializationException: MESSAGE 
> LOCALIZATION FAILED: Can't find resource for bundle 
> java.util.PropertyResourceBundle, key Could not construct 
> org.apache.ctakes.dictionary.lookup2.dictionary.UmlsJdbcRareWordDictionary
>   at 
> org.apache.ctakes.dictionary.lookup2.ae.AbstractJCasTermAnnotator.initialize(AbstractJCasTermAnnotator.java:131)
>   at 
> org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.initializeAnalysisComponent(PrimitiveAnalysisEngine_impl.java:266)
>   ... 16 more
> Caused by: 
> org.apache.uima.analysis_engine.annotator.AnnotatorContextException: MESSAGE 
> LOCALIZATION FAILED: Can't find resource for bundle 
> java.util.PropertyResourceBundle, key Could not construct 
> org.apache.ctakes.dictionary.lookup2.dictionary.UmlsJdbcRareWordDictionary
>   at 
> org.apache.ctakes.dictionary.lookup2.dictionary.DictionaryDescriptorParser.parseDictionary(DictionaryDescriptorParser.java:199)
>   at 
> org.apache.ctakes.dictionary.lookup2.dictionary.DictionaryDescriptorParser.parseDictionaries(DictionaryDescriptorParser.java:156)
>   at 
> org.apache.ctakes.dictionary.lookup2.dictionary.DictionaryDescriptorParser.parseDescriptor(DictionaryDescriptorParser.java:128)
>   at 
> org.apache.ctakes.dictionary.lookup2.ae.AbstractJCasTermAnnotator.initialize(AbstractJCasTermAnnotator.java:129)
>   ... 17 more
> Caused by: java.lang.reflect.InvocationTargetException
>   at sun.re

Re: Regarding Apached cTakes

2018-06-01 Thread Peter Abramowitsch
That's good.  But the fact that the stack trace shows you're out of memory,
probably indicates that the JVM memory related args are not successfully
being passed.
They should be  set to -Xms512M -Xmx3g  or greater.

Peter

On Fri, Jun 1, 2018 at 5:03 PM, Ankit Bisht 
wrote:

> Hello Peter,
>
> Thanks for showing your concern on this issue. The machine on which I am
> running cTakes has  32 GB of Ram and 100 GB of Memory left.
>
> -Ankit
>
> On Fri, Jun 1, 2018 at 2:31 AM, Peter Abramowitsch <
> pabramowit...@gmail.com>
> wrote:
>
> > Hi Ankit
> >
> > It looks like your answer is near the bottom of the stacktrace you sent.
> > The machine is out of memory.   In general with these traces the root
> cause
> > of an error will be near the bottom, but as it is propagated up the many
> > layers of the application, the messages become more generic.
> >
> > So in your case the root cause is:
> > > Caused by: org.hsqldb.HsqlException: error in script file line: 1141005
> > java.lang.OutOfMemoryError: GC overhead limit exceeded
> >
> > In general Ctakes scripts launch the app with parameters that specify the
> > memory needed, but if either these are left out, or the machine itself
> > doesn’t have the available resources, you will get this error.
> >
> > - Peter
> >
> > Sent from my iPad
> >
> > > On Jun 1, 2018, at 02:51, Ankit Bisht 
> > wrote:
> > >
> > > Hello Rehan,
> > >
> > > Are you talking about the below messages? I took them from CAS visual
> > debugger tool = > Tools => View log file .  I have also attached the
> > snapshot of exception that I am getting. Please let me know if you need
> > anything else.
> > >
> > > 10:58:42.492 - 1: org.apache.uima.tools.cvd.
> > MainFrame.handleException(526): SEVERE: Initialization of annotator
> class
> > "org.apache.ctakes.dictionary.lookup2.ae.DefaultJCasTermAnnotator"
> > failed.  (Descriptor: file:/C:/Users/ankit/Desktop/
> > apache-ctakes-4.0.0/desc/ctakes-dictionary-lookup-fast/
> > desc/analysis_engine/UmlsLookupAnnotator.xml)
> > > org.apache.uima.resource.ResourceInitializationException:
> > Initialization of annotator class "org.apache.ctakes.dictionary.
> lookup2.ae
> > .DefaultJCasTermAnnotator" failed.  (Descriptor:
> > file:/C:/Users/ankit/Desktop/apache-ctakes-4.0.0/desc/
> > ctakes-dictionary-lookup-fast/desc/analysis_engine/
> > UmlsLookupAnnotator.xml)
> > >   at org.apache.uima.analysis_engine.impl.
> > PrimitiveAnalysisEngine_impl.initializeAnalysisComponent(
> > PrimitiveAnalysisEngine_impl.java:271)
> > >   at org.apache.uima.analysis_engine.impl.
> > PrimitiveAnalysisEngine_impl.initialize(PrimitiveAnalysisEngine_impl.
> > java:170)
> > >   at org.apache.uima.impl.AnalysisEngineFactory_impl.
> > produceResource(AnalysisEngineFactory_impl.java:94)
> > >   at org.apache.uima.impl.CompositeResourceFactory_impl.
> > produceResource(CompositeResourceFactory_impl.java:62)
> > >   at org.apache.uima.UIMAFramework.produceResource(UIMAFramework.
> > java:279)
> > >   at org.apache.uima.UIMAFramework.produceAnalysisEngine(
> > UIMAFramework.java:407)
> > >   at org.apache.uima.analysis_engine.asb.impl.ASB_impl.
> > setup(ASB_impl.java:256)
> > >   at org.apache.uima.analysis_engine.impl.
> > AggregateAnalysisEngine_impl.initASB(AggregateAnalysisEngine_impl.
> > java:429)
> > >   at org.apache.uima.analysis_engine.impl.
> > AggregateAnalysisEngine_impl.initializeAggregateAnalysisEng
> > ine(AggregateAnalysisEngine_impl.java:373)
> > >   at org.apache.uima.analysis_engine.impl.
> > AggregateAnalysisEngine_impl.initialize(AggregateAnalysisEngine_impl.
> > java:186)
> > >   at org.apache.uima.impl.AnalysisEngineFactory_impl.
> > produceResource(AnalysisEngineFactory_impl.java:94)
> > >   at org.apache.uima.impl.CompositeResourceFactory_impl.
> > produceResource(CompositeResourceFactory_impl.java:62)
> > >   at org.apache.uima.UIMAFramework.produceResource(UIMAFramework.
> > java:279)
> > >   at org.apache.uima.UIMAFramework.produceAnalysisEngine(
> > UIMAFramework.java:371)
> > >   at org.apache.uima.tools.cvd.MainFrame.setupAE(MainFrame.
> > java:1484)
> > >   at org.apache.uima.tools.cvd.MainFrame.loadAEDescriptor(
> > MainFrame.java:476)
> > >   at org.apache.uima.tools.cvd.CVD.main(CVD.java:164)
> > > Caused by: org.apache.uima.resource.ResourceInit

Re: Regarding Apached cTakes

2018-06-01 Thread Peter Abramowitsch
The best thing would be for you to look at some of the other launch scripts in 
the bin directory of a release.  There you’ll find examples of passing both 
memory and UMLS parameters.

Sent from my iPad

> On Jun 1, 2018, at 18:36, Ankit Bisht  wrote:
> 
> Thanks Peter. I have not done that before. How could it be done?
> Would it be done like this:
> java -Xms512M -Xmx3g ClassName?
> 
> -Ankit
> 
> On Fri, Jun 1, 2018 at 12:05 PM, Peter Abramowitsch > wrote:
> 
>> That's good.  But the fact that the stack trace shows you're out of memory,
>> probably indicates that the JVM memory related args are not successfully
>> being passed.
>> They should be  set to -Xms512M -Xmx3g  or greater.
>> 
>> Peter
>> 
>> On Fri, Jun 1, 2018 at 5:03 PM, Ankit Bisht 
>> wrote:
>> 
>>> Hello Peter,
>>> 
>>> Thanks for showing your concern on this issue. The machine on which I am
>>> running cTakes has  32 GB of Ram and 100 GB of Memory left.
>>> 
>>> -Ankit
>>> 
>>> On Fri, Jun 1, 2018 at 2:31 AM, Peter Abramowitsch <
>>> pabramowit...@gmail.com>
>>> wrote:
>>> 
>>>> Hi Ankit
>>>> 
>>>> It looks like your answer is near the bottom of the stacktrace you
>> sent.
>>>> The machine is out of memory.   In general with these traces the root
>>> cause
>>>> of an error will be near the bottom, but as it is propagated up the
>> many
>>>> layers of the application, the messages become more generic.
>>>> 
>>>> So in your case the root cause is:
>>>>> Caused by: org.hsqldb.HsqlException: error in script file line:
>> 1141005
>>>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>> 
>>>> In general Ctakes scripts launch the app with parameters that specify
>> the
>>>> memory needed, but if either these are left out, or the machine itself
>>>> doesn’t have the available resources, you will get this error.
>>>> 
>>>> - Peter
>>>> 
>>>> Sent from my iPad
>>>> 
>>>>> On Jun 1, 2018, at 02:51, Ankit Bisht 
>>>> wrote:
>>>>> 
>>>>> Hello Rehan,
>>>>> 
>>>>> Are you talking about the below messages? I took them from CAS visual
>>>> debugger tool = > Tools => View log file .  I have also attached the
>>>> snapshot of exception that I am getting. Please let me know if you need
>>>> anything else.
>>>>> 
>>>>> 10:58:42.492 - 1: org.apache.uima.tools.cvd.
>>>> MainFrame.handleException(526): SEVERE: Initialization of annotator
>>> class
>>>> "org.apache.ctakes.dictionary.lookup2.ae.DefaultJCasTermAnnotator"
>>>> failed.  (Descriptor: file:/C:/Users/ankit/Desktop/
>>>> apache-ctakes-4.0.0/desc/ctakes-dictionary-lookup-fast/
>>>> desc/analysis_engine/UmlsLookupAnnotator.xml)
>>>>> org.apache.uima.resource.ResourceInitializationException:
>>>> Initialization of annotator class "org.apache.ctakes.dictionary.
>>> lookup2.ae
>>>> .DefaultJCasTermAnnotator" failed.  (Descriptor:
>>>> file:/C:/Users/ankit/Desktop/apache-ctakes-4.0.0/desc/
>>>> ctakes-dictionary-lookup-fast/desc/analysis_engine/
>>>> UmlsLookupAnnotator.xml)
>>>>>  at org.apache.uima.analysis_engine.impl.
>>>> PrimitiveAnalysisEngine_impl.initializeAnalysisComponent(
>>>> PrimitiveAnalysisEngine_impl.java:271)
>>>>>  at org.apache.uima.analysis_engine.impl.
>>>> PrimitiveAnalysisEngine_impl.initialize(PrimitiveAnalysisEngine_impl.
>>>> java:170)
>>>>>  at org.apache.uima.impl.AnalysisEngineFactory_impl.
>>>> produceResource(AnalysisEngineFactory_impl.java:94)
>>>>>  at org.apache.uima.impl.CompositeResourceFactory_impl.
>>>> produceResource(CompositeResourceFactory_impl.java:62)
>>>>>  at org.apache.uima.UIMAFramework.
>> produceResource(UIMAFramework.
>>>> java:279)
>>>>>  at org.apache.uima.UIMAFramework.produceAnalysisEngine(
>>>> UIMAFramework.java:407)
>>>>>  at org.apache.uima.analysis_engine.asb.impl.ASB_impl.
>>>> setup(ASB_impl.java:256)
>>>>>  at org.apache.uima.analysis_engine.impl.
>>>> AggregateAnalysisEngine_impl.initASB(AggregateAnalysisEngine_impl.
>>>> java:429)
>>>>>  at 

Re: Could not find or load main class org.apache.uima.tools.cpm.CpmFrame

2018-06-01 Thread Peter Abramowitsch
Hi Barbara,  Thanks for sharing that.

Have you been able to build a running version from the trunk in source in 
Eclipse?   I’ve been having compile problems in ctakes-temporal  and an API 
mismatch in the casgen module when I run it from the maven system in Eclipse.  
Instead I need to build the TypeSystem from the command line.Even then, the 
binary distribution created by the 4.0.0 source build does not behave like the 
binary distribution off the ctakes site.  Have you noticed that?

Sent from my iPad

> On Jun 1, 2018, at 23:13, Barbara Moloney  
> wrote:
> 
> I found the solution at
> https://stackoverflow.com/questions/30386833/uima-example-in-eclipse-not-working
> with
> some modification. I needed to create a UIMA user library in Eclipse for
> each ctakes run confiiguration by adding external Jars from UIMA_HOME\lib.
> 
> Barbara
> 
> 
> On 1 May 2018 at 11:28, Barbara Moloney 
> wrote:
> 
>> Hi
>> I'm new to cTAKES (and Java 1.8) and trying to get run configurations to
>> work. I have downloaded the source code from URL as specified in the
>> Developer Install instructions for cTAKES 4.0. When I try to run the
>> configuration within Eclipse (eg UIMA_CPE_GUI) I get an error message
>> 
>> Error: Could not find or load main class org.apache.uima.tools.cpm.CpmF
>> rame
>> 
>> I'm suspecting the problem may be in my path and/or classpath variables.
>> 
>> I have another Eclipse workspace for UIMA and the run configurations are
>> working.
>> 
>> Can anyone help please.
>> 
>> Barbara
>> 
>> 
> 
> -- 
> 
> This
> message is intended for the addressee named and may contain 
> 
> confidential information. If you are not the intended recipient, please 
> 
> delete it and notify the sender. Views expressed in this message are 
> those 
> of the individual sender, and are not necessarily the views of 
> their 
> organisation.


Re: Run cTAKES continuously

2018-06-12 Thread Peter Abramowitsch
The best solution would be to put it in a server framework.  I was not able
to get the EpsilonTeam server to work, but there's another tiny server
version written in Scala which you can try.  I ended up doing one using the
Spark REST framework.   You can build a non server / non UI version which
does run at the command line by coding it up (in Java) to create the
pipeline or using a piper, then create a jCas which you use/reset/reuse

The core of it would be a loop like this

jcas.setDocumentText(note.getFree_text());
_aae.process(jcas);

On Tue, Jun 12, 2018 at 8:05 PM, Ted Pikul  wrote:

> Hi- I’ve been able to successfully run cTAKES from the command line as
> documented here:
> https://cwiki.apache.org/confluence/display/ctakes/
> default+clinical+pipeline
>
> This works great, but each time it runs it has to make the database
> connection using jdbc and load the model, which takes 15 seconds or so.
>
> Is there another script besides the runClinicalPipeline.sh that I can run
> to just keep this running and send new notes to it rather than getting the
> db connection and loading the model each time?
>
> I know there is the cTAKES rest server project:
> https://github.com/GoTeamEpsilon/ctakes-rest-service which I think might
> do
> what I’m looking to do.  but as it’s still in alpha stage, especially the
> docker piece of it, and I don’t really need a server I can just run from
> command line, I’m not sure this is the right solution for me.
>
> I tried looking at how the runctakesCVD.sh script works, as it does what I
> need but with the CVD UI, but I couldn’t quite figure it out from looking
> at the UIMA code.
>
> Any guidance here is greatly appreciated. Thank you
>


Re: Run cTAKES continuously

2018-06-12 Thread Peter Abramowitsch
Sorry,  The mail sent before I was ready
This is pseudocode

   // __aae is your analysis engine (there could be multiple)
   while(more notes) {
 jcas.setDocumentText(note.getFree_text());
 _aae.process(jcas);
 // do something with the jcas contents here
 jcas.reset()
  }

On Tue, Jun 12, 2018 at 11:06 PM, Peter Abramowitsch <
pabramowit...@gmail.com> wrote:

> The best solution would be to put it in a server framework.  I was not
> able to get the EpsilonTeam server to work, but there's another tiny server
> version written in Scala which you can try.  I ended up doing one using the
> Spark REST framework.   You can build a non server / non UI version which
> does run at the command line by coding it up (in Java) to create the
> pipeline or using a piper, then create a jCas which you use/reset/reuse
>
> The core of it would be a loop like this
>
> jcas.setDocumentText(note.getFree_text());
> _aae.process(jcas);
>
> On Tue, Jun 12, 2018 at 8:05 PM, Ted Pikul  wrote:
>
>> Hi- I’ve been able to successfully run cTAKES from the command line as
>> documented here:
>> https://cwiki.apache.org/confluence/display/ctakes/default+
>> clinical+pipeline
>>
>> This works great, but each time it runs it has to make the database
>> connection using jdbc and load the model, which takes 15 seconds or so.
>>
>> Is there another script besides the runClinicalPipeline.sh that I can run
>> to just keep this running and send new notes to it rather than getting the
>> db connection and loading the model each time?
>>
>> I know there is the cTAKES rest server project:
>> https://github.com/GoTeamEpsilon/ctakes-rest-service which I think might
>> do
>> what I’m looking to do.  but as it’s still in alpha stage, especially the
>> docker piece of it, and I don’t really need a server I can just run from
>> command line, I’m not sure this is the right solution for me.
>>
>> I tried looking at how the runctakesCVD.sh script works, as it does what I
>> need but with the CVD UI, but I couldn’t quite figure it out from looking
>> at the UIMA code.
>>
>> Any guidance here is greatly appreciated. Thank you
>>
>
>


Re: Run cTAKES continuously

2018-06-14 Thread Peter Abramowitsch
Hi Gandhi

I also had difficulty building it, and then finally when it built after I
tweaked some code in ctakes-temporal, it just returned null content.   I
figured it was in a state of flux.

But additionally as I discussed with Matthew Vita one of your colleagues,
the project could be more useful if there were a way of building and
developing with it also outside docker.  As currently packaged, it seems
optimized for one-off fetch and build from straight to finish.  At the
moment delegating all the heavy lifting to the shell script which
docker-compose invokes means that for each trivial change, the system has
to re-do a large amount of downloading and building.


On Thu, Jun 14, 2018 at 4:53 PM, Gandhi Rajan Natarajan <
gandhi.natara...@arisglobal.com> wrote:

> Hi Ted,
>
> Try building 'ctakes-web-rest' module in https://github.com/
> GoTeamEpsilon/ctakes-rest-service . Please let me know what's the issue
> you are facing.
>
> Please have a look at readme file once. You need to have all the ctakes
> dependency jars before building this module.
>
> Regards,
> Gandhi
>
>
> -Original Message-
> From: Ted Pikul [mailto:tedpik...@gmail.com]
> Sent: Thursday, June 14, 2018 8:12 PM
> To: dev@ctakes.apache.org
> Subject: Re: Run cTAKES continuously
>
> Thank you Peter and Gandhi.
>
> I’ve not been able to get the ctakes-rest-service to run successfully
> (using docker). I’m also not sure it supports UMLS credentials and it looks
> like the UMLS database is a local copy although I could be misunderstanding
> that. Due to the license confusion around running a local copy of UMLS
> database I’d rather just avoid that.
>
> I’ll try the implementation suggested by Peter.
>
> I also found that Tika has a cTAKES REST API, but unfortunately it loads
> the model on each request.
>
> On Wednesday, June 13, 2018, Gandhi Rajan Natarajan <
> gandhi.natara...@arisglobal.com> wrote:
>
> > Hi Ted,
> >
> > The implementation suggested by Peter is already available in
> > https://github.com/GoTeamEpsilon/ctakes-rest-service/tree/master/ctake
> > s-
> > web-rest
> >
> > Building this project will give you a WAR file which you need to
> > deploy in Tomcat.
> >
> > Regards,
> > Gandhi
> >
> > -Original Message-
> > From: Peter Abramowitsch [mailto:pabramowit...@gmail.com]
> > Sent: Wednesday, June 13, 2018 2:40 AM
> > To: dev@ctakes.apache.org
> > Subject: Re: Run cTAKES continuously
> >
> > Sorry,  The mail sent before I was ready This is pseudocode
> >
> >// __aae is your analysis engine (there could be multiple)
> >    while(more notes) {
> >  jcas.setDocumentText(note.getFree_text());
> >  _aae.process(jcas);
> >  // do something with the jcas contents here
> >  jcas.reset()
> >   }
> >
> > On Tue, Jun 12, 2018 at 11:06 PM, Peter Abramowitsch <
> > pabramowit...@gmail.com> wrote:
> >
> > > The best solution would be to put it in a server framework.  I was
> > > not able to get the EpsilonTeam server to work, but there's another
> > > tiny server version written in Scala which you can try.  I ended up
> > > doing one
> > using the
> > > Spark REST framework.   You can build a non server / non UI version
> which
> > > does run at the command line by coding it up (in Java) to create the
> > > pipeline or using a piper, then create a jCas which you
> > > use/reset/reuse
> > >
> > > The core of it would be a loop like this
> > >
> > > jcas.setDocumentText(note.getFree_text());
> > > _aae.process(jcas);
> > >
> > > On Tue, Jun 12, 2018 at 8:05 PM, Ted Pikul 
> wrote:
> > >
> > >> Hi- I’ve been able to successfully run cTAKES from the command line
> > >> as documented here:
> > >> https://cwiki.apache.org/confluence/display/ctakes/default+
> > >> clinical+pipeline
> > >>
> > >> This works great, but each time it runs it has to make the database
> > >> connection using jdbc and load the model, which takes 15 seconds or
> so.
> > >>
> > >> Is there another script besides the runClinicalPipeline.sh that I
> > >> can run to just keep this running and send new notes to it rather
> > >> than getting the db connection and loading the model each time?
> > >>
> > >> I know there is the cTAKES rest server project:
> > >> https://github.com/GoTeamEpsilon/ctakes-rest-service which I think
> > >> might do what I

Licensing, API, And Multiple Users -- attn Sean Finan

2018-07-20 Thread Peter Abramowitsch
Hello

I'm working as a consultant to a large academic institution, implementing a
customized version of CTAKES as a layer in a larger warehouse of clinical
data.   Although the institution is singular, there could be multiple
"virtual" users - call them departments or projects that are having data
processed on their behalf.

While I have created my own REST implementation, I've seen that the API
versions that are referenced on the CTAKES website also just do a single
UMLS authentication at startup.   So different consumers can, in fact, have
their documents processed under the auspices of a single UMLS identity.

I can think of several architectural solutions to this issue, with various
impacts on performance, but I'd like to know what is the official position
on offering Ctakes as a service in an Academic setting.  Would the rules be
different from a service, that for example, exposed pure Metamap?   There
are some pages online that refer to this situation.

Peter


Re: Need Assistance !!!

2018-07-24 Thread Peter Abramowitsch
This is a sophisticated and subtle arena of work in which many people have
contributed, and so your questions cannot be answered in just a few
sentences.  A couple of articles that might help.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2528047/
http://dms.data.jhu.edu/data-management-resources/publish-and-share/de-identify-human-subjects-data/applications-to-assist-in-de-identification-of-human-subjects-data/

Because you didn't mention HIPAA, you need to check into the national/legal
environment in which you will be using these notes and see what types of
Identifying information pertain to your use case.  If it is HIPAA, you can
find specific guidance for attributes that should be dummied, excised, or
obfuscated.

On Tue, Jul 24, 2018 at 12:36 PM, RAJAT TANWAR 
wrote:

> I was trying to de-identify my clinical notes and i came across cTakes. I
> am very impressed with the demo of cTakes.
>
> I want to install it and De-identify my clinical notes.
>
> Could i be  provided with some assistance on how to approach it.
>
> Thanks and Regards
>


Re: Confusion regarding licensing of UMLS

2018-08-11 Thread Peter Abramowitsch
Hi, yes there are a lot of details and use cases that are not documented,
but there are quite a few bits of information if you take the time.

To give some quick "answers" to your questions...  (I put that in quotes
because these are interpretations and not legal opinions)

You do not need a UMLS license if you don't include the UMLS lookup
mechanism.In this case only the Apache license probably applies
http://www.apache.org/licenses/

But if you do use UMLS, start by reading
https://www.nlm.nih.gov/databases/umls.html
https://uts.nlm.nih.gov/license.html
https://uts.nlm.nih.gov//help/license/validateumlsuserhelp.html

There are options for situations where you would centralize a ctakes
service.  Some of the provisions have to do with the fact that
that SNOMED vocabulary details are included in the UMLS output, and the
SNOMED organizations have specific licensing requirements that you can see
on their site.

I am nearly positive that you cannot use the UMLS mechanism in a for-profit
setting without an explicit license from SNOMED.  Also, regardless of
whether the service as you deliver it is centralized, as I understand the
language and diagrams on these sites, each and every end user needs to
have, and actively validate a current UMLS license.

I've been involved in something similar to what you describe, but in a
non-commercial setting, and it does validate every end user.

Peter






On Sat, Aug 11, 2018 at 7:45 AM Abhishek De 
wrote:

> Hi all,
>
> I am Abhishek De, a newcomer to the CTAKES community. I had worked with
> CTAKES over 4 years ago, but now I am trying to incorporate it into my Java
> application as a senior developer, so I need to know a lot of things about
> it. For starters, I can see that even the pipeline without UMLS needs UMLS
> credentials in order to run. So I have the following questions:
>
> 1. Is there a certain fee which is required in order to apply for a UMLS
> license?
>
> 2. My organization is building this pharmacovigilance app which will
> internally use CTAKES for its client. Is it okay for only my parent
> organization to have the UMLS license or do the client needs to have it
> too? If I expose the application as a service, with the servers residing in
> our organization's premises, would the client still need to have a licnse?
>
> 3. As there is precious little documentation on CTAKES, and many links
> there are now dead including the ones giving instructions on how to install
> dictionaries for RxNorm or SNOMED-CT, could anyone please help me in this
> regard?
>
> 4. Also, I would like to know if there are any small dictionaries which
> could be used free of charge or licensing, for my PoC.
>
> Thanks and Regards,
> Abhishek De
>


Re: Confusion regarding licensing of UMLS

2018-08-11 Thread Peter Abramowitsch
I would suggest looking into the piper system - the scripts piperCreator
and piperRunner so you can construct/run pipelines with the bits you want.
Also look for the piper samples scattered throughout the resources area.
But assembling pipelines using Java is simple too.   You just need to learn
the flow and the upstream dependencies of the parsers and annotators.

Peter

On Sat, Aug 11, 2018 at 3:28 PM Abhishek De 
wrote:

> Hi Peter,
>
> As you said, even if I don't use the UMLS dictionaries, I should be able to
> run the pipeline, right? But the class ClinicalPipelineFactory in the
> ctakes-clinical-pipeline module, which initializes multiple pipelines for
> these various use cases, only defines a getDefaultPipeline() and
> getFastPipeline() which have some annotating components in them. The
> getDefaultPipeline() method defines a UmlsDictionaryLookupAnnotator in its
> workflow. Even the fast pipeline defines a DefaultJCasTermAnnotator in its
> workflow. When I try to run the sample main() method with either of these
> pipelines, they throw an error requiring the UMLS credentials. I mean, the
> ClinicalPipelineWithUmls class is already present for handling the UMLS
> path. Why do the rest of the pipelines also require UMLS authentication?
> Please suggest a pipeline which I might run and get some annotations
> identified from the text without UMLS.
>
> Thanks and Regards,
> Abhishek
>
> On Sat, Aug 11, 2018 at 1:36 PM Peter Abramowitsch <
> pabramowit...@gmail.com>
> wrote:
>
> > Hi, yes there are a lot of details and use cases that are not documented,
> > but there are quite a few bits of information if you take the time.
> >
> > To give some quick "answers" to your questions...  (I put that in quotes
> > because these are interpretations and not legal opinions)
> >
> > You do not need a UMLS license if you don't include the UMLS lookup
> > mechanism.In this case only the Apache license probably applies
> > http://www.apache.org/licenses/
> >
> > But if you do use UMLS, start by reading
> > https://www.nlm.nih.gov/databases/umls.html
> > https://uts.nlm.nih.gov/license.html
> > https://uts.nlm.nih.gov//help/license/validateumlsuserhelp.html
> >
> > There are options for situations where you would centralize a ctakes
> > service.  Some of the provisions have to do with the fact that
> > that SNOMED vocabulary details are included in the UMLS output, and the
> > SNOMED organizations have specific licensing requirements that you can
> see
> > on their site.
> >
> > I am nearly positive that you cannot use the UMLS mechanism in a
> for-profit
> > setting without an explicit license from SNOMED.  Also, regardless of
> > whether the service as you deliver it is centralized, as I understand the
> > language and diagrams on these sites, each and every end user needs to
> > have, and actively validate a current UMLS license.
> >
> > I've been involved in something similar to what you describe, but in a
> > non-commercial setting, and it does validate every end user.
> >
> > Peter
> >
> >
> >
> >
> >
> >
> > On Sat, Aug 11, 2018 at 7:45 AM Abhishek De 
> > wrote:
> >
> > > Hi all,
> > >
> > > I am Abhishek De, a newcomer to the CTAKES community. I had worked with
> > > CTAKES over 4 years ago, but now I am trying to incorporate it into my
> > Java
> > > application as a senior developer, so I need to know a lot of things
> > about
> > > it. For starters, I can see that even the pipeline without UMLS needs
> > UMLS
> > > credentials in order to run. So I have the following questions:
> > >
> > > 1. Is there a certain fee which is required in order to apply for a
> UMLS
> > > license?
> > >
> > > 2. My organization is building this pharmacovigilance app which will
> > > internally use CTAKES for its client. Is it okay for only my parent
> > > organization to have the UMLS license or do the client needs to have it
> > > too? If I expose the application as a service, with the servers
> residing
> > in
> > > our organization's premises, would the client still need to have a
> > licnse?
> > >
> > > 3. As there is precious little documentation on CTAKES, and many links
> > > there are now dead including the ones giving instructions on how to
> > install
> > > dictionaries for RxNorm or SNOMED-CT, could anyone please help me in
> this
> > > regard?
> > >
> > > 4. Also, I would like to know if there are any small dictionaries which
> > > could be used free of charge or licensing, for my PoC.
> > >
> > > Thanks and Regards,
> > > Abhishek De
> > >
> >
>


Re: Confusion regarding licensing of UMLS

2018-08-12 Thread Peter Abramowitsch
r/pipeline/TsChunkerSubPipe.piper



On Sun, Aug 12, 2018 at 6:04 AM Abhishek De 
wrote:

> Hi Peter,
>
> I found the PiperCreator class in the ctakes-gui module which is opening up
> the Pipeline Fabricator GUI with a lot of pipe bits which are seeming to be
> overwhelming!! If I could get the piper samples which you are talking
> about, I could have got a taste of how this thing works. But the resources
> for this module does not contain those, nor does the ctakes-resources
> separate download which contains the dictionaries. Also, you said "script"
> and not "class". That makes me think you are speaking of something else
> entirely, since there is no class named PiperRunner. So could you please
> point me where to download these scripts and the resources?
>
> Thanks and Regards,
> Abhishek
>
> On Sat, Aug 11, 2018 at 7:10 PM Peter Abramowitsch <
> pabramowit...@gmail.com>
> wrote:
>
> > I would suggest looking into the piper system - the scripts piperCreator
> > and piperRunner so you can construct/run pipelines with the bits you
> want.
> > Also look for the piper samples scattered throughout the resources area.
> > But assembling pipelines using Java is simple too.   You just need to
> learn
> > the flow and the upstream dependencies of the parsers and annotators.
> >
> > Peter
> >
> > On Sat, Aug 11, 2018 at 3:28 PM Abhishek De 
> > wrote:
> >
> > > Hi Peter,
> > >
> > > As you said, even if I don't use the UMLS dictionaries, I should be
> able
> > to
> > > run the pipeline, right? But the class ClinicalPipelineFactory in the
> > > ctakes-clinical-pipeline module, which initializes multiple pipelines
> for
> > > these various use cases, only defines a getDefaultPipeline() and
> > > getFastPipeline() which have some annotating components in them. The
> > > getDefaultPipeline() method defines a UmlsDictionaryLookupAnnotator in
> > its
> > > workflow. Even the fast pipeline defines a DefaultJCasTermAnnotator in
> > its
> > > workflow. When I try to run the sample main() method with either of
> these
> > > pipelines, they throw an error requiring the UMLS credentials. I mean,
> > the
> > > ClinicalPipelineWithUmls class is already present for handling the UMLS
> > > path. Why do the rest of the pipelines also require UMLS
> authentication?
> > > Please suggest a pipeline which I might run and get some annotations
> > > identified from the text without UMLS.
> > >
> > > Thanks and Regards,
> > > Abhishek
> > >
> > > On Sat, Aug 11, 2018 at 1:36 PM Peter Abramowitsch <
> > > pabramowit...@gmail.com>
> > > wrote:
> > >
> > > > Hi, yes there are a lot of details and use cases that are not
> > documented,
> > > > but there are quite a few bits of information if you take the time.
> > > >
> > > > To give some quick "answers" to your questions...  (I put that in
> > quotes
> > > > because these are interpretations and not legal opinions)
> > > >
> > > > You do not need a UMLS license if you don't include the UMLS lookup
> > > > mechanism.In this case only the Apache license probably applies
> > > > http://www.apache.org/licenses/
> > > >
> > > > But if you do use UMLS, start by reading
> > > > https://www.nlm.nih.gov/databases/umls.html
> > > > https://uts.nlm.nih.gov/license.html
> > > > https://uts.nlm.nih.gov//help/license/validateumlsuserhelp.html
> > > >
> > > > There are options for situations where you would centralize a ctakes
> > > > service.  Some of the provisions have to do with the fact that
> > > > that SNOMED vocabulary details are included in the UMLS output, and
> the
> > > > SNOMED organizations have specific licensing requirements that you
> can
> > > see
> > > > on their site.
> > > >
> > > > I am nearly positive that you cannot use the UMLS mechanism in a
> > > for-profit
> > > > setting without an explicit license from SNOMED.  Also, regardless of
> > > > whether the service as you deliver it is centralized, as I understand
> > the
> > > > language and diagrams on these sites, each and every end user needs
> to
> > > > have, and actively validate a current UMLS license.
> > > >
> > > > I've been involved in something similar to what you describe, but in
> a
> > > > non-commercial setti

Re: Confusion regarding licensing of UMLS

2018-08-14 Thread Peter Abramowitsch
gine_impl.java:429)
> at
> org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.initializeAggregateAnalysisEngine(AggregateAnalysisEngine_impl.java:373)
> at
> org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.initialize(AggregateAnalysisEngine_impl.java:186)
> at
> org.apache.uima.impl.AnalysisEngineFactory_impl.produceResource(AnalysisEngineFactory_impl.java:94)
> at
> org.apache.uima.impl.CompositeResourceFactory_impl.produceResource(CompositeResourceFactory_impl.java:62)
> at org.apache.uima.UIMAFramework.produceResource(UIMAFramework.java:279)
> at org.apache.uima.UIMAFramework.produceResource(UIMAFramework.java:331)
> at
> org.apache.uima.UIMAFramework.produceAnalysisEngine(UIMAFramework.java:448)
> at
> org.apache.uima.fit.factory.AnalysisEngineFactory.createEngine(AnalysisEngineFactory.java:205)
> at
> org.apache.uima.fit.pipeline.SimplePipeline.runPipeline(SimplePipeline.java:227)
> at
> org.apache.uima.fit.pipeline.SimplePipeline.runPipeline(SimplePipeline.java:260)
> at
> org.apache.ctakes.clinicalpipeline.ClinicalPipelineFactory.main(ClinicalPipelineFactory.java:169)
> Caused by: java.lang.NullPointerException
> at java.io.File.(File.java:277)
> at
> org.apache.ctakes.assertion.medfacts.AssertionAnalysisEngine.initialize(AssertionAnalysisEngine.java:89)
> at
> org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.initializeAnalysisComponent(PrimitiveAnalysisEngine_impl.java:266)
> ... 26 more
> 
> Upon tracing the root of the problem, I found that the
> AssertionAnalysisEngine tries to initialize certain model resources, and
> tries to obtain their file paths from a UimaContext:
> 
> String assertionModelResourceKey = "assertionModelResource";
>String assertionModelFilePath = getContext().getResourceFilePath(
>assertionModelResourceKey);
>assertionModelFile = new File(assertionModelFilePath);
> 
> I'm thinking that perhaps this context itself is not getting resolved. I
> tried to search for some answers on the web, like this:
> 
> http://mail-archives.apache.org/mod_mbox/ctakes-dev/201804.mbox/%3ccao1jdhhf0uo3edyxotwkwwm00+wwpwyjveaqgde35opmwng...@mail.gmail.com%3E
> 
> However, here, the file was being directly asked for by name, and I don't
> have a concrete filename to search for. The ctakes-assertion-res module
> contains some model files, but I can't correlate them to the files being
> sought here. Please help.
> 
> Thanks and Regards,
> Abhishek
> 
> On Sun, Aug 12, 2018 at 10:27 PM Peter Abramowitsch 
> wrote:
> 
>> Hi Abhishek, not meaning to offend, but I think you need to spend more time
>> on the wiki site, a lot of what you are asking about is documented there
>> 
>> 
>> https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+4.0+User+Install+Guide#cTAKES4.0UserInstallGuide-cTAKESPipelineFabricatorGUI(CreatingPiperFiles)
>> 
>> If you have the binary release, you'll find the launch scripts in ./bin,
>> if you have source you'll find it all in
>> ./ctakes-distribution/src/main/bin
>> 
>> The piper files are sprinkled throughout based on the modules that use them
>> and are grouped under a common resource folder in the binary release
>> 
>> 
>> ./ctakes/trunk/ctakes-assertion-res/target/classes/org/apache/ctakes/assertion/pipeline/TsAttributeCleartkSubPipe.piper
>> 
>> ./ctakes/trunk/ctakes-assertion-res/target/classes/org/apache/ctakes/assertion/pipeline/AssertionSubPipe.piper
>> 
>> ./ctakes/trunk/ctakes-assertion-res/target/classes/org/apache/ctakes/assertion/pipeline/AttributeCleartkSubPipe.piper
>> 
>> ./ctakes/trunk/ctakes-assertion-res/src/main/resources/org/apache/ctakes/assertion/pipeline/TsAttributeCleartkSubPipe.piper
>> 
>> ./ctakes/trunk/ctakes-assertion-res/src/main/resources/org/apache/ctakes/assertion/pipeline/AssertionSubPipe.piper
>> 
>> ./ctakes/trunk/ctakes-assertion-res/src/main/resources/org/apache/ctakes/assertion/pipeline/AttributeCleartkSubPipe.piper
>> 
>> ./ctakes/trunk/ctakes-core-res/target/classes/org/apache/ctakes/core/pipeline/DefaultTokenizerPipeline.piper
>> 
>> ./ctakes/trunk/ctakes-core-res/target/classes/org/apache/ctakes/core/pipeline/FullTokenizerPipeline.piper
>> 
>> ./ctakes/trunk/ctakes-core-res/target/classes/org/apache/ctakes/core/pipeline/TsDefaultTokenizerPipeline.piper
>> 
>> ./ctakes/trunk/ctakes-core-res/target/classes/org/apache/ctakes/core/pipeline/TsFullTokenizerPipeline.piper
>> 
>> ./ctakes/trunk/ctakes-core-res/src/main/resources/org/apache/ctakes/core/pipeline/DefaultTokenizerPipeline.piper
>> 
>> ./ctakes/trunk/ctakes-core-res/src/main/resources/org/apach

Re: Confusion regarding licensing of UMLS [EXTERNAL]

2018-08-15 Thread Peter Abramowitsch
Thanks Sean.  I too think something is broken with the Assertion Engine
resource lookup.  There are aspects of the Assertion package that I found
useful in the past.  The Zoner for instance.

On Wed, Aug 15, 2018, 3:18 PM Finan, Sean 
wrote:

> Hi Abhishek,
>
> There is a very short introduction to building pipelines in ctakes 4.0 in
> a pamphlet that can be found at
> https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+4.0 , section
> "Documentation", second link.  Pamphlet section 5.2, page 5.
>
> There is a runnable example that prints discovered mentions and properties
> using a code-built pipeline.  ctakes-examples project, class
> ProcessDirBuilderRunner.  It uses the newer Cleartk engines instead of the
> Assertion engines.
>
> Sean
> 
> From: Abhishek De 
> Sent: Tuesday, August 14, 2018 11:23 PM
> To: dev@ctakes.apache.org
> Subject: Re: Confusion regarding licensing of UMLS [EXTERNAL]
>
> Hi Peter,
>
> Thanks for all the help you provided me even from your vacation. If someone
> else is following this thread, please help me in this regard. I had hoped
> that 4 years later, there would be some decent literature as well as
> tutorials on the web, but sadly, it isn't the case. So could anyone please
> help me in this regard?
>
> Thanks and Regards,
> Abhishek
>
> On Wed, Aug 15, 2018 at 2:37 AM Peter Abramowitsch <
> pabramowit...@gmail.com>
> wrote:
>
> > Hi Abhishek,
> >
> > I sympathize with your problem, and believe I’ve had the same one with
> the
> > Assertion Engine.  I think there are indeed resource lookup issues and
> have
> > had the same ones including that engine in a piper file.  Creating a
> > resource description from a class only works for engine sources that have
> > specific constant paths defined, and even then the resources need to be
> > present at those paths.   However, I’m on vacation at the moment, and so
> I
> > really don’t have the time or sources to help you.  Perhaps someone else
> on
> > this mailing list can.
> >
> > Peter
> >
> > Sent from my iPad
> >
> > > On Aug 14, 2018, at 05:16, Abhishek De 
> wrote:
> > >
> > > Hi Peter,
> > >
> > > Thanks a lot for bearing with me. Actually, I'm working on a tight
> > > deadline, and going through the entire codebase and the wiki
> meticulously
> > > would take me over a month!! I'm sincerely hoping you would understand
> > and
> > > bear with me.
> > >
> > > As an update, I had applied for a UMLS license which got rejected,
> don't
> > > know why!! I have re-applied for it, fingers crossed, this time, may
> the
> > > UMLS gods accept my prayer!!
> > >
> > > My sole aim, for now is to get a pipeline working from Java code, whose
> > > output I can see on the console window. I had a look at this for the
> > order
> > > in which the components are to be set up for a working pipeline:
> > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_CTAKES_cTAKES-2B4.0-2BComponent-2BUse-2BGuide&d=DwIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=paY_XjRKMyWf-1kr-_W9Z2jh1wQvymH2PIyF4_tBow0&s=tED25FbW6NUUIcd3GeshS4b_FhLDAbcTqMf7lwnAVOw&e=
> > >
> > > As assertion comes first, I attempted to attack it first. I took the
> > > TokenProcessingPipeline in ClinicalPipelineFactory and added the
> > > AssertionAnalysisEngine to it. However, since the class was missing
> > > a createAnnotatorDescription() method like the others, which is
> required
> > by
> > > the builder, I copied and pasted it there.
> > >
> > > public static AnalysisEngineDescription createAnnotatorDescription()
> > throws
> > > ResourceInitializationException{
> > >  return
> > >
> >
> AnalysisEngineFactory.createEngineDescription(AssertionAnalysisEngine.class);
> > > }
> > >
> > > The TokenProcessingPipeline now looked like this:
> > >
> > > public static AnalysisEngineDescription getTokenProcessingPipeline()
> > throws
> > > ResourceInitializationException, MalformedURLException {
> > >  AggregateBuilder builder = new AggregateBuilder();
> > >  builder.add(AssertionAnalysisEngine.createAnnotatorDescription());
> > >  builder.add( SimpleSegmentAnnotator.createAnnotatorDescription()
> );
> > >  builder.add( SentenceDetector.createAnnotatorDesc

Compiling Ctakes from Subversion

2018-11-11 Thread Peter Abramowitsch
I cannot finds a configuration which permits a complete compile of the SVN
4.0.1 version of the code

Is anyone having this problem?

When compiling using JAVA 1.8.x  you get compile errors in ctakes-temporal
There are various rather oblique assignments between Generic types, which
produce errors of this sort.

[ERROR]
/Users/peterabramowitsch/projects/apache/ctakes/trunk2/ctakes-temporal/src/main/java/org/apache/ctakes/temporal/eval/Evaluation_ImplBase.java:[899,45]
[unchecked] unchecked generic array creation for varargs parameter of type
Collection[]
[ERROR]   where CAP#1 is a fresh type-variable:
CAP#1 extends TOP from capture of ? extends TOP

Then if you try to compile it using JAVA 1.9.x ctakes-temporal compiles but
now you have a problem with ctakes-ytex which depends on some jaxb modules
for the hibernate schema generation.
root cause:

Caused by: java.lang.ClassNotFoundException:
javax.activation.MimeTypeParseException
at
org.codehaus.plexus.classworlds.strategy.SelfFirstStrategy.loadClass(SelfFirstStrategy.java:50)
at
org.codehaus.plexus.classworlds.realm.ClassRealm.loadClass(ClassRealm.java:244)
at
org.codehaus.plexus.classworlds.realm.ClassRealm.loadClass(ClassRealm.java:230)
... 48 more

These are no longer included by default with the JDK   Following help out
on the net, I added this dependencies to that projects's pom.


javax.activation
activation
1.1.1


But the problem did not go away.

*So my question is thia:   Has anybody been able to compile the current SVN
version from top to bottom, and if so, exactly what environment are you
using to do it*
*Java Version?Mods to  POM files, Source Files etc.*

Peter


Re: Compiling Ctakes from Subversion

2018-11-12 Thread Peter Abramowitsch
Thanks Ghandi

Turns out, after a bunch of tests that the older version of jdk8 I was
using (25) must have not been able to link to the generic classes it had
generated.  I was prepared to find that my problem was a contaminated maven
cache or some other error of mine, but after a lot of testing, I narrowed
it down to just the java compiler version.  On the latest version of 8
(jdk1.8.0_192) it does compile straight through.

Perhaps someone with write access to the Wiki pages should note it in the
developer instructions:
That it needs to be a late version of Java 8.  (And not Java 9 or later
until the changes are made to the POM dependencies in ytex that bring back
in in java.activation)

Peter

On Mon, Nov 12, 2018 at 7:17 AM Gandhi Rajan Natarajan <
gandhi.natara...@arisglobal.com> wrote:

> Hi Peter,
>
> I use Java 1.8.0_171. Checked out the source code from -
> https://svn.apache.org/repos/asf/ctakes/trunk
>
> I don’t get any compilation issue in ctakes-temporal module running the
> command - " mvn clean install"
>
> Regards,
> Gandhi
>
> -Original Message-
> From: Peter Abramowitsch 
> Sent: Sunday, November 11, 2018 4:50 PM
> To: dev@ctakes.apache.org
> Subject: Compiling Ctakes from Subversion
>
> I cannot finds a configuration which permits a complete compile of the SVN
> 4.0.1 version of the code
>
> Is anyone having this problem?
>
> When compiling using JAVA 1.8.x  you get compile errors in ctakes-temporal
> There are various rather oblique assignments between Generic types, which
> produce errors of this sort.
>
> [ERROR]
>
> /Users/peterabramowitsch/projects/apache/ctakes/trunk2/ctakes-temporal/src/main/java/org/apache/ctakes/temporal/eval/Evaluation_ImplBase.java:[899,45]
> [unchecked] unchecked generic array creation for varargs parameter of type
> Collection[]
> [ERROR]   where CAP#1 is a fresh type-variable:
> CAP#1 extends TOP from capture of ? extends TOP
>
> Then if you try to compile it using JAVA 1.9.x ctakes-temporal compiles
> but now you have a problem with ctakes-ytex which depends on some jaxb
> modules for the hibernate schema generation.
> root cause:
>
> Caused by: java.lang.ClassNotFoundException:
> javax.activation.MimeTypeParseException
> at
>
> org.codehaus.plexus.classworlds.strategy.SelfFirstStrategy.loadClass(SelfFirstStrategy.java:50)
> at
>
> org.codehaus.plexus.classworlds.realm.ClassRealm.loadClass(ClassRealm.java:244)
> at
>
> org.codehaus.plexus.classworlds.realm.ClassRealm.loadClass(ClassRealm.java:230)
> ... 48 more
>
> These are no longer included by default with the JDK   Following help out
> on the net, I added this dependencies to that projects's pom.
>
> 
> javax.activation
> activation
> 1.1.1
> 
>
> But the problem did not go away.
>
> *So my question is thia:   Has anybody been able to compile the current SVN
> version from top to bottom, and if so, exactly what environment are you
> using to do it*
> *Java Version?Mods to  POM files, Source Files etc.*
>
> Peter
> This email and any files transmitted with it are confidential and intended
> solely for the use of the individual or entity to whom they are addressed.
> If you are not the named addressee you should not disseminate, distribute
> or copy this e-mail. Please notify the sender or system manager by email
> immediately if you have received this e-mail by mistake and delete this
> e-mail from your system. If you are not the intended recipient you are
> notified that disclosing, copying, distributing or taking any action in
> reliance on the contents of this information is strictly prohibited and
> against the law.
>


Re: Compiling Ctakes from Subversion [EXTERNAL]

2018-11-12 Thread Peter Abramowitsch
I feel ashamed to say.25.   I tend to do updates infrequently and  jump
forward many versions. To create longer periods of stability.   It just
happened that some new code in ctakes-temporal 4.0.1 no longer compiled
under that version.  Do you think it will soon be time to move forward off
of eight?

Peter

On Mon, Nov 12, 2018, 4:29 PM Finan, Sean  Hi all,
>
> I have been using 111 - which turned 2 years old last month.
>
> When we first ported ctakes 4.0 to java8 I was using java 8 update 60 -
> now over 3 years old.
>
> Peter,
> what version were you using?
>
> Thanks,
> Sean
> ____
> From: Peter Abramowitsch 
> Sent: Monday, November 12, 2018 9:55 AM
> To: dev@ctakes.apache.org
> Subject: Re: Compiling Ctakes from Subversion [EXTERNAL]
>
> Thanks Ghandi
>
> Turns out, after a bunch of tests that the older version of jdk8 I was
> using (25) must have not been able to link to the generic classes it had
> generated.  I was prepared to find that my problem was a contaminated maven
> cache or some other error of mine, but after a lot of testing, I narrowed
> it down to just the java compiler version.  On the latest version of 8
> (jdk1.8.0_192) it does compile straight through.
>
> Perhaps someone with write access to the Wiki pages should note it in the
> developer instructions:
> That it needs to be a late version of Java 8.  (And not Java 9 or later
> until the changes are made to the POM dependencies in ytex that bring back
> in in java.activation)
>
> Peter
>
> On Mon, Nov 12, 2018 at 7:17 AM Gandhi Rajan Natarajan <
> gandhi.natara...@arisglobal.com> wrote:
>
> > Hi Peter,
> >
> > I use Java 1.8.0_171. Checked out the source code from -
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__svn.apache.org_repos_asf_ctakes_trunk&d=DwIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=4eP6RBmyPfD_FjXsiov_qLD3CHrI8oN4Q5ogUvU0wxQ&s=jzBD6GNkVI1qBUcXCch0GlnHBrtQn-2MYlG_JC3Iusk&e=
> >
> > I don’t get any compilation issue in ctakes-temporal module running the
> > command - " mvn clean install"
> >
> > Regards,
> > Gandhi
> >
> > -Original Message-
> > From: Peter Abramowitsch 
> > Sent: Sunday, November 11, 2018 4:50 PM
> > To: dev@ctakes.apache.org
> > Subject: Compiling Ctakes from Subversion
> >
> > I cannot finds a configuration which permits a complete compile of the
> SVN
> > 4.0.1 version of the code
> >
> > Is anyone having this problem?
> >
> > When compiling using JAVA 1.8.x  you get compile errors in
> ctakes-temporal
> > There are various rather oblique assignments between Generic types, which
> > produce errors of this sort.
> >
> > [ERROR]
> >
> >
> /Users/peterabramowitsch/projects/apache/ctakes/trunk2/ctakes-temporal/src/main/java/org/apache/ctakes/temporal/eval/Evaluation_ImplBase.java:[899,45]
> > [unchecked] unchecked generic array creation for varargs parameter of
> type
> > Collection[]
> > [ERROR]   where CAP#1 is a fresh type-variable:
> > CAP#1 extends TOP from capture of ? extends TOP
> >
> > Then if you try to compile it using JAVA 1.9.x ctakes-temporal compiles
> > but now you have a problem with ctakes-ytex which depends on some jaxb
> > modules for the hibernate schema generation.
> > root cause:
> >
> > Caused by: java.lang.ClassNotFoundException:
> > javax.activation.MimeTypeParseException
> > at
> >
> >
> org.codehaus.plexus.classworlds.strategy.SelfFirstStrategy.loadClass(SelfFirstStrategy.java:50)
> > at
> >
> >
> org.codehaus.plexus.classworlds.realm.ClassRealm.loadClass(ClassRealm.java:244)
> > at
> >
> >
> org.codehaus.plexus.classworlds.realm.ClassRealm.loadClass(ClassRealm.java:230)
> > ... 48 more
> >
> > These are no longer included by default with the JDK   Following help out
> > on the net, I added this dependencies to that projects's pom.
> >
> > 
> > javax.activation
> > activation
> > 1.1.1
> > 
> >
> > But the problem did not go away.
> >
> > *So my question is thia:   Has anybody been able to compile the current
> SVN
> > version from top to bottom, and if so, exactly what environment are you
> > using to do it*
> > *Java Version?Mods to  POM files, Source Files etc.*
> >
> > Peter
> > This email and any files transmitted with it are confidential and
> intended
> > solely for the use of the individual or entity to whom they are
> addressed.
> > If you are not the named addressee you should not disseminate, distribute
> > or copy this e-mail. Please notify the sender or system manager by email
> > immediately if you have received this e-mail by mistake and delete this
> > e-mail from your system. If you are not the intended recipient you are
> > notified that disclosing, copying, distributing or taking any action in
> > reliance on the contents of this information is strictly prohibited and
> > against the law.
> >
>


Re: uima-as examples

2019-01-17 Thread Peter Abramowitsch
I used a completely different approach that allows parallel but not async
processing.  Multiple [analysis engine+cas] pair objects pre-instantiated
into into a threadsafe pool running behind a web service interface. We can
fully saturate a single ctakes server process using multiple client
processes talking to that API each working synchronously and arriving at an
overall speed of 10-15 6K notes per second on a single server process.

I haven't used AS but it looks as if that middleware could have too many
moving parts for our needs.  They would generate many wakeups and context
switches adding undesired latency as a request makes its way to the
server.   I'm assuming that in AS, the broker and the MQ are separate
processes and not just in-process subsystems to the ctakes server process.
Is that right?

On Thu, Jan 17, 2019 at 4:09 PM Greg Silverman  wrote:

> Anyone out there developed a pipeline using UIMA-AS, as opposed to the
> CPE/CPM file reader?
>
> Thanks in advance!
>
> Greg--
>
> --
> Greg M. Silverman
> Senior Systems Developer
> NLP/IE 
> Cardiovascular Informatics 
> University of Minnesota
> g...@umn.edu
>
>  ›  evaluate-it.org  ‹
>


Re: Newb question

2019-02-19 Thread Peter Abramowitsch
Probably most or even all of them are linked to a few specific errors in
your compilation environment, but you'd have to post a small sampling of
the errors to get any useful help.   Clearly it is something very basic
that is awry.Try grabbing a few errors from one of the core modules and
posting here.

Peter

On Mon, Feb 18, 2019 at 3:42 PM Karel Lahmy  wrote:

> Hello,
>
>
> I've gone over the steps for checking out the sources using Eclipse as
> described here:
>
> https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+4.0+Developer+Install+Guide
> BUT I am getting tons of compilation errors (13,882 to be exact...), any
> idea what can be the issue?
>
>
> Thanks,
>
> Karel
>


Re: Negation

2019-03-01 Thread Peter Abramowitsch
In an IdentifiedAnnotation, the attribute "polarity"  reflects the negation
value.

On Fri, Mar 1, 2019 at 5:59 PM Greg Silverman  wrote:

> I found this:
> https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+4.0+-+Assertion
> ... however, I don't see any attributes or types regarding negation.
>
> On Fri, Mar 1, 2019 at 5:41 PM Greg Silverman  wrote:
>
> > I'm extracting CUI/concepts from our annotations and would like any
> > negations associated with these. We're running the default pipeline, but
> I
> > can't see anything in the XMI output that resembles a negation for the
> > UmlsConcepts. Are negation annotators not part of the default pipeline?
> >
> > Thanks!
> >
> > Greg--
> >
> > --
> > Greg M. Silverman
> > Senior Systems Developer
> > NLP/IE 
> > University of Minnesota
> > g...@umn.edu
> >
> >  ›  evaluate-it.org  ‹
> >
>
>
> --
> Greg M. Silverman
> Senior Systems Developer
> NLP/IE 
> University of Minnesota
> g...@umn.edu
>
>  ›  evaluate-it.org  ‹
>


Re: ctake web service and multi-threading

2019-03-09 Thread Peter Abramowitsch
After seeing some of the thread contention issues last year, I started from 
scratch to create a pipeline pool that sizes itself according to the memory 
that’s available.  Each instance contains the complete pipeline including the 
Term Annotator and a re-settable JCas object.   I don’t use any of the thread 
constructs in piper files - to not confuse the issue.  All of this is accessed 
via a web service with a multi threaded dispatcher (SparkJava).  This 
implementation does allow each of the pool members to initialize its pipeline 
serially though.  It only starts handling requests when all members are ready.

It seems to be completely stable - running for hours with as many clients as 
the memory can handle.   Looking at performance stats, one can see that, single 
threaded, there are latencies introduced by dictionary lookups, and these 
‘holes’ are filled when multi threading, allowing a much greater ability to 
saturate the cpu.  After a certain point, of course, when all cores are 
saturated, the performance curve flattens out. But there are no exceptions.   

Introducing a wait  in the pool’s getMember() allows the system to deal 
effectively with overloading - ie, what to do when all members are busy: there 
are more requests than the pool-size will handle.  

I don’t think there would be a problem to have pool members with differently 
configured pipelines, but I never tried that.  And I suppose it could be that 
one of the annotators in the community might not work if it had any statics.  
But I’m guessing that they would primarily be configuration items set in the 
annotator’s initialization phase.   

Long story short - with a bit of extra packaging, you can get it to work 
without tampering with the core code.

Peter
Sent from my iPad

> On Mar 8, 2019, at 08:23, Jeffrey Miller  wrote:
> 
> Is there any known reason that you can't create a pipeline pool, but keep
> everything in the same process? Is it safe to load multiple pipelines in
> the same process as long as only one thread can access each one at a time
> (we plan to use this in a Spark pipeline). One caveat I have noticed- it
> seems like if I use the thread safe components to build a pipeline pool,
> only one dictionary for the DefaultJCasTermAnnotator can be loaded per
> process. For example, I was trying to take advantage of the ability to
> switch pipelines via a query parameter that is suggested at in the code for
> the rest service. The two pipelines used different ontology dictionaries,
> but it seemed like with the thread safe components it must have reduced
> the DefaultJCasTermAnnotator to a singleton object in memory, because it
> only used the first dictionary instantiated. Either way, given how Sean
> described how the thread safe components worked above, you probably
> wouldn't want to use them in a pipeline pool, assuming that the problems
> with threading was limited to multiple threads access the same pipeline at
> the same time, and not having multiple pipelines loaded into memory each
> accessed by only a single thread.
> 
> On Fri, Mar 8, 2019 at 11:06 AM Kathy Ferro 
> wrote:
> 
>> I thought about creating a queue that acts as traffic cop.  Only the
>> traffic cop calls the WS.  I also want to test multiple WS running on
>> different port.  Traffic cop calls which every WS is available and keep
>> track of WS statuses.  With all this processing going, it might kill the
>> power for blocks.
>> 
>> On Fri, Mar 8, 2019 at 10:34 AM Finan, Sean <
>> sean.fi...@childrens.harvard.edu> wrote:
>> 
>>> Hi all,
>>> 
>>> I guess that a quick test could be run with a multi-threaded pipeline.
>>> Tim, for some reason I recall you checking in one with a dockerfile.
>> Maybe
>>> not, and it might not be the default in the service.  Anyway, you could
>> set
>>> the procs to something like 50 and throw 50 users at it.  It definitely
>>> does not scale anything close to linearly.  ctakes aes aren't build for
>>> thread-safety, so they are all wrapped with locks and there is a lot of
>>> thread contention.  However, running such a test might indicate the
>> source
>>> of the problem.
>>> 
>>> The other option is to create a queue that collects post calls and doles
>>> them out serially to a single pipeline.  User #50 would probably not
>>> appreciate it though ...
>>> 
>>> From: gandhi rajan 
>>> Sent: Friday, March 8, 2019 10:02 AM
>>> To: dev@ctakes.apache.org
>>> Subject: Re: ctake web service [EXTERNAL]
>>> 
>>> Hi Kathy,
>>> 
>>> I guess the initializations happens in post construct method. So if we
>>> could synchronize that I feel we can get away from the problem.
>>> Unfortunately I m not able to tet this as my setup is gone with my old
>> job.
>>> Try it out.
>>> 
>>> Regards,
>>> Gandhi.
>>> 
 On Friday, March 8, 2019, Kathy Ferro  wrote:
 
 Tim,
 
 Thanks for reply.  I'm continuing the research.  With all the layers
>> that
 wrap around this, you would thin

Re: ctake web service [EXTERNAL]

2019-03-09 Thread Peter Abramowitsch
I haven't made our code available, Sorry.  Not sure if I can.  But from my
description, you should find it pretty easy.  I started by extending a
GenericObjectPool and going from there. As each instance is instantiated, I
re-read the piper file - creating a new engine which is assigned to a Pool
member.

Peter

On Sat, Mar 9, 2019 at 9:20 AM Jeffrey Miller  wrote:

> Thanks for your response Sean- we are still working on this (and have some
> things to look into given your last response), but I will share details
> when we have it working. We are still deciding on whether to use Spark or
> Apache Beam.
>
> Just to clarify my previous confusion, I assumed the TS wrappers were so
> you could avoid creating multiple pipelines and just run one instance of
> the pipeline with a separate JCAS per thread. I thought the main motivation
> behind that would be to avoid loading >1 dictionaries into memory, for
> example. But it sounds like I was mistaken. With respect to sharing
> resources, are static variables the main concern? Do you know if this is a
> problem for any of the annotators in the default clinical pipeline (the
> regular components, not the thread safe ones)? From Peter's response (I am
> not sure if that split off into another forum thread because the subject
> changed), it sounds like it may not be a problem? I'd like to really
> understand thread-safe with respect to core cTAKES components (with the
> caveat that community-created annotators could be implemented in any number
> of ways, making it hard to declare cTAKES is "thread-safe"). I'd be happy
> to contribute documentation back to the wiki once I feel I have a solid
> grasp on it.
>
> Peter- have you made your pipeline pool code available anywhere?
>
> On Fri, Mar 8, 2019 at 12:49 PM Finan, Sean <
> sean.fi...@childrens.harvard.edu> wrote:
>
> > Hi all,
> >
> > >Is there any known reason that you can't create a pipeline pool, but
> keep
> > everything in the same process?
> > -- No, but ...
> > > Is it safe to load multiple pipelines in
> > the same process as long as only one thread can access each one at a time
> > (we plan to use this in a Spark pipeline).
> > -- If you are talking about oob ctakes being the process, only a single
> > pipeline will run on multiple threads.  The threads will share resources,
> > static variables, etc. and the  pipeline will give you terrible results
> and
> > very quickly crash.  That is why I wrote the thread-safe wrappers.
> > -- That being said, supposedly you can configure spark to handle this by
> > keeping everything contained in a unique copy per thread.  Sort of like
> > ThreadLocal (I think), but more effective on a full-pipeline level.
> >
> > > it must have reduced the DefaultJCasTermAnnotator to a singleton object
> > in memory.
> > -- Yes.  The thread-safe pipeline is not meant to have siblings in the
> > same process - the wrappers can only do so much.  That being said, I am
> > pretty sure that the Default... is thread-safe so it doesn't actually
> need
> > the wrapper.  Regardless, the rest of the pipeline would crash.
> >
> > Jeff, can you share information about your efforts on spark?  If we could
> > get that working and in standard ctakes it would be fantastic.
> >
> > I hope that this information is useful.
> >
> > Sean
> >
> >
> >
> > 
> > From: Jeffrey Miller 
> > Sent: Friday, March 8, 2019 11:23 AM
> > To: dev@ctakes.apache.org
> > Subject: Re: ctake web service [EXTERNAL]
> >
> > Is there any known reason that you can't create a pipeline pool, but keep
> > everything in the same process? Is it safe to load multiple pipelines in
> > the same process as long as only one thread can access each one at a time
> > (we plan to use this in a Spark pipeline). One caveat I have noticed- it
> > seems like if I use the thread safe components to build a pipeline pool,
> > only one dictionary for the DefaultJCasTermAnnotator can be loaded per
> > process. For example, I was trying to take advantage of the ability to
> > switch pipelines via a query parameter that is suggested at in the code
> for
> > the rest service. The two pipelines used different ontology dictionaries,
> > but it seemed like with the thread safe components it must have reduced
> > the DefaultJCasTermAnnotator to a singleton object in memory, because it
> > only used the first dictionary instantiated. Either way, given how Sean
> > described how the thread safe components worked above, you probably
> > wouldn't want to use them in a pipeline pool, assuming that the problems
> > with threading was limited to multiple threads access the same pipeline
> at
> > the same time, and not having multiple pipelines loaded into memory each
> > accessed by only a single thread.
> >
> > On Fri, Mar 8, 2019 at 11:06 AM Kathy Ferro 
> > wrote:
> >
> > > I thought about creating a queue that acts as traffic cop.  Only the
> > > traffic cop calls the WS.  I also want to test multiple WS running on
> > 

Re: Threading and cTAKES (on Spark) [EXTERNAL]

2019-03-28 Thread Peter Abramowitsch
Actually  my implementation does not share a single pipeline across
threads, it creates a set of separate pipelines.  I found that once the
code is in memory, it actually does not take long to instantiate many
pipelines.  Each one is attached to a thread safe pool object that also
hosts a re-settable jCas.  When a request arrives on a thread, one of these
pipeline-jcas pairs is activated and assigned to a document.   Typically
each pool object needs about 1.7G.  On a multi core machine we can run as
many parallel threads as we have memory and send the processor idle time
down to 10% or less.   Since it doesn't rely on the annotators being thread
safe, I can use any of them.  Where they might have class variables - these
are usually for configuration only, and by instantiating all of them ahead
of time on a single thread, they are safely initialized.  The multi
threading only happens at document processing time.  We've run high
intensity sessions with many threads for 12-15 hours and never seen any
conflicts.

On Thu, Mar 28, 2019 at 9:20 PM Finan, Sean <
sean.fi...@childrens.harvard.edu> wrote:

> Hi Jeff,
>
> > 1) do you think it might not crash yet produce unreliable results when
> using the components in the DefaultClinicalPipeline?
>
> -- I am pretty certain that you would get unreliable results.  I seem to
> recall attempts with the default pipeline crashing, but with a small corpus
> one could get lucky.
>
> > 2) Do you have any more information about [Spark]
>
> -- No, not really.  I don't work with it, I am just regurgitating from
> memory things read or heard.
>
> > 3) In the TS pipelines, what does the "threads" keyword ...
>
> -- "threads" specifies how many threads share a single pipeline.
> -- All annotators in this pipeline must be thread-safe.
> -- It is up to that single instance of a pipeline to be thread safe.
> "threads" does not enforce anything.
> -- "threads n" will attempt to process a maximum of n documents
> simultaneously on a pipeline.
> -- "threads n" works by running the single pipeline on n threads and
> running a single document through the pipeline on each thread.
> -- It is entirely up to the pipeline to determine the concurrency of
> processing documents.
> -- The more thread-safe annotators that don't require locking, the more
> utilized the threads will be.
>
> I hope that makes sense.
>
>
>
> 
> From: Jeffrey Miller 
> Sent: Thursday, March 28, 2019 3:51 PM
> To: dev@ctakes.apache.org
> Subject: Threading and cTAKES (on Spark) [EXTERNAL]
>
> Hi,
>
> I am following up on a discussion previously in the "re: ctakes web
> service" thread from this month. Apologies if I summarize anyone's comments
> incorrectly. Sean had commented that it would not be advisable to create a
> pool of pipelines and dispatch 1 per thread in the same process because the
> individual AEs have static variables and resources that would be shared
> across instances. I can comment that anecdotally, we have not seen crashes
> when doing this (but we have seen crashes when we are trying to share 1
> pipeline across > 1 thread). Nevertheless, I cannot guarantee that the
> annotations are happening correctly all the time or that we might not
> occasionally get unlucky and enter into a race condition. It also sounds
> like from Peter's comment in the previous thread,
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.apache.org_thread.html_93da8248b03b1c59135fb9b4030b0546a4631ec32d6f5c779d2821cc-40-253Cdev.ctakes.apache.org-253E&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=uYabaJeyLV-qVc3xJyB-6w9LVawSFytQEU37NnkdHV0&s=bwkSz7ZhmUnXJZmcm7zVEKuaMpsv_IH-Xs-UYZU3u3M&e=
> that a pipeline pool across multiple threads has been stable for his work.
> I have a couple of questions:
>
> 1) Does anyone else have experience with this? Sean, from your comments
> before, do you think it might not crash yet produce unreliable results when
> using the components in the DefaultClinicalPipeline?
>
> 2) Sean, you commented before
>
> > That being said, supposedly you can configure Spark to handle this by
> keeping everything contained in a unique copy per thread.  Sort of like
> ThreadLocal (I think), but more effective on a full-pipeline level.
>
> Do you have any more information about this- we are currently looking into
> it, and it looks like it should be possible to limit each executor (JVM) to
> a single thread, but I was wondering if you had any references to the
> ThreadLocal-style setup or knew anyone else that had tried it.
>
> 3) In the TS pipelines, what does the "threads" keyword in the piper file
> actually enforce? Is it the number of threads it will allow you to share
> the pipeline with or does it automatically create a threaded pipeline for
> you?
>
> Thanks!
> Jeff
>


Re: Reading clinical notes for specific predictions

2019-05-08 Thread Peter Abramowitsch
Hi Sekhar

The predictions item in your list of objectives is very tricky and cTakes,
or indeed any software system will only get you part of the way there.  CDS
(clinical decision support)  researchers have been on this path for many
years and it is clear that even an hybrid human/computational system is
limited in its accuracy & predictive ability.  And with medicine, a miss is
as good as a mile - as the saying goes.

As to your vocabularies question - if you don't already know the SNOMED
clinical ontology, and RxNorm resources I suggest you have a look.  cTakes
can fish out the appropriate CUIs and SNOMED term ids, and the ontologies
will help  you draw the lateral links through common parents - or in your
specific example, therapeutic classes.

- Peter

On Tue, May 7, 2019 at 6:47 PM Hari, Sekhar  wrote:

> Hi there -
>
> I'm trying to predict a few things from clinical notes as follows:
>
>
> 1.   Look at the notes and discharge summaries, and predict the
> re-admissions data, cardiac arrests, diabetes, and pre-term birth.
>
> 2.   Understand the vocabulary of doctors and pharmacies. For example,
> recognize that Tylenol and Acetaminophen refer to the same item. Have a
> good understanding of body parts and diseases. The vocabulary is
> domain-specific.
>
> 3.   The data is loaded from Cerner and EPIC.
>
> Can somebody help with suggestions on the list of pipelines that can be
> used to achieve (1) and (2) above? Should I also develop a machine-learning
> model along with cTAKES to get the desired results?
>
> Thanks
> Sekhar Hari | AI Program Lead | Health Sciences R&D | Asia Pacific
> Solutions Delivery Center
> +91 814 7027 779 (C)
>
>


Re: Reading clinical notes for specific predictions

2019-05-08 Thread Peter Abramowitsch
Ctakes can detect many forms for each "identified annotation", and it can
be trained with further dictionary development, to handle more acronyms,
more idiosyncratic speech etc, but it is not perfect.  For example, It is
not that good at the moment, for detection of temporal context to
distinguish medical history from current observation or from family
history.  It is better,  but still not perfect at detecting negated
concepts which can occur in many different forms depending on the
linguistic patterns of the specific physicians whose notes you are
reading.  What makes clinical notes particularly tricky as an NLP task is
that physicians are rushed - they abbreviate, they misspell, they create
staccato phrases instead of sentences, etc.  It is not like parsing
well-formed published text.

I have not tried all the available annotators, so you may want to
experiment and see what works best for you.

I hope you were joking about "a couple of algorithms".Prediction is one
of the problems that has been addressed by thousands of highly trained
experts in diagnostics and clinical informatics.  I have found interesting
work that was done years ago by some people working on Inference engines
using Deontic logic.   Prediction is only partially an information handling
problem -- it is also a capture problem.   It is something that only highly
trained observers can get right part of the time, and by observations that
do not always become part of the clinical record.   If you want to know
more about what I'm talking about, try reading "Cutting for Stone" .   It
is written by one of the world's most distinguished diagnosticians, Abraham
Verghese who now teaches at Stanford University.

Peter

On Wed, May 8, 2019 at 2:10 PM Hari, Sekhar  wrote:

> Thanks Peter for your insights. Agree, this kind of predictions will need
> a couple of algorithms to be trained and work together to get to level of
> acceptable accuracy. I'm familiar with the RXNORM and SNOMED contents; but
> will dig deeper.
>
> Do you know if cTAKES can identify events such as "cardiac arrest",
> "diabetes" and "pre-term birth"? Likely these are mentioned with different
> text representations in the clinical notes.
>
> Thanks
> Sekhar Hari | AI Program Lead | Health Sciences R&D | Asia Pacific
> Solutions Delivery Center
> +91 814 7027 779 (C)
>
> -Original Message-
> From: Peter Abramowitsch 
> Sent: Wednesday, May 8, 2019 2:50 PM
> To: dev@ctakes.apache.org
> Subject: Re: Reading clinical notes for specific predictions
>
> Hi Sekhar
>
> The predictions item in your list of objectives is very tricky and cTakes,
> or indeed any software system will only get you part of the way there.  CDS
> (clinical decision support)  researchers have been on this path for many
> years and it is clear that even an hybrid human/computational system is
> limited in its accuracy & predictive ability.  And with medicine, a miss is
> as good as a mile - as the saying goes.
>
> As to your vocabularies question - if you don't already know the SNOMED
> clinical ontology, and RxNorm resources I suggest you have a look.  cTakes
> can fish out the appropriate CUIs and SNOMED term ids, and the ontologies
> will help  you draw the lateral links through common parents - or in your
> specific example, therapeutic classes.
>
> - Peter
>
> On Tue, May 7, 2019 at 6:47 PM Hari, Sekhar  wrote:
>
> > Hi there -
> >
> > I'm trying to predict a few things from clinical notes as follows:
> >
> >
> > 1.   Look at the notes and discharge summaries, and predict the
> > re-admissions data, cardiac arrests, diabetes, and pre-term birth.
> >
> > 2.   Understand the vocabulary of doctors and pharmacies. For
> example,
> > recognize that Tylenol and Acetaminophen refer to the same item. Have
> > a good understanding of body parts and diseases. The vocabulary is
> > domain-specific.
> >
> > 3.   The data is loaded from Cerner and EPIC.
> >
> > Can somebody help with suggestions on the list of pipelines that can
> > be used to achieve (1) and (2) above? Should I also develop a
> > machine-learning model along with cTAKES to get the desired results?
> >
> > Thanks
> > Sekhar Hari | AI Program Lead | Health Sciences R&D | Asia Pacific
> > Solutions Delivery Center
> > +91 814 7027 779 (C)
> >
> >
>


Re: acronyms/abbreviations [EXTERNAL]

2019-05-17 Thread Peter Abramowitsch
Seems like some kind of simple heuristic should work:Isn't it just a
case of looking at the in/out text offsets of the source text for an
identified annotation and then comparing that with the canonical text of
the CUI or SnomedID.   If the source text is just a few of characters (say
less than 5) and the Levenstein difference between it and the canonical
text is > than the length of the source text,  you're pretty sure to have
an acronym.

For instance if cTakes finds   "MI" and assigns SNOMED  22298006 or CUI
C0027051 with canonical text "Myocardial Infarction"*, *then with the
in/out offsets into the text you should be able to run this heuristic

The problem (and I see this in my work) is that many acronyms have multiple
meanings.  Thus, you may accurately be able to tell that your identified
concept came from an acronym, but it was the wrong concept!!

Peter

On Thu, May 16, 2019 at 4:31 AM Greg Silverman  wrote:

> Got it!
>
> Yes, I understand the formidability, given the need for disambiguation,
> etc. Was just curious if this existed.
>
> Thanks!
>
>
> On Wed, May 15, 2019 at 9:11 PM Finan, Sean <
> sean.fi...@childrens.harvard.edu> wrote:
>
> > Hi Greg,
> >
> > Ok, that gives me a great vector toward addressing your needs.
> >
> > I don't know of any ctakes components that indicate whether or not
> > discovered concepts come from acronyms, abbreviations or -replete- text
> > mentions.
> >
> > There should be something that does that.   Open source >  Any
> > champions available?
> >
> > Right now no abbreviation or metonym information is provided in the
> > standard components.If it can be extruded from source then it should
> be
> > provided.
> >
> > If anybody has such a component, please let us know !   This is a
> > formidable (imio) nlp problem, so call your kudos with a solution!
> >
> > Sean
> >
> > 
> > From: Greg Silverman 
> > Sent: Wednesday, May 15, 2019 9:21 PM
> > To: dev@ctakes.apache.org
> > Subject: Re: acronyms/abbreviations [EXTERNAL]
> >
> > I'm just wondering how acronyms are identified as acronyms in cTAKES (for
> > example, in MetaMap, there is an attribute in the Document annotation
> with
> > ids of where they are in the Utterance annotation; and in BioMedICUS,
> there
> > is an acronym annotation type, etc.). From examining the XMI CAS, it is
> not
> > obvious.
> >
> > We're extracting the desired annotations from the XMI CAS using a custom
> > Groovy client.
> >
> > Thanks!
> >
> > On Wed, May 15, 2019 at 7:43 PM Finan, Sean <
> > sean.fi...@childrens.harvard.edu> wrote:
> >
> > > Hi Greg,
> > >
> > > What exactly do you need ?
> > >
> > > There are a lot of output components that can produce different formats
> > > containing various types of information.
> > >
> > > Do you prefer to parse ml ?  Or is columnized text output ok?  Does
> this
> > > go to a post-processing engine or a human user?
> > >
> > > Thanks,
> > >
> > > Sean
> > > 
> > > From: Greg Silverman 
> > > Sent: Wednesday, May 15, 2019 7:09 PM
> > > To: dev@ctakes.apache.org
> > > Subject: acronyms/abbreviations [EXTERNAL]
> > >
> > > How can I get these from the XMI annotations?
> > >
> > > Thanks!
> > >
> > > Greg--
> > >
> > > --
> > > Greg M. Silverman
> > > Senior Systems Developer
> > > NLP/IE <
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__healthinformatics.umn.edu_research_nlpie-2Dgroup&d=DwIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=Fj9pHse59o_GfrCnR_sqZ7ibEmMju2GDRj6hmEg5s9U&s=taqRUWLVp4l5699x1GSXNfIK6WkZXiAgKnA3CPmlfWk&e=
> > > >
> > > University of Minnesota
> > > g...@umn.edu
> > >
> > >  ›  evaluate-it.org  ‹
> > >
> >
> >
> > --
> > Greg M. Silverman
> > Senior Systems Developer
> > NLP/IE <
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__healthinformatics.umn.edu_research_nlpie-2Dgroup&d=DwIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=DSQkibRULBYY2ijgCfGWGPmrKD7gdrLjBbvnTbXozsA&s=pTRmMExWf-ju3IjLOdTelulzu0JW399BumarcAx5tRw&e=
> > >
> > University of Minnesota
> > g...@umn.edu
> >
> >  ›  evaluate-it.org  ‹
> >
>
>
> --
> Greg M. Silverman
> Senior Systems Developer
> NLP/IE 
> University of Minnesota
> g...@umn.edu
>
>  ›  evaluate-it.org  ‹
>


Re: acronyms/abbreviations [EXTERNAL]

2019-05-18 Thread Peter Abramowitsch
Greg,  Thanks for these links.  I really enjoy discussions of this kind and
am glad to see that someone is trying these knowledge based approaches and
reporting back.  I've played with the Wordnet APIs and believe that it is
possible to use the hyper/hypo-nym constructs to help score different
interpretations of ambiguous terms.  Additionally, I think Ngram fitting
can be used to help rate the relevance of one definition over another.
But I'd bet that the effectiveness these approaches is highly dependent on
grammatically complete and correct text.   Clinical notes are another
thing.

I had a perfect example of this problem the other day.   A note stating
something like "nursing care resumed after 12pm".  Ctakes had tagged this
with both lactation-related and nursing-service-related CUIs.  But the
patient was an elderly man.  Clearly the context was not to be found in the
grammar but in the clinical settingThus there is a kind of meta context
(patient's age, gender, disease state) that could also contribute to
disambiguation.  This could be achieved by ML methods trained on marked up
notes... very labor intensive, or by some kind of rules mechanism, but that
would also be labor intensive - a never-to-be-finished effort.  These might
require the creation of an instant/lightweight VMR to structure the
contextual elements from the note that the scoring mechanism would reason
over.But I'd prefer a Campari and soda.



On Sat, May 18, 2019 at 3:24 AM Greg Silverman  wrote:

> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3111590/
>
> On Fri, May 17, 2019 at 8:23 PM Greg Silverman  wrote:
>
> > Yes, and regarding your last paragraph: This is where disambiguation
> comes
> > into play. Here is one method:
> >
> https://www.cs.cmu.edu/afs/cs/project/jair/pub/volume23/montoyo05a-html/node9.html
> >
> > I'm not sure how either MetaMap or BioMedICUS do disambiguation, but
> since
> > are both open source, they would be potential resources..
> >
> > Greg--
> >
> > On Fri, May 17, 2019 at 2:17 AM Peter Abramowitsch <
> > pabramowit...@gmail.com> wrote:
> >
> >> Seems like some kind of simple heuristic should work:Isn't it just a
> >> case of looking at the in/out text offsets of the source text for an
> >> identified annotation and then comparing that with the canonical text of
> >> the CUI or SnomedID.   If the source text is just a few of characters
> (say
> >> less than 5) and the Levenstein difference between it and the canonical
> >> text is > than the length of the source text,  you're pretty sure to
> have
> >> an acronym.
> >>
> >> For instance if cTakes finds   "MI" and assigns SNOMED  22298006 or CUI
> >> C0027051 with canonical text "Myocardial Infarction"*, *then with the
> >> in/out offsets into the text you should be able to run this heuristic
> >>
> >> The problem (and I see this in my work) is that many acronyms have
> >> multiple
> >> meanings.  Thus, you may accurately be able to tell that your identified
> >> concept came from an acronym, but it was the wrong concept!!
> >>
> >> Peter
> >>
> >> On Thu, May 16, 2019 at 4:31 AM Greg Silverman  wrote:
> >>
> >> > Got it!
> >> >
> >> > Yes, I understand the formidability, given the need for
> disambiguation,
> >> > etc. Was just curious if this existed.
> >> >
> >> > Thanks!
> >> >
> >> >
> >> > On Wed, May 15, 2019 at 9:11 PM Finan, Sean <
> >> > sean.fi...@childrens.harvard.edu> wrote:
> >> >
> >> > > Hi Greg,
> >> > >
> >> > > Ok, that gives me a great vector toward addressing your needs.
> >> > >
> >> > > I don't know of any ctakes components that indicate whether or not
> >> > > discovered concepts come from acronyms, abbreviations or -replete-
> >> text
> >> > > mentions.
> >> > >
> >> > > There should be something that does that.   Open source >  Any
> >> > > champions available?
> >> > >
> >> > > Right now no abbreviation or metonym information is provided in the
> >> > > standard components.If it can be extruded from source then it
> >> should
> >> > be
> >> > > provided.
> >> > >
> >> > > If anybody has such a component, please let us know !   This is a
> >> > > formidable (imio) nlp problem, so call your kudos with a solution

Re: acronyms/abbreviations [EXTERNAL]

2019-05-19 Thread Peter Abramowitsch
OMG,  I hadn't even thought of "ephemeral vocabulary".  Great example!

Peter

On Sun, May 19, 2019 at 6:05 PM Greg Silverman  wrote:

> Peter,
> You'll like this example then from a manuscript we submitted to MedInfo:
> "It is important to point out that while some system annotation types
> scored really well using the geometric mean method to identify best-at-task
> annotation systems,  on examination, since our method was unable to provide
> lexical disambiguation of terms, there were some misclassifications. An
> example was for the entity Speed of Vehicle where the system cTAKES perform
> very well with the MedicationsMention annotation type. On further
> examination, the terms that provided a match were “speed” and “mph,” which
> have different contextual meanings from those having to do with physical
> measurement with respect to velocity.  In this case, “speed” and “mph” are
> common street drugs..."
>
> Greg--
>
>
> On Sat, May 18, 2019 at 3:12 AM Peter Abramowitsch <
> pabramowit...@gmail.com>
> wrote:
>
> > Greg,  Thanks for these links.  I really enjoy discussions of this kind
> and
> > am glad to see that someone is trying these knowledge based approaches
> and
> > reporting back.  I've played with the Wordnet APIs and believe that it is
> > possible to use the hyper/hypo-nym constructs to help score different
> > interpretations of ambiguous terms.  Additionally, I think Ngram fitting
> > can be used to help rate the relevance of one definition over another.
> > But I'd bet that the effectiveness these approaches is highly dependent
> on
> > grammatically complete and correct text.   Clinical notes are another
> > thing.
> >
> > I had a perfect example of this problem the other day.   A note stating
> > something like "nursing care resumed after 12pm".  Ctakes had tagged this
> > with both lactation-related and nursing-service-related CUIs.  But the
> > patient was an elderly man.  Clearly the context was not to be found in
> the
> > grammar but in the clinical settingThus there is a kind of meta
> context
> > (patient's age, gender, disease state) that could also contribute to
> > disambiguation.  This could be achieved by ML methods trained on marked
> up
> > notes... very labor intensive, or by some kind of rules mechanism, but
> that
> > would also be labor intensive - a never-to-be-finished effort.  These
> might
> > require the creation of an instant/lightweight VMR to structure the
> > contextual elements from the note that the scoring mechanism would reason
> > over.But I'd prefer a Campari and soda.
> >
> >
> >
> > On Sat, May 18, 2019 at 3:24 AM Greg Silverman  wrote:
> >
> > > https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3111590/
> > >
> > > On Fri, May 17, 2019 at 8:23 PM Greg Silverman  wrote:
> > >
> > > > Yes, and regarding your last paragraph: This is where disambiguation
> > > comes
> > > > into play. Here is one method:
> > > >
> > >
> >
> https://www.cs.cmu.edu/afs/cs/project/jair/pub/volume23/montoyo05a-html/node9.html
> > > >
> > > > I'm not sure how either MetaMap or BioMedICUS do disambiguation, but
> > > since
> > > > are both open source, they would be potential resources..
> > > >
> > > > Greg--
> > > >
> > > > On Fri, May 17, 2019 at 2:17 AM Peter Abramowitsch <
> > > > pabramowit...@gmail.com> wrote:
> > > >
> > > >> Seems like some kind of simple heuristic should work:Isn't it
> > just a
> > > >> case of looking at the in/out text offsets of the source text for an
> > > >> identified annotation and then comparing that with the canonical
> text
> > of
> > > >> the CUI or SnomedID.   If the source text is just a few of
> characters
> > > (say
> > > >> less than 5) and the Levenstein difference between it and the
> > canonical
> > > >> text is > than the length of the source text,  you're pretty sure to
> > > have
> > > >> an acronym.
> > > >>
> > > >> For instance if cTakes finds   "MI" and assigns SNOMED  22298006 or
> > CUI
> > > >> C0027051 with canonical text "Myocardial Infarction"*, *then with
> the
> > > >> in/out offsets into the text you should be able to run this
> heuristic
> > > >>
> > > >> The problem (and I see this in my work) is tha

Re: Drugs' Primary Compound ID

2019-05-20 Thread Peter Abramowitsch
I used to work for a division of Hearst that also owns the company First
Databank.  They have an electronic compendium of information about every
drug where you can find out its generic and proprietary forms, its primary
ingredient(s), its therapeutic class, forms, dosages, side effects, disease
indications etc etc.  Much of this you can now get from RXNorm, I think.
The subscription fee for FDB is pretty high but the information is very
well curated.

Peter

On Mon, May 20, 2019 at 5:02 PM Hari, Sekhar  wrote:

> Hi -
>
> My question is a little different, and I'm OK if there is a way to solve
> this puzzle either through cTAKES, OR, through UMLS lookups, OR, through
> lookups in other published databases. At this time, I really don't know if
> this can be solved through Machine Learning algorithms.
>
> Problem:
> I've been asked to find out if the following is possible:
> "Given a pharma regulatory document (say a searchable PDF document)
> related to drug(s), predict the corresponding 'Primary Compound ID'.
>
> The format of a primary compound ID could be - < name>>-<>-<>.
>
> To make the scenario easier, I'll consider the following case:
> Primary Compound ID: CNTO148.
> This is a deviation to the above format. If we split this ID, it would
> represent CNTO as the pharma company (Centocor Biotech, Inc). I don't know
> what the number 148 represent.
>
> However, CNTO148 is the pre-marketing name given during clinical trial
> phases. It's actual trademark is "SIMPONI" and the International
> Non-proprietary name (INN) is "Golimumab". The condition mentioned for this
> drug is 'Rheumatoid Arthritis'
>
> Question:
> Using cTAKES if I could identify the product as "SIMPONI" and the
> indication as 'Rheumatoid Arthritis', is there a way to identify or derive
> its 'Primary Compound ID' - in this case CNTO148 - (or sometimes called as
> 'Controlling Product') through some mechanism?
>
> My analysis:
> If I query the ClinicalTrials.gov data using the drug name, I'm able to
> find the corresponding 'Primary Compound ID' that was used during clinical
> study. But this ID is not available for all drug products from
> ClinicalTrials.gov database. I'm looking at a consistent way to derive the
> 'Primary Compound ID' if these IDs are registered anywhere.
>
> Other questions:
> What meaning does the abbreviations used in 'Primary Compound ID' contain
> (three or two letters abbreviation in the format defined above)?
> Some example abbreviations (there are many more):
>
> * AAB
>
> * AC
>
> * AN
>
> * AAA
>
> * AAC
>
> * AMK
>
> * ZBR
>
> * AER
>
> * AEN
>
> Is there a vocabulary where these are listed that I could study?
>
> Thanks
> Sekhar Hari | AI Program Lead | Health Sciences R&D | Asia Pacific
> Solutions Delivery Center
> +91 814 7027 779 (C)
>


Re: Accessing the External Resource from the UimaContext without Using XML descriptor [EXTERNAL]

2019-06-29 Thread Peter Abramowitsch
I've been wondering whether Levenshtein Distance or Soundex have any
potential in the cTakes pipeline. For example, if, after failing the
dictionary lookup, one used something like CSpell to find a potential
concept, but then used one of these linguistic similarity methods to
quantify the difference between it and the source over the text range and
turn that into a confidence value, would it help mitigate overfitting?  I
guess the answer would be how often radically different concepts can differ
by a single character.  Another factor as was hinted at above is that
spelling issues in consumer provided text are completely different in
character from that of the rushed clinician, and these may require
completely different solutions.

On Fri, Jun 28, 2019 at 6:34 AM Remy Sanouillet 
wrote:

> Hi Siamak,
>
> I agree with Sean. Spelling correction in NLP is a bit of a tar baby. We
> attempted to integrate CSpell (
> https://lsg3.nlm.nih.gov/Specialist/Summary/cSpell.html) to improve
> recall.
> Unfortunately we had to take if out because the overfitting affected
> precision and increased ambiguity too much.
>
>Remy
>
> On Fri, Jun 28, 2019 at 5:20 AM Finan, Sean <
> sean.fi...@childrens.harvard.edu> wrote:
>
> > Hi Siamak,
> >
> > The problem of misspelled terms is a big one.  I have read about
> > approaches taken by others for research, but nothing has been implemented
> > for ctakes.
> >
> > The only thing that has been done on my projects is addition to the
> > dictionary of common misspellings for a directed project.  For instance,
> in
> > a project specifically addressing brain aneurysms I added to the
> (project)
> > dictionary misspellings like "aneurism", "anurism" and "anurysm".  I
> didn't
> > worry about misspellings for terms that didn't apply to the project; I
> > didn't bother adding things like "skelatal" for "skeletal" because I
> didn't
> > really care if that term was missed.
> >
> > Sean
> > 
> > From: Siamak Barzegar 
> > Sent: Friday, June 28, 2019 6:12 AM
> > To: dev@ctakes.apache.org
> > Subject: Re: Accessing the External Resource from the UimaContext without
> > Using XML descriptor [EXTERNAL]
> >
> > Dear Sean,
> >
> > Thank you very much for your help.
> > As you suggested, I use "BsvRareWordDictionary" and create a BSV file for
> > my small lexicon.
> > I am using it in the Spanish medical documents. As you know medical
> > documents have a lot of typos.  I was wondering to know is there any
> > dictionary lookup in cTAKES or another component from other projects that
> > can detect these small typos?
> > for example, if we have this work in dictionary file:
> > C001|T01|Fumador 2 paq*ue*tes
> >
> > And in the document, we have "fumador 2 paq*eu*tes". Is there any way to
> be
> > able to annotate this typo word as well?
> >
> > With Best Wishes,
> > Siamak
> >
> >
> >
> > On Tue, 25 Jun 2019 at 18:38, Finan, Sean <
> > sean.fi...@childrens.harvard.edu>
> > wrote:
> >
> > > Ah.
> > >
> > > You are trying to use an old annotator.  It was never updated to be a
> > > uimafit component and I think that it may not work with the
> > PipelineBuilder.
> > > Newer annotators have (for the most part) simpler interfaces and do not
> > > require explicit specification of resources, resource types, etc.
> > >
> > > You have several options (worst to best):
> > > 1.  Don't use PipelineBuilder
> > > 2.  Wrap the older annotator in a uimafit-compatible component
> > > 3.  Make a method that generates a description:
> > >  UmlsDictionaryLookupAnnotator does this in a method named
> > > createAnnotatorDescription()
> > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__svn.apache.org_repos_asf_ctakes_trunk_ctakes-2Ddictionary-2Dlookup_src_main_java_org_apache_ctakes_dictionary_lookup_ae_UmlsDictionaryLookupAnnotator.java&d=DwIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=aNXh5Gc3ezd0x905RnW8e_Qa2SPMb_NqsaOGDBxoOh8&s=2RzyJ7sX-k2SpTfrXvoZLi3rJwdUer1mNva_-a78bGc&e=
> > > -- Create the description and use the PIpelineBuilder
> addDescription(..)
> > > method.
> > > 4.  Use the newer fast dictionary instead of the old one.
> > > -- The basic equivalent of the old *CSV annotator is
> > > BsvRareWordDictionary.  It takes a single parameter "bsvPath".  Instead
> > of
> > > comma-separated values it wants Bar-separated values in the format
> > > Cui|Synonym or Cui|Tui|Synonym
> > > -- One misconception that people seem to have is that the "fast"
> > > dictionary is faster but less accurate.  Actually, it is faster and
> more
> > > accurate.  Speed was the greater difference and that name stuck.
> > >
> > > There may be other solutions, but those are what come to mind right
> now.
> > >
> > > Sean
> > > 
> > > From: Siamak Barzegar 
> > > Sent: Tuesday, June 25, 2019 11:46 AM
> > > To: dev@ctakes.apache.org
> > > Subject: Re: Accessing the External R

Re: Averbis released type systems as open source

2019-07-08 Thread Peter Abramowitsch
Thank you for this news!
Will annotators using these type systems also be open sourced?  And if so,
in for which languages?

Peter

On Mon, Jul 8, 2019 at 8:25 AM Peter Klügl  wrote:

> Hi,
>
>
> Averbis released a selection of type systems as open source (Apache
> License 2.0) which also cover many use cases in clinical and medical
> information extraction.
>
>
> I know that cTAKES has of course well-established type systems, but a
> comparison of different approaches to model information can lead to
> fruitful discussions IMHO. Feel free to contact me :-) (or talk to me
> @ACL2019_Italy)
>
>
>
> https://github.com/averbis/core-typesystems
>
> Types for linguistic preprocessing, generic types like 'Concept' as well
> as domain-independent entities like 'Date' or 'Measurement'.
>
>
> https://github.com/averbis/health-typesystems
>
> Types for specific medical entities and relationships like 'Diagnosis',
> 'Medication' and also for more specialized types like 'HLA' or
> 'VisualAcuity'.
>
>
> https://github.com/averbis/pharma-typesystems
>
> Types mainly for IDMP.
>
>
> Artifacts with descriptions, types.txt, JCas cover classes and some
> generated asciidoc are available at maven central.
>
>
> Best,
>
>
> Peter
>
> --
> Dr. Peter Klügl
> R&D Text Mining/Machine Learning
>
> Averbis GmbH
> Salzstr. 15
> 79098 Freiburg
> Germany
>
> Fon: +49 761 708 394 0
> Fax: +49 761 708 394 10
> Email: peter.klu...@averbis.com
> Web: https://averbis.com
>
> Headquarters: Freiburg im Breisgau
> Register Court: Amtsgericht Freiburg im Breisgau, HRB 701080
> Managing Directors: Dr. med. Philipp Daumke, Dr. Kornél Markó
>
>


Re: Struggling initializing [EXTERNAL]

2019-08-12 Thread Peter Abramowitsch
Hi Sebastien

Ctakes is a large package with many moving parts and a long history.  It
would be impossible to answer your questions in a single email and it has
been packaged in many different ways by various users.

If you haven't done it already, I suggest going back to basics:
downloading and configuring  just  stable stable binary version exactly as
documented. and run the simplest possible configuration, just to get off
the ground.

make sure UMLS User Credentials are present in your environment.

copy
./resources/org/apache/ctakes/clinical/pipeline/DefaultFastPipeline.piper
into a work area
create infiles and outfiles sub folders in that work area

put a single note into the infiles folder

source the ./bin/runPiperCreator script, and have it read in that copy of
DefaultFastPipeline.piperBecome familiar with that.

Insert a reader entry at the top - a simple folder reader   You can copy
paste this if you want - and modify the input path


*reader org.apache.ctakes.core.cr.FilesInDirectoryCollectionReader
InputDirectory=*

and add a simple writer at the bottom an XMI writer.  You can copy paste
this if you want - and modify the output path

*add org.apache.ctakes.core.cc.XmiWriterCasConsumerCtakes
OutputDirectory=*

Now just validate it from within the piperCreator application using the
YELLOW button at upper right.  When valid you may get a green button to run
it, but if not

(I find that sometimes the run button remains disabled even though the
piper is valid)

source ./bin/runPiperFile.sh -p 

If it's worked you will see an XMI file in the ouput file folder.  This is
not the only way to visualize the output, but it is a simple tested
mechanism.

If it finishes with a non-zero length XMI you have a basis for learning the
system and later on, thinking about expanding your scope.  If not, there's
still something amiss with your configuration.   Next time, send us the
entire error trace, not just the line at which the exception message
appears.

Everything is customizable, including how the piper can be invoked,  but
before you start, get the simple case running first.

- Peter

On Mon, Aug 12, 2019 at 3:43 PM Sébastien Boussard  wrote:

> I would also like to add, I do have the dictionaries.
>
> On Mon, Aug 12, 2019 at 3:39 PM Sébastien Boussard 
> wrote:
>
> > Thank you, everyone, for looking at this,
> > My project is to understand how to use ctakes well and make it flexible
> > for it to be used for everyone else in our lab when I leave. The first
> > thing I wanted to do was just be able to get inputs and outputs. As I
> > understand it more, I want to be able to transform it after. What are the
> > full capabilities of piper files? When would it be advantageous to just
> use
> > that over what I was doing before?
> >
> > Thank you,
> > Sebastien Boussard
> >
> > On Mon, Aug 12, 2019 at 8:21 AM Finan, Sean <
> > sean.fi...@childrens.harvard.edu> wrote:
> >
> >> Hi all,
> >>
> >> I think that there are a lot of things going on here.
> >>
> >> Jeff's question is on point - do you actually have the dictionary?
> >>
> >> I think that doing all of this with code is unnecessary.
> >> - I don't see anything in the code that cannot be done in a piper file.
> >> - Piper files can set the collection reader.   Use the "reader" command.
> >> For your use, that would be " reader LinesFromFileCollectionReader
> >> InputFileName= "
> >> - Piper files can load other piper files.  Use the "load" command.
> >> For your use, that would be " load DefaultFastPipeline "
> >>   https://cwiki.apache.org/confluence/display/CTAKES/Piper+Files
> >> So, instead of writing and debugging code, you can create a 2 line piper
> >> file and just run it using
> org.apache.ctakes.core.pipeline.PiperFileRunner
> >>
> >> "  PiperFileRunner
> >>  -p 
> >> -o 
> >> --user 
> >> --pass  "
> >>
> >> Or if you really want to run the piper from code then you can do so, but
> >> I would rely more upon the piper such as in the examples code
> >> HelloWorldPiperRunner.java
> >>
> >> I would just use a piper file.  If you want to get fancy, then instead
> of
> >> explicitly specifying the InputFileName in the piper, use the "cli"
> command
> >> in the piper.
> >> " cli InputFileName=in "
> >> Then you can remove the specification from the piper command ( simplify
> >> it to " reader LinesFromFileCollectionReader " )
> >> and your PiperFileRunner would be the same as above but with "--in
> >>  " added.
> >> Then you can change the input using the command line instead of
> >> constantly editing the piper.
> >>
> >> Besides the obvious simplicity for the user of only using a piper file,
> >> it should be easier for others to assist with problems as they do not
> need
> >> to go through your code.
> >>
> >> I have to ask why you are using LinesFromCollectionReader ?  It treats
> >> each line like a different document.
> >> Your first attempt points to "right_knee_arthroscopy" in the example
> >> notes.
> >> This would give you two outpu

Re: Relating MeasurementAnnotations to other IdentifiedAnnotations

2019-08-20 Thread Peter Abramowitsch
Hi Jeff

I've experimented with three approaches.

One is with the LabValueFinder which is included in the cTakes release -
that looks specifically for values associated with LabMentions.  It also
has an "eager" mode where it converts some MedicationMentions into
LabMentions, when the context seems right.  O2, Sodium etc.   I can't say
it works all that well and it is not capable of many different semantic
forms of the Name/Value association.  It is also too eager.. sometimes
creating LabMentions out of Medications when it shouldn't.

Another approach was to use something like Stanford's TokensRegex, that
allows you to construct regex-like rules where the segments are not strings
but Tokens, where you can query the attributes like POS, and NER .   For
Ctakes I had to adapt a UIMA package that must have been someone's thesis
project from the university of Nantes.

Copyright 2015 - CNRS (Centre National de Recherche Scientifique)
package fr.univnantes.lina.uima.tkregex

What I have is not ready for prime time and is still very rough.  It works
well but only for a limited set of rules

I used it to create a vitals detector.  Here's a snippet of the rules that
this package loads in at runtime, that creates an annotation called WGT
given these matchers
matcher NUM: [ postag == "CD" ];
matcher BE: [ lemma == "be" | lemma == "at"];
matcher WT: /(?i)^wt|^weight/;
matcher WUOM: /(?i)^kg|^lb|^pounds/;
term "WGT": WT BE? SYM? NUM WUOM;

The last approach was a home-built mechanism using the ConllDependencyNode
collection and the RelationArguments to detect the same connection between
certain typed pairs of Identified annotations.

Problem is. I've always been in prototyping mode and never had time to push
these methods to production ready status

Peter

On Tue, Aug 20, 2019 at 1:15 PM Jeffrey Miller  wrote:

> Hi,
>
> Is there any configuration or component in cTAKES that can be used to
> attribute a measurement annotation to another annotation that it applies
> to? For example, for "2 mm incision" where we relate "2 mm" to "incision"?
> It looks like there might be a roundabout way to find the head of the span
> of the MeasurementAnnotation in the output of the dependency parser, but I
> was wondering if this has been explored before? Perhaps the
> RelationExtractor component?
>
> I also have another more general question if anyone can help- how does the
> structure of the cTAKES type system effect how cTAKES works? I am looking
> for a general intuition of how the structure of the typesystem drives the
> larger cTAKES architecture?
>
> Thanks!
> Jeff
>


Re: PiperFileRunner Error [EXTERNAL]

2019-10-11 Thread Peter Abramowitsch
It runs under Java 9 and possibly 11 if built in 8, and with a bunch of hasty 
changes I got it to compile and run in 9, but it wasn’t worth the effort.  With 
Java 11, bigger changes were introduced in the packaging of modules, which are 
no longer completely synonymous with jar files.  It would take some work to 
create a release-generating codebase in java 11. 

Sent from my iPad

> On Oct 11, 2019, at 11:07, Finan, Sean  
> wrote:
> 
> Hi Carolina,
> 
> ctakes is written for java 8.  I don't know if anybody has tested any higher 
> version.  It looks like you are compiling with java version 11.
> 
> Are you using Apache maven to build ctakes?  ctakes is modular and uses 
> https://maven.apache.org/.
> 
> The most complete instructions of which I am aware are here:
> 
> https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+4.0+Developer+Install+Guide#cTAKES4.0DeveloperInstallGuide-Build
> 
> Sean
> 
> 
> From: Cervantes, Carolina 
> Sent: Friday, October 11, 2019 1:55 PM
> To: dev@ctakes.apache.org
> Subject: PiperFileRunner Error [EXTERNAL]
> 
> Hello,
> 
> I am trying to run the HelloWorld piper file and when I try to do it via 
> terminal I have been getting this error:
> 
> Error: Could not find or load main class 
> org.apache.ctakes.core.pipeline.PiperFileRunner
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.ctakes.core.pipeline.PiperFileRunner
> 
> When I try to run it through IntelliJ I get this error:
> 
> Information:java: Errors occurred while compiling module 'ctakes-utils'
> Information:javac 11.0.2 was used to compile java sources
> Information:10/11/2019 11:53 AM - Compilation completed with 2 errors and 0 
> warnings in 8 s 947 ms
> C: /Users /Zach Barrett /Documents /Computer Science /Junior /Comp 398 
> /ctakes /ctakes-utils /src /main /java /org /apache /ctakes /utils /env 
> /EnvironmentVariable.java
>Error:Error:line (21)java: package jdk.nashorn.internal.ir.annotations 
> does not exist
>Error:Error:line (24)java: cannot find symbol
>  symbol: class Immutable
> 
> Any help would be appreciated!


Re: cTAKES handling HL7 CDA Level 1 [EXTERNAL] [SUSPICIOUS]

2019-12-18 Thread Peter Abramowitsch
The problem could be a non UTF8 BOM character as the first character in a
file.  Try opening the XML file in a unicode agnostic editor that allows
for different encodings  and then re-write it in US ASCII.

https://en.wikipedia.org/wiki/Byte_order_mark

Peter

On Wed, Dec 18, 2019 at 11:31 AM Finan, Sean <
sean.fi...@childrens.harvard.edu> wrote:

> Sorry - I missed this:
> > I'm using the two CDA files that come with the cTAKES package
> (testpatient_cn_2.xml and testpatient_cn_1.xml compatible with
> NotesIIST_RTF.DTD
>
> Those files -should- be ok as they were originally used to test the CDA
> workflow.
>
> The code for CdaCasInitializer and ClinicalNotePreProcessor hasn't changed
> since 2015.
>
> The actual error is coming from the 3rd party xml parser (xerces):
> Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1;
> Content is not allowed in prolog.
> at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown
> Source)
>
> I am not sure what would be causing this.
>
> I don't run CDA, so I can't speak to the operational status of those
> components or the pipeline in general.
>
> Does anybody else out there use CDA?
>
> Sean
>
>
> 
> From: Finan, Sean 
> Sent: Wednesday, December 18, 2019 2:22 PM
> To: dev@ctakes.apache.org
> Subject: Re: cTAKES handling HL7 CDA Level 1 [EXTERNAL] [SUSPICIOUS]
>
> * External Email - Caution *
>
>
> Hi Masoud,
>
> I am not an xml expert, so take this with a grain of salt.
>
> I think that something is wrong/unmatched with the first line of your xml
> document.
> Make sure that the first line is something like:
> 
>
> Sean
>
> 
> From: Masoud Rouhizadeh 
> Sent: Wednesday, December 18, 2019 1:47 PM
> To: dev@ctakes.apache.org
> Subject: Re: cTAKES handling HL7 CDA Level 1 [EXTERNAL]
>
> * External Email - Caution *
>
>
> Hi all,
>
> I'm using cTAKES user to process CDA documents by
> AggregateCdaProcessor.xml and AggregateCdaUMLSProcessor.xml located in
> /desc/ctakes-clinical-pipeline/desc/analysis_engine/
>
> My script to call this is
>
> java -Dctakes.umlsuser= -Dctakes.umlspw= -cp
> $CTAKES_HOME/lib/*:$CTAKES_HOME/desc/:$CTAKES_HOME/resources/
> -Dlog4j.configuration=file:$CTAKES_HOME/config/log4j.xml -Xms2g -Xmx3g
> org.apache.ctakes.core.cpe.CmdLineCpeRunner
> $CTAKES_HOME/desc/ctakes-clinical-pipeline/desc/collection_processing_engine/test_cda_masoud.xml
>
> test_cda_masoud.xml has a proper path to CDA input and output. I'm using
> the two CDA files that come with the cTAKES package (testpatient_cn_2.xml
> and testpatient_cn_1.xml compatible with NotesIIST_RTF.DTD).
>
> Unfortunately, it seems that CdaCasInitializer cannot run, and I get the
> attached errors. I get the same errors when using the GUI with
> AggregateCdaProcessor AE
>
> - Am I missing something obvious?
> - Does cTAKES *User* installation handle CDA documents?
> - Is org.apache.ctakes.core.cpe.CmdLineCpeRunner an appropriate pipeline
> for CdaCasInitializer?
>
> Thank you so much for your help in advance.
>
> Masoud
>
>
>
>
>
>
>
> On 11/8/19, 8:30 AM, "Finan, Sean" 
> wrote:
>
>
> Hi Masoud,
>
> I think that the CdaCasInitializer is at least 10 years old.  I would
> not expect it to conform to any recent standards.
>
> Does anybody else have a reader or transformer that can handle HL7 CDA
> r2?
>
> Sean
>
> p.s.
> If anybody is involved with HL7 International, you may want to get
> some movement on addressing the typo on the page header(s):
>
> Section 1a: Clinical Document Architcture (CDA®)
>
> 
> From: Masoud Rouhizadeh 
> Sent: Thursday, November 7, 2019 5:59 PM
> To: dev@ctakes.apache.org
> Subject: cTAKES handling HL7 CDA Level 1 [EXTERNAL]
>
> Dear cTAKES developer mailing list,
>
> We have been working on a project at Hopkins for converting
> Epic-generated RTF notes into Clinical Document Architecture Level One.
>
> We have been using HL7 CDA® Release 2 Schema, and now we plan to use
> cTAKES for concept extraction from those documents. The CDA Schema and
> examples can be found here
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.hl7.org_implement_standards_product-5Fbrief.cfm-3Fproduct-5Fid-3D7&d=DwIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=h8q4BiKKL6eDBOGEta7gcpkDGIx5xFPlGrNfUPlzBuc&s=l8HjgDHeywmdkSUkOJBGWNLpJ-bPlw7Lmgzh02w8k2s&e=
>
> In the cTAKES documentation, I see that CdaCasInitializer "does not
> handle all CDA documents. The CDA document must conform to the DTD
> resources/cda/NotesIIST_RTF.DTD."
>
> Has anyone tested and evaluated cTAKES ability to consume HL7 CDA
> Level 1 Release 2 documents?
>
> Thank you,
> Masoud
>
> 
> Masoud Rouhizadeh, PhD
> Faculty - Division of Health Science Informatics (DHSI)
> NLP Lead - Institute for Clinical and Translational Resea

Re: using UMLS Metathesaurus in cTAKES offline

2020-05-23 Thread Peter Abramowitsch
Having the data is not synonymous with umls authentication. Out of the box,
you do need internet connectivity for the authentication to take place.
 It will happen once during startup. That will be sufficient for as long as
the instance is running.

The authentication is really meant to be umls' way to measure usage as much
as it is a permission scheme.

There is a mechanism for the authentication to be proxied through a
different url which can be built onto to create something like you're
thinking of. I've used that, but for a different purpose.

But in these days of ever diminishing government, it's valuable for the nlm
to have those authentication hits.

Peter

On Sat, May 23, 2020, 6:30 PM Akram  wrote:

> I want to use cTAKES offline
>
> I am using command line
>
> run\runClinicalPipeline  -i E:\cTAKES\files\MedReps\Input  --xmiOut
> E:\cTAKES\files\MedReps\Output  --user myuser  --pass mypassword
>
> This is my piper file
>
> load DefaultTokenizerPipeline.piper
>
> add DefaultJCasTermAnnotator
>
> load AttributeCleartkSubPipe.piper
>
> writeXmis
>
>
> I thought UMLS will be accessed once and download all needed files so the
> next time it does not need the internet to access **UMLS**
>
> but I was wrong.
>
> When I work offline cTAKES does not work in attempt to access UMLS and
> gives error.
>
> I found that UMLS offers to download its data, so I did
>
> I downloaded **umls-2020AA-full.zip**
>
> I extracted Metathesaurus using MetamorphoSys and added it to
>
> E:\cTAKES\resources\org\apache\ctakes\dictionary\lookup\umls2020aa
>
> It is a huge folder 30GB+ full of .RRF files but did not work
>
> Not sure where the problem is
>
> do I have to change pipers?
>
> do I have to change the command?
>
> do I have to change files in the folder umls2020aa?
>
>
> How to fix these problems to use cTAKES offline?
>
>


Re: using UMLS Metathesaurus in cTAKES offline [EXTERNAL]

2020-05-25 Thread Peter Abramowitsch
Hi Akram

It's not a matter of instructions.  It's not been made easy because
obviously we won't want people to do it.It's a matter of writing code
that serves as an intermediary between ctakes and the UMLS authentication
service or a simulated version thereof.  As an example, I have a use case
where the instance of ctakes is buried deep in a PHI protected environment
where we have forbidden web services to connect direct;y with the outside
world.  So I've written a proxy mechanism that stands half in the protected
perimeter and half out.But we're still authenticating.

Perhaps you can figure out a strategy for authenticating and creating a
time-limited token that can be taken offline.   The hook to do something
like that is in the System property -Dctakes.umlsaddr

Peter

Peter



On Mon, May 25, 2020 at 10:56 AM Akram  wrote:

>  Thanks Sean & Peter
> I totally appreciate what NLM and cTAKES are doing.
> you said "It is possible to use ctakes and the UMLS dictionary completely
> offline"
> How can I use them offline?Is there any document that shows how?If not,
> then could you please help me with instructions.
> Best Regards
>
> On Monday, 25 May 2020, 11:26:04 pm AEST, Finan, Sean <
> sean.fi...@childrens.harvard.edu> wrote:
>
>  Peter is absolutely correct.
>
> It is possible to use ctakes and the UMLS dictionary completely offline,
> but it isn't recommended for regular use.  If you have any way to connect
> to the internet please use the standard methods.
>
> Many years ago the initial creators of ctakes negotiated with the NLM to
> enable the unique manner in which ctakes uses the UMLS.  At the time it had
> never been done and the UMLS could not be redistributed by outside
> agencies.  A legal (and amicable) partnership between the NLM and ctakes is
> absolutely necessary, and upholding our side of the agreement is how we
> make that happen.
>
> NLM maintains the umls and without proof of importance this maintenance
> would cease.
>
> NLM grants are one mechanism by which ctakes development gets funding, so
> it really is important that they know who is using the UMLS and how
> frequently.  There is no detriment to providing them this information.  You
> will never be charged for use, no matter how heavy it may be.
>
> The NLM sends annual requests for users to complete a survey.  It is
> extremely important that you complete the survey and indicate that you use
> the UMLS for NLP and ctakes.  The NLM and other agencies fund projects in
> part upon user-base.  The larger the user base of ctakes, the greater the
> chance of funding for its development.  Any funding in development turns
> into more accurate annotation engines, more capabilities and simpler usage
> for everybody.
>
> Of course, private funding would also help ...
>
> Sean
>
>
> 
> From: Peter Abramowitsch 
> Sent: Sunday, May 24, 2020 2:49 AM
> To: dev@ctakes.apache.org; Akram
> Subject: Re: using UMLS Metathesaurus in cTAKES offline [EXTERNAL]
>
> * External Email - Caution *
>
>
> Having the data is not synonymous with umls authentication. Out of the box,
> you do need internet connectivity for the authentication to take place.
>  It will happen once during startup. That will be sufficient for as long as
> the instance is running.
>
> The authentication is really meant to be umls' way to measure usage as much
> as it is a permission scheme.
>
> There is a mechanism for the authentication to be proxied through a
> different url which can be built onto to create something like you're
> thinking of. I've used that, but for a different purpose.
>
> But in these days of ever diminishing government, it's valuable for the nlm
> to have those authentication hits.
>
> Peter
>
> On Sat, May 23, 2020, 6:30 PM Akram  wrote:
>
> > I want to use cTAKES offline
> >
> > I am using command line
> >
> >run\runClinicalPipeline  -i E:\cTAKES\files\MedReps\Input  --xmiOut
> > E:\cTAKES\files\MedReps\Output  --user myuser  --pass mypassword
> >
> > This is my piper file
> >
> > load DefaultTokenizerPipeline.piper
> >
> > add DefaultJCasTermAnnotator
> >
> > load AttributeCleartkSubPipe.piper
> >
> > writeXmis
> >
> >
> > I thought UMLS will be accessed once and download all needed files so the
> > next time it does not need the internet to access **UMLS**
> >
> > but I was wrong.
> >
> > When I work offline cTAKES does not work in attempt to access UMLS and
> > gives error.
> >
> > I found that UMLS offers to download its data, so I did
> >
&

Re: Building a new custom dictionary or Updating/Adding values to the existing dictionary in cTAKES

2020-05-29 Thread Peter Abramowitsch
I'm using the UMLS fast dictionary out of the box and mammography certainly
appears:

   {
  "_type": "UmlsConcept",
  "codingScheme": "SNOMEDCT_US",
  "code": "71651007",
  "score": 0.0,
  "disambiguated": false,
  "cui": "C0024671",
  "tui": "T060",
  "preferredText": "Mammography"
},

The problem with pap smear is not that a concept isn't found, but that PAP
is also an acronym for something else: Prostatic acid phosphatase
{
  "_type": "UmlsConcept",
  "codingScheme": "SNOMEDCT_US",
  "code": "59518007",
  "score": 0.0,
  "disambiguated": false,
  "cui": "C0523444",
  "tui": "T059",
  "preferredText": "Prostatic acid phosphatase measurement"
}

Oddly enough I can't get it to recognize any of its forms except for
"cervical smear test"





On Fri, May 29, 2020 at 8:54 AM Remy Sanouillet 
wrote:

> Hello Abad,
>
> The short answer is, yes, the sno_rx_16ab can be "hacked". A couple of
> caveats are that any mistake can stop all recognition and you will lose all
> your mods on updates. So an additional dictionary is a recommended approach.
>
> There are two cases. EIther the CUI you are adding already exists and you
> are just adding a synonym. In that case, you only need to add one line:
>
>> INSERT INTO CUI_TERMS VALUES(CUI,RINDEX,TCOUNT,TEXT,RWORD)
>
> where:
>
>- CUI is the cui, nuf'said
>- TEXT is the tokenized lowercase string for the entry. In your case
>'pap smear'. Most punctuation is a separate token. Single quotes are
>escaped by doubling them
>- RWORD is the one token in TEXT that is the most indicative (least
>common) which will be used as the index in the lookup. In your case
>probably 'pap' since it is not as common as 'smear'
>- RINDEX is the index of RWORD in TEXT. First token is 0 which is the
>case for 'pap'
>- TCOUNT is the token count for TEXT. In your case, 2
>
> So you would want to add:
>
>> INSERT INTO CUI_TERMS VALUES(200845,0,2,'pap smear','pap')
>>
>
>  If the entry is a non-existing one, you will need to add a few more
> lines. Their positions are unimportant as long as they are below the header
> lines (below the final "SET SCHEMA PUBLIC" line).
>
>1. INSERT INTO TUI VALUES(CUI,TUI)
>One line for each TUI in the taxonomy
>2. INSERT INTO SNOMEDCT_US VALUES(CUI,SNOMED)
>assuming you are adding a SNOMED
>3. INSERT INTO PREFTERM VALUES(CUI,PREFTERM)
>where PREFTERM is the pretty string to describe the entry. It need not
>correspond to any indexed entry. It is used for display once the lookup has
>been successful.
>
> That's it. Use at your own discretion. No guarantees.
>
>
> *Rémy Sanouillet*
> NLP Engineer
> re...@foreseemed.com 
>
>
> [image: cid:347EAEF1-26E8-42CB-BAE3-6CB228301B15]
> ForeSee Medical, Inc.
> 12555 High Bluff Drive, Suite 100
> San Diego, CA 92130
>
> NOTICE: This e-mail message and all attachments transmitted with it are
> intended solely for the use of the addressee and may contain legally
> privileged and confidential information. If the reader of this message is
> not the intended recipient, or an employee or agent responsible for
> delivering this message to the intended recipient, you are hereby notified
> that any dissemination, distribution, copying, or other use of this message
> or its attachments is strictly prohibited. If you have received this
> message in error, please notify the sender immediately by replying to this
> message and please delete it from your computer.
>
>
> On Fri, May 29, 2020 at 7:34 AM  wrote:
>
>> Hi Team,
>>
>>
>>
>> We set up cTAKES4.0.0 as our NLP engine for our profile recently . We
>> have faced situations where some of the expected tokens are not picked up
>> by cTAKES during clinical text extraction. So our first thought process was
>> to identify where the dictionary is configured and how that can be updated.
>> After some code analysis  it was found that the dictionary is configured in
>> the  below path under ctakes/resources for sources RxNorm and SNOMEDCT_US
>>
>>
>>
>> We were able to open the hsqldb using the hsql db gui and found out that
>> some of our required entries are already there . So if I come specifically
>> to our current problem. The  Pap Smear and Mamogram are two clinical terms
>> which are not currently recognized by cTAKES in our profile.
>>
>> ·   If I look into the .script file , Pap Smear and
>> Mammogram/Mammography is already present in the .script file and in the
>> respective tables. PFB a snapshot as below
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> But still this was not recogonised by cTAKES. I see there are some
>> filters working on top of the available entries in dictionary(ctakes-gui
>> and ctake-gui-res). Will that be because of these filters the tokens are
>> not recognized as expected. Could you pls. share us what exactly these
>> filters do. This will help us in future also when we are trying to add new
>> terms into the dictionary
>>
>>
>>
>>
>>
>

Re: Missing Medication Frequency and Allergy attributes from MedicationMention

2020-06-05 Thread Peter Abramowitsch
Some granular areas are unfinished in cTakes and in these cases, attributes
mentioned are just placeholders for functionality that needs to be filled
in.   I can't speak specifically to Medication Freq/Dose/Route, but much
work is left to be done and contributed throughout the system.  Bodysite is
another one of these.  Or conditionality and confidence.  In some cases you
will never find them populated, or in others you'll find that values can
only be detected in a small number of contexts.

Unless an army of highly qualified developers and informaticists with free
time materializes to take it much further, cTakes will always be a work in
progress.  But many of us have already found it to be highly effective in
its current form, and some have made private customizations to suit our own
needs.

Peter

On Fri, Jun 5, 2020 at 2:58 AM Honey gandhi
 wrote:

> Hi
>
> We are exploring ctakes capabilities to use it as our NLP engine to parse
> clinical data.
>
> Though we are able to parse the data at high level. We are not able to get
> values for medication frequency, duration, allergy and other related
> specifications.
> It should have ideally populated values for ‘MedicationFrequency',
> ‘MedicationAllergy' and other related fields in ‘MedicationMention’
>
> I have also tried including RelationSubPipe.piper file  from
> cakes-relation-extractor to my Full.piper files in cakes-web-rest module.
> But I don’t see any difference this made as I am yet not able to figure
> out the relation among medication entity and its frequency, dosage etc.
>
> We are relatively new to this. Please advise on how to proceed further.
>
>
> Thanks,
> Honey G.


Re: Missing Medication Frequency and Allergy attributes from MedicationMention

2020-06-06 Thread Peter Abramowitsch
Hi Honey

I am using the anatomical site mention in my pipeline, but at the moment,
the results are extremely poor.  To be honest though, I have not had time
to see how it is implemented.   The results fall into three categories

missing 2cm lesion on left foot  result (null)
obviouship fractureresult (hip structure)
wrong   cervical cancer   result(neck)yes it actually gave me that
value!

That last example gives a hint as to the implementation and its
limitations.
One day if I have time, I will have a look

Peter


Peter

On Fri, Jun 5, 2020 at 11:54 PM Honey gandhi
 wrote:

> Is there any other way to find relationship between medication and its
> dose/route/frequency or between anatomical site and its sign symptoms?
>
> Thanks,
> Honey G.
>
> > On 06-Jun-2020, at 12:09 PM, Peter Abramowitsch 
> wrote:
> >
> > Some granular areas are unfinished in cTakes and in these cases,
> attributes
> > mentioned are just placeholders for functionality that needs to be filled
> > in.   I can't speak specifically to Medication Freq/Dose/Route, but much
> > work is left to be done and contributed throughout the system.  Bodysite
> is
> > another one of these.  Or conditionality and confidence.  In some cases
> you
> > will never find them populated, or in others you'll find that values can
> > only be detected in a small number of contexts.
> >
> > Unless an army of highly qualified developers and informaticists with
> free
> > time materializes to take it much further, cTakes will always be a work
> in
> > progress.  But many of us have already found it to be highly effective in
> > its current form, and some have made private customizations to suit our
> own
> > needs.
> >
> > Peter
> >
> > On Fri, Jun 5, 2020 at 2:58 AM Honey gandhi
> >  wrote:
> >
> >> Hi
> >>
> >> We are exploring ctakes capabilities to use it as our NLP engine to
> parse
> >> clinical data.
> >>
> >> Though we are able to parse the data at high level. We are not able to
> get
> >> values for medication frequency, duration, allergy and other related
> >> specifications.
> >> It should have ideally populated values for ‘MedicationFrequency',
> >> ‘MedicationAllergy' and other related fields in ‘MedicationMention’
> >>
> >> I have also tried including RelationSubPipe.piper file  from
> >> cakes-relation-extractor to my Full.piper files in cakes-web-rest
> module.
> >> But I don’t see any difference this made as I am yet not able to figure
> >> out the relation among medication entity and its frequency, dosage etc.
> >>
> >> We are relatively new to this. Please advise on how to proceed further.
> >>
> >>
> >> Thanks,
> >> Honey G.
>


Adding Bsv Custom Dict via piper

2020-06-11 Thread Peter Abramowitsch
Hi Sean

Not to take up your time, but it seems like there's no PipeBit associated
with the BsvRareWord classes, I did it years ago by messing with the desc
files and adding them to an AnalysisEngine, but is there a way now in a
piper file to add that dictionary to the fast UMLS lookup?

Peter


Re: Adding Bsv Custom Dict via piper

2020-06-11 Thread Peter Abramowitsch
Hi Sean.

False alarm.  Adding the Bsv dicts was much easier than I thought.

Peter

On Thu, Jun 11, 2020 at 2:21 PM Peter Abramowitsch 
wrote:

> Hi Sean
>
> Not to take up your time, but it seems like there's no PipeBit associated
> with the BsvRareWord classes, I did it years ago by messing with the desc
> files and adding them to an AnalysisEngine, but is there a way now in a
> piper file to add that dictionary to the fast UMLS lookup?
>
> Peter
>
>
>


Re: Fw: ApacheCon 2020

2020-07-06 Thread Peter Abramowitsch
Hi Sean

I'm asking my team's manager to see if we can present.  I work as Architect
/ cTakes Implementer with a team at the UCSF Bakar Institute for
computational health sciences.

Peter

On Mon, Jul 6, 2020 at 6:20 AM Finan, Sean 
wrote:

> I can't believe that I forgot to mention ...
>
>
> There will also be a presentation (maybe two?) by a group that has adapted
> ctakes to work with two other languages.  They have also integrated ctakes
> with other tools such as FreeLing and HeidelTime.  So cool ...
>
>
> Cheers,
>
> Sean
>
>
> 
> From: Finan, Sean
> Sent: Monday, July 6, 2020 9:08 AM
> To: dev@ctakes.apache.org; u...@ctakes.apache.org
> Subject: ApacheCon 2020
>
>
> Hi all,
>
>
> The ctakes representation at ApacheCon 2020 is looking good!​
>
>
> ApacheCon 2020 runs September 29 through October 1.
>
> Submission runs through Sunday, July 12.  Technically it is 8:00 a.m.
> Eastern time Monday, but please don't procrastinate.
>
> Registration is free.
>
>
> I am excited to announce that we have three groups interested in giving
> presentations on their configuration and use of ctakes at a large scale!
>
> We also have a presentation on the installation of the ctakes Rest service
> using the ctakes-rest module!
>
>
> Knowledge on these topics is always extremely valuable to our users, and I
> for one really want to see how sites use ctakes when given different
> resources, requirements and restrictions.  Because of that, I am trying to
> put together (technology allowing) a roundtable discussion with those
> presenters.  That should be of value to every user no matter what your
> situation.
>
>
> We still need more presentations!  To encourage you, here is a little
> information:
>
>
> 1.  What you do is interesting!  If you think that nobody out there cares
> about what you've done and how, then you probably aren't fully aware of how
> large and diverse our user base really is.  People want to know about
> things like integration, customization, clinical specialty application,
> augmentation and favorite capability fascination.
>
> 2.  Submission is very simple.  This is not like a scientific conference
> that requires a complete paper describing your work.  You only need to
> submit a blurb that loosely covers your topic and major talking point(s).
> Half a dozen sentences will suffice.  In fact, what I sent last week (far
> below) could pass muster for a submission.  Go for something that will be
> on a brochure / schedule.
>
> 3.  The audience is made up of people just like you.  Developers,
> Bioinformaticians, IT Specialists, Students, Medical Researchers, AI
> Explorers and far more Hackers than Rock Stars.
>
> 4.  Slick presentation skills are not necessary.  Don't worry if you have
> never spoken to a room full of listeners.  Don't worry if English isn't
> your first language.  Don't worry if your slides are "sloppy".  Your
> presentation will not be graded.
>
> 5.  You don't need to prepare your whole talk before submitting.Idea
> now, details later.
>
> 6.  Registration is FREE.
>
>
> Right now the speaking time is anything up to 50 minutes.  If you don't
> want to present a full 50 minutes then that is ok ... The rest can be
> filled with extra question/answer or somebody else may fill the remaining
> time with a presentation on a similar topic.
>
>
> I am going to put together a lightning round.  If you think that you can
> cover some material in five to fifteen minutes then this is for you!
> Lightning rounds can be fun as you can make an impact with two or three
> slides and barely enough speaking to run out of breath.  This is really a
> free-for-all.  You can pack the time with data, give a short demonstration,
> compare using ctakes to breaking a mustang, or even do some on-topic
> (ctakes, nlp, AI, bioinformatics) stand up.  Anything goes.  This was an
> interesting (full) talk last year:
> https://aceu19.apachecon.com/session/confessions-middle-aged-coder-turned-gravel-grinder.
>  If you want to be in the lightning round, just write me a couple of
> sentences on your strike and I will put together the full submission for
> ApacheCon.  Does it get any easier?
>
>
> I will present one or two things, but to maximize impact I would like to
> know what most interests / would help all of you.  So, please write me a
> topic or two that would best apply to your work.
>
>
> Some links ...
>
>
> ApacheCon Home Page:  https://www.apachecon.com/
>
> ApacheCon Registration: https://hopin.to/events/apachecon-home
>
> ApacheCon Submission:  https://acna2020.jamhosted.net/cfp.html
>
>
> Lastly, so that we don't crash a server, I would like to have a rough head
> count for attendance estimation.  If you think that you will watch any
> presentation of ctakes then please send me ( seanfi...@apache.org ) an
> email with the subject "Attend" and "+1" in the body.
>
>
> Cheers,
>
> Sean
>
>
> 
> From: Finan, Sean
> Sent: Monday, June 29,

obtaining the UMLS set for the dictionary creator

2020-07-22 Thread Peter Abramowitsch
I've been trying to discover what directory the dictionary creator wants to
see for "UMLS Installation"  Although I have the dictionary and lvg
folders, inside ctakes-4.0-resources-bin.zip I don't see any folder that
the dictionary creator is happy with.

When I look for  RRF files in my running installation, the only ones I see
are in the ytex folder and they're tiny.   Of course the Snomed a-b folder
is there,

I remember downloading the UMLS back 2014 and creating a dictionary then, I
cannot find a link to the umls install I need for the dictionary creator
now?

Can you help?

I've tried every folder from here to the top of the
ctakes-4.0-resources-bin tree.

[image: image.png]


Re: obtaining the UMLS set for the dictionary creator

2020-07-22 Thread Peter Abramowitsch
Thank you.  That's helpful.  The last time I did this years ago, the tool
wasn't available and I just remember downloading some huge files.  It's
only mentioned in passing in two Ctakes Wiki pages - and not in association
with the dictionary creator.  whereas the resources file, which is
mentioned there has just the pre-built fast lookup.



Peter

On Wed, Jul 22, 2020 at 2:39 PM Thomas W Loehfelm
 wrote:

> If you have downloaded and created a local UMLS installation using the
> MetamorphoSys tool you should end up with a directory structure something
> like:
> ../../umls/2020AA/META/{all of the various .RRF files}
>
> In that case, you want to point the DictionaryCreator tool to the 2020AA
> directory, i.e. the one that contains the META directory.
>
> From: Peter Abramowitsch 
> Reply-To: "dev@ctakes.apache.org" 
> Date: Wednesday, July 22, 2020 at 2:17 PM
> To: "dev@ctakes.apache.org" 
> Subject: obtaining the UMLS set for the dictionary creator
>
> I've been trying to discover what directory the dictionary creator wants
> to see for "UMLS Installation"  Although I have the dictionary and lvg
> folders, inside ctakes-4.0-resources-bin.zip I don't see any folder that
> the dictionary creator is happy with.
>
> When I look for  RRF files in my running installation, the only ones I see
> are in the ytex folder and they're tiny.   Of course the Snomed a-b folder
> is there,
>
> I remember downloading the UMLS back 2014 and creating a dictionary then,
> I cannot find a link to the umls install I need for the dictionary creator
> now?
>
> Can you help?
>
> I've tried every folder from here to the top of the
> ctakes-4.0-resources-bin tree.
>
> [cid:ii_kcxv5co80]
> **CONFIDENTIALITY NOTICE** This e-mail communication and any attachments
> are for the sole use of the intended recipient and may contain information
> that is confidential and privileged under state and federal privacy laws.
> If you received this e-mail in error, be aware that any unauthorized use,
> disclosure, copying, or distribution is strictly prohibited. If you
> received this e-mail in error, please contact the sender immediately and
> destroy/delete all copies of this message.
>


Re: obtaining the UMLS set for the dictionary creator

2020-07-22 Thread Peter Abramowitsch
Thanks Ghandi
I really appreciate the details.

P.

On Wed, Jul 22, 2020 at 10:18 PM gandhi rajan 
wrote:

> Hi Peter,
>
> To add more info, step by step info on UMLS installation is available in
> the link -
>
> http://blog.appliedinformaticsinc.com/getting-started-with-metamorphosys-the-umls-installation-tool/
>
> Once installed, refer -
> https://cwiki.apache.org/confluence/display/CTAKES/Dictionary+Creator+GUI
>  and Select a *UMLS installation* directory as
> //META (in my case
> umls_root_dir\2020AA\META).
>
> After dictionary creation, the SQL scripts and custom dictionary XML will
> be available under
> \resources\org\apache\ctakes\dictionary\lookup\fast
> folder.
>
> Hope it helps.
>
> On Thu, Jul 23, 2020 at 3:16 AM Peter Abramowitsch <
> pabramowit...@gmail.com>
> wrote:
>
> > Thank you.  That's helpful.  The last time I did this years ago, the tool
> > wasn't available and I just remember downloading some huge files.  It's
> > only mentioned in passing in two Ctakes Wiki pages - and not in
> association
> > with the dictionary creator.  whereas the resources file, which is
> > mentioned there has just the pre-built fast lookup.
> >
> >
> >
> > Peter
> >
> > On Wed, Jul 22, 2020 at 2:39 PM Thomas W Loehfelm
> >  wrote:
> >
> > > If you have downloaded and created a local UMLS installation using the
> > > MetamorphoSys tool you should end up with a directory structure
> something
> > > like:
> > > ../../umls/2020AA/META/{all of the various .RRF files}
> > >
> > > In that case, you want to point the DictionaryCreator tool to the
> 2020AA
> > > directory, i.e. the one that contains the META directory.
> > >
> > > From: Peter Abramowitsch 
> > > Reply-To: "dev@ctakes.apache.org" 
> > > Date: Wednesday, July 22, 2020 at 2:17 PM
> > > To: "dev@ctakes.apache.org" 
> > > Subject: obtaining the UMLS set for the dictionary creator
> > >
> > > I've been trying to discover what directory the dictionary creator
> wants
> > > to see for "UMLS Installation"  Although I have the dictionary and
> > lvg
> > > folders, inside ctakes-4.0-resources-bin.zip I don't see any folder
> that
> > > the dictionary creator is happy with.
> > >
> > > When I look for  RRF files in my running installation, the only ones I
> > see
> > > are in the ytex folder and they're tiny.   Of course the Snomed a-b
> > folder
> > > is there,
> > >
> > > I remember downloading the UMLS back 2014 and creating a dictionary
> then,
> > > I cannot find a link to the umls install I need for the dictionary
> > creator
> > > now?
> > >
> > > Can you help?
> > >
> > > I've tried every folder from here to the top of the
> > > ctakes-4.0-resources-bin tree.
> > >
> > > [cid:ii_kcxv5co80]
> > > **CONFIDENTIALITY NOTICE** This e-mail communication and any
> attachments
> > > are for the sole use of the intended recipient and may contain
> > information
> > > that is confidential and privileged under state and federal privacy
> laws.
> > > If you received this e-mail in error, be aware that any unauthorized
> use,
> > > disclosure, copying, or distribution is strictly prohibited. If you
> > > received this e-mail in error, please contact the sender immediately
> and
> > > destroy/delete all copies of this message.
> > >
> >
>
>
> --
> Regards,
> Gandhi
>
> "The best way to find urself is to lose urself in the service of others
> !!!"
>


Re: Clarification regarding NegationFSM

2020-07-23 Thread Peter Abramowitsch
Check and see if the identified annotation you get for "Smoking status: N"
without your change is actually "Non Smoker" with polarity 1.
Nonsmoker is a separate concept, from a Smoker with polarity -1.  Instead
of looking at range text, check the canonical text for the concept you have.
Having said that, there are many issues with negation in all of the
negation annotators.  Some are too eager, others are too cautious.

Peter

On Thu, Jul 23, 2020 at 10:17 AM Sreejith Pk  wrote:

> Hi Team,
>
> We are using cTAKES 4.0.0 as the NLP engine in our application. I have
> added ContextAnnotator to the pipeline to achieve correct Polarity to the
> tokens.
> After analysing the ContextAnnotator code, I understand that negation
> determining condition is written in NegationFSM class.
> In my requirement, I have a sentence "Smoking status: N"  and I want to set
> polarity -1 to the token "Smoking" because of the occurrence of "N". To
> achieve the same, I have tried adding "N" to the existing HashSet
> in NegationFSM constructor like iv_negVerbsSet.add("N"); But it seems,
> polarity of the word token "Smoking" is still  coming as 1.
> With the same configuration set if I pass "Smoking status: denies", I am
> getting the polarity of token "Smoking" as -1. Kindly help.
>
> Thanks & Regards
> Sreejith
>


Re: Clarification regarding NegationFSM [EXTERNAL]

2020-07-24 Thread Peter Abramowitsch
Thanks Sean.  I didn't know about that annotator.

On Fri, Jul 24, 2020, 3:51 AM Finan, Sean 
wrote:

> Hi Sreejith,
>
> Without seeing an example of text I can't say whether my next words will
> help you or not.
>
> If you are using trunk then you should have access to two 'new' annotation
> engines in ctakes-core.
> ListAnnotator- Annotates formatted List Sections by detecting them
> using Regular Expressions provided in an input File.
> ListEntryNegator  - Checks List Entries for negation, which may be
> exhibited differently from unstructured negation.
>
> ListAnnotator can use any list of regular expressions in a file.  The
> default file is in ctakes-core-res, called DefaultListRegex.bsv
> The format for each line in the regex list is
> NAME||LIST_REGEX||ENTRY_SEPARATOR_REGEX   where
> NAME - name of list type.  Can be anything.
> LIST_REGEX   - some regular expression for which a block of text will
> match a list in its entirety.
> ENTRY_SEPARATOR_REGEX   - some regular expression for which text within
> the entire list will match a single list entry.
> For instance, the List
> Smoker Status: N
> Drinking Status: Y
> Pregnant: N/A
> A -simple- line in the regex file could be
> Colonized
> List||(?:^(?:[^\r\n:]+:[^\r\n:]+)+\r?\n){2,}||(?:^(?:[^\r\n:]+:[^\r\n:]+)+\r?\n)
> Notice that each item is separated by two bar characters "||".
>
> The file of regular expressions can be changed using the LIST_TYPES_PATH
> parameter.
>
> ListEntryNegator will iterate through each ListEntry in the cas and use a
> regular expression to determine whether or not items in the list should be
> negated.
> Right now that regex is hard-coded in the class.  There should probably be
> a mechanism to overwrite it.  ": N" is not in there.   Also, only
> Disease/Disorders and Sign/Symptom mentions in the ListEntry are negated.
>  You would need to add SmokingStatusAnnotation as a negatable.
>
> I don't know if any of this is helpful, but I thought that I would throw
> it out there.
>
> Sean
> 
> From: Sreejith Pk 
> Sent: Friday, July 24, 2020 4:09 AM
> To: dev@ctakes.apache.org
> Subject: Re: Clarification regarding NegationFSM [EXTERNAL]
>
> * External Email - Caution *
>
>
> Hi Peter, Thanks a lot for the reply.
>
> Let me elaborate more on the changes I have done so far. I have added
> KuRuleBasedClassifierAnnotator to the pipeline inorder to fetch Smoking
> related keywords from the document. I have
> modified KuRuleBasedClassifierAnnotator in such a way that it will iterate
> through the identified tokens and if the token matches any smoking related
> word which are configured inside a keyword.txt file. The identified tokens
> will be then set to SmokerNamedEntityAnnotation and thus can be read from
> the output XMI.
> Here in my scenario, the sentence I am passing to cTAKES is "Smoking
> status: N". As Smoking is configured inside keywords.txt, it will be coming
> as the output node in SmokerNamedEntityAnnotation. Its polarity only I am
> parsing in my parser logic. Here polarity of SmokerNamedEntityAnnotation
> - "Smoking" token is coming as 1 instead of expected -1
> (NB: I have removed ":" from the NamedEntityContextAnalizer.java - boundary
> words set)
>
> Thanks and Regards,
> Sreejith
>
>
> On Thu, Jul 23, 2020 at 11:20 PM Peter Abramowitsch <
> pabramowit...@gmail.com>
> wrote:
>
> > Check and see if the identified annotation you get for "Smoking status:
> N"
> > without your change is actually "Non Smoker" with polarity 1.
> > Nonsmoker is a separate concept, from a Smoker with polarity -1.  Instead
> > of looking at range text, check the canonical text for the concept you
> > have.
> > Having said that, there are many issues with negation in all of the
> > negation annotators.  Some are too eager, others are too cautious.
> >
> > Peter
> >
> > On Thu, Jul 23, 2020 at 10:17 AM Sreejith Pk  wrote:
> >
> > > Hi Team,
> > >
> > > We are using cTAKES 4.0.0 as the NLP engine in our application. I have
> > > added ContextAnnotator to the pipeline to achieve correct Polarity to
> the
> > > tokens.
> > > After analysing the ContextAnnotator code, I understand that negation
> > > determining condition is written in NegationFSM class.
> > > In my requirement, I have a sentence "Smoking status: N"  and I want to
> > set
> > > polarity -1 to the token "Smoking" because of the occurrence of "N". To
> > > achieve the same, I have tried adding "N" to the existing HashSet
> > > in NegationFSM constructor like iv_negVerbsSet.add("N"); But it seems,
> > > polarity of the word token "Smoking" is still  coming as 1.
> > > With the same configuration set if I pass "Smoking status: denies", I
> am
> > > getting the polarity of token "Smoking" as -1. Kindly help.
> > >
> > > Thanks & Regards
> > > Sreejith
> > >
> >
>


Problem trying to load a custom dictionary

2020-07-30 Thread Peter Abramowitsch
A couple of questions about installing a custom dictionary in lookup fast.
I hope I'm not too far off the track.

I've used the dictionary creator with a UMLS install to create the
dictionary script, prop file, and xml file in my ctakes resources tree

1.  Do I have to manually run this script to execute all its SQL statements
into hsqldb or is this executed by the cTakes program when it encounters
the XML descriptor?
If manual running is needed, are there instructions on how and where to
load the script?

2.  The xml file generated by the dictionary creator contains lines with
duplicate names-- see yellow highlight.  Is this  correct?
my_dict_v1Terms
 

3.  Try as I might I cannot get ctakes to load anything other than
sno_rx.   I'm using a piper file with an entry looking like
  
  add org.apache.ctakes.dictionary.lookup2.ae.OverlapJCasTermAnnotator
DictionaryDescriptor=org/apache/ctakes/dictionary/lookup/fast/my_dict_v1.xml
 
Not sure if they're looked at any more but I also changed these xml files
under desc as well.


desc/ctakes-dictionary-lookup-fast/desc/analysis_engine/UmlsOverlapLookupAnnotator.xml

desc/ctakes-dictionary-lookup-fast/desc/analysis_engine/UmlsLookupAnnotator.xml

But I can't even get it to fail to try to load mine.
my log looks like this

  30 Jul 2020 15:43:37  INFO AbstractJCasTermAnnotator - Exclusion
tagset loaded: CC CD DT EX IN LS MD PDT POS PP PP$ PRP PRP$ RP TO VB VBD
VBG VBN VBP VBZ WDT WP WPS WRB
  30 Jul 2020 15:43:37  INFO AbstractJCasTermAnnotator - Using minimum
term text span: 3
  30 Jul 2020 15:43:37  INFO AbstractJCasTermAnnotator - Using
Dictionary Descriptor: org/apache/ctakes/dictionary/lookup/fast/
sno_rx_16ab.xml


Any suggestions?

Regards,  Peter


Re: Problem trying to load a custom dictionary

2020-07-30 Thread Peter Abramowitsch
Thanks Jeff

That worked!

Seems like something that should get fixed in the PiperCreator and in the
documentation.

With a life of assuming that every mistake is my own error, the last thing
I would have expected was
a generator of incorrect params.

Peter

On Thu, Jul 30, 2020 at 4:59 PM Jeffrey Miller  wrote:

> Peter,
>
> 1) This is loaded by cTAKES, you don't need to manually create the
> database.
> 2) I can't see the highlights here, but I think that file should be okay as
> created by the GUI.
> 3) I think the parameter name to configure your dictionary location is
> LookupXml instead of DictionaryDescriptor
>
> Jeff
>
> On Thu, Jul 30, 2020 at 6:49 PM Peter Abramowitsch <
> pabramowit...@gmail.com>
> wrote:
>
> > A couple of questions about installing a custom dictionary in lookup
> fast.
> > I hope I'm not too far off the track.
> >
> > I've used the dictionary creator with a UMLS install to create the
> > dictionary script, prop file, and xml file in my ctakes resources tree
> >
> > 1.  Do I have to manually run this script to execute all its SQL
> statements
> > into hsqldb or is this executed by the cTakes program when it encounters
> > the XML descriptor?
> > If manual running is needed, are there instructions on how and where to
> > load the script?
> >
> > 2.  The xml file generated by the dictionary creator contains lines with
> > duplicate names-- see yellow highlight.  Is this  correct?
> > my_dict_v1Terms
> >   >
> value="jdbc:hsqldb:file:resources/org/apache/ctakes/dictionary/lookup/fast/
> > my_dict_v1/my_dict_v1"/>
> >
> > 3.  Try as I might I cannot get ctakes to load anything other than
> > sno_rx.   I'm using a piper file with an entry looking like
> >   
> >   add org.apache.ctakes.dictionary.lookup2.ae
> .OverlapJCasTermAnnotator
> >
> >
> DictionaryDescriptor=org/apache/ctakes/dictionary/lookup/fast/my_dict_v1.xml
> >  
> > Not sure if they're looked at any more but I also changed these xml files
> > under desc as well.
> >
> >
> >
> >
> desc/ctakes-dictionary-lookup-fast/desc/analysis_engine/UmlsOverlapLookupAnnotator.xml
> >
> >
> >
> desc/ctakes-dictionary-lookup-fast/desc/analysis_engine/UmlsLookupAnnotator.xml
> >
> > But I can't even get it to fail to try to load mine.
> > my log looks like this
> >
> >   30 Jul 2020 15:43:37  INFO AbstractJCasTermAnnotator - Exclusion
> > tagset loaded: CC CD DT EX IN LS MD PDT POS PP PP$ PRP PRP$ RP TO VB VBD
> > VBG VBN VBP VBZ WDT WP WPS WRB
> >   30 Jul 2020 15:43:37  INFO AbstractJCasTermAnnotator - Using
> minimum
> > term text span: 3
> >   30 Jul 2020 15:43:37  INFO AbstractJCasTermAnnotator - Using
> > Dictionary Descriptor: org/apache/ctakes/dictionary/lookup/fast/
> > sno_rx_16ab.xml
> >
> >
> > Any suggestions?
> >
> > Regards,  Peter
> >
>


Re: Problem trying to load a custom dictionary [EXTERNAL]

2020-07-31 Thread Peter Abramowitsch
I could do it while the experience is fresh, although I only know the happy
path and not the deeper details in this area of the suite
If you want me to, let me know how to get editing privileges on the Wiki.

Peter

On Fri, Jul 31, 2020 at 4:28 AM Finan, Sean <
sean.fi...@childrens.harvard.edu> wrote:

> Obviously Jeff is correct in all of his answers.  Thank you Jeff!
>
> One comment: DictionaryDescriptor is a deprecated parameter name that is
> picked up by the piper creator when it inspects the code.  However, I am
> not sure why the deprecated parameter name isn't working ...
>
> The wiki needs additional and more thorough information.  If anybody can
> volunteer to work on it I (and future users) would really appreciate it!
>
> Thanks,
> Sean
>
>
> ____
> From: Peter Abramowitsch 
> Sent: Thursday, July 30, 2020 9:02 PM
> To: dev@ctakes.apache.org
> Subject: Re: Problem trying to load a custom dictionary [EXTERNAL]
>
> * External Email - Caution *
>
>
> Thanks Jeff
>
> That worked!
>
> Seems like something that should get fixed in the PiperCreator and in the
> documentation.
>
> With a life of assuming that every mistake is my own error, the last thing
> I would have expected was
> a generator of incorrect params.
>
> Peter
>
> On Thu, Jul 30, 2020 at 4:59 PM Jeffrey Miller  wrote:
>
> > Peter,
> >
> > 1) This is loaded by cTAKES, you don't need to manually create the
> > database.
> > 2) I can't see the highlights here, but I think that file should be okay
> as
> > created by the GUI.
> > 3) I think the parameter name to configure your dictionary location is
> > LookupXml instead of DictionaryDescriptor
> >
> > Jeff
> >
> > On Thu, Jul 30, 2020 at 6:49 PM Peter Abramowitsch <
> > pabramowit...@gmail.com>
> > wrote:
> >
> > > A couple of questions about installing a custom dictionary in lookup
> > fast.
> > > I hope I'm not too far off the track.
> > >
> > > I've used the dictionary creator with a UMLS install to create the
> > > dictionary script, prop file, and xml file in my ctakes resources tree
> > >
> > > 1.  Do I have to manually run this script to execute all its SQL
> > statements
> > > into hsqldb or is this executed by the cTakes program when it
> encounters
> > > the XML descriptor?
> > > If manual running is needed, are there instructions on how and where to
> > > load the script?
> > >
> > > 2.  The xml file generated by the dictionary creator contains lines
> with
> > > duplicate names-- see yellow highlight.  Is this  correct?
> > > my_dict_v1Terms
> > >   > >
> >
> value="jdbc:hsqldb:file:resources/org/apache/ctakes/dictionary/lookup/fast/
> > > my_dict_v1/my_dict_v1"/>
> > >
> > > 3.  Try as I might I cannot get ctakes to load anything other than
> > > sno_rx.   I'm using a piper file with an entry looking like
> > >   
> > >   add org.apache.ctakes.dictionary.lookup2.ae
> > .OverlapJCasTermAnnotator
> > >
> > >
> >
> DictionaryDescriptor=org/apache/ctakes/dictionary/lookup/fast/my_dict_v1.xml
> > >  
> > > Not sure if they're looked at any more but I also changed these xml
> files
> > > under desc as well.
> > >
> > >
> > >
> > >
> >
> desc/ctakes-dictionary-lookup-fast/desc/analysis_engine/UmlsOverlapLookupAnnotator.xml
> > >
> > >
> > >
> >
> desc/ctakes-dictionary-lookup-fast/desc/analysis_engine/UmlsLookupAnnotator.xml
> > >
> > > But I can't even get it to fail to try to load mine.
> > > my log looks like this
> > >
> > >   30 Jul 2020 15:43:37  INFO AbstractJCasTermAnnotator - Exclusion
> > > tagset loaded: CC CD DT EX IN LS MD PDT POS PP PP$ PRP PRP$ RP TO VB
> VBD
> > > VBG VBN VBP VBZ WDT WP WPS WRB
> > >   30 Jul 2020 15:43:37  INFO AbstractJCasTermAnnotator - Using
> > minimum
> > > term text span: 3
> > >   30 Jul 2020 15:43:37  INFO AbstractJCasTermAnnotator - Using
> > > Dictionary Descriptor: org/apache/ctakes/dictionary/lookup/fast/
> > > sno_rx_16ab.xml
> > >
> > >
> > > Any suggestions?
> > >
> > > Regards,  Peter
> > >
> >
>


Re: Problem trying to load a custom dictionary [EXTERNAL]

2020-07-31 Thread Peter Abramowitsch
Thank you Jeff and Gandhi for offers of help.I'm not trying to renege
on my offer, but as I have only done this once, I'm wondering if your
combined experience makes it much more appropriate for one of you to do
this documentation and I do a review rather than the other way round.   --
Especially if Jeff has actually written up the basis for the enhancement.

However I'm willing to give it a shot if neither of you wants to take the
reins

Peter



On Fri, Jul 31, 2020 at 7:39 AM Jeffrey Miller  wrote:

> I can help with this as well. I have some documentation that I have written
> for myself that would probably be useful. I've tried to keep a list of
> useful forum posts that contain information that could probably be more
> prominently displayed on the wiki.
>
> On Fri, Jul 31, 2020 at 10:34 AM gandhi rajan 
> wrote:
>
> > Hi Peter,
> >
> > We can work together on this if you are interested.
> >
> > On Fri, Jul 31, 2020 at 7:44 PM Peter Abramowitsch <
> > pabramowit...@gmail.com>
> > wrote:
> >
> > > I could do it while the experience is fresh, although I only know the
> > happy
> > > path and not the deeper details in this area of the suite
> > > If you want me to, let me know how to get editing privileges on the
> Wiki.
> > >
> > > Peter
> > >
> > > On Fri, Jul 31, 2020 at 4:28 AM Finan, Sean <
> > > sean.fi...@childrens.harvard.edu> wrote:
> > >
> > > > Obviously Jeff is correct in all of his answers.  Thank you Jeff!
> > > >
> > > > One comment: DictionaryDescriptor is a deprecated parameter name that
> > is
> > > > picked up by the piper creator when it inspects the code.  However, I
> > am
> > > > not sure why the deprecated parameter name isn't working ...
> > > >
> > > > The wiki needs additional and more thorough information.  If anybody
> > can
> > > > volunteer to work on it I (and future users) would really appreciate
> > it!
> > > >
> > > > Thanks,
> > > > Sean
> > > >
> > > >
> > > > 
> > > > From: Peter Abramowitsch 
> > > > Sent: Thursday, July 30, 2020 9:02 PM
> > > > To: dev@ctakes.apache.org
> > > > Subject: Re: Problem trying to load a custom dictionary [EXTERNAL]
> > > >
> > > > * External Email - Caution *
> > > >
> > > >
> > > > Thanks Jeff
> > > >
> > > > That worked!
> > > >
> > > > Seems like something that should get fixed in the PiperCreator and in
> > the
> > > > documentation.
> > > >
> > > > With a life of assuming that every mistake is my own error, the last
> > > thing
> > > > I would have expected was
> > > > a generator of incorrect params.
> > > >
> > > > Peter
> > > >
> > > > On Thu, Jul 30, 2020 at 4:59 PM Jeffrey Miller 
> > > wrote:
> > > >
> > > > > Peter,
> > > > >
> > > > > 1) This is loaded by cTAKES, you don't need to manually create the
> > > > > database.
> > > > > 2) I can't see the highlights here, but I think that file should be
> > > okay
> > > > as
> > > > > created by the GUI.
> > > > > 3) I think the parameter name to configure your dictionary location
> > is
> > > > > LookupXml instead of DictionaryDescriptor
> > > > >
> > > > > Jeff
> > > > >
> > > > > On Thu, Jul 30, 2020 at 6:49 PM Peter Abramowitsch <
> > > > > pabramowit...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > A couple of questions about installing a custom dictionary in
> > lookup
> > > > > fast.
> > > > > > I hope I'm not too far off the track.
> > > > > >
> > > > > > I've used the dictionary creator with a UMLS install to create
> the
> > > > > > dictionary script, prop file, and xml file in my ctakes resources
> > > tree
> > > > > >
> > > > > > 1.  Do I have to manually run this script to execute all its SQL
> > > > > statements
> > > > > > into hsqldb or is this executed by the cTakes program when it
> > > > encounters
> > > > > > the XML descriptor?
> > >

With custom dictionary - over-eager resolution of acronyms

2020-08-01 Thread Peter Abramowitsch
Hi All

Having created a new dictionary from the 2020AA UMLS and added Genes and
Receptors to the dictionary-creator's default selections, I have a curious
problem where cTakes now assigns the most bizarre acronyms to ordinary
words used in POS contexts where it shouldn't  find Mentions.

Here are two examples:

1.   soft (in "soft tissue...")
becomes   "SHORT STATURE, ONYCHODYSPLASIA, FACIAL DYSMORPHISM, AND
HYPOTRICHOSIS SYNDROME",

2.   bed in ("The wound bed was...")
becomes  "BORNHOLM EYE DISEASE"

I have not changed the TermConsumer type in the descriptor XML.

Are the DictionaryCreator's defaults, the equivalent to the default sno_rx
that's delivered with the app?

Attached is the vocab subsets list I used


Peter


Re: With custom dictionary - over-eager resolution of acronyms

2020-08-01 Thread Peter Abramowitsch
Hi Jeff thanks for your suggestions,

I spent some time in the script file and sure enough,  my 2020 UMLS
extraction actually has these two entries:

INSERT INTO CUI_TERMS VALUES(3542022,0,1,'soft','soft')
INSERT INTO PREFTERM VALUES(3542022,'SHORT STATURE, ONYCHODYSPLASIA, FACIAL
DYSMORPHISM, AND HYPOTRICHOSIS SYNDROME')

It's unbelievable.  the UMLS entry has got to be wrong or I'm missing
something to say that it only applies (as an acronym) if it's capitalized

In sno_rx  there is neither a CUI 3542022 nor the definition of "soft" as a
solitary word, nor even a mention of ONYCHODYSPLASIA or HYPOTRICHOSIS

In any case, I would have thought that ctakes will only create an event
mention from a term tagged as NN or NP slot, not a ADJ as in "soft tissue"

Anyway  Thanks!  Now I will keep poking around.


Peter












On Sat, Aug 1, 2020 at 5:06 PM Jeffrey Miller  wrote:

> Sorry, I meant suggest to search for 'soft' in the dictionary file not
> 'short'
>
> grep -i ,\'soft\', *.script
>
> On Sat, Aug 1, 2020 at 7:47 PM Jeffrey Miller  wrote:
>
> > Hi Peter,
> >
> > To my knowledge, there isn't any drastic difference in the behavior of
> the
> > dictionary gui creator and the way the sno_rx dictionary was created. I
> > originally thought there was, but I realized the difference was that I
> had
> > not installed all of UMLS to my machine (just the vocabularies I was
> > interested in) and I was missing synonyms. The first thing I would check,
> > are you able to find a matching entry in the .script file for your ctakes
> > dictionary when you do this:
> >
> > grep -i ,\'short\', *.script
> >
> > That would confirm whether or not you have a term in your dictionary made
> > up only of 'short' and whether it mapped to the CUI equal to "SHORT
> > STATURE, ONYCHODYSPLASIA, FACIAL DYSMORPHISM, AND HYPOTRICHOSIS
> SYNDROME".
> > If it's not in there, something else is going on. You could do the same
> for
> > 'bed'.
> >
> > If not, another thing I might check is that I noticed you are using
> > the OverlapJCasTermAnnotator in your prior e-mail. I don't have much
> > experience with it, and I don't think it should cause this behavior, but
> I
> > wonder if that could be making the difference (as compared
> > to DefaultJCasTermAnnotator).
> >
> > Jeff
> >
> > On Sat, Aug 1, 2020 at 5:27 PM Peter Abramowitsch <
> pabramowit...@gmail.com>
> > wrote:
> >
> >>
> >> Hi All
> >>
> >> Having created a new dictionary from the 2020AA UMLS and added Genes and
> >> Receptors to the dictionary-creator's default selections, I have a
> curious
> >> problem where cTakes now assigns the most bizarre acronyms to ordinary
> >> words used in POS contexts where it shouldn't  find Mentions.
> >>
> >> Here are two examples:
> >>
> >> 1.   soft (in "soft tissue...")
> >> becomes   "SHORT STATURE, ONYCHODYSPLASIA, FACIAL DYSMORPHISM, AND
> >> HYPOTRICHOSIS SYNDROME",
> >>
> >> 2.   bed in ("The wound bed was...")
> >> becomes  "BORNHOLM EYE DISEASE"
> >>
> >> I have not changed the TermConsumer type in the descriptor XML.
> >>
> >> Are the DictionaryCreator's defaults, the equivalent to the default
> >> sno_rx that's delivered with the app?
> >>
> >> Attached is the vocab subsets list I used
> >>
> >>
> >> Peter
> >>
> >>
> >>
>


Re: With custom dictionary - over-eager resolution of acronyms [EXTERNAL]

2020-08-02 Thread Peter Abramowitsch
Many thanks Sean and Jeff.  You guys must be both on the East Coast, because my 
coffee has only just kicked in enough to digest your lucid replies.   Super 
helpful information.  It sounds like the quick and dirty solution is to rebuild 
the dictionary without the OMIM and MTH vocabularies.  So it’s not a case of a 
CUI being remapped - but that it’s being layered onto by a particular 
vocabulary adding a synonym (which in this case is probably very rarely used) 

One question related to that - are the vocabulary & tui selections that one 
finds as defaults in the dictionary creator something set by the creator as a 
ctakes optimization, or are the defaults governed by information the creator 
reads from the UMLS release? 

And thanks for mentioning the capitalization project.  I had been looking in 
vain for that functionality which I had assumed was already there.  You can 
tell that these are still my first experiences with dictionary building.

I appreciate how difficult it is to find the time to build enhancements to the 
product when one is so busy just using it.   There’s an enhancement I’ve been 
prototyping for months which brings in some functionality from the Stanford NLP 
project.  But just don’t have time or energy to productize it.   It would have 
two completely different applications:  a superior way of finding the values of 
findings and a way of validating/pruning the polarity status of concepts that 
are in an semi-grammatical or improperly punctuated sentence - such as “Denies 
headache, abdominal pain, temperature normal”

Maybe one day

Thanks again
Peter

Sent from my iPad

> On Aug 2, 2020, at 06:25, Finan, Sean  
> wrote:
> 
> Hi Peter,
> 
> I would guess that you are seeing things like "SOFT" because you new 
> dictionary has a vocabulary that was not included in sno_rx_16ab.
> I don't remember if OMIM (which has the 'SOFT' synonym) was included in 
> sno_rx_16ab.  Probably not, omim is a more -specialized- vocabulary for 
> genetics.
> 
> The term is only in the omim (and mth) vocabularies in the 2016AB umls 
> release. 
>   
> https://uts.nlm.nih.gov/metathesaurus.html#C3542022;0;1;CUI;2016AB;WORD;CUI;*;
>   
> 
> The term is in snomed in umls 2020AA, but only with the expanded full-text 
> synonym.  It still has the abbreviation from omim.  
> https://uts.nlm.nih.gov/metathesaurus.html#SHORT%20STATURE,%20ONYCHODYSPLASIA,%20FACIAL%20DYSMORPHISM,%20AND%20HYPOTRICHOSIS;0;1;TERM;2020AA;WORD;TERM;*;
> 
> As for finding terms in adjectives, the default parts of speech(pos) that are 
> checked for terms are:
> VB,VBD,VBG,VBN,VBP,VBZ,CC,CD,DT,EX,IN,LS,MD,PDT,POS,PP,PP$,PRP,PRP$,RP,TO,WDT,WP,WPS,WRB
> 
> You can see what these are here: 
> https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
> 
> You can override this list.  In your piper file, set the variable 
> "exclusionTags"
> 
> // Default excluded parts of speech, plus various forms of adjective.
> set 
> exclusionTags="ADJ,JJ,JJR,JJS,VB,VBD,VBG,VBN,VBP,VBZ,CC,CD,DT,EX,IN,LS,MD,PDT,POS,PP,PP$,PRP,PRP$,RP,TO,WDT,WP,WPS,WRB"
> 
> //  Annotate concepts based upon default algorithms.
> add DefaultJCasTermAnnotator
> 
> 
> You'll notice that I threw in 'ADJ' for good measure.  It should not break 
> anything.  
> 
> I have modified this list many times for various projects.  In one I allow 
> verbs for lookup.  For those notes the value of the true positives outweighed 
> the increased false negatives.  In another I actually empty the entire list 
> to allow everything (set exclusionTags="").  I did this because there is a 
> lot of structured text in lists and tables, but the pos tagger is trying to 
> resolve prose text.  The pos assigned on the structured text is all over the 
> place, and terms are missed left and right.
> 
> So ... last but definitely not least, case-sensitivity.
> I started working on this a while ago, but right now it sits unfinished.
> 
> There is an additional table in the dictionary database, in which all 
> synonyms are all upper-case.
> This second table is created with synonyms that exist in the umls as all 
> upper-case.
> The first  "classic" table is created using ONLY synonyms from the umls that 
> are lower and/or mixed case. 
> 
> When the annotator engine iterates over the text, it checks one table 
> (classic) or the other (caps) depending upon the case of the text in the note.
> 
> It sounds like minor work, but it requires a new engine, new dictionary, and 
> new dictionary creator.  None of this is difficult, but it requires time.
> 
> Anyway, I hope that some of this helps.
> 
> Sean
> 
> 
> 
> From: Peter Abramowitsch 
&

Re: With custom dictionary - over-eager resolution of acronyms [EXTERNAL] [EXTERNAL]

2020-08-02 Thread Peter Abramowitsch
>It would have two completely different applications:  a superior way of
finding the values of findings and a way of validating/pruning the polarity
status of concepts that are in an semi-grammatical or improperly punctuated
sentence
-- Cool.  I expect to see it by end of business tomorrow.

... if only.

Peter

On Sun, Aug 2, 2020 at 10:46 AM Finan, Sean <
sean.fi...@childrens.harvard.edu> wrote:

> For Peter and Jeff:
>
> > are the vocabulary & tui selections that one finds as defaults in the
> dictionary creator something set by the creator as a ctakes optimization
> -- Good question, and the answer is "no."  Those vocabularies and semantic
> types were chosen simply because they contain clinical terms of interest to
> previously done national studies.  The other semantic types and vocabulary
> terms, while present in notes, are often not of interest to "standard"
> clinical studies.  Adding more terms from other vocabularies and semantic
> types should not slow down processing to any noticeable degree.
> > are the defaults governed by information the creator reads from the UMLS
> release
> -- As far as I know there are no recommendations of this sort made by the
> NLM.
>
> >It would have two completely different applications:  a superior way of
> finding the values of findings and a way of validating/pruning the polarity
> status of concepts that are in an semi-grammatical or improperly punctuated
> sentence
> -- Cool.  I expect to see it by end of business tomorrow.
>
> >I recently created a dictionary based off of UMLS 2020AA and did not see
> 'bed' or 'soft' mapped as synonyms to those terms in my .script file. They
> are there, but mapped to other cuis (for example, the cui for an actual bed
> from SNOMED). I think the difference is that I select all of the available
> TUIs on the right and when I do that 'bed' and 'soft' get assigned to a
> different CUIs (with TUIs of "manufactured object" and "quantitative
> concept" respectively) and the CUI synonyms for the more clinical TUIs are
> skipped. I selected all the TUIs because the defaults seemed to be missing
> some things people might be interested in, but I did not expect the
> behavior where it would change how identical terms from other TUIs get
> included (maybe this is some kind of WSD?)
> -- Yes, there is some horribly simple "WSD" being done before the
> dictionary is written.
> What you are seeing is that SOFT only exists as two synonym entries under
> "Short Stature ...", while it exists as 2++ synonym entries for "bed"
> and/or it is the preferred text for "bed" (probably not), or something like
> that.
>
> >but I imagine it could cause other misses.
> -- True.  It is really difficult to make the perfect dictionary for any
> purpose.  So, we just go for the best coverage and fewest extraneous
> entries - or fewest frequently discovered extraneous entries.  "Bed" may
> not be a problem for notes on outpatient visits.  For inpatient notes it
> would be a different story.
>
> And of course, once you get a great set of terms, you get to play with the
> valid parts of speech.  You decide on grabbing every term or only the
> longest overlapping terms.  Allow discontinuous spans or require continuous
> spans.
>
> Fun.
>
>
>
> 
> From: Peter Abramowitsch 
> Sent: Sunday, August 2, 2020 12:14 PM
> To: dev@ctakes.apache.org
> Subject: Re: With custom dictionary - over-eager resolution of acronyms
> [EXTERNAL] [EXTERNAL]
>
> * External Email - Caution *
>
>
> Many thanks Sean and Jeff.  You guys must be both on the East Coast,
> because my coffee has only just kicked in enough to digest your lucid
> replies.   Super helpful information.  It sounds like the quick and dirty
> solution is to rebuild the dictionary without the OMIM and MTH
> vocabularies.  So it’s not a case of a CUI being remapped - but that it’s
> being layered onto by a particular vocabulary adding a synonym (which in
> this case is probably very rarely used)
>
> One question related to that - are the vocabulary & tui selections that
> one finds as defaults in the dictionary creator something set by the
> creator as a ctakes optimization, or are the defaults governed by
> information the creator reads from the UMLS release?
>
> And thanks for mentioning the capitalization project.  I had been looking
> in vain for that functionality which I had assumed was already there.  You
> can tell that these are still my first experiences with dictionary building.
>
> I appreciate how difficult it is to find the time to build enhancements to
> t

RE Tuning custom dictionary recommendations

2020-08-04 Thread Peter Abramowitsch
Hi Jeff et al

To take up the thread from a few days ago where a simple english word such
as bed, soft, shop also maps into a legitimate but rarely used acronym and
shows up in the same POS as a potentially interesting entity,  what is the
mechanism you would use to disambiguate?

This problem only started since I  constructed a SNO+RX+HGNC dictionary
from the 2020A UMLS dump.   Adding more TUIS where a more conventional
word-sense of the target word occurs, does not fix this problem.

For instance, why does the sno_rx dictionary not contain this disease which
aliases to  "bed" ?

ucsf_dict_v1 $ grep 3159311 *.script
*INSERT INTO CUI_TERMS VALUES(3159311,0,1,'bed','bed')*
INSERT INTO CUI_TERMS VALUES(3159311,5,8,'myopia , high , with
nonprogressive cone dysfunction','nonprogressive')
INSERT INTO CUI_TERMS VALUES(3159311,0,3,'bornholm eye disease','bornholm')
INSERT INTO CUI_TERMS VALUES(3159311,5,6,'x-linked cone dysfunction
syndrome with myopia','myopia')
INSERT INTO TUI VALUES(3159311,47)
*INSERT INTO PREFTERM VALUES(3159311,'BORNHOLM EYE DISEASE')*
INSERT INTO SNOMEDCT_US VALUES(3159311,718718009)


sno_rx_16ab $ grep 3159311 *.script
nada

Solutions good or evil?

   - Strip the relevant lines out of ths dict.script file?
   - Blacklist the text?
   - Add to my stopCUI list (a little feature I added)?
   - Some other configuration I don't  know about?
   For instance, is there a CUI:ACRONYM table?
   I'm tempted to create one.  This would require the matching term to be
   present in upper case.

Peter


Re: RE Tuning custom dictionary recommendations

2020-08-04 Thread Peter Abramowitsch
Ok Thanks Jeff.  I'm glad I wasn't missing something important.

There already is a blacklist text mechanism which suppresses identification
of specific text by clinical domain.
Looking at the code it collects entries like
cTakesSemanticCode,texta,textb,textc
NE_TYPE_ID_DRUG, jasmine, coriander, bleach
There's a case sensitive list and a case insensitive one.

So I will try that.
in one of my examples, I'll say that  'bed' is not a disorder, while 'BED'
could be one.



On Tue, Aug 4, 2020 at 2:12 PM Jeffrey Miller  wrote:

> Hi Peter,
>
> To your question about sno_rx_16ab I suspect that the CUI is new since
> 2016, or if it existed in UMLS back then, it was not associated with a term
> in snomed or rxnorm at that time.
>
> To those solutions, if you are able to use the trunk I know Sean said there
> was a suppression text feature, otherwise in the past I have removed the
> lines from the .script file
>
> I definitely think the acronym case sensitive feature would be great.
>
> Jeff
>
> On Tue, Aug 4, 2020 at 3:28 PM Peter Abramowitsch  >
> wrote:
>
> > Hi Jeff et al
> >
> > To take up the thread from a few days ago where a simple english word
> such
> > as bed, soft, shop also maps into a legitimate but rarely used acronym
> and
> > shows up in the same POS as a potentially interesting entity,  what is
> the
> > mechanism you would use to disambiguate?
> >
> > This problem only started since I  constructed a SNO+RX+HGNC dictionary
> > from the 2020A UMLS dump.   Adding more TUIS where a more conventional
> > word-sense of the target word occurs, does not fix this problem.
> >
> > For instance, why does the sno_rx dictionary not contain this disease
> which
> > aliases to  "bed" ?
> >
> > ucsf_dict_v1 $ grep 3159311 *.script
> > *INSERT INTO CUI_TERMS VALUES(3159311,0,1,'bed','bed')*
> > INSERT INTO CUI_TERMS VALUES(3159311,5,8,'myopia , high , with
> > nonprogressive cone dysfunction','nonprogressive')
> > INSERT INTO CUI_TERMS VALUES(3159311,0,3,'bornholm eye
> disease','bornholm')
> > INSERT INTO CUI_TERMS VALUES(3159311,5,6,'x-linked cone dysfunction
> > syndrome with myopia','myopia')
> > INSERT INTO TUI VALUES(3159311,47)
> > *INSERT INTO PREFTERM VALUES(3159311,'BORNHOLM EYE DISEASE')*
> > INSERT INTO SNOMEDCT_US VALUES(3159311,718718009)
> >
> >
> > sno_rx_16ab $ grep 3159311 *.script
> > nada
> >
> > Solutions good or evil?
> >
> >- Strip the relevant lines out of ths dict.script file?
> >- Blacklist the text?
> >- Add to my stopCUI list (a little feature I added)?
> >- Some other configuration I don't  know about?
> >For instance, is there a CUI:ACRONYM table?
> >I'm tempted to create one.  This would require the matching term to be
> >present in upper case.
> >
> > Peter
> >
>


Re: RE Tuning custom dictionary recommendations

2020-08-04 Thread Peter Abramowitsch
Blacklist format
Actually I got it inverted, its:

semantic_code1, semantic_code2,...|text1
semantic_code1, semantic_code2,...|text2

Peter

On Tue, Aug 4, 2020 at 4:16 PM Peter Abramowitsch 
wrote:

> Ok Thanks Jeff.  I'm glad I wasn't missing something important.
>
> There already is a blacklist text mechanism which suppresses
> identification of specific text by clinical domain.
> Looking at the code it collects entries like
> cTakesSemanticCode,texta,textb,textc
> NE_TYPE_ID_DRUG, jasmine, coriander, bleach
> There's a case sensitive list and a case insensitive one.
>
> So I will try that.
> in one of my examples, I'll say that  'bed' is not a disorder, while 'BED'
> could be one.
>
>
>
> On Tue, Aug 4, 2020 at 2:12 PM Jeffrey Miller  wrote:
>
>> Hi Peter,
>>
>> To your question about sno_rx_16ab I suspect that the CUI is new since
>> 2016, or if it existed in UMLS back then, it was not associated with a
>> term
>> in snomed or rxnorm at that time.
>>
>> To those solutions, if you are able to use the trunk I know Sean said
>> there
>> was a suppression text feature, otherwise in the past I have removed the
>> lines from the .script file
>>
>> I definitely think the acronym case sensitive feature would be great.
>>
>> Jeff
>>
>> On Tue, Aug 4, 2020 at 3:28 PM Peter Abramowitsch <
>> pabramowit...@gmail.com>
>> wrote:
>>
>> > Hi Jeff et al
>> >
>> > To take up the thread from a few days ago where a simple english word
>> such
>> > as bed, soft, shop also maps into a legitimate but rarely used acronym
>> and
>> > shows up in the same POS as a potentially interesting entity,  what is
>> the
>> > mechanism you would use to disambiguate?
>> >
>> > This problem only started since I  constructed a SNO+RX+HGNC dictionary
>> > from the 2020A UMLS dump.   Adding more TUIS where a more conventional
>> > word-sense of the target word occurs, does not fix this problem.
>> >
>> > For instance, why does the sno_rx dictionary not contain this disease
>> which
>> > aliases to  "bed" ?
>> >
>> > ucsf_dict_v1 $ grep 3159311 *.script
>> > *INSERT INTO CUI_TERMS VALUES(3159311,0,1,'bed','bed')*
>> > INSERT INTO CUI_TERMS VALUES(3159311,5,8,'myopia , high , with
>> > nonprogressive cone dysfunction','nonprogressive')
>> > INSERT INTO CUI_TERMS VALUES(3159311,0,3,'bornholm eye
>> disease','bornholm')
>> > INSERT INTO CUI_TERMS VALUES(3159311,5,6,'x-linked cone dysfunction
>> > syndrome with myopia','myopia')
>> > INSERT INTO TUI VALUES(3159311,47)
>> > *INSERT INTO PREFTERM VALUES(3159311,'BORNHOLM EYE DISEASE')*
>> > INSERT INTO SNOMEDCT_US VALUES(3159311,718718009)
>> >
>> >
>> > sno_rx_16ab $ grep 3159311 *.script
>> > nada
>> >
>> > Solutions good or evil?
>> >
>> >- Strip the relevant lines out of ths dict.script file?
>> >- Blacklist the text?
>> >- Add to my stopCUI list (a little feature I added)?
>> >- Some other configuration I don't  know about?
>> >For instance, is there a CUI:ACRONYM table?
>> >I'm tempted to create one.  This would require the matching term to
>> be
>> >present in upper case.
>> >
>> > Peter
>> >
>>
>


Re: RE Tuning custom dictionary recommendations

2020-08-04 Thread Peter Abramowitsch
>From codebase 4.0.1
org.apache.ctakes.dictionary.lookup2.consumer.DefaultTermConsumer  line 98,
but you'll see it referenced everywhere in the file.

Oddly enough it is not abstracted out so it could be used in the
PrecisionTermConsumer.
I'm just testing it.

Peter

Peter


On Tue, Aug 4, 2020 at 4:41 PM Jeffrey Miller  wrote:

> Where in the source code is this feature implemented?
>
> On Tue, Aug 4, 2020 at 7:30 PM Peter Abramowitsch  >
> wrote:
>
> > Blacklist format
> > Actually I got it inverted, its:
> >
> > semantic_code1, semantic_code2,...|text1
> > semantic_code1, semantic_code2,...|text2
> >
> > Peter
> >
> > On Tue, Aug 4, 2020 at 4:16 PM Peter Abramowitsch <
> pabramowit...@gmail.com
> > >
> > wrote:
> >
> > > Ok Thanks Jeff.  I'm glad I wasn't missing something important.
> > >
> > > There already is a blacklist text mechanism which suppresses
> > > identification of specific text by clinical domain.
> > > Looking at the code it collects entries like
> > > cTakesSemanticCode,texta,textb,textc
> > > NE_TYPE_ID_DRUG, jasmine, coriander, bleach
> > > There's a case sensitive list and a case insensitive one.
> > >
> > > So I will try that.
> > > in one of my examples, I'll say that  'bed' is not a disorder, while
> > 'BED'
> > > could be one.
> > >
> > >
> > >
> > > On Tue, Aug 4, 2020 at 2:12 PM Jeffrey Miller 
> wrote:
> > >
> > >> Hi Peter,
> > >>
> > >> To your question about sno_rx_16ab I suspect that the CUI is new since
> > >> 2016, or if it existed in UMLS back then, it was not associated with a
> > >> term
> > >> in snomed or rxnorm at that time.
> > >>
> > >> To those solutions, if you are able to use the trunk I know Sean said
> > >> there
> > >> was a suppression text feature, otherwise in the past I have removed
> the
> > >> lines from the .script file
> > >>
> > >> I definitely think the acronym case sensitive feature would be great.
> > >>
> > >> Jeff
> > >>
> > >> On Tue, Aug 4, 2020 at 3:28 PM Peter Abramowitsch <
> > >> pabramowit...@gmail.com>
> > >> wrote:
> > >>
> > >> > Hi Jeff et al
> > >> >
> > >> > To take up the thread from a few days ago where a simple english
> word
> > >> such
> > >> > as bed, soft, shop also maps into a legitimate but rarely used
> acronym
> > >> and
> > >> > shows up in the same POS as a potentially interesting entity,  what
> is
> > >> the
> > >> > mechanism you would use to disambiguate?
> > >> >
> > >> > This problem only started since I  constructed a SNO+RX+HGNC
> > dictionary
> > >> > from the 2020A UMLS dump.   Adding more TUIS where a more
> conventional
> > >> > word-sense of the target word occurs, does not fix this problem.
> > >> >
> > >> > For instance, why does the sno_rx dictionary not contain this
> disease
> > >> which
> > >> > aliases to  "bed" ?
> > >> >
> > >> > ucsf_dict_v1 $ grep 3159311 *.script
> > >> > *INSERT INTO CUI_TERMS VALUES(3159311,0,1,'bed','bed')*
> > >> > INSERT INTO CUI_TERMS VALUES(3159311,5,8,'myopia , high , with
> > >> > nonprogressive cone dysfunction','nonprogressive')
> > >> > INSERT INTO CUI_TERMS VALUES(3159311,0,3,'bornholm eye
> > >> disease','bornholm')
> > >> > INSERT INTO CUI_TERMS VALUES(3159311,5,6,'x-linked cone dysfunction
> > >> > syndrome with myopia','myopia')
> > >> > INSERT INTO TUI VALUES(3159311,47)
> > >> > *INSERT INTO PREFTERM VALUES(3159311,'BORNHOLM EYE DISEASE')*
> > >> > INSERT INTO SNOMEDCT_US VALUES(3159311,718718009)
> > >> >
> > >> >
> > >> > sno_rx_16ab $ grep 3159311 *.script
> > >> > nada
> > >> >
> > >> > Solutions good or evil?
> > >> >
> > >> >- Strip the relevant lines out of ths dict.script file?
> > >> >- Blacklist the text?
> > >> >- Add to my stopCUI list (a little feature I added)?
> > >> >- Some other configuration I don't  know about?
> > >> >For instance, is there a CUI:ACRONYM table?
> > >> >I'm tempted to create one.  This would require the matching term
> to
> > >> be
> > >> >present in upper case.
> > >> >
> > >> > Peter
> > >> >
> > >>
> > >
> >
>


Re: RE Tuning custom dictionary recommendations

2020-08-04 Thread Peter Abramowitsch
It works!

I've added one entry "bed" with my new dictionary and blacklisting it from
disorders, S&S, procedures, etc to the case sensitive version
When I spell it bed, it is blacklisted.
When I spell it BED, I get

"ontologyConceptArr": [{
  "_type": "UmlsConcept",
  "codingScheme": "SNOMEDCT_US",
  "code": "718718009",
  "score": 0.0,
  "disambiguated": false,
  "cui": "C3159311",
  "tui": "T047",
  "preferredText": "BORNHOLM EYE DISEASE"
}],

On Tue, Aug 4, 2020 at 4:41 PM Jeffrey Miller  wrote:

> Where in the source code is this feature implemented?
>
> On Tue, Aug 4, 2020 at 7:30 PM Peter Abramowitsch  >
> wrote:
>
> > Blacklist format
> > Actually I got it inverted, its:
> >
> > semantic_code1, semantic_code2,...|text1
> > semantic_code1, semantic_code2,...|text2
> >
> > Peter
> >
> > On Tue, Aug 4, 2020 at 4:16 PM Peter Abramowitsch <
> pabramowit...@gmail.com
> > >
> > wrote:
> >
> > > Ok Thanks Jeff.  I'm glad I wasn't missing something important.
> > >
> > > There already is a blacklist text mechanism which suppresses
> > > identification of specific text by clinical domain.
> > > Looking at the code it collects entries like
> > > cTakesSemanticCode,texta,textb,textc
> > > NE_TYPE_ID_DRUG, jasmine, coriander, bleach
> > > There's a case sensitive list and a case insensitive one.
> > >
> > > So I will try that.
> > > in one of my examples, I'll say that  'bed' is not a disorder, while
> > 'BED'
> > > could be one.
> > >
> > >
> > >
> > > On Tue, Aug 4, 2020 at 2:12 PM Jeffrey Miller 
> wrote:
> > >
> > >> Hi Peter,
> > >>
> > >> To your question about sno_rx_16ab I suspect that the CUI is new since
> > >> 2016, or if it existed in UMLS back then, it was not associated with a
> > >> term
> > >> in snomed or rxnorm at that time.
> > >>
> > >> To those solutions, if you are able to use the trunk I know Sean said
> > >> there
> > >> was a suppression text feature, otherwise in the past I have removed
> the
> > >> lines from the .script file
> > >>
> > >> I definitely think the acronym case sensitive feature would be great.
> > >>
> > >> Jeff
> > >>
> > >> On Tue, Aug 4, 2020 at 3:28 PM Peter Abramowitsch <
> > >> pabramowit...@gmail.com>
> > >> wrote:
> > >>
> > >> > Hi Jeff et al
> > >> >
> > >> > To take up the thread from a few days ago where a simple english
> word
> > >> such
> > >> > as bed, soft, shop also maps into a legitimate but rarely used
> acronym
> > >> and
> > >> > shows up in the same POS as a potentially interesting entity,  what
> is
> > >> the
> > >> > mechanism you would use to disambiguate?
> > >> >
> > >> > This problem only started since I  constructed a SNO+RX+HGNC
> > dictionary
> > >> > from the 2020A UMLS dump.   Adding more TUIS where a more
> conventional
> > >> > word-sense of the target word occurs, does not fix this problem.
> > >> >
> > >> > For instance, why does the sno_rx dictionary not contain this
> disease
> > >> which
> > >> > aliases to  "bed" ?
> > >> >
> > >> > ucsf_dict_v1 $ grep 3159311 *.script
> > >> > *INSERT INTO CUI_TERMS VALUES(3159311,0,1,'bed','bed')*
> > >> > INSERT INTO CUI_TERMS VALUES(3159311,5,8,'myopia , high , with
> > >> > nonprogressive cone dysfunction','nonprogressive')
> > >> > INSERT INTO CUI_TERMS VALUES(3159311,0,3,'bornholm eye
> > >> disease','bornholm')
> > >> > INSERT INTO CUI_TERMS VALUES(3159311,5,6,'x-linked cone dysfunction
> > >> > syndrome with myopia','myopia')
> > >> > INSERT INTO TUI VALUES(3159311,47)
> > >> > *INSERT INTO PREFTERM VALUES(3159311,'BORNHOLM EYE DISEASE')*
> > >> > INSERT INTO SNOMEDCT_US VALUES(3159311,718718009)
> > >> >
> > >> >
> > >> > sno_rx_16ab $ grep 3159311 *.script
> > >> > nada
> > >> >
> > >> > Solutions good or evil?
> > >> >
> > >> >- Strip the relevant lines out of ths dict.script file?
> > >> >- Blacklist the text?
> > >> >- Add to my stopCUI list (a little feature I added)?
> > >> >- Some other configuration I don't  know about?
> > >> >For instance, is there a CUI:ACRONYM table?
> > >> >I'm tempted to create one.  This would require the matching term
> to
> > >> be
> > >> >present in upper case.
> > >> >
> > >> > Peter
> > >> >
> > >>
> > >
> >
>


The 2020 UMLS dictionary and our default SNO_RX

2020-08-05 Thread Peter Abramowitsch
Hi All

I've been setting up a custom dictionary using UMLS with the goal of simply
adding a comprehensive genetic vocabulary HGNC  to the latest UMLS SNOMED
and RXNORM vocabularies in the hope of getting somewhere close to the
cTakes default dictionary again.

However, there are changes to concept vocabularies in UMLS2020AA that
affect the ability of cTakes to work well with older notes and possibly the
note-writing practices of older physicians and labs.   Some of the tried
and true acronyms such as WBC for leukocytes, RBC, and EOS (eosinophil
count) are no longer part of SNOMED.  Probably this is because the
components of these parameters are now broken out into  more granular
types.   The other reason this may be is that a few of these acronyms now
overlap the names of Genes.  EOS is one of them.  This is just speculation.

In order to have these common parameters re-included via their common lab
acronyms, it is necessary to add another common US vocabulary such as
HL7-V3.0 or NCI_CDISC.  Of course one can remap back into SNOMED by adding
insert statements into the dictionary script, but it might be a
non-scalable exercise.

So my point here is that if, one day, we plan to create a new cTakes
release, and with it, a new UMLS lookup, we may need to consider adding a
third basic vocabulary into our current set of two.

Thoughts?
Peter


Re: The 2020 UMLS dictionary and our default SNO_RX

2020-08-05 Thread Peter Abramowitsch
Hi Jeff

I thought I did load them all, but I'll go back and check.

When looking at my gene issue  the result is that the lookup arbitrarily
(seemingly anyway) flips between one and another when there are overlaps
between vocabularies.Ie. I see that both Vocab A & B both contain geneX
and geneY.   Neither of these are in SNOMED. So in my output, I get one of
the genes associated with Vocab A and another with Vocab B.   When I remove
Vocab B then obviously both are associated with Vocab A - which is what I
wanted.

If, for you, WBC is showing up as an anatomical location, rather than a
T059  then probably it's not getting the correct SNOMED code though.
Wouldn't that be a problem for your researchers?

Peter

On Wed, Aug 5, 2020 at 5:37 PM Jeffrey Miller  wrote:

> Hi Peter,
>
> If I create a dictionary using UMLS 2020aa with just snomed and rxnorm my
> cTAKES dictionary still seems to have a CUI associated with the string
> 'wbc' that links to the snomed term for Leukocyte (Cell). It is not mapping
> to a lab result TUI, but rather an anatomical site, but it seems to be the
> same CUI that 'wbc' resolves to in sno_rx_16ab. Maybe HGNC is conflicting
> with that too?
>
> Just to double check, when you installed UMLS through Metamorphosys, did
> you install all of the available vocabularies?
>
> Jeff
>
> On Wed, Aug 5, 2020 at 6:52 PM Peter Abramowitsch  >
> wrote:
>
> > Hi All
> >
> > I've been setting up a custom dictionary using UMLS with the goal of
> simply
> > adding a comprehensive genetic vocabulary HGNC  to the latest UMLS SNOMED
> > and RXNORM vocabularies in the hope of getting somewhere close to the
> > cTakes default dictionary again.
> >
> > However, there are changes to concept vocabularies in UMLS2020AA that
> > affect the ability of cTakes to work well with older notes and possibly
> the
> > note-writing practices of older physicians and labs.   Some of the tried
> > and true acronyms such as WBC for leukocytes, RBC, and EOS (eosinophil
> > count) are no longer part of SNOMED.  Probably this is because the
> > components of these parameters are now broken out into  more granular
> > types.   The other reason this may be is that a few of these acronyms now
> > overlap the names of Genes.  EOS is one of them.  This is just
> speculation.
> >
> > In order to have these common parameters re-included via their common lab
> > acronyms, it is necessary to add another common US vocabulary such as
> > HL7-V3.0 or NCI_CDISC.  Of course one can remap back into SNOMED by
> adding
> > insert statements into the dictionary script, but it might be a
> > non-scalable exercise.
> >
> > So my point here is that if, one day, we plan to create a new cTakes
> > release, and with it, a new UMLS lookup, we may need to consider adding a
> > third basic vocabulary into our current set of two.
> >
> > Thoughts?
> > Peter
> >
>


Re: The 2020 UMLS dictionary and our default SNO_RX

2020-08-06 Thread Peter Abramowitsch
Hi Jeff

You are absolutely right:  when I use sno_rx with the term WBC in a simple
context it is not showing up as a T059.  I was surprised about that

I was wrong about the term I was looking at.   Here's the scenario that did
change

Text context
afebrile, but has elevated WBC count;

*Using sno_rx*
canonical text:  White blood cell count increased (lab result)
CUI: C0750426,
location:  Leukocytes,
location_snomed: 52501007
range_text:  elevated WBC count,
vocab_term: 414478003,
vocab_type: SNOMEDCT_US
...other params.

*Using new dict based on 2020AA*
Missing:

Reason:
*grep elevated newdict_750426*
INSERT INTO CUI_TERMS VALUES(750426,0,4,'elevated white blood
count','elevated')
INSERT INTO CUI_TERMS VALUES(750426,0,5,'elevated white blood cell
count','elevated')
*grep elevated olddict_750426*
INSERT INTO CUI_TERMS VALUES(750426,0,4,'elevated white blood
count','elevated')
INSERT INTO CUI_TERMS VALUES(750426,1,3,'elevated wbc count','wbc')
<--  missing
INSERT INTO CUI_TERMS VALUES(750426,0,5,'elevated white blood cell
count','elevated')

So back to your recommendation on using MMSYS

You chose the ACTIVE_SUBSETS option - right?
And on the Sources to Exclude/Include page, do you deselect all sources to
exclude?
Have you tweaked the precedence of subsets or do you leave the default
order alone?

Many thanks,
Peter

On Thu, Aug 6, 2020 at 8:11 AM Jeffrey Miller  wrote:

> Peter,
>
> I have experienced similar issues with how text spans translate to
> different CUIs depending on the included vocabularies as well. I had a
> similar conversation with Sean on the dev forum last year I believe.
>
> I do not believe the behavior of 'wbc' has changed- if I run the clinical
> pipeline with sno_rx_16ab dictionary, it is tagged as an
> AnatomicalSiteMention. Are you seeing something different?
>
> Jeff
>
> On Wed, Aug 5, 2020 at 11:24 PM Peter Abramowitsch <
> pabramowit...@gmail.com>
> wrote:
>
> > Hi Jeff
> >
> > I thought I did load them all, but I'll go back and check.
> >
> > When looking at my gene issue  the result is that the lookup arbitrarily
> > (seemingly anyway) flips between one and another when there are overlaps
> > between vocabularies.Ie. I see that both Vocab A & B both contain
> geneX
> > and geneY.   Neither of these are in SNOMED. So in my output, I get one
> of
> > the genes associated with Vocab A and another with Vocab B.   When I
> remove
> > Vocab B then obviously both are associated with Vocab A - which is what I
> > wanted.
> >
> > If, for you, WBC is showing up as an anatomical location, rather than a
> > T059  then probably it's not getting the correct SNOMED code though.
> > Wouldn't that be a problem for your researchers?
> >
> > Peter
> >
> > On Wed, Aug 5, 2020 at 5:37 PM Jeffrey Miller  wrote:
> >
> > > Hi Peter,
> > >
> > > If I create a dictionary using UMLS 2020aa with just snomed and rxnorm
> my
> > > cTAKES dictionary still seems to have a CUI associated with the string
> > > 'wbc' that links to the snomed term for Leukocyte (Cell). It is not
> > mapping
> > > to a lab result TUI, but rather an anatomical site, but it seems to be
> > the
> > > same CUI that 'wbc' resolves to in sno_rx_16ab. Maybe HGNC is
> conflicting
> > > with that too?
> > >
> > > Just to double check, when you installed UMLS through Metamorphosys,
> did
> > > you install all of the available vocabularies?
> > >
> > > Jeff
> > >
> > > On Wed, Aug 5, 2020 at 6:52 PM Peter Abramowitsch <
> > pabramowit...@gmail.com
> > > >
> > > wrote:
> > >
> > > > Hi All
> > > >
> > > > I've been setting up a custom dictionary using UMLS with the goal of
> > > simply
> > > > adding a comprehensive genetic vocabulary HGNC  to the latest UMLS
> > SNOMED
> > > > and RXNORM vocabularies in the hope of getting somewhere close to the
> > > > cTakes default dictionary again.
> > > >
> > > > However, there are changes to concept vocabularies in UMLS2020AA that
> > > > affect the ability of cTakes to work well with older notes and
> possibly
> > > the
> > > > note-writing practices of older physicians and labs.   Some of the
> > tried
> > > > and true acronyms such as WBC for leukocytes, RBC, and EOS
> (eosinophil
> > > > count) are no longer part of SNOMED.  Probably th

Re: The 2020 UMLS dictionary and our default SNO_RX

2020-08-07 Thread Peter Abramowitsch
Hi Jeff

Many thanks for all your suggestions.

Things have settled down now.  The blacklist feature has been very useful
for suppressing "false" acronym detection and I will add a few synonyms to
the dict script that have gone away.  Also added some post processing code
(that might be  useful for others?)  - when a range maps to two or more
concepts in different semantic domains, I set the confidence level in each
to 0.5.   Like the gene CAD and the acronym CAD, for example.

Peter

On Fri, Aug 7, 2020 at 6:29 AM Jeffrey Miller  wrote:

> Hi Peter,
>
> Yes, I've chosen active subsets then I think I actually choose the select
> sources to exclude option, but I don't believe that should matter. I leave
> the precedence defaults alone.
>
> Jeff
>
> On Thu, Aug 6, 2020, 2:13 PM Peter Abramowitsch 
> wrote:
>
> > Hi Jeff
> >
> > You are absolutely right:  when I use sno_rx with the term WBC in a
> simple
> > context it is not showing up as a T059.  I was surprised about that
> >
> > I was wrong about the term I was looking at.   Here's the scenario that
> did
> > change
> >
> > Text context
> > afebrile, but has elevated WBC count;
> >
> > *Using sno_rx*
> > canonical text:  White blood cell count increased (lab result)
> > CUI: C0750426,
> > location:  Leukocytes,
> > location_snomed: 52501007
> > range_text:  elevated WBC count,
> > vocab_term: 414478003,
> > vocab_type: SNOMEDCT_US
> > ...other params.
> >
> > *Using new dict based on 2020AA*
> > Missing:
> >
> > Reason:
> > *grep elevated newdict_750426*
> > INSERT INTO CUI_TERMS VALUES(750426,0,4,'elevated white blood
> > count','elevated')
> > INSERT INTO CUI_TERMS VALUES(750426,0,5,'elevated white blood cell
> > count','elevated')
> > *grep elevated olddict_750426*
> > INSERT INTO CUI_TERMS VALUES(750426,0,4,'elevated white blood
> > count','elevated')
> > INSERT INTO CUI_TERMS VALUES(750426,1,3,'elevated wbc count','wbc')
> > <--  missing
> > INSERT INTO CUI_TERMS VALUES(750426,0,5,'elevated white blood cell
> > count','elevated')
> >
> > So back to your recommendation on using MMSYS
> >
> > You chose the ACTIVE_SUBSETS option - right?
> > And on the Sources to Exclude/Include page, do you deselect all sources
> to
> > exclude?
> > Have you tweaked the precedence of subsets or do you leave the default
> > order alone?
> >
> > Many thanks,
> > Peter
> >
> > On Thu, Aug 6, 2020 at 8:11 AM Jeffrey Miller  wrote:
> >
> > > Peter,
> > >
> > > I have experienced similar issues with how text spans translate to
> > > different CUIs depending on the included vocabularies as well. I had a
> > > similar conversation with Sean on the dev forum last year I believe.
> > >
> > > I do not believe the behavior of 'wbc' has changed- if I run the
> clinical
> > > pipeline with sno_rx_16ab dictionary, it is tagged as an
> > > AnatomicalSiteMention. Are you seeing something different?
> > >
> > > Jeff
> > >
> > > On Wed, Aug 5, 2020 at 11:24 PM Peter Abramowitsch <
> > > pabramowit...@gmail.com>
> > > wrote:
> > >
> > > > Hi Jeff
> > > >
> > > > I thought I did load them all, but I'll go back and check.
> > > >
> > > > When looking at my gene issue  the result is that the lookup
> > arbitrarily
> > > > (seemingly anyway) flips between one and another when there are
> > overlaps
> > > > between vocabularies.Ie. I see that both Vocab A & B both contain
> > > geneX
> > > > and geneY.   Neither of these are in SNOMED. So in my output, I get
> one
> > > of
> > > > the genes associated with Vocab A and another with Vocab B.   When I
> > > remove
> > > > Vocab B then obviously both are associated with Vocab A - which is
> > what I
> > > > wanted.
> > > >
> > > > If, for you, WBC is showing up as an anatomical location, rather
> than a
> > > > T059  then probably it's not getting the correct SNOMED code though.
> > > > Wouldn't that be a problem for your researchers?
> > > >
> > > > Peter
> > > >
> > > > On Wed, Aug 5, 2020 at 5:37 PM Jeffrey Miller 
> > wrote:
> > > >
> > > >

Need a little more help on dictionaries

2020-08-13 Thread Peter Abramowitsch
Hi All

I'm able to create a subset with the UMLS mmsys tool, use the dictionary
creator on the full UMLS release, create, install and tweak the scripts
adding or removing aliases etc.  My goal is simply to add HUGO gene terms
to SNOMED and RXNORM.

However I must be missing some bit of information on the use of mmsys or
the dictionary creator, because some very common terms are missing from my
dictionary but present in the released sno_rx

As an example, the acronym SOB
in mmsys, the term SOB is present in my subset, and it is mapped into
SNOMED with the expected CUI 13404 and SNOMEDIDs same as sno_rx
I see the cui_tui mapping it into the correct TUI for a finding  INSERT
INTO TUI VALUES(13404,184)
I see the cui and the preferred term "dyspnea" in my *script file, and I
can resolve it in a note using the default consumer and obtaining the
correct SNOMED ID
I see lots of cui_term entries for the same CUI, and I can resolve them
too.  but  SOB is not present in my cui terms.
How did it get there?

So either - I am not using one of the tools correctly, or in creating
SNO_RX, someone has added SOB by hand rather than using the creator.  And
if they have, they have probably also done other tweaks.

Sean, Ghandi or Jeff
Can you explain this?

Peter


Re: Need a little more help on dictionaries

2020-08-14 Thread Peter Abramowitsch
Hi Gandhi,  Yes  I added snomed, rxnorm and hugo( hgnc) so yes, i have
thousands of Snomed associations in my script file.  But some, like my
example aren't there. These missing ones tend to be the short acronyms
rather than the template phrases.  But they're present in mmsys with the
snomed mapping.

I tried to follow Jeff's suggestion of adding more likely vocabularies to
these three but then i got inconsistent results. Some terms that had shown
up with Snomed codes started being reported in other vocabs.

It may or may not be connected, but could you explain the function/behavior
of the source/destination checkboxes on the dictionary creator?

Peter

On Fri, Aug 14, 2020, 6:45 AM gandhi rajan  wrote:

> Hi Peter,
>
> However I must be missing some bit of information on the use of mmsys or
> the dictionary creator, because some very common terms are missing from my
> dictionary but present in the released sno_rx
>
> >>> Do you mean the entries are missing in your database? When I tried the
> latest UMLS installation I dont see snomed dictionary terms getting added
> by default. Did you selected snomed dictionary in dictionary GUI or it
> showed up in GUI?
>
> On Fri, Aug 14, 2020 at 9:32 AM Peter Abramowitsch <
> pabramowit...@gmail.com>
> wrote:
>
> > Hi All
> >
> > I'm able to create a subset with the UMLS mmsys tool, use the dictionary
> > creator on the full UMLS release, create, install and tweak the scripts
> > adding or removing aliases etc.  My goal is simply to add HUGO gene terms
> > to SNOMED and RXNORM.
> >
> > However I must be missing some bit of information on the use of mmsys or
> > the dictionary creator, because some very common terms are missing from
> my
> > dictionary but present in the released sno_rx
> >
> > As an example, the acronym SOB
> > in mmsys, the term SOB is present in my subset, and it is mapped into
> > SNOMED with the expected CUI 13404 and SNOMEDIDs same as sno_rx
> > I see the cui_tui mapping it into the correct TUI for a finding  INSERT
> > INTO TUI VALUES(13404,184)
> > I see the cui and the preferred term "dyspnea" in my *script file, and I
> > can resolve it in a note using the default consumer and obtaining the
> > correct SNOMED ID
> > I see lots of cui_term entries for the same CUI, and I can resolve them
> > too.  but  SOB is not present in my cui terms.
> > How did it get there?
> >
> > So either - I am not using one of the tools correctly, or in creating
> > SNO_RX, someone has added SOB by hand rather than using the creator.  And
> > if they have, they have probably also done other tweaks.
> >
> > Sean, Ghandi or Jeff
> > Can you explain this?
> >
> > Peter
> >
>
>
> --
> Regards,
> Gandhi
>
> "The best way to find urself is to lose urself in the service of others
> !!!"
>


Re: Need a little more help on dictionaries [EXTERNAL]

2020-08-14 Thread Peter Abramowitsch
Thanks Sean.  Interesting.  I will have a look. Then have a look in the
creator code.


  I should work through emails from older to newer.  Just responded to
Gandhi.

On Fri, Aug 14, 2020, 4:53 AM Finan, Sean 
wrote:

> Hi Peter,
>
> I don't have an answer but I do have a question:
>
> In your mrconso.rrf, do you see a snomed line item for "SOB" or only "SOB
> -Shortness of breath" ?
>
> I think that the simple "SOB" and "sob" entries might be from other
> vocabularies.
>
> There is (was?) logic in the dictionary creator to multiply things like
> "SOB - Shortness of breath", "SOB (Shortness of breath)"  etc. and create 3
> synonym entries: full, left and right.  There is a requirement that the
> left side be all caps and a fitting acronym for the right side.  However, I
> vacillated on the correctness of this behavior as almost all terms already
> had the 3 entries.  I am not sure what the current version of the creator
> does.
>
> Dictionary creation is indeed a touchy operation.
>
> Sean
> 
> From: Peter Abramowitsch 
> Sent: Thursday, August 13, 2020 11:57 PM
> To: dev@ctakes.apache.org
> Subject: Need a little more help on dictionaries [EXTERNAL]
>
> * External Email - Caution *
>
>
> Hi All
>
> I'm able to create a subset with the UMLS mmsys tool, use the dictionary
> creator on the full UMLS release, create, install and tweak the scripts
> adding or removing aliases etc.  My goal is simply to add HUGO gene terms
> to SNOMED and RXNORM.
>
> However I must be missing some bit of information on the use of mmsys or
> the dictionary creator, because some very common terms are missing from my
> dictionary but present in the released sno_rx
>
> As an example, the acronym SOB
> in mmsys, the term SOB is present in my subset, and it is mapped into
> SNOMED with the expected CUI 13404 and SNOMEDIDs same as sno_rx
> I see the cui_tui mapping it into the correct TUI for a finding  INSERT
> INTO TUI VALUES(13404,184)
> I see the cui and the preferred term "dyspnea" in my *script file, and I
> can resolve it in a note using the default consumer and obtaining the
> correct SNOMED ID
> I see lots of cui_term entries for the same CUI, and I can resolve them
> too.  but  SOB is not present in my cui terms.
> How did it get there?
>
> So either - I am not using one of the tools correctly, or in creating
> SNO_RX, someone has added SOB by hand rather than using the creator.  And
> if they have, they have probably also done other tweaks.
>
> Sean, Ghandi or Jeff
> Can you explain this?
>
> Peter
>


Re: Need a little more help on dictionaries [EXTERNAL]

2020-08-14 Thread Peter Abramowitsch
Thanks Sean  ... now I'm going to jog your memory:

I quickly went through the dictionary code.  You were right.  There was a
class AutoTermExtractor in org.apache.ctakes.gui.dictionary.umls which
looks like it did what you said.  But all of it is all commented out.

Then there's another bit of code with a function extractAbbreviations() in
UmlsTermUtil, and this one relies on externalized files including this
one:  default/RightAbbreviations.txt.  And this file contains (SOB), one of
the abbreviations I was looking for.

Now this file seems to exist in multiple versions

cogitext:trunk-java8 peterabramowitsch$ find . -name
"RightAbbreviations.txt" -exec wc -l {} \;
1178
./ctakes-gui-res/target/classes/org/apache/ctakes/gui/dictionary/data/default/RightAbbreviations.txt
   0
./ctakes-gui-res/target/classes/org/apache/ctakes/gui/dictionary/data/small/RightAbbreviations.txt
   8
./ctakes-gui-res/target/classes/org/apache/ctakes/gui/dictionary/data/tim/RightAbbreviations.txt
   0
./ctakes-gui-res/target/classes/org/apache/ctakes/gui/dictionary/data/tiny/RightAbbreviations.txt

Does this jog your memory enough to fill in the history and tell me what I
need to do?

Peter


On Fri, Aug 14, 2020 at 4:53 AM Finan, Sean <
sean.fi...@childrens.harvard.edu> wrote:

> Hi Peter,
>
> I don't have an answer but I do have a question:
>
> In your mrconso.rrf, do you see a snomed line item for "SOB" or only "SOB
> -Shortness of breath" ?
>
> I think that the simple "SOB" and "sob" entries might be from other
> vocabularies.
>
> There is (was?) logic in the dictionary creator to multiply things like
> "SOB - Shortness of breath", "SOB (Shortness of breath)"  etc. and create 3
> synonym entries: full, left and right.  There is a requirement that the
> left side be all caps and a fitting acronym for the right side.  However, I
> vacillated on the correctness of this behavior as almost all terms already
> had the 3 entries.  I am not sure what the current version of the creator
> does.
>
> Dictionary creation is indeed a touchy operation.
>
> Sean
> 
> From: Peter Abramowitsch 
> Sent: Thursday, August 13, 2020 11:57 PM
> To: dev@ctakes.apache.org
> Subject: Need a little more help on dictionaries [EXTERNAL]
>
> * External Email - Caution *
>
>
> Hi All
>
> I'm able to create a subset with the UMLS mmsys tool, use the dictionary
> creator on the full UMLS release, create, install and tweak the scripts
> adding or removing aliases etc.  My goal is simply to add HUGO gene terms
> to SNOMED and RXNORM.
>
> However I must be missing some bit of information on the use of mmsys or
> the dictionary creator, because some very common terms are missing from my
> dictionary but present in the released sno_rx
>
> As an example, the acronym SOB
> in mmsys, the term SOB is present in my subset, and it is mapped into
> SNOMED with the expected CUI 13404 and SNOMEDIDs same as sno_rx
> I see the cui_tui mapping it into the correct TUI for a finding  INSERT
> INTO TUI VALUES(13404,184)
> I see the cui and the preferred term "dyspnea" in my *script file, and I
> can resolve it in a note using the default consumer and obtaining the
> correct SNOMED ID
> I see lots of cui_term entries for the same CUI, and I can resolve them
> too.  but  SOB is not present in my cui terms.
> How did it get there?
>
> So either - I am not using one of the tools correctly, or in creating
> SNO_RX, someone has added SOB by hand rather than using the creator.  And
> if they have, they have probably also done other tweaks.
>
> Sean, Ghandi or Jeff
> Can you explain this?
>
> Peter
>


Re: Need a little more help on dictionaries [EXTERNAL]

2020-08-14 Thread Peter Abramowitsch
Hi Sean

I think I found the answer, and I have one question.

In dictionary creator, the hardwired dir is "tiny" that in fact has an
empty file for those abbreviations

In DictionaryBuilder.java:

*static private final String DEFAULT_DATA_DIR =
"org/apache/ctakes/gui/dictionary/data/tiny";*
*...*
*final UmlsTermUtil umlsTermUtil = new UmlsTermUtil( DEFAULT_DATA_DIR );*

The command line args are not used in this application, neither are
sysprops or environment vars so there's no way to change it short of
recompiling.

So the question is:  do you know why the empty version is the default?

Peter



On Fri, Aug 14, 2020 at 4:53 AM Finan, Sean <
sean.fi...@childrens.harvard.edu> wrote:

> Hi Peter,
>
> I don't have an answer but I do have a question:
>
> In your mrconso.rrf, do you see a snomed line item for "SOB" or only "SOB
> -Shortness of breath" ?
>
> I think that the simple "SOB" and "sob" entries might be from other
> vocabularies.
>
> There is (was?) logic in the dictionary creator to multiply things like
> "SOB - Shortness of breath", "SOB (Shortness of breath)"  etc. and create 3
> synonym entries: full, left and right.  There is a requirement that the
> left side be all caps and a fitting acronym for the right side.  However, I
> vacillated on the correctness of this behavior as almost all terms already
> had the 3 entries.  I am not sure what the current version of the creator
> does.
>
> Dictionary creation is indeed a touchy operation.
>
> Sean
> 
> From: Peter Abramowitsch 
> Sent: Thursday, August 13, 2020 11:57 PM
> To: dev@ctakes.apache.org
> Subject: Need a little more help on dictionaries [EXTERNAL]
>
> * External Email - Caution *
>
>
> Hi All
>
> I'm able to create a subset with the UMLS mmsys tool, use the dictionary
> creator on the full UMLS release, create, install and tweak the scripts
> adding or removing aliases etc.  My goal is simply to add HUGO gene terms
> to SNOMED and RXNORM.
>
> However I must be missing some bit of information on the use of mmsys or
> the dictionary creator, because some very common terms are missing from my
> dictionary but present in the released sno_rx
>
> As an example, the acronym SOB
> in mmsys, the term SOB is present in my subset, and it is mapped into
> SNOMED with the expected CUI 13404 and SNOMEDIDs same as sno_rx
> I see the cui_tui mapping it into the correct TUI for a finding  INSERT
> INTO TUI VALUES(13404,184)
> I see the cui and the preferred term "dyspnea" in my *script file, and I
> can resolve it in a note using the default consumer and obtaining the
> correct SNOMED ID
> I see lots of cui_term entries for the same CUI, and I can resolve them
> too.  but  SOB is not present in my cui terms.
> How did it get there?
>
> So either - I am not using one of the tools correctly, or in creating
> SNO_RX, someone has added SOB by hand rather than using the creator.  And
> if they have, they have probably also done other tweaks.
>
> Sean, Ghandi or Jeff
> Can you explain this?
>
> Peter
>


Re: Need a little more help on dictionaries [EXTERNAL]

2020-08-14 Thread Peter Abramowitsch
Hurray!
Finally an explanation that makes sense.  I just couldn't figure out how
you could have made sno_rx with that dictionary creator.   Clearly, those
helper files represent a LOT of work.

I have locally modified the dictionary creator code to look for the system
property ctakes.dictgui_helperdata as a way to point it to another of those
directories.  I don't have check-in privileges so will keep it private for
now.

Many thanks for your help.

Peter

On Fri, Aug 14, 2020 at 9:51 AM Finan, Sean <
sean.fi...@childrens.harvard.edu> wrote:

> Hi Peter,
>
> shining a flashlight back into the dark ages ...
>
> You have found the advanced configuration directories!
>
> Those actually precede the gui dictionary creator and were a big part of
> formatting with the previous cli dictionary creator.  The cli was versatile
> but not simple.  The default collection of configuration files for the cli
> had a lot more going on.
>
> I think that I made "tiny/" directory the default for the gui because it
> didn't do as much manipulation and I wanted things to be a greater 1:1
> match with the source.
>
> I obviously used something other than the simple "tiny/" configuration
> when I made sno_rx_16ab.   I remember running repeated tests on some
> corpora as well as manually inspecting the produced databases.
>
> I can't believe that I had forgotten all of this.
>
> You should be able to mix and match files from the different configuration
> directories and just throw them into your own directory (or tiny/) then
> point DEFAULT_.. to your directory and recompile.
>
>
> Sean
>
> 
> From: Peter Abramowitsch 
> Sent: Friday, August 14, 2020 12:22 PM
> To: dev@ctakes.apache.org
> Subject: Re: Need a little more help on dictionaries [EXTERNAL]
>
> * External Email - Caution *
>
>
> Hi Sean
>
> I think I found the answer, and I have one question.
>
> In dictionary creator, the hardwired dir is "tiny" that in fact has an
> empty file for those abbreviations
>
> In DictionaryBuilder.java:
>
> *static private final String DEFAULT_DATA_DIR =
> "org/apache/ctakes/gui/dictionary/data/tiny";*
> *...*
> *final UmlsTermUtil umlsTermUtil = new UmlsTermUtil( DEFAULT_DATA_DIR );*
>
> The command line args are not used in this application, neither are
> sysprops or environment vars so there's no way to change it short of
> recompiling.
>
> So the question is:  do you know why the empty version is the default?
>
> Peter
>
>
>
> On Fri, Aug 14, 2020 at 4:53 AM Finan, Sean <
> sean.fi...@childrens.harvard.edu> wrote:
>
> > Hi Peter,
> >
> > I don't have an answer but I do have a question:
> >
> > In your mrconso.rrf, do you see a snomed line item for "SOB" or only "SOB
> > -Shortness of breath" ?
> >
> > I think that the simple "SOB" and "sob" entries might be from other
> > vocabularies.
> >
> > There is (was?) logic in the dictionary creator to multiply things like
> > "SOB - Shortness of breath", "SOB (Shortness of breath)"  etc. and
> create 3
> > synonym entries: full, left and right.  There is a requirement that the
> > left side be all caps and a fitting acronym for the right side.
> However, I
> > vacillated on the correctness of this behavior as almost all terms
> already
> > had the 3 entries.  I am not sure what the current version of the creator
> > does.
> >
> > Dictionary creation is indeed a touchy operation.
> >
> > Sean
> > 
> > From: Peter Abramowitsch 
> > Sent: Thursday, August 13, 2020 11:57 PM
> > To: dev@ctakes.apache.org
> > Subject: Need a little more help on dictionaries [EXTERNAL]
> >
> > * External Email - Caution *
> >
> >
> > Hi All
> >
> > I'm able to create a subset with the UMLS mmsys tool, use the dictionary
> > creator on the full UMLS release, create, install and tweak the scripts
> > adding or removing aliases etc.  My goal is simply to add HUGO gene terms
> > to SNOMED and RXNORM.
> >
> > However I must be missing some bit of information on the use of mmsys or
> > the dictionary creator, because some very common terms are missing from
> my
> > dictionary but present in the released sno_rx
> >
> > As an example, the acronym SOB
> > in mmsys, the term SOB is present in my subset, and it is mapped into
> > SNOMED with the expected CUI 13404 and SNOMEDIDs same as sno_rx
> > I see the cui_tui mapping it into the correct

Re: Need a little more help on dictionaries [EXTERNAL]

2020-08-14 Thread Peter Abramowitsch
Thanks Sean.

In no way was the comment  "explanation that makes sense" about you!I
apologize if it sounded like that.

It is so funny, because in a former company where I was architect, many
years ago,  Oacis Healthcare (which implemented one of the first HL7
databases and gateways) there was another Sean, and this one too, held the
accumulated memory and wisdom about a vital chunk of historical software.
Everyone bombarded him with questions all day long because he was the one
true source.  At the end of the day, his exhaustion was total.

My statement was rhetorical.  Wracking my brain for an explanation I had
possibly missed.

Peter

On Fri, Aug 14, 2020 at 10:27 AM Finan, Sean <
sean.fi...@childrens.harvard.edu> wrote:

>
> >Finally an explanation that makes sense.
> -- It frequently takes a while to get one of those out of me ...
>
> > I don't have check-in privileges so will keep it private for
> now.
> -- We shall have to do something about that.
>
> Cheers,
> Sean
>
> 
> From: Peter Abramowitsch 
> Sent: Friday, August 14, 2020 1:17 PM
> To: dev@ctakes.apache.org
> Subject: Re: Need a little more help on dictionaries [EXTERNAL]
>
> * External Email - Caution *
>
>
> Hurray!
> Finally an explanation that makes sense.  I just couldn't figure out how
> you could have made sno_rx with that dictionary creator.   Clearly, those
> helper files represent a LOT of work.
>
> I have locally modified the dictionary creator code to look for the system
> property ctakes.dictgui_helperdata as a way to point it to another of those
> directories.  I don't have check-in privileges so will keep it private for
> now.
>
> Many thanks for your help.
>
> Peter
>
> On Fri, Aug 14, 2020 at 9:51 AM Finan, Sean <
> sean.fi...@childrens.harvard.edu> wrote:
>
> > Hi Peter,
> >
> > shining a flashlight back into the dark ages ...
> >
> > You have found the advanced configuration directories!
> >
> > Those actually precede the gui dictionary creator and were a big part of
> > formatting with the previous cli dictionary creator.  The cli was
> versatile
> > but not simple.  The default collection of configuration files for the
> cli
> > had a lot more going on.
> >
> > I think that I made "tiny/" directory the default for the gui because it
> > didn't do as much manipulation and I wanted things to be a greater 1:1
> > match with the source.
> >
> > I obviously used something other than the simple "tiny/" configuration
> > when I made sno_rx_16ab.   I remember running repeated tests on some
> > corpora as well as manually inspecting the produced databases.
> >
> > I can't believe that I had forgotten all of this.
> >
> > You should be able to mix and match files from the different
> configuration
> > directories and just throw them into your own directory (or tiny/) then
> > point DEFAULT_.. to your directory and recompile.
> >
> >
> > Sean
> >
> > 
> > From: Peter Abramowitsch 
> > Sent: Friday, August 14, 2020 12:22 PM
> > To: dev@ctakes.apache.org
> > Subject: Re: Need a little more help on dictionaries [EXTERNAL]
> >
> > * External Email - Caution *
> >
> >
> > Hi Sean
> >
> > I think I found the answer, and I have one question.
> >
> > In dictionary creator, the hardwired dir is "tiny" that in fact has an
> > empty file for those abbreviations
> >
> > In DictionaryBuilder.java:
> >
> > *static private final String DEFAULT_DATA_DIR =
> > "org/apache/ctakes/gui/dictionary/data/tiny";*
> > *...*
> > *final UmlsTermUtil umlsTermUtil = new UmlsTermUtil( DEFAULT_DATA_DIR );*
> >
> > The command line args are not used in this application, neither are
> > sysprops or environment vars so there's no way to change it short of
> > recompiling.
> >
> > So the question is:  do you know why the empty version is the default?
> >
> > Peter
> >
> >
> >
> > On Fri, Aug 14, 2020 at 4:53 AM Finan, Sean <
> > sean.fi...@childrens.harvard.edu> wrote:
> >
> > > Hi Peter,
> > >
> > > I don't have an answer but I do have a question:
> > >
> > > In your mrconso.rrf, do you see a snomed line item for "SOB" or only
> "SOB
> > > -Shortness of breath" ?
> > >
> > > I think that the simple "SOB" and "sob" entries might be from other
> > > vocabul

HSQLDB question

2020-08-20 Thread Peter Abramowitsch
Hi All

In tailoring a new ctakes dictionary and trying to keep the changes as
compact and easy to manage as possible, I'm clumping all modifications
together at the end of the script file.  This would include both additions
and deletions.

Inserting into CUI_TERMS is no problem, but I have at least one instance
where I'd like to delete from CUI_TERMS, rather than deleting the INSERT
statement that put the term synonym into the script file in the first
place.   However, I haven't found an SQL DELETE statement that hsqldb
likes.Here are some examples I tried

DELETE FROM CUI_TERMS WHERE CUI=1414063 AND TEXT='lad'
DELETE FROM CUI_TERMS WHERE CUI=1414063 AS NUMBER AND TEXT='lad'
DELETE FROM CUI_TERMS WHERE CUI=1414063 AS BIGINT AND TEXT='lad'
DELETE FROM CUI_TERMS WHERE CUI='1414063' AS NUMBER AND TEXT='lad'

In all cases the script reader is treating the CUI value as a string and
complaining that it can't cast it to an Integer/Number etc, which is why I
also tried explicit casting, but that didn't work either.

>>>
error in script file line: 1850042 java.lang.ClassCastException:
java.lang.String cannot be cast to java.lang.Integer
>>>

Has anyone successfully put DELETE statements in the script file?

Regards
Peter


Re: HSQLDB question

2020-08-20 Thread Peter Abramowitsch
Thanks Remy,

Yes, I was originally going to do that too, and then thought:  why not try
to bunch everything at the bottom of the script file where it's easier to
find.
I'll probably go with your suggestion and just source control the script.
I was intending to source-control the diff results if I could get the
delete syntax to work.

Peter

On Thu, Aug 20, 2020 at 1:16 PM Remy Sanouillet 
wrote:

> I ended up just running a sed script before deploying the dictionary, that
> way I could source code control all the modifications to the original file.
>
> *Rémy Sanouillet*
> NLP Engineer
> re...@foreseemed.com 
>
>
> [image: cid:347EAEF1-26E8-42CB-BAE3-6CB228301B15]
> ForeSee Medical, Inc.
> 12555 High Bluff Drive, Suite 100
> San Diego, CA 92130
>
> NOTICE: This e-mail message and all attachments transmitted with it are
> intended solely for the use of the addressee and may contain legally
> privileged and confidential information. If the reader of this message is
> not the intended recipient, or an employee or agent responsible for
> delivering this message to the intended recipient, you are hereby notified
> that any dissemination, distribution, copying, or other use of this message
> or its attachments is strictly prohibited. If you have received this
> message in error, please notify the sender immediately by replying to this
> message and please delete it from your computer.
>
>
> On Thu, Aug 20, 2020 at 11:33 AM Peter Abramowitsch <
> pabramowit...@gmail.com> wrote:
>
>> Hi All
>>
>> In tailoring a new ctakes dictionary and trying to keep the changes as
>> compact and easy to manage as possible, I'm clumping all modifications
>> together at the end of the script file.  This would include both additions
>> and deletions.
>>
>> Inserting into CUI_TERMS is no problem, but I have at least one instance
>> where I'd like to delete from CUI_TERMS, rather than deleting the INSERT
>> statement that put the term synonym into the script file in the first
>> place.   However, I haven't found an SQL DELETE statement that hsqldb
>> likes.Here are some examples I tried
>>
>> DELETE FROM CUI_TERMS WHERE CUI=1414063 AND TEXT='lad'
>> DELETE FROM CUI_TERMS WHERE CUI=1414063 AS NUMBER AND TEXT='lad'
>> DELETE FROM CUI_TERMS WHERE CUI=1414063 AS BIGINT AND TEXT='lad'
>> DELETE FROM CUI_TERMS WHERE CUI='1414063' AS NUMBER AND TEXT='lad'
>>
>> In all cases the script reader is treating the CUI value as a string and
>> complaining that it can't cast it to an Integer/Number etc, which is why I
>> also tried explicit casting, but that didn't work either.
>>
>> >>>
>> error in script file line: 1850042 java.lang.ClassCastException:
>> java.lang.String cannot be cast to java.lang.Integer
>> >>>
>>
>> Has anyone successfully put DELETE statements in the script file?
>>
>> Regards
>> Peter
>>
>


Question about window size in term lookup

2020-08-24 Thread Peter Abramowitsch
Hello all

Is there a mechanism, a lookup file, etc which overrides the window size
set on the term annotator or the chunker.   Changing the window size from
the default of 3 to 2 opens the floodgate to false acronym annotations.  So
my question is whether there's a place where one can register specific two
character terms, for example BP or PT which will be found even with a
window size set to three.

A similar question about Genes.   On adding the HGNC vocabulary I notice
that there are many thousands of aliases for genes which overlap other
common acronyms and english words such as trip, spring, plan, bed, yes,
rip, prn etc.   I'm not sure if these aliases are ever used.   So I created
a sed script with 4000 regex expressions to remove the 2 and 3 letter gene
synonyms from a script file.  I will only suppress the 4 letter synonyms
manually where they cause trouble. But does anyone have a  more elegant
solution?

Peter


Re: Question about window size in term lookup [EXTERNAL]

2020-08-25 Thread Peter Abramowitsch
Thanks Sean.  A lot of good ideas.  I hadn't even been thinking of
post-filtering, but that's a very viable approach. Something like using
tweezers to remove a splinter instead of removing them from all the pieces
of wood you might encounter.   I like how you use the functor approach on
the filters.

Yesterday I tried another method too.   Join all the 2&3 character gene
terms with the 10,000 most common english words - then take the resulting
list and use it to create a deletion list in the dictionary creation step.
It reduced the number of items to remove by an order of magnitude.   ~4000
down to ~400

Deleting it in the dictionary is more painful up front, but more performant
than post filtering, for two obvious reasons,  but using your approach and
checking if the # of gene references is > 0, one can choose to filter only
specific notes and that would increase performance again.  Unfortunately
performance is a big factor in our project.

>From your response and Kean's I'm inferring that there's no way to set the
window size to N and have an exception list of a few items that are of
length < N.  Right?  If there were, it would be in the chunker, not the
term lookup.

Thanks again for your suggestions!

Peter

On Tue, Aug 25, 2020 at 5:50 AM Finan, Sean <
sean.fi...@childrens.harvard.edu> wrote:

> I think that Kean is correct.  I usually create an annotator that removes
> terms that I don't want.  It is usually fairly easy.
>
>   final Predicate is2char
> = a -> a.getCoveredText().length() == 2;
>
>   final String geneTui = SemanticTui.getTui( "Gene or Genome" ).name();
>
>   OntologyConceptUtil.getAnnotationsByTui( jCas, geneTui )
>  .stream()
>  .filter( is2char )
>  .forEach( Annotation::removeFromIndexes );
>
>
> Or, if you want to grab a few that aren't specifically "Gene" but are in
> the same semantic group (without looking it up in class SemanticGroup), and
> in the HGNC vocabulary :
>
>   final Class geneClass
> = SemanticTui.getTui( "Gene or Genome" )
>  .getGroup()
>  .getCtakesClass();
>
>   final Predicate isHgnc
> = a -> OntologyConceptUtil.getSchemeCodes( a ).containsKey(
> "hgnc" );
>
>   JCasUtil.select( jCas, geneClass )
>   .stream()
>   .filter( is2char )
>   .filter( isHgnc )
>   .forEach( Annotation::removeFromIndexes );
>
>
> "hgnc" may need to be "HGNC" ... and will only exist if you stored the
> HGNC codes in your dictionary.
>
>
> Or you can do it focusing on what you do want.
>
>   final Collection WANTED_GROUP = EnumSet.of(
> SemanticGroup.DRUG, SemanticGroup.LAB );
>
>   final Predicate isTrashGroup
> = a -> SemanticGroup.getGroups( a )
> .stream()
> .noneMatch( WANTED_GROUP::contains );
>
>   JCasUtil.select( jCas, IdentifiedAnnotation.class )
>   .stream()
>   .filter( is2char )
>   .filter( isTrashGroup )
>   .forEach( Annotation::removeFromIndexes );
>
> Or if you want to cover all combinations that aren't all uppercase:
>
>   final Predicate notCaps
> = a -> a.getCoveredText()
> .chars()
> .anyMatch( Character::isLowerCase );
>
>   JCasUtil.select( jCas, IdentifiedAnnotation.class )
>   .stream()
>   .filter( is2char )
>   .filter( notCaps )
>   .forEach( Annotation::removeFromIndexes );
>
> Or mix and modify.  For instance, ignore character length but  Tui = Gene
> and the text is not all caps.
>
> Sometimes I enjoy mocking up code ...
>
> Sean
>
> 
> From: Kean Kaufmann 
> Sent: Monday, August 24, 2020 9:35 PM
> To: dev@ctakes.apache.org
> Subject: Re: Question about window size in term lookup [EXTERNAL]
>
> * External Email - Caution *
>
>
> >
> > my question is whether there's a place where one can register specific
> two
> > character terms, for example BP or PT which will be found even with a
> > window size set to three.
>
>
> My brute-force approach is pretty brutal: Change the window size to two,
> annotate terms, then remove all two-letter annotations except the very few
> I'm interested in.
>
> On Mon, Aug 24, 2020 at 9:07 PM Peter Abramowitsch <
> pabramowit...@gmail.com>
> wrote:
>
> > Hello all
>

Re: Question about window size in term lookup [EXTERNAL]

2020-08-25 Thread Peter Abramowitsch
>>> -- What was your english dictionary source?  I suppose that there could
be some blacklisting in a dictionary creator.

I found these two

https://raw.githubusercontent.com/first20hours/google-1-english/master/google-1-english-usa.txt
https://www.mit.edu/~ecprice/wordlist.1

But these were probably derived from internet usage.   I was surprised by
some of the words that showed up

P.

On Tue, Aug 25, 2020 at 9:09 AM Finan, Sean <
sean.fi...@childrens.harvard.edu> wrote:

> Hi Peter,
>
> >I'm inferring that there's no way to set the
> window size to N and have an exception list of a few items that are of
> length < N.
> -- As far as I can recall there isn't any such method in the lookup.
>
> > Join all the 2&3 character gene
> terms with the 10,000 most common english words
> -- I have seen this done elsewhere, and can't remember if anybody tested
> precision gained vs. recall lost.  It would be highly related to
> note/specialty type.
> -- What was your english dictionary source?  I suppose that there could be
> some blacklisting in a dictionary creator.
>
> >It reduced the number of items to remove by an order of magnitude.   ~4000
> down to ~400
> -- Very nice.
>
> >performance is a big factor in our project.
> -- Yup.
>
>
> If only the dictionary lookup differentiated between all-caps words and
> lower or mixed case ...
>
> Thanks for sharing your ideas,
> Sean
>
>
>
> 
> From: Peter Abramowitsch 
> Sent: Tuesday, August 25, 2020 11:56 AM
> To: dev@ctakes.apache.org
> Subject: Re: Question about window size in term lookup [EXTERNAL]
>
> * External Email - Caution *
>
>
> Thanks Sean.  A lot of good ideas.  I hadn't even been thinking of
> post-filtering, but that's a very viable approach. Something like using
> tweezers to remove a splinter instead of removing them from all the pieces
> of wood you might encounter.   I like how you use the functor approach on
> the filters.
>
> Yesterday I tried another method too.   Join all the 2&3 character gene
> terms with the 10,000 most common english words - then take the resulting
> list and use it to create a deletion list in the dictionary creation step.
> It reduced the number of items to remove by an order of magnitude.   ~4000
> down to ~400
>
> Deleting it in the dictionary is more painful up front, but more performant
> than post filtering, for two obvious reasons,  but using your approach and
> checking if the # of gene references is > 0, one can choose to filter only
> specific notes and that would increase performance again.  Unfortunately
> performance is a big factor in our project.
>
> From your response and Kean's I'm inferring that there's no way to set the
> window size to N and have an exception list of a few items that are of
> length < N.  Right?  If there were, it would be in the chunker, not the
> term lookup.
>
> Thanks again for your suggestions!
>
> Peter
>
> On Tue, Aug 25, 2020 at 5:50 AM Finan, Sean <
> sean.fi...@childrens.harvard.edu> wrote:
>
> > I think that Kean is correct.  I usually create an annotator that removes
> > terms that I don't want.  It is usually fairly easy.
> >
> >   final Predicate is2char
> > = a -> a.getCoveredText().length() == 2;
> >
> >   final String geneTui = SemanticTui.getTui( "Gene or Genome"
> ).name();
> >
> >   OntologyConceptUtil.getAnnotationsByTui( jCas, geneTui )
> >  .stream()
> >  .filter( is2char )
> >  .forEach( Annotation::removeFromIndexes );
> >
> >
> > Or, if you want to grab a few that aren't specifically "Gene" but are in
> > the same semantic group (without looking it up in class SemanticGroup),
> and
> > in the HGNC vocabulary :
> >
> >   final Class geneClass
> > = SemanticTui.getTui( "Gene or Genome" )
> >  .getGroup()
> >  .getCtakesClass();
> >
> >   final Predicate isHgnc
> > = a -> OntologyConceptUtil.getSchemeCodes( a ).containsKey(
> > "hgnc" );
> >
> >   JCasUtil.select( jCas, geneClass )
> >   .stream()
> >   .filter( is2char )
> >   .filter( isHgnc )
> >   .forEach( Annotation::removeFromIndexes );
> >
> >
> > "hgnc" may need to be "HGNC" ... and will only exist if you stored the
> > HGNC codes in yo

Short gene term collisions

2020-08-25 Thread Peter Abramowitsch
As a thank you for your suggestions, here's a little file that may help.

It's a command file for sed that will remove all short gene synonyms for
HGNC that collide with common english words of of 2,3,4 characters in
length.   You will only need it if you've included HGNC in your
vocabularies and Gene & Receptor TUIs in your dictionary

The common words list is a bit weird, containing some contemporary acronyms
that are not strictly speaking words.  But feel free to improve

https://raw.githubusercontent.com/first20hours/google-1-english/master/google-1-english-usa.txt

sed -f deletion_short_gene_terms_script < original_dict.script >
scrubbed_dict.script

Peter


deletion_short_gene_terms_script.gz
Description: GNU Zip compressed data


I think I found a bug.

2020-08-30 Thread Peter Abramowitsch
Hi,
I was getting a StringIndexOutOfBoundsException in
DependencyUtil.doesSubsume(annot1, annot2)  with exactly this situation:

*negex annotator*
*the text begins  "negative for "*

If the chunk *negative for xyz *is preceded by anything else, even a space,
the problem goes away.  It also goes away when you choose another style of
negation.   "no headache", for instance

I've traced the problem back to some illegal entries in the jCAS  You can
see from the image below that the ContextAnnotation's begin offset is
illegal.

Clearly there's an off-by-one error and this triggered the exception
because in my example, the Annotation is created right from the 0th char of
my note text.  But it occurred to me that in every other case, where the
annotation doesn't begin on the first character and it doesn't throw an
exception, it might cause  downstream methods like doesSubsume to give the
wrong result because the begin/end offsets are wrong.

I'm not sure how to follow this up.  But if anyone wants to tackle it?

This is from HistoryAttributeClassifier beginning at line 274

[image: image.png]


Re: I think I found a bug. [EXTERNAL]

2020-08-31 Thread Peter Abramowitsch
Thanks Jeff,  I don't think the image is needed.  here's what it showed.

With the negex annotator in the pipeline

With "Negative for headache"  as the text starting at position 0
In HistoryAttributeClassifier beginning near line 274
the first IdentifiedAnnotation in the

*List lsmentions*

contains a ContextAnnotation where the offset range is   -1, 13.
Looking at the text, it should probably have been 0, 11.

Add any text ahead of the "Negative for" and it works brilliantly.
Probably one of those  off-by-one errors  that comes from staying up too
late.

Peter



Peter

Peter


On Mon, Aug 31, 2020 at 3:48 AM Miller, Timothy <
timothy.mil...@childrens.harvard.edu> wrote:

> Peter,
> I think the email server doesn't let images through. Can you post an
> imgur link maybe?
> Tim
>
> On Sun, 2020-08-30 at 14:35 -0700, Peter Abramowitsch wrote:
> > * External Email - Caution *
> >
> > Hi,
> > I was getting a StringIndexOutOfBoundsException in
> > DependencyUtil.doesSubsume(annot1, annot2)  with exactly this
> > situation:
> >
> > negex annotator
> > the text begins  "negative for "
> >
> > If the chunk negative for xyz is preceded by anything else, even a
> > space, the problem goes away.  It also goes away when you choose
> > another style of negation.   "no headache", for instance
> >
> > I've traced the problem back to some illegal entries in the jCAS  You
> > can see from the image below that the ContextAnnotation's begin
> > offset is illegal.
> >
> > Clearly there's an off-by-one error and this triggered the exception
> > because in my example, the Annotation is created right from the 0th
> > char of my note text.  But it occurred to me that in every other
> > case, where the annotation doesn't begin on the first character and
> > it doesn't throw an exception, it might cause  downstream methods
> > like doesSubsume to give the wrong result because the begin/end
> > offsets are wrong.
> >
> > I'm not sure how to follow this up.  But if anyone wants to tackle
> > it?
> >
> > This is from HistoryAttributeClassifier beginning at line 274
> >
> >
> >
> >
> >
>


Re: Building a new custom dictionary or Updating/Adding values to the existing dictionary in cTAKES [EXTERNAL]

2020-09-15 Thread Peter Abramowitsch
Sean this conversation raises for me a question that I've had for a while.
 Does the term finding mechanism actually use a treebank to find the POS or
does it use a another less rigorous approach.   If it were rigorous
wouldn't it be able to tag a pure number as an NN in the role of  object if
it played the corresponding role in the sentence?

I've not had the same problem as Ayyub,  but I have been wondering why one
needed to disable the identification of cm as a genetic acronym because of
situations where clearly cm is part of a unit of measure and would show up
as an entity's modifier in a treebank.

Does the question make sense?

Peter

On Tue, Sep 15, 2020, 9:02 AM Finan, Sean 
wrote:

> I should mention that going the Paragraph route would only impact term
> lookup.
> 
> From: abad.ay...@cognizant.com 
> Sent: Tuesday, September 15, 2020 11:54 AM
> To: dev@ctakes.apache.org
> Subject: RE: Building a new custom dictionary or Updating/Adding values to
> the existing dictionary in cTAKES [EXTERNAL]
>
> * External Email - Caution *
>
>
> Thank you Sean for the response. We shall definitely try that way. I have
> one question on the "f84.1" problem, since we have now developed a lot of
> features based on the output from cTAKES, is the impact of changing the
> sentenceDetectorAnnotator going to be huge?
>
> Thanks & Regards
>
> Abad Ayyub
> Vnet: 406170 | Cell : +91-9447379028
>
>
>
> -Original Message-
> From: Finan, Sean 
> Sent: Tuesday, September 15, 2020 9:06 PM
> To: dev@ctakes.apache.org
> Subject: Re: Building a new custom dictionary or Updating/Adding values to
> the existing dictionary in cTAKES [EXTERNAL]
>
> [External]
>
>
> Hi Abad,
>
> The first thing that I would try for the "97112" problem is changing the
> parts of speech that are ignored for lookup.  Right now a pure number is
> ignored - it is not a word.  So, similar to what I said in my previous
> email, change the dictionary lookup parameter exclusionTags.  But to make
> sure that you get everything, you can first try no exclusions:
> set exclusionTags=""
>
> My guess with the F84.1 problem is that your sentence splitter is
> splitting "F84.1" but not splitting "F84 . 1".
>
> I think that the best way to start debugging is adding the
> PrettyTextWriter to the end of the piper and looking at its output (see my
> previous email).   It will print each sentence on a line and indicate the
> part of speech for each token.  If you can quickly and easily see what the
> system is doing then you might start to understand what needs to be changed
> to fit your data.
>
> Sean
> 
> From: abad.ay...@cognizant.com 
> Sent: Tuesday, September 15, 2020 11:15 AM
> To: dev@ctakes.apache.org
> Subject: RE: Building a new custom dictionary or Updating/Adding values to
> the existing dictionary in cTAKES [EXTERNAL]
>
> * External Email - Caution *
>
>
> Thank you Sean for the detailed response.  I think there was
> miscommunication from our end with the requirement. Your solution of adding
> spaces between the entries worked but it required the input  text also to
> have the spaces. If the text comes in as 'F84.1' cTAKES didn't reckon the
> token but if the text came as 'F84 . 1' then cTAKES was recognizing the
> tokens for the below INSERT scripts.
>
> INSERT INTO CUI_TERMS VALUES(4352,0,3, ‘F84 . 1’,’F84’)
>
> But we encountered a similar issue when we configured an INSERT entry as
> below for CPT codes,
>
> INSERT INTO CUI_TERMS VALUES(41154,0,1, ‘97112’,’97112’)
>
> Where 97112 is a CPT code(which usually doesn’t have decimals or '.'). We
> expected cTAKES to recognize the CPT code '97112' as a separate token but
> it didn't. Could you pls. advise us on why this issue came up.
>
> Is there something wrong in the configuration. Do we need to have
> something additional for cTAKES to recognize the code alone as a separate
> token Is there any other way in which we can try to get the respective
> ICD/CPT code of the identified annotation from cTAKES, like querying the
> CPT/ICD table using the fetched CUI? Kindly advise.
>
>
> Thanks & Regards
>
> Abad Ayyub
> Vnet: 406170 | Cell : +91-9447379028
>
>
>
> -Original Message-
> From: Finan, Sean 
> Sent: Monday, September 14, 2020 9:35 PM
> To: dev@ctakes.apache.org
> Subject: Re: Building a new custom dictionary or Updating/Adding values to
> the existing dictionary in cTAKES [EXTERNAL]
>
> [External]
>
>
> Hi Abad,
>
>
> I think that you need to make only one minor change.
>
>
> ctakes uses "tokens" for identification and not the actual text.
> Tokenization turns text such as "F84.1" into "F84 . 1"  The first token
> being F84, followed by a token encompassing '.' and another with '1'.  The
> manner in which this is indicated in the .script file is by adding a space
> between each token.  This makes the full entry:
>
>
> INSERT INTO CUI_TERMS VALUES(4352,0,3, ‘F84 . 1’,’F84’)
>
>
> Notice that the token length 

Re: Building a new custom dictionary or Updating/Adding values to the existing dictionary in cTAKES [EXTERNAL]

2020-09-15 Thread Peter Abramowitsch
Thanks Tim.

I've been experimenting with the PennTreebank and see some potential for
using it as a powerful disambiguation tool.  The complex part is to find a
heuristic that minimizes the number of cases where the "big guns"   need to
be brought in -- because, yes, it would really slow things down.

Peter

On Tue, Sep 15, 2020 at 12:54 PM Miller, Timothy <
timothy.mil...@childrens.harvard.edu> wrote:

> Peter,
> The parts of speech come from the ctakes-pos-tagger module, which uses
> the OpenNLP pos tagger trained on clinical data. There is a
> constituency parser as well, which I think in theory can tag even
> better (that might be able to get you a unary branch in a tree from NN
> -> CD -> .), but is a lot slower than the pos tagger and we
> probably don't want to make it necessary to run for simple dictionary
> pipelines.
> Tim
>
> On Tue, 2020-09-15 at 12:12 -0700, Peter Abramowitsch wrote:
> > * External Email - Caution *
> >
> >
> > Sean this conversation raises for me a question that I've had for a
> > while.
> >  Does the term finding mechanism actually use a treebank to find the
> > POS or
> > does it use a another less rigorous approach.   If it were rigorous
> > wouldn't it be able to tag a pure number as an NN in the role
> > of  object if
> > it played the corresponding role in the sentence?
> >
> > I've not had the same problem as Ayyub,  but I have been wondering
> > why one
> > needed to disable the identification of cm as a genetic acronym
> > because of
> > situations where clearly cm is part of a unit of measure and would
> > show up
> > as an entity's modifier in a treebank.
> >
> > Does the question make sense?
> >
> > Peter
> >
> > On Tue, Sep 15, 2020, 9:02 AM Finan, Sean <
> > sean.fi...@childrens.harvard.edu>
> > wrote:
> >
> > > I should mention that going the Paragraph route would only impact
> > > term
> > > lookup.
> > > 
> > > From: abad.ay...@cognizant.com 
> > > Sent: Tuesday, September 15, 2020 11:54 AM
> > > To: dev@ctakes.apache.org
> > > Subject: RE: Building a new custom dictionary or Updating/Adding
> > > values to
> > > the existing dictionary in cTAKES [EXTERNAL]
> > >
> > > * External Email - Caution *
> > >
> > >
> > > Thank you Sean for the response. We shall definitely try that way.
> > > I have
> > > one question on the "f84.1" problem, since we have now developed a
> > > lot of
> > > features based on the output from cTAKES, is the impact of changing
> > > the
> > > sentenceDetectorAnnotator going to be huge?
> > >
> > > Thanks & Regards
> > >
> > > Abad Ayyub
> > > Vnet: 406170 | Cell : +91-9447379028
> > >
> > >
> > >
> > > -Original Message-
> > > From: Finan, Sean 
> > > Sent: Tuesday, September 15, 2020 9:06 PM
> > > To: dev@ctakes.apache.org
> > > Subject: Re: Building a new custom dictionary or Updating/Adding
> > > values to
> > > the existing dictionary in cTAKES [EXTERNAL]
> > >
> > > [External]
> > >
> > >
> > > Hi Abad,
> > >
> > > The first thing that I would try for the "97112" problem is
> > > changing the
> > > parts of speech that are ignored for lookup.  Right now a pure
> > > number is
> > > ignored - it is not a word.  So, similar to what I said in my
> > > previous
> > > email, change the dictionary lookup parameter exclusionTags.  But
> > > to make
> > > sure that you get everything, you can first try no exclusions:
> > > set exclusionTags=""
> > >
> > > My guess with the F84.1 problem is that your sentence splitter is
> > > splitting "F84.1" but not splitting "F84 . 1".
> > >
> > > I think that the best way to start debugging is adding the
> > > PrettyTextWriter to the end of the piper and looking at its output
> > > (see my
> > > previous email).   It will print each sentence on a line and
> > > indicate the
> > > part of speech for each token.  If you can quickly and easily see
> > > what the
> > > system is doing then you might start to understand what needs to be
> > > changed
> > > to fit your data.
> > >
> > > Sean
> > > 
>

Current thinking on new UMLS authentication

2020-09-18 Thread Peter Abramowitsch
Hi All

Probably all of you have received an email from Patrick McLaughlin at the
NLM regarding upcoming changes to the UMLS authentication they are going to
support and to retire.   This will have implications for all cTakes users
in different ways depending on how cTakes is implemented in your
community.   To me, there were some ambiguities in his email regarding
usage situations as a registered content provider that needed to be spelled
out.

I was wondering if any of you have had further conversations with him which
might clarify whether, for instance,  users within a registered content
provider installation would still need to be individually authenticated.
Or on any other authentication scenario.

I'm trying to contact him or his team at the moment to ask about our
particular architecture.

Regards,  Peter


Re: Current thinking on new UMLS authentication [EXTERNAL]

2020-09-18 Thread Peter Abramowitsch
authentication [EXTERNAL]
>
> * External Email - Caution *
>
>
> I never received the email you mentioned.
>
> I assume this will affect the API call to NLM for UMLS validation? If it
> does, why not take the NLM's model for UMLS and only require UMLS
> credentials at the time of download?
>
> Greg--
>
>
>
> On Fri, Sep 18, 2020 at 12:33 PM Peter Abramowitsch <
> pabramowit...@gmail.com>
> wrote:
>
> > Hi All
> >
> > Probably all of you have received an email from Patrick McLaughlin at
> > the NLM regarding upcoming changes to the UMLS authentication they are
> going to
> > support and to retire.   This will have implications for all cTakes users
> > in different ways depending on how cTakes is implemented in your
> > community.   To me, there were some ambiguities in his email regarding
> > usage situations as a registered content provider that needed to be
> > spelled out.
> >
> > I was wondering if any of you have had further conversations with him
> > which might clarify whether, for instance,  users within a registered
> > content provider installation would still need to be individually
> authenticated.
> > Or on any other authentication scenario.
> >
> > I'm trying to contact him or his team at the moment to ask about our
> > particular architecture.
> >
> > Regards,  Peter
> >
>
>
> --
> Greg M. Silverman
> Senior Systems Developer
> NLP/IE <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__healthinformatics.umn.edu_research_nlpie-2Dgroup&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=SeLHlpmrGNnJ9mI2WCgf_wwQk9zL4aIrVmfBoSi-j0kfEcrO4yRGmRCJNAr-rCmP&m=OoVx9YhA_zrGwya4OQ29snp1lWjOvt3EuMAspsP5CbA&s=4lOuRwaYmEQD_5BlWeR8Q2qXY9olJvV1k3W2LLnVwvo&e=
> > Department of Surgery University of Minnesota g...@umn.edu
>


Re: Current thinking on new UMLS authentication

2020-09-18 Thread Peter Abramowitsch
The message was initially delivered to only content-providers - a
particular subgroup of cTakes users of which I'm a part, but the language
is quite unclear.  If I read it one way, it looks like consumers of a
content provider instance of cTakes don't need to individually authenticate
anymore.  But I could be wrong.   The other bit which I find disturbing is
that nothing on their UTS website has been updated.  It still has the old
API instructions, no mention of any changes or a sandbox.  Even their rss
feed has nothing new.

I've written a mechanism to act as an authentication relay which we need in
our facility, and depending on the outcome in the next weeks/months, I may
share it, once I've converted it to whatever the new authentication
mechanism will requite - and if people are interested.It allows the
current cTakes release to delegate the authentication to the relay which
then takes care of adapting to the new  UMLS requirements.

Peter



On Fri, Sep 18, 2020 at 7:21 PM Akram  wrote:

>  Does that mean cTAKES is not going to work?
> My PhD research depends on it.
> Is there anyway we can get cTAKES working without the need to authenticate
> UMLS?
> as Gres suggested, can we just download NLM model and use cTAKES offline
> compeletely?
> Thanks
>
>
> On Friday, 18 September 2020, 10:46:12 am GMT-7, Greg Silverman
>  wrote:
>
>  I never received the email you mentioned.
>
> I assume this will affect the API call to NLM for UMLS validation? If it
> does, why not take the NLM's model for UMLS and only require UMLS
> credentials at the time of download?
>
> Greg--
>
>
>
> On Fri, Sep 18, 2020 at 12:33 PM Peter Abramowitsch <
> pabramowit...@gmail.com>
> wrote:
>
> > Hi All
> >
> > Probably all of you have received an email from Patrick McLaughlin at the
> > NLM regarding upcoming changes to the UMLS authentication they are going
> to
> > support and to retire.  This will have implications for all cTakes users
> > in different ways depending on how cTakes is implemented in your
> > community.  To me, there were some ambiguities in his email regarding
> > usage situations as a registered content provider that needed to be
> spelled
> > out.
> >
> > I was wondering if any of you have had further conversations with him
> which
> > might clarify whether, for instance,  users within a registered content
> > provider installation would still need to be individually authenticated.
> > Or on any other authentication scenario.
> >
> > I'm trying to contact him or his team at the moment to ask about our
> > particular architecture.
> >
> > Regards,  Peter
> >
>
>
> --
> Greg M. Silverman
> Senior Systems Developer
> NLP/IE <https://healthinformatics.umn.edu/research/nlpie-group>
> Department of Surgery
> University of Minnesota
> g...@umn.edu
>


Authentication update

2020-09-24 Thread Peter Abramowitsch
Hi All

I managed to have a conversation with Patrick McLaughlin who's in charge of
the Authentication changes at UMLS.   I have some updates for you which I
think you will find reassuring.

He's familiar with cTAKES but hasn't used it personally, nor was very
familiar with the different ways it could be deployed and its
authentication requirements.

To boil the conversation down to a few key points:

They want to get out of the business of forgotten password and username
management. but they are still creating individual user accounts and
accounts for content providers.

There are two routes to authenticating and I think the far simpler one for
traditional cTAKES users will be that of using API keys instead of an
external Authentication service. Strangely enough, I had to ask Patrick
about it since it is not mentioned on their recent emails to our
community.So I'll describe that approach first:

API_KEY Method.

Our individual user accounts do and will still have an API key created for
them and you can authenticate with the API key instead of username-password
to get an access token with a specific TTL. (a surrogate for yes/no
authentication)  So cTAKES will need some relatively trivial changes to
accept an API key in the usual places (EnvVar, System Property, or
Piperfile) and then make the appropriate request with that key.

I'm using the alternate authentication URL property so I can relay requests
through a proxy authenticator, but it will forward API keys just as cTakes'
new code should do.  So whoever makes these changes - please don't remove
the modifiable system property
* ctakes.umlsaddr!!*

What the UTS will want, though, is the domain name from which the API
bearing Token request will come, so they can whitelist it.

If you have no UMLS license yet, you will go to the UTS as usual to apply
and from there, you will be given the choice of several external sources of
identity verification.  Here's where external authentication will come into
play (using Google Auth, Facebook, Login.gov etc):  Once you've passed that
step, you'll get the user account and an API key.

If you already have a UMLS license and are using the username password
paradigm, when the cTAKES fix is out, go back to your UTS profile and find
your API key.

SECOND OPTION
(after your license is updated or newly established with whitelist &
callback info)

If you are architecting an implementation where ad-hoc users authenticate
for one-off interactions with cTakes through a website or directly with the
UMLS metathesaurus via its APIs, the implementation will involve something
more complicated.  You will need to implement a publicly visible web
service or callback-link on your site.

Your authentication request will go out, probably with an email address and
will be forwarded by UTS to your authenticator of choice (whomever you
specified when creating your license)

Asynchronously, the UTS will call back on your registered URL which your
application will be listening for to give you approval or rejection.  This
would unblock access to cTAKES or the website behind which you've deployed
it.

So you see that this method is not really suitable for situations where
cTAKES is used as a console app or a dedicated web-service, possibly
situated deep within a protected PHI safe zone.  In this setting a
Synchronous authentication approach using the API_KEY would be simpler from
every angle.

If I haven't been clear enough or you still have questions, call Patrick.
He's very open to working with us

Peter


Re: Authentication update

2020-09-24 Thread Peter Abramowitsch
The correct answer would be no.  Not in an out-of-the-box implementation of
cTAKES.Leaving a trail of our usage of UMLS is one of the factors
contributing to its very existence. If it seemed as if no-one were using
these validation APIs at UMLS - it would give ammunition to the budget
cutters.

You can do more-or-less whatever you want with software, so it is an
ethical question as much as a technical one.

An implementation such as ours has to go to some extra lengths to obtain a
user's credentials because the cTakes service is buried inside a PHI
sensitive zone that has no cloud-facing ports.  But still, we do it.

Peter

On Thu, Sep 24, 2020 at 2:13 PM Akram  wrote:

>  Thanks Peter,
> Is there a way we can use cTAKES completely offline?
> I mean we can download whatever we need from UMLS and use it as datasource
> locally in our computers?
>
>
>
> On Thursday, 24 September 2020, 02:02:15 pm GMT-7, Peter Abramowitsch <
> pabramowit...@gmail.com> wrote:
>
>  Hi All
>
> I managed to have a conversation with Patrick McLaughlin who's in charge of
> the Authentication changes at UMLS.  I have some updates for you which I
> think you will find reassuring.
>
> He's familiar with cTAKES but hasn't used it personally, nor was very
> familiar with the different ways it could be deployed and its
> authentication requirements.
>
> To boil the conversation down to a few key points:
>
> They want to get out of the business of forgotten password and username
> management. but they are still creating individual user accounts and
> accounts for content providers.
>
> There are two routes to authenticating and I think the far simpler one for
> traditional cTAKES users will be that of using API keys instead of an
> external Authentication service. Strangely enough, I had to ask Patrick
> about it since it is not mentioned on their recent emails to our
> community.So I'll describe that approach first:
>
> API_KEY Method.
>
> Our individual user accounts do and will still have an API key created for
> them and you can authenticate with the API key instead of username-password
> to get an access token with a specific TTL. (a surrogate for yes/no
> authentication)  So cTAKES will need some relatively trivial changes to
> accept an API key in the usual places (EnvVar, System Property, or
> Piperfile) and then make the appropriate request with that key.
>
> I'm using the alternate authentication URL property so I can relay requests
> through a proxy authenticator, but it will forward API keys just as cTakes'
> new code should do.  So whoever makes these changes - please don't remove
> the modifiable system property
> * ctakes.umlsaddr!!*
>
> What the UTS will want, though, is the domain name from which the API
> bearing Token request will come, so they can whitelist it.
>
> If you have no UMLS license yet, you will go to the UTS as usual to apply
> and from there, you will be given the choice of several external sources of
> identity verification.  Here's where external authentication will come into
> play (using Google Auth, Facebook, Login.gov etc):  Once you've passed that
> step, you'll get the user account and an API key.
>
> If you already have a UMLS license and are using the username password
> paradigm, when the cTAKES fix is out, go back to your UTS profile and find
> your API key.
>
> SECOND OPTION
> (after your license is updated or newly established with whitelist &
> callback info)
>
> If you are architecting an implementation where ad-hoc users authenticate
> for one-off interactions with cTakes through a website or directly with the
> UMLS metathesaurus via its APIs, the implementation will involve something
> more complicated.  You will need to implement a publicly visible web
> service or callback-link on your site.
>
> Your authentication request will go out, probably with an email address and
> will be forwarded by UTS to your authenticator of choice (whomever you
> specified when creating your license)
>
> Asynchronously, the UTS will call back on your registered URL which your
> application will be listening for to give you approval or rejection.  This
> would unblock access to cTAKES or the website behind which you've deployed
> it.
>
> So you see that this method is not really suitable for situations where
> cTAKES is used as a console app or a dedicated web-service, possibly
> situated deep within a protected PHI safe zone.  In this setting a
> Synchronous authentication approach using the API_KEY would be simpler from
> every angle.
>
> If I haven't been clear enough or you still have questions, call Patrick.
> He's very open to working with us
>
> Peter
>


Found code relating to a bug I reported a few weeks ago.

2020-10-12 Thread Peter Abramowitsch
Hi Sean

If you know every inch of the code maybe I can ask you what you think of
this problem I found in the negex annotator.   It causes any sentence to
crash when the very first character begins the negation:

*"Absence of headache"  *
causes a crash later on in another annotator because the ContextAnnotation
it creates has a begin offset of -1.

*" Absence of headache" *
successfully annotates the phrase.

I need to fix this urgently, but I found a mysterious piece of code that is
responsible for this.
I'm working off a trunk snapshot 4.0.1 taken Dec 27  2018

NegexAnnotator.annotateNegation()   at line 846 of its class file:

*846: nec.setBegin(s.getBegin() + t.getStart() - 1);*

In the case where a sentence begins with "Absence of"   then both s
(Sentence) and t (negex token) begin at offset 0. Then the
ContextAnnotation goes on its destructive way.   So what's with the
-1   ?
Of course It also fails with "No headache..." as the beginning of a sentence

If you know, or have a hunch why the  -1 at that line is there I will track
it down further.  Otherwise I'm just tempted to leave it and
Max(calcOffset, 0)

The crash actually occurs as control passes to the historyOf annotator
which looks at the ContextAnnotation created by Negex.
See below the signature for the stack trace

My ICLA has not been approved  yet so I can't make any alterations to
source, nor would I without any orientation to the process.   Never done it
before nor have the keys to Jira

Anyone else who knows the NegexNegator in depth  please chime in as well

Peter


Caused by: java.lang.StringIndexOutOfBoundsException: String index out of
range: -1
at java.lang.String.substring(String.java:1960)
at org.apache.uima.jcas.tcas.Annotation.getCoveredText(Annotation.java:128)
at
org.apache.ctakes.dependency.parser.util.DependencyUtility.doesSubsume(DependencyUtility.java:67)
at
org.apache.ctakes.dependency.parser.util.DependencyUtility.getDependencyNodes(DependencyUtility.java:104)
at
org.apache.ctakes.dependency.parser.util.DependencyUtility.getNominalHeadNode(DependencyUtility.java:113)
at
org.apache.ctakes.assertion.attributes.history.HistoryAttributeClassifier.extract(HistoryAttributeClassifier.java:2


Re: Found code relating to a bug I reported a few weeks ago. [EXTERNAL]

2020-10-15 Thread Peter Abramowitsch
Thanks Sean

Yes, this morning I got access, but I'm not in a hurry to start tampering
with the archive.  If you want to take this into another mail stream it's
fine - perhaps better. I have a couple of general questions and
specific questions

1.  About the fix we discussed above, are you suggesting we just let the
negex annotator begin at offset 0 + token-pos, and then see if anything
stops working (or improves!)  downstream, now that at least historyOf
downstream doesn't throw an exception anymore?   There are so many
downstream permutations given all the possible annotators it would be
impossible to test all of them.  And unless we see another one of these -1
strangenesses in other places where context annotations are created, can we
just assume that it is idiosyncratic to Negex for historical reasons?

2.  Is the official archive in Git now or in SVN?  Apache root mentioned
SVN only.   If SVN, what is your favorite gui if you use one?

3.  I prefer a peer-review/pull-request type interaction if possible.  I
would hate to introduce rubbish even if entitled to do so.  Do you already
implement something like that?

4.  What about Jira?  Do permissions and links come from Apache?  I've been
on it briefly in read-only mode.

There's no hurry to respond.  I'm on my way back to Italy shortly and will
set up shop there again next week sometime.

Regards, Peter

On Wed, Oct 14, 2020 at 2:12 PM Finan, Sean <
sean.fi...@childrens.harvard.edu> wrote:

> Hi Peter,
>
> There are untold miles of code that I have never traveled ...
>
> I don't understand the -1.  Maybe the original code used some other code (
> .start() )that wasn't zero-based?
>
> It seems like nec.begin should not be offset and that down the line
> consumers should offset to n-1 or .max(n-1,0).  Which of course means that
> any fixes need to have propagated adjustments.  Yay.
>
> I think that your ICLA has gone through (congratulations).  Just in time,
> right?
>
> Sean
>
>
> 
> From: Peter Abramowitsch 
> Sent: Monday, October 12, 2020 6:05 PM
> To: dev@ctakes.apache.org
> Subject: Found code relating to a bug I reported a few weeks ago.
> [EXTERNAL]
>
> * External Email - Caution *
>
>
> Hi Sean
>
> If you know every inch of the code maybe I can ask you what you think of
> this problem I found in the negex annotator.   It causes any sentence to
> crash when the very first character begins the negation:
>
> *"Absence of headache"  *
> causes a crash later on in another annotator because the ContextAnnotation
> it creates has a begin offset of -1.
>
> *" Absence of headache" *
> successfully annotates the phrase.
>
> I need to fix this urgently, but I found a mysterious piece of code that is
> responsible for this.
> I'm working off a trunk snapshot 4.0.1 taken Dec 27  2018
>
> NegexAnnotator.annotateNegation()   at line 846 of its class file:
>
> *846: nec.setBegin(s.getBegin() + t.getStart() - 1);*
>
> In the case where a sentence begins with "Absence of"   then both s
> (Sentence) and t (negex token) begin at offset 0. Then the
> ContextAnnotation goes on its destructive way.   So what's with the
> -1   ?
> Of course It also fails with "No headache..." as the beginning of a
> sentence
>
> If you know, or have a hunch why the  -1 at that line is there I will track
> it down further.  Otherwise I'm just tempted to leave it and
> Max(calcOffset, 0)
>
> The crash actually occurs as control passes to the historyOf annotator
> which looks at the ContextAnnotation created by Negex.
> See below the signature for the stack trace
>
> My ICLA has not been approved  yet so I can't make any alterations to
> source, nor would I without any orientation to the process.   Never done it
> before nor have the keys to Jira
>
> Anyone else who knows the NegexNegator in depth  please chime in as well
>
> Peter
> 
>
> Caused by: java.lang.StringIndexOutOfBoundsException: String index out of
> range: -1
> at java.lang.String.substring(String.java:1960)
> at org.apache.uima.jcas.tcas.Annotation.getCoveredText(Annotation.java:128)
> at
>
> org.apache.ctakes.dependency.parser.util.DependencyUtility.doesSubsume(DependencyUtility.java:67)
> at
>
> org.apache.ctakes.dependency.parser.util.DependencyUtility.getDependencyNodes(DependencyUtility.java:104)
> at
>
> org.apache.ctakes.dependency.parser.util.DependencyUtility.getNominalHeadNode(DependencyUtility.java:113)
> at
>
> org.apache.ctakes.assertion.attributes.history.HistoryAttributeClassifier.extract(HistoryAttributeClassifier.java:2
>


Re: Changes to UTS Authentication for Authorized Content Distributors [EXTERNAL]

2020-11-11 Thread Peter Abramowitsch
I was planning to make the change for our implementation at UCSF in the
next couple of weeks. After discussions with Patrick McLauchlin and some
testing with the new system, I thought of offering two cTAKES configuration
options.  One will mimic the current attributes, and the other would be
self-evident,

Option 1 (for backward compatibility)
Wherever it is set, if UmlsUserValidator is called with "apikey" as the
umls_user, it assume user umls_password is the UMLS_API_KEY

Option 2.
If there is a SystemProperty ctakes.umls_apikeyit will use its
value with the validator
Ditto if there is an EnvironmentVariable UMLS_APIKEY

If you're interested I will post the changes.
Peter

On Wed, Nov 11, 2020 at 7:49 PM Finan, Sean <
sean.fi...@childrens.harvard.edu> wrote:

> Hi Greg,
>
> The simple answer is that we have no answer.  yet.
>
> This change in authentication will require some work from ctakes
> developers and we haven't yet mapped out the effort.
>
> We will endeavor to have both an implementation and documentation
> available before the current authentication is no longer supported by the
> NLM.
>
> Sean
> 
> From: Greg Silverman 
> Sent: Wednesday, November 11, 2020 1:26 PM
> To: dev@ctakes.apache.org; Reed McEwan; Raymond Finzel; Ben Knoll
> Cc: Pei Chen
> Subject: Re: Changes to UTS Authentication for Authorized Content
> Distributors [EXTERNAL]
>
> * External Email - Caution *
>
>
> For example, the user installation guide has not been updated to reflect
> the changes NLM is implementing. The impact for our workflow is pretty
> significant, so without a clear picture about what we need to do in order
> to not have any down time is - to put it mildly -  leaving us in the dark.
>
> Greg--
>
> On Tue, Nov 10, 2020 at 9:18 AM Greg Silverman  wrote:
>
> > It's still unclear what this means for me as a user of a piece of
> software
> > that uses UTS for authentication purposes. Could someone please, in plain
> > language, describe what we as normal users who use software reliant on
> this
> > authentication mechanism will have to do in order to not disrupt any
> > running workflows?
> >
> > Thanks!
> >
> > Greg--
> >
> >
> > On Mon, Nov 9, 2020 at 7:13 AM McLaughlin, Patrick (NIH/NLM) [E]
> >  wrote:
> >
> >> Hello,
> >>
> >>
> >>
> >> The UMLS Terminology Services (UTS) is moving from a username/password
> >> login to an NIH-federal identity provider system on Monday, November 9.
> >> UMLS users will begin migrating their accounts to the new system on this
> >> date with a migration deadline of January 15, 2021.
> >>
> >>
> >>
> >> You will need to update any systems that use the UMLS user validation
> API
> >> <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__uts.nlm.nih.gov_help_license_validateumlsuserhelp.html&d=DwIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=mIcyXu5lIFGpKDoh1Gy8Cpi8zJMuo64EtaNNv3RRB_8&s=kuqvrMCb5p9FcXZjvAf2xzaeqn__Pi9PDsjbonEERNE&e=
> >, as
> >> described in my previous emails. We recommend you implement the new
> >> workflow as soon as possible after November 9.
> >>
> >>
> >>
> >> Attached are instructions for implementing UMLS user validation with the
> >> new system. You MUST supply NLM with the domains (e.g.,
> >>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.example.com&d=DwIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=mIcyXu5lIFGpKDoh1Gy8Cpi8zJMuo64EtaNNv3RRB_8&s=AiVteA8b5xOfTwc2GWNl3Zk5bQ0aVjL-jVdSrvJrfuk&e=
> in the instructions), so that we can whitelist
> >> the domains first.
> >>
> >>
> >>
> >> The UMLS user validation API
> >> <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__uts.nlm.nih.gov_help_license_validateumlsuserhelp.html&d=DwIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=mIcyXu5lIFGpKDoh1Gy8Cpi8zJMuo64EtaNNv3RRB_8&s=kuqvrMCb5p9FcXZjvAf2xzaeqn__Pi9PDsjbonEERNE&e=
> > will
> >> remain functional through January 15, 2021; however, UMLS users that
> create
> >> their UTS accounts after November 9 will not have a password, and you
> will
> >> be unable to validate their accounts.
> >>
> >>
> >>
> >> Please let us know if you run into any issues or have any questions.
> >> Thank you!
> >>
> >>
> >>
> >> -Patrick
> >>
> >>
> >>
> >> *From:* McLaughlin, Patrick (NIH/NLM) [E] 
> >> *Sent:* Wednesday, September 16, 2020 5:35 PM
> >> *To:* dev@ctakes.apache.org
> >> *Cc:* Pei Chen 
> >> *Subject:* RE: Changes to UTS Authentication for Authorized Content
> >> Distributors
> >>
> >>
> >>
> >> Hello,
> >>
> >>
> >>
> >> I’m following up on my previous message about changes to the NLM UMLS
> >> Terminology Services (UTS) authentication. As an Authorized Content
> >> Distributor of UTS content, you will need to modify your implementation
> to
> >> accommodate these changes. Our testing environment is now available for
> you
> >> to test.
> >>
>

Re: Changes to UTS Authentication for Authorized Content Distributors

2020-11-11 Thread Peter Abramowitsch
Hi Greg
It's actually extremely simple for current UMLS licensees.
The new API uses an API_KEY instead of user/password.Just login to the
UTS site, go to your profile area and check on your key
I or someone else will make changes to the cTAKES validator to accept this
key in lieu of name and password

For new UMLS users, they will need a couple of extra steps.   They will get
an identity from one of the authentication providers like Login.gov as a
part of the UTS registration process.   But having completed that, they
will have a profile page with the API_KEY as above



On Wed, Nov 11, 2020 at 7:27 PM Greg Silverman  wrote:

> For example, the user installation guide has not been updated to reflect
> the changes NLM is implementing. The impact for our workflow is pretty
> significant, so without a clear picture about what we need to do in order
> to not have any down time is - to put it mildly -  leaving us in the dark.
>
> Greg--
>
> On Tue, Nov 10, 2020 at 9:18 AM Greg Silverman  wrote:
>
> > It's still unclear what this means for me as a user of a piece of
> software
> > that uses UTS for authentication purposes. Could someone please, in plain
> > language, describe what we as normal users who use software reliant on
> this
> > authentication mechanism will have to do in order to not disrupt any
> > running workflows?
> >
> > Thanks!
> >
> > Greg--
> >
> >
> > On Mon, Nov 9, 2020 at 7:13 AM McLaughlin, Patrick (NIH/NLM) [E]
> >  wrote:
> >
> >> Hello,
> >>
> >>
> >>
> >> The UMLS Terminology Services (UTS) is moving from a username/password
> >> login to an NIH-federal identity provider system on Monday, November 9.
> >> UMLS users will begin migrating their accounts to the new system on this
> >> date with a migration deadline of January 15, 2021.
> >>
> >>
> >>
> >> You will need to update any systems that use the UMLS user validation
> API
> >> , as
> >> described in my previous emails. We recommend you implement the new
> >> workflow as soon as possible after November 9.
> >>
> >>
> >>
> >> Attached are instructions for implementing UMLS user validation with the
> >> new system. You MUST supply NLM with the domains (e.g.,
> >> https://www.example.com in the instructions), so that we can whitelist
> >> the domains first.
> >>
> >>
> >>
> >> The UMLS user validation API
> >>  will
> >> remain functional through January 15, 2021; however, UMLS users that
> create
> >> their UTS accounts after November 9 will not have a password, and you
> will
> >> be unable to validate their accounts.
> >>
> >>
> >>
> >> Please let us know if you run into any issues or have any questions.
> >> Thank you!
> >>
> >>
> >>
> >> -Patrick
> >>
> >>
> >>
> >> *From:* McLaughlin, Patrick (NIH/NLM) [E] 
> >> *Sent:* Wednesday, September 16, 2020 5:35 PM
> >> *To:* dev@ctakes.apache.org
> >> *Cc:* Pei Chen 
> >> *Subject:* RE: Changes to UTS Authentication for Authorized Content
> >> Distributors
> >>
> >>
> >>
> >> Hello,
> >>
> >>
> >>
> >> I’m following up on my previous message about changes to the NLM UMLS
> >> Terminology Services (UTS) authentication. As an Authorized Content
> >> Distributor of UTS content, you will need to modify your implementation
> to
> >> accommodate these changes. Our testing environment is now available for
> you
> >> to test.
> >>
> >>
> >>
> >> *We need some information from you.*
> >>
> >>
> >>
> >> In order for you to test your implementation, we need two things:
> >>
> >>
> >>
> >>1. A domain name from which you will link your users to our
> >>authentication service - We will need to whitelist your domain name
> for use
> >>in our test system. Example: www.yourwebsite.org.
> >>2. A Google email address - We will need to configure a test account
> >>for you so that you can test user authentication.
> >>
> >>
> >>
> >> If you have questions or concerns, please respond to this email. We
> >> appreciate your patience as we make improvements to UTS.
> >>
> >>
> >>
> >> -Patrick
> >>
> >>
> >>
> >> Patrick McLaughlin
> >>
> >> Head, Terminology QA & User Services
> >>
> >> National Library of Medicine
> >>
> >> 8600 Rockville Pike, MSC 3831, Bethesda, MD  20894
> >>
> >> patrick.mclaugh...@nih.gov
> >>
> >>
> >>
> >> *From:* McLaughlin, Patrick (NIH/NLM) [E] 
> >> *Sent:* Friday, August 14, 2020 6:14 PM
> >> *To:* dev@ctakes.apache.org
> >> *Cc:* Pei Chen 
> >> *Subject:* Changes to UTS Authentication for Authorized Content
> >> Distributors
> >>
> >>
> >>
> >> Dear UMLS Licensee,
> >>
> >>
> >>
> >> I’m contacting you from the U.S. National Library of Medicine because
> you
> >> are an Authorized Content Distributor of UMLS Terminology Services (UTS)
> >> content (https://uts.nlm.nih.gov/help/license/validateumlsuserhelp.html
> ).
> >> We are contacting you because we are making changes to the way in which
> UTS
> >> users authenticate starting t

Re: Changes to UTS Authentication for Authorized Content Distributors

2020-11-15 Thread Peter Abramowitsch
Hi Greg

I've got the modifications finished for the new UMLS authentication method
using API keys.  If you're game, I'd like you to be next to test it.
Contact me at pabramowit...@gmail.com and I'll get you a new
ctakes-dictionary-lookup-fast.4.0,1,x,jar  and Readme.

If it's smooth for you, I'll talk with Sean about checking it in and what
wiki locations need to be updated.

To get your key you'll need to log into UMLS, If you've not been there
recently you'll need to go through their profile upgrade process where user
details will be rerouted through one of the  public authentication
mechanisms.
Once in, go to your profile section and you'll find the API_KEY.

All of you will need to do this eventually.

Regards
Peter

Regards, Peter

On Wed, Nov 11, 2020 at 10:13 PM Greg Silverman  wrote:

> Hi Peter,
> Thanks, that would be great. I like the backwards compatible method. Our
> issue is that we have custom configurations for use in Docker and
> Kubernetes with UIMA-AS, so this would be ideal.
>
> Greg--
>
>
> On Wed, Nov 11, 2020 at 3:07 PM Peter Abramowitsch <
> pabramowit...@gmail.com>
> wrote:
>
> > Hi Greg
> > It's actually extremely simple for current UMLS licensees.
> > The new API uses an API_KEY instead of user/password.Just login to
> the
> > UTS site, go to your profile area and check on your key
> > I or someone else will make changes to the cTAKES validator to accept
> this
> > key in lieu of name and password
> >
> > For new UMLS users, they will need a couple of extra steps.   They will
> get
> > an identity from one of the authentication providers like Login.gov as a
> > part of the UTS registration process.   But having completed that, they
> > will have a profile page with the API_KEY as above
> >
> >
> >
> > On Wed, Nov 11, 2020 at 7:27 PM Greg Silverman 
> > wrote:
> >
> > > For example, the user installation guide has not been updated to
> reflect
> > > the changes NLM is implementing. The impact for our workflow is pretty
> > > significant, so without a clear picture about what we need to do in
> order
> > > to not have any down time is - to put it mildly -  leaving us in the
> > dark.
> > >
> > > Greg--
> > >
> > > On Tue, Nov 10, 2020 at 9:18 AM Greg Silverman  wrote:
> > >
> > > > It's still unclear what this means for me as a user of a piece of
> > > software
> > > > that uses UTS for authentication purposes. Could someone please, in
> > plain
> > > > language, describe what we as normal users who use software reliant
> on
> > > this
> > > > authentication mechanism will have to do in order to not disrupt any
> > > > running workflows?
> > > >
> > > > Thanks!
> > > >
> > > > Greg--
> > > >
> > > >
> > > > On Mon, Nov 9, 2020 at 7:13 AM McLaughlin, Patrick (NIH/NLM) [E]
> > > >  wrote:
> > > >
> > > >> Hello,
> > > >>
> > > >>
> > > >>
> > > >> The UMLS Terminology Services (UTS) is moving from a
> username/password
> > > >> login to an NIH-federal identity provider system on Monday, November
> > 9.
> > > >> UMLS users will begin migrating their accounts to the new system on
> > this
> > > >> date with a migration deadline of January 15, 2021.
> > > >>
> > > >>
> > > >>
> > > >> You will need to update any systems that use the UMLS user
> validation
> > > API
> > > >> <https://uts.nlm.nih.gov/help/license/validateumlsuserhelp.html>,
> as
> > > >> described in my previous emails. We recommend you implement the new
> > > >> workflow as soon as possible after November 9.
> > > >>
> > > >>
> > > >>
> > > >> Attached are instructions for implementing UMLS user validation with
> > the
> > > >> new system. You MUST supply NLM with the domains (e.g.,
> > > >> https://www.example.com in the instructions), so that we can
> > whitelist
> > > >> the domains first.
> > > >>
> > > >>
> > > >>
> > > >> The UMLS user validation API
> > > >> <https://uts.nlm.nih.gov/help/license/validateumlsuserhelp.html>
> will
> > > >> remain functional through January 15, 2021; however, UMLS users that
> > > create
> > >

Re: Changes to UTS Authentication for Authorized Content Distributors [EXTERNAL] [SUSPICIOUS] [SUSPICIOUS]

2020-11-24 Thread Peter Abramowitsch
Sean

In the meantime, I think I have the 4.0.0 source somewhere.  I can take my
copy which is early 4.0.1 and make whatever small changes are needed (if
any) to get it to build in 4.0.0.   Would that be useful?

Peter

On Tue, Nov 24, 2020 at 6:46 PM Miller, Timothy <
timothy.mil...@childrens.harvard.edu> wrote:

> On Tue, 2020-11-24 at 16:29 +, Finan, Sean wrote:
> > * External Email - Caution *
> >
> >
> > Hi Tim and all,
> >
> > Peter kindly checked this into trunk last week.
> > I tested that version and it seemed to work.
> >
> > Another question might be "how do we get this into the/a release?
> >
> > I haven't looked into whether or not Apache svn servers have a
> > locking mechanism on release branches, but if not I think that a
> > patch of 4.0 using the version that you and Greg tested should be a
> > simple checkin.
>
> I think it's worth checking -- if we're allowed to just branch off the
> 4.0.0 tag we can get a 4.0.1 distribution that just has this one
> change, and we could have it built and uploaded quickly so we're ready
> for the UMLS change. How would we find out?
>
> Tim
>
> >
> > I am sure that everybody is tired of hearing me say this, but I would
> > like to get out a version 5 asap and disclaim that it is required for
> > the new umls authentication.  That would make patching v4 a non-
> > issue.
> >
> > Regardless of repository inclusion, the documentation (also written
> > by Peter) needs to get to the ctakes wiki  - and probably the main
> > ctakes web site.  On that note, the web site needs to be redone
> > asap.
> >
> > Anyway, cheers to Peter for taking upon himself this update!
> > We do still have a few things left to do.
> > Volunteers?
> >
> > Sean
> >
> > 
> > From: Miller, Timothy 
> > Sent: Tuesday, November 24, 2020 11:07 AM
> > To: dev@ctakes.apache.org
> > Subject: Re: Changes to UTS Authentication for Authorized Content
> > Distributors [EXTERNAL] [SUSPICIOUS]
> >
> > * External Email - Caution *
> >
> >
> > Peter,
> > I was able to try your changes and get this new authentication
> > mechanism to work in the default pipeline. Peter, Sean, et al, what
> > are
> > the next steps for getting this in to trunk? If you're not
> > comfortable
> > checking in directly maybe you can share the patch for review.
> > Tim
> >
> > On Sun, 2020-11-15 at 20:54 +0100, Peter Abramowitsch wrote:
> > > * External Email - Caution *
> > >
> > >
> > > Hi Greg
> > >
> > > I've got the modifications finished for the new UMLS authentication
> > > method
> > > using API keys.  If you're game, I'd like you to be next to test
> > > it.
> > > Contact me at pabramowit...@gmail.com and I'll get you a new
> > > ctakes-dictionary-lookup-fast.4.0,1,x,jar  and Readme.
> > >
> > > If it's smooth for you, I'll talk with Sean about checking it in
> > > and
> > > what
> > > wiki locations need to be updated.
> > >
> > > To get your key you'll need to log into UMLS, If you've not
> > > been
> > > there
> > > recently you'll need to go through their profile upgrade process
> > > where user
> > > details will be rerouted through one of the  public authentication
> > > mechanisms.
> > > Once in, go to your profile section and you'll find the API_KEY.
> > >
> > > All of you will need to do this eventually.
> > >
> > > Regards
> > > Peter
> > >
> > > Regards, Peter
> > >
> > > On Wed, Nov 11, 2020 at 10:13 PM Greg Silverman <
> > > g...@umn.edu.invalid>
> > > wrote:
> > >
> > > > Hi Peter,
> > > > Thanks, that would be great. I like the backwards compatible
> > > > method. Our
> > > > issue is that we have custom configurations for use in Docker and
> > > > Kubernetes with UIMA-AS, so this would be ideal.
> > > >
> > > > Greg--
> > > >
> > > >
> > > > On Wed, Nov 11, 2020 at 3:07 PM Peter Abramowitsch <
> > > > pabramowit...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Greg
> > > > > It's actually extremely simple for current UMLS licensees.
> > > > > The new API uses an API_KEY instead of user/password. 

  1   2   3   >