RE: How to add a new dictionary database to cTAKES

2014-02-28 Thread Finan, Sean
Hi Abhishek,

You have some interesting timing ...
I can give you the xml specifications that you require if you send me the 
format of your dictionary.

Since you are new to the current dictionary module setup, I might also have a 
simpler solution for you ...

A couple of days ago I checked a new module into Sandbox called 
ctakes-dictionary-lookup2 (how novel a name).  It is a complete replacement of 
the current dictionary lookup module, but both can sit side-by-side in your 
local trunk sandbox or build.  It has an example descriptor that tells it to 
read a bar-separated value file (BSV) as a dictionary, storing it (indexed) in 
memory for fast lookup.  There is an example dictionary and xml descriptor for 
that dictionary.  It accepts 2 or 3 column files in the format CUI|Text or 
CUI|TUI|Text.  It automatically detects the number of columns, but they must be 
in that order.  It also does not need the text fields to be tokenized, allowing 
it to accept "Tumor, malignant" as well as "tumor , malignant" as it will 
perform the tokenization upon reading the file.  
As the dictionary will be stored in-memory it should not be huge.  If you do 
have a very large number of terms (>50k) then I recommend an hsql db.  The new 
module will take an hsql db with the fixed field names CUI, TUI, RINDEX, 
TCOUNT, TEXT, RWORD.  I will explain what those mean in some documentation that 
I plan to check into sandbox later today, but I can help you build an hsql 
dictionary db ...
Yesterday I checked into sandbox a project named "dictionarytool".  It is 
source-only, but I can give you a jar if you want one.  Out-of-the-box it will 
build various dictionaries from a UMLS download.  It can build BSV, Hsql (new 
format) and Hsql (current format) to be used by the new or current dictionary 
lookup modules.

This devlist announcement is a little premature on my part.  I will not get 
usage documentation into sandbox for a day or two, but I can send you copies as 
I go if you are in a hurry, or just give you xml snippets for the current 
module descriptors.  If you send the format of your dictionary then that can be 
done quickly.  I just wanted to let you know that there is another option wrt 
dictionary lookup.

Sean

-Original Message-
From: Abhishek De [mailto:abhishek...@alumnux.com] 
Sent: Friday, February 28, 2014 6:58 AM
To: dev@ctakes.apache.org
Subject: How to add a new dictionary database to cTAKES

 

Hi, 

How do I add a new database to the cTAKES pipeline to perform lookup from? How 
do I specify what columns to look up and how to annotate the text with the 
returned hits? I have gone through the DictionaryLookupAnnotatorDB.xml and 
LookupDesc_Db.xml files. However, I could not understand the meanings of the 
terms like "lookupField", "metaField", "maxPermutationLevel" and 
"exclusionTags". If I add a new database, I need to configure this xml file 
properly. Please guide me regarding these problems. 

Thanks and Regards, 

Abhishek De 
 


RE: getSeverity etc. for relation extractor

2014-03-20 Thread Finan, Sean
 > 1) Should we populate IdentifiedAnnotation.severity() and bodylocationof() 
 > Directly in RelationExtractorAnnotator instead of the template filler?  
  One minor issue might be the fact that multiple relations of the same type 
can (and most likely will be) created for a single Identified Annotation.  
Somehow a "best of" would need to be arrived upon for storage in the IA.

> 2)Chase brought up a good point, should we add some of the commonly used 
> components to the defaultpipeline?  (DrugNER, RelationExtractor, 
> TemplateFiller)? 
  One minor issue might be speed.  The longer a new user has to wait for 
results the less they will enjoy their first cTakes experience.  If a user 
doesn't know what they are getting and how to use it then the fruits of that 
additional runtime are wasted upon them.

I dislike clutter, but until it is easier to pick and choose components perhaps 
we could have an "ExpertPipeline" with a fully fleshed-out workflow.  It would 
be great if (instead of cut and paste) it referenced the default desc and then 
added the -more advanced- items to the end.  It is just a thought.

Sean

> -Original Message-
> From: Chen, Pei
> Sent: Wednesday, March 19, 2014 5:58 PM
> To: dev@ctakes.apache.org
> Subject: RE: getSeverity etc. for relation extractor
> 
> Chase,
> I am not sure why or the reasoning behind this, but it might explain 
> why Severity is null for your DiseaseDisorderMention example:
> Line 319 in TemplateFillerAnnotator.java:
> 
> If I'm reading this logic correctly, it will only populate severity for
> SignSymptomMention   Can't think of why not to populate it if it exists in
> the BinaryTextRelations-
> have you tried adding: ddm.setSeverity(degreeOfTextRelation); instead 
> of logging the error ???
> 
>   if (eventMention instanceof
> DiseaseDisorderMention) {
>   DiseaseDisorderMention ddm =
> (DiseaseDisorderMention) eventMention;
>   logger.error("Need to implement attr 
> for " + relation + " for 
> DiseaseDisorderMention");
>   } else if (eventMention instanceof
> SignSymptomMention) {
>   SignSymptomMention ssm =
> (SignSymptomMention) eventMention;
> 
>   ssm.setSeverity(degreeOfTextRelation);
> 
> Would you mind opening a Jira attach a patch/test if it works for you?
> -Pei
> 
> > -Original Message-
> > From: Chase Master [mailto:chasemast...@gmail.com]
> > Sent: Wednesday, March 19, 2014 4:09 PM
> > To: dev@ctakes.apache.org
> > Subject: Re: getSeverity etc. for relation extractor
> >
> > Thanks,
> > I tried using the AggregateTemplateFiller.xml from the 
> > template-filler module, and I specified the relation extractor 
> > pipeline that I was using before from the relation-extractor project 
> > (there is also a different one in the template-filler project called 
> > "RelationExtractorAggregateWithoutOrangeBook").  However, I don't 
> > see a difference, the severity is still null.
> >
> > Just wondering - is there some reason that the TemplateFiller is not 
> > included by default?  It seems confusing that there are getters for 
> > properties that aren't set in general ...even when one runs the 
> > default clinical pipeline instead of the RelationExtractorAggregate, 
> > these getters are there, but there are no relations.
> >
> >
> > Thanks
> > Chase
> >
> >
> > On Wed, Mar 19, 2014 at 1:04 PM, Chen, Pei
> > wrote:
> >
> > > If I remember correctly, I think those attributes were set in 
> > > IdentifiedAnnotation via:
> > > ctakes-template-filler/desc/analysis_engine/TemplateFillerAnnotator.
> > > xm
> > > l
> > > One can look at the logic in:
> > > org.apache.ctakes.template.filler.ae.TemplateFillerAnnotator [1]
> > >
> > > Have you tried added that to the pipeline?
> > >
> > > [1]
> > > http://svn.apache.org/repos/asf/ctakes/trunk/ctakes-template-fille
> > > r/
> > > sr
> > > c/main/java/org/apache/ctakes/template/filler/ae/TemplateFillerAnn
> > > ot
> > > at
> > > or.java
> > >
> > > --Pei
> > >
> > > > -Original Message-
> > > > From: Chase Master [mailto:chasemast...@gmail.com]
> > > > Sent: Wednesday, March 19, 2014 1:56 PM
> > > > To: dev@ctakes.apache.org
> > > > Subject: getSeverity etc. for relation extractor
> > > >
> > > > Hi,
> > > >
> > > > I am trying to output the relations associated with
> > > DiseaseDisorderMentions
> > > > and other types.  But I want to start by iterating over 
> > > > DiseaseDisorderMention, not BinaryTextRelations since I want to 
> > > > be sure
> > > to
> > > > find them all, even if they have no associated relation.
> > > >
> > > > I always get null when using any of the getters like 
> > > > "getSeverity()".  I
> > > am
> > > > using the example text "He had a slight fracture in the proximal 
> > > > right
> > > fibula".
> > > > When I iterate over BinaryTextRelations, I see the following 
> > > > valid
> > > valu

RE: getSeverity etc. for relation extractor

2014-03-21 Thread Finan, Sean
> until we have a definite, well-defined need (from a user).

"Rash on arm and leg"

>  I don't follow what you mean by your item B) below

[Rash].getLocationRelation() > [Rash : Arm]
[Rash].getLocation() > [Arm]



-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] 
Sent: Friday, March 21, 2014 12:58 PM
To: 'dev@ctakes.apache.org'
Subject: RE: getSeverity etc. for relation extractor

Yes, if there is more than one severity or location relation for a given 
identified annotation, currently the template filler does just take the last 
severity and or last location.

I suggest not changing the type system to allow a list (FSArray), or at least 
holding off until we have a definite, well-defined need (from a user). 

I think instead, ideally, we would make the template filler smarter at picking 
which severity / which location  when there is more than one for the given 
identified annotation. Therefore I'd rather not make it a list now, when in the 
long run I think it should be a single value. And in the meantime if someone 
has a need, they can look through the relations.

Pei, I don't follow what you mean by your item B) below

-- James

-Original Message-
From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu]
Sent: Thursday, March 20, 2014 2:03 PM
To: dev@ctakes.apache.org
Subject: RE: getSeverity etc. for relation extractor

Awesome!
Thanks James...

On Sean's point about many-to-one relationships.  I think the current type 
system only supports 1 degree_of and severity_of for each IdentifiedAnnotation? 
 
Does the TemplateFiller component currently just take the last one in the list 
currently?
Should we modify the type system to support this in the future- something like 
the below?
A) Support many-to-one
B) Separate out getting the relations and getting the actual identified 
annotations.

One suggestion would be:
IdentifiedAnnotation.getBodyLocations(): FSArray
IdentifiedAnnotation.getBodyLocationRelations(): FSArray
IdentifiedAnnotation.getSeverity(): FSArray
IdentifiedAnnotation.getSeverityRelations(): FSArray

What do others think?
--Pei

> -Original Message-
> From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
> Sent: Thursday, March 20, 2014 2:50 PM
> To: 'dev@ctakes.apache.org'
> Subject: RE: getSeverity etc. for relation extractor
> 
> I saw the jira was assigned to me and had a few minutes so I 
> implemented a fix and committed.
> It was more than just the one line.
> The name of the index in which the binary text relations has changed 
> (now separate indexes instead of one for all binary text relations) so 
> I had to change which index was searched.
> 
> -Original Message-
> From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu]
> Sent: Thursday, March 20, 2014 9:28 AM
> To: dev@ctakes.apache.org
> Subject: RE: getSeverity etc. for relation extractor
> 
> Thanks for confirm James.  It seem like a bug...
> Chase,
> if you confirm if adding ddm.setSeverity(degreeOfTextRelation);  works 
> for you, I can commit the changes in trunk.
> 
> Which also brings up some interesting points:
> 1) Should we populate IdentifiedAnnotation.severity() and 
> bodylocationof() Directly in RelationExtractorAnnotator instead of the 
> template filler?
> It would seem more intuitive and faster than iterating through the 
> relations afterwards again.
> 2)Chase brought up a good point, should we add some of the commonly 
> used components to the defaultpipeline?  (DrugNER, RelationExtractor, 
> TemplateFiller)?  Seems easier to get onboard I think.
> 
> --Pei
> 
> 
> > -Original Message-
> > From: Chen, Pei
> > Sent: Wednesday, March 19, 2014 5:58 PM
> > To: dev@ctakes.apache.org
> > Subject: RE: getSeverity etc. for relation extractor
> >
> > Chase,
> > I am not sure why or the reasoning behind this, but it might explain 
> > why Severity is null for your DiseaseDisorderMention example:
> > Line 319 in TemplateFillerAnnotator.java:
> >
> > If I'm reading this logic correctly, it will only populate severity for
> > SignSymptomMention   Can't think of why not to populate it if it exists 
> > in
> > the BinaryTextRelations-
> > have you tried adding: ddm.setSeverity(degreeOfTextRelation); 
> > instead of logging the error ???
> >
> > if (eventMention instanceof
> > DiseaseDisorderMention) {
> > DiseaseDisorderMention ddm =
> > (DiseaseDisorderMention) eventMention;
> > logger.error("Need to implement attr
> for " + relation + " for
> > DiseaseDisorderMention");
> > } else if (eventMention instanceof
> > SignSymptomMention) {
> > SignSymptomMention ssm =
> > (SignSymptomMention) eventMention;
> >
> > ssm.setSeverity(degreeOfTextRelation);
> >
> > Would you mind opening a Jira attach a patch/test if it works for you?
> > -Pei
> >
> > > -Original Message-
> > > From: Chase 

RE: getSeverity etc. for relation extractor

2014-03-21 Thread Finan, Sean
Hi James,

It is starting to resemble a row of falling dominoes ...

I ran with an incubator version of the "location of" extractor and it did seem 
to find multiple locations for a single d/d.  Functionality may have changed 
since then.

Thanks for all of your attention to this topic.

Sean

-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] 
Sent: Friday, March 21, 2014 4:34 PM
To: 'dev@ctakes.apache.org'
Subject: RE: getSeverity etc. for relation extractor

Running from trunk, I don't get any relations for "Rash on arm and leg" :(

If I change the text to "pain in arm and leg" I get one LocationOfTextRelation 
annotation with arg1=SignSymptomMention (pain) and arg2=AnatomicalSiteMention 
(arm)

Does the relation extractor support creating a 2nd relation involving pain - 
the one between pain and leg (is this just an unfortunate choice of example) or 
does the relation extractor need enhancement before it would create mutiple 
location_of for a single SignSymptomMention or DiseaseDisorderMention

BTW, I will have to debug the setting of bodyLocation in the code because even 
for "pain in arm", when running from trunk, the LocationOfTextRelation 
annotation is being created, but the bodyLocation within the SignSymptomMention 
is not being set because the code in TemplateFillerAnnotator expects arg1 and 
arg2 to be swapped from what they currently are. I'll take a look at what it 
was in cTAKES 3.1 and find out if this is a bug in TemplateFillerAnnotator or 
something else.

-- James

-Original Message-
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
Sent: Friday, March 21, 2014 12:30 PM
To: dev@ctakes.apache.org
Subject: RE: getSeverity etc. for relation extractor

> until we have a definite, well-defined need (from a user).

"Rash on arm and leg"

>  I don't follow what you mean by your item B) below

[Rash].getLocationRelation() > [Rash : Arm]
[Rash].getLocation() > [Arm]



-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
Sent: Friday, March 21, 2014 12:58 PM
To: 'dev@ctakes.apache.org'
Subject: RE: getSeverity etc. for relation extractor

Yes, if there is more than one severity or location relation for a given 
identified annotation, currently the template filler does just take the last 
severity and or last location.

I suggest not changing the type system to allow a list (FSArray), or at least 
holding off until we have a definite, well-defined need (from a user). 

I think instead, ideally, we would make the template filler smarter at picking 
which severity / which location  when there is more than one for the given 
identified annotation. Therefore I'd rather not make it a list now, when in the 
long run I think it should be a single value. And in the meantime if someone 
has a need, they can look through the relations.

Pei, I don't follow what you mean by your item B) below

-- James

-Original Message-
From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu]
Sent: Thursday, March 20, 2014 2:03 PM
To: dev@ctakes.apache.org
Subject: RE: getSeverity etc. for relation extractor

Awesome!
Thanks James...

On Sean's point about many-to-one relationships.  I think the current type 
system only supports 1 degree_of and severity_of for each IdentifiedAnnotation? 
 
Does the TemplateFiller component currently just take the last one in the list 
currently?
Should we modify the type system to support this in the future- something like 
the below?
A) Support many-to-one
B) Separate out getting the relations and getting the actual identified 
annotations.

One suggestion would be:
IdentifiedAnnotation.getBodyLocations(): FSArray
IdentifiedAnnotation.getBodyLocationRelations(): FSArray
IdentifiedAnnotation.getSeverity(): FSArray
IdentifiedAnnotation.getSeverityRelations(): FSArray

What do others think?
--Pei

> -Original Message-
> From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
> Sent: Thursday, March 20, 2014 2:50 PM
> To: 'dev@ctakes.apache.org'
> Subject: RE: getSeverity etc. for relation extractor
> 
> I saw the jira was assigned to me and had a few minutes so I 
> implemented a fix and committed.
> It was more than just the one line.
> The name of the index in which the binary text relations has changed 
> (now separate indexes instead of one for all binary text relations) so 
> I had to change which index was searched.
> 
> -Original Message-
> From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu]
> Sent: Thursday, March 20, 2014 9:28 AM
> To: dev@ctakes.apache.org
> Subject: RE: getSeverity etc. for relation extractor
> 
> Thanks for confirm James.  It seem like a bug...
> Chase,
> if you confirm if adding ddm.setSeverity(degreeOfTextRelation);  works 
> for you, I can commit the changes in

RE: getSeverity etc. for relation extractor

2014-03-24 Thread Finan, Sean
Hi James, I don't have an exact phrase to use.  We used the location_of with a 
brain aneurysm project, but the corpus is elsewhere now.  However, it would tag 
things such as [aneurysm] : [middle cerebral artery] and [aneurysm] : [cerebral 
artery] - which is different from arm/leg, but an example of 2 locations for 
one entity.  

From: Masanz, James J. [masanz.ja...@mayo.edu]
Sent: Monday, March 24, 2014 11:05 AM
To: 'dev@ctakes.apache.org'
Subject: RE: getSeverity etc. for relation extractor

I ran  3.1  against "pain in arm and leg" and I get just one location_of 
relation.
And again no location_of relations for "rash on arm and leg"

Sean, what was the exact phrase you used with the  incubator version? (or was 
that a while ago and lost)

-----Original Message-
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
Sent: Friday, March 21, 2014 3:59 PM
To: dev@ctakes.apache.org
Subject: RE: getSeverity etc. for relation extractor

Hi James,

It is starting to resemble a row of falling dominoes ...

I ran with an incubator version of the "location of" extractor and it did seem 
to find multiple locations for a single d/d.  Functionality may have changed 
since then.

Thanks for all of your attention to this topic.

Sean

-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
Sent: Friday, March 21, 2014 4:34 PM
To: 'dev@ctakes.apache.org'
Subject: RE: getSeverity etc. for relation extractor

Running from trunk, I don't get any relations for "Rash on arm and leg" :(

If I change the text to "pain in arm and leg" I get one LocationOfTextRelation 
annotation with arg1=SignSymptomMention (pain) and arg2=AnatomicalSiteMention 
(arm)

Does the relation extractor support creating a 2nd relation involving pain - 
the one between pain and leg (is this just an unfortunate choice of example) or 
does the relation extractor need enhancement before it would create mutiple 
location_of for a single SignSymptomMention or DiseaseDisorderMention

BTW, I will have to debug the setting of bodyLocation in the code because even 
for "pain in arm", when running from trunk, the LocationOfTextRelation 
annotation is being created, but the bodyLocation within the SignSymptomMention 
is not being set because the code in TemplateFillerAnnotator expects arg1 and 
arg2 to be swapped from what they currently are. I'll take a look at what it 
was in cTAKES 3.1 and find out if this is a bug in TemplateFillerAnnotator or 
something else.

-- James

-Original Message-
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
Sent: Friday, March 21, 2014 12:30 PM
To: dev@ctakes.apache.org
Subject: RE: getSeverity etc. for relation extractor

> until we have a definite, well-defined need (from a user).

"Rash on arm and leg"

>  I don't follow what you mean by your item B) below

[Rash].getLocationRelation() > [Rash : Arm]
[Rash].getLocation() > [Arm]



-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
Sent: Friday, March 21, 2014 12:58 PM
To: 'dev@ctakes.apache.org'
Subject: RE: getSeverity etc. for relation extractor

Yes, if there is more than one severity or location relation for a given 
identified annotation, currently the template filler does just take the last 
severity and or last location.

I suggest not changing the type system to allow a list (FSArray), or at least 
holding off until we have a definite, well-defined need (from a user).

I think instead, ideally, we would make the template filler smarter at picking 
which severity / which location  when there is more than one for the given 
identified annotation. Therefore I'd rather not make it a list now, when in the 
long run I think it should be a single value. And in the meantime if someone 
has a need, they can look through the relations.

Pei, I don't follow what you mean by your item B) below

-- James

-Original Message-
From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu]
Sent: Thursday, March 20, 2014 2:03 PM
To: dev@ctakes.apache.org
Subject: RE: getSeverity etc. for relation extractor

Awesome!
Thanks James...

On Sean's point about many-to-one relationships.  I think the current type 
system only supports 1 degree_of and severity_of for each IdentifiedAnnotation?
Does the TemplateFiller component currently just take the last one in the list 
currently?
Should we modify the type system to support this in the future- something like 
the below?
A) Support many-to-one
B) Separate out getting the relations and getting the actual identified 
annotations.

One suggestion would be:
IdentifiedAnnotation.getBodyLocations(): FSArray
IdentifiedAnnotation.getBodyLocationRelations(): FSArray
IdentifiedAnnotation.getSeverity(): FSArray
IdentifiedAnnotation.getSeverityRelations(): FSArray

What do others thi

RE: "Temporal Information Extraction" package has compile time error

2014-03-27 Thread Finan, Sean
Hi Manu,

Speaking for the developers of that module, we are excited that you and others 
in the community are starting to show so much interest in temporal information 
extraction - enough to attempt builds and trial runs.

The Temporal module is still in an "academic" experimental phase and there are 
some necessary models and custom third-party library extensions that are 
necessary to build but have not or cannot be checked into the cTakes 
repository.  We hope to have Temporal ready for full build and use in the 
upcoming cTakes release, but until that time it will remain relatively unusable 
by the wider cTakes community.  I apologize if its placement in trunk caused 
confusion.

All of that having been written, if you have particular ideas on 
implementation, usage or anything else, please let us know.

Sean

-Original Message-
From: Manu Sikka [mailto:manusi...@hotmail.com] 
Sent: Wednesday, March 26, 2014 11:15 PM
To: dev@ctakes.apache.org
Subject: "Temporal Information Extraction" package has compile time error







"Temporal Information Extraction" package has compile time error
Please look into it 
  


RE: suggestion for default pipelines

2014-04-15 Thread Finan, Sean
+1 I think that a factory is a great idea.

I (personally) dislike the descriptor schema, but I think that deprecation is 
the way to go until a replacement comes along.  



-Original Message-
From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
Sent: Tuesday, April 15, 2014 9:54 AM
To: dev@ctakes.apache.org
Subject: suggestion for default pipelines

The discussion in the other thread with Abraham Tom gave me an idea I
wanted to float to the list. We have been using some UIMAFit pipeline
builders in the temporal project that maybe could be moved into
clinical-pipeline. For example, look to this file:

http://svn.apache.org/viewvc/ctakes/trunk/ctakes-temporal/src/main/java/org/apache/ctakes/temporal/pipelines/TemporalExtractionPipeline_ImplBase.java?view=markup

with the static methods getPreprocessorAggregateBuilder() and
getLightweightPreprocessorAggregateBuilder()   [no umls].

So my idea would be to create a class in clinical-pipeline
(CTakesPipelines) with static methods for some standard pipelines (to
return AnalysisEngineDescriptions instead of AggregateBuilders?):

getStandardUMLSPipeline()  -- builds pipeline currently in
AggregatePlaintextUMLSProcessor.xml
getFullPipeline() -- same as above but with SRL, constituency parsing,
etc., every component in ctakes

We could then potentially merge our entry points -- I think Abraham's
experience points out that this is currently confusing, as well as
probably not implemented optimally. For example, either
ClinicalPipelineWithUmls or BagOfCUIsGenerator would use that static
method to run a uimafit-style pipeline. Maybe we can slowly deprecate
our xml descriptors too unless people feel strongly about keeping those
around.

Another benefit is that the cTAKES API is then trivial -- if you import
ctakes into your pom file getting a UIMA pipeline is one UimaFit call:

builder.add(CTAKESPipelines.getStandardUMLSPipeline());


I think this would actually be pretty easy to implement, but hoping to
get some feedback on whether this is a good direction.

Tim





RE: errors when run BagOfCUIsGenerator.java

2014-04-16 Thread Finan, Sean
Try to open  https://uts-ws.nlm.nih.gov 
If that works then try 
https://uts-ws.nlm.nih.gov/restful/isValidctakes.umlsuser and see if you get a 
message like
"This XML file does not appear to have any style information associated with 
it. The document tree is shown below."


If that works and you are comfortable with the code, try with
umlsaddr : https://uts-ws.nlm.nih.gov/restful/isValidctakes.umlsuser
vendor : NLM-6515182895


   /**
* @param umlsaddr -
* @param vendor   -
* @param username -
* @param password -
* @return true if the server at umlsaddr approves of the vendor, user, 
password combination
*/
   public static boolean isValidUMLSUser( final String umlsaddr, final String 
vendor,
  final String username, final String 
password ) {
  String data;
  try {
 data = URLEncoder.encode( "licenseCode", "UTF-8" ) + "=" + 
URLEncoder.encode( vendor, "UTF-8" );
 data += "&" + URLEncoder.encode( "user", "UTF-8" ) + "=" + 
URLEncoder.encode( username, "UTF-8" );
 data += "&" + URLEncoder.encode( "password", "UTF-8" ) + "=" + 
URLEncoder.encode( password, "UTF-8" );
  } catch ( UnsupportedEncodingException unseE ) {
 LOGGER.error( "Could not encode URL for " + username + " with vendor 
license " + vendor );
 return false;
  }
  try {
 final URL url = new URL( umlsaddr );
 final URLConnection connection = url.openConnection();
 connection.setDoOutput( true );
 final OutputStreamWriter writer = new OutputStreamWriter( 
connection.getOutputStream() );
 writer.write( data );
 writer.flush();
 boolean result = false;
 final BufferedReader reader = new BufferedReader( new 
InputStreamReader( connection.getInputStream() ) );
 String line;
 while ( (line = reader.readLine()) != null ) {
final String trimline = line.trim();
if ( trimline.isEmpty() ) {
   break;
}
result = trimline.equalsIgnoreCase( "true" );
 }
 writer.close();
 reader.close();
 return result;
  } catch ( IOException ioE ) {
 LOGGER.error( ioE.getMessage() );
 return false;
  }
   }



-Original Message-
From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu] 
Sent: Wednesday, April 16, 2014 1:25 PM
To: dev@ctakes.apache.org
Subject: RE: errors when run BagOfCUIsGenerator.java

Ying,
Are you behind a proxy or firewall?
If you're trying to use the umls resources, it attempts to make a call to their 
umls service to validate your credentials.
--Pei

> -Original Message-
> From: Liu, Ying [mailto:l...@advisory.com]
> Sent: Wednesday, April 16, 2014 1:13 PM
> To: dev@ctakes.apache.org
> Subject: errors when run BagOfCUIsGenerator.java
> 
> It failed when run BagOfCUIsGenerator.java. The followings are the 
> error information. Thanks for your help.
> Ying
> 
> 
> 
> Exception in thread "main"
> org.apache.uima.resource.ResourceInitializationException: 
> Initialization of annotator class 
> "org.apache.ctakes.dictionary.lookup.ae.UmlsDictionaryLookupAnnotator"
> failed.  (Descriptor: 
> file:/C:/Users/Ying/workspacectakes/ctakes/ctakes-
> dictionary-
> lookup/desc/analysis_engine/DictionaryLookupAnnotatorUMLS.xml)
> at
> org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.init
> ialize
> AnalysisComponent(PrimitiveAnalysisEngine_impl.java:252)
> at
> org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.init
> ialize
> (PrimitiveAnalysisEngine_impl.java:156)
> at
> org.apache.uima.impl.AnalysisEngineFactory_impl.produceResource(Analys
> i
> sEngineFactory_impl.java:94)
> at
> org.apache.uima.impl.CompositeResourceFactory_impl.produceResource(C
> ompositeResourceFactory_impl.java:62)
> at
> org.apache.uima.UIMAFramework.produceResource(UIMAFramework.java:
> 269)
> at
> org.apache.uima.UIMAFramework.produceAnalysisEngine(UIMAFramework
> .java:387)
> at
> org.apache.uima.analysis_engine.asb.impl.ASB_impl.setup(ASB_impl.java:
> 25
> 4)
> at
> org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.init
> AS
> B(AggregateAnalysisEngine_impl.java:431)
> at
> org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.init
> ializ
> eAggregateAnalysisEngine(AggregateAnalysisEngine_impl.java:375)
> at
> org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.init
> ializ
> e(AggregateAnalysisEngine_impl.java:185)
> at
> org.apache.uima.impl.AnalysisEngineFactory_impl.produceResource(Analys
> i
> sEngineFactory_impl.java:94)
> at
> org.apache.uima.impl.CompositeResourceFactory_impl.produceResource(C
> ompositeResourceFactory_impl.java:62)
> at
> org.apache.uima.UIMAFramework.produceResource(UIMAFramework.java:
> 269)
> at
> org.apache.uima.UIMAFramework.produceAnalysisEngine(UIMAFramework
> .java:354)
> at
> org.uimafit.

RE: lvg entries

2014-04-17 Thread Finan, Sean
Those variants are not used by the dictionary lookup.  I did look at them to 
see if it was worthwhile for the new dictionary, but they are all over the 
place so I passed.  

From: Miller, Timothy [timothy.mil...@childrens.harvard.edu]
Sent: Thursday, April 17, 2014 1:25 PM
To: dev@ctakes.apache.org
Subject: Re: lvg entries

Pei and I had a similar discussion in person -- mapping from lexical
variants to a stem might be useful. Pei also mentioned that one intended
use might have been searching the dictionary with lexical variants, but
I don't think that is done. Looking at the precision of the variants, I
think its highly unlikely the speed tradeoff would be worth any
improvements in recall.

Finally, at least in eclipse doing a search on references to the method
to retrieve the lemma entries turns up nothing.

Tim


On 04/17/2014 01:14 PM, Dligach, Dmitriy wrote:
> I don’t know of any applications within cTAKES that make use of this… The 
> reverse (mapping from these “variants” to the normal form) may be useful 
> though.
>
> Dima
>
>
>
>
> On Apr 17, 2014, at 11:50, Miller, Timothy 
>  wrote:
>
>> Sure, just as an example, I gave it a note with about 1000 words. It
>> generates 11500 NonEmptyFSList elements (each is basically one lexical
>> variant).
>>
>> For the word "symptomatic", these are the first 10 of 20 lexical variants:
>> Symptomaticer/JJ
>> Symptomaticer/RB
>> Symptomaticed/VB
>> Symptomaticcing/VB
>> Symptomatics/VB
>> Symptomatics/NN
>> Symptomaticked/VB
>> Symptomatic/VB
>> Symptomatic/JJ
>> Symptomatic/RB
>>
>> Tim
>>
>>
>> On 04/17/2014 12:31 PM, Dligach, Dmitriy wrote:
>>> Tim, this is a very interesting observation. Could you please send a few 
>>> examples of what LVG generates? Both sensical and non :)
>>>
>>> Dima
>>>
>>>
>>>
>>>
>>> On Apr 17, 2014, at 11:28, Miller, Timothy 
>>>  wrote:
>>>
 The LVG annotator creates an enormous number of "lemmas" for every
 WordToken in the CAS, and I'm wondering what the original purpose was? I
 think this is probably a minor bottleneck for speed but mostly a pretty
 big space hog (at least 50% of the space of xmi files in my tests).

 As of right now I'm not sure if any downstream components are using
 these lemmas, and on a manual inspection the precision seems to be
 pretty abysmal (meaning most of them are nonsensical as lexical
 variants), so as I said, just wondering if we can revisit why cTAKES
 generates so many and whether that component can be optimized.

 Thanks
 Tim

>



RE: lvg entries

2014-04-18 Thread Finan, Sean
+1 false

-Original Message-
From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
Sent: Friday, April 18, 2014 2:54 PM
To: dev@ctakes.apache.org
Subject: Re: lvg entries

Thanks for tracking that down Andy.

I am making a pass at UimaFit-izing the configuration parameters for all the 
annotators in the default pipeline, before I create the static factory methods 
like we recently discussed. Should I go ahead and change this to make default 
behavior be false?

Tim


On 04/18/2014 12:47 AM, andy mcmurry wrote:
> There is a lot of config handling, maybe PostLemmas is being set to 
> true or
> configInit() is not setting up  the NLM wrapper incorrectly.
>
> ctakes-lvg *README*
> Note: as distributed, PostLemmas is set to false.  This is done to 
> reduce the size of the CAS.
> Set PostLemmas to true to have org.apache.ctakes.typesystem.type.Lemma
> annotations added to the CAS.
>
> *LvgAnnotator.xml *
> PostLemmas = True
>
> *LvgAnnotator.java*
> if (postLemmas) {
>  lvgResource.getLvgLex()
> }
>
>
>
>
>
>
>
> On Thu, Apr 17, 2014 at 3:23 PM, Masanz, James J. 
> wrote:
>
>> The normalizedForm field is filled in. It is used by dictionary lookup.
>>
>> So, for example, if the dictionary would contain "lymph node" but not 
>> "lymph nodes", a document with text of "lymph nodes" would match the 
>> dictionary entry "lymph node" because "node", being the normalized 
>> form of "nodes", would be used when searching dictionary entries (in 
>> addition to searching dictionary entries for "nodes")
>>
>> -Original Message-
>> From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
>> Sent: Thursday, April 17, 2014 4:33 PM
>> To: dev@ctakes.apache.org
>> Subject: Re: lvg entries
>>
>> Quick follow-up since I was interested. The current dependency parser 
>> does have the option to use ctakes lemmas or do its own lemmatizing, 
>> but that doesn't use the lemma field, it uses the normalizedForm 
>> field. I'm not sure if that field is actually ever filled in -- on my 
>> example data it is always null.
>>
>> Tim
>>
>> On 04/17/2014 01:57 PM, Masanz, James J. wrote:
>>> Offhand I recall at least one of the dependency parsers used the 
>>> Lemma
>> annotations at one point.
>>> Not sure if still does.
>>>
>>> There is an option for turning off the posting of the lemmas to the cas.
>>>
>>> Hope that helps
>>>
>>> -Original Message-
>>> From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
>>> Sent: Thursday, April 17, 2014 11:27 AM
>>> To: dev@ctakes.apache.org
>>> Subject: lvg entries
>>>
>>> The LVG annotator creates an enormous number of "lemmas" for every 
>>> WordToken in the CAS, and I'm wondering what the original purpose 
>>> was? I think this is probably a minor bottleneck for speed but 
>>> mostly a pretty big space hog (at least 50% of the space of xmi files in my 
>>> tests).
>>>
>>> As of right now I'm not sure if any downstream components are using 
>>> these lemmas, and on a manual inspection the precision seems to be 
>>> pretty abysmal (meaning most of them are nonsensical as lexical 
>>> variants), so as I said, just wondering if we can revisit why cTAKES 
>>> generates so many and whether that component can be optimized.
>>>
>>> Thanks
>>> Tim
>>>
>>>
>>



RE: new dictionary lookup {was RE: lvg entries]

2014-04-22 Thread Finan, Sean
Hi James,

>> Will the new dictionary lookup use the canonicalForm?

It does use WordToken.getCanonicalForm();
Usually this seems to be empty, but as long as it is present it will be used.


-Original Message-
From: andy mcmurry [mailto:mcmurry.a...@gmail.com] 
Sent: Tuesday, April 22, 2014 4:23 AM
To: dev@ctakes.apache.org
Subject: Re: new dictionary lookup {was RE: lvg entries]

Highly Relevant

*DNorm: disease name normalization*
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3810844/

"Disease names are often created by combining roots and affixes from Greek or 
Latin (e.g. ‘hemochromatosis’)" 






On Mon, Apr 21, 2014 at 8:57 AM, Masanz, James J. wrote:

> Sean,
>
> Will the new dictionary lookup use the canonicalForm? If not, perhaps 
> you can remove LVG from at least some of the pipelines (drug-ner does 
> not include the dependency parser)
>
> -----Original Message-
> From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
> Sent: Thursday, April 17, 2014 12:52 PM
> To: dev@ctakes.apache.org
> Subject: RE: lvg entries
>
> Those variants are not used by the dictionary lookup.  I did look at 
> them to see if it was worthwhile for the new dictionary, but they are 
> all over the place so I passed.
> 
> From: Miller, Timothy [timothy.mil...@childrens.harvard.edu]
> Sent: Thursday, April 17, 2014 1:25 PM
> To: dev@ctakes.apache.org
> Subject: Re: lvg entries
>
> Pei and I had a similar discussion in person -- mapping from lexical 
> variants to a stem might be useful. Pei also mentioned that one 
> intended use might have been searching the dictionary with lexical 
> variants, but I don't think that is done. Looking at the precision of 
> the variants, I think its highly unlikely the speed tradeoff would be 
> worth any improvements in recall.
>
> Finally, at least in eclipse doing a search on references to the 
> method to retrieve the lemma entries turns up nothing.
>
> Tim
>
>
> On 04/17/2014 01:14 PM, Dligach, Dmitriy wrote:
> > I don't know of any applications within cTAKES that make use of this...
> The reverse (mapping from these "variants" to the normal form) may be 
> useful though.
> >
> > Dima
> >
> >
> >
> >
> > On Apr 17, 2014, at 11:50, Miller, Timothy <
> timothy.mil...@childrens.harvard.edu> wrote:
> >
> >> Sure, just as an example, I gave it a note with about 1000 words. 
> >> It generates 11500 NonEmptyFSList elements (each is basically one 
> >> lexical variant).
> >>
> >> For the word "symptomatic", these are the first 10 of 20 lexical
> variants:
> >> Symptomaticer/JJ
> >> Symptomaticer/RB
> >> Symptomaticed/VB
> >> Symptomaticcing/VB
> >> Symptomatics/VB
> >> Symptomatics/NN
> >> Symptomaticked/VB
> >> Symptomatic/VB
> >> Symptomatic/JJ
> >> Symptomatic/RB
> >>
> >> Tim
> >>
> >>
> >> On 04/17/2014 12:31 PM, Dligach, Dmitriy wrote:
> >>> Tim, this is a very interesting observation. Could you please send 
> >>> a
> few examples of what LVG generates? Both sensical and non :)
> >>>
> >>> Dima
> >>>
> >>>
> >>>
> >>>
> >>> On Apr 17, 2014, at 11:28, Miller, Timothy <
> timothy.mil...@childrens.harvard.edu> wrote:
> >>>
> >>>> The LVG annotator creates an enormous number of "lemmas" for 
> >>>> every WordToken in the CAS, and I'm wondering what the original 
> >>>> purpose
> was? I
> >>>> think this is probably a minor bottleneck for speed but mostly a
> pretty
> >>>> big space hog (at least 50% of the space of xmi files in my tests).
> >>>>
> >>>> As of right now I'm not sure if any downstream components are 
> >>>> using these lemmas, and on a manual inspection the precision 
> >>>> seems to be pretty abysmal (meaning most of them are nonsensical 
> >>>> as lexical variants), so as I said, just wondering if we can 
> >>>> revisit why cTAKES generates so many and whether that component can be 
> >>>> optimized.
> >>>>
> >>>> Thanks
> >>>> Tim
> >>>>
> >
>
>


RE: ytex merged into trunk

2014-04-28 Thread Finan, Sean
Hi Vijay,

I did a checkout this morning and I'm getting compile errors from Maven.

If I just run mvn compile then I get an error while building ytex claiming that 
the package has not been created.  Is there a reversed dependency?

If I run mvn compile package then ytex seems to run through, but there is an 
error in the test of ytex-uima (see below).

Any ideas?

Thanks,
Sean


Running org.apache.ctakes.ytex.uima.annotators.SparseDataExporterTest
...
2014-04-28 10:50:43,074 INFO  org.hibernate.dialect.Dialect  - HHH000400: Using 
dialect: org.hibernate.dialect.HSQLDialect
2014-04-28 10:50:43,112 WARN  org.hibernate.engine.jdbc.spi.SqlExceptionHelper  
- SQL Error: -22, SQLState: S0002
2014-04-28 10:50:43,112 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper  
- Table not found in statement [select uimatype0_.ui
ma_type_id as uima_typ1_21_, uimatype0_.uima_type_name as uima_typ2_21_, 
uimatype0_.table_name as table_na3_21_ from PUBLIC.ref_uima
_type uimatype0_]
...
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 6.277 sec <<< 
FAILURE!

Results :

Tests in error:
  test(org.apache.ctakes.ytex.uima.annotators.DBCollectionReaderTest): Unable 
to initialize group definition. Group resource name [c
lasspath*:org/apache/ctakes/ytex/uima/beanRefContext.xml], factory key 
[ytexApplicationContext]; nested exception is org.springframe
work.beans.factory.BeanCreationException: Error creating bean with name 
'ytexApplicationContext' defined in URL [file:/C:/Spiffy/Dev
/ApacheCtakesTrunk/ctakes-ytex-res/src/main/resources/org/apache/ctakes/ytex/uima/beanRefContext.xml]:
 Instantiation of bean failed;
 nested exception is org.springframework.beans.BeanInstantiationException: 
Could not instantiate bean class [org.springframework.con
text.support.ClassPathXmlApplicationContext]: Constructor threw exception; 
nested exception is org.springframework.beans.factory.Bea
nCreationException: Error creating bean with name 'documentMapperService' 
defined in class path resource [org/apache/ctakes/ytex/uim
a/beans-uima-mapper.xml]: Invocation of init method failed; nested exception is 
org.hibernate.exception.SQLGrammarException: could n
ot prepare statement
  org.apache.ctakes.ytex.uima.annotators.DBConsumerTest: Unable to initialize 
group definition. Group resource name [classpath*:org/
apache/ctakes/ytex/uima/beanRefContext.xml], factory key 
[ytexApplicationContext]; nested exception is org.springframework.beans.fac
tory.BeanCreationException: Error creating bean with name 
'ytexApplicationContext' defined in URL [file:/C:/Spiffy/Dev/ApacheCtakesT
runk/ctakes-ytex-res/src/main/resources/org/apache/ctakes/ytex/uima/beanRefContext.xml]:
 Instantiation of bean failed; nested except
ion is org.springframework.beans.BeanInstantiationException: Could not 
instantiate bean class [org.springframework.context.support.C
lassPathXmlApplicationContext]: Constructor threw exception; nested exception 
is org.springframework.beans.factory.BeanCreationExcep
tion: Error creating bean with name 'documentMapperService' defined in class 
path resource [org/apache/ctakes/ytex/uima/beans-uima-m
apper.xml]: Invocation of init method failed; nested exception is 
org.hibernate.exception.SQLGrammarException: could not prepare sta
tement
  org.apache.ctakes.ytex.uima.annotators.DBConsumerTest
  
testDictionaryLookupIntegrated(org.apache.ctakes.ytex.uima.annotators.DictionaryLookupAnnotatorTest):
 Initialization of annotator
class "org.apache.ctakes.ytex.uima.annotators.SegmentRegexAnnotator" failed.  
(Descriptor: file:/C:/Spiffy/Dev/ApacheCtakesTrunk/cta
kes-ytex-uima/desc/analysis_engine/SegmentRegexAnnotator.xml)
  
testDictionaryLookupSimple(org.apache.ctakes.ytex.uima.annotators.DictionaryLookupAnnotatorTest)
  
testDisambiguate(org.apache.ctakes.ytex.uima.annotators.SenseDisambiguatorAnnotatorTest):
 Unable to initialize group definition. G
roup resource name [classpath*:org/apache/ctakes/ytex/uima/beanRefContext.xml], 
factory key [ytexApplicationContext]; nested excepti
on is org.springframework.beans.factory.BeanCreationException: Error creating 
bean with name 'ytexApplicationContext' defined in URL
 
[file:/C:/Spiffy/Dev/ApacheCtakesTrunk/ctakes-ytex-res/src/main/resources/org/apache/ctakes/ytex/uima/beanRefContext.xml]:
 Instanti
ation of bean failed; nested exception is 
org.springframework.beans.BeanInstantiationException: Could not instantiate 
bean class [or
g.springframework.context.support.ClassPathXmlApplicationContext]: Constructor 
threw exception; nested exception is org.springframew
ork.beans.factory.BeanCreationException: Error creating bean with name 
'documentMapperService' defined in class path resource [org/a
pache/ctakes/ytex/uima/beans-uima-mapper.xml]: Invocation of init method 
failed; nested exception is org.hibernate.exception.SQLGram
marException: could not prepare statement
  org.apache.ctakes.ytex.uima.annotators.SparseDataExporterTest: Unable to 
initialize group definition. Group resource n

RE: ytex merged into trunk

2014-04-28 Thread Finan, Sean
Completely new error.  I have taken this offline until we figure out what is 
going on.

-Original Message-
From: vijay garla [mailto:vnga...@gmail.com] 
Sent: Monday, April 28, 2014 1:47 PM
To: dev@ctakes.apache.org
Subject: Re: ytex merged into trunk

Hello All,

I can't reproduce this build error.  It appears that maven does not want to run 
copy-dependencies in the compile phase.  However, I have tried building this 
with maven 3.2.1 and maven 3.1.0 and it works fine for both.

@Sean - can you send me the output of mvn -x clean install -pl ctakes-ytex 
(executed from ctakes root dir)

This is the plugin that maven is complaining about:

org.apache.maven.plugins
maven-dependency-plugin


copy-dependencies
compile

copy-dependencies


${basedir}/target/lib
false
false
true






On Mon, Apr 28, 2014 at 1:26 PM, vijay garla  wrote:

> sorry about that.  I will investigate.
>
> -vj
>
>
> On Mon, Apr 28, 2014 at 11:00 AM, Finan, Sean < 
> sean.fi...@childrens.harvard.edu> wrote:
>
>> Hi Vijay,
>>
>> I did a checkout this morning and I'm getting compile errors from Maven.
>>
>> If I just run mvn compile then I get an error while building ytex 
>> claiming that the package has not been created.  Is there a reversed 
>> dependency?
>>
>> If I run mvn compile package then ytex seems to run through, but 
>> there is an error in the test of ytex-uima (see below).
>>
>> Any ideas?
>>
>> Thanks,
>> Sean
>>
>>
>> Running org.apache.ctakes.ytex.uima.annotators.SparseDataExporterTest
>> ...
>> 2014-04-28 10:50:43,074 INFO  org.hibernate.dialect.Dialect  - HHH000400:
>> Using dialect: org.hibernate.dialect.HSQLDialect
>> 2014-04-28 10:50:43,112 WARN
>>  org.hibernate.engine.jdbc.spi.SqlExceptionHelper  - SQL Error: -22,
>> SQLState: S0002
>> 2014-04-28 10:50:43,112 ERROR
>> org.hibernate.engine.jdbc.spi.SqlExceptionHelper  - Table not found 
>> in statement [select uimatype0_.ui ma_type_id as uima_typ1_21_, 
>> uimatype0_.uima_type_name as uima_typ2_21_, uimatype0_.table_name as 
>> table_na3_21_ from PUBLIC.ref_uima _type uimatype0_] ...
>> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 6.277 
>> sec <<< FAILURE!
>>
>> Results :
>>
>> Tests in error:
>>   test(org.apache.ctakes.ytex.uima.annotators.DBCollectionReaderTest):
>> Unable to initialize group definition. Group resource name [c 
>> lasspath*:org/apache/ctakes/ytex/uima/beanRefContext.xml], factory 
>> key [ytexApplicationContext]; nested exception is org.springframe
>> work.beans.factory.BeanCreationException: Error creating bean with 
>> name 'ytexApplicationContext' defined in URL [file:/C:/Spiffy/Dev
>> /ApacheCtakesTrunk/ctakes-ytex-res/src/main/resources/org/apache/ctakes/ytex/uima/beanRefContext.xml]:
>> Instantiation of bean failed;
>>  nested exception is
>> org.springframework.beans.BeanInstantiationException: Could not 
>> instantiate bean class [org.springframework.con
>> text.support.ClassPathXmlApplicationContext]: Constructor threw 
>> exception; nested exception is org.springframework.beans.factory.Bea
>> nCreationException: Error creating bean with name 'documentMapperService'
>> defined in class path resource [org/apache/ctakes/ytex/uim
>> a/beans-uima-mapper.xml]: Invocation of init method failed; nested 
>> exception is org.hibernate.exception.SQLGrammarException: could n ot 
>> prepare statement
>>   org.apache.ctakes.ytex.uima.annotators.DBConsumerTest: Unable to 
>> initialize group definition. Group resource name [classpath*:org/ 
>> apache/ctakes/ytex/uima/beanRefContext.xml], factory key 
>> [ytexApplicationContext]; nested exception is 
>> org.springframework.beans.fac
>> tory.BeanCreationException: Error creating bean with name 
>> 'ytexApplicationContext' defined in URL 
>> [file:/C:/Spiffy/Dev/ApacheCtakesT
>> runk/ctakes-ytex-res/src/main/resources/org/apache/ctakes/ytex/uima/beanRefContext.xml]:
>> Instantiation of bean failed; nested except ion is 
>> org.springframework.beans.BeanInstantiationException: Could not 
>> instantiate bean class [org.springframework.context.support.C
>> lassPathXmlApplicationContext]: Constructor threw exception; nested 
>> exception is org.springframework.beans.factory.BeanCreationExcep
>> tion: Error creating bean with name 'documentMapperService' defined 
>> in class path resource [org/apache/ctakes/ytex/uima/beans-uima-m
>> apper.xml]: Invocation of init method failed; nested exception is
>> org.hibernate.exception.SQLGrammarException: could not prepare sta 
&g

RE: Explict version numbers instead of ranges in pom.xml

2014-05-02 Thread Finan, Sean
+1
> so I was planning to update 

-Original Message-
From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu] 
Sent: Friday, May 02, 2014 12:27 PM
To: dev@ctakes.apache.org
Subject: Explict version numbers instead of ranges in pom.xml

Hi,
Are there any opposition to using explicitly dependency version numbers?
Occasionally, I would get errors like the below:
Failed to collect dependencies at junit:junit:jar:[4.10,4.10] Caused by: 
org.eclipse.aether.resolution.VersionRangeResolutionException: No versions 
available for junit:junit:jar:[4.10,4.10] within specified range

In principle, I think it's nice to have explicit behavior otherwise, it can be 
hard to reproduce errors if we allow version ranges
There were only 3 places in the pom.xml that allowed ranges, so I was planning 
to update those unless anyone has strong objections to it...

--Pei


RE: Preparing for an Apache cTAKES 3.2 Release?

2014-06-11 Thread Finan, Sean
>. The newer NER should have in its name the Behavior...

I agree, but the *2 module is a complete replacement for the current lookup.  
It does not (really) have any different behavior, just a different 
implementation and performance.  We plan to swap out the old with the new in 
the next release and get rid of the *2 suffix.  So, any name provided now is 
just temporary - unless people don't like the name "dictionary-lookup" at all.

In my original sandbox it was named "RareWordLookup", a nod to its 
implementation.  However, this doesn't help any users.

Sean

-Original Message-
From: andy mcmurry [mailto:mcmurry.a...@gmail.com] 
Sent: Wednesday, June 11, 2014 3:09 AM
To: dev@ctakes.apache.org
Subject: Re: Preparing for an Apache cTAKES 3.2 Release?

"2" doesn't mean much. The newer NER should have in its name the Behavior...

Perhaps something like MetaMap Usage
 "--allow_overmatches" or  
"--allow_concept_gaps" or .other?

Since yTex already provides a pluggable *DictionaryLookup, *that seems like the 
best place to define the differing Behavior /  Usage.

https://cwiki.apache.org/confluence/display/CTAKES/User's+Guide
https://code.google.com/p/ytex/wiki/DictionaryLookup_V05


AndyMC

On Tue, Jun 10, 2014 at 9:55 AM, britt fitch  wrote:

> I don’t have an issue with the *-2 name. I also don’t have any 
> objections to renaming it.
>
> It might be nice to keep the old dictionary code around for a 
> release-worth of time but after that I would vote purging it.
> If someone needs it after that it’ll be accessible in the archived 
> releases.
>
>
>
> On Jun 10, 2014, at 12:48 PM, Chen, Pei 
> 
> wrote:
>
> > I think James has a fair point here.
> > It may be worthwhile biting the bullet here and push forward.
> >
> > Since this essentially will be a full replacement of the
> ctakes-dictionary-lookup module, a good option maybe to just replace 
> the entire module now and rename the existing module to * _deprecated.
> > How do folks feel about that?  In a nutshell, 
> > ctakes-dictionary-lookup-2
> is a faster algorithm with a simpler code base- and comparable results 
> (Sean has a full comparison in the documentation for those who are curious).
> >
> > --Pei
> >
> >> -Original Message-
> >> From: britt fitch [mailto:britt.fi...@gmail.com]
> >> Sent: Monday, June 09, 2014 5:42 PM
> >> To: dev@ctakes.apache.org
> >> Subject: Re: Preparing for an Apache cTAKES 3.2 Release?
> >>
> >> There is some documentation in the dictionary2 module under 
> >> /doc/DictionaryLookupHelp.{txt | docx} that gives some some details 
> >> of
> the
> >> different lookup implementation options within that module that I 
> >> found helpful.
> >>
> >>
> >> On Jun 9, 2014, at 5:17 PM, Masanz, James J. 
> >> 
> >> wrote:
> >>
> >>>
> >>> Will ctakes-dictionary-lookup2 remain the name for the new 
> >>> dictionary
> >> lookup or will it have a name that reflects the algorithm?
> >>>
> >>> Is there a description of it that will help users to decide when 
> >>> to
> use one
> >> dictionary lookup component vs. the other.
> >>>
> >>> -- James
> >>>
> >>> -Original Message-
> >>> From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu]
> >>> Sent: Friday, June 06, 2014 12:34 PM
> >>> To: dev@ctakes.apache.org
> >>> Subject: Preparing for an Apache cTAKES 3.2 Release?
> >>>
> >>> Hi,
> >>> The 3.2 release was slated to be release end of this month (Jun 21).
> >>> Since I volunteered to be the RM for this release, just like the 
> >>> past
> >> releases, I was planning to create a branch/tag next week from 
> >> trunk and dev can continue.
> >>> Feel free to take a look at any outstanding Jira issues [1] that 
> >>> you
> may want
> >> to be included in this release.
> >>>
> >>> Major changes include:
> >>> CTAKES-197Upgrade cTAKES to Java 7
> >>> CTAKES-292Integrate YTEX with cTAKES
> >>> CTAKES-82  Add ctakes-temporal module (Time and Event
> Annotator +
> >> DocTimeRel Property only?)
> >>>
> >>> [1]
> >>> https://issues.apache.org/jira/browse/CTAKES-
> >> 298?jql=fixVersion%20%3D%
> >>> 203.2.0%20AND%20project%20%3D%20CTAKES
> >>>
>  -Original Message-
>  From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
>  Sent: Wednesday, March 26, 2014 9:34 PM
>  To: 'dev@ctakes.apache.org'
>  Subject: RE: Apache cTAKES 3.2 Release?
> 
>  +1 to naming it 3.2
> 
>  I'll review my JIRA items this week.
> 
>  -- James
> 
>  -Original Message-
>  From: Pei Chen [mailto:chen...@apache.org]
>  Sent: Wednesday, March 26, 2014 10:14 AM
>  To: dev@ctakes.apache.org
>  Subject: Apache cTAKES 3.2 Release?
> 
>  Hi,
> 
>  I think there are a lot of items slated for the next release, I 
>  suggest we make it 3.2 instead of another patch release.
> 
>  I can volunteer to be the RM unless someone would like to take 
>  that
> up...
> 
> >>>

RE: Preparing for an Apache cTAKES 3.2 Release?

2014-06-11 Thread Finan, Sean
> it would be incredibly helpful to have thorough documentation

I agree.  There is some documentation in the module's doc/ directory, but it is 
very brief.  There are also some example descriptors in the example/ directory. 
 The -resource also has some example xmls and dictionaries.

It isn't much, but I have a small plate heaped with large portions of many 
courses and very little time to document.  If there are questions please write 
me and I'll update the documentation as necessary.  Anybody else that feels 
inclined can also add to the docs.  Eventually the documentation should be 
moved to reside with the rest of the cTakes docs.

Sean

-Original Message-
From: vijay garla [mailto:vnga...@gmail.com] 
Sent: Wednesday, June 11, 2014 9:33 AM
To: dev@ctakes.apache.org
Subject: Re: Preparing for an Apache cTAKES 3.2 Release?

regardless of the name, I think it would be incredibly helpful to have thorough 
documentation on the dictionary lookup, how to configure it, and how to create 
new dictionaries.  I would venture to say that this is the most important 
component in cTAKES, and probably the one that has generated the most questions 
on the newsgroup.



On Wed, Jun 11, 2014 at 9:21 AM, Finan, Sean < 
sean.fi...@childrens.harvard.edu> wrote:

> >. The newer NER should have in its name the Behavior...
>
> I agree, but the *2 module is a complete replacement for the current 
> lookup.  It does not (really) have any different behavior, just a 
> different implementation and performance.  We plan to swap out the old 
> with the new in the next release and get rid of the *2 suffix.  So, 
> any name provided now is just temporary - unless people don't like the 
> name "dictionary-lookup" at all.
>
> In my original sandbox it was named "RareWordLookup", a nod to its 
> implementation.  However, this doesn't help any users.
>
> Sean
>
> -Original Message-
> From: andy mcmurry [mailto:mcmurry.a...@gmail.com]
> Sent: Wednesday, June 11, 2014 3:09 AM
> To: dev@ctakes.apache.org
> Subject: Re: Preparing for an Apache cTAKES 3.2 Release?
>
> "2" doesn't mean much. The newer NER should have in its name the 
> Behavior...
>
> Perhaps something like MetaMap Usage
> <http://metamap.nlm.nih.gov/Docs/MM09_Usage.shtml> "--allow_overmatches"
> or  "--allow_concept_gaps" or .other?
>
> Since yTex already provides a pluggable *DictionaryLookup, *that seems 
> like the best place to define the differing Behavior /  Usage.
>
> https://cwiki.apache.org/confluence/display/CTAKES/User's+Guide
> https://code.google.com/p/ytex/wiki/DictionaryLookup_V05
>
>
> AndyMC
>
> On Tue, Jun 10, 2014 at 9:55 AM, britt fitch 
> wrote:
>
> > I don’t have an issue with the *-2 name. I also don’t have any 
> > objections to renaming it.
> >
> > It might be nice to keep the old dictionary code around for a 
> > release-worth of time but after that I would vote purging it.
> > If someone needs it after that it’ll be accessible in the archived 
> > releases.
> >
> >
> >
> > On Jun 10, 2014, at 12:48 PM, Chen, Pei 
> > 
> > wrote:
> >
> > > I think James has a fair point here.
> > > It may be worthwhile biting the bullet here and push forward.
> > >
> > > Since this essentially will be a full replacement of the
> > ctakes-dictionary-lookup module, a good option maybe to just replace 
> > the entire module now and rename the existing module to * _deprecated.
> > > How do folks feel about that?  In a nutshell,
> > > ctakes-dictionary-lookup-2
> > is a faster algorithm with a simpler code base- and comparable 
> > results (Sean has a full comparison in the documentation for those 
> > who are
> curious).
> > >
> > > --Pei
> > >
> > >> -Original Message-
> > >> From: britt fitch [mailto:britt.fi...@gmail.com]
> > >> Sent: Monday, June 09, 2014 5:42 PM
> > >> To: dev@ctakes.apache.org
> > >> Subject: Re: Preparing for an Apache cTAKES 3.2 Release?
> > >>
> > >> There is some documentation in the dictionary2 module under 
> > >> /doc/DictionaryLookupHelp.{txt | docx} that gives some some 
> > >> details of
> > the
> > >> different lookup implementation options within that module that I 
> > >> found helpful.
> > >>
> > >>
> > >> On Jun 9, 2014, at 5:17 PM, Masanz, James J.
> > >> 
> > >> wrote:
> > >>
> > >>>
> > >>> Will ctakes-dictionary-lookup2 remain the name for the new 

RE: Preparing for an Apache cTAKES 3.2 Release?

2014-06-16 Thread Finan, Sean
I guess that I've got one question at this point:

Is the name being given to the -new- dictionary lookup module temporary or 
permanent?  

I was under the assumption that it was temporary and that with the switch to it 
being default (and eventually only) the module would simply be named 
"dictionary-lookup".



-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] 
Sent: Monday, June 16, 2014 11:24 AM
To: 'dev@ctakes.apache.org'
Subject: RE: Preparing for an Apache cTAKES 3.2 Release?

I'd rather something else than "dictionary-lookup-fast". If we come up with 
something even faster than this one, having an older one called "fast" could be 
confusing.

-Original Message-
From: Dligach, Dmitriy [mailto:dmitriy.dlig...@childrens.harvard.edu]
Sent: Monday, June 16, 2014 9:55 AM
To: cTAKES Developer list
Subject: Re: Preparing for an Apache cTAKES 3.2 Release?

+1

Dima




On Jun 16, 2014, at 9:42, Miller, Timothy 
 wrote:

> Sorry to weigh in so late on this -- just returned from vacation. If 
> we want to have a one release delay before making dictionary2 default 
> for testing/documentation/configuration purposes, and there isn't an 
> obvious function-related name, and the main difference is speed, maybe 
> we could call it dictionary-lookup-fast? Besides being accurate and 
> more descriptive than "2", it might lure people into trying it and 
> give us some feedback.
> 
> Tim
> 
> 
> On 06/16/2014 10:34 AM, Chen, Pei wrote:
>> I'm making some significant updates to trunk that may cause some instability 
>> for this release.
>> It should be mostly transparent, but let me know if you encounter any issues 
>> with trunk.
>> 
>> Also, regarding the dictionary-lookup2.  If there are no strong objections, 
>> we can leave default to as-is (old behavior).  Folks who wish to give the 
>> new one a try are welcome to do so and we can change the default behavior in 
>> a future release.
>> 
>> [ducks for cover now]
>> --Pei
>> 
>>> -Original Message-
>>> From: ksa...@gmail.com [mailto:ksa...@gmail.com] On Behalf Of 
>>> Karthik Sarma
>>> Sent: Wednesday, June 11, 2014 9:58 AM
>>> To: dev@ctakes.apache.org
>>> Subject: Re: Preparing for an Apache cTAKES 3.2 Release?
>>> 
>>> Agreed
>>> 
>>> On Wednesday, June 11, 2014, vijay garla  wrote:
>>> 
>>>> regardless of the name, I think it would be incredibly helpful to 
>>>> have thorough documentation on the dictionary lookup, how to 
>>>> configure it, and how to create new dictionaries.  I would venture 
>>>> to say that this is the most important component in cTAKES, and 
>>>> probably the one that has generated the most questions on the newsgroup.
>>>> 
>>>> 
>>>> 
>>>> On Wed, Jun 11, 2014 at 9:21 AM, Finan, Sean < 
>>>> sean.fi...@childrens.harvard.edu> wrote:
>>>> 
>>>>>> . The newer NER should have in its name the Behavior...
>>>>> I agree, but the *2 module is a complete replacement for the 
>>>>> current lookup.  It does not (really) have any different behavior, 
>>>>> just a
>>>> different
>>>>> implementation and performance.  We plan to swap out the old with 
>>>>> the new in the next release and get rid of the *2 suffix.  So, any 
>>>>> name provided now is just temporary - unless people don't like the 
>>>>> name "dictionary-lookup" at all.
>>>>> 
>>>>> In my original sandbox it was named "RareWordLookup", a nod to its 
>>>>> implementation.  However, this doesn't help any users.
>>>>> 
>>>>> Sean
>>>>> 
>>>>> -Original Message-
>>>>> From: andy mcmurry [mailto:mcmurry.a...@gmail.com]
>>>>> Sent: Wednesday, June 11, 2014 3:09 AM
>>>>> To: dev@ctakes.apache.org
>>>>> Subject: Re: Preparing for an Apache cTAKES 3.2 Release?
>>>>> 
>>>>> "2" doesn't mean much. The newer NER should have in its name the 
>>>>> Behavior...
>>>>> 
>>>>> Perhaps something like MetaMap Usage 
>>>>> <http://metamap.nlm.nih.gov/Docs/MM09_Usage.shtml> "--
>>> allow_overmatches"
>>>>> or  "--allow_concept_gaps" or .other?
>>>>> 
>>>>> Since yTex already provides a pluggable *DictionaryLookup, *that 
>>

RE: DeepPheno: guidance on CTakes

2014-06-27 Thread Finan, Sean
Hi Pei,

Nice examples.  The pipeline builder could be simpler (divvied), but they 
shouldn't leave anybody confused.

+1 for the uimafit annotations!

-Original Message-
From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu] 
Sent: Friday, June 27, 2014 11:11 AM
To: Hochheiser, Harry Stewart; dev@ctakes.apache.org
Subject: RE: DeepPheno: guidance on CTakes

+dev
Harry,
I've just checked in some two example java classes [1] that should make life a 
lot easier for developers to create and add new cTAKES Annotators.
It will shield users initially from all of the complexities of UIMA, XML 
Descriptors, cTAKES, etc.

Just check out the latest: 
svn co http://svn.apache.org/repos/asf/ctakes/trunk
mvn clean compile

--Pei
[1] 
http://svn.apache.org/repos/asf/ctakes/trunk/ctakes-examples/src/main/java/org/apache/ctakes/examples/

> -Original Message-
> From: Hochheiser, Harry Stewart [mailto:har...@pitt.edu]
> Sent: Thursday, June 26, 2014 5:31 PM
> To: Chen, Pei
> Subject: DeepPheno: guidance on CTakes
> 
> Pei:
> 
> As I'm now digging into cTAKES as part of our DeepPheno project (and 
> some other related efforts), I'm hoping you can help with a quick 
> question. Is there any guide/documentation on the process for adding 
> new annotators to cTAKES?  I've dug into the apache site and mailing 
> list archives, but haven't had much luck.
> 
> Thanks!
> 
> -harry
> 
> 
> Harry Hochheiser
> University of Pittsburgh
> Department of Biomedical Informatics
> har...@pitt.edu   412 648 9300
> 
> 
> 



RE: Bacterium Dictionary

2014-06-30 Thread Finan, Sean
Hi Nick,
There are ~26,000 T007 Bacterium (falls under Living Being) entries in UMLS 
2013aa.  They aren't in the cTakes dictionary, but you can build a separate 
bacteria dictionary using the dictionary creator tool in cTakes sandbox.  It 
can create dictionaries formatted for use with both available 
cTakes-dictionary-lookup modules.  I have a full living beings dictionary, if 
you want to somehow confirm your umls license then I could pull out the 
bacteria for you.
Sean

> -Original Message-
> From: Pei Chen [mailto:chen...@apache.org]
> Sent: Monday, June 30, 2014 12:50 PM
> To: dev@ctakes.apache.org
> Subject: Re: Bacterium Dictionary
> 
> Nick,
> I am not sure how complete it is, but I believe the UMLS has the semantic type
> of
> 
> Bacterium
> 
> 
>  [T007]
>   It's most likely not included in the default cTAKES dictionaries though...
> 
> Thanks,
> Pei
> 
> 
> On Mon, Jun 30, 2014 at 10:31 AM, Nick Nikandish <
> snika...@emerginghealthit.com> wrote:
> 
> >  Hi there,
> >
> >
> >
> > I was wondering if Ctakes has any Bacterium Dictionary? I need to
> > extract information for bacteria like “Enterococcus Faecium”,
> > “Pseudomonas Aeruginosa “ , etc  and I was wondering if I can do it by
> > using Ctakes annotators?
> >
> >
> >
> > Thanks,
> >
> >
> >
> > *Nick Nikandish*
> >
> > *Product Development Software Engineer*
> >
> > Clinical Research Informatics
> >
> >
> >
> > *Emerging Health*
> >
> > *Montefiore Information Technology*
> >
> > 6 Executive Blvd. Suite 290, Yonkers, NY 10701
> >
> > 914-457-6792 Office
> >
> > snika...@montefiore.org
> >
> > www.emerginghealthit.com
> >
> > www.montefiore.org
> >
> >
> >
> > [image: logo-montefiore-it]
> >
> >
> >


RE: Bacterium Dictionary

2014-06-30 Thread Finan, Sean
t; check
> those texts against a comprehensive library like UML. I have UMLS account and
> but I was wondering how to utilize Ctakes to use that library. It will be 
> great if
> there were some documents on building a separate dictionary using the
> dictionary creator.
> 
> 
> Thanks again,
> Nick
> 
> -Original Message-
> From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
> Sent: Monday, June 30, 2014 3:37 PM
> To: dev@ctakes.apache.org
> Subject: RE: Bacterium Dictionary
> 
> Hi Nick,
> There are ~26,000 T007 Bacterium (falls under Living Being) entries in UMLS
> 2013aa.  They aren't in the cTakes dictionary, but you can build a separate
> bacteria dictionary using the dictionary creator tool in cTakes sandbox.  It 
> can
> create dictionaries formatted for use with both available cTakes-dictionary-
> lookup modules.  I have a full living beings dictionary, if you want to 
> somehow
> confirm your umls license then I could pull out the bacteria for you.
> Sean
> 
> > -Original Message-
> > From: Pei Chen [mailto:chen...@apache.org]
> > Sent: Monday, June 30, 2014 12:50 PM
> > To: dev@ctakes.apache.org
> > Subject: Re: Bacterium Dictionary
> >
> > Nick,
> > I am not sure how complete it is, but I believe the UMLS has the
> > semantic type of
> >
> > Bacterium
> > <https://uts.nlm.nih.gov//semanticnetwork.html#Bacterium;0;0;2014AA>
> >
> >  [T007]
> >   It's most likely not included in the default cTAKES dictionaries though...
> >
> > Thanks,
> > Pei
> >
> >
> > On Mon, Jun 30, 2014 at 10:31 AM, Nick Nikandish <
> > snika...@emerginghealthit.com> wrote:
> >
> > >  Hi there,
> > >
> > >
> > >
> > > I was wondering if Ctakes has any Bacterium Dictionary? I need to
> > > extract information for bacteria like “Enterococcus Faecium”,
> > > “Pseudomonas Aeruginosa “ , etc  and I was wondering if I can do it
> > > by using Ctakes annotators?
> > >
> > >
> > >
> > > Thanks,
> > >
> > >
> > >
> > > *Nick Nikandish*
> > >
> > > *Product Development Software Engineer*
> > >
> > > Clinical Research Informatics
> > >
> > >
> > >
> > > *Emerging Health*
> > >
> > > *Montefiore Information Technology*
> > >
> > > 6 Executive Blvd. Suite 290, Yonkers, NY 10701
> > >
> > > 914-457-6792 Office
> > >
> > > snika...@montefiore.org
> > >
> > > www.emerginghealthit.com
> > >
> > > www.montefiore.org
> > >
> > >
> > >
> > > [image: logo-montefiore-it]
> > >
> > >
> > >


RE: [VOTE] Release Apache cTAKES 3.2.0

2014-07-02 Thread Finan, Sean
+1

Pulled fresh candidate, built, and ran Clinical using CPE without problem.  
Other than that, no testing.  SVN gave me a problem initially (checked out as 
anonymous) asking for a password then flunking the checkout, but an update 
completed it.  I blame the heat.

From: Masanz, James J. [masanz.ja...@mayo.edu]
Sent: Monday, June 30, 2014 10:24 PM
To: dev@ctakes.apache.org
Subject: RE: [VOTE] Release Apache cTAKES 3.2.0

This is pretty obvious, but since this is a record of what was voted upon,
note that some of the URLs contain an extra

ctakes-3.2.0/

For example
http://people.apache.org/~chenpei/RCs/ctakes-3.2.0/ctakes-3.2.0/apache-ctakes-3.2.0-src.tar.gz

should be just
http://people.apache.org/~chenpei/RCs/ctakes-3.2.0/apache-ctakes-3.2.0-src.tar.gz

-- James


From: Pei Chen [chen...@apache.org]
Sent: Friday, June 27, 2014 5:15 PM
To: dev@ctakes.apache.org
Subject: [VOTE] Release Apache cTAKES 3.2.0

Hi all,

This is a call for a vote on releasing the following candidate (rc1) as
Apache cTAKES 3.2.0.
The major changes include:
- New optional YTEX component(s) (Yale Extensions to cTAKES)
- New optional improved/faster dictionary lookup (dictionary-lookup-fast)
- New optional Temporal component (Time + Event extraction.  Relations will
be including in a future release.)
- Other bug fixes/enhancements from Jira

[TODO: Online documentation still needs to be updated on wiki for the abo]

For more detailed information on the changes/release notes, please visit:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12313621&version=12324066

The release was made using the cTAKES release process documented here:
http://ctakes.apache.org/ctakes-release-guide.html

The candidate is available at:
http://people.apache.org/~chenpei/RCs/ctakes-3.2.0/ctakes-3.2.0/apache-ctakes-3.2.0-src.tar.gz
/.zip

The tag to be voted on:
http://svn.apache.org/repos/asf/ctakes/tags/ctakes-3.2.0-rc1/

The MD5 checksum of the tarball can be found at:
http://people.apache.org/~chenpei/RCs/ctakes-3.2.0/ctakes-3.2.0/apache-ctakes-3.2.0-src.tar.gz.md5
/.zip.md5

The signature of the tarball can be found at:
http://people.apache.org/~chenpei/RCs/ctakes-3.2.0/ctakes-3.2.0/apache-ctakes-3.2.0-src.tar.gz.asc
/.zip.asc

Apache cTAKES' KEYS file, containing the PGP keys used to sign the release:
https://dist.apache.org/repos/dist/release/ctakes/KEYS

Please vote on releasing these packages as Apache cTAKES 3.2.0. The vote is
open for at least the next 72 hours.
Only votes from the cTAKES PMC are binding, but folks are welcome to check
the release candidate and voice their approval or disapproval.
The vote passes if at least three binding +1 votes are cast.

[ ] +1 Release the packages as Apache cTAKES 3.2.0
[ ] -1 Do not release the packages because...

Also, the convenience binary can be found at:
http://people.apache.org/~chenpei/RCs/ctakes-3.2.0/ctakes-3.2.0/apache-ctakes-3.2.0-bin.tar.gz
/.zip
Note: It's tempoarily on people.a.o because the artifacts were too large
for https://dist.apache.org/repos/dist/dev/ctakes (Working with infra on
increasing the limit).


Thanks!


RE: [VOTE] Release Apache cTAKES 3.2.0 (rc2)

2014-07-10 Thread Finan, Sean
+1 for the ytex method of handling a umls login before download of the umls 
resources.  While this also doesn't truly prevent people from sharing files 
(data) without a umls account, it is a little bit of a nicer mechanism.

Aside ...  Does anybody out there have experience with izpack?  (izpack.org)  
Creation of an "InstallAnywhere" style module is under consideration ...

> -Original Message-
> From: vijay garla [mailto:vnga...@gmail.com]
> Sent: Wednesday, July 09, 2014 10:30 AM
> To: dev@ctakes.apache.org
> Subject: Re: [VOTE] Release Apache cTAKES 3.2.0 (rc2)
> 
> ctakes-ytex-lib-3.1.2-SNAPSHOT.zip
>  - this
> contains non-asf compliant ytex libs.  I would like to add it to the 
> sourceforge
> site / or add it to the ctakes resources directly (that way users simply have 
> to
> unzip a single zip file)
> 
> ctakes-ytex-resources-3.1.2-SNAPSHOT.zip
>  3.1.2-SNAPSHOT.zip>
> -
> this contains data derived from the UMLS - concept graphs and dictionary
> lookup tables.  downloading this requires a UTS login.  It is conceptually no
> different from the ctakes resources, so I believe it would be OK to add it to 
> that
> zip file, but I'm not a lawyer.
> 
> On another note: I think forcing users to specify the UTS username/password
> and contacting NIH every time you run cTAKES is problematic, and doesn't
> prevent users who don't have a valid UTS login from viewing the data contained
> in the lucene index dictionary.  I personally believe requiring a UTS login to
> download would be the best way to make resources derived from the UMLS
> available to users (this is what I'm doing for ytex-resources).
> 
> to summarize: for now, I would like to add the ytex libs to the ctakes 
> resources
> zip.
> 
> -vj
> 
> 
> 
> 
> On Wed, Jul 9, 2014 at 4:04 PM, Chen, Pei 
> wrote:
> 
> > The maven artifacts are also available in the staging area:
> > https://repository.apache.org/content/repositories/orgapachectakes-100
> > 1
> > VJ: Just curious- how did you envision ytex users downloading the
> > jars/war? From the distro bin.zip or from maven central?
> >
> > --Pei
> >
> > > -Original Message-
> > > From: Pei Chen [mailto:chen...@apache.org]
> > > Sent: Tuesday, July 08, 2014 6:11 PM
> > > To: dev@ctakes.apache.org
> > > Subject: [VOTE] Release Apache cTAKES 3.2.0 (rc2)
> > >
> > > Hi all,
> > >
> > > The main difference between rc1 and rc2 is that we removed the
> > > lvg-res
> > and
> > > assertion-res.jar from the distro.  They still need to be unpacked.
> > >
> > > This is a call for a vote on releasing the following candidate (rc2)
> > > as
> > Apache
> > > cTAKES 3.2.0.
> > > The major changes include:
> > > - New optional YTEX component(s) (Yale Extensions to cTAKES)
> > > - New optional improved/faster dictionary lookup
> > > (dictionary-lookup-fast)
> > > - New optional Temporal component (Time + Event extraction.
> > > Relations
> > will
> > > be including in a future release.)
> > > - Other bug fixes/enhancements from Jira
> > >
> > > [TODO: Online documentation still needs to be updated on wiki]
> > >
> > > For more detailed information on the changes/release notes, please visit:
> > >
> > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12313
> > 621
> > > &version=12324066
> > >
> > > The release was made using the cTAKES release process documented here:
> > > http://ctakes.apache.org/ctakes-release-guide.html
> > >
> > > The candidate is available at:
> > > http://people.apache.org/~chenpei/RCs/ctakes-3.2.0-rc2/apache-ctakes
> > > -
> > > 3.2.0-src.tar.gz
> > > /.zip
> > >
> > > The tag to be voted on:
> > > http://svn.apache.org/repos/asf/ctakes/tags/ctakes-3.2.0-rc2
> > >
> > > The MD5 checksum of the tarball can be found at:
> > > http://people.apache.org/~chenpei/RCs/ctakes-3.2.0-rc2/apache-ctakes
> > > -
> > > 3.2.0-src.tar.gz.md5
> > > /.zip.md5
> > >
> > > The signature of the tarball can be found at:
> > > http://people.apache.org/~chenpei/RCs/ctakes-3.2.0-rc2/apache-ctakes
> > > -
> > > 3.2.0-src.tar.gz.asc
> > > /.zip.asc
> > >
> > > Apache cTAKES' KEYS file, containing the PGP keys used to sign the
> > release:
> > > https://dist.apache.org/repos/dist/release/ctakes/KEYS
> > >
> > > Please vote on releasing these packages as Apache cTAKES 3.2.0. The
> > > vote
> > is
> > > open for at least the next 72 hours.
> > > Only votes from the cTAKES PMC are binding, but folks are welcome to
> > check
> > > the release candidate and voice their approval or disapproval.
> > > The vote passes if at least three binding +1 votes are cast.
> > >
> > > [ ] +1 Release the packages as Apache cTAKES 3.2.0 [ ] -1 Do not
> > > release
> > the
> > > packages because...
> > >
> > > Also, the convenience binary can be found at:
> > > http://people.apache.org/~chenpei/RCs/ctakes-3.2.0-rc2/apache-ctakes
> > > -
> > > 3.2.0-bin.tar.gz
> > > 

RE: Lucene for UMLS2014

2014-07-21 Thread Finan, Sean
Hi Harpreet,

If you are willing to use cTakes 3.2, try the dictionary-lookup-fast module as 
a replacement of the default dictionary-lookup.  That module has a new 
dictionary resource (hsql, not lucene) and slightly different methods for 
lookup and matching.  In time trials it has been faster than the default module 
(hence the name).  Accuracy depends upon the parameter settings, but in the 
tests performed so far the results are comparable or better.  The new 
dictionary is much leaner than the current default dictionary, small enough to 
port from the hsql cached version to a hsql in-memory version.  Using the 
in-memory version makes dictionary lookup practically instantaneous (hundredths 
of a second).  Limited documentation is available in the module's doc/ 
directory.

I will be on vacation for a week, but please don't hesitate to write if you 
have any questions.

Sean

From: Harpreet Khanduja [hsk5...@rit.edu]
Sent: Thursday, July 17, 2014 5:07 PM
To: dev@ctakes.apache.org
Subject: Lucene for UMLS2014

Hello,
I would be grateful if someone could help.

I created a lucene index for umls2014 but only for snomed vocabulary.
I did this because I thought this would reduce the dictionary look up
time.
But it still almost the same. Is there any other way to improve the
dictionary look up time?

Thank you,
Harpreet


RE: question about sentence segmentation

2014-08-02 Thread Finan, Sean
Hi Tim,

> It would be preferable to me to put sentence breaks in between the sections, 
> so
> the first two sentences would be:
> 
> 1) PE: Lymphonodes...
> 2) Lungs: normal...

The punctuation is (always) after the logical break, being "Term: " for a 
Term:Definition list.  I think that the first three sentences should be
1) PE:
2) Lymphnodes: neck and ...
3) CV: regular and ...
Where the first line is an overarching Term: sentence (tree root), because each 
Term:Definition line that follows is within the physical exam.

Just an fyi.  Does that make sense?  Haven't had my coffee ...
Sean

> -Original Message-
> From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
> Sent: Saturday, August 02, 2014 7:44 AM
> To: dev@ctakes.apache.org
> Subject: RE: question about sentence segmentation
> 
> I'm annotating some oncology notes from SHARP right now, and they are
> basically a nightmare for our current sentence segmentation model. Mainly
> because they eschew explicit markers between sentences. I thought I'd ping the
> list with some interesting examples just in case it stimulates ideas. But it 
> seems
> to me that at some point we'll have to augment the opennlp module (preferable)
> or roll our own to handle cases like these.
> 
> In this example a bunch of background is on one line with no punctuation
> between logical breaks:
> PE: Lymphnodes: neck and axilla without adenopathy Lungs: normal and clear to
> auscultation CV: regular rate and rhythm without murmur or gallop , S1, S2
> normal, no murmur, click, rub or gal*, chest is clear without rales or 
> wheezing,
> no pedal edema, no JVD, no hepatosplenomegaly Breast: negative findings
> right/left breast with mild swelling, warmth, mild erythema, slightly tender, 
> no
> seroma or hematoma Abdomen: Abdomen soft, non-tender.
> 
> It would be preferable to me to put sentence breaks in between the sections, 
> so
> the first two sentences would be:
> 
> 1) PE: Lymphonodes...
> 2) Lungs: normal...
> 
> but without any candidate characters to split the sentence I don't think it is
> possible.
> 
> Another example that breaks our model in a different way (truncated):
> 1. Baseline labwork including tumor markers  2. Start DD AC on Friday 8/1 with
> RN chemo teach  3. S U parent study
> 
> Our model will break on the period after the number, so we'd probably get:
> 1.
> Baseline labwork including tumor markers 2.
> Start DD 3.
> S U parent study
> 
> So the number is going in exactly the wrong place. Here it would be preferable
> to get:
> 1.
> Baseline labwork...
> 2.
> Start DD...
> 3.
> S U parent study
> 
> Anyways, just something to think about! The problem is much more complex in
> clinical data than in edited text, but I'm sure we all knew that already :)
> 
> Tim
> 
> 
> 
> From: Miller, Timothy [timothy.mil...@childrens.harvard.edu]
> Sent: Monday, July 28, 2014 2:38 PM
> To: dev@ctakes.apache.org
> Subject: Re: question about sentence segmentation
> 
> Yes, you're right about that Britt. I've been doing some annotations side by 
> side
> with a treebank viewer and think I have a pretty good handle on the actual 
> rules.
> 
> Basically, if a header or list identifier is followed by a period or a 
> newline it is
> considered a sentence break and otherwise it is part of the sentence.
> 
> e.g.
> 
> 1. 20 mg flomax
> 
> is two sentences, while:
> 
> 1 - 20 mg flomax
> 
> is one sentence.
> 
> For headings:
> 
> Allergies: Pt is allergic to aspirin.
> 
> is one sentence, while:
> 
> Allergies:
> Pt is allergic to aspirin.
> 
> is two sentences.
> 
> I'm planning to follow these guidelines.
> 
> Tim
> 
> On 07/28/2014 01:53 PM, britt fitch wrote:
> 
> Thanks for the document, Tim. It seems to not be explicit about how to handle
> sentences occurring in lists.
> 
> Are you still considering having the list number as outside of the sentence?
> 
> Thanks
> 
> Britt
> 
> On Jul 25, 2014, at 7:09 AM, Miller, Timothy
>  ard.edu> wrote:
> 
> 
> 
> Checking with Guergana and other colleagues here the advice is to have the
> sentence segmenter follow the treebank guidelines for sentence segmentation:
> http://clear.colorado.edu/compsem/documents/treebank_guidelines.pdf
> 
> They are a bit light on detail but fortunately we have some treebanked data 
> so I
> will use that for the training data and hopefully that will illuminate the 
> tricky
> cases.
> 
> Tim
> 
> 
> From: Masanz, James J.
> [masanz.ja...@mayo.edu]
> Sent: Tuesday, July 15, 2014 4:39 PM
> To: 'dev@ctakes.apache.org'
> Subject: RE: question about sentence segmentation
> 
> Sorry, I don't know if there was a reason.
> 
> If you haven't checked with Guergana, you might want to ask her if she had a
> reason or if it was just the way it had been since that corpus was created.
> 
> -Original Messa

RE: code value for vocabulary in dic-lookup-fast

2014-08-06 Thread Finan, Sean
Hi Harpreet,

I don't know if this has yet been answered (I'm still finding vacation-time 
emails), but the Snomed-ct, Rx-norm, etc. codes were removed from the -fast 
dictionary for speed.  Basically, any single UMLS Cui can have multiple 
different snomed-ct codes (for instance), and adding extra rows per-code leads 
to a lot of waste.  A post- Cui assignment step could be performed to assign 
non-unique snomed-ct codes (for instance) to discovered unique Cuis.  I am 
actually (slowly) conceptualizing an annotator that does just that - mapping 
Cuis to other source codes.  It would be an optional annotator, lean and fast.  
No promise on a date for startup code in sandbox.

Sean

> -Original Message-
> From: Harpreet Khanduja [mailto:hsk5...@rit.edu]
> Sent: Friday, July 25, 2014 2:33 PM
> To: dev@ctakes.apache.org
> Subject: code value for vocabulary in dic-lookup-fast
> 
> Hello,
> 
> I am using ctakes-dictionary-lookup-fast to annotation purposes.
> But, there is no value for
> "code"  attribute like it was there when I used ctakes-dictionary-lookup.
> 
> Is there any way I can find out the code attribute value using 
> ctakes-dictionary-
> lookup-fast?
> 
> 
> Thank you so much for the help,
> 
> Harpreet


RE: v_snomed_fword_lookup view

2014-08-08 Thread Finan, Sean
Hi Clayton,

I don't know how the ytex dictionary lookup works, so I'm afraid that I can't 
help you with an answer.  Maybe Vijay is the best person to do this.  If you 
aren't tied to ytex you could try the new cTakes dictionary-lookup-fast.  I 
tested "Patient came in with a malar rash" and it found "malar" and "malar 
rash".

Vijay,

At some point the lookup-fast module will be the default for the cTakes 
clinical pipeline.  In order to synchronize the ytex lookup with cTakes, would 
you like to eventually work together on reusing the same code for ytex?  I have 
no idea what ytex does, but I know the ins and outs of the cdl-fast module.

Sean

> -Original Message-
> From: clayclay...@gmail.com [mailto:clayclay...@gmail.com] On Behalf Of
> Clayton Turner
> Sent: Friday, August 08, 2014 2:08 PM
> To: dev@ctakes.apache.org
> Subject: v_snomed_fword_lookup view
> 
> Hi Everyone:
> 
> I have a question about how the v_snomed_fword_lookup view works when
> running the CPE.
> 
> So my understanding of the view is that it is a view comprised of the
> ytex.umls_aui_fword table, the umls.mrconso table and bits/pieces from
> other umls tables.
> 
> I feel like this is not completely correct or my idea of how the join to
> create the view works is off. For example, let's say I want the CPE to find
> "malar " (e.g. malar rash) as a concept in the annotations. It never
> happens after running my CPE descriptor and I cannot find it in my
> v_snomed_fword_lookup view.
> 
> select count(*) from umls_aui_fword where fword='malar'; yields 34 results
> 
> select count(*) from umls.mrconso where str='malar'; yields 3 results.
> 
> So clearly these two tables know what the cui and context(s) are for malar
> . Yet, whenever I run a gold standard set of notes through the CPE,
> malar is constantly flagged as just a word token and the concept is never
> grabbed. This is recurrent for lots of other concepts, as well, I just
> wanted to use an example to illustrate my issue.
> 
> Some troubleshooting I already went through:
> 1) Reinstalled ytex and umls database objects
> 2) Reinstalled a second time after redownloading umls through
> metamorphosys, ensuring that snomed vocabularies were included (also
> checked file sizes and noticed a big difference so I know those
> vocabularies ARE included
> 
> Anyone got any ideas as to what the issue could be?
> 
> Thank you,
> Clayton Turner


RE: v_snomed_fword_lookup view

2014-08-11 Thread Finan, Sean
Hi Clay,
It has been a hectic day and I apologize for the late reply.

> Now, I don't even have a "lookup2" folder and, subsequently the Tui folder and
> cTakesSnomed.xml file. This seems to be the problem, but I'm not sure where
> these files are supposed to be grabbed from.

I think that all directories "lookup2" were renamed "lookup-fast", but it looks 
like the links in the xml descriptors were not updated to match.

In the SnomedLookupAnnotator.xml (or Ov equivalent), find
DictionaryDescriptorFile


   
file:org/apache/ctakes/dictionary/lookup2/Snomed2011ab_ctakesTui/cTakesSnomed.xml

And adjust the file url accordingly.  I haven't checked out trunk since the 
code was moved (I've been busy with other things), but the equivalent file 
should not be difficult to find.  Unless you are on Windows the entire path to 
the dictionary database must be lowercase for hsql to work.  I think that was 
checked in, but if it hasn't been then you'll need to rename the directory (and 
hsql files) to be all lower-case.  Also, edit the  line to 
match the rename in the cTakesSnomed.xml

If you get this working could you please check in the changes?  At some point I 
can do a full checkout and do it myself, but I won't get to it for a while.

Sean

> -Original Message-
> From: clayclay...@gmail.com [mailto:clayclay...@gmail.com] On Behalf Of
> Clayton Turner
> Sent: Monday, August 11, 2014 11:53 AM
> To: dev@ctakes.apache.org
> Subject: Re: v_snomed_fword_lookup view
> 
> When navigating to ctakes-dictionary-lookup-fast\desc\analysis_engine there
> are 2 files, assumedly analysis engines.
> 
> SnomedLookupAnnotator.xml and SnomedOvLookupAnnotator.xml
> 
> If I pick either, I put in my UMLS information but receive an error when 
> trying to
> run the CPE:
> 
> Initialization of CAS Processor with name "SnomedOvLookupAnnotator" failed.
> CausedBy: org.apache.uima.resource.ResourceConfigurationException:
> Initialization of CAS processor with name "SnomedOvLookupAnnotator" failed.
> CausedBy: org.apache.uima.resource.ResourceInitializationException: Error
> initializing "org.apache.uima.resource.impl.DataResource_impl" from descriptor
> file:..SnomedLookupAnnotator.xml
> CausedBy: org.apache.uima.resource.ResourceInitializationException: Could not
> access the resource data at
> file:org\apache\ctakes\dictionary\lookup2\Snomed2011ab_ctakesTui\cTakesSn
> omed.xml
> 
> Now, I don't even have a "lookup2" folder and, subsequently the Tui folder and
> cTakesSnomed.xml file. This seems to be the problem, but I'm not sure where
> these files are supposed to be grabbed from.
> 
> 
> On Mon, Aug 11, 2014 at 11:47 AM, Clayton Turner 
> wrote:
> 
> > Hi again:
> >
> > How exactly do you switch to using the cTakes dictionary-lookup-fast.
> > Do I need to go in and alter xml files or is it as simple as adding a
> > certain item to the list of analysis engines?
> >
> >
> > On Fri, Aug 8, 2014 at 3:48 PM, Finan, Sean <
> > sean.fi...@childrens.harvard.edu> wrote:
> >
> >> Hi Clayton,
> >>
> >> I don't know how the ytex dictionary lookup works, so I'm afraid that
> >> I can't help you with an answer.  Maybe Vijay is the best person to do 
> >> this.
> >>  If you aren't tied to ytex you could try the new cTakes
> >> dictionary-lookup-fast.  I tested "Patient came in with a malar rash"
> >> and it found "malar" and "malar rash".
> >>
> >> Vijay,
> >>
> >> At some point the lookup-fast module will be the default for the
> >> cTakes clinical pipeline.  In order to synchronize the ytex lookup
> >> with cTakes, would you like to eventually work together on reusing
> >> the same code for ytex?  I have no idea what ytex does, but I know
> >> the ins and outs of the cdl-fast module.
> >>
> >> Sean
> >>
> >> > -Original Message-
> >> > From: clayclay...@gmail.com [mailto:clayclay...@gmail.com] On
> >> > Behalf Of Clayton Turner
> >> > Sent: Friday, August 08, 2014 2:08 PM
> >> > To: dev@ctakes.apache.org
> >> > Subject: v_snomed_fword_lookup view
> >> >
> >> > Hi Everyone:
> >> >
> >> > I have a question about how the v_snomed_fword_lookup view works
> >> > when running the CPE.
> >> >
> >> > So my understanding of the view is that it is a view comprised of
> >> > the ytex.umls_aui_fwor

RE: v_snomed_fword_lookup view

2014-08-11 Thread Finan, Sean
Thanks Harpreet,
That is definitely necessary to build!

Those lines should already be in the pom, but commented out.  I think that some 
version/branching issues may have arisen at some point wrt this module ...

If somebody beats me to it then cheers, otherwise I will try to check out 
tonight and get all the bits in place.

Sean

> -Original Message-
> From: Harpreet Khanduja [mailto:hsk5...@rit.edu]
> Sent: Monday, August 11, 2014 1:12 PM
> To: dev@ctakes.apache.org
> Subject: Re: v_snomed_fword_lookup view
> 
> Hello Clayton,
>   I do not know about ytex, but I did switch from dictionary-lookup to 
> dictionary-
> lookup-fast.
>   I update my ctakes-dictionary-lookup-fast project using maven.
>   I think I used Team- Update and switched to the latest revision available 
> and
> then
>   I downloaded new 3.2 resources from the for umls. and then I added these
> resources to my
>   ctakes-dictionary-lookup-fast resources folder and also the classpath in 
> ctakes-
> clinical-pipeline.
> 
>  Then I changed the pom.xml file which belongs to the whole ctakes project and
> added  org.apache.ctakes
> ctakes-dictionary-lookup-res
> ${ctakes.version}
> 
> 
> org.apache.ctakes
> ctakes-dictionary-lookup-fast
> ${ctakes.version}
> 
> 
> 
>  these two dependencies to the file.
> 
> 
> After this, I also added the dependency
> 
> org.apache.ctakes
> ctakes-dictionary-lookup-fast
> 
> 
> to the pom.xml of ctakes-clinical-pipeline.
> 
> And then add the resources folder in ctakes-clinical-pipeline using build path
> configuration under "add class" option.
> 
> After this it should work.
> 
> 
> Regards,
> Harpreet
> 
> 
> 
> 
> 
> 
> On Mon, Aug 11, 2014 at 12:44 PM, Clayton Turner 
> wrote:
> 
> > I still get the same error with the ctakes3.2 branch. Any suggestions?
> >
> >
> > On Mon, Aug 11, 2014 at 12:06 PM, Clayton Turner
> > 
> > wrote:
> >
> > > I'm going to do a clean install through the repo rather than the
> > > binaries and see if that fixes my issue because I think I just read
> > > a past post saying the lookup2 folders exist there.
> > >
> > >
> > > On Mon, Aug 11, 2014 at 11:52 AM, Clayton Turner
> > > 
> > > wrote:
> > >
> > >> When navigating to
> > >> ctakes-dictionary-lookup-fast\desc\analysis_engine
> > >> there are 2 files, assumedly analysis engines.
> > >>
> > >> SnomedLookupAnnotator.xml and SnomedOvLookupAnnotator.xml
> > >>
> > >> If I pick either, I put in my UMLS information but receive an error
> > >> when trying to run the CPE:
> > >>
> > >> Initialization of CAS Processor with name "SnomedOvLookupAnnotator"
> > >> failed.
> > >> CausedBy: org.apache.uima.resource.ResourceConfigurationException:
> > >> Initialization of CAS processor with name "SnomedOvLookupAnnotator"
> > >> failed.
> > >> CausedBy: org.apache.uima.resource.ResourceInitializationException:
> > Error
> > >> initializing "org.apache.uima.resource.impl.DataResource_impl" from
> > >> descriptor file:..SnomedLookupAnnotator.xml
> > >> CausedBy: org.apache.uima.resource.ResourceInitializationException:
> > Could
> > >> not
> > >> access the resource data at
> > >>
> > >>
> > file:org\apache\ctakes\dictionary\lookup2\Snomed2011ab_ctakesTui\cTake
> > sSnomed.xml
> > >>
> > >> Now, I don't even have a "lookup2" folder and, subsequently the Tui
> > >> folder and cTakesSnomed.xml file. This seems to be the problem, but
> > >> I'm
> > not
> > >> sure where these files are supposed to be grabbed from.
> > >>
> > >>
> > >> On Mon, Aug 11, 2014 at 11:47 AM, Clayton Turner
> > >> 
> > >> wrote:
> > >>
> > >>> Hi again:
> > >>>
> > >>> How exactly do you switch to using the cTakes dictionary-lookup-fast.
> > Do
> > >>> I need to go in and alter xml files or is it as simple as adding a
> > certain
> > >>> item to the list of analysis engines?
> > >>>
> > >>>
> > >>> On Fri, Aug 8, 2014 at 3:48 PM, Finan, Sean <
> > >>> sean.fi...@childrens.harvard.edu> wrote:
> > >>>
> > >>>> Hi Clayton,
> > >>>>
> > >>>> I don

Youtube Channel "Apache cTakes"

2014-08-12 Thread Finan, Sean
cTakes now has a youtube channel named "Apache cTakes".  It is empty, but if 
you have ever made a training video, presentation on a component (descriptors, 
type system, etc.), or demo of integration with another system (UimaFit, 
Uima-AS, etc.) then please feel free to post on that channel.  When there is 
content the Apache pages can have a link to the channel.

Sean



RE: v_snomed_fword_lookup view

2014-08-13 Thread Finan, Sean
Hi Clayton,

I'm glad that you got it working.  Though I stated that I would, I haven't yet 
checked the fidelity of trunk.  Urgent data request one day, "must have" 
writing the next ... and I still live with the delusion that I left academia to 
have free time ...

I have never used ytex or weka, so I'm unfamiliar with all things .arff .  
Could it be that the ytex .arff exporter needs to change consumed cTakes 
annotation classes (>3.1)?

I have a custom CasConsumer that saves text spans and Cuis to file in a simple 
list, and that is what I used for the performance analysis of the lookup 
module.  For our other projects here in Beantown we have other various outputs 
that fit the job at hand: text flat files, xml files, sql database tables, 
knot-encoded lace doilies, etc.

I'm sure that none of the above helps you, but I felt obliged to provide some 
kind of answer to your question.

Sean

> -Original Message-
> From: clayclay...@gmail.com [mailto:clayclay...@gmail.com] On Behalf Of
> Clayton Turner
> Sent: Wednesday, August 13, 2014 12:25 PM
> To: dev@ctakes.apache.org
> Subject: Re: v_snomed_fword_lookup view
> 
> Okay, I believe I have ctakes dictionary fast working now. Something I'm 
> curious
> about, though, is how you extract the data in order to conduct analysis.
> 
> I've, in the past, been using the SparseDataExporterImpl from ytex in order to
> create a .arff file for use in weka, but the ctakes pipeline I'm using 
> doesn't seem
> to be compatible with this ytex exporting as I'm not getting any cuis in my 
> arff
> file.
> 
> I'm using the aggregate plain text umls processor analysis engine from ctakes
> and then using the dbconsumer analysis engine from ytex (for storing into the
> database with regard to analysis batch).
> 
> Any tips for exporting or some simple issue I'm missing?
> 
> Thanks,
> Clayton
> 
> 
> On Mon, Aug 11, 2014 at 2:09 PM, Harpreet Khanduja 
> wrote:
> 
> > Yes, absolutely and
> > no problem at all.
> >
> > Regards,
> > Harpreet
> >
> >
> > On Mon, Aug 11, 2014 at 1:16 PM, Finan, Sean <
> > sean.fi...@childrens.harvard.edu> wrote:
> >
> > > Thanks Harpreet,
> > > That is definitely necessary to build!
> > >
> > > Those lines should already be in the pom, but commented out.  I
> > > think
> > that
> > > some version/branching issues may have arisen at some point wrt this
> > module
> > > ...
> > >
> > > If somebody beats me to it then cheers, otherwise I will try to
> > > check out tonight and get all the bits in place.
> > >
> > > Sean
> > >
> > > > -Original Message-
> > > > From: Harpreet Khanduja [mailto:hsk5...@rit.edu]
> > > > Sent: Monday, August 11, 2014 1:12 PM
> > > > To: dev@ctakes.apache.org
> > > > Subject: Re: v_snomed_fword_lookup view
> > > >
> > > > Hello Clayton,
> > > >   I do not know about ytex, but I did switch from
> > > > dictionary-lookup to
> > > dictionary-
> > > > lookup-fast.
> > > >   I update my ctakes-dictionary-lookup-fast project using maven.
> > > >   I think I used Team- Update and switched to the latest revision
> > > available and
> > > > then
> > > >   I downloaded new 3.2 resources from the for umls. and then I
> > > > added
> > > these
> > > > resources to my
> > > >   ctakes-dictionary-lookup-fast resources folder and also the
> > > > classpath
> > > in ctakes-
> > > > clinical-pipeline.
> > > >
> > > >  Then I changed the pom.xml file which belongs to the whole ctakes
> > > project and
> > > > added  org.apache.ctakes
> > > > ctakes-dictionary-lookup-res
> > > > ${ctakes.version}
> > > > 
> > > > 
> > > > org.apache.ctakes
> > > > ctakes-dictionary-lookup-fast
> > > > ${ctakes.version}
> > > > 
> > > >
> > > >
> > > >  these two dependencies to the file.
> > > >
> > > >
> > > > After this, I also added the dependency
> > > > 
> > > > org.apache.ctakes
> > > > ctakes-dictionary-lookup-fast
> > > > 
> > > >
> > > > to the pom.xml of ctakes-clinical-pipeline.
> > > >
> > > > And then add the resources folder in ctakes-clinical-pipeline
> > > > using
> > > build path
>

RE: v_snomed_fword_lookup view

2014-08-13 Thread Finan, Sean
>is the purpose of a CasConsumer to essentially save your data

Correct, though it is a generic (and archaic) term indicating any end-user of 
the cas.  

> -Original Message-
> From: clayclay...@gmail.com [mailto:clayclay...@gmail.com] On Behalf Of
> Clayton Turner
> Sent: Wednesday, August 13, 2014 2:10 PM
> To: dev@ctakes.apache.org
> Subject: Re: v_snomed_fword_lookup view
> 
> Oh okay, so is the purpose of a CasConsumer to essentially save your data in a
> representation that you can do some kind of data mining or classification on 
> it?
> If so, then I think I need to look into making/using one of those.
> 
> 
> On Wed, Aug 13, 2014 at 1:41 PM, Finan, Sean <
> sean.fi...@childrens.harvard.edu> wrote:
> 
> > Hi Clayton,
> >
> > I'm glad that you got it working.  Though I stated that I would, I
> > haven't yet checked the fidelity of trunk.  Urgent data request one
> > day, "must have" writing the next ... and I still live with the
> > delusion that I left academia to have free time ...
> >
> > I have never used ytex or weka, so I'm unfamiliar with all things .arff .
> >  Could it be that the ytex .arff exporter needs to change consumed
> > cTakes annotation classes (>3.1)?
> >
> > I have a custom CasConsumer that saves text spans and Cuis to file in
> > a simple list, and that is what I used for the performance analysis of
> > the lookup module.  For our other projects here in Beantown we have
> > other various outputs that fit the job at hand: text flat files, xml
> > files, sql database tables, knot-encoded lace doilies, etc.
> >
> > I'm sure that none of the above helps you, but I felt obliged to
> > provide some kind of answer to your question.
> >
> > Sean
> >
> > > -Original Message-
> > > From: clayclay...@gmail.com [mailto:clayclay...@gmail.com] On Behalf
> > > Of Clayton Turner
> > > Sent: Wednesday, August 13, 2014 12:25 PM
> > > To: dev@ctakes.apache.org
> > > Subject: Re: v_snomed_fword_lookup view
> > >
> > > Okay, I believe I have ctakes dictionary fast working now. Something
> > > I'm
> > curious
> > > about, though, is how you extract the data in order to conduct analysis.
> > >
> > > I've, in the past, been using the SparseDataExporterImpl from ytex
> > > in
> > order to
> > > create a .arff file for use in weka, but the ctakes pipeline I'm
> > > using
> > doesn't seem
> > > to be compatible with this ytex exporting as I'm not getting any
> > > cuis in
> > my arff
> > > file.
> > >
> > > I'm using the aggregate plain text umls processor analysis engine
> > > from
> > ctakes
> > > and then using the dbconsumer analysis engine from ytex (for storing
> > into the
> > > database with regard to analysis batch).
> > >
> > > Any tips for exporting or some simple issue I'm missing?
> > >
> > > Thanks,
> > > Clayton
> > >
> > >
> > > On Mon, Aug 11, 2014 at 2:09 PM, Harpreet Khanduja 
> > > wrote:
> > >
> > > > Yes, absolutely and
> > > > no problem at all.
> > > >
> > > > Regards,
> > > > Harpreet
> > > >
> > > >
> > > > On Mon, Aug 11, 2014 at 1:16 PM, Finan, Sean <
> > > > sean.fi...@childrens.harvard.edu> wrote:
> > > >
> > > > > Thanks Harpreet,
> > > > > That is definitely necessary to build!
> > > > >
> > > > > Those lines should already be in the pom, but commented out.  I
> > > > > think
> > > > that
> > > > > some version/branching issues may have arisen at some point wrt
> > > > > this
> > > > module
> > > > > ...
> > > > >
> > > > > If somebody beats me to it then cheers, otherwise I will try to
> > > > > check out tonight and get all the bits in place.
> > > > >
> > > > > Sean
> > > > >
> > > > > > -Original Message-
> > > > > > From: Harpreet Khanduja [mailto:hsk5...@rit.edu]
> > > > > > Sent: Monday, August 11, 2014 1:12 PM
> > > > > > To: dev@ctakes.apache.org
> > > > > > Subject: Re: v_snomed_fword_lookup view
> > > > > >
> > > > > > Hello Clayton,
> > > > > >   I do not know a

RE: v_snomed_fword_lookup view

2014-08-13 Thread Finan, Sean
You can find example Cas Consumers in cTakes-core ..[dirPath]../cc/

> -Original Message-
> From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
> Sent: Wednesday, August 13, 2014 2:20 PM
> To: dev@ctakes.apache.org
> Subject: Re: v_snomed_fword_lookup view
> 
> There's nothing conceptually special about the consumer model vs.
> regular annotators (Analysis Engines). You can write an output format from any
> analysis engine as long as it is after the annotations you need in the 
> pipeline. If
> you have global constraints (like in an ARFF file I think you need to know 
> all the
> CUIs in your corpus to write the attribute list?), then it is important to 
> use the
> process() method [called once per document] to store CUIs in a non-UIMA class
> variable (for example, a map from file id to a list/set/multiset of CUIs), 
> and then
> use the collectionProcessComplete() [called once after all documents have been
> processed] method to do the actual writing of the file.
> 
> Hope that is useful, sorry I couldn't tie it in to your previous YTEX 
> exporter but
> I'm not familiar with that process.
> 
> Tim
> 
> 
> On 08/13/2014 02:11 PM, Clayton Turner wrote:
> > Oh okay, so is the purpose of a CasConsumer to essentially save your
> > data in a representation that you can do some kind of data mining or
> > classification on it?  If so, then I think I need to look into
> > making/using one of those.
> >
> >
> > On Wed, Aug 13, 2014 at 1:41 PM, Finan, Sean <
> > sean.fi...@childrens.harvard.edu> wrote:
> >
> >> Hi Clayton,
> >>
> >> I'm glad that you got it working.  Though I stated that I would, I
> >> haven't yet checked the fidelity of trunk.  Urgent data request one
> >> day, "must have" writing the next ... and I still live with the
> >> delusion that I left academia to have free time ...
> >>
> >> I have never used ytex or weka, so I'm unfamiliar with all things .arff .
> >>  Could it be that the ytex .arff exporter needs to change consumed
> >> cTakes annotation classes (>3.1)?
> >>
> >> I have a custom CasConsumer that saves text spans and Cuis to file in
> >> a simple list, and that is what I used for the performance analysis
> >> of the lookup module.  For our other projects here in Beantown we
> >> have other various outputs that fit the job at hand: text flat files,
> >> xml files, sql database tables, knot-encoded lace doilies, etc.
> >>
> >> I'm sure that none of the above helps you, but I felt obliged to
> >> provide some kind of answer to your question.
> >>
> >> Sean
> >>
> >>> -Original Message-
> >>> From: clayclay...@gmail.com [mailto:clayclay...@gmail.com] On Behalf
> >>> Of Clayton Turner
> >>> Sent: Wednesday, August 13, 2014 12:25 PM
> >>> To: dev@ctakes.apache.org
> >>> Subject: Re: v_snomed_fword_lookup view
> >>>
> >>> Okay, I believe I have ctakes dictionary fast working now. Something
> >>> I'm
> >> curious
> >>> about, though, is how you extract the data in order to conduct analysis.
> >>>
> >>> I've, in the past, been using the SparseDataExporterImpl from ytex
> >>> in
> >> order to
> >>> create a .arff file for use in weka, but the ctakes pipeline I'm
> >>> using
> >> doesn't seem
> >>> to be compatible with this ytex exporting as I'm not getting any
> >>> cuis in
> >> my arff
> >>> file.
> >>>
> >>> I'm using the aggregate plain text umls processor analysis engine
> >>> from
> >> ctakes
> >>> and then using the dbconsumer analysis engine from ytex (for storing
> >> into the
> >>> database with regard to analysis batch).
> >>>
> >>> Any tips for exporting or some simple issue I'm missing?
> >>>
> >>> Thanks,
> >>> Clayton
> >>>
> >>>
> >>> On Mon, Aug 11, 2014 at 2:09 PM, Harpreet Khanduja 
> >>> wrote:
> >>>
> >>>> Yes, absolutely and
> >>>> no problem at all.
> >>>>
> >>>> Regards,
> >>>> Harpreet
> >>>>
> >>>>
> >>>> On Mon, Aug 11, 2014 at 1:16 PM, Finan, Sean <
> >>>> sean.fi...@childrens.harvard.edu> wrote:
> >>>>
> >&

RE: Web server

2014-08-21 Thread Finan, Sean
Hi John,
Have you (or another) thought about modifying the Uima Simple Server to run a 
cTakes pipeline?
http://uima.apache.org/sandbox.html#simple-server


> -Original Message-
> From: John Green [mailto:john.travis.gr...@gmail.com]
> Sent: Thursday, August 21, 2014 3:06 PM
> To: dev@ctakes.apache.org
> Subject: Web server
> 
> Im trying to deploy the cTakes web-server code someone already wrote (who
> wrote it btw?). Im running into deployment issues in eclipse with tomcat 7
> on mac... I can get into details but for now: is it in a working state? Im
> learning as I go and it looks in order and the code is solid...
> 
> Also, Pei: did they check in an LVG version that is thread safe now?
> 
> Im really set on getting cTakes into a fluid RESTful interface.
> 
> JG


RE: Web server

2014-08-21 Thread Finan, Sean
> Do you have experience with uima simple server?
A few months ago I set it up and ran it just for kicks.  It is simple, but I 
pondered that as such it could serve as a nice foundation.  Well, maybe a 
cornerstone.


> -Original Message-
> From: John Green [mailto:john.travis.gr...@gmail.com]
> Sent: Thursday, August 21, 2014 3:43 PM
> To: dev@ctakes.apache.org
> Cc: dev@ctakes.apache.org
> Subject: RE: Web server
> 
> I have. I read the docs, it mentions more information but the tutorial was 
> very
> short.
> 
> 
> It seems there are simple get requests with the xml ae for output built into 
> the
> existing sandbox code, so I just wanted to hash that first before starting on 
> a
> new thread.
> 
> 
> 
> 
> Do you have experience with uima simple server?
> 
> 
> 
> 
> JG
> —
> Sent from Mailbox for iPhone
> 
> On Thu, Aug 21, 2014 at 12:10 PM, Finan, Sean
>  wrote:
> 
> > Hi John,
> > Have you (or another) thought about modifying the Uima Simple Server to run
> a cTakes pipeline?
> > http://uima.apache.org/sandbox.html#simple-server
> >> -Original Message-
> >> From: John Green [mailto:john.travis.gr...@gmail.com]
> >> Sent: Thursday, August 21, 2014 3:06 PM
> >> To: dev@ctakes.apache.org
> >> Subject: Web server
> >>
> >> Im trying to deploy the cTakes web-server code someone already wrote
> >> (who wrote it btw?). Im running into deployment issues in eclipse
> >> with tomcat 7 on mac... I can get into details but for now: is it in
> >> a working state? Im learning as I go and it looks in order and the code is
> solid...
> >>
> >> Also, Pei: did they check in an LVG version that is thread safe now?
> >>
> >> Im really set on getting cTakes into a fluid RESTful interface.
> >>
> >> JG


RE: Permutations

2014-09-05 Thread Finan, Sean
Hi Kim, Pei,

I don't think that I changed anything to which Kim is referring, just a couple 
of other things that happen to be in the same segment.  From the attached it 
looks like Kim's change is to copy a list and sort the copy, while mine were 
moving the sort from an inner to outer loop.  At any rate, whatever I did I did 
a year and a half ago and I'm concentrating on the new lookup these days.

Sean

From: Pei Chen [chen...@apache.org]
Sent: Friday, September 05, 2014 12:17 PM
To: dev@ctakes.apache.org
Subject: Re: Permutations

Hi Kim,
Thanks for pointing that out.
https://issues.apache.org/jira/browse/CTAKES-310 has been opened for
this.
If you commit the changes, we can see if we can include in the 3.2.1
patch release.
I was looking at the changelist for this file, and it may look like
some of these optimizations may have been intentional by Sean so he
may have some more insight in this bit of the logic.

On Thu, Sep 4, 2014 at 6:22 PM, Kim Ebert
 wrote:
> Hi All,
>
> I was reviewing the use of permutations, and I noticed that we sorted
> the permutation list before creating the string to do the concept lookup
> with. It also appears that we were sorting the object that was stored in
> the parent list.
>
> I've made a few changes, and now it appears I can discover some
> additional concepts based upon the permutations.
>
> Let me know what you think of the following changes.
>
> Thanks,
>
> Kim
>
> === modified file
> 'ctakes-dictionary-lookup/src/main/java/org/apache/ctakes/dictionary/lookup/algorithms/FirstTokenPermutationImpl.java'
> ---
> ctakes-dictionary-lookup/src/main/java/org/apache/ctakes/dictionary/lookup/algorithms/FirstTokenPermutationImpl.java
> 2014-07-31 22:00:48 +
> +++
> ctakes-dictionary-lookup/src/main/java/org/apache/ctakes/dictionary/lookup/algorithms/FirstTokenPermutationImpl.java
> 2014-09-04 18:39:59 +
> @@ -210,11 +210,12 @@
>final List> permutationList = iv_permCacheMap.get(
> permutationIndex );
>for ( List permutations : permutationList ) {
>   // Moved sort and offset calculation from inner (per
> MetaDataHit) iteration 2-21-2013 spf
> - Collections.sort( permutations );
> + List permutationsSorted = (List)
> ((ArrayList)permutations).clone();
> + Collections.sort( permutationsSorted );
>   int startOffset = firstWordStartOffset;
>   int endOffset = firstWordEndOffset;
> - if ( !permutations.isEmpty() ) {
> -int firstIdx = permutations.get( 0 );
> + if ( !permutationsSorted.isEmpty() ) {
> +int firstIdx = permutationsSorted.get( 0 );
>  if ( firstIdx <= firstTokenIndex ) {
> firstIdx--;
>  }
> @@ -222,7 +223,7 @@
>  if ( firstToken.getStartOffset() < firstWordStartOffset ) {
> startOffset = firstToken.getStartOffset();
>  }
> -int lastIdx = permutations.get( permutations.size() - 1 );
> +int lastIdx = permutationsSorted.get(
> permutationsSorted.size() - 1 );
>  if ( lastIdx <= firstTokenIndex ) {
> lastIdx--;
>  }
>
>
> --
> Kim Ebert
> 1.801.669.7342
> Perfect Search Corp
> http://www.perfectsearchcorp.com/
>


RE: Permutations

2014-09-05 Thread Finan, Sean
> We also had issues in cTAKES 2.5.
> Here is the patch for 2.5. Before I got the patch to 3.0 Sean made his 
> changes.

Thanks for verifying,

Sean

From: Kim Ebert [kim.eb...@perfectsearchcorp.com]
Sent: Friday, September 05, 2014 12:28 PM
To: dev@ctakes.apache.org; Finan, Sean
Subject: Re: Permutations

Hi Pei and Sean,

Sean, any thoughts about this would be helpful.

We also had issues in cTAKES 2.5.

Here is the patch for 2.5. Before I got the patch to 3.0 Sean made his
changes.

=== modified file
'src/edu/mayo/bmi/lookup/algorithms/FirstTokenPermutationImpl.java'
--- src/edu/mayo/bmi/lookup/algorithms/FirstTokenPermutationImpl.java
2012-11-28 01:56:50 +
+++ src/edu/mayo/bmi/lookup/algorithms/FirstTokenPermutationImpl.java
2013-02-06 16:39:37 +
@@ -294,14 +294,16 @@
 Iterator mdhIterator = mdhSet.iterator();
 while (mdhIterator.hasNext())
 {
-MetaDataHit mdh = (MetaDataHit)
mdhIterator.next();
+MetaDataHit mdh = (MetaDataHit)
mdhIterator.next();
+
+List permutationSorted = (List)
((ArrayList)permutation).clone();
 // figure out start and end offsets
-Collections.sort(permutation);
+Collections.sort(permutationSorted);

 int startOffset;
-if (permutation.size() > 0)
+if (permutationSorted.size() > 0)
 {
-int firstIdx = ((Integer)
permutation.get(0)).intValue();
+int firstIdx = ((Integer)
permutationSorted.get(0)).intValue();
 if (firstIdx <= firstTokenIndex.intValue())
 {
 firstIdx--;
@@ -322,9 +324,9 @@
 }

 int endOffset;
-if (permutation.size() > 0)
+if (permutationSorted.size() > 0)
 {
-int lastIdx = ((Integer)
permutation.get(permutation.size() - 1)).intValue();
+int lastIdx = ((Integer)
permutationSorted.get(permutationSorted.size() - 1)).intValue();
 if (lastIdx <= firstTokenIndex.intValue())
 {
 lastIdx--;


Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/

On 09/05/2014 10:17 AM, Pei Chen wrote:
> Hi Kim,
> Thanks for pointing that out.
> https://issues.apache.org/jira/browse/CTAKES-310 has been opened for
> this.
> If you commit the changes, we can see if we can include in the 3.2.1
> patch release.
> I was looking at the changelist for this file, and it may look like
> some of these optimizations may have been intentional by Sean so he
> may have some more insight in this bit of the logic.
>
> On Thu, Sep 4, 2014 at 6:22 PM, Kim Ebert
>  wrote:
>> Hi All,
>>
>> I was reviewing the use of permutations, and I noticed that we sorted
>> the permutation list before creating the string to do the concept lookup
>> with. It also appears that we were sorting the object that was stored in
>> the parent list.
>>
>> I've made a few changes, and now it appears I can discover some
>> additional concepts based upon the permutations.
>>
>> Let me know what you think of the following changes.
>>
>> Thanks,
>>
>> Kim
>>
>> === modified file
>> 'ctakes-dictionary-lookup/src/main/java/org/apache/ctakes/dictionary/lookup/algorithms/FirstTokenPermutationImpl.java'
>> ---
>> ctakes-dictionary-lookup/src/main/java/org/apache/ctakes/dictionary/lookup/algorithms/FirstTokenPermutationImpl.java
>> 2014-07-31 22:00:48 +
>> +++
>> ctakes-dictionary-lookup/src/main/java/org/apache/ctakes/dictionary/lookup/algorithms/FirstTokenPermutationImpl.java
>> 2014-09-04 18:39:59 +
>> @@ -210,11 +210,12 @@
>>final List> permutationList = iv_permCacheMap.get(
>> permutationIndex );
>>for ( List permutations : permutationList ) {
>>   // Moved sort and offset calculation from inner (per
>> MetaDataHit) iteration 2-21-2013 spf
>> - Collections.sort( permutations );
>> + List permutationsSorted = (List)
>> ((ArrayList)permutations).clone();
>> + Collections.sort( permutationsSorted );
>>   int startOffset = firstWordStartOffset;
>>   int endOffset = first

RE: Ctakes to process 5000K recoreds

2014-09-09 Thread Finan, Sean
Hi Nick,

I think that the bottleneck is probably the lookup module itself.  So, I just 
sent you a secure email/ftp link.  It contains a build of the new 
dictionary-lookup-fast module.  Should you choose to try it, let me know how 
things turn out.

Sean

From: Nick Nikandish [snika...@emerginghealthit.com]
Sent: Tuesday, September 09, 2014 4:10 PM
To: dev@ctakes.apache.org
Subject: RE: Ctakes to process 5000K recoreds

Thanks, let me try it.
Nick

-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
Sent: Tuesday, September 09, 2014 4:08 PM
To: 'dev@ctakes.apache.org'
Subject: RE: Ctakes to process 5000K recoreds

If you just need the medication names, you can remove these:
 ContextDependentTokenizerAnnotator
 DependencyParser
 AssertionAnnotator

You might be able to get rid of the LvgAnnotator and still get decent results 
since variations of word form should not affect medication names. I would try 
with it and without it on a smaller set of files and see if you see a 
difference.

I believe the others are needed by the default configs for medication lookup. 
For example, POS is used to get phrase type. Phrases are used to remove verb 
phrases from the lookup and also therefore to keep the lookup windows from 
getting too big.  I'm more familiar with the other types of named entities 
(diseases, symptoms, etc) than with medications.

-Original Message-
From: Nick Nikandish [mailto:snika...@emerginghealthit.com]
Sent: Tuesday, September 09, 2014 3:01 PM
To: dev@ctakes.apache.org
Subject: RE: Ctakes to process 5000K recoreds

James,

Do you have any suggestion about running cTakes with minimum annotators that 
can return Medications in DictionaryLookupAnnotator?
Thanks,
Nick

-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
Sent: Tuesday, September 09, 2014 3:05 PM
To: 'dev@ctakes.apache.org'
Subject: RE: Ctakes to process 5000K recoreds

I suspect that when you take out simple segment annotated, nothing is getting 
processed, and that is why it appears so fast. At least some of the annotators 
loop through the list of sections/segments, which is why there is a simple 
segment annotator - so that there is at least one section/segment identified. 
Are you getting any annotations at all?

-Original Message-
From: Nick Nikandish [mailto:snika...@emerginghealthit.com]
Sent: Tuesday, September 09, 2014 2:02 PM
To: dev@ctakes.apache.org
Subject: RE: Ctakes to process 5000K recoreds

Pei,
I need the name of the medications for the application that I wrote and uses 
ctakes.so I cache the medication in DictionaryLookupAnnotator(in 
performLookup()) and use them in my program but when I have 
SimpleSegementAnnotator it just takes forever. After taking 
SimpleSegementAnnotator out, no medication name in DictionaryLookupAnnotator is 
returned in the code. So I was wondering if there was a way that I could 
eliminate SimpleSegementAnnotator but still be  able to get the medications 
name in that class?

Nick

-Original Message-
From: Pei Chen [mailto:chen...@apache.org]
Sent: Tuesday, September 09, 2014 2:54 PM
To: dev@ctakes.apache.org
Subject: Re: Ctakes to process 5000K recoreds

Nick,
When you mean no medication is being annotated, I presume you mean the 
medication attributes (i.e. dosage, frequency, etc.) are not being annotated?  
I think the DrugNER needs a list of section names in the config; I think it 
includes SIMPLE_SEGMENT.  I am very surprised that SimpleSegementAnnotator is 
the bottle neck though; all it does is assume the entire document is a single 
section called SIMPLE_SEGMENT.
Have you tried commenting out the DependencyParser if you're not using those 
features.

--Pei


On Tue, Sep 9, 2014 at 2:45 PM, Nick Nikandish  
wrote:
>
> Hi there,
>
> I am using Ctakes to process 5000K free text  records  where each record has 
> several medications.
> This is the fixed flow that it goes through:
>
>
> SimpleSegmentAnnotator
> 
> SentenceDetectorAnnotator
> 
> TokenizerAnnotator
> 
> LvgAnnotator
> 
> ContextDependentTokenizerAnnotator
> 
> POSTagger
> 
> Chunker
> 
> LookupWindowAnnotator
> 
> DictionaryLookupAnnotatorDB
> 
> DependencyParser
> 
> AssertionAnnotator
>
> ExtractionPre

RE: Ctakes to process 5000K recoreds

2014-09-09 Thread Finan, Sean
Just use it with cTakes.  Instead of removing other modules from the pipeline, 
replace the dictionary-lookup with dictionary-lookup-fast.

For the 
desc/ctakes-clinical-pipeline/desc/analysis_engine/AggregatePlaintextUMLSProcessor.xml
 , you would modify:


  


To be:


  



That should be it.  You can then leave the rest of the module specifications 
alone.

Sean


From: Nick Nikandish [snika...@emerginghealthit.com]
Sent: Tuesday, September 09, 2014 4:32 PM
To: dev@ctakes.apache.org
Subject: RE: Ctakes to process 5000K recoreds

Hi Sean,

Many thanks, I will try it tomorrow. Do you have any special instruction to run 
that scrip or I have to use it with cTakes?

Thanks,
Nick

-Original Message-
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
Sent: Tuesday, September 09, 2014 4:24 PM
To: dev@ctakes.apache.org
Subject: RE: Ctakes to process 5000K recoreds

Hi Nick,

I think that the bottleneck is probably the lookup module itself.  So, I just 
sent you a secure email/ftp link.  It contains a build of the new 
dictionary-lookup-fast module.  Should you choose to try it, let me know how 
things turn out.

Sean

From: Nick Nikandish [snika...@emerginghealthit.com]
Sent: Tuesday, September 09, 2014 4:10 PM
To: dev@ctakes.apache.org
Subject: RE: Ctakes to process 5000K recoreds

Thanks, let me try it.
Nick

-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
Sent: Tuesday, September 09, 2014 4:08 PM
To: 'dev@ctakes.apache.org'
Subject: RE: Ctakes to process 5000K recoreds

If you just need the medication names, you can remove these:
 ContextDependentTokenizerAnnotator
 DependencyParser
 AssertionAnnotator

You might be able to get rid of the LvgAnnotator and still get decent results 
since variations of word form should not affect medication names. I would try 
with it and without it on a smaller set of files and see if you see a 
difference.

I believe the others are needed by the default configs for medication lookup. 
For example, POS is used to get phrase type. Phrases are used to remove verb 
phrases from the lookup and also therefore to keep the lookup windows from 
getting too big.  I'm more familiar with the other types of named entities 
(diseases, symptoms, etc) than with medications.

-Original Message-
From: Nick Nikandish [mailto:snika...@emerginghealthit.com]
Sent: Tuesday, September 09, 2014 3:01 PM
To: dev@ctakes.apache.org
Subject: RE: Ctakes to process 5000K recoreds

James,

Do you have any suggestion about running cTakes with minimum annotators that 
can return Medications in DictionaryLookupAnnotator?
Thanks,
Nick

-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
Sent: Tuesday, September 09, 2014 3:05 PM
To: 'dev@ctakes.apache.org'
Subject: RE: Ctakes to process 5000K recoreds

I suspect that when you take out simple segment annotated, nothing is getting 
processed, and that is why it appears so fast. At least some of the annotators 
loop through the list of sections/segments, which is why there is a simple 
segment annotator - so that there is at least one section/segment identified. 
Are you getting any annotations at all?

-Original Message-
From: Nick Nikandish [mailto:snika...@emerginghealthit.com]
Sent: Tuesday, September 09, 2014 2:02 PM
To: dev@ctakes.apache.org
Subject: RE: Ctakes to process 5000K recoreds

Pei,
I need the name of the medications for the application that I wrote and uses 
ctakes.so I cache the medication in DictionaryLookupAnnotator(in 
performLookup()) and use them in my program but when I have 
SimpleSegementAnnotator it just takes forever. After taking 
SimpleSegementAnnotator out, no medication name in DictionaryLookupAnnotator is 
returned in the code. So I was wondering if there was a way that I could 
eliminate SimpleSegementAnnotator but still be  able to get the medications 
name in that class?

Nick

-Original Message-
From: Pei Chen [mailto:chen...@apache.org]
Sent: Tuesday, September 09, 2014 2:54 PM
To: dev@ctakes.apache.org
Subject: Re: Ctakes to process 5000K recoreds

Nick,
When you mean no medication is being annotated, I presume you mean the 
medication attributes (i.e. dosage, frequency, etc.) are not being annotated?  
I think the DrugNER needs a list of section names in the config; I think it 
includes SIMPLE_SEGMENT.  I am very surprised that SimpleSegementAnnotator is 
the bottle neck though; all it does is assume the entire document is a single 
section called SIMPLE_SEGMENT.
Have you tried commenting out the DependencyParser if you're not using those 
features.

--Pei


On Tue, Sep 9, 2014 at 2:45 PM, Nick Nikandish  
wrote:
>
> Hi there,
>
> I am using Ctakes to process 5000K free text  records  where each record has 
> several medications.

RE: Ctakes to process 5000K recoreds

2014-09-09 Thread Finan, Sean
There is a tool to generate a dictionary in the new format using the UMLS MR*** 
files.  

The module can also read directly from a file with bar-separated values:  
CUI|Text or CUI|TUI|Text which could be useful for small custom dictionaries.

I can send a copy of the dictionary creator jar and instructions tomorrow.

Sean

From: Bruce Tietjen [bruce.tiet...@perfectsearchcorp.com]
Sent: Tuesday, September 09, 2014 5:17 PM
To: dev@ctakes.apache.org
Subject: Re: Ctakes to process 5000K recoreds

Sean,

If that is a script for generating a dictionary for use with
dictionary-lookup-fast, I would also be very interested in checking it out.

Thanks,

Bruce


 [image: IMAT Solutions] <http://imatsolutions.com>
 Bruce Tietjen
Senior Software Engineer
[image: Mobile:] 801.634.1547
bruce.tiet...@imatsolutions.com

On Tue, Sep 9, 2014 at 2:40 PM, Nick Nikandish <
snika...@emerginghealthit.com> wrote:

> Great. I will do that. Thanks again.
>
> Nick
>
> -Original Message-
> From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
> Sent: Tuesday, September 09, 2014 4:39 PM
> To: dev@ctakes.apache.org
> Subject: RE: Ctakes to process 5000K recoreds
>
> Just use it with cTakes.  Instead of removing other modules from the
> pipeline, replace the dictionary-lookup with dictionary-lookup-fast.
>
> For the
> desc/ctakes-clinical-pipeline/desc/analysis_engine/AggregatePlaintextUMLSProcessor.xml
> , you would modify:
>
> 
>location="../../../ctakes-dictionary-lookup/desc/analysis_engine/DictionaryLookupAnnotatorUMLS.xml"/>
> 
>
> To be:
>
> 
>location="../../../ctakes-dictionary-lookup-fast/desc/analysis_engine/UmlsLookupAnnotator.xml"/>
> 
>
>
> That should be it.  You can then leave the rest of the module
> specifications alone.
>
> Sean
>
> 
> From: Nick Nikandish [snika...@emerginghealthit.com]
> Sent: Tuesday, September 09, 2014 4:32 PM
> To: dev@ctakes.apache.org
> Subject: RE: Ctakes to process 5000K recoreds
>
> Hi Sean,
>
> Many thanks, I will try it tomorrow. Do you have any special instruction
> to run that scrip or I have to use it with cTakes?
>
> Thanks,
> Nick
>
> -Original Message-
> From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
> Sent: Tuesday, September 09, 2014 4:24 PM
> To: dev@ctakes.apache.org
> Subject: RE: Ctakes to process 5000K recoreds
>
> Hi Nick,
>
> I think that the bottleneck is probably the lookup module itself.  So, I
> just sent you a secure email/ftp link.  It contains a build of the new
> dictionary-lookup-fast module.  Should you choose to try it, let me know
> how things turn out.
>
> Sean
> 
> From: Nick Nikandish [snika...@emerginghealthit.com]
> Sent: Tuesday, September 09, 2014 4:10 PM
> To: dev@ctakes.apache.org
> Subject: RE: Ctakes to process 5000K recoreds
>
> Thanks, let me try it.
> Nick
>
> -Original Message-
> From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
> Sent: Tuesday, September 09, 2014 4:08 PM
> To: 'dev@ctakes.apache.org'
> Subject: RE: Ctakes to process 5000K recoreds
>
> If you just need the medication names, you can remove these:
>  ContextDependentTokenizerAnnotator
>  DependencyParser
>  AssertionAnnotator
>
> You might be able to get rid of the LvgAnnotator and still get decent
> results since variations of word form should not affect medication names. I
> would try with it and without it on a smaller set of files and see if you
> see a difference.
>
> I believe the others are needed by the default configs for medication
> lookup. For example, POS is used to get phrase type. Phrases are used to
> remove verb phrases from the lookup and also therefore to keep the lookup
> windows from getting too big.  I'm more familiar with the other types of
> named entities (diseases, symptoms, etc) than with medications.
>
> -Original Message-
> From: Nick Nikandish [mailto:snika...@emerginghealthit.com]
> Sent: Tuesday, September 09, 2014 3:01 PM
> To: dev@ctakes.apache.org
> Subject: RE: Ctakes to process 5000K recoreds
>
> James,
>
> Do you have any suggestion about running cTakes with minimum annotators
> that can return Medications in DictionaryLookupAnnotator?
> Thanks,
> Nick
>
> -Original Message-
> From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
> Sent: Tuesday, September 09, 2014 3:05 PM
> To: 'dev@ctakes.apache.org'
> Subject: RE: Ctakes to process 5000K recoreds
>
> I suspect that when you take out simple segment annotated, nothing is
> getting proces

RE: Ctakes to process 5000K recoreds

2014-09-09 Thread Finan, Sean
Yes, the code is in the sandbox.  

From: Chen, Pei [pei.c...@childrens.harvard.edu]
Sent: Tuesday, September 09, 2014 5:26 PM
To: 
Subject: Re: Ctakes to process 5000K recoreds

Sean-
Aren't the scripts to generate the DB already available in the sandbox area?

Sent from my iPhone

> On Sep 9, 2014, at 5:24 PM, "Finan, Sean"  
> wrote:
>
> There is a tool to generate a dictionary in the new format using the UMLS 
> MR*** files.
>
> The module can also read directly from a file with bar-separated values:  
> CUI|Text or CUI|TUI|Text which could be useful for small custom dictionaries.
>
> I can send a copy of the dictionary creator jar and instructions tomorrow.
>
> Sean
> 
> From: Bruce Tietjen [bruce.tiet...@perfectsearchcorp.com]
> Sent: Tuesday, September 09, 2014 5:17 PM
> To: dev@ctakes.apache.org
> Subject: Re: Ctakes to process 5000K recoreds
>
> Sean,
>
> If that is a script for generating a dictionary for use with
> dictionary-lookup-fast, I would also be very interested in checking it out.
>
> Thanks,
>
> Bruce
>
>
> [image: IMAT Solutions] <http://imatsolutions.com>
> Bruce Tietjen
> Senior Software Engineer
> [image: Mobile:] 801.634.1547
> bruce.tiet...@imatsolutions.com
>
> On Tue, Sep 9, 2014 at 2:40 PM, Nick Nikandish <
> snika...@emerginghealthit.com> wrote:
>
>> Great. I will do that. Thanks again.
>>
>> Nick
>>
>> -Original Message-
>> From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
>> Sent: Tuesday, September 09, 2014 4:39 PM
>> To: dev@ctakes.apache.org
>> Subject: RE: Ctakes to process 5000K recoreds
>>
>> Just use it with cTakes.  Instead of removing other modules from the
>> pipeline, replace the dictionary-lookup with dictionary-lookup-fast.
>>
>> For the
>> desc/ctakes-clinical-pipeline/desc/analysis_engine/AggregatePlaintextUMLSProcessor.xml
>> , you would modify:
>>
>>
>>  > location="../../../ctakes-dictionary-lookup/desc/analysis_engine/DictionaryLookupAnnotatorUMLS.xml"/>
>>
>>
>> To be:
>>
>>
>>  > location="../../../ctakes-dictionary-lookup-fast/desc/analysis_engine/UmlsLookupAnnotator.xml"/>
>>
>>
>>
>> That should be it.  You can then leave the rest of the module
>> specifications alone.
>>
>> Sean
>>
>> 
>> From: Nick Nikandish [snika...@emerginghealthit.com]
>> Sent: Tuesday, September 09, 2014 4:32 PM
>> To: dev@ctakes.apache.org
>> Subject: RE: Ctakes to process 5000K recoreds
>>
>> Hi Sean,
>>
>> Many thanks, I will try it tomorrow. Do you have any special instruction
>> to run that scrip or I have to use it with cTakes?
>>
>> Thanks,
>> Nick
>>
>> -Original Message-
>> From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
>> Sent: Tuesday, September 09, 2014 4:24 PM
>> To: dev@ctakes.apache.org
>> Subject: RE: Ctakes to process 5000K recoreds
>>
>> Hi Nick,
>>
>> I think that the bottleneck is probably the lookup module itself.  So, I
>> just sent you a secure email/ftp link.  It contains a build of the new
>> dictionary-lookup-fast module.  Should you choose to try it, let me know
>> how things turn out.
>>
>> Sean
>> 
>> From: Nick Nikandish [snika...@emerginghealthit.com]
>> Sent: Tuesday, September 09, 2014 4:10 PM
>> To: dev@ctakes.apache.org
>> Subject: RE: Ctakes to process 5000K recoreds
>>
>> Thanks, let me try it.
>> Nick
>>
>> -Original Message-
>> From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
>> Sent: Tuesday, September 09, 2014 4:08 PM
>> To: 'dev@ctakes.apache.org'
>> Subject: RE: Ctakes to process 5000K recoreds
>>
>> If you just need the medication names, you can remove these:
>> ContextDependentTokenizerAnnotator
>> DependencyParser
>> AssertionAnnotator
>>
>> You might be able to get rid of the LvgAnnotator and still get decent
>> results since variations of word form should not affect medication names. I
>> would try with it and without it on a smaller set of files and see if you
>> see a difference.
>>
>> I believe the others are needed by the default configs for medication
>> lookup. For example, POS is used to get phrase type. Phrases are used to
>> remove verb phrases from the lookup and also therefore to

RE: Ctakes to process 5000K recoreds

2014-09-09 Thread Finan, Sean
>(Trying to avoid passing individual jars via email)

Understood.  I sent the latest (Saturday) build of the dictionary module that I 
haven't yet checked in.  Its dictionary format is incompatible with the format 
produced by the creator in sandbox.  I will check in all of the code changes 
once I've had a chance to clean them up and write some documentation on the 
preferred terms, icd9s, etc.

Sean

From: Chen, Pei [pei.c...@childrens.harvard.edu]
Sent: Tuesday, September 09, 2014 5:29 PM
To: 
Subject: Re: Ctakes to process 5000K recoreds

(Trying to avoid passing individual jars via email)

Sent from my iPhone

> On Sep 9, 2014, at 5:26 PM, "Chen, Pei"  
> wrote:
>
> Sean-
> Aren't the scripts to generate the DB already available in the sandbox area?
>
> Sent from my iPhone
>
>> On Sep 9, 2014, at 5:24 PM, "Finan, Sean"  
>> wrote:
>>
>> There is a tool to generate a dictionary in the new format using the UMLS 
>> MR*** files.
>>
>> The module can also read directly from a file with bar-separated values:  
>> CUI|Text or CUI|TUI|Text which could be useful for small custom dictionaries.
>>
>> I can send a copy of the dictionary creator jar and instructions tomorrow.
>>
>> Sean
>> 
>> From: Bruce Tietjen [bruce.tiet...@perfectsearchcorp.com]
>> Sent: Tuesday, September 09, 2014 5:17 PM
>> To: dev@ctakes.apache.org
>> Subject: Re: Ctakes to process 5000K recoreds
>>
>> Sean,
>>
>> If that is a script for generating a dictionary for use with
>> dictionary-lookup-fast, I would also be very interested in checking it out.
>>
>> Thanks,
>>
>> Bruce
>>
>>
>> [image: IMAT Solutions] <http://imatsolutions.com>
>> Bruce Tietjen
>> Senior Software Engineer
>> [image: Mobile:] 801.634.1547
>> bruce.tiet...@imatsolutions.com
>>
>> On Tue, Sep 9, 2014 at 2:40 PM, Nick Nikandish <
>> snika...@emerginghealthit.com> wrote:
>>
>>> Great. I will do that. Thanks again.
>>>
>>> Nick
>>>
>>> -Original Message-
>>> From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
>>> Sent: Tuesday, September 09, 2014 4:39 PM
>>> To: dev@ctakes.apache.org
>>> Subject: RE: Ctakes to process 5000K recoreds
>>>
>>> Just use it with cTakes.  Instead of removing other modules from the
>>> pipeline, replace the dictionary-lookup with dictionary-lookup-fast.
>>>
>>> For the
>>> desc/ctakes-clinical-pipeline/desc/analysis_engine/AggregatePlaintextUMLSProcessor.xml
>>> , you would modify:
>>>
>>>   
>>> >> location="../../../ctakes-dictionary-lookup/desc/analysis_engine/DictionaryLookupAnnotatorUMLS.xml"/>
>>>   
>>>
>>> To be:
>>>
>>>   
>>> >> location="../../../ctakes-dictionary-lookup-fast/desc/analysis_engine/UmlsLookupAnnotator.xml"/>
>>>   
>>>
>>>
>>> That should be it.  You can then leave the rest of the module
>>> specifications alone.
>>>
>>> Sean
>>>
>>> 
>>> From: Nick Nikandish [snika...@emerginghealthit.com]
>>> Sent: Tuesday, September 09, 2014 4:32 PM
>>> To: dev@ctakes.apache.org
>>> Subject: RE: Ctakes to process 5000K recoreds
>>>
>>> Hi Sean,
>>>
>>> Many thanks, I will try it tomorrow. Do you have any special instruction
>>> to run that scrip or I have to use it with cTakes?
>>>
>>> Thanks,
>>> Nick
>>>
>>> -Original Message-
>>> From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
>>> Sent: Tuesday, September 09, 2014 4:24 PM
>>> To: dev@ctakes.apache.org
>>> Subject: RE: Ctakes to process 5000K recoreds
>>>
>>> Hi Nick,
>>>
>>> I think that the bottleneck is probably the lookup module itself.  So, I
>>> just sent you a secure email/ftp link.  It contains a build of the new
>>> dictionary-lookup-fast module.  Should you choose to try it, let me know
>>> how things turn out.
>>>
>>> Sean
>>> 
>>> From: Nick Nikandish [snika...@emerginghealthit.com]
>>> Sent: Tuesday, September 09, 2014 4:10 PM
>>> To: dev@ctakes.apache.org
>>> Subject: RE: Ctakes to process 5000K recoreds
>>>
>>> Thanks, let me try it.

RE: Ctakes to process 5000K records

2014-09-10 Thread Finan, Sean
Hi Nick, 

>file:org/apache/ctakes/dictionary/fast/cTakesHsql.xml

does that file not exist under resources?  cTakes shouldn't need anything under 
that directory to be added to the classpath.

I checked the source into trunk this morning, but the zip that you downloaded 
had everything included.  As long as you unzipped in cTakes root the resources, 
desc and lib should have been properly placed.

Sean





From: Nick Nikandish [snika...@emerginghealthit.com]
Sent: Wednesday, September 10, 2014 3:06 PM
To: dev@ctakes.apache.org
Subject: RE: Ctakes to process 5000K records

Hi Sean,

I am getting this error:
org.apache.uima.resource.ResourceInitializationException: Could not access the 
resource data at file:org/apache/ctakes/dictionary/fast/cTakesHsql.xml.

Where should I add it to the classpath?

Thanks,
Nick

-Original Message-
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
Sent: Tuesday, September 09, 2014 4:39 PM
To: dev@ctakes.apache.org
Subject: RE: Ctakes to process 5000K recoreds

Just use it with cTakes.  Instead of removing other modules from the pipeline, 
replace the dictionary-lookup with dictionary-lookup-fast.

For the 
desc/ctakes-clinical-pipeline/desc/analysis_engine/AggregatePlaintextUMLSProcessor.xml
 , you would modify:


  


To be:


  



That should be it.  You can then leave the rest of the module specifications 
alone.

Sean


From: Nick Nikandish [snika...@emerginghealthit.com]
Sent: Tuesday, September 09, 2014 4:32 PM
To: dev@ctakes.apache.org
Subject: RE: Ctakes to process 5000K recoreds

Hi Sean,

Many thanks, I will try it tomorrow. Do you have any special instruction to run 
that scrip or I have to use it with cTakes?

Thanks,
Nick

-Original Message-
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
Sent: Tuesday, September 09, 2014 4:24 PM
To: dev@ctakes.apache.org
Subject: RE: Ctakes to process 5000K recoreds

Hi Nick,

I think that the bottleneck is probably the lookup module itself.  So, I just 
sent you a secure email/ftp link.  It contains a build of the new 
dictionary-lookup-fast module.  Should you choose to try it, let me know how 
things turn out.

Sean

From: Nick Nikandish [snika...@emerginghealthit.com]
Sent: Tuesday, September 09, 2014 4:10 PM
To: dev@ctakes.apache.org
Subject: RE: Ctakes to process 5000K recoreds

Thanks, let me try it.
Nick

-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
Sent: Tuesday, September 09, 2014 4:08 PM
To: 'dev@ctakes.apache.org'
Subject: RE: Ctakes to process 5000K recoreds

If you just need the medication names, you can remove these:
 ContextDependentTokenizerAnnotator
 DependencyParser
 AssertionAnnotator

You might be able to get rid of the LvgAnnotator and still get decent results 
since variations of word form should not affect medication names. I would try 
with it and without it on a smaller set of files and see if you see a 
difference.

I believe the others are needed by the default configs for medication lookup. 
For example, POS is used to get phrase type. Phrases are used to remove verb 
phrases from the lookup and also therefore to keep the lookup windows from 
getting too big.  I'm more familiar with the other types of named entities 
(diseases, symptoms, etc) than with medications.

-Original Message-
From: Nick Nikandish [mailto:snika...@emerginghealthit.com]
Sent: Tuesday, September 09, 2014 3:01 PM
To: dev@ctakes.apache.org
Subject: RE: Ctakes to process 5000K recoreds

James,

Do you have any suggestion about running cTakes with minimum annotators that 
can return Medications in DictionaryLookupAnnotator?
Thanks,
Nick

-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
Sent: Tuesday, September 09, 2014 3:05 PM
To: 'dev@ctakes.apache.org'
Subject: RE: Ctakes to process 5000K recoreds

I suspect that when you take out simple segment annotated, nothing is getting 
processed, and that is why it appears so fast. At least some of the annotators 
loop through the list of sections/segments, which is why there is a simple 
segment annotator - so that there is at least one section/segment identified. 
Are you getting any annotations at all?

-Original Message-
From: Nick Nikandish [mailto:snika...@emerginghealthit.com]
Sent: Tuesday, September 09, 2014 2:02 PM
To: dev@ctakes.apache.org
Subject: RE: Ctakes to process 5000K recoreds

Pei,
I need the name of the medications for the application that I wrote and uses 
ctakes.so I cache the medication in DictionaryLookupAnnotator(in 
performLookup()) and use them in my program but when I have 
SimpleSegementAnnotator it just takes forever. After taking 
SimpleSegementAnnotator out, no medication name in DictionaryLookupAnnotator is 
returned in the code. So I was wonderin

RE: Ctakes to process 5000K records

2014-09-10 Thread Finan, Sean
Excellent!

From: Nick Nikandish [snika...@emerginghealthit.com]
Sent: Wednesday, September 10, 2014 3:48 PM
To: dev@ctakes.apache.org
Subject: FW: Ctakes to process 5000K records

Please disregard this question, I figured it out.

Thanks,
Nick

-Original Message-
From: Nick Nikandish
Sent: Wednesday, September 10, 2014 3:06 PM
To: 'dev@ctakes.apache.org'
Subject: RE: Ctakes to process 5000K records

Hi Sean,

I am getting this error:
org.apache.uima.resource.ResourceInitializationException: Could not access the 
resource data at file:org/apache/ctakes/dictionary/fast/cTakesHsql.xml.

Where should I add it to the classpath?

Thanks,
Nick

-Original Message-----
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
Sent: Tuesday, September 09, 2014 4:39 PM
To: dev@ctakes.apache.org
Subject: RE: Ctakes to process 5000K recoreds

Just use it with cTakes.  Instead of removing other modules from the pipeline, 
replace the dictionary-lookup with dictionary-lookup-fast.

For the 
desc/ctakes-clinical-pipeline/desc/analysis_engine/AggregatePlaintextUMLSProcessor.xml
 , you would modify:


  


To be:


  



That should be it.  You can then leave the rest of the module specifications 
alone.

Sean


From: Nick Nikandish [snika...@emerginghealthit.com]
Sent: Tuesday, September 09, 2014 4:32 PM
To: dev@ctakes.apache.org
Subject: RE: Ctakes to process 5000K recoreds

Hi Sean,

Many thanks, I will try it tomorrow. Do you have any special instruction to run 
that scrip or I have to use it with cTakes?

Thanks,
Nick

-Original Message-----
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
Sent: Tuesday, September 09, 2014 4:24 PM
To: dev@ctakes.apache.org
Subject: RE: Ctakes to process 5000K recoreds

Hi Nick,

I think that the bottleneck is probably the lookup module itself.  So, I just 
sent you a secure email/ftp link.  It contains a build of the new 
dictionary-lookup-fast module.  Should you choose to try it, let me know how 
things turn out.

Sean

From: Nick Nikandish [snika...@emerginghealthit.com]
Sent: Tuesday, September 09, 2014 4:10 PM
To: dev@ctakes.apache.org
Subject: RE: Ctakes to process 5000K recoreds

Thanks, let me try it.
Nick

-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
Sent: Tuesday, September 09, 2014 4:08 PM
To: 'dev@ctakes.apache.org'
Subject: RE: Ctakes to process 5000K recoreds

If you just need the medication names, you can remove these:
 ContextDependentTokenizerAnnotator
 DependencyParser
 AssertionAnnotator

You might be able to get rid of the LvgAnnotator and still get decent results 
since variations of word form should not affect medication names. I would try 
with it and without it on a smaller set of files and see if you see a 
difference.

I believe the others are needed by the default configs for medication lookup. 
For example, POS is used to get phrase type. Phrases are used to remove verb 
phrases from the lookup and also therefore to keep the lookup windows from 
getting too big.  I'm more familiar with the other types of named entities 
(diseases, symptoms, etc) than with medications.

-Original Message-
From: Nick Nikandish [mailto:snika...@emerginghealthit.com]
Sent: Tuesday, September 09, 2014 3:01 PM
To: dev@ctakes.apache.org
Subject: RE: Ctakes to process 5000K recoreds

James,

Do you have any suggestion about running cTakes with minimum annotators that 
can return Medications in DictionaryLookupAnnotator?
Thanks,
Nick

-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
Sent: Tuesday, September 09, 2014 3:05 PM
To: 'dev@ctakes.apache.org'
Subject: RE: Ctakes to process 5000K recoreds

I suspect that when you take out simple segment annotated, nothing is getting 
processed, and that is why it appears so fast. At least some of the annotators 
loop through the list of sections/segments, which is why there is a simple 
segment annotator - so that there is at least one section/segment identified. 
Are you getting any annotations at all?

-Original Message-
From: Nick Nikandish [mailto:snika...@emerginghealthit.com]
Sent: Tuesday, September 09, 2014 2:02 PM
To: dev@ctakes.apache.org
Subject: RE: Ctakes to process 5000K recoreds

Pei,
I need the name of the medications for the application that I wrote and uses 
ctakes.so I cache the medication in DictionaryLookupAnnotator(in 
performLookup()) and use them in my program but when I have 
SimpleSegementAnnotator it just takes forever. After taking 
SimpleSegementAnnotator out, no medication name in DictionaryLookupAnnotator is 
returned in the code. So I was wondering if there was a way that I could 
eliminate SimpleSegementAnnotator but still be  able to get the medications 
name in that class?

Nick

-Original Message-
From: 

RE: cTakes output predictability

2014-10-07 Thread Finan, Sean
Steve Bethard wrote:
> I spent some time writing a script for diff-ing CASes

I urge anyone interested in comparing cTakes CASes / output to use this type of 
approach.  Comparison of program output is a post-process task, and unless 
absolutely necessary code to juggle data and metadata belongs there.  Attempts 
to force every module past, present and Future to abide by fixed orderings, 
enumerations etc. is not as simple a task as one might initially think - 
especially if third-party libraries are involved.  I won't get into problems 
associated with why one is comparing output (swapped module?) and IDs, orders 
etc. being different because of a possibly intentional difference.

In addition to or instead of creating a post-processing script, one could write 
a new "cas-consumer" that writes output in a desired format - but this should 
not require changes to engines.

"If it ain't broke, don't fix it"

Sean


-Original Message-
From: Steven Bethard [mailto:steven.beth...@gmail.com] 
Sent: Monday, October 06, 2014 11:23 PM
To: dev@ctakes.apache.org
Subject: Re: cTakes output predictability

On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen
 wrote:
> Since I started working with cTakes some time ago, I have found it
> difficult to compare the output between subsequent runs on the same files
> because annotations are often assigned different IDs, are listed in
> different order, etc.

At one point, I spent some time writing a script for diff-ing CASes
that intended to address some of these kinds of issues. It's still
here in cTAKES:

ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analysis/CompareFeatureStructures.java

You might see if you could use or adapt that to your needs.

Steve


RE: cTakes output predictability

2014-10-07 Thread Finan, Sean
Hi Kim,

One might want compare the Sentence detector that uses end of line characters 
as sentence splitters with one that does not.  Such a change in sentence 
splitting would not only effect the sentence type discoveries but also 
practically every type that follows.

Another might want to compare a note with "skin cancer" vs. one in which you 
replace "skin cancer" with "melanoma" just to see what the CUI differences 
might be.  There are changes in two words vs. one, 11 characters vs. 8, a 
removed adjective(?), and of course changes in CUIs.

Of course, if you are just running notes on a new moon and then again on a full 
moon ...

Sean

-Original Message-
From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com] 
Sent: Tuesday, October 07, 2014 10:41 AM
To: dev@ctakes.apache.org
Subject: Re: cTakes output predictability

Sean,

"...being different because of a possibly intentional difference."

I would like you to elaborate a bit on the what would be intentionally 
different between the processing of the same document multiple times. It would 
help my understanding of cTakes.

Thanks,

Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/

On 10/07/2014 07:30 AM, Finan, Sean wrote:
> Steve Bethard wrote:
>> I spent some time writing a script for diff-ing CASes
> I urge anyone interested in comparing cTakes CASes / output to use this type 
> of approach.  Comparison of program output is a post-process task, and unless 
> absolutely necessary code to juggle data and metadata belongs there.  
> Attempts to force every module past, present and Future to abide by fixed 
> orderings, enumerations etc. is not as simple a task as one might initially 
> think - especially if third-party libraries are involved.  I won't get into 
> problems associated with why one is comparing output (swapped module?) and 
> IDs, orders etc. being different because of a possibly intentional difference.
>
> In addition to or instead of creating a post-processing script, one could 
> write a new "cas-consumer" that writes output in a desired format - but this 
> should not require changes to engines.
>
> "If it ain't broke, don't fix it"
>
> Sean
>
>
> -Original Message-
> From: Steven Bethard [mailto:steven.beth...@gmail.com]
> Sent: Monday, October 06, 2014 11:23 PM
> To: dev@ctakes.apache.org
> Subject: Re: cTakes output predictability
>
> On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen 
>  wrote:
>> Since I started working with cTakes some time ago, I have found it 
>> difficult to compare the output between subsequent runs on the same 
>> files because annotations are often assigned different IDs, are 
>> listed in different order, etc.
> At one point, I spent some time writing a script for diff-ing CASes 
> that intended to address some of these kinds of issues. It's still 
> here in cTAKES:
>
> ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analysis
> /CompareFeatureStructures.java
>
> You might see if you could use or adapt that to your needs.
>
> Steve



RE: cTakes output predictability

2014-10-07 Thread Finan, Sean
Hi Kim,

>In our testing we've found several cases where running with the same 
>configuration outputs different data under different moons

This is a known behavior and I understand that this may be the case that 
started the thread.  

>>> I spent some time writing a script for diff-ing CASes
>> I urge anyone interested in comparing cTakes CASes / output to use this type 
>> of approach.  

I still stand by my original email.

> Having output that is in a predictable order makes checking to see if there 
> are differences much cheaper when you are dealing with larger data sets.

Britt said:
> The option Sean mentioned of writing your own custom consumer (without the 
> UIMA id that is causing your issues) should meet these needs I believe.

I agree.

Sean

-Original Message-
From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com] 
Sent: Tuesday, October 07, 2014 11:30 AM
To: dev@ctakes.apache.org
Subject: Re: cTakes output predictability

Hi Sean,

Well of course that makes plenty of sense. Testing different cTakes 
configurations you would expect different output. In our testing we've found 
several cases where running with the same configuration outputs different data 
under different moons. Having consistent results helps us know if we've made 
improvements to our quality or not. c

Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/

On 10/07/2014 08:50 AM, Finan, Sean wrote:
> Hi Kim,
>
> One might want compare the Sentence detector that uses end of line characters 
> as sentence splitters with one that does not.  Such a change in sentence 
> splitting would not only effect the sentence type discoveries but also 
> practically every type that follows.
>
> Another might want to compare a note with "skin cancer" vs. one in which you 
> replace "skin cancer" with "melanoma" just to see what the CUI differences 
> might be.  There are changes in two words vs. one, 11 characters vs. 8, a 
> removed adjective(?), and of course changes in CUIs.
>
> Of course, if you are just running notes on a new moon and then again on a 
> full moon ...
>
> Sean
>
> -Original Message-
> From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com]
> Sent: Tuesday, October 07, 2014 10:41 AM
> To: dev@ctakes.apache.org
> Subject: Re: cTakes output predictability
>
> Sean,
>
> "...being different because of a possibly intentional difference."
>
> I would like you to elaborate a bit on the what would be intentionally 
> different between the processing of the same document multiple times. It 
> would help my understanding of cTakes.
>
> Thanks,
>
> Kim Ebert
> 1.801.669.7342
> Perfect Search Corp
> http://www.perfectsearchcorp.com/
>
> On 10/07/2014 07:30 AM, Finan, Sean wrote:
>> Steve Bethard wrote:
>>> I spent some time writing a script for diff-ing CASes
>> I urge anyone interested in comparing cTakes CASes / output to use this type 
>> of approach.  Comparison of program output is a post-process task, and 
>> unless absolutely necessary code to juggle data and metadata belongs there.  
>> Attempts to force every module past, present and Future to abide by fixed 
>> orderings, enumerations etc. is not as simple a task as one might initially 
>> think - especially if third-party libraries are involved.  I won't get into 
>> problems associated with why one is comparing output (swapped module?) and 
>> IDs, orders etc. being different because of a possibly intentional 
>> difference.
>>
>> In addition to or instead of creating a post-processing script, one could 
>> write a new "cas-consumer" that writes output in a desired format - but this 
>> should not require changes to engines.
>>
>> "If it ain't broke, don't fix it"
>>
>> Sean
>>
>>
>> -Original Message-
>> From: Steven Bethard [mailto:steven.beth...@gmail.com]
>> Sent: Monday, October 06, 2014 11:23 PM
>> To: dev@ctakes.apache.org
>> Subject: Re: cTakes output predictability
>>
>> On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen 
>>  wrote:
>>> Since I started working with cTakes some time ago, I have found it 
>>> difficult to compare the output between subsequent runs on the same 
>>> files because annotations are often assigned different IDs, are 
>>> listed in different order, etc.
>> At one point, I spent some time writing a script for diff-ing CASes 
>> that intended to address some of these kinds of issues. It's still 
>> here in cTAKES:
>>
>> ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analysi
>> s
>> /CompareFeatureStructures.java
>>
>> You might see if you could use or adapt that to your needs.
>>
>> Steve



RE: cTakes output predictability

2014-10-07 Thread Finan, Sean
Hi Kim,

> It concerns me a bit by making the code return consistent results would be so 
> concerning. 
Could you please clarify what you mean by "consistent results"?  Do you mean 
ordering and IDs or are you talking about actual type values not matching?

>This should be the default mode of operation.
Depending upon what you meant above, I may agree or disagree.

> Since it doesn't appear that there are any consequences with moving forward 
> with changing the code
Why do you say this?  

I think that there may be more required changes than you realize.  Every 
insertion into the CAS must be of ordered data.  This means that, for instance, 
named entities discovered by dictionary will need to be inserted in some 
predictable order, such as by alphabetized cui per every alphabetized tui (and 
other code) per ordered text span.  You will need to check and recheck every 
point at which the CAS is modified by every module.  Right now there are at 
least three or four places in two cTakes dictionary modules where a change 
would be required - and that doesn't include YTEX lookup.

If you really feel strongly about this and are going to change cTakes code, 
then I suggest (at the risk of sounding like a complete jerk) that you also 
consider the following:
1.  Don't check anything into trunk until all is well with your changes and 
tests
Just in case you abandon the effort
2.  Write unit tests for every change   
True, Map to LinkedMap shouldn't break anything, but they are good to have, and 
may prevent others in the future from switching back to a non-linked map or any 
unordered collection (set not list, etc.).  It also makes a better place for 
explanation in Javadoc than inlines above the code.
3.  Run memory requirement tests before all of your changes and then again 
after your changes
I'm actually curious about how much memory might be eaten with linkages 
everywhere
4.  Run performance (speed) tests before and after
On a large corpus to ensure that garbage collection is involved
5.  Do the above with every combination possible in current workflows: every 
combination of available sentence detector, pos tagger, smoking status 
detector, dictionary lookup, cas consumer, etc.
As soon as somebody says "all output is consistently ordered between runs" it 
had better be so for every possible workflow
6.  Write system tests to ensure ordered/predicted outputs with each combination
Otherwise somebody may break it
7.  Document the what, how, and why for future development
Otherwise somebody won't know to stick to the new rules
8.  Assist anybody as needed that in the future breaks one of these unit or 
system tests with a fix or new feature
By mandating such a rule you are assuming responsibility for it
9.  Assist anybody as needed that in the future adds a new module or workflow 
to cTakes to abide by the ordering requirement
By mandating such a rule you are assuming responsibility for it
10.  Assist anybody as needed that in the future adds a new module or workflow 
to add system tests to ensure maintenance of the ordering requirement
By mandating such a rule you are assuming responsibility for it


-Original Message-
From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com] 
Sent: Tuesday, October 07, 2014 11:57 AM
To: dev@ctakes.apache.org
Subject: Re: cTakes output predictability

I think we may really prefer the first method. Since it doesn't appear that 
there are any consequences with moving forward with changing the code, we would 
really like to move forward with this approach.

Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/

On 10/07/2014 09:35 AM, britt fitch wrote:
> The option Sean mentioned of writing your own custom consumer (without 
> the UIMA id that is causing your issues) should meet these needs I 
> believe.
>
>
>
> Britt Fitch
> Wired Informatics
> 265 Franklin St Ste 1702
> Boston, MA 02110
> http://wiredinformatics.com
> britt.fi...@wiredinformatics.com
>
> On Oct 7, 2014, at 11:29 AM, Kim Ebert 
>  <mailto:kim.eb...@perfectsearchcorp.com>> wrote:
>
>> Hi Sean,
>>
>> Well of course that makes plenty of sense. Testing different cTakes 
>> configurations you would expect different output. In our testing 
>> we've found several cases where running with the same configuration 
>> outputs different data under different moons. Having consistent 
>> results helps us know if we've made improvements to our quality or 
>> not. Having output that is in a predictable order makes checking to 
>> see if there are differences much cheaper when you are dealing with larger 
>> data sets.
>>
>> Kim Ebert
>> 1.801.669.7342
>> Perfect Search Corp
>> http://www.perfectsearchcorp.com/
>>
>> On 10/07/2014 08

RE: cTakes output predictability

2014-10-07 Thread Finan, Sean
I'm just about sapped on this topic.  What comes below is my final writing.

Kim wrote:
>Yes, I mean actual type values not matching.

Ok, this is a very serious problem and should have nothing to do with ordering 
and/or IDs.  I repeat: this should have nothing to do with ordering or ids.  
Reordering or changing ID assignment, while possibly producing repeatable 
output, will not necessary fix the actual bug.  Please write a Jira for each 
item, and (imo) we should think about withholding any non-bug-fix release until 
they have been dealt with.

Bruce wrote:
> I did not intend to step on anyone's toes.
No worries - I don't think that any toes have been stepped upon. It is good 
that questions and concerns are shared with the group.  

> Note that in the first instance, there were two MedicationMentions, but in 
> the second, there is only one.
Assuming that the second drug mention doesn't appear elsewhere in output2 then 
this needs to be addressed.  Please log a tar.  Relating this to the order/id 
issue, which number of mentions is correct (2)?  If you reorder will that 
consistently output two medications instead of one or one medication instead of 
two?  This is most likely a bug in the identification and/or storage and/or 
retrieval code and needs to be fixed there.

>Yes, everyone could write their own custom compare code, but wouldn't it be 
>more valuable to the community to make that task easier?

I would hope that a reusable Cas-Consumer that sorts and re-IDs annotations 
could be started and people could add to it as needed.  I would also hope that 
a reusable post-process comparison utility could be started and 
improved/maintained.

Sean


-Original Message-
From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com] 
Sent: Tuesday, October 07, 2014 1:21 PM
To: dev@ctakes.apache.org
Subject: Re: cTakes output predictability

I did not intend to step on anyone's toes.

One of the reasons I proposed the changes was to try to make it extremely 
obvious when there are significant difference in output from the cTakes 
pipeline when running the same document again, and once identified, make it 
easier to identify the source of the difference.

Because of the huge number of differences between the output using the 
FileWriterCasConsumer.xml, first detecting that there is a significant 
differences and identifying them for a large set of documents is a daunting 
task.

The following is an example of some significant differences that I have 
detected between two subsequent runs on the same document using the current 
release of cTakes. (There are actually quite a few documents that exhibit this 
kind of behavior. This is only one example.)


Snippet from first run:







Snippet from subsequent trun:






Note that in the first instance, there were two MedicationMentions, but in the 
second, there is only one.

Yes, everyone could write their own custom compare code, but wouldn't it be 
more valuable to the community to make that task easier?

Thanks,

Bruce Tietjen



 [image: IMAT Solutions] <http://imatsolutions.com>  Bruce Tietjen Senior 
Software Engineer
[image: Mobile:] 801.634.1547
bruce.tiet...@imatsolutions.com

On Tue, Oct 7, 2014 at 11:01 AM, Kim Ebert 
wrote:

> Hi Sean,
>
> No, your not a jerk. These are things worth considering, and I 
> understand your concerns with touching various points of the codebase.
>
> I'll talk with our group over here and see where we want to go. We are 
> really interested in cTakes behaving well, so we are usually pretty 
> careful in testing our changes before committing anything.
>
> Thanks,
>
> Kim Ebert
> 1.801.669.7342
> Perfect Search Corp
> http://www.perfectsearchcorp.com/
>
> On 10/07/2014 10:46 AM, Finan, Sean wrote:
> > Hi Kim,
> >
> >> It concerns me a bit by making the code return consistent results 
> >> would
> be so concerning.
> > Could you please clarify what you mean by "consistent results"?  Do 
> > you
> mean ordering and IDs or are you talking about actual type values not 
> matching?
> >
> >> This should be the default mode of operation.
> > Depending upon what you meant above, I may agree or disagree.
> >
> >> Since it doesn't appear that there are any consequences with moving
> forward with changing the code
> > Why do you say this?
> >
> > I think that there may be more required changes than you realize.  
> > Every
> insertion into the CAS must be of ordered data.  This means that, for 
> instance, named entities discovered by dictionary will need to be 
> inserted in some predictable order, such as by alphabetized cui per 
> every alphabetized tui (and other code) per ordered text span.  You 
> will need to check and recheck 

RE: cTakes output predictability

2014-10-07 Thread Finan, Sean
Hi Kim,

Great Catch!

I think that by now this thread may be discarded by most as spam.  So, I'm back 
(apologies - I know that you are tired of me by now).

I checked the code that you pointed to ...  I really dislike looking at older 
cTakes code because I'm filled with an overwhelming urge to refactor.

If I understand the code correctly (it could use some doc), it runs negation 
engines and then if any negation exists it creates a single hit signifying 
negation.  Like a heavyweight Boolean.   Unfortunately, as you know, because 
Collection "s"  is a Set and it throws in the first token to come along ...  

An isolated change here would probably be better than going through the entire 
code base and switching to LinkedHashMaps, Lists, etc. - plus it would fix your 
problem.

You could (for reuse by others, assuming that one doesn't already exist) create 
a singleton BaseTokenComparator implements Comparator  with 
something like:
   public int compare( final BaseToken textSpan1, final BaseToken textSpan2 ) {
  if ( textSpan1. getStartOffset () != textSpan2. getStartOffset () ) {
 return textSpan1. getStartOffset () - textSpan2. getStartOffset ();
  }
  return textSpan1. getEndOffset () - textSpan2. getEndOffset ();
   }

And in NegationContextAnalyzer line ~48
Final List negatorsList = new ArrayList( 
_negIndicatorFSM.execute(fsmTokenList) );
If ( !negatorsList.isEmpty() ) {
Collections.sort( negatorsList, BaseTokenComparator.getInstance() );
Return new ContextHit( negatorsList.get(0).getStartOffset(), 
negatorsList.get(0).getEndOffset() );

Or you could write a (faster) method to use in place of the List and Sort like:
BaseToken getFirstTextSpan( final Iterable tokens ) {
BaseToken firstToken  = null;
For ( BaseToken token : tokens ) {
If ( firstToken == null || token.getStartOffset() < 
firstToken.getStartOffset() ) {
firstToken = token;
continue;
}
If ( token.getStartOffset() == firstToken.getStartOffset() && 
token.getEndOffset() < firstToken.getEndOffset() ) {
firstToken = token;
}
}
Return firstToken; 


Of course, a perfectly reasonable question to pose to the community is 
something like "Is the best stored negation context the first or largest or 
???"  Perhaps the first negator span isn't the most wanted for later use - 
perhaps it is the most-encompassing span so that multiple words can be reused.  
You could throw that out under a new thread title and perhaps the original 
authors or current users would speak up as to what might be best.  Personally I 
have no idea.

Anyway, great catch!

Sean


-Original Message-
From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com] 
Sent: Tuesday, October 07, 2014 3:11 PM
To: dev@ctakes.apache.org
Subject: Re: cTakes output predictability

Hi all,

I'm not sure these should be classified as bugs. They look l like design 
decisions at some point, but they do have impact in the consistency of the 
results. If they are right are not might be something to debate later down the 
road, but it would be nice to be consistent in the output.

For example, I have the following text.

"I do not see any"

Can result in the following ContextAnnotations:



or



or



Well, after doing some digging it turns out that 
org.apache.ctakes.necontexts.negation.NegationContextAnalyzer is to blame.

The code looks like the following:

public ContextHit analyzeContext(List contextTokens, 
int scopeOrientation)
throws AnalysisEngineProcessException {
List fsmTokenList = wrapAsFsmTokens(contextTokens);

try {
Set s =
_negIndicatorFSM.execute(fsmTokenList);

*if (s.size() > 0) {*
NegationIndicator neg = s.iterator().next();
   *return new ContextHit(neg.getStartOffset(),
neg.getEndOffset());*
} else {
return null;
}
} catch (Exception e) {
throw new AnalysisEngineProcessException(e);
}
}

This will at most return one item from the Set. Since the set is an unordered 
hash, this will result in one of three options to be returned.
Is this a bug, or a design decision. Which one is right? Which one is wrong? It 
maybe this is a disign decision, but it would be nice if we are consistently 
right, or consistently wrong. Many other instances of this result in similar 
issues.

Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/

On 10/07/2014 12:43 PM, Finan, Sean wrote:
> I'm just about sapped on this topic.  What comes below is my final writing.
>
> Kim wrote:
>> Yes, I mean actual type values not matching.
> Ok, this is a very serious problem and should have nothing to do

RE: Differences in MedicationMention annotations on subsequent processing runs

2014-10-08 Thread Finan, Sean
Hi Bruce,
I would venture to say that this is neither expected nor desired.

Before you fix it (or in addition to a fix), try to run with the new dictionary 
lookup.   It will have a different behavior, and it will be the default 
dictionary lookup in future releases of cTakes – making fixes to the current 
module slightly less urgent.

Sean

From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com]
Sent: Wednesday, October 08, 2014 11:38 AM
To: dev@ctakes.apache.org
Subject: Differences in MedicationMention annotations on subsequent processing 
runs


I have encountered a situation in which the cTakes clinical pipeline output 
differs between multiple runs on the same text with the same configuration.
The following snippets from a single document are sufficient to demonstrate the 
issue:

 a gentle curve going into. irrigated with Bacitracin.

The source of the difference is that the DictionaryLookupAnnotator uses a map 
to filter out duplicate annotations for a single document location:
// used to prevent duplicate hits
// key = hit begin,end key (java.lang.String)
// val = Set of MetaDataHit objects
private Map> iv_dupMap = new HashMap<>();

This map is shared between both the umls_ms_2011ab lookup and the 
umls_ms_2011an_rxnorm lookup,

If both dictionaries contain the same term, the order of dictionary lookup 
execution determines the output.If the rxnorm lookup runs first, then a 
MedicationMention annotation for Bacitracin appears in the final output. If the 
standard umls lookup runs first, then there is no MedicationMention annotation 
for Bacitracin.
I will attach the output from the subsequent runs. (Hopefully the attachment 
will make it through the system)

Is this expected behavior? If not, what would be the expected behavior?

[Image removed by sender. IMAT Solutions]
Bruce Tietjen
Senior Software Engineer
[Image removed by sender. Mobile:]801.634.1547
bruce.tiet...@imatsolutions.com


RE: Differences in MedicationMention annotations on subsequent processing runs

2014-10-08 Thread Finan, Sean
Good point ...
I tried to check in to sourceforge but had problems.  I will try again right 
now ...

Building a custom dictionary is possible with the DictionaryTool in cTakes 
sandbox, but that is a different rabbit hole.

-Original Message-
From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com] 
Sent: Wednesday, October 08, 2014 11:52 AM
To: dev@ctakes.apache.org
Subject: Re: Differences in MedicationMention annotations on subsequent 
processing runs

If I understand correctly, I would need new dictionary resources to run the 
rare word lookup method.

Where can I find the necessary dictionary(ies) or how do I build them?


 [image: IMAT Solutions] <http://imatsolutions.com>  Bruce Tietjen Senior 
Software Engineer
[image: Mobile:] 801.634.1547
bruce.tiet...@imatsolutions.com

On Wed, Oct 8, 2014 at 9:46 AM, Finan, Sean < sean.fi...@childrens.harvard.edu> 
wrote:

>  Hi Bruce,
>
> I would venture to say that this is neither expected nor desired.
>
>
>
> Before you fix it (or in addition to a fix), try to run with the new
> dictionary lookup.   It will have a different behavior, and it will be the
> default dictionary lookup in future releases of cTakes – making fixes 
> to the current module slightly less urgent.
>
>
>
> Sean
>
>
>
> *From:* Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com]
> *Sent:* Wednesday, October 08, 2014 11:38 AM
> *To:* dev@ctakes.apache.org
> *Subject:* Differences in MedicationMention annotations on subsequent 
> processing runs
>
>
>
>
>
> I have encountered a situation in which the cTakes clinical pipeline 
> output differs between multiple runs on the same text with the same 
> configuration.
>
> The following snippets from a single document are sufficient to 
> demonstrate the issue:
>
>  a gentle curve going into. irrigated with Bacitracin.
>
>
>
> The source of the difference is that the DictionaryLookupAnnotator 
> uses a map to filter out duplicate annotations for a single document location:
>
> // used to prevent duplicate hits
> // key = hit begin,end key (java.lang.String)
> // val = Set of MetaDataHit objects
> private Map> iv_dupMap = new HashMap<>();
>
>  This map is shared between both the umls_ms_2011ab lookup and the 
> umls_ms_2011an_rxnorm lookup,
>
>
>
> If both dictionaries contain the same term, the order of dictionary 
> lookup execution determines the output.If the rxnorm lookup runs 
> first, then a MedicationMention annotation for Bacitracin appears in 
> the final output. If the standard umls lookup runs first, then there 
> is no MedicationMention annotation for Bacitracin.
>
> I will attach the output from the subsequent runs. (Hopefully the 
> attachment will make it through the system)
>
>
>
> Is this expected behavior? If not, what would be the expected behavior?
>
>
>
> [image: Image removed by sender. IMAT Solutions] 
> <http://imatsolutions.com>
>
> *Bruce Tietjen*
> Senior Software Engineer
> [image: Image removed by sender. Mobile:]801.634.1547 
> bruce.tiet...@imatsolutions.com
>


RE: Differences in MedicationMention annotations on subsequent processing runs

2014-10-08 Thread Finan, Sean
Hi Bruce,

With Pei's help I just updated the sourceforge repo with the cTakes 
dictionaries.  Checkout artifact ctakes-resources-snomed-rword-hsqldb-2011ab

Sean

-Original Message-
From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com] 
Sent: Wednesday, October 08, 2014 11:52 AM
To: dev@ctakes.apache.org
Subject: Re: Differences in MedicationMention annotations on subsequent 
processing runs

If I understand correctly, I would need new dictionary resources to run the
rare word lookup method.

Where can I find the necessary dictionary(ies) or how do I build them?


 [image: IMAT Solutions] <http://imatsolutions.com>
 Bruce Tietjen
Senior Software Engineer
[image: Mobile:] 801.634.1547
bruce.tiet...@imatsolutions.com

On Wed, Oct 8, 2014 at 9:46 AM, Finan, Sean <
sean.fi...@childrens.harvard.edu> wrote:

>  Hi Bruce,
>
> I would venture to say that this is neither expected nor desired.
>
>
>
> Before you fix it (or in addition to a fix), try to run with the new
> dictionary lookup.   It will have a different behavior, and it will be the
> default dictionary lookup in future releases of cTakes – making fixes to
> the current module slightly less urgent.
>
>
>
> Sean
>
>
>
> *From:* Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com]
> *Sent:* Wednesday, October 08, 2014 11:38 AM
> *To:* dev@ctakes.apache.org
> *Subject:* Differences in MedicationMention annotations on subsequent
> processing runs
>
>
>
>
>
> I have encountered a situation in which the cTakes clinical pipeline
> output differs between multiple runs on the same text with the same
> configuration.
>
> The following snippets from a single document are sufficient to
> demonstrate the issue:
>
>  a gentle curve going into. irrigated with Bacitracin.
>
>
>
> The source of the difference is that the DictionaryLookupAnnotator uses a
> map to filter out duplicate annotations for a single document location:
>
> // used to prevent duplicate hits
> // key = hit begin,end key (java.lang.String)
> // val = Set of MetaDataHit objects
> private Map> iv_dupMap = new HashMap<>();
>
>  This map is shared between both the umls_ms_2011ab lookup and the
> umls_ms_2011an_rxnorm lookup,
>
>
>
> If both dictionaries contain the same term, the order of dictionary lookup
> execution determines the output.If the rxnorm lookup runs first, then a
> MedicationMention annotation for Bacitracin appears in the final output. If
> the standard umls lookup runs first, then there is no MedicationMention
> annotation for Bacitracin.
>
> I will attach the output from the subsequent runs. (Hopefully the
> attachment will make it through the system)
>
>
>
> Is this expected behavior? If not, what would be the expected behavior?
>
>
>
> [image: Image removed by sender. IMAT Solutions]
> <http://imatsolutions.com>
>
> *Bruce Tietjen*
> Senior Software Engineer
> [image: Image removed by sender. Mobile:]801.634.1547
> bruce.tiet...@imatsolutions.com
>


RE: Differences in MedicationMention annotations on subsequent processing runs

2014-10-09 Thread Finan, Sean
> DictionaryLookupAnnotator which is a container for the dictionaries and it 
> iterates through the list of lookup dictionaries

I am confused.  The new dictionary-lookup-fast has neither this class nor 
multiple dictionaries.  The umls and rxnorm are in the same database table and 
lookup is performed in one swoop.  Could you please send a copy of your 
pipeline xmls to me directly (instead of bombing the group) with something 
other than an .xml extension (they get blocked)?



From: Bruce Tietjen [bruce.tiet...@perfectsearchcorp.com]
Sent: Thursday, October 09, 2014 11:41 AM
To: dev@ctakes.apache.org
Subject: Re: Differences in MedicationMention annotations on subsequent 
processing runs

I tried the Dictionary-lookup-fast module and the bahavior is the same. I did 
have to run it a number of times before timing was right to reproduce the 
issue. With the older lookup, chances were about 50/50 between which dictionary 
ran first. Using the dictionary-fast, it seems more like 70/30 with the 
standard umls lookup being more likely to run first than not. Which means that 
most of the time, there is no MedicationMention annotation for Bacitracin.  
(See Attached)

The code with the issue is the DictionaryLookupAnnotator which is a container 
for the dictionaries and it iterates through the list of lookup dictionaries so 
that part of the code path does not seem to have changed.

In the past, the rxNorm dictionary was a Lucene search and so I'm guessing it 
behaved a little differently than it does now with both being JDBC.

The fact that the filter is at this location seems to indicate that it may have 
been by intended for it to be across all dictionaries. On the other hand, it 
appears to mask out the lookups for the different dictionaries, resulting in 
some annotations not being made.

So, the real question is how should the filter work -- should the annotation 
filtering be per lookup dictionary, or be across all dictionaries? Or is there 
something wrong elsewhere that causes

I lean towards having the filter function per dictionary. This may risk having 
duplicate annotations, but that would probably be better than missing the 
annotation all together.







[IMAT Solutions]<http://imatsolutions.com>
Bruce Tietjen
Senior Software Engineer
[Mobile:] 801.634.1547
bruce.tiet...@imatsolutions.com<mailto:bruce.tiet...@imatsolutions.com>

On Wed, Oct 8, 2014 at 10:02 AM, Finan, Sean 
mailto:sean.fi...@childrens.harvard.edu>> 
wrote:
Hi Bruce,

With Pei's help I just updated the sourceforge repo with the cTakes 
dictionaries.  Checkout artifact ctakes-resources-snomed-rword-hsqldb-2011ab

Sean

-Original Message-
From: Bruce Tietjen 
[mailto:bruce.tiet...@perfectsearchcorp.com<mailto:bruce.tiet...@perfectsearchcorp.com>]
Sent: Wednesday, October 08, 2014 11:52 AM
To: dev@ctakes.apache.org<mailto:dev@ctakes.apache.org>
Subject: Re: Differences in MedicationMention annotations on subsequent 
processing runs

If I understand correctly, I would need new dictionary resources to run the
rare word lookup method.

Where can I find the necessary dictionary(ies) or how do I build them?


 [image: IMAT Solutions] <http://imatsolutions.com>
 Bruce Tietjen
Senior Software Engineer
[image: Mobile:] 801.634.1547
bruce.tiet...@imatsolutions.com<mailto:bruce.tiet...@imatsolutions.com>

On Wed, Oct 8, 2014 at 9:46 AM, Finan, Sean <
sean.fi...@childrens.harvard.edu<mailto:sean.fi...@childrens.harvard.edu>> 
wrote:

>  Hi Bruce,
>
> I would venture to say that this is neither expected nor desired.
>
>
>
> Before you fix it (or in addition to a fix), try to run with the new
> dictionary lookup.   It will have a different behavior, and it will be the
> default dictionary lookup in future releases of cTakes – making fixes to
> the current module slightly less urgent.
>
>
>
> Sean
>
>
>
> *From:* Bruce Tietjen 
> [mailto:bruce.tiet...@perfectsearchcorp.com<mailto:bruce.tiet...@perfectsearchcorp.com>]
> *Sent:* Wednesday, October 08, 2014 11:38 AM
> *To:* dev@ctakes.apache.org<mailto:dev@ctakes.apache.org>
> *Subject:* Differences in MedicationMention annotations on subsequent
> processing runs
>
>
>
>
>
> I have encountered a situation in which the cTakes clinical pipeline
> output differs between multiple runs on the same text with the same
> configuration.
>
> The following snippets from a single document are sufficient to
> demonstrate the issue:
>
>  a gentle curve going into. irrigated with Bacitracin.
>
>
>
> The source of the difference is that the DictionaryLookupAnnotator uses a
> map to filter out duplicate annotations for a single document location:
>
> // used to prevent duplicate hits
> // key = hit begin,end key (java.lang.String)
> // val = Set of

RE: Differences in MedicationMention annotations on subsequent processing runs

2014-10-09 Thread Finan, Sean
I just ran the –fast with an example containing  bacitracin in four sentences, 
once being the first word and once being the last.  In ten of ten runs all four 
bacitracin mentions were discovered.

You completely replaced the dictionary lookup with ?

  



From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com]
Sent: Thursday, October 09, 2014 11:42 AM
To: dev@ctakes.apache.org
Subject: Re: Differences in MedicationMention annotations on subsequent 
processing runs

I tried the Dictionary-lookup-fast module and the bahavior is the same. I did 
have to run it a number of times before timing was right to reproduce the 
issue. With the older lookup, chances were about 50/50 between which dictionary 
ran first. Using the dictionary-fast, it seems more like 70/30 with the 
standard umls lookup being more likely to run first than not. Which means that 
most of the time, there is no MedicationMention annotation for Bacitracin.  
(See Attached)
The code with the issue is the DictionaryLookupAnnotator which is a container 
for the dictionaries and it iterates through the list of lookup dictionaries so 
that part of the code path does not seem to have changed.
In the past, the rxNorm dictionary was a Lucene search and so I'm guessing it 
behaved a little differently than it does now with both being JDBC.
The fact that the filter is at this location seems to indicate that it may have 
been by intended for it to be across all dictionaries. On the other hand, it 
appears to mask out the lookups for the different dictionaries, resulting in 
some annotations not being made.

So, the real question is how should the filter work -- should the annotation 
filtering be per lookup dictionary, or be across all dictionaries? Or is there 
something wrong elsewhere that causes
I lean towards having the filter function per dictionary. This may risk having 
duplicate annotations, but that would probably be better than missing the 
annotation all together.





[IMAT Solutions]<http://imatsolutions.com>
Bruce Tietjen
Senior Software Engineer
[Mobile:]801.634.1547
bruce.tiet...@imatsolutions.com<mailto:bruce.tiet...@imatsolutions.com>

On Wed, Oct 8, 2014 at 10:02 AM, Finan, Sean 
mailto:sean.fi...@childrens.harvard.edu>> 
wrote:
Hi Bruce,

With Pei's help I just updated the sourceforge repo with the cTakes 
dictionaries.  Checkout artifact ctakes-resources-snomed-rword-hsqldb-2011ab

Sean

-Original Message-
From: Bruce Tietjen 
[mailto:bruce.tiet...@perfectsearchcorp.com<mailto:bruce.tiet...@perfectsearchcorp.com>]
Sent: Wednesday, October 08, 2014 11:52 AM
To: dev@ctakes.apache.org<mailto:dev@ctakes.apache.org>
Subject: Re: Differences in MedicationMention annotations on subsequent 
processing runs

If I understand correctly, I would need new dictionary resources to run the
rare word lookup method.

Where can I find the necessary dictionary(ies) or how do I build them?


 [image: IMAT Solutions] <http://imatsolutions.com>
 Bruce Tietjen
Senior Software Engineer
[image: Mobile:] 801.634.1547
bruce.tiet...@imatsolutions.com<mailto:bruce.tiet...@imatsolutions.com>

On Wed, Oct 8, 2014 at 9:46 AM, Finan, Sean <
sean.fi...@childrens.harvard.edu<mailto:sean.fi...@childrens.harvard.edu>> 
wrote:

>  Hi Bruce,
>
> I would venture to say that this is neither expected nor desired.
>
>
>
> Before you fix it (or in addition to a fix), try to run with the new
> dictionary lookup.   It will have a different behavior, and it will be the
> default dictionary lookup in future releases of cTakes – making fixes to
> the current module slightly less urgent.
>
>
>
> Sean
>
>
>
> *From:* Bruce Tietjen 
> [mailto:bruce.tiet...@perfectsearchcorp.com<mailto:bruce.tiet...@perfectsearchcorp.com>]
> *Sent:* Wednesday, October 08, 2014 11:38 AM
> *To:* dev@ctakes.apache.org<mailto:dev@ctakes.apache.org>
> *Subject:* Differences in MedicationMention annotations on subsequent
> processing runs
>
>
>
>
>
> I have encountered a situation in which the cTakes clinical pipeline
> output differs between multiple runs on the same text with the same
> configuration.
>
> The following snippets from a single document are sufficient to
> demonstrate the issue:
>
>  a gentle curve going into. irrigated with Bacitracin.
>
>
>
> The source of the difference is that the DictionaryLookupAnnotator uses a
> map to filter out duplicate annotations for a single document location:
>
> // used to prevent duplicate hits
> // key = hit begin,end key (java.lang.String)
> // val = Set of MetaDataHit objects
> private Map> iv_dupMap = new HashMap<>();
>
>  This map is shared between both the umls_ms_2011ab lookup and the
> umls_ms_2011an_rxnorm lookup,
>
>
>
> If both dictionaries contain the same 

RE: Need information regarding cTakes changes

2014-10-20 Thread Finan, Sean
Hi Chandu,
For your note #2:
> 2)Any new features that can be added to current version of cTakes 
> project to make it more useful.
You can always check (or add to) the Jira "future enhancement" page at:
https://issues.apache.org/jira/browse/CTAKES/fixforversion/12323040/?selectedTab=com.atlassian.jira.jira-projects-plugin:version-summary-panel

Sean

-Original Message-
From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
Sent: Monday, October 20, 2014 2:40 PM
To: dev@ctakes.apache.org
Subject: Re: Need information regarding cTakes changes


On 10/17/2014 05:23 AM, sarath chandra Reddy wrote:
> Hi,
>
> I am not proposing any changes, as I did not have much knowledge about 
> the cTakes project code. I am requesting the persons who are currently 
> working on the development of cTakes next version.I need their help in 
> answering the questions mentioned in previous mail.
>
> 1) Any possible improvements that can be made to current cTakes 
> version to improve its efficiency ?. Like code-level and design level changes.
Well, the new "fast" dictionary module should solve one of the biggest issues, 
the bottleneck of the dictionary lookup. Beyond that, it would be nice to 
decrease the memory footprint of the dependency parser.

> 2)Any new features that can be added to current version of cTakes 
> project to make it more useful.
Using UIMA-AS allows for scaleout, in combination with the fast dictionary can 
allow very fast processing. Maybe it's not a feature per se, and maybe it will 
come from an outside project, but I think infrastructure that makes it easy to 
get a highly parallel and very fast version of ctakes up and running would be a 
nice addition.

(By the way, that's just one interesting example that came to mind, not 
necessarily the most important or highest priority!)

Tim


> I humbly request the developers to provide me information regarding these.
>
> Regards,
> Chandu
>
> On Thu, Oct 16, 2014 at 8:31 PM, Chen, Pei 
> 
> wrote:
>
>> Chanda,
>> Could you describe what types of changes you are proposing.
>>
>> We'll welcome any contributions.
>>
>> Sent from my iPhone
>>
>>> On Oct 16, 2014, at 5:21 PM, sarath chandra Reddy 
>>> 
>> wrote:
>>> Hi,
>>>
>>> I am doing a research work on cTakes . I request the developers 
>>> working
>> on
>>> the development of cTakes project to answer the following questions.
>>> Connect me with the right persons.
>>>
>>> -->I need three major possible improvements  to the cTakes current 
>>> -->design Also three new features that can be added to the current 
>>> -->cTakes
>> project
>>> I am waiting for your responses. Thank you in advance.
>>>
>>> Regards,
>>> Chandu



RE: ctakes-dictionary-lookup-fast

2014-11-07 Thread Finan, Sean
By Pei:
> As much as I hate maintaining more desc xml's, but I think it's prudent to 
> create a separate one for a patch release temporarily for 
> ctakes-dictionary-lookup-fast so users do not get blindsided by the change in 
> output.

By Sean:
Excellent idea


-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] 
Sent: Friday, November 07, 2014 11:13 AM
To: dev@ctakes.apache.org
Subject: RE: ctakes-dictionary-lookup-fast


sounds good to me. 

thanks for attaching the images to the JIRA. 

-- James

From: Chen, Pei [pei.c...@childrens.harvard.edu]
Sent: Friday, November 07, 2014 10:00 AM
To: dev@ctakes.apache.org
Subject: RE: ctakes-dictionary-lookup-fast

Attached screenshots of CVD output to the Jira[1].
As much as I hate maintaining more desc xml's, but I think it's prudent to 
create a separate one for a patch release temporarily for 
ctakes-dictionary-lookup-fast so users do not get blindsided by the change in 
output.
So users can still choose the existing behavior: 
AggregatePlaintextUMLSProcessor.xml
Or the new dictionary lookup: AggregatePlaintextFastUMLSProcessor.xml

[1] https://issues.apache.org/jira/browse/CTAKES-325

We can replace the xml's in the next major/minor release...
--Pei

> -Original Message-
> From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
> Sent: Thursday, November 06, 2014 10:17 PM
> To: 'dev@ctakes.apache.org'
> Subject: RE: ctakes-dictionary-lookup-fast
>
> The image  didn't come through for me. Can you post the image 
> somewhere and send the url? Thanks.
>
>
> From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu]
> Sent: Thursday, November 06, 2014 2:55 PM
> To: dev@ctakes.apache.org
> Subject: ctakes-dictionary-lookup-fast
>
> Hi,
> The original plan was to update AggregatePlaintextUMLSProcessor.xml to 
> use the new ultrafast dictionary lookup in the upcoming 3.2.1 release.
> However, the output is slightly different the old cTAKES dictionary 
> where it no longer has a SNOMED/RXNORM consumer (Returns CUI's only 
> and doesn't post process map back to the SNOMED/RXNORM codes.)  This 
> can certainly be done again, but I am not sure how many people are 
> dependent on the AggregatePlaintextUMLSProcessor.xml to consider this 
> a patch release.
> Some Options/Ideas:
>
> 1)  Create a AggreatePlaintextUMLSFastProcessor.xml which defaults to
> dictionary-lookup-fast. But doesn't return the codes for now.  We 
> replace the default pipeline when SNOMED/RXNORM codes are returned again.
>
> 2)  Push forward with defaulting to the new dictionary-lookup-fast in
> AggregatePlaintextUMLSProcessor.xml
>
> Example output of dictionary-lookup-fast:
>
> [cid:image001.png@01CFF9D9.E5D2CA50]


RE: Announcement: UMLS MedGen-MySQL dataset now available as open access download

2014-11-14 Thread Finan, Sean
Hi Andy,

Great stuff!  I think that I understand the method, but I have a question about 
the statement:

>the content is publicly available per the NCBI policy and license for MedGen 
>sources

Does this mean that I, Joe Anybody, could download the content, place some of 
the content in a database structured in my own fashion, package the -new- 
database, and include it in a cTakes distribution?
Or, does it mean that content downloaded by script is usable as-is and only 
as-is?  The whole "if I'd known your were going to do that I wouldn't have 
given it to you ..."

Thanks,
Sean


From: andy mcmurry [mcmurry.a...@gmail.com]
Sent: Thursday, November 13, 2014 6:59 PM
To: dev@ctakes.apache.org
Subject: Re: Announcement: UMLS MedGen-MySQL dataset now available as open 
access download

Pei: Yes, specifically:

The source code was released by Invitae under Apache ASL 2.0 per my request
and with full blessing from our legal counsel and software team. I also
reviewed in principle the idea with John Wilbanks of Sage Bionetworks (and
formerly creative commons). This is legit, or I wouldn't have spent tons of
hours doing it.

The raw content is a set of scripts which wget a list of URLS from the NCBI
public FTP repositories. This code DOES NOT redistribute any content
whatsoever, just a list of URLs to download, unzip, and insert into a local
mysql database. To repeat: I am NOT circulating any content, just URL links
-- you must download the content yourself. And that is the beauty -- all
content is downloaded BY THE USER and the content is publicly available per
the NCBI policy and license for MedGen sources.


On Thu, Nov 13, 2014 at 11:18 AM, Chen, Pei 
wrote:

> John- I believe that was the thinking.
> Andy- Just to confirm- Is the raw content of this dataset released under
> ASL2.0?  i.e. can you contribute it as a CSV or similar so that cTAKES may
> re-tokenize it using the same PTB rules, format it for cTAKES' dictionary
> lookup, etc., and then redistribute it under the same License.
>
> > -Original Message-
> > From: John Green [mailto:john.travis.gr...@gmail.com]
> > Sent: Thursday, November 13, 2014 1:55 PM
> > To: dev@ctakes.apache.org
> > Cc: dev@ctakes.apache.org
> > Subject: Re: Announcement: UMLS MedGen-MySQL dataset now available
> > as open access download
> >
> > The old licensed setup would be kept as a packaged option? Much as it is
> > now With the unlicensed going out in place of the current "free"
> > dictionary? Am I understanding that right?
> >
> >
> > JG
> > —
> > Sent from Mailbox
> >
> > On Thu, Nov 13, 2014 at 1:40 PM, andy mcmurry
> > 
> > wrote:
> >
> > > I'll crunch the numbers -- in the meantime I can tell you that
> > > phenotypes vary by semantic type. clinical attributes  from SNOMED are
> > > abundant, many concepts in mesh that are mapped to diseases. Tons of
> > > "pharmacological substances"
> > > On Nov 12, 2014 6:19 AM, "Dligach, Dmitriy" <
> > > dmitriy.dlig...@childrens.harvard.edu> wrote:
> > >> Andy, thank you for this resource!
> > >>
> > >> Do you have an estimate of what percentage of UMLS concepts were left
> > out?
> > >>
> > >> Dima
> > >>
> > >>
> > >>
> > >>
> > >> On Nov 11, 2014, at 16:02, andy mcmurry 
> > wrote:
> > >>
> > >> > Hello!
> > >> >
> > >> > https://bitbucket.org/invitae/medgen-mysql (Apache Licensed ASL2)
> > >> >
> > >> > We just released a new library containing a huge chunk of UMLS
> > >> > concepts which are available without registering
> > accounts/username/passwords.
> > >> > LEGALLY. Yes, really!
> > >> >
> > >> > The subset is from NCBI and it contains *thousands of concepts from
> > >> SNOMED
> > >> > and other vocabularies*.
> > >> >
> > >> > The code is essentially
> > >> > 1. a list of WGET targets to various NCBI FTP site mirrors 2.
> > >> > Makefile for building the databases of interest
> > >> >
> > >> > Our legal team has approved distribution for Open Access work, ASL2
> > >> > LICENSE.
> > >> >
> > >> > I recommend we use this opportunity to make this the default
> > >> > distribution for CTAKES UMLS connections, because it obviates the
> > >> > need for so much painful credentialing and back and forth
> > >> > agreements with the US National Library of Medicine.
> > >> >
> > >> > Cheers!
> > >> > --Andy
> > >> >
> > >> >
> > >> > On Wed, Sep 10, 2014 at 12:13 PM, Masanz, James J. <
> > >> masanz.ja...@mayo.edu>
> > >> > wrote:
> > >> >
> > >> >>
> > >> >> I would love to see the install be as simple as apt-get install to
> > >> >> end
> > >> up
> > >> >> with some working dictionary that have more than a handful of
> > >> >> entries to get them started.
> > >> >>
> > >> >> Regards,
> > >> >> James Masanz
> > >> >>
> > >> >> -Original Message-
> > >> >> From: andy mcmurry [mailto:mcmurry.a...@gmail.com]
> > >> >> Sent: Tuesday, September 09, 2014 4:32 PM
> > >> >> To: ctakes-...@incubator.apache.org
> > >> >> Subject: Recommendation for ctakes default (UMLS) dictio

RE: Asking help for always unsuccessful AE load

2014-12-04 Thread Finan, Sean
Hi Jun,

Do AE pipelines that do not use the Smoking Status module work?

I think that Smoking Status configuration (via binary install) might be broken 
in the last several versions.  I thought that I had submitted a Jira long, long 
ago, but right now I can't find it so maybe my memory is playing games.  I have 
gotten the module to work, but it took hours to find and fix the problems.  If 
you can get other AEs to run then let me know and I'll try to find my working 
setup and diff it with the cTakes install tomorrow.  If I remember correctly I 
had to move (unpack) some things from lib/ jars to resources/ and change a path 
or two in the desc/ xmls.

Sean


From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com]
Sent: Wednesday, December 03, 2014 10:52 AM
To: dev@ctakes.apache.org
Subject: Re: Asking help for always unsuccessful AE load

Hi Jun,

I know this has been a problem in some versions. What version are you using? 
Could you try this out on the latest release candidate to see if it is still an 
issue?

Thanks,

[IMAT Solutions]
Kim Ebert
Software Engineer
[Office:]801.669.7342
kim.eb...@imatsolutions.com
On 12/02/2014 08:28 PM, Ying, Jun wrote:

Dear Sir/Madam,

When I Load some AE in cTakes like "SimulatedProdSmokingTAE.xml", It always 
jump the Exception "java.lang.illegalArgumentException: URl is not 
hierarchical". Why it happens? How to fix it.

Thanks.



[X]





The information in this e-mail is intended only for the person to whom it is

addressed. If you believe this e-mail was sent to you in error and the e-mail

contains patient information, please contact the Partners Compliance HelpLine at

http://www.partners.org/complianceline . If the e-mail was sent to you in error

but does not contain patient information, please contact the sender and properly

dispose of the e-mail.





RE: Scaling cTakes

2014-12-05 Thread Finan, Sean
Hi Brandon,

It sounds like you've got  a decent pipeline set up.  To increase the speed you 
could try swapping out use of ctakes-dictionary-lookup with 
ctakes-dictionary-lookup-fast in the AE.  Check 
ctakes-clinical-pipeline/desc/[ae]/AggregatePlaintextFastUMLSProcessor.xml for 
an example.  As for the CASPool, I don't think that it will make any difference 
for cTakes.  

Sean

From: Geise, Brandon D. [bdge...@geisinger.edu]
Sent: Friday, December 05, 2014 12:40 PM
To: dev@ctakes.apache.org
Subject: Scaling cTakes

Hi,

I'm new to cTakes and the UIMA framework.  I've read most of the UIMA 
documentation and was able to take the BagofCUIGenerator example and modify to 
read notes from a DB, process using the UMLS AE in the clinical-pipeline using 
a local DB version of UMLS, and output the CUIs to a DB.  However, the problem 
I'm having is it's extremely slow; ~3.5-4 notes a minute.  I was hoping I could 
get some hints or advice on speeding the process up.  I read there's a patch 
for LVG, but wasn't quite sure how to implement.  Also from testing using the 
CPE GUI, I don't notice any different in processing time by adjusting the 
CASPool setting.  Some advice on the CASPool would be appreciated also.

Thanks,
Brandon


IMPORTANT WARNING: The information in this message (and the documents attached 
to it, if any) is confidential and may be legally privileged. It is intended 
solely for the addressee. Access to this message by anyone else is 
unauthorized. If you are not the intended recipient, any disclosure, copying, 
distribution or any action taken, or omitted to be taken, in reliance on it is 
prohibited and may be unlawful. If you have received this message in error, 
please delete all electronic copies of this message (and the documents attached 
to it, if any), destroy any hard copies you may have created and notify me 
immediately by replying to this email. Thank you.

Geisinger Health System utilizes an encryption process to safeguard Protected 
Health Information and other confidential data contained in external e-mail 
messages. If email is encrypted, the recipient will receive an e-mail 
instructing them to sign on to the Geisinger Health System Secure E-mail 
Message Center to retrieve the encrypted e-mail.


RE: Scaling cTakes

2014-12-09 Thread Finan, Sean
Hi Brandon,

You are welcome.  I was hoping that you'd get the note processing time down to 
under a second with the different lookup, but I guess not.  I think that any 
optimization from here really depends upon what information you want to extract 
from the notes.

Sean

From: Geise, Brandon D. [bdge...@geisinger.edu]
Sent: Tuesday, December 09, 2014 9:13 AM
To: dev@ctakes.apache.org
Subject: RE: Scaling cTakes

Thanks again Sean for the advice.  Just by changing the pipeline to use the 
fast dictionary led to quadrupling the processing speed.  Any other suggestions 
on performance tuning would be great!

Thanks,
Brandon

-Original Message-
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
Sent: Friday, December 05, 2014 1:14 PM
To: dev@ctakes.apache.org
Subject: RE: Scaling cTakes

Hi Brandon,

It sounds like you've got  a decent pipeline set up.  To increase the speed you 
could try swapping out use of ctakes-dictionary-lookup with 
ctakes-dictionary-lookup-fast in the AE.  Check 
ctakes-clinical-pipeline/desc/[ae]/AggregatePlaintextFastUMLSProcessor.xml for 
an example.  As for the CASPool, I don't think that it will make any difference 
for cTakes.

Sean

From: Geise, Brandon D. [bdge...@geisinger.edu]
Sent: Friday, December 05, 2014 12:40 PM
To: dev@ctakes.apache.org
Subject: Scaling cTakes

Hi,

I'm new to cTakes and the UIMA framework.  I've read most of the UIMA 
documentation and was able to take the BagofCUIGenerator example and modify to 
read notes from a DB, process using the UMLS AE in the clinical-pipeline using 
a local DB version of UMLS, and output the CUIs to a DB.  However, the problem 
I'm having is it's extremely slow; ~3.5-4 notes a minute.  I was hoping I could 
get some hints or advice on speeding the process up.  I read there's a patch 
for LVG, but wasn't quite sure how to implement.  Also from testing using the 
CPE GUI, I don't notice any different in processing time by adjusting the 
CASPool setting.  Some advice on the CASPool would be appreciated also.

Thanks,
Brandon


IMPORTANT WARNING: The information in this message (and the documents attached 
to it, if any) is confidential and may be legally privileged. It is intended 
solely for the addressee. Access to this message by anyone else is 
unauthorized. If you are not the intended recipient, any disclosure, copying, 
distribution or any action taken, or omitted to be taken, in reliance on it is 
prohibited and may be unlawful. If you have received this message in error, 
please delete all electronic copies of this message (and the documents attached 
to it, if any), destroy any hard copies you may have created and notify me 
immediately by replying to this email. Thank you.

Geisinger Health System utilizes an encryption process to safeguard Protected 
Health Information and other confidential data contained in external e-mail 
messages. If email is encrypted, the recipient will receive an e-mail 
instructing them to sign on to the Geisinger Health System Secure E-mail 
Message Center to retrieve the encrypted e-mail.



RE: Links Not Working

2014-12-12 Thread Finan, Sean
Hi Kasie,

cTakes is a community effort, so you've contacted the right people.  Assuming 
that the "Bug Tracker" link in the navigation bar on the left works, please 
submit a report and list all of the orphan links.  A kindly volunteer will fix 
them as soon as possible.

Thanks,
Sean

-Original Message-
From: kasie.allen [mailto:kasie.al...@world.edu] 
Sent: Friday, December 12, 2014 11:39 AM
To: dev@ctakes.apache.org
Subject: Links Not Working

Hi!

I came across a few links that aren't working on your website. Do you mind 
telling me who I should contact about them?

Thanks! :)
Kasie

-- 

Kasie Allen


RE: intro video and ctakes youtube : Youtube Apache cTakes Channel Direct Link

2014-12-15 Thread Finan, Sean
Hmmm, I can't find it in a search.  However, here is a direct link:

https://www.youtube.com/channel/UC8hQoOKz3v4PNEf6cqSkjbQ

Maybe it needs a few videos to register in the search engine ?

Sean

-Original Message-
From: Pei Chen [mailto:chen...@apache.org] 
Sent: Monday, December 15, 2014 11:32 AM
To: dev@ctakes.apache.org
Subject: Re: intro video and ctakes youtube

John,
I presume you this thread:
http://mail-archives.apache.org/mod_mbox/ctakes-dev/201408.mbox/%3c393252f14c42f946952f1ed75d316cad39158...@chexmbx4a.chboston.org%3E

Strange, I couldn't find it anymore either... The place holder could have been 
auto deleted because it was empty?  I think it's worth it if you're willing to 
create and add to it again...

---Pei

On Fri, Dec 12, 2014 at 11:46 PM, John Green 
wrote:
>
> I was going to post some basic how to videos that help with the 
> learning curve I've walked over the last year and a half. I went 
> looking for ctakes youtube channel mentioned awhile back and I did not find 
> it...
>
> Anyone know where it went?
>
> Best,
> JG
>


RE: revamping the Apache cTAKES website

2014-12-15 Thread Finan, Sean
Wow, I've just spent the last 2 hours doing the exact same thing.  That is what 
I get for missing a meeting.  Mine is extremely similar, though slightly 
different language (and without the "improved performance" bar chart - which 
may not belong).  I also put the "Examples" in a big green button right above 
"Download".Anyway, same general idea - focus on a couple primary things, 
leave lengthy text like the current page for elsewhere.

-Original Message-
From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu] 
Sent: Monday, December 15, 2014 4:33 PM
To: dev@ctakes.apache.org
Subject: RE: revamping the Apache cTAKES website

Check out a mockup of a new website proposal:
http://svn.apache.org/repos/asf/ctakes/site/new/index.html
Based off bootstrap (Idea borrowed from the Spark folks..).

Couple of key pieces of info:
- 10% of visitors are on mobile/tablets
- The most currently visited pages are: downloads.cgi, gettingstarted.html.  I 
suggest we focus our attention on those 2 items.  (Putting a Downloads link 
right on the front page, etc.)

svn co http://svn.apache.org/repos/asf/ctakes/site/new if you want to checkout 
the code of the site.

--Pei

-Original Message-
From: John Green [mailto:john.travis.gr...@gmail.com]
Sent: Friday, December 05, 2014 6:34 PM
To: dev@ctakes.apache.org
Cc: dev@ctakes.apache.org
Subject: RE: revamping the Apache cTAKES website

I would like to second the bootstrap recommendation, with the additional 
recommendation of django for the backend. It is an amazing platform for rapid 
development and easy updating.


JG
—
Sent from Mailbox

On Fri, Dec 5, 2014 at 12:15 PM, Savova, Guergana 
 wrote:

> There are now 4 volunteers:
> Michelle Chen
> Pei Chen
> Sean Finan
> Guergana Savova
> --Guergana
> -Original Message-
> From: Savova, Guergana [mailto:guergana.sav...@childrens.harvard.edu]
> Sent: Friday, December 05, 2014 11:56 AM
> To: dev@ctakes.apache.org
> Subject: RE: revamping the Apache cTAKES website Wonderful, thank you, 
> Michelle! There will be a flurry of emails the week of Dec 15 followed by 
> actual work, so book your calendar if possible...
> --Guergana
> -Original Message-
> From: Michelle Chen [mailto:michelle1919c...@gmail.com]
> Sent: Friday, December 05, 2014 11:48 AM
> To: dev@ctakes.apache.org
> Subject: Re: revamping the Apache cTAKES website Hello Guergana, I 
> don't know that much about cTakes, but would be interested in contributing to 
> the effort.
> I'm not sure if there is an interest in matching the website design of other 
> Apache projects, but it seems that the two main designs that are being used 
> from my arbitrary search on http://projects.apache.org/indexes/alpha.html is 
> 1. the current design that cTakes is using and 2. a Bootstrap approach.
> I've done a little bit of work on Bootstrap and would be interested in 
> helping with that. Let me know how I can be helpful.
> Sincerely,
> Michelle Chen :)
> "Be strong and of good courage; do not be afraid, nor be dismayed, for 
> the Lord your God is with you wherever you go." ~Joshua 1:9 On Fri, Dec 5, 
> 2014 at 11:21 AM, Savova, Guergana < guergana.sav...@childrens.harvard.edu> 
> wrote:
>> cTAKES-ers,
>>
>> we would like to start working on updating the Apache cTAKES website
>> - some of the information there is already stale and needs refreshing.
>> Do you have ideas on website design, content, etc.? Would you like to 
>> contribute to the effort? We are planning to start working on the 
>> website the week of Dec 15.
>>
>> Cheers,
>> --Guergana
>>
>>


RE: revamping the Apache cTAKES website

2014-12-15 Thread Finan, Sean
Anyway, a pretty amazing fresh start, thanks Pei

-Original Message-
From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu] 
Sent: Monday, December 15, 2014 4:33 PM
To: dev@ctakes.apache.org
Subject: RE: revamping the Apache cTAKES website

Check out a mockup of a new website proposal:
http://svn.apache.org/repos/asf/ctakes/site/new/index.html
Based off bootstrap (Idea borrowed from the Spark folks..).

Couple of key pieces of info:
- 10% of visitors are on mobile/tablets
- The most currently visited pages are: downloads.cgi, gettingstarted.html.  I 
suggest we focus our attention on those 2 items.  (Putting a Downloads link 
right on the front page, etc.)

svn co http://svn.apache.org/repos/asf/ctakes/site/new if you want to checkout 
the code of the site.

--Pei

-Original Message-
From: John Green [mailto:john.travis.gr...@gmail.com]
Sent: Friday, December 05, 2014 6:34 PM
To: dev@ctakes.apache.org
Cc: dev@ctakes.apache.org
Subject: RE: revamping the Apache cTAKES website

I would like to second the bootstrap recommendation, with the additional 
recommendation of django for the backend. It is an amazing platform for rapid 
development and easy updating.


JG
—
Sent from Mailbox

On Fri, Dec 5, 2014 at 12:15 PM, Savova, Guergana 
 wrote:

> There are now 4 volunteers:
> Michelle Chen
> Pei Chen
> Sean Finan
> Guergana Savova
> --Guergana
> -Original Message-
> From: Savova, Guergana [mailto:guergana.sav...@childrens.harvard.edu]
> Sent: Friday, December 05, 2014 11:56 AM
> To: dev@ctakes.apache.org
> Subject: RE: revamping the Apache cTAKES website Wonderful, thank you, 
> Michelle! There will be a flurry of emails the week of Dec 15 followed by 
> actual work, so book your calendar if possible...
> --Guergana
> -Original Message-
> From: Michelle Chen [mailto:michelle1919c...@gmail.com]
> Sent: Friday, December 05, 2014 11:48 AM
> To: dev@ctakes.apache.org
> Subject: Re: revamping the Apache cTAKES website Hello Guergana, I 
> don't know that much about cTakes, but would be interested in contributing to 
> the effort.
> I'm not sure if there is an interest in matching the website design of other 
> Apache projects, but it seems that the two main designs that are being used 
> from my arbitrary search on http://projects.apache.org/indexes/alpha.html is 
> 1. the current design that cTakes is using and 2. a Bootstrap approach.
> I've done a little bit of work on Bootstrap and would be interested in 
> helping with that. Let me know how I can be helpful.
> Sincerely,
> Michelle Chen :)
> "Be strong and of good courage; do not be afraid, nor be dismayed, for 
> the Lord your God is with you wherever you go." ~Joshua 1:9 On Fri, Dec 5, 
> 2014 at 11:21 AM, Savova, Guergana < guergana.sav...@childrens.harvard.edu> 
> wrote:
>> cTAKES-ers,
>>
>> we would like to start working on updating the Apache cTAKES website
>> - some of the information there is already stale and needs refreshing.
>> Do you have ideas on website design, content, etc.? Would you like to 
>> contribute to the effort? We are planning to start working on the 
>> website the week of Dec 15.
>>
>> Cheers,
>> --Guergana
>>
>>


RE: Problem running cTakes-clinical pipeline --> AggregatePlaintextFastUMLSProcessor.xml

2014-12-15 Thread Finan, Sean
Hi Yu,

> Also do you know is there any command line I can run to annotate like a 
> thousand files automatically rather than copy and paster.

You could try the CPE gui : bin/runctakesCPE.sh

Sean

From: Liang, Yu [mailto:yu.li...@nyumc.org]
Sent: Monday, December 15, 2014 4:51 PM
To: dev@ctakes.apache.org
Subject: Problem running cTakes-clinical pipeline --> 
AggregatePlaintextFastUMLSProcessor.xml



Hi Yu,
I think this is a current limitation in cTAKES.  I think it has to do with 
negation not detecting if the line breaks are separating the sentences.

Would you mind forwarding the example to 
dev@ctakes.apache.org?
I think Tim and others may be working on this issue.

--Pei

On Mon, Dec 15, 2014 at 3:54 PM, Liang, Yu 
mailto:yu.li...@nyumc.org>> wrote:
On Dec 15, 2014, at 2:58 PM, Liang, Yu 
mailto:yu.li...@nyumc.org>> wrote:

Hi Pei Chen,

Could you please look at the following example I run, I think the result is not 
accurate. The polarity of illness  is -1 but for fever, vomiting, diarrhea,and 
pain are all +1.

Also do you know is there any command line I can run to annotate like a 
thousand files automatically rather than copy and paster.

Yu Liang


[cid:DF19883E-B993-4CD0-90BD-F285A3C1A5A3@wireless.nyumc.org]
Yu Liang

CHIBI







RE: UMLS Integration

2014-12-15 Thread Finan, Sean
Hi Praveen,

I think that this question might be better aimed at the nlm umls community.  
The standard cTakes installation does not follow this workflow.

Sean

-Original Message-
From: Jay_Ram [mailto:pandupraveen...@gmail.com] 
Sent: Tuesday, December 16, 2014 12:10 AM
To: dev@ctakes.apache.org
Subject: UMLS Integration

Hi All,

I downloaded UMLS resource, to use them offline by loading in mysql. I followed 
them which are mentioned to load data into mysql. But I am unable to do it show 
error

Loading MetamorphoSys ...
[Please be patient and wait for MetamorphoSys to begin]

java.util.zip.ZipException: invalid LOC header (bad signature)
at java.util.zip.ZipFile.read(Native Method)
at java.util.zip.ZipFile.access$1400(Unknown Source)
at java.util.zip.ZipFile$ZipFileInputStream.read(Unknown Source)
at java.util.zip.ZipFile$ZipFileInflaterInputStream.fill(Unknown
Source)
at java.util.zip.InflaterInputStream.read(Unknown Source)
at java.util.zip.InflaterInputStream.read(Unknown Source)
at java.util.zip.CheckedInputStream.read(Unknown Source)
at java.util.zip.GZIPInputStream.readUByte(Unknown Source)
at java.util.zip.GZIPInputStream.readUShort(Unknown Source)
at java.util.zip.GZIPInputStream.readHeader(Unknown Source)
at java.util.zip.GZIPInputStream.(Unknown Source)
at
gov.nih.nlm.umls.meta.io.RRFMetadataInputStream.openSourceFile(RRFMetadataInputStream.java:390)
at
gov.nih.nlm.umls.meta.io.RRFConceptInputStream.open(RRFConceptInputStream.java:175)
at
gov.nih.nlm.umls.meta.io.RRFMetathesaurusInputStream.open(RRFMetathesaurusInputStream.java:125)
at
gov.nih.nlm.umls.mmsys.io.RRFMetamorphoSysInputStream.open(RRFMetamorphoSysInputStream.java:629)
at
gov.nih.nlm.umls.mmsys.subset.gui.MetamorphoSysGUI.validateGUIConfigurables(MetamorphoSysGUI.java:1097)
at
gov.nih.nlm.umls.mmsys.subset.gui.BeginSubsetAction.actionPerformed(BeginSubsetAction.java:110)
at javax.swing.AbstractButton.fireActionPerformed(Unknown Source)
at javax.swing.AbstractButton$Handler.actionPerformed(Unknown
Source)
at javax.swing.DefaultButtonModel.fireActionPerformed(Unknown
Source)
at javax.swing.DefaultButtonModel.setPressed(Unknown Source)
at javax.swing.AbstractButton.doClick(Unknown Source)
at javax.swing.plaf.basic.BasicMenuItemUI.doClick(Unknown Source)
at
javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(Unknown Source)
at java.awt.AWTEventMulticaster.mouseReleased(Unknown Source)
at java.awt.Component.processMouseEvent(Unknown Source)
at javax.swing.JComponent.processMouseEvent(Unknown Source)
at java.awt.Component.processEvent(Unknown Source)
at java.awt.Container.processEvent(Unknown Source)
at java.awt.Component.dispatchEventImpl(Unknown Source)
at java.awt.Container.dispatchEventImpl(Unknown Source)
at java.awt.Component.dispatchEvent(Unknown Source)
at java.awt.LightweightDispatcher.retargetMouseEvent(Unknown Source)
at java.awt.LightweightDispatcher.processMouseEvent(Unknown Source)
at java.awt.LightweightDispatcher.dispatchEvent(Unknown Source)
at java.awt.Container.dispatchEventImpl(Unknown Source)
at java.awt.Window.dispatchEventImpl(Unknown Source)
at java.awt.Component.dispatchEvent(Unknown Source)
at java.awt.EventQueue.dispatchEventImpl(Unknown Source)
at java.awt.EventQueue.access$200(Unknown Source)
at java.awt.EventQueue$3.run(Unknown Source)
at java.awt.EventQueue$3.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown
Source)
at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown
Source)
at java.awt.EventQueue$4.run(Unknown Source)
at java.awt.EventQueue$4.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown
Source)
at java.awt.EventQueue.dispatchEvent(Unknown Source)
at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown
Source)
at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source)
at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown
Source)
at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
at java.awt.EventDispatchThread.run(Unknown Source)

Please help to slove this issue. Thaks in advance.

Regards,
Praveen.


RE: intro video and ctakes youtube : Youtube Apache cTakes Channel Direct Link

2014-12-16 Thread Finan, Sean
Hi John,

Look for an "Upload" button in the upper-left corner next to a blue "Sign in" 
button.

Sean

-Original Message-
From: John Green [mailto:john.travis.gr...@gmail.com] 
Sent: Tuesday, December 16, 2014 11:12 AM
To: dev@ctakes.apache.org
Subject: Re: intro video and ctakes youtube : Youtube Apache cTakes Channel 
Direct Link

That is, how do we upload videos *to the channel. *

On Tue, Dec 16, 2014 at 11:09 AM, John Green 
wrote:
>
> How do we upload videos we wish to contribute? I dont have any 
> experience with youtube other than as a watcher.
>
> JG
>
> On Mon, Dec 15, 2014 at 11:43 AM, Finan, Sean < 
> sean.fi...@childrens.harvard.edu> wrote:
>>
>> Hmmm, I can't find it in a search.  However, here is a direct link:
>>
>> https://www.youtube.com/channel/UC8hQoOKz3v4PNEf6cqSkjbQ
>>
>> Maybe it needs a few videos to register in the search engine ?
>>
>> Sean
>>
>> -Original Message-
>> From: Pei Chen [mailto:chen...@apache.org]
>> Sent: Monday, December 15, 2014 11:32 AM
>> To: dev@ctakes.apache.org
>> Subject: Re: intro video and ctakes youtube
>>
>> John,
>> I presume you this thread:
>>
>> http://mail-archives.apache.org/mod_mbox/ctakes-dev/201408.mbox/%3C39
>> 3252f14c42f946952f1ed75d316cad39158...@chexmbx4a.chboston.org%3E
>>
>> Strange, I couldn't find it anymore either... The place holder could 
>> have been auto deleted because it was empty?  I think it's worth it 
>> if you're willing to create and add to it again...
>>
>> ---Pei
>>
>> On Fri, Dec 12, 2014 at 11:46 PM, John Green 
>> > >
>> wrote:
>> >
>> > I was going to post some basic how to videos that help with the 
>> > learning curve I've walked over the last year and a half. I went 
>> > looking for ctakes youtube channel mentioned awhile back and I did 
>> > not
>> find it...
>> >
>> > Anyone know where it went?
>> >
>> > Best,
>> > JG
>> >
>>
>


RE: intro video and ctakes youtube : Youtube Apache cTakes Channel Direct Link

2014-12-17 Thread Finan, Sean
Hmmm, well this is a ticker:

http://www.ampercent.com/upload-videos-youtube-channel-without-knowing-username-password/9374/



-Original Message-
From: John Green [mailto:john.travis.gr...@gmail.com] 
Sent: Wednesday, December 17, 2014 2:08 PM
To: dev@ctakes.apache.org
Subject: Re: intro video and ctakes youtube : Youtube Apache cTakes Channel 
Direct Link

Isnt this to upload for my account? What about to the channel?

On Tue, Dec 16, 2014 at 12:16 PM, Finan, Sean < 
sean.fi...@childrens.harvard.edu> wrote:
>
> Hi John,
>
> Look for an "Upload" button in the upper-left corner next to a blue 
> "Sign in" button.
>
> Sean
>
> -Original Message-
> From: John Green [mailto:john.travis.gr...@gmail.com]
> Sent: Tuesday, December 16, 2014 11:12 AM
> To: dev@ctakes.apache.org
> Subject: Re: intro video and ctakes youtube : Youtube Apache cTakes 
> Channel Direct Link
>
> That is, how do we upload videos *to the channel. *
>
> On Tue, Dec 16, 2014 at 11:09 AM, John Green 
> 
> wrote:
> >
> > How do we upload videos we wish to contribute? I dont have any 
> > experience with youtube other than as a watcher.
> >
> > JG
> >
> > On Mon, Dec 15, 2014 at 11:43 AM, Finan, Sean < 
> > sean.fi...@childrens.harvard.edu> wrote:
> >>
> >> Hmmm, I can't find it in a search.  However, here is a direct link:
> >>
> >> https://www.youtube.com/channel/UC8hQoOKz3v4PNEf6cqSkjbQ
> >>
> >> Maybe it needs a few videos to register in the search engine ?
> >>
> >> Sean
> >>
> >> -Original Message-
> >> From: Pei Chen [mailto:chen...@apache.org]
> >> Sent: Monday, December 15, 2014 11:32 AM
> >> To: dev@ctakes.apache.org
> >> Subject: Re: intro video and ctakes youtube
> >>
> >> John,
> >> I presume you this thread:
> >>
> >> http://mail-archives.apache.org/mod_mbox/ctakes-dev/201408.mbox/%3C
> >> 39 3252f14c42f946952f1ed75d316cad39158...@chexmbx4a.chboston.org%3E
> >>
> >> Strange, I couldn't find it anymore either... The place holder 
> >> could have been auto deleted because it was empty?  I think it's 
> >> worth it if you're willing to create and add to it again...
> >>
> >> ---Pei
> >>
> >> On Fri, Dec 12, 2014 at 11:46 PM, John Green 
> >>  >> >
> >> wrote:
> >> >
> >> > I was going to post some basic how to videos that help with the 
> >> > learning curve I've walked over the last year and a half. I went 
> >> > looking for ctakes youtube channel mentioned awhile back and I 
> >> > did not
> >> find it...
> >> >
> >> > Anyone know where it went?
> >> >
> >> > Best,
> >> > JG
> >> >
> >>
> >
>


RE: cTakes Annotation Comparison

2014-12-19 Thread Finan, Sean
Well, I guess that it is time for me to speak up …

I must say that I’m happy that people are showing interest in the fast lookup.  
I am also happy (sort of) that some concerns are being raised – and that there 
is now community participation in my little toy.  I  have some concerns about 
what people are reporting.  This does not coincide with what I have seen at 
all.  Yesterday I started (without knowing this thread existed) testing a 
bare-minimum pipeline for CUI extraction.  It is just the stripped-down 
Aggregate with only: segment, tokens, sentences, POS, and the fast lookup.  The 
people at Children’s wanted to know how fast we could get.  1,196 notes in 
under 90 seconds on my laptop with over 210,000 annotations, which is 175/note. 
 After reading the thread I decided to run the fast lookup with several 
configurations.  I also ran the default for 10.5 hours.  I am comparing the 
annotations from each system against the human annotations that we have, and I 
will let everybody know what I find – for better or worse.

The fast lookup does not (out-of-box) do the exact same thing as the default.  
Some things can be configured to make it more closely approximate the default 
dictionary.

1.Set the minimum annotation span length to 2 (default is 3).  This is 
in desc/[ae]/UmlsLookupAnnotator.xml : line #78.  The annotator should then 
pick up text like “CT” and improve recall, but it will hurt precision.

2.   Set the Lookup Window to LookupWindowAnnotation.  This is in 
desc/[ae]/UmlsLookupAnnotator.xml: lines #65 & #93.   The LookupWindowAnnotator 
will need to be added to the aggregate pipeline 
AggregatePlaintextFastUMLSProcesor.xml  lines #50 & #172.  This will narrow the 
lookup window and may increase precision, but (in my experience) reduces recall.

3.   Allow the –rough- identification of Overlapping spans.  The default 
dictionary will often identify text like “metastatic colorectal carcinoma” when 
that text actually does not exist anywhere in umls.  It basically ignores 
“colorectal” and gives the whole span the CUI for “metastatic carcinoma”.  In 
this case it is arguably a good thing.  In many others it is arguably not so 
much.  There is a Class ... lookup2.ae.OverlapJCasTermAnnotator.java that will 
do the same thing.  You can create a new desc/[ae]/*Annotator.xml or just 
change the  in desc/[ae]/UmlsLookupAnnotator.xml 
line #25.  I will check in a new desc xml (sorry; thought I had) because there 
are 2 parameters unique to OverlapJCasTermAnnotator

4.   You can play with the OverlapJCasTermAnnotator parameters 
“consecutiveSkips” and “totalTokenSkips”.  These control just how lenient you 
want the overlap tagging to be.

5.   Create a new dictionary database.  There is a (bit messy) 
DictionaryTool in sandbox that will let you dump whatever you do or do not want 
from UMLS into a database.  It will also help you clean up or –select- stored 
entries as well.  There is a lot of garbage in the default dictionary database: 
repeated terms with caps/no caps (“Cancer”,”cancer”), text with metadata 
(“cancer [finding]”) and text that just clutters (“PhenX: entry for cancer”, 
“1”, “2”).  The fast lookup database should have most of the Snomed and RxNorm 
terms (and synonyms) of interest, but you could always make a new database that 
is much more inclusive.

The main key to the speed of the fast dictionary lookup is actually … the key.  
It is the way that the database is indexed and the lookup by “rare” word 
instead of “first” word.  Everything else can be changed around it and it 
should still be a faster version.

As for the false positives like “Today”, that will always be a problem until we 
have disambiguation.  The lookup is basically a glorified grep.

Sean

From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu]
Sent: Friday, December 19, 2014 10:43 AM
To: dev@ctakes.apache.org; kim.eb...@imatsolutions.com
Subject: RE: cTakes Annotation Comparison

Also check out stats that Sean ran before releasing the new component on:
http://svn.apache.org/repos/asf/ctakes/trunk/ctakes-dictionary-lookup-fast/doc/DictionaryLookupStats.docx
From the evaluation and experience, the new lookup algorithm should be a huge 
improvement in terms of both speed and accuracy.
This is very different than what Bruce mentioned…  I’m sure Sean will chime 
here.
(The old dictionary lookup is essentially obsolete now- plagued with 
bugs/issues as you mentioned.)
--Pei

From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com]
Sent: Friday, December 19, 2014 10:25 AM
To: dev@ctakes.apache.org
Subject: Re: cTakes Annotation Comparison

Guergana,

I'm curious to the number of records that are in your gold standard sets, or if 
your gold standard set was run through a long running cTAKES process. I know at 
some point we fixed a bug in the old dictionary lookup that caused the 
permutations to become corrupted over time. Typically this isn't seen in the 
fi

RE: cTakes Annotation Comparison

2014-12-19 Thread Finan, Sean
One quick mention:

The cTakes dictionaries are built with UMLS 2011AB.  If the Human annotations 
were not done using the same UMLS version then there WILL be differences in CUI 
and Semantic group.  I don't have time to go into it with details, examples, 
etc. just be aware that every 6 months cuis are added, removed, deprecated, and 
moved from one TUI to another.

Sean

-Original Message-
From: Savova, Guergana [mailto:guergana.sav...@childrens.harvard.edu] 
Sent: Friday, December 19, 2014 1:28 PM
To: dev@ctakes.apache.org
Subject: RE: cTakes Annotation Comparison

Several thoughts:
1. The ShARE corpus annotates only mentions of type Diseases/Disorders and only 
Anatomical Sites associated with a Disease/Disorder. This is by design. cTAKES 
annotates all mentions of types Diseases/Disorders, Signs/Symptoms, Procedures, 
Medications and Anatomical Sites. Therefore you will get MANY more annotations 
with cTAKES. Eventually the ShARe corpus will be expanded to the other types.

2. Keeping (1) in mind, you can approximately estimate the precision/recall/f1 
of cTAKES on the ShARe corpus if you output only mentions of type 
Disease/Disorder. 

3. Could you send us the list of files you use from ShARe to test? We have the 
corpus and would like to run against as well.

Hope this makes sense...
--Guergana

-Original Message-
From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com] 
Sent: Friday, December 19, 2014 1:16 PM
To: dev@ctakes.apache.org
Subject: Re: cTakes Annotation Comparison

Our analysis against the human adjudicated gold standard from this SHARE corpus 
is using a simple check to see if the cTakes output included the annotation 
specified by the gold standard. The initial results I reported were for exact 
matches of CUI and text span.  Only exact matches were counted.

It looks like if we also count as matches cTakes annotations with a matching 
CUI and a text span that overlaps the gold standard text span then the matches 
increase to 224 matching annotations for the FastUMLS pipeline and 2319 for the 
the old pipeline.

The question was also asked about annotations in the cTakes output that were 
not in the human adjudicated gold standard. The answer is yes, there were a lot 
of additional annotations made by cTakes that don't appear to be in the gold 
standard. We haven't analyzed that yet, but it looks like the gold standard we 
are using may only have Disease_Disorder annotations.



 [image: IMAT Solutions]   Bruce Tietjen Senior 
Software Engineer
[image: Mobile:] 801.634.1547
bruce.tiet...@imatsolutions.com

On Fri, Dec 19, 2014 at 9:54 AM, Miller, Timothy < 
timothy.mil...@childrens.harvard.edu> wrote:
>
> Thanks Kim,
> This sounds interesting though I don't totally understand it. Are you 
> saying that extraction performance for a given note depends on which 
> order the note was in the processing queue? If so that's pretty bad! 
> If you (or anyone else who understands this issue) has a concrete 
> example I think that might help me understand what the problem is/was.
>
> Even though, as Pei mentioned, we are going to try moving the 
> community to the faster dictionary, I would like to understand better 
> just to help myself avoid issues of this type going forward (and 
> verify the new dictionary doesn't use similar logic).
>
> Also, when we finish annotating the sample notes, might we use that as 
> a point of comparison for the two dictionaries? That would get around 
> the issue that not everyone has access to the datasets we used for 
> validation and others are likely not able to share theirs either. And 
> maybe we can replicate the notes if we want to simulate the scenario 
> Kim is talking about with thousands or more notes.
>
> Tim
>
>
> On 12/19/2014 10:24 AM, Kim Ebert wrote:
> Guergana,
>
> I'm curious to the number of records that are in your gold standard 
> sets, or if your gold standard set was run through a long running cTAKES 
> process.
> I know at some point we fixed a bug in the old dictionary lookup that 
> caused the permutations to become corrupted over time. Typically this 
> isn't seen in the first few records, but over time as patterns are 
> used the permutations would become corrupted. This caused documents 
> that were fed through cTAKES more than once to have less codes 
> returned than the first time.
>
> For example, if a permutation of 4,2,3,1 was found, the permutation 
> would be corrupted to be 1,2,3,4. It would no longer be possible to 
> detect permutations of 4,2,3,1 until cTAKES was restarted. We got the 
> fix in after the cTAKES 3.2.0 release. 
> https://issues.apache.org/jira/browse/CTAKES-310
> Depending upon the corpus size, I could see the permutation engine 
> eventually only have a single permutation of 1,2,3,4.
>
> Typically though, this isn't very easily detected in the first 100 or 
> so documents.
>
> We discovered this issue when we made cTAKES have consistent output of 
> cod

RE: cTakes Annotation Comparison

2014-12-19 Thread Finan, Sean
I’m bringing it up in case the Human Annotations were done using a different 
version.

From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com]
Sent: Friday, December 19, 2014 1:40 PM
To: dev@ctakes.apache.org
Subject: Re: cTakes Annotation Comparison

Sean,

I don't think that would be an issue since both the rare word lookup and the 
first word lookup are using UMLS 2011AB. Or is the rare word lookup using a 
different dictionary?

I would expect roughly similar results between the two when it comes to 
differences between UMLS versions.

[IMAT Solutions]<http://imatsolutions.com>
Kim Ebert
Software Engineer
[Office:]801.669.7342
kim.eb...@imatsolutions.com<mailto:greg.hub...@imatsolutions.com>
On 12/19/2014 11:31 AM, Finan, Sean wrote:

One quick mention:



The cTakes dictionaries are built with UMLS 2011AB.  If the Human annotations 
were not done using the same UMLS version then there WILL be differences in CUI 
and Semantic group.  I don't have time to go into it with details, examples, 
etc. just be aware that every 6 months cuis are added, removed, deprecated, and 
moved from one TUI to another.



Sean



-Original Message-

From: Savova, Guergana [mailto:guergana.sav...@childrens.harvard.edu]

Sent: Friday, December 19, 2014 1:28 PM

To: dev@ctakes.apache.org<mailto:dev@ctakes.apache.org>

Subject: RE: cTakes Annotation Comparison



Several thoughts:

1. The ShARE corpus annotates only mentions of type Diseases/Disorders and only 
Anatomical Sites associated with a Disease/Disorder. This is by design. cTAKES 
annotates all mentions of types Diseases/Disorders, Signs/Symptoms, Procedures, 
Medications and Anatomical Sites. Therefore you will get MANY more annotations 
with cTAKES. Eventually the ShARe corpus will be expanded to the other types.



2. Keeping (1) in mind, you can approximately estimate the precision/recall/f1 
of cTAKES on the ShARe corpus if you output only mentions of type 
Disease/Disorder.



3. Could you send us the list of files you use from ShARe to test? We have the 
corpus and would like to run against as well.



Hope this makes sense...

--Guergana



-Original Message-

From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com]

Sent: Friday, December 19, 2014 1:16 PM

To: dev@ctakes.apache.org<mailto:dev@ctakes.apache.org>

Subject: Re: cTakes Annotation Comparison



Our analysis against the human adjudicated gold standard from this SHARE corpus 
is using a simple check to see if the cTakes output included the annotation 
specified by the gold standard. The initial results I reported were for exact 
matches of CUI and text span.  Only exact matches were counted.



It looks like if we also count as matches cTakes annotations with a matching 
CUI and a text span that overlaps the gold standard text span then the matches 
increase to 224 matching annotations for the FastUMLS pipeline and 2319 for the 
the old pipeline.



The question was also asked about annotations in the cTakes output that were 
not in the human adjudicated gold standard. The answer is yes, there were a lot 
of additional annotations made by cTakes that don't appear to be in the gold 
standard. We haven't analyzed that yet, but it looks like the gold standard we 
are using may only have Disease_Disorder annotations.







 [image: IMAT Solutions] <http://imatsolutions.com><http://imatsolutions.com>  
Bruce Tietjen Senior Software Engineer

[image: Mobile:] 801.634.1547

bruce.tiet...@imatsolutions.com<mailto:bruce.tiet...@imatsolutions.com>



On Fri, Dec 19, 2014 at 9:54 AM, Miller, Timothy < 
timothy.mil...@childrens.harvard.edu<mailto:timothy.mil...@childrens.harvard.edu>>
 wrote:



Thanks Kim,

This sounds interesting though I don't totally understand it. Are you

saying that extraction performance for a given note depends on which

order the note was in the processing queue? If so that's pretty bad!

If you (or anyone else who understands this issue) has a concrete

example I think that might help me understand what the problem is/was.



Even though, as Pei mentioned, we are going to try moving the

community to the faster dictionary, I would like to understand better

just to help myself avoid issues of this type going forward (and

verify the new dictionary doesn't use similar logic).



Also, when we finish annotating the sample notes, might we use that as

a point of comparison for the two dictionaries? That would get around

the issue that not everyone has access to the datasets we used for

validation and others are likely not able to share theirs either. And

maybe we can replicate the notes if we want to simulate the scenario

Kim is talking about with thousands or more notes.



Tim





On 12/19/2014 10:24 AM, Kim Ebert wrote:

Guergana,



I'm curious to the number of records that are in your gold standard

sets, or if your gold standard set was run through a l

RE: cTakes Annotation Comparison

2014-12-19 Thread Finan, Sean
Our human annotators on Share used 2012AB.  I mention it because when I have 
done manual spot-checks between human and system annotations I had 
head-scratchers that ended up being differences in the UMLS version.  I first 
noticed these discrepancies before I had started working on the fast lookup 
(that is to say: when working with the default lookup).

From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com]
Sent: Friday, December 19, 2014 1:40 PM
To: dev@ctakes.apache.org
Subject: Re: cTakes Annotation Comparison

Sean,

I don't think that would be an issue since both the rare word lookup and the 
first word lookup are using UMLS 2011AB. Or is the rare word lookup using a 
different dictionary?

I would expect roughly similar results between the two when it comes to 
differences between UMLS versions.

[IMAT Solutions]<http://imatsolutions.com>
Kim Ebert
Software Engineer
[Office:]801.669.7342
kim.eb...@imatsolutions.com<mailto:greg.hub...@imatsolutions.com>
On 12/19/2014 11:31 AM, Finan, Sean wrote:

One quick mention:



The cTakes dictionaries are built with UMLS 2011AB.  If the Human annotations 
were not done using the same UMLS version then there WILL be differences in CUI 
and Semantic group.  I don't have time to go into it with details, examples, 
etc. just be aware that every 6 months cuis are added, removed, deprecated, and 
moved from one TUI to another.



Sean



-Original Message-

From: Savova, Guergana [mailto:guergana.sav...@childrens.harvard.edu]

Sent: Friday, December 19, 2014 1:28 PM

To: dev@ctakes.apache.org<mailto:dev@ctakes.apache.org>

Subject: RE: cTakes Annotation Comparison



Several thoughts:

1. The ShARE corpus annotates only mentions of type Diseases/Disorders and only 
Anatomical Sites associated with a Disease/Disorder. This is by design. cTAKES 
annotates all mentions of types Diseases/Disorders, Signs/Symptoms, Procedures, 
Medications and Anatomical Sites. Therefore you will get MANY more annotations 
with cTAKES. Eventually the ShARe corpus will be expanded to the other types.



2. Keeping (1) in mind, you can approximately estimate the precision/recall/f1 
of cTAKES on the ShARe corpus if you output only mentions of type 
Disease/Disorder.



3. Could you send us the list of files you use from ShARe to test? We have the 
corpus and would like to run against as well.



Hope this makes sense...

--Guergana



-Original Message-

From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com]

Sent: Friday, December 19, 2014 1:16 PM

To: dev@ctakes.apache.org<mailto:dev@ctakes.apache.org>

Subject: Re: cTakes Annotation Comparison



Our analysis against the human adjudicated gold standard from this SHARE corpus 
is using a simple check to see if the cTakes output included the annotation 
specified by the gold standard. The initial results I reported were for exact 
matches of CUI and text span.  Only exact matches were counted.



It looks like if we also count as matches cTakes annotations with a matching 
CUI and a text span that overlaps the gold standard text span then the matches 
increase to 224 matching annotations for the FastUMLS pipeline and 2319 for the 
the old pipeline.



The question was also asked about annotations in the cTakes output that were 
not in the human adjudicated gold standard. The answer is yes, there were a lot 
of additional annotations made by cTakes that don't appear to be in the gold 
standard. We haven't analyzed that yet, but it looks like the gold standard we 
are using may only have Disease_Disorder annotations.







 [image: IMAT Solutions] <http://imatsolutions.com><http://imatsolutions.com>  
Bruce Tietjen Senior Software Engineer

[image: Mobile:] 801.634.1547

bruce.tiet...@imatsolutions.com<mailto:bruce.tiet...@imatsolutions.com>



On Fri, Dec 19, 2014 at 9:54 AM, Miller, Timothy < 
timothy.mil...@childrens.harvard.edu<mailto:timothy.mil...@childrens.harvard.edu>>
 wrote:



Thanks Kim,

This sounds interesting though I don't totally understand it. Are you

saying that extraction performance for a given note depends on which

order the note was in the processing queue? If so that's pretty bad!

If you (or anyone else who understands this issue) has a concrete

example I think that might help me understand what the problem is/was.



Even though, as Pei mentioned, we are going to try moving the

community to the faster dictionary, I would like to understand better

just to help myself avoid issues of this type going forward (and

verify the new dictionary doesn't use similar logic).



Also, when we finish annotating the sample notes, might we use that as

a point of comparison for the two dictionaries? That would get around

the issue that not everyone has access to the datasets we used for

validation and others are likely not able to share theirs either. And

maybe we can replicate the notes if we want to

RE: cTakes Annotation Comparison

2014-12-19 Thread Finan, Sean
Hi Bruce,

I'm not sure how there would be fewer matches with the overlap processor.  
There should be all of the matches from the non-overlap processor plus those 
from the overlap.  Decreasing from 215 to 211 is strange.  Have you done any 
manual spot checks on this?  It is really bizarre that you'd only have two 
matches per document (100 docs?).  

Thanks,
Sean

-Original Message-
From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com] 
Sent: Friday, December 19, 2014 3:23 PM
To: dev@ctakes.apache.org
Subject: Re: cTakes Annotation Comparison

Sean,

I tried the configuration changes you mentioned in your earlier email.

The results are as follows:

Total Annotations found: 12,161 (default configuration found 8,284)

If counting exact span matches, this run only matched 211 (default 
configuration matched 215).

If counting overlapping spans, this run only matched 220 (default configuration 
matched 224)

Bruce



 [image: IMAT Solutions]   Bruce Tietjen Senior 
Software Engineer
[image: Mobile:] 801.634.1547
bruce.tiet...@imatsolutions.com

On Fri, Dec 19, 2014 at 12:16 PM, Chen, Pei 
wrote:
>
>  Kim,
>
> Maintenance is the factor not bugs/issue to forge ahead.
>
> They are 2 components that do the same thing with the same goal (As 
> Sean mentioned, one should be able configure the new code base to  
> replicate the old algorithm if required- it’s just a simpler and 
> cleaner code base.  If this is not the case or if there are issues, we 
> should fix it and move forward.).
>
> We can keep the old component around for as long as needed, but it’s 
> likely going to have limited support…
>
> --Pei
>
>
>
> *From:* Kim Ebert [mailto:kim.eb...@imatsolutions.com]
> *Sent:* Friday, December 19, 2014 1:47 PM
> *To:* Chen, Pei; dev@ctakes.apache.org
>
> *Subject:* Re: cTakes Annotation Comparison
>
>
>
> Pei,
>
> I don't think bugs/issues should be part of determining if one 
> algorithm vs the other is superior. Obviously, it is worth mentioning 
> the bugs, but if the fast lookup method has worse precision and recall 
> but better performance, vs the slower but more accurate first word 
> lookup algorithm, then time should be invested in fixing those bugs 
> and resolving those weird issues.
>
> Now I'm not saying which one is superior in this case, as the data 
> will end up speaking for itself one way or the other; bus as of right 
> now, I'm not convinced yet that the old dictionary lookup is obsolete 
> yet, and I'm not sure the community is convinced yet either.
>
>
>
> [image: IMAT Solutions] 
>
> *Kim Ebert*
> Software Engineer
> [image: Office:]801.669.7342
> kim.eb...@imatsolutions.com 
>
> On 12/19/2014 08:39 AM, Chen, Pei wrote:
>
> Also check out stats that Sean ran before releasing the new component on:
>
>
> http://svn.apache.org/repos/asf/ctakes/trunk/ctakes-dictionary-lookup-
> fast/doc/DictionaryLookupStats.docx
>
> From the evaluation and experience, the new lookup algorithm should be 
> a huge improvement in terms of both speed and accuracy.
>
> This is very different than what Bruce mentioned…  I’m sure Sean will 
> chime here.
>
> (The old dictionary lookup is essentially obsolete now- plagued with 
> bugs/issues as you mentioned.)
>
> --Pei
>
>
>
> *From:* Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com
> ]
> *Sent:* Friday, December 19, 2014 10:25 AM
> *To:* dev@ctakes.apache.org
> *Subject:* Re: cTakes Annotation Comparison
>
>
>
> Guergana,
>
> I'm curious to the number of records that are in your gold standard 
> sets, or if your gold standard set was run through a long running cTAKES 
> process.
> I know at some point we fixed a bug in the old dictionary lookup that 
> caused the permutations to become corrupted over time. Typically this 
> isn't seen in the first few records, but over time as patterns are 
> used the permutations would become corrupted. This caused documents 
> that were fed through cTAKES more than once to have less codes 
> returned than the first time.
>
> For example, if a permutation of 4,2,3,1 was found, the permutation 
> would be corrupted to be 1,2,3,4. It would no longer be possible to 
> detect permutations of 4,2,3,1 until cTAKES was restarted. We got the 
> fix in after the cTAKES 3.2.0 release. 
> https://issues.apache.org/jira/browse/CTAKES-310
> Depending upon the corpus size, I could see the permutation engine 
> eventually only have a single permutation of 1,2,3,4.
>
> Typically though, this isn't very easily detected in the first 100 or 
> so documents.
>
> We discovered this issue when we made cTAKES have consistent output of 
> codes in our system.
>
>
>
> [image: IMAT Solutions] 
>
> *Kim Ebert*
> Software Engineer
> [image: Office:]801.669.7342
> kim.eb...@imatsolutions.com 
>
> On 12/19/2014 07:05 AM, Savova, Guergana wrote:
>
> We are doing a similar kind of evaluation and will report the results.
>
>
>
> Before we released the Fast lookup, 

RE: cTakes Annotation Comparison

2014-12-19 Thread Finan, Sean
Hi Bruce,
> Correction -- So far, I did steps 1 and 2 of Sean's email.

No problem.  Aside from recreating the database, those two steps have the 
greatest impact.  But before you change anything else, please do some manual 
spot checks.  I have never seen a case where the lookup would be so horribly 
inaccurate.

Thanks

-Original Message-
From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com] 
Sent: Friday, December 19, 2014 3:29 PM
To: dev@ctakes.apache.org
Subject: Re: cTakes Annotation Comparison

Correction -- So far, I did steps 1 and 2 of Sean's email.


 [image: IMAT Solutions]   Bruce Tietjen Senior 
Software Engineer
[image: Mobile:] 801.634.1547
bruce.tiet...@imatsolutions.com

On Fri, Dec 19, 2014 at 1:22 PM, Bruce Tietjen < 
bruce.tiet...@perfectsearchcorp.com> wrote:
>
> Sean,
>
> I tried the configuration changes you mentioned in your earlier email.
>
> The results are as follows:
>
> Total Annotations found: 12,161 (default configuration found 8,284)
>
> If counting exact span matches, this run only matched 211 (default 
> configuration matched 215).
>
> If counting overlapping spans, this run only matched 220 (default 
> configuration matched 224)
>
> Bruce
>
>
>
>  [image: IMAT Solutions]   Bruce Tietjen 
> Senior Software Engineer
> [image: Mobile:] 801.634.1547
> bruce.tiet...@imatsolutions.com
>
> On Fri, Dec 19, 2014 at 12:16 PM, Chen, Pei < 
> pei.c...@childrens.harvard.edu> wrote:
>>
>>  Kim,
>>
>> Maintenance is the factor not bugs/issue to forge ahead.
>>
>> They are 2 components that do the same thing with the same goal (As 
>> Sean mentioned, one should be able configure the new code base to  
>> replicate the old algorithm if required- it’s just a simpler and 
>> cleaner code base.  If this is not the case or if there are issues, 
>> we should fix it and move forward.).
>>
>> We can keep the old component around for as long as needed, but it’s 
>> likely going to have limited support…
>>
>> --Pei
>>
>>
>>
>> *From:* Kim Ebert [mailto:kim.eb...@imatsolutions.com]
>> *Sent:* Friday, December 19, 2014 1:47 PM
>> *To:* Chen, Pei; dev@ctakes.apache.org
>>
>> *Subject:* Re: cTakes Annotation Comparison
>>
>>
>>
>> Pei,
>>
>> I don't think bugs/issues should be part of determining if one 
>> algorithm vs the other is superior. Obviously, it is worth mentioning 
>> the bugs, but if the fast lookup method has worse precision and 
>> recall but better performance, vs the slower but more accurate first 
>> word lookup algorithm, then time should be invested in fixing those 
>> bugs and resolving those weird issues.
>>
>> Now I'm not saying which one is superior in this case, as the data 
>> will end up speaking for itself one way or the other; bus as of right 
>> now, I'm not convinced yet that the old dictionary lookup is obsolete 
>> yet, and I'm not sure the community is convinced yet either.
>>
>>
>>
>> [image: IMAT Solutions] 
>>
>> *Kim Ebert*
>> Software Engineer
>> [image: Office:]801.669.7342
>> kim.eb...@imatsolutions.com 
>>
>> On 12/19/2014 08:39 AM, Chen, Pei wrote:
>>
>> Also check out stats that Sean ran before releasing the new component on:
>>
>>
>> http://svn.apache.org/repos/asf/ctakes/trunk/ctakes-dictionary-lookup
>> -fast/doc/DictionaryLookupStats.docx
>>
>> From the evaluation and experience, the new lookup algorithm should 
>> be a huge improvement in terms of both speed and accuracy.
>>
>> This is very different than what Bruce mentioned…  I’m sure Sean will 
>> chime here.
>>
>> (The old dictionary lookup is essentially obsolete now- plagued with 
>> bugs/issues as you mentioned.)
>>
>> --Pei
>>
>>
>>
>> *From:* Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com
>> ]
>> *Sent:* Friday, December 19, 2014 10:25 AM
>> *To:* dev@ctakes.apache.org
>> *Subject:* Re: cTakes Annotation Comparison
>>
>>
>>
>> Guergana,
>>
>> I'm curious to the number of records that are in your gold standard 
>> sets, or if your gold standard set was run through a long running cTAKES 
>> process.
>> I know at some point we fixed a bug in the old dictionary lookup that 
>> caused the permutations to become corrupted over time. Typically this 
>> isn't seen in the first few records, but over time as patterns are 
>> used the permutations would become corrupted. This caused documents 
>> that were fed through cTAKES more than once to have less codes 
>> returned than the first time.
>>
>> For example, if a permutation of 4,2,3,1 was found, the permutation 
>> would be corrupted to be 1,2,3,4. It would no longer be possible to 
>> detect permutations of 4,2,3,1 until cTAKES was restarted. We got the 
>> fix in after the cTAKES 3.2.0 release.
>> https://issues.apache.org/jira/browse/CTAKES-310 Depending upon the 
>> corpus size, I could see the permutation engine eventually only have 
>> a single permutation of 1,2,3,4.
>>
>> Typically though, this isn't very easily detected in the first 100 or 
>> 

RE: cTakes Annotation Comparison

2014-12-19 Thread Finan, Sean
Sorry, I meant “Do some spot checks on the validity”.  In other words, when 
your script reports that a cui and/or span is missing, manually look at the 
data and see if it really is.  Just open up one .xmi in the CVD and see what it 
looks like.

Thanks,
Sean

From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com]
Sent: Friday, December 19, 2014 3:37 PM
To: dev@ctakes.apache.org
Subject: Re: cTakes Annotation Comparison

My original results were using a newly downloaded cTakes 3.2.1 with the 
separately downloaded resources copied in. There were no changes to any of the 
configuration files.
As far as this last run, I modified the UMLSLookupAnnotator.xml and 
AggregatePlaintextFastUMLSProcessor.xml.  I've attached the modified ones I 
used (but they may not get through the mailing list).



[Image removed by sender. IMAT Solutions]<http://imatsolutions.com>
Bruce Tietjen
Senior Software Engineer
[Image removed by sender. Mobile:]801.634.1547
bruce.tiet...@imatsolutions.com<mailto:bruce.tiet...@imatsolutions.com>

On Fri, Dec 19, 2014 at 1:27 PM, Finan, Sean 
mailto:sean.fi...@childrens.harvard.edu>> 
wrote:
Hi Bruce,

I'm not sure how there would be fewer matches with the overlap processor.  
There should be all of the matches from the non-overlap processor plus those 
from the overlap.  Decreasing from 215 to 211 is strange.  Have you done any 
manual spot checks on this?  It is really bizarre that you'd only have two 
matches per document (100 docs?).

Thanks,
Sean

-Original Message-
From: Bruce Tietjen 
[mailto:bruce.tiet...@perfectsearchcorp.com<mailto:bruce.tiet...@perfectsearchcorp.com>]
Sent: Friday, December 19, 2014 3:23 PM
To: dev@ctakes.apache.org<mailto:dev@ctakes.apache.org>
Subject: Re: cTakes Annotation Comparison

Sean,

I tried the configuration changes you mentioned in your earlier email.

The results are as follows:

Total Annotations found: 12,161 (default configuration found 8,284)

If counting exact span matches, this run only matched 211 (default 
configuration matched 215).

If counting overlapping spans, this run only matched 220 (default configuration 
matched 224)

Bruce



 [image: IMAT Solutions] <http://imatsolutions.com>  Bruce Tietjen Senior 
Software Engineer
[image: Mobile:] 801.634.1547
bruce.tiet...@imatsolutions.com<mailto:bruce.tiet...@imatsolutions.com>

On Fri, Dec 19, 2014 at 12:16 PM, Chen, Pei 
mailto:pei.c...@childrens.harvard.edu>>
wrote:
>
>  Kim,
>
> Maintenance is the factor not bugs/issue to forge ahead.
>
> They are 2 components that do the same thing with the same goal (As
> Sean mentioned, one should be able configure the new code base to
> replicate the old algorithm if required- it’s just a simpler and
> cleaner code base.  If this is not the case or if there are issues, we
> should fix it and move forward.).
>
> We can keep the old component around for as long as needed, but it’s
> likely going to have limited support…
>
> --Pei
>
>
>
> *From:* Kim Ebert 
> [mailto:kim.eb...@imatsolutions.com<mailto:kim.eb...@imatsolutions.com>]
> *Sent:* Friday, December 19, 2014 1:47 PM
> *To:* Chen, Pei; dev@ctakes.apache.org<mailto:dev@ctakes.apache.org>
>
> *Subject:* Re: cTakes Annotation Comparison
>
>
>
> Pei,
>
> I don't think bugs/issues should be part of determining if one
> algorithm vs the other is superior. Obviously, it is worth mentioning
> the bugs, but if the fast lookup method has worse precision and recall
> but better performance, vs the slower but more accurate first word
> lookup algorithm, then time should be invested in fixing those bugs
> and resolving those weird issues.
>
> Now I'm not saying which one is superior in this case, as the data
> will end up speaking for itself one way or the other; bus as of right
> now, I'm not convinced yet that the old dictionary lookup is obsolete
> yet, and I'm not sure the community is convinced yet either.
>
>
>
> [image: IMAT Solutions] <http://imatsolutions.com>
>
> *Kim Ebert*
> Software Engineer
> [image: Office:]801.669.7342
> kim.eb...@imatsolutions.com<mailto:kim.eb...@imatsolutions.com> 
> mailto:greg.hub...@imatsolutions.com>>
>
> On 12/19/2014 08:39 AM, Chen, Pei wrote:
>
> Also check out stats that Sean ran before releasing the new component on:
>
>
> http://svn.apache.org/repos/asf/ctakes/trunk/ctakes-dictionary-lookup-
> fast/doc/DictionaryLookupStats.docx
>
> From the evaluation and experience, the new lookup algorithm should be
> a huge improvement in terms of both speed and accuracy.
>
> This is very different than what Bruce mentioned…  I’m sure Sean will
> chime here.
>
> (The old dictionary lookup is essentially obsolete now- plagued with
> bugs/iss

RE: cTakes Annotation Comparison --- (^:

2014-12-19 Thread Finan, Sean
Apologies accepted.  I'm really glad that you found the problem.

So what you are saying is (just to be very very clear to everybody reading this 
thread):

>FastUMLSProcessor found 2795 matches (2,842 including overlaps)
While
> UMLSProcessor found 2632 matches (2,735 including overlaps)

--- So recall is BETTER in the fast lookup

And...
>FastUMLSProcessor found 30,716 annotations
While
>UMLSProcessor found 31,598 annotations

--- So precision is also looking BETTER in the fast lookup

Now maybe there will be a little more buy-in for the fast lookup.

Cheers,
Sean


-Original Message-
From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com] 
Sent: Friday, December 19, 2014 5:05 PM
To: dev@ctakes.apache.org
Subject: Re: cTakes Annotation Comparison

My apologies to Sean and everyone,

I am happy to report that I found a bug in our analysis tools that was missing 
the last FSArray entry for any FSArray list.

With the bug fixed, the results look MUCH better.

UMLSProcessor found 31,598 annotations
FastUMLSProcessor found 30,716 annotations

There were 23,522 annotations that were exact matches between the two.

When comparing with the gold standard annotations (4591 annotations):

UMLSProcessor found 2632 matches (2,735 including overlaps) FastUMLSProcessor 
found 2795 matches (2,842 including overlaps)






 [image: IMAT Solutions] <http://imatsolutions.com>  Bruce Tietjen Senior 
Software Engineer
[image: Mobile:] 801.634.1547
bruce.tiet...@imatsolutions.com

On Fri, Dec 19, 2014 at 1:49 PM, Bruce Tietjen < 
bruce.tiet...@perfectsearchcorp.com> wrote:
>
> I'll do that -- there is always a possibility of bugs in the analysis 
> tool.
>
>
>  [image: IMAT Solutions] <http://imatsolutions.com>  Bruce Tietjen 
> Senior Software Engineer
> [image: Mobile:] 801.634.1547
> bruce.tiet...@imatsolutions.com
>
> On Fri, Dec 19, 2014 at 1:39 PM, Finan, Sean < 
> sean.fi...@childrens.harvard.edu> wrote:
>>
>>  Sorry, I meant “Do some spot checks on the validity”.  In other 
>> words, when your script reports that a cui and/or span is missing, 
>> manually look at the data and see if it really is.  Just open up one 
>> .xmi in the CVD and see what it looks like.
>>
>>
>>
>> Thanks,
>>
>> Sean
>>
>>
>>
>> *From:* Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com]
>> *Sent:* Friday, December 19, 2014 3:37 PM
>> *To:* dev@ctakes.apache.org
>> *Subject:* Re: cTakes Annotation Comparison
>>
>>
>>
>> My original results were using a newly downloaded cTakes 3.2.1 with 
>> the separately downloaded resources copied in. There were no changes 
>> to any of the configuration files.
>>
>> As far as this last run, I modified the UMLSLookupAnnotator.xml and 
>> AggregatePlaintextFastUMLSProcessor.xml.  I've attached the modified 
>> ones I used (but they may not get through the mailing list).
>>
>>
>>
>>
>>
>>
>> [image: Image removed by sender. IMAT Solutions] 
>> <http://imatsolutions.com>
>>
>> *Bruce Tietjen*
>> Senior Software Engineer
>> [image: Image removed by sender. Mobile:]801.634.1547 
>> bruce.tiet...@imatsolutions.com
>>
>>
>>
>> On Fri, Dec 19, 2014 at 1:27 PM, Finan, Sean < 
>> sean.fi...@childrens.harvard.edu> wrote:
>>
>> Hi Bruce,
>>
>> I'm not sure how there would be fewer matches with the overlap 
>> processor.  There should be all of the matches from the non-overlap 
>> processor plus those from the overlap.  Decreasing from 215 to 211 is 
>> strange.  Have you done any manual spot checks on this?  It is really 
>> bizarre that you'd only have two matches per document (100 docs?).
>>
>> Thanks,
>> Sean
>>
>> -Original Message-
>> From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com]
>> Sent: Friday, December 19, 2014 3:23 PM
>> To: dev@ctakes.apache.org
>> Subject: Re: cTakes Annotation Comparison
>>
>> Sean,
>>
>> I tried the configuration changes you mentioned in your earlier email.
>>
>> The results are as follows:
>>
>> Total Annotations found: 12,161 (default configuration found 8,284)
>>
>> If counting exact span matches, this run only matched 211 (default 
>> configuration matched 215).
>>
>> If counting overlapping spans, this run only matched 220 (default 
>> configuration matched 224)
>>
>> Bruce
>>
>>
>>
>>  [image: IMAT Solutions] <http://imatsolutions.com>  Bruce Tietjen 
>> Senior Software Engineer
>> [image: Mobile:] 801.634.1547
&g

RE: Using cTakes programmatically

2014-12-29 Thread Finan, Sean
Hi Maite Meseure,

Check the cTakes User guide on UMLS setup:

https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+User+Install+Guide#cTAKES3.2UserInstallGuide-(Recommended)AddUMLSaccessrights

which (in part) points you towards obtaining a license to use the NIH UMLS 
dictionary:

https://uts.nlm.nih.gov/license.html



Sean


From: Maite Meseure Hugues [meseure.ma...@gmail.com]
Sent: Monday, December 29, 2014 4:17 PM
To: dev@ctakes.apache.org
Subject: Using cTakes programmatically

Dear all,

I allow myself to contact you in order to ask you how I can simply add
cTAKES packages in my java code to get the same output than the XML output
from the CPE (using clinical-pipeline/ test_plaintext.xml as descriptor).
I've explored and tested the cTakes example ( using
ClinicalPipelineFactory.getDefaultPipeline() ) but I've got this error
message:

[...] https://uts-ws.nlm.nih.gov/restful/isValidUMLSUser: maitemeseure

Exception in thread "main"
org.apache.uima.resource.ResourceInitializationException: Initialization of
annotator class
"org.apache.ctakes.dictionary.lookup.ae.UmlsDictionaryLookupAnnotator"
failed.  (Descriptor: )
Thanks a lot for your time.

Best regards

--
--
 Maïté Meseure Hugues


RE: Question about CPE/ descriptor and xml file.

2015-01-05 Thread Finan, Sean
Go through the error that you got, and look for a message like:

Failed to initilize.  Invalid UMLS License

and

Error: Invalid UMLS License.  A UMLS License is required to use the UMLS 
dictionary lookup. 
Error: You may request one at: https://uts.nlm.nih.gov/license.html 
Please verify your UMLS license settings in the 
DictionaryLookupAnnotatorUMLS.xml configuration.

If you see that message, you see a possible solution.  If you have a umls 
username and password, make sure that they are set correctly for the cTakes run.

If you don't see that message, check 
resources/org/apache/ctakes/dictionary/lookup/umls2011ab/umls and see if it 
contains a rather large .data file.  If not, then go through the process 
detailed at http://ctakes.apache.org/downloads.cgi in the section entitled 
"Resources".

If you have the .data file, then let us know and we'll try to push forward.

Sean


-Original Message-
From: Maite Meseure Hugues [mailto:mmhug...@medmergent.com] 
Sent: Monday, January 05, 2015 9:33 AM
To: dev@ctakes.apache.org
Subject: Question about CPE/ descriptor and xml file.

Hello everyone,


I am a new user of cTakes and I would like to integrate it in my code to run it 
programmatically.

I followed the example in the cTakes package but I have an error message 
regarding the descriptor:


[...] 03 Jan 2015 13:39:33  INFO UmlsDictionaryLookupAnnotator - Using 
ctakes.umlsaddr: https://uts-ws.nlm.nih.gov/restful/isValidUMLSUser: 
maitemeseure

Exception in thread "main" 
org.apache.uima.resource.ResourceInitializationException: Initialization of 
annotator class 
"org.apache.ctakes.dictionary.lookup.ae.UmlsDictionaryLookupAnnotator" failed.  
(Descriptor: )

Do you know how I can fix that?? My goal is to get in output the same XML file 
than the CPE.
Thanks a lot for your time.

Best regards,

Maite Meseure



RE: Negex

2015-01-05 Thread Finan, Sean
I don't know.  I'm comparing what I think is the 2009 negex trigger set 
https://code.google.com/p/negex/source/browse/trunk/GeneralNegEx.Java.v.1.2.05092009/negex_triggers.txt

with the cTakes trigger set in 
org.apache.ctakes.core.fsm.machine.NegationFSM.java and it looks like the 
cTakes set is missing some 2009 negex trigger words, such as "exhibit".

Anyway, you can read 
https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.0+-+NE+Contexts for 
info on adding triggers to the cTakes version.

Sean

-Original Message-
From: John Green [mailto:john.travis.gr...@gmail.com] 
Sent: Monday, January 05, 2015 2:03 PM
To: dev@ctakes.apache.org
Cc: dev@ctakes.apache.org
Subject: Re: Negex

Thanks Ma'am for the input!


So to clarify: ctakes added additional trigger words to the list published 
originally? (This is an unrelated question to the negex vs ml thread last 
month).




Best,

John
—
Sent from Mailbox

On Mon, Jan 5, 2015 at 12:58 PM, Green, John  wrote:

> Hi all - Does anyone know off the top of their head if the negex 
> trigger rules included in the original 2009 python script were added 
> to when it was implemented in ctakes?
> Thanks,
> John


RE: Negex

2015-01-05 Thread Finan, Sean
> Adding triggers requires modifying a text file - much simpler than changing 
> code and compiling.

+1

Thanks Vijay!

-Original Message-
From: vijay garla [mailto:vnga...@gmail.com] 
Sent: Monday, January 05, 2015 3:13 PM
To: dev@ctakes.apache.org
Subject: Re: Negex

I think the original ctakes negation AE is in the spirit of Negex, but it is 
not Negex.  AFAICT the ctakes negation AE
* requires that triggers are single tokens
* does not support conjuctions (e.g. however, nevertheless) or post-negation 
triggers (e.g. free, was ruled out)
* is based on a FSM, which is different from the way Negex rules work
* requires modifying and recompiling code to add negation triggers

We ported the Negex algorithm 1-1 to UIMA to address these issues.  See the 
ctakes-ytex-uima\desc\analysis_engine\NegexAnnotator.xml and the corresponding 
trigger file
(ctakes-ytex-res\src\main\resources\org\apache\ctakes\ytex\negex\negex_triggers.txt)

Adding triggers requires modifying a text file - much simpler than changing 
code and compiling.

-vj

On Mon, Jan 5, 2015 at 8:30 PM, Finan, Sean < sean.fi...@childrens.harvard.edu> 
wrote:

> I don't know.  I'm comparing what I think is the 2009 negex trigger 
> set 
> https://code.google.com/p/negex/source/browse/trunk/GeneralNegEx.Java.
> v.1.2.05092009/negex_triggers.txt
>
> with the cTakes trigger set in
> org.apache.ctakes.core.fsm.machine.NegationFSM.java and it looks like 
> the cTakes set is missing some 2009 negex trigger words, such as "exhibit".
>
> Anyway, you can read
> https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.0+-+NE+Con
> texts for info on adding triggers to the cTakes version.
>
> Sean
>
> -Original Message-
> From: John Green [mailto:john.travis.gr...@gmail.com]
> Sent: Monday, January 05, 2015 2:03 PM
> To: dev@ctakes.apache.org
> Cc: dev@ctakes.apache.org
> Subject: Re: Negex
>
> Thanks Ma'am for the input!
>
>
> So to clarify: ctakes added additional trigger words to the list 
> published originally? (This is an unrelated question to the negex vs 
> ml thread last month).
>
>
>
>
> Best,
>
> John
> —
> Sent from Mailbox
>
> On Mon, Jan 5, 2015 at 12:58 PM, Green, John  wrote:
>
> > Hi all - Does anyone know off the top of their head if the negex 
> > trigger rules included in the original 2009 python script were added 
> > to when it was implemented in ctakes?
> > Thanks,
> > John
>


RE: dictionary lookup config for best F1 measure [was RE: cTakes Annotation Comparison

2015-01-09 Thread Finan, Sean
Hi James,
Great question.  In truth, you may need to run a few times to find out.  Doing 
that with a full pipeline would be tedious, but there is a descriptor in 
clinical-pipeline named CuisOnlyPlaintextUMLSProcessor.xml that will only 
obtain Umls cuis.  It runs ~50,000 notes per hour on my laptop as-is, so I 
suggest that you test with that ae.  It has lvg commented out by default (for 
speed).  Adding lvg will increase the runtime, but it also will (as you know) 
find a few additional terms.   You can try a few configurations without it and 
then the best option with it.  If you want to test the default dictionary 
lookup then you can certainly swap the referenced lookup xmls.

Changes to the fast dictionary configuration are made in two places:
1.  The main descriptor ...-fast/desc/analysis_engine/UmlsLookupAnnotator.xml
2.  The resource (dictionary) configuration file 
resources/.../fast/cTakesHsql..xml

A few suggestions, in order of impact:
1.  I am guessing that the annotations in clef are human annotated with 
longest-length spans only.  In other words, "colon cancer" instead of  "colon 
cancer" and "cancer".  To best approximate this style of annotation, edit the 
cTakesHsql.xml in the section  and change the selected 
implementation.  By default it is DefaultTermConsumer (go figure), but you will 
want to use the commented-out PrecisionTermConsumer.  As the above cTakesHsql 
comment indicates " DefaultTermConsumer will persist all spans.
   PrecisionTermConsumer will only persist only the longest overlapping span of 
any semantic group."  Doing this should increase precision, and depending upon 
how "good" the annotations are it should not greatly change recall.

2. Just for kicks, try using SemanticCleanupTermConsumer.  It may slightly 
increase precision, but it also may decrease recall.  Hopefully it doesn't do 
much at all (PrecisionTermConsumer and proper semantic typing in the dictionary 
should suffice without this term consumer).

3. Especially for task 2 (acronyms & abbreviations), you should try a run with 
minimumSpan in UmlsLookupAnnotator.xml set to 2.   This changes 
the minimum allowable span of a term.  The default is 3 to increase precision 
on acronyms & abbreviations, but decreasing to 2 may improve recall on the 
same.   The dictionary is not built with anything below 2 characters.
4.  On that note (character length), if task 1 does not include acronyms & 
abbreviations, then you can try increasing the minimum span length above 3 and 
see if there is a good increase in precision without a significant decrease in 
recall.

5.  Try a few runs with overlapping spans in addition to exact matches.  To do 
this use the OverlapJCasTermAnnotator instead of the DefaultJCasTermAnnotator 
annotator implementation.  DefaultJCasTermAnnotator is specified in 
UmlsLookupAnnotator.xml  but I will check in a descriptor for overlap matching. 
 There are additional parameters for that option, but I'll email  them after I 
checkin.

6.  By default the new lookup uses Sentence as the lookup window.  I did this 
for two reasons: 1. Not all terms are within Noun Phrases, 2. Some Noun Phrases 
overlapped, causing repeated lookups (in my 3.0 candidate trials), and 3. Not 
all cTakes Noun Phrases are accurate.  Because the lookup is fast, using a full 
Sentence for lookup doesn't seem to hurt much.  However, you can always switch 
it back to see if precision is increased enough to warrant the decrease in 
recall.  This is changed in UmlsLookupAnnotator.xml

I have run my own tests with the various setups, but I don't want to adversely 
influence what you run just in case the trends with the share/clef annotations 
differ.

Sean

-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] 
Sent: Friday, January 09, 2015 3:57 PM
To: 'dev@ctakes.apache.org'
Subject: dictionary lookup config for best F1 measure [was RE: cTakes 
Annotation Comparison

Sean (or others), 

Of the various configuration options described below, which values/choices 
would you recommend for best F1 measure for something like the shared clef 2013 
task?
https://sites.google.com/site/shareclefehealth/

I'm looking for something that doesn't have to be the best speed-wise, but that 
is the recommended for optimizing F1 measure.

Regards,
James 

-Original Message-
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
Sent: Friday, December 19, 2014 11:55 AM
To: dev@ctakes.apache.org; kim.eb...@imatsolutions.com
Subject: RE: cTakes Annotation Comparison

Well, I guess that it is time for me to speak up …

I must say that I’m happy that people are showing interest in the fast lookup.  
I am also happy (sort of) that some concerns are being raised – and that there 
is now community participation in my little toy.  I  have some concerns about 
what people are reporting.  This does not coincide wit

RE: dictionary lookup config for best F1 measure [was RE: cTakes Annotation Comparison : Span Overlap addendum

2015-01-09 Thread Finan, Sean
Hi James,

I've checked in a descriptor for the UmlsOverlapLookupAnnotator in fast/desc/ . 
 I also checked in a modification for the CuisOnlyPlaintextUMLSProcessor.xml 
with the Overlap annotator commented out as an option:

  
 
 
 
 
  

As an example of its difference from the Default, I ran the example colon 
cancer document from thyme and it finds the following:
"blood with stool" > C1321898: blood in stool
"polyps, all adenomatous" > C0206677: adenomatous polyps
"lesions in his liver" > C0577053: lesion of liver
"PAST MEDICAL/SURGICAL HISTORY" > C0262926: medical history , C0455458: past 
medical history
"MEDICAL/SURGICAL HISTORY" > C0262926: medical history
"tonsils and adenoids" > C0580788: tonsil and adenoid structure ; this is also 
found without overlap, but overlap finds it a second time
"torn left Achilles tendon" > C0263970: rupture of Achilles tendon
"ankle scar on left" > C0230448: structure of left ankle *
"prostate, no masses palpable" > C0577252: prostate palpable
"cancer of the cecum" > C0153437: malignant neoplasm of cecum  ; this is also 
found without overlap, but overlap finds it a second time
"complications of anesthesia" > C0392008: complication of anesthesia  ; this is 
also found without overlap, but overlap finds it a second time

* One important item is that the overlap annotator understands discontiguous 
spans.  There is, in fact, a ...lookup2.textspan.MultiTextSpan class.  So, for 
items such as "ankle scar on left" the annotator is actually annotating only 
"ankle ... left" but it has to be stored in the cas as one big happy albeit 
underspecified span.

I think that I mentioned in the previous email that the Overlap annotator has a 
couple of extra parameters.  They are called "totalTokenSkips" and 
"consecutiveTokenSkips".  The names are pretty self-explanatory; the algorithm 
will allow a maximum number of tokens to be skipped, consecutive or not, as 
long as the total number of consecutive tokens to be skipped is not above a 
certain number.  For instance, total=4 and consecutive=2 (the defaults) will 
match "this kinda sorta should maybe hopefully match" with "this should match". 
 This is pretty lenient, but seems to work in my tests.  "this kinda-sorta 
should ..." will not match ... though maybe '-' should be a special case.  Let 
me know what you think.

Enjoy,
Sean


-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] 
Sent: Friday, January 09, 2015 3:57 PM
To: 'dev@ctakes.apache.org'
Subject: dictionary lookup config for best F1 measure [was RE: cTakes 
Annotation Comparison

Sean (or others), 

Of the various configuration options described below, which values/choices 
would you recommend for best F1 measure for something like the shared clef 2013 
task?
https://sites.google.com/site/shareclefehealth/

I'm looking for something that doesn't have to be the best speed-wise, but that 
is the recommended for optimizing F1 measure.

Regards,
James 

-Original Message-
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] 
Sent: Friday, December 19, 2014 11:55 AM
To: dev@ctakes.apache.org; kim.eb...@imatsolutions.com
Subject: RE: cTakes Annotation Comparison

Well, I guess that it is time for me to speak up …

I must say that I’m happy that people are showing interest in the fast lookup.  
I am also happy (sort of) that some concerns are being raised – and that there 
is now community participation in my little toy.  I  have some concerns about 
what people are reporting.  This does not coincide with what I have seen at 
all.  Yesterday I started (without knowing this thread existed) testing a 
bare-minimum pipeline for CUI extraction.  It is just the stripped-down 
Aggregate with only: segment, tokens, sentences, POS, and the fast lookup.  The 
people at Children’s wanted to know how fast we could get.  1,196 notes in 
under 90 seconds on my laptop with over 210,000 annotations, which is 175/note. 
 After reading the thread I decided to run the fast lookup with several 
configurations.  I also ran the default for 10.5 hours.  I am comparing the 
annotations from each system against the human annotations that we have, and I 
will let everybody know what I find – for better or worse.

The fast lookup does not (out-of-box) do the exact same thing as the default.  
Some things can be configured to make it more closely approximate the default 
dictionary.

1.Set the minimum annotation span length to 2 (default is 3).  This is 
in desc/[ae]/UmlsLookupAnnotator.xml : line #78.  The annotator should then 
pick up text like “CT” and improve recall, but it will hurt precision.

2.   Set the Lookup Window to LookupWindowAnnotation.  This 

RE: Question about fast pipeline

2015-01-12 Thread Finan, Sean
Hi Michelle,

Did your error have only "Could not find . as absolute" or did it also have 
"or in ... or in ..."?  If you see " ... or in ... " then this is a new issue.  
If you don't, then you should update your source.  If you need to run the 
release binary then let me know and I can work out sending you a patch.

Sean

-Original Message-
From: michelle1919c...@gmail.com [mailto:michelle1919c...@gmail.com] On Behalf 
Of Michelle Chen
Sent: Monday, January 12, 2015 4:30 PM
To: dev@ctakes.apache.org
Subject: Question about fast pipeline

I'm fairly new to using cTAKES and was trying to figure out how to use the fast 
pipeline in my Java code.

I was able to run the code in Clinical Pipeline Factory with both the default 
Pipeline and the fast Pipeline. However, when I tried incorporating 
getDefaultPipeline, I get these errors:

"ERROR JdbcConnectionFactory - Could not find 
resources/org/apache/ctakes/dictionary/lookup/fast/ctakessnorx/ctakessnorx.script
as absolute.

ERROR JdbcRareWordDictionary - Could not Connect to Dictionary UmlsHsqlRareWord"

Has anyone else encountered this before? Is there something that I should be 
linking that I forgot to reference? Or do I just need to update the resources 
folder again?

Thank you.

---
Michelle Chen


RE: Question about the pipeline

2015-02-02 Thread Finan, Sean
Hi Tol (and Maite),

I'm not entirely certain that I understand the question, but here is an attempt 
to help.  If I'm oversimplifying then I apologize.

I think that ExampleAggregatePipeline is intended to represent a very simple 
single-note pipeline and that custom code could be produced by using it as an 
example.

If you want to process texts in a directory, you can find with a web search 
plenty of ways to list files in a directory and read text from files.  
org.apache.ctakes.core.cr.FilesInDirectoryCollectionReader might be what you 
used in the CPE, and you can certainly peruse the code and take what you need.  
Or, if you decide to write a simple diy,  here is one possibility:

Static public Collection getFilesInDir( final File directory ) {
   final Collection fileList = new ArrayList<>();
   final File[] fileList = directory.listFiles();
   if ( fileList == null ) {
  System.err.println( "please check the directory " + 
directory.getAbsolutePath() );
  System.exit( 1 );
   }
for ( final File file : directory.listFiles() ) {
if ( file.canRead() ) {
fileList.add( file );
}
}
} 

Static public String getTextInFile( final File file ) throws IOException {   -- 
or handle ioE herein
   final Path nioPath = file.toPath();
   return new String( Files.readAllBytes( nioPath ) );
}

Static public void main( String ... args ) {
   If ( args[0].isEmpty() ) {
  System.out.println( "Enter a directory path" );
  System.exit( 0 );
   }
   Final Collection files = getFilesInDir( new File( args[0] );
   For ( File file : files ) {
  Final String note = getTextInFile( file );
  ---  Insert here code a' la ExampleAggregatePipeline  ---
  ---  swap out the writer in ExampleAggregatePipeline with CasIOUtil 
method (below)  ---
   }
}

I must admit that I have never directly used it, but there is an xmi file 
writing method in org.apache.uima.fit.util.CasIOUtil named writeXmi( JCas jCas, 
File file ).  You could give this a try and see if it produces the type of 
output that you want.  The same utility class has a writeXCas(..) method.


If the above has absolutely nothing to do with your needs then please send me a 
bulleted list of items, example workflow, etc. and I'll see if I can be of 
service.

Oh, and I wrote the above code freehand, so MS Outlook is adding capital 
letters, etc.  If you cut and paste you'll need to change that - plus I haven't 
run/compiled, so there might be a typo or missed exception or something.  Or it 
may not work (in which case I'll throw in a little more effort).

Sean


-Original Message-
From: Tol O. [mailto:tol...@gmail.com] 
Sent: Monday, February 02, 2015 6:56 PM
To: dev@ctakes.apache.org
Subject: Re: Question about the pipeline

Maite Meseure Hugues  writes:

> 
> Hello all,
> 
> Thank you for your preceding answers.
> I have a few questions regarding the pipeline example to run cTakes 
> programmatically.
> I am running ExampleAggregatePipeline.java with 
> ExampleHelloWorldAnnotator but I would like to know how I can change 
> it to run my data, as the CPE where we can choose the directory of our data.
> My second question is about the xml output generated with the CPE, can 
> I get the same xml output in using the example pipeline? and How?
> Thanks for your time.


I would like to ask the same question. After successfully setting up CTAKES 
following the Developers Guide I would also like to use a modified 
ExampleAggregatePipeline to output a CAS file identical to the output obtained 
by the CPE or the CVD when following the Users Guide.

This would be a great help for developers as a starting class to be able to 
programmatically obtain an annotated file based on a plaintext or XML input, 
same as through the two GUIs.

Right now I am reading through the Component Use Guide to replicate the CPE or 
the CVD tutorial with the test input, but it is a bit overwhelming.

Any pointers or suggestions would be really appreciated.

Tol O.



RE: Question about the pipeline

2015-02-03 Thread Finan, Sean
Hi Maite,

RunCPE is a good find, and if it fits your bil hten you should use it.  But it 
(if you mean the yTex class) doesn't take input and output directories from the 
command line.  It does take the path to a CPE.xml file.  There is a cTakes 
(non-yTex) equivalent named CmdLineCpeRunner.  Either one of them should print 
a usage if you run it without arguments.  As the CmdLineCpeRunner indicates, 
you can create a cpe .xml file with the cpe gui.  Basically, start the cpe gui, 
select your input (reader), output (writer) and pipeline (ae) in the gui and 
then save the cpe descriptor (via the menubar).  You can exit the gui and run 
either one of the cmd line utilities with the path to that cpe .xml descriptor 
as the argument.  Please note: sometimes you have to explicitly type ".xml" in 
the filename when saving with the cpe gui.  If you run with the cpe gui and 
then exit it should automatically ask you if you want to save the cpe .xml 
descriptor.  Anyway, once you have the .xml file you can always edit the input 
and output paths in that file to change your run parameters.  

Sean

-Original Message-
From: Maite Meseure Hugues [mailto:meseure.ma...@gmail.com] 
Sent: Tuesday, February 03, 2015 9:01 AM
To: dev@ctakes.apache.org
Subject: Re: Question about the pipeline

Thanks a lot Sean for your detailed reply. I've also found RunCPE.java that 
allows to put the input and outpur directories in arguments in the environment 
and do the same job than the CPE-GUI -at least in Eclipse, I haven't managed to 
run it via the command line yet.

On Mon, Feb 2, 2015 at 7:12 PM, Finan, Sean < sean.fi...@childrens.harvard.edu> 
wrote:

> Hi Tol (and Maite),
>
> I'm not entirely certain that I understand the question, but here is 
> an attempt to help.  If I'm oversimplifying then I apologize.
>
> I think that ExampleAggregatePipeline is intended to represent a very 
> simple single-note pipeline and that custom code could be produced by 
> using it as an example.
>
> If you want to process texts in a directory, you can find with a web 
> search plenty of ways to list files in a directory and read text from 
> files.  org.apache.ctakes.core.cr.FilesInDirectoryCollectionReader 
> might be what you used in the CPE, and you can certainly peruse the 
> code and take what you need.  Or, if you decide to write a simple diy,  
> here is one
> possibility:
>
> Static public Collection getFilesInDir( final File directory ) {
>final Collection fileList = new ArrayList<>();
>final File[] fileList = directory.listFiles();
>if ( fileList == null ) {
>   System.err.println( "please check the directory " +
> directory.getAbsolutePath() );
>   System.exit( 1 );
>}
> for ( final File file : directory.listFiles() ) {
> if ( file.canRead() ) {
> fileList.add( file );
> }
> }
> }
>
> Static public String getTextInFile( final File file ) throws IOException
> {   -- or handle ioE herein
>final Path nioPath = file.toPath();
>return new String( Files.readAllBytes( nioPath ) ); }
>
> Static public void main( String ... args ) {
>If ( args[0].isEmpty() ) {
>   System.out.println( "Enter a directory path" );
>   System.exit( 0 );
>}
>Final Collection files = getFilesInDir( new File( args[0] );
>For ( File file : files ) {
>   Final String note = getTextInFile( file );
>   ---  Insert here code a' la ExampleAggregatePipeline  ---
>   ---  swap out the writer in ExampleAggregatePipeline with 
> CasIOUtil method (below)  ---
>}
> }
>
> I must admit that I have never directly used it, but there is an xmi 
> file writing method in org.apache.uima.fit.util.CasIOUtil named 
> writeXmi( JCas jCas, File file ).  You could give this a try and see 
> if it produces the type of output that you want.  The same utility 
> class has a writeXCas(..) method.
>
>
> If the above has absolutely nothing to do with your needs then please 
> send me a bulleted list of items, example workflow, etc. and I'll see 
> if I can be of service.
>
> Oh, and I wrote the above code freehand, so MS Outlook is adding 
> capital letters, etc.  If you cut and paste you'll need to change that 
> - plus I haven't run/compiled, so there might be a typo or missed 
> exception or something.  Or it may not work (in which case I'll throw 
> in a little more effort).
>
> Sean
>
>
> -Original Message-
> From: Tol O. [mailto:tol...@gmail.com]
> Sent: Monday, February 02, 2015 6:56 PM
> To: dev@ctakes.apache.org
> Subject: Re: Question about the pipeline
>
> Maite Meseure Hugues  writes:
>
> >
> > Hello all,
> >
> > T

RE: git mirrors out of sync?

2015-02-03 Thread Finan, Sean
Hi Steve,

You are right (confirming your finding) - it looks like the first is a no-show 
and the second is somebody's personal upload to github (not git.apache.org) 
from 3 years ago.  The jira claims that the item was closed (fixed), but if you 
go to 
https://urldefense.proofpoint.com/v2/url?u=http-3A__git.apache.org_&d=BQIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=6K2jncop0hjH-CSVJRe1t5Ehv0V75znADU0wtfGz_1w&m=NERTSV05Tazy9bLFr0JnQeCe6FcppzevqkKgecLBfhA&s=hg28ET1-cmNSr9e9uZcva97I5GEgyQGtYqBF1BKSQxU&e=
  cTakes is not listed.  Was it there previous to 6 days ago but removed? 

If nobody responds with a "here's yer problem" by end of week then I ( or you, 
if you like) will ping infra.  I know that at least one contributor (not me) 
prefers to use git.

Sean

-Original Message-
From: Steven Bethard [mailto:steven.beth...@gmail.com] 
Sent: Tuesday, February 03, 2015 3:38 PM
To: dev@ctakes.apache.org
Subject: git mirrors out of sync?

The git mirrors for cTAKES seem to be either broken 
(https://urldefense.proofpoint.com/v2/url?u=http-3A__git.apache.org_ctakes.git&d=BQIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=2TD3UZU0K4cU6Xehm7SjkXAnlWgKfoCoEDC8XWIU5fs&s=YbXZ5LN-Z295poj6jlkGInSjv6t78b2X0QgO8hI0vwk&e=
 ) or embarrassingly out of sync 
(https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_ctakes&d=BQIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=2TD3UZU0K4cU6Xehm7SjkXAnlWgKfoCoEDC8XWIU5fs&s=YW6_xp81csYAksST2pDnIUjQEEI7rmK60iN9NDYO3cg&e=
 ). Is this a known issue? I looked at the INFRA ticket [1], but didn't see 
anything that suggested that there should be a problem.

Steve

[1] 
https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_INFRA-2D8553&d=BQIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=2TD3UZU0K4cU6Xehm7SjkXAnlWgKfoCoEDC8XWIU5fs&s=-ZNPLIX5GcrgNmQwjs8qmXU8rG_D8de7ymM9_y3gPPM&e=
 


RE: Question about the pipeline

2015-02-03 Thread Finan, Sean
Hi Tol,

> Essentially, I want to know how to set up the cTAKES objects correctly into a 
> pipeline in a Java programs, so that medical texts are annotated, like the 
> GUI is doing. I would really appreciate any hints or how to accomplish this.

Looking at your embedded code I think that you've got the general idea of how 
to do everything.  Perhaps you are wondering how to create custom pipelines by 
programmatically adding chosen processors?

Tim Miller made a great addition (imo) to the cTakes code with the 
org.apache.ctakes.clinicalpipeline. ClinicalPipelineFactory class.  Perhaps you 
can take a look at that and see if it helps?

Sean

-Original Message-
From: Tol O. [mailto:tol...@gmail.com] 
Sent: Tuesday, February 03, 2015 7:35 PM
To: dev@ctakes.apache.org
Subject: Re: Question about the pipeline

>

Sean,

Thank you for the detailed reply.

As you mentioned, I had to revert the capital letters from your Outlook, and 
also, if somebody else wants to use the code and cannot get it to run: the 
getFilesInDir method needs to return the populated Collection fileList, 
the variable final File[] fileList and its usage should be renamed to something 
else (as the variable name already exists) and the main method needs to throw 
an IOException.

I think these were all the changes I made so that the txt files from a folder 
are added to the collection, many thanks again.

What I am looking to do is also what the description in 
"ExampleAggregatePipeline" says, "running a pipeline programatically w/o uima 
xml descriptor xml files". This is accomplished by what I understand the 
uimaFIT classes, so that AEs can be defined in Java, added to a Pipeline and 
directly run.

The uimaFIT page gives a nice Java snippet that uses uimaFIT in a similar way 
as the cTAKES example, I pasted the few Java lines below at [1]. 
https://urldefense.proofpoint.com/v2/url?u=http-3A__uima.apache.org_d_uimafit-2Dcurrent_tools.uimafit.book.html-23ugr.tools.uimafit.introduction&d=BQICAg&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=uhPMXYD_U8cpnenfJCFigx00DCavTuwRGY-irX80FfU&s=4s5P35eByjHcLHM6WEp5jmjquPc-wynEgjBWnY6I6Pg&e=
 

I would like to use cTAKES in my own Java programs such that, just like the 
ExampleAggregatePipeline, uimaFIT can be used create and run a cTAKES pipeline 
to annotate medical texts. Then, I could also output the result in CAS files, 
just like the CVD GUI is doing. This would allow to directly be able to add or 
modify my own AnalysisEngines.

Essentially, I want to know how to set up the cTAKES objects correctly into a 
pipeline in a Java programs, so that medical texts are annotated, like the GUI 
is doing. I would really appreciate any hints or how to accomplish this. 

Following your code example to read the files the outlined idea is:

for ( File file : files ) {
  Final String note = getTextInFile( file );
  JCas jCas = JCasFactory.createJCas();
  jCas.setDocumentText(note);

  // 1. create the AnalysisEngines for tokenizer, tagger and other cTAKES 
components etc. to annotate medical texts
  // 2. runPipeline(jCas, ...);
}

[1]
The code snippet from uimaFIT:

JCas jCas = JCasFactory.createJCas();

jCas.setDocumentText("some text");

AnalysisEngine tokenizer = createEngine(MyTokenizer.class);

AnalysisEngine tagger = createEngine(MyTagger.class);

runPipeline(jCas, tokenizer, tagger);

for(Token token : iterate(jCas, Token.class)){
System.out.println(token.getTag());
}

Tol O.


Finan, Sean  writes:

> 
> Hi Tol (and Maite),
> 
> I'm not entirely certain that I understand the question, but here is 
> an
attempt to help.  If I'm
> oversimplifying then I apologize.
> 
> I think that ExampleAggregatePipeline is intended to represent a very
simple single-note pipeline and
> that custom code could be produced by using it as an example.
> 
> If you want to process texts in a directory, you can find with a web
search plenty of ways to list files in a
> directory and read text from files. 
org.apache.ctakes.core.cr.FilesInDirectoryCollectionReader
> might be what you used in the CPE, and you can certainly peruse the 
> code
and take what you need.  Or, if you
> decide to write a simple diy,  here is one possibility:
> 
> Static public Collection getFilesInDir( final File directory ) {
>final Collection fileList = new ArrayList<>();
>final File[] fileList = directory.listFiles();
>if ( fileList == null ) {
>   System.err.println( "please check the directory " +
directory.getAbsolutePath() );
>   System.exit( 1 );
>}
> for ( final File file : directory.listFiles() ) {
> if ( file.canRead() ) {
> fileList.add( file );
> }
> }
> }
> 
> Static public String getTextInFile( final File fil

RE: Question about the pipeline

2015-02-05 Thread Finan, Sean
Hi Maite,

Without more information I can't venture a guess as to a cause of the error.  
If RunCPE works then why not use that?  They are practically identical.

Sean

From: Maite Meseure Hugues [meseure.ma...@gmail.com]
Sent: Thursday, February 05, 2015 8:51 AM
To: dev@ctakes.apache.org
Subject: Re: Question about the pipeline

I see. In my case, I am using the CPE descriptor saved from the GUI for
CmdLineCpeRunner as said Sean. I've selected
AggregatePlaintextProcessor.xml as AE but I have this error:

"Couldn't initialize processing engine.

  Initialization of CAS Processor with name "AggregatePlaintextProcessor"
failed. "

Meanwhile, RunCPE.java works properly with the same descriptor in Eclipse.
Does anyone have an idea?

On Wed, Feb 4, 2015 at 12:56 PM, Lingren, Todd 
wrote:

> Hi Maite,
> For each patient in my list, I create a new FilesToFiles CPE xml using
> some sed commands on the template original.
>
> Specifically, here's the command line argument (I'm on linux).
>
> CTAKES_HOME=...
> java -cp $CTAKES_HOME/lib/*:$CTAKES_HOME/desc/:$CTAKES_HOME/resources/
> -Dlog4j.configuration=file:$CTAKES_HOME/config/log4j.xml -Xms512M -Xmx2048M
> CmdLineCpeRunner FilesToFiles_patient_cui.xml > outputfile.txt
>
> I don't think it matters, but I'm using the cTAKES 3.1.0 version.
>
>
> Todd Lingren
> Biomedical Informatics
> Cincinnati Children’s Hospital
> todd.ling...@cchmc.org
> 513-803-9032
>
>
> -Original Message-
> From: Maite Meseure Hugues [mailto:meseure.ma...@gmail.com]
> Sent: Wednesday, February 04, 2015 12:59 PM
> To: dev@ctakes.apache.org
> Subject: Re: Question about the pipeline
>
> Interesting, Todd thank you and how do you use CMdLineCpeRunner basically?
> Because I tested in cmd line with:
>
> java org.apache.ctakes.core.cpe.CmdLineCpeRunner [path-to-my-cpe.xml]
>
> but here is that I've got:
>
>
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/uima/util/InvalidXMLException
>
> at java.lang.Class.getDeclaredMethods0(Native Method)
>
> at java.lang.Class.privateGetDeclaredMethods(Class.java:2693)
>
> at java.lang.Class.privateGetMethodRecursive(Class.java:3040)
>
> at java.lang.Class.getMethod0(Class.java:3010)
>
> at java.lang.Class.getMethod(Class.java:1776)
>
> at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)
>
> at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)
>
> ...
>
> On Wed, Feb 4, 2015 at 8:32 AM, Lingren, Todd 
> wrote:
>
> > Sean and Maite,
> > FWIW, I use CmdLineCpeRunner frequently. I employ it with a bash
> > script to automatically create a new xml file based on the subfolder
> > names contained in the target directory. So in our HPC, it spawns a
> > new job for each subfolder (which may have between 5 and 2500 notes).
> >
> > Todd Lingren
> > Biomedical Informatics
> > Cincinnati Children’s Hospital
> > todd.ling...@cchmc.org
> > 513-803-9032
> >
> >
> > -Original Message-
> > From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
> > Sent: Tuesday, February 03, 2015 2:47 PM
> > To: dev@ctakes.apache.org
> > Subject: RE: Question about the pipeline
> >
> > Hi Maite,
> >
> > RunCPE is a good find, and if it fits your bil hten you should use it.
> > But it (if you mean the yTex class) doesn't take input and output
> > directories from the command line.  It does take the path to a CPE.xml
> > file.  There is a cTakes (non-yTex) equivalent named CmdLineCpeRunner.
> > Either one of them should print a usage if you run it without arguments.
> > As the CmdLineCpeRunner indicates, you can create a cpe .xml file with
> > the cpe gui.  Basically, start the cpe gui, select your input
> > (reader), output
> > (writer) and pipeline (ae) in the gui and then save the cpe descriptor
> > (via the menubar).  You can exit the gui and run either one of the cmd
> > line utilities with the path to that cpe .xml descriptor as the argument.
> > Please note: sometimes you have to explicitly type ".xml" in the
> > filename when saving with the cpe gui.  If you run with the cpe gui
> > and then exit it should automatically ask you if you want to save the
> cpe .xml descriptor.
> > Anyway, once you have the .xml file you can always edit the input and
> > output paths in that file to change your run parameters.
> >
> > Sean
> >
> > -Original Message-
> > From: Maite Meseure Hugues [mailto:meseure.ma...@gmail.com]
> > Sent: Tuesday, February 03, 

RE: Question about the pipeline

2015-02-05 Thread Finan, Sean
Hi Maite,

If you can run the cpe gui using the script in bin/ , try specifying the 
descriptor for that:

runctakesCPE -desc pathToXml

If that runs then try copying the runctakesCPE to something like runctakesCLI 
and change the last line of the file to call CmdLineCpeRunner instead of 
CpmFrame.

Sean

p.s. check the last line of runctakesCPE script that you are using and make 
sure that it passes arguments: %* for Windows or $@ for unix/linux

-Original Message-
From: Maite Meseure Hugues [mailto:meseure.ma...@gmail.com] 
Sent: Thursday, February 05, 2015 9:42 AM
To: dev@ctakes.apache.org
Subject: Re: Question about the pipeline

Yes, it does but only in Eclipse, not in command line even though I am in the 
good directory. I have to look at the classpath more in details probably.
Thanks for your replies.

On Thu, Feb 5, 2015 at 8:08 AM, Finan, Sean < sean.fi...@childrens.harvard.edu> 
wrote:

> Hi Maite,
>
> Without more information I can't venture a guess as to a cause of the 
> error.  If RunCPE works then why not use that?  They are practically 
> identical.
>
> Sean
> 
> From: Maite Meseure Hugues [meseure.ma...@gmail.com]
> Sent: Thursday, February 05, 2015 8:51 AM
> To: dev@ctakes.apache.org
> Subject: Re: Question about the pipeline
>
> I see. In my case, I am using the CPE descriptor saved from the GUI 
> for CmdLineCpeRunner as said Sean. I've selected 
> AggregatePlaintextProcessor.xml as AE but I have this error:
>
> "Couldn't initialize processing engine.
>
>   Initialization of CAS Processor with name "AggregatePlaintextProcessor"
> failed. "
>
> Meanwhile, RunCPE.java works properly with the same descriptor in Eclipse.
> Does anyone have an idea?
>
> On Wed, Feb 4, 2015 at 12:56 PM, Lingren, Todd 
> 
> wrote:
>
> > Hi Maite,
> > For each patient in my list, I create a new FilesToFiles CPE xml 
> > using some sed commands on the template original.
> >
> > Specifically, here's the command line argument (I'm on linux).
> >
> > CTAKES_HOME=...
> > java -cp 
> > $CTAKES_HOME/lib/*:$CTAKES_HOME/desc/:$CTAKES_HOME/resources/
> > -Dlog4j.configuration=file:$CTAKES_HOME/config/log4j.xml -Xms512M
> -Xmx2048M
> > CmdLineCpeRunner FilesToFiles_patient_cui.xml > outputfile.txt
> >
> > I don't think it matters, but I'm using the cTAKES 3.1.0 version.
> >
> >
> > Todd Lingren
> > Biomedical Informatics
> > Cincinnati Children’s Hospital
> > todd.ling...@cchmc.org
> > 513-803-9032
> >
> >
> > -Original Message-
> > From: Maite Meseure Hugues [mailto:meseure.ma...@gmail.com]
> > Sent: Wednesday, February 04, 2015 12:59 PM
> > To: dev@ctakes.apache.org
> > Subject: Re: Question about the pipeline
> >
> > Interesting, Todd thank you and how do you use CMdLineCpeRunner
> basically?
> > Because I tested in cmd line with:
> >
> > java org.apache.ctakes.core.cpe.CmdLineCpeRunner 
> > [path-to-my-cpe.xml]
> >
> > but here is that I've got:
> >
> >
> > Exception in thread "main" java.lang.NoClassDefFoundError:
> > org/apache/uima/util/InvalidXMLException
> >
> > at java.lang.Class.getDeclaredMethods0(Native Method)
> >
> > at java.lang.Class.privateGetDeclaredMethods(Class.java:2693)
> >
> > at java.lang.Class.privateGetMethodRecursive(Class.java:3040)
> >
> > at java.lang.Class.getMethod0(Class.java:3010)
> >
> > at java.lang.Class.getMethod(Class.java:1776)
> >
> > at 
> > sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:54
> > 4)
> >
> > at 
> > sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526
> > )
> >
> > ...
> >
> > On Wed, Feb 4, 2015 at 8:32 AM, Lingren, Todd 
> > 
> > wrote:
> >
> > > Sean and Maite,
> > > FWIW, I use CmdLineCpeRunner frequently. I employ it with a bash 
> > > script to automatically create a new xml file based on the 
> > > subfolder names contained in the target directory. So in our HPC, 
> > > it spawns a new job for each subfolder (which may have between 5 and 2500 
> > > notes).
> > >
> > > Todd Lingren
> > > Biomedical Informatics
> > > Cincinnati Children’s Hospital
> > > todd.ling...@cchmc.org
> > > 513-803-9032
> > >
> > >
> > > -Original Message-
> > > From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
> > > Sent: Tuesday, February 03, 2015 2:47 PM
> > > To:

RE: IntelliJ experience or instructions

2015-02-12 Thread Finan, Sean
Hi Taposh,

Try the process outlined below.  I have screenshots for each step if you want 
them.  If this works (you are the first tester) then we can put it in the web 
documentation.

Sean

Fresh checkout from SVN
===
1. Start IntelliJ IDEa.
2. In the "Quick Start" menu, select "Check out from Version Control" (VCS 
icon).  
This will display a drop-down box.
3. In the drop-down box, select "Subversion".  
This will open a "Checkout from Subversion" dialog.

4. In the "Checkout from Subversion" dialog, click the "+" button in the top 
left to add a new Repository.  
This will open a "New Repository Location" dialog.
5. In the "New Repository Location" dialog, enter the svn checkout location of 
cTakes.
6. Click "Ok".  
This will inspect the repository.
7. Click the "Expand" triangle.  
This should display the directory listing of trunk.
8. Click "Checkout".  
This will open a "Destination Directory" dialog.

9. Enter a local directory in which to keep trunk (your sandbox).
10. Click "Ok".  This will open a "Checkout Options" dialog.  
The default options ("Head", etc.) are fine for most users.
11. Click "Ok".  
This will open a "Working Copy Format" dialog.
12. Select a (version) format and click "Ok".  I use version 1.8, but any 
should be fine.  
This will start the actual checkout and display a progress dialog.  The 
checkout may take a little while.

13. After the checkout has completed, you a new dialog will ask you if you'd 
like to open the project.  Click "No".


Import Project from Maven
=
A local repository is needed.  If you do not yet have one, follow any 
instructions on this wiki for checking out cTakes.
1. Start IntelliJ IDEa.
2. In the "Quick Start" menu, select "Import Project".
This will open a "Select File or Directory to Import" dialog.
3. Browse to your local cTakes repository root directory and select the pom.xml 
file.
4. Click "Ok".
This will open an "Import Project from Maven" dialog.
5. Make sure the "Search for projects recursively" box is selected, just in 
case any cTakes modules are not in the pom.
6. Make sure that "Create IntelliJ IDEa modules for aggregator projects" is not 
selected.
If you plan to add new module, 'disable' a present cTakes module or make other 
changes to the main pom.xml, check the "Import Maven projects automatically" 
box.
See also: http://www.jetbrains.com/idea/webhelp/maven-importing.html
7. Make sure that "Create module rgoups for multi-module Maven projects" is not 
selected.
8. Make sure that "Keep source and test folders on reimport is selected.
9. Make sure that "Exclude build directory (%PROJECT_ROOT%/target)" is selected.
10. Make sure that "Use Maven output directories" is selected.
11. Make sure that the "Generated sources folders" option "Detect 
automatically" is selected.
12. For the "Phase to be used for folders update" the default option 
"process-resources" should be fine.
13. For the "Automatically download" options, you may select what you like, but 
be wary that if broken code has been checked in you may need to revert manually.
14. The default "Dependency types" are fine.
15. Click "Environment settings...".
This will open a dialog that can be used to set options about the Maven 
environment.
16. The default maven environment settings should be fine.  If $M2_HOME is not 
set in your environment you may select a "Maven home directory", but it is 
better to set $M2_HOME in your environment.
17. Click "Next".
This will inspect the cTakes Maven settings and search for profiles.  It should 
display a dialog with the possible cTakes profiles.
If you plan to run the UIMA CVD or CPE then select the appropriate profile.  
Neither is necessary
18. Click "Next".
This will open an "Import" dialog with the current version of cTakes displayed.
19. Click "Next".
This will open a dialog allowing you to select a Java SDK version.
20. Click the "+" button in the top left.
This will display a drop-down box with options for an SDK.
21. Select "JDK".
This will open a dialog to select a Java JDK directory.
22. Navigate to a directory with a JDK and click "Ok".  
This will display a listing of the file paths associated with the selected JDK.
23. Click "Next".

24. Click "Ok/Next???"  Didn't see dialog.
The project will load.  This may take a while.
25. Important: If you are asked about adding and .iml files to svn, click "No". 
 

26. You should now see the full cTakes project structure in IntelliJ.


Compile with Maven
===
1. Open the "Maven Projects" Tool Window using the button on the left side of 
the window.  
If you do not see it, Use the main Menu > View > Tool Windows > Maven Projects.
If you would like to permanently add the button to the UI, use Menu > View > 
Tool Buttons.
See also: http://www.jetbrains.com/idea/webhelp/maven-projects-tool-window.html
2. Select the "Expand" triangle next to "Apache cTAKES" and then its child node 
"Lifecycle".
This will display standard maven goals.

RE: BagOfCuisGenerator.java, same idea for getConceptText()

2015-02-12 Thread Finan, Sean
Try something like the following for output:

   private int extractFeatures( final IdentifiedAnnotation annotation )  {
  // Extract the IdentifiedAnnotation itself
  final Collection umlsInfos = getUmlsInfos( annotation, 
_printSnomed );
  if ( umlsInfos  == null ) {
 return 0;
  }
  final int begin = annotation.getBegin();
  final int end = annotation.getEnd();
  final String annotationText = annotation.getCoveredText();
  final int polarity = annotation.getPolarity();
  int count = 0;
  for ( String umlsInfo : umlsInfos ) {
 saveAnnotation( annotationText, umlsInfo, polarity, begin, end );
 count++;
  }
  return count;
   }

   static private Collection getUmlsInfos( final IdentifiedAnnotation 
identifiedAnnotation ) {
  final FSArray fsArray = identifiedAnnotation.getOntologyConceptArr();
  if ( fsArray == null ) {
 return Collections.emptySet();
  }
  final FeatureStructure[] featureStructures = fsArray.toArray();
  final Set umlsInfos = new HashSet( 
featureStructures.length );
  for ( FeatureStructure featureStructure : featureStructures ) {
 final OntologyConcept ontologyConcept = (OntologyConcept) 
featureStructure;
 String info = null;
 if ( ontologyConcept instanceof UmlsConcept ) {
final UmlsConcept umlsConcept = (UmlsConcept) ontologyConcept;
info = umlsConcept.getCui();
final String tui = umlsConcept.getTui();
if ( tui != null && !tui.isEmpty() ) {
   info += "_" + tui;
}
final String preferredText = umlsConcept.getPreferredText();
if ( preferredText != null && !preferredText.isEmpty() ) {
   info += " = \"" + preferredText + "\"";
}
umlsInfos.add( info );
 }
  }
  return umlsInfos;
   }

   public void saveAnnotation( final String spannedText, final String umlsInfo, 
final int polarity,
   final int begin, final int end  )  {
  final String text = begin + "," + end + " " + (polarity < 0 ? "-" : " ") 
+ umlsInfo + " " + spannedText;
  if ( _writer == null ) {
 System.out.println( text );
 return;
  }
  try {
 _writer.write( text );
 _writer.newLine();
  } catch ( IOException ioE ) {
 logger.error( ioE.getMessage() );
  }
   }
-Original Message-
From: Maite Meseure Hugues [mailto:meseure.ma...@gmail.com] 
Sent: Thursday, February 12, 2015 2:46 PM
To: dev@ctakes.apache.org
Subject: BagOfCuisGenerator.java, same idea for getConceptText()

Hi everyone,

I am currently working on BagOfCuisGenerator, and I would like to add the 
concept text to the output.
I 've seen some discussions about getting the original text and UMLS preferred 
text in addition to the cui. Can someone give me pointers to do that?
Thanks in advance for your time.

Maite

--
--
 Maïté Meseure Hugues


RE: BagOfCuisGenerator.java, same idea for getConceptText()

2015-02-12 Thread Finan, Sean
Oh yeah - use the -fast dictionary to get preferred text.  The fastest way to 
get cuis only is with CuisOnlyPlaintextUMLSProcessor.  If you want polarity 
make sure you uncomment the section with PolarityCleartkAnalysisEngine.

Sean

-Original Message-
From: Maite Meseure Hugues [mailto:meseure.ma...@gmail.com] 
Sent: Thursday, February 12, 2015 2:46 PM
To: dev@ctakes.apache.org
Subject: BagOfCuisGenerator.java, same idea for getConceptText()

Hi everyone,

I am currently working on BagOfCuisGenerator, and I would like to add the 
concept text to the output.
I 've seen some discussions about getting the original text and UMLS preferred 
text in addition to the cui. Can someone give me pointers to do that?
Thanks in advance for your time.

Maite

--
--
 Maïté Meseure Hugues


RE: BagOfCuisGenerator.java, same idea for getConceptText()

2015-02-17 Thread Finan, Sean
Hi Maite,

I just checked the log and it looks like you'll need to use a copy of cTakes 
built after 12/08/2014 to get Snomed codes.

Sean

-Original Message-
From: Maite Meseure Hugues [mailto:meseure.ma...@gmail.com] 
Sent: Monday, February 16, 2015 12:19 PM
To: dev@ctakes.apache.org
Subject: Re: BagOfCuisGenerator.java, same idea for getConceptText()

Sean, I have a question, is it because I am using fast dictionary I don't get 
snomed-oid or snomed-code? Instead, it's " snomed_oid: null#CTAKES".
Thank you.

Maite

On Fri, Feb 13, 2015 at 1:32 PM, Maite Meseure Hugues < 
meseure.ma...@gmail.com> wrote:

> Thank you for your replies, It's helpful. I was working on 3.2.0 
> version, so it looks like 3.2.1 allows to get the UMLS preferred text.
>
> Maite
>
> On Thu, Feb 12, 2015 at 2:25 PM, Finan, Sean < 
> sean.fi...@childrens.harvard.edu> wrote:
>
>> Oh yeah - use the -fast dictionary to get preferred text.  The 
>> fastest way to get cuis only is with CuisOnlyPlaintextUMLSProcessor.  
>> If you want polarity make sure you uncomment the section with 
>> PolarityCleartkAnalysisEngine.
>>
>> Sean
>>
>> -Original Message-
>> From: Maite Meseure Hugues [mailto:meseure.ma...@gmail.com]
>> Sent: Thursday, February 12, 2015 2:46 PM
>> To: dev@ctakes.apache.org
>> Subject: BagOfCuisGenerator.java, same idea for getConceptText()
>>
>> Hi everyone,
>>
>> I am currently working on BagOfCuisGenerator, and I would like to add 
>> the concept text to the output.
>> I 've seen some discussions about getting the original text and UMLS 
>> preferred text in addition to the cui. Can someone give me pointers 
>> to do that?
>> Thanks in advance for your time.
>>
>> Maite
>>
>> --
>> --
>>  Maïté Meseure Hugues
>>
>
>
>
> --
> --
>  Maïté Meseure Hugues
>
>


--
--
 Maïté Meseure Hugues


RE: CTAKES mirroring on github.

2015-02-17 Thread Finan, Sean
Our request is for a read-only mirror.  However, if it ever becomes i/o, I 
don't know if this will have what you want, but http://git.apache.org/
Links to documentation (mostly server setup) http://www.apache.org/dev/git.html 
and a wiki (check toward middle and bottom for committer info) 
https://wiki.apache.org/general/GitAtApache



-Original Message-
From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
Sent: Tuesday, February 17, 2015 12:31 PM
To: dev@ctakes.apache.org
Subject: Re: CTAKES mirroring on github.

Is there any existing resource to help people who want to use git understand 
the right workflow to contribute to ctakes? (i.e. how this interacts with svn 
repos).
Tim


On 02/17/2015 12:23 PM, jay vyas wrote:
> Hi CTakes.  Looks like infra finally got  onto the JIRA i made for 
> this a while back.  They are currently working on fixing a couple of 
> minor glitches w/ the mirroring (not showing all commits)... but there 
> now is a mirror for CTakes on github.
>
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache
> _ctakes&d=BQIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=Heup-
> IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=4sEI9mOp
> kTz6K-DjmNU1s8Do1TGA0_10HqJcowKpDxc&s=fNVbyXzpBLSAG6-DIjBZ1vbMp0JGaX90
> Lcdzg_EFVvM&e=
>



RE: Hello cTAKES Mailing List

2015-02-22 Thread Finan, Sean
Hi Raymond,

If you use the dictionary-fast module there exists an entry "feeling bad" with 
cui 557911 and cui 231218.  There is also "feel bad" and "feeling bad 
emotionally"

You will find "horrible present pain" but no other entry with "horrible".   You 
will not find any terms with "awful" and probably many other desired words.  If 
you are really interested in slang "crappy", "lousy", etc. then they are 
definitely not present.

What you can do is create a second dictionary.  There are example custom 
dictionaries in 
-dictionary-lookup-fast-res/src/main/resources/org/apache/ctakes/dictionary/lookup/fast/example/bsv/
You should look at custom_cui_bsv.bsv if you want to specify term unique id 
codes and term text alone.  If you want to add tui/group codes then look at 
custom_cui_tui_bsv.bsv  - you will probably want to model your dictionary after 
this so that you can tag your terms with tuis for "symptoms".

You will want to imitate sections from the corresponding .xml file in that 
directory.   Make a copy of cTakesHsql.xml (two dirs up) and add lines: 
  
 CustomCuiRareWord
 
org.apache.ctakes.dictionary.lookup2.BsvRareWordDictionary
 

 
  

And

  
 CustomCuiConcept
 
org.apache.ctakes.dictionary.lookup2.concept.BsvConceptFactory
 

 
  

And

  
 CustomPair
 CustomCuiRareWord
 CustomCuiConcept
  

Then make sure that you point to your custom cTakesHsql.xml in 
dictionary-fast/desc/analysis_engine/UmlsLookupAnnotator.xml (or Overlap 
depending upon your use):

DictionaryDescriptorFile


   
file:org/apache/ctakes/dictionary/lookup/fast/cTakesHsqlYourCopy.xml


You can also skip the UMLS dictionary altogether and just use your custom 
dictionary.

If you do give this a try then let me know  how it goes.  If you need 
additional assistance let me know and I will help the best I can.

Sean


-Original Message-
From: Raymond Li [mailto:ray...@bu.edu] 
Sent: Saturday, February 21, 2015 1:26 PM
To: dev@ctakes.apache.org
Subject: Hello cTAKES Mailing List

Hello, my name is is Raymond Li and I am currently working on a team project 
involving cTAKES. The goal of our project would be to use cTAKES to analyze 
posts on social media (such as tweets, forum posts, public available data) in 
order to catch in real-time any adverse effects of prescribed drugs and do a 
public service of protecting people from harmful drugs.

Aside from this introduction, I do have only one question to ask to proceed 
with this project: Is cTAKES capable of understanding slang words as symptoms. 
An example is if I were to say "I took Crestor and feeling bad"
is there a way for cTAKES to recognize that Crestor had a negative effect?
My team has not been able to isolate 'bad' as a negative effect as it is not a 
defined medical symptom, but it would be nice to figure out if such a solution 
exists, or if we would need to develop our own solution and how we could go 
around doing it.

My team and I would appreciate any comments or assistance regarding our project 
and this current issue. Thank you and have a nice day!

--
Sincerely,

Raymond Li


RE: Hello cTAKES Mailing List

2015-02-23 Thread Finan, Sean
The CHV is a good resource for some things, but before going through the 
motions of porting it to a ctakes format, take a look inside.  

 
-Original Message-
From: Pei Chen [mailto:chen...@apache.org] 
Sent: Monday, February 23, 2015 1:52 PM
To: dev@ctakes.apache.org
Subject: Re: Hello cTAKES Mailing List

Raymond,
Probably a combination of UMLS *Consumer Health Vocabulary + Custom Dictionary 
(as Sean described) *may work for the use case*:* "OAC CHV connects informal, 
common words and phrases about health to technical terms used by health care 
professionals. It includes jargon, slang, ambiguous, and misspelled words as 
used by consumers and health care professionals. Due to its nature, OAC CHV 
includes concepts that are not represented by other source vocabularies within 
the Metathesaurus."

[1] 
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.nlm.nih.gov_research_umls_sourcereleasedocs_current_CHV_&d=BQIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=1Bkpeno1tqLjX78o0wYm5DmJHCHlK7hrxpeEgPnGtRM&s=-rEmTgTCe0mkSXT34XK56zkiuy_VxIfFvngGJzUwem8&e=
 

On Sun, Feb 22, 2015 at 10:37 AM, Finan, Sean < 
sean.fi...@childrens.harvard.edu> wrote:

> Hi Raymond,
>
> If you use the dictionary-fast module there exists an entry "feeling bad"
> with cui 557911 and cui 231218.  There is also "feel bad" and "feeling 
> bad emotionally"
>
> You will find "horrible present pain" but no other entry with "horrible".
>  You will not find any terms with "awful" and probably many other 
> desired words.  If you are really interested in slang "crappy", 
> "lousy", etc. then they are definitely not present.
>
> What you can do is create a second dictionary.  There are example 
> custom dictionaries in 
> -dictionary-lookup-fast-res/src/main/resources/org/apache/ctakes/dicti
> onary/lookup/fast/example/bsv/ You should look at custom_cui_bsv.bsv 
> if you want to specify term unique id codes and term text alone.  If 
> you want to add tui/group codes then look at custom_cui_tui_bsv.bsv  - 
> you will probably want to model your dictionary after this so that you 
> can tag your terms with tuis for "symptoms".
>
> You will want to imitate sections from the corresponding .xml file in that
> directory.   Make a copy of cTakesHsql.xml (two dirs up) and add lines:
>   
>  CustomCuiRareWord
>
>  
> org.apache.ctakes.dictionary.lookup2.BsvRareWordDictionary
>  
>  value="org/apache/ctakes/dictionary/fast/example/custom_cui_tui_bsv.bsv"/>
>  
>   
>
> And
>
>   
>  CustomCuiConcept
>
>  
> org.apache.ctakes.dictionary.lookup2.concept.BsvConceptFactory
>  
>  value="org/apache/ctakes/dictionary/fast/example/custom_cui_tui_bsv.bsv"/>
>  
>   
>
> And
>
>   
>  CustomPair
>  CustomCuiRareWord
>  CustomCuiConcept
>   
>
> Then make sure that you point to your custom cTakesHsql.xml in 
> dictionary-fast/desc/analysis_engine/UmlsLookupAnnotator.xml (or 
> Overlap depending upon your use):
>
> DictionaryDescriptorFile
> 
> 
>
>  
> file:org/apache/ctakes/dictionary/lookup/fast/cTakesHsqlYourCopy.xml
> 
>
> You can also skip the UMLS dictionary altogether and just use your 
> custom dictionary.
>
> If you do give this a try then let me know  how it goes.  If you need 
> additional assistance let me know and I will help the best I can.
>
> Sean
>
>
> -Original Message-
> From: Raymond Li [mailto:ray...@bu.edu]
> Sent: Saturday, February 21, 2015 1:26 PM
> To: dev@ctakes.apache.org
> Subject: Hello cTAKES Mailing List
>
> Hello, my name is is Raymond Li and I am currently working on a team 
> project involving cTAKES. The goal of our project would be to use 
> cTAKES to analyze posts on social media (such as tweets, forum posts, 
> public available data) in order to catch in real-time any adverse 
> effects of prescribed drugs and do a public service of protecting 
> people from harmful drugs.
>
> Aside from this introduction, I do have only one question to ask to 
> proceed with this project: Is cTAKES capable of understanding slang 
> words as symptoms. An example is if I were to say "I took Crestor and feeling 
> bad"
> is there a way for cTAKES to recognize that Crestor had a negative effect?
> My team has not been able to isolate 'bad' as a negative effect as it 
> is not a defined medical symptom, but it would be nice to figure out 
> if such a solution exists, or if we would need to develop our own 
> solution and how we could go around doing it.
>
> My team and I would appreciate any comments or assistance regarding 
> our project and this current issue. Thank you and have a nice day!
>
> --
> Sincerely,
>
> Raymond Li
>


URGENT! RE: New Website

2015-02-25 Thread Finan, Sean
Hi all,

It looks like a few people (myself included) are interested in having 
information on people, projects, papers, and applications that use cTAKES on 
the web page.  I have created a form on google that might help us collect this 
and other information.  Please visit 
https://docs.google.com/forms/d/10ryw42aqkIf2ygjNTa_To1OgGDZzDqHizVg__Jxyuws/viewform?usp=send_form
Most of the form is multiple choice, so it only takes a minute or two to 
complete it.  The more information we have the better we can develop and 
promote cTAKES, so this is very important.

Thank you,
Sean

-Original Message-
From: Mohammad Alodadi [mailto:mso1...@gmail.com] 
Sent: Wednesday, February 25, 2015 2:09 AM
To: dev@ctakes.apache.org
Subject: Re: New Website

I like the look of the new website. 
I was thinking, if someone could collect references of all the research papers, 
that use cTakes in their methodology, in a page and include the link in the use 
cases page, that would be a very great idea to see the different uses of cTakes.

Sincerely,

Mohammad Alodadi


> On Feb 24, 2015, at 8:46 PM, taposh.d@kp.org wrote:
> 
> Hi Michelle -
> 
> The site looks nice. 
> Would it be possible to add link to source via svn or github. 
> Also, case studies would help potential people. 
> 
> Regards,
> 
> Taposh D. Roy
> Health Data Lead
> Decision Support Team
> 
> Kaiser Permanente
> Program Office
> 1950 Franklin Street, 17th Floor
> Oakland, California 94588
> 510-987-4121 (Office)
> 510-206-1633 (cell)
> 
> 
> 
> NOTICE TO RECIPIENT:  If you are not the intended recipient of this 
> e-mail, you are prohibited from sharing, copying, or otherwise using 
> or disclosing its contents.  If you have received this e-mail in 
> error, please notify the sender immediately by reply e-mail and 
> permanently delete this e-mail and any attachments without reading, 
> forwarding or saving them.  Thank you.
> 
> 
> 
> 
> 
> From:   Michelle Chen 
> To: dev@ctakes.apache.org
> Date:   02/24/2015 04:30 PM
> Subject:New Website
> 
> 
> 
> Hello everyone,
> 
> We are planning on publishing the new website on March 2, 2015. Here 
> is the link to the proposed site:
> https://urldefense.proofpoint.com/v2/url?u=http-3A__svn.apache.org_repos_asf_ctakes_site_new_index.html&d=BQIFAg&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=SFiBdOhfH5CdMkKcR10nLGTTP4hqatPnp7nAnpr_ZFw&s=t5Zx2haIAOy9nVTKfqs7L7uRwblbLJ7imHALjP2iMqI&e=
>  .
> (Note: Not all of the pages are fully functional yet, but we figured 
> that this new look is exciting news and wanted feedback.)
> 
> Some ideas for feedback:
> 1. Succinct quotations from users and devs about "How has cTakes 
> helped you?" so that we can populate the "Why cTAKES?" page. (with 
> permission to use information of your name, position, employer, and/or 
> product/project) 2. Use cases (with potential screenshots) of cTAKES 
> to populate the "Examples" page of GUI or other use cases. The 
> "examples" page is in the process of being revamped.
> 3. Mobile feedback: This has not been tested on devices, but what 
> would be needed/useful?
> 4. What is missing from the web page? E.g. FAQs, useful tips. Where 
> are there broken links?
> 6. Anything!
> 
> We welcome any suggestions or code contributions directly the website 
> itself. Look forward to hearing from everyone. Have a great day.
> 
> 
> Sincerely,
> Michelle Chen
> 



RE: Running cTakes in parallel

2015-02-25 Thread Finan, Sean
Hi Michelle,

When it comes to 
> multiple instances of cTakes in parallel
You can certainly start as many pipelines as you want as separate JVM 
processes, just make sure that you divide your notes among separate batches, 
one batch per process.  Also keep in mind that you don't want to clobber your 
disk or reserve more ram than you've actually got available.

We have an UIMA-AS ( http://uima.apache.org/doc-uimaas-what.html ) solution 
that we have used with cTAKES.  It is a small collection of bash scripts that 
set up an environment and control processes.  There are also a couple .xml 
descriptors and configuration files that UIMA-AS needs  At some point it should 
be cleaned up a bit and checked in to sandbox, but it has been a very low 
(subgrade) priority for me.  Since you are local you could check out our setup 
in person if you like.

Sean

-Original Message-
From: michelle1919c...@gmail.com [mailto:michelle1919c...@gmail.com] On Behalf 
Of Michelle Chen
Sent: Wednesday, February 25, 2015 2:07 PM
To: dev@ctakes.apache.org
Subject: Running cTakes in parallel

Quick question, is it possible to run multiple instances of cTakes in parallel?

I'm currently using ctakes-clinical-pipeline's fastPipeline() and wanted to run 
it over multiple documents. Would it be okay to create multiple AnalysisEngine? 
How has other people done this process?

I wasn't sure if this would work because of this Issue 
.
---
Michelle Chen

Massachusetts Institute of Technology
Electrical Engineering and Computer Science B.S. '14, M.Eng. '15


RE: Is it necessary to put UMLS login into files when passing them with -D to the JVM?

2015-03-06 Thread Finan, Sean
Hi Tom,

> I am passing my UMLS login and password on startup as arguments ... 
> "-Dctakes.umlsuser=myusername -Dctakes.umlspw=mypassword"
That is fine.  If I understand correctly you are already running this way 
without problem.  The comments in the .xml files should probably be extended to 
include mention of the cmd parameters.


> [I] downloaded [AggregatePlaintextFastUmlsProcessor.xml] from the svn and 
> replaced the old cTAKES 3.2.1 ...
I think that this should be fine.  Java code for each annotator may have 
changed, but I don't think that any class names (by which annotators are 
called) have changed.  The best way to know for certain is to run it, and if 
you haven't seen any problems then I think that you are in good shape.

Sean

-Original Message-
From: Tom Devel [mailto:deve...@gmail.com] 
Sent: Friday, March 06, 2015 3:20 PM
To: dev@ctakes.apache.org
Subject: Is it necessary to put UMLS login into files when passing them with -D 
to the JVM?

Hi,

in AggregatePlaintextFastUMLSProcessor.xml of cTAKES it states that:

[...] Please update DictionaryLookupAnnotatorUMLS.xml file with your UMLS 
username and password.

Similarly, in AggregatePlaintextFastUMLSProcessor.xml from 
https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CTAKES-2D344&d=BQIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=ZOef73O4fpDF9CZPAZHmVyDZDQDa6jKWyTTU1kikj9o&s=7C1osQzBp5-aSIXPeqWPXcafrLDGCeEkR3sfbiJMRDQ&e=
 

[...] Please update
resources/org/apache/ctakes/dictionary/lookup/fast/cTakesHsql.xml file with 
your UMLS username and password

I am passing my UMLS login and password on startup as arguments, when starting 
the either CVD/CPE or org.apache.uima.examples.cpe.SimpleRunCPE
argumets such as:

"-Dctakes.umlsuser=myusername -Dctakes.umlspw=mypassword"

In such a case, it is still necessary to modify the file(s) above?

Additional question: It seems that the
AggregatePlaintextFastUMLSProcessor.xml from 
https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CTAKES-2D344&d=BQIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=ZOef73O4fpDF9CZPAZHmVyDZDQDa6jKWyTTU1kikj9o&s=7C1osQzBp5-aSIXPeqWPXcafrLDGCeEkR3sfbiJMRDQ&e=
  has some nice improvements (using DrugNER and default fast pipeline). I just 
downloaded it from the svn and replaced the old cTAKES 3.2.1 file with this 
one, and it seems to run just fine and cTAKES does annotations. Can somebody 
from the devs or users tell me if this manual replacement step is OK and does 
not break anything that I am not aware of?

Many thanks for answers on any of my questions, Tom


RE: Questions about dictionary-lookup and dictionary-lookup-fast

2015-03-10 Thread Finan, Sean
Hi Maite,

> Does anyone know why is it [UmlsDictionaryLookupAnnotator ]so slow?
The top 5 reasons (1-3 are 90% of the problem):
1.  The dictionary database is bloated with unwanted entries
2.  The dictionary database indexing is sub-optimal
3.  The second drug lookup with orangebook filtering takes extra time
4.  The matching algorithm does a little more work than is necessary
5.  There is some redundancy

> my interest is to be able to create my own HsqlDb-based dictionary
If you want to build a database using a subset of UMLS, check out the 
Dictionary Tool in Sandbox.  It can build custom hsqldb dictionaries in both 
the new (-fast) and old format using sources, tuis, filters, etc. that you 
specify in plaintext parameter files.  Several types of default setups are 
already available.  It is fully functional, but it has been a work-in-progress 
during my off-hours, so functionality changes and documentation is lacking, but 
there is a howto.txt  in the dictionarytool/doc/ directory.

*NOTE: if your custom dictionaries are small (~1000 entries?) then it would 
probably be easier to just throw them into a bar-separated value (bsv) file.  
There are examples in the dictionary-fast-res example/bsv/ directory.  

Sean

-Original Message-
From: Maite Meseure Hugues [mailto:meseure.ma...@gmail.com] 
Sent: Tuesday, March 10, 2015 12:35 PM
To: dev@ctakes.apache.org
Subject: Questions about dictionary-lookup and dictionary-lookup-fast

Hi everyone,

1) I am currently working on BagOfCuisGenerator.java with the analysis engine 
'AggregatePlaintextUMLSProcessor.xml', but that process is very slow at that 
step:

INFO UmlsDictionaryLookupAnnotator - process(JCas)

Does anyone know why is it so slow?

2) I also tried with 'AggregatePlaintextFastUMLSProcessor.xml' and it's 
actually pretty fast like his name suggests, but my interest is to be able to 
create my own HsqlDb-based dictionary like we can do with a Lucene index and 
integrate it in the process, is it possible with the fast version? Do you have 
any pointers that could allow me to do that?

Thank you very much for you time.

--
--
 Maïté Meseure Hugues


  1   2   3   4   5   6   7   8   9   10   >