Re: ctakes model

2013-03-29 Thread Tim Miller
I emailed the author of svmlight and he was very clear that our usage is 
acceptable:


   Hi Tim,

   Using the models trained with my software in the way you describe is fine 
with me for the application you describe. Feel free to include the trained 
models in the distribution under any license you like...

   Cheers
   Thorsten

Tim

On 03/26/2013 10:04 AM, Tim Miller wrote:
We are using a software package called svmlight and a derivative 
called svmlight-tk to train machine learning models in research. If 
the results are good we would like to include the models in cTAKES. We 
thought we should consult legal because this software uses a custom 
license not covered by FAQs, and we just want to be on the safe side. 
The full text of the license is attached.
(Just in case the attachment doesn't work I was able to find it on the 
web here: http://morphix-nlp.berlios.de/manual/node48.html but not at 
the svmlight webpage.)


To be clear, we do not want to redistribute the svmlight or 
svmlight-tk source or binaries or even use maven to download them. 
There is code in clearTK which can read these models and process new 
data and is license compatible. We are only talking about distributing 
models trained with the svmlight and svmlight-tk packages.


Thanks,





best way to include old software

2013-04-01 Thread Tim Miller

Hello,
I want to use some classes from a BSD-licensed software called JaSpell 
(http://jaspell.sourceforge.net/). It looks like it hasn't been 
maintained in a bit (2010). It is not in Maven central, so should I:


1) Include jars in a lib directory and check in to ctakes svn
2) Copy over the source files I need (to allow for possible maintenance)
3) Put into maven myself (don't know how to do this)
4) ?

Tim


Re: cTakes with java web application

2013-04-02 Thread Tim Miller

Gira,
Your use case is probably one that will become more and more common, and 
cTAKES devs do similar things all the time. I think the hangup to new or 
non-dev users (and probably something we need to document better) is 
that cTAKES is built on top of UIMA, and so the techniques for running 
pipelines and extracting information are actually UIMA and 
UIMAFit-based, and so there is nothing like the traditional javadocs 
explaining a cTAKES API to rely on.


Pei's sample code is basically UIMA and UIMAFit standard code that 
points at cTAKES pipelines, then once that is working the real cTAKES 
part is basically just understanding the type system so you know how to 
use UIMA API calls to extract the information you need. So maybe better 
documentation of the type system (maybe in javadoc style) is something 
that cTAKES should prioritize.


Tim

On 04/02/2013 10:45 AM, giri vara prasad nambari wrote:

Hi Pei,
Thanks for your time!
Sort of this is what I am looking for. I will do some research on 
javadoc to see what I could do with the API.
May I ask you one more question? Isn't ctakes build to accommodate 
these types of requirements (like integrating with other application)? 
Am I missing something important?
The reason is, I would need to read the out put of ctakes and perform 
some other analysis using WEKA. If ctakes is not yet ready for these 
types of requirements I may need to go back re-evaluate the software 
stack.

Thank you,
Giri


On Tue, Apr 2, 2013 at 10:20 AM, Chen, Pei 
> wrote:


Hi Giri,

I presume, essentially, you’re planning to include the cTAKES
lib(s) (via mvn?) into your existing app

1)Programmatically configure the pipeline

2)Pass in a document(s) to cTAKES for processing

3)Do XYZ with the output from the jCAS using the UIMA API’s (such
as writing to disk or saving it to a db)

It is not quite prime time ready but, take a look peek at the
below (It uses uimaFIT to do the above):


http://svn.apache.org/repos/asf/ctakes/sandbox/ctakes-gui/src/main/java/org/chboston/cnlp/ctakes/gui/service/LauncherService.java

Essentially, it boils down to a few lines of code:

AnalysisEngine aggregateAE = AnalysisEngineFactory.createAggregate(

   engines, componentNames, typeSystemDescription, null,

   new SofaMapping[0]);

JCas jcas = aggregateAE.newJCas();

jcas.setDocumentText(doc.getText());

aggregateAE.process(jcas);

*From:*giri vara prasad nambari [mailto:girinamb...@gmail.com
]
*Sent:* Tuesday, April 02, 2013 10:04 AM
*To:* u...@ctakes.apache.org 
*Subject:* Re: cTakes with java web application

Hi Pei,

Thanks for your time on answering this.

Actually I am not looking for pre built web application (or) GUI.
I was expecting something like "include ctakes jars in my web
application ((or) even for the matter any client java program)"
and start using ctakes API. Is this possible with ctakes api? If
so, any sample ctakes client code available?

I am not moving towards any SOA (or) pre-built GUI.

I would be happy to contribute to GUI, but first I need to finish
this ctakes integration task into my web application ASAP.

I hope this time my question is more clear.

Thank you,

Giri

On Tue, Apr 2, 2013 at 9:51 AM, Chen, Pei
mailto:pei.c...@childrens.harvard.edu>> wrote:

Hi Giri,

Apache cTAKES is mainly in Java built on top of the UIMA Framework.

Currently, there isn’t out of the box web application with cTAKES,
however there is a GUI currently in the sandbox area but isn’t
quite ready for prime time yet.  Is this something that you might
be interested in contributing to?

http://svn.apache.org/repos/asf/ctakes/sandbox/ctakes-gui/

There are also some UIMA options that may point you in the right
direction.

UIMA-AS (If you’re gearing towards a Service Architecture for your
web app.)

http://uima.apache.org/d/uima-as-2.4.0/uima_async_scaleout.html

There is also a Simple Rest service (but runs in-process):

UIMA Simple Server

http://uima.apache.org/sandbox.html#simple-server

Thanks,

Pei

*From:*giri vara prasad nambari [mailto:girinamb...@gmail.com
]
*Sent:* Tuesday, April 02, 2013 12:29 AM
*To:* ctakes-u...@incubator.apache.org

*Subject:* Fwd: cTakes with java web application

Hi Community,

I did lot of google for sample java code to integrate cTakes into
web application, can some one please point me in right direction.

I would like to use clinical pipeline with plain text instead of
XML documents.

Any help would be appreciated.

Thank you,

Giri






srl bug with one word sentences

2013-04-08 Thread Tim Miller
I filed a jira (https://issues.apache.org/jira/browse/CTAKES-184) for an 
issue in the srl component of the dependency parser. Here is the issue: 
The SRL component grabs the Dependency Node for each token using the 
token span. When there is a one-word sentence there are two such nodes, 
the "head" node which spans the sentence and the token node. There is a 
quick fix checked in that looks at the "id" field, but it would be 
preferable to fix it upstream so that users of the Dependency types 
don't need to know about this issue.


Here are a few possibilities, any other ideas?
1) Change the span of the head node, maybe to 0 length at the start of 
the sentence?

2) Create a separate type for DependencyTreeHead, perhaps spanless



--
Tim Miller, PhD
Postdoctoral Research Fellow
Children's Hospital Informatics Program
Boston Children's Hospital and Harvard Medical School
617-919-1223



Re: export runnable jar issue

2013-04-23 Thread Tim Miller

It may actually be a maven issue:
https://bugs.eclipse.org/bugs/show_bug.cgi?id=284928

I'm a little swamped right now, but I'd like to try setting up a 
non-maven project with two source folders:

src/main/java
src/main/resources

like ours, and see if this issue persists.
Tim

On 04/23/2013 11:21 AM, Masanz, James J. wrote:

I'm having the same issue.
I tried just creating a jar instead of a runnable jar and get the same.

Tim, I don't know if it's 1, 2 or 3.

-- James




-Original Message-
From: dev-return-1533-Masanz.James=mayo@ctakes.apache.org [mailto:dev-
return-1533-Masanz.James=mayo@ctakes.apache.org] On Behalf Of Miller,
Timothy
Sent: Sunday, April 21, 2013 7:23 AM
To: dev@ctakes.apache.org
Subject: export runnable jar issue

I've found it useful to "export runnable jar" to create a single file that
can be run with 'java -jar' on a server/cluster for a given run
configuration. There is an issue that may be an eclipse bug but also may
be a ctakes issue.

The issue is that files under src/main/java source folders export exactly
as expected, while files under src/main/resources folders export with the
"resources" folder prepended. So you will get the following classpaths in
the jar:

org/apache/ctakes/someproject/SomeCode.class
resources/org/apache/ctakes/someproject/some_model.bin

This obviously screws up important classpaths at runtime. I can workaround
it, but does anyone have a sense of whether this issue is:
1) Clearly an eclipse issue with nothing we can do except try to notify
someone
2) An eclipse issue but there is something we're doing wrong (like a
convention for resources we're not following that would fix this)
3) Clearly a ctakes issue

Tim




Re: CTAKES-143 - dictionary lookup iterates inefficiently

2013-04-24 Thread Tim Miller

Looks like I included it in this commit:

   r1442975 | tmill | 2013-02-06 09:15:47 -0500 (Wed, 06 Feb 2013) | 5
   lines

   Addresses ctakes-143. Adds interface method getSortedLookupTokens to
   allow/ensure optimal handling of sorted underlying types in
   dictionary lookup.
   Implemented changed versions of other classes implementing classes
   but did not test since they are not used in ctakes as far as I can tell.
   See discussion on ctakes-dev for further information
   
(http://mail-archives.apache.org/mod_mbox/incubator-ctakes-dev/201302.mbox/%3C51102B9B.3000206%40childrens.harvard.edu%3E)

I don't know if it wasn't included or just not mentioned for the release.
Tim

On 04/24/2013 03:42 PM, Masanz, James J. wrote:

Tim,  was an improvement equivalent to  "CTAKES-143 - dictionary lookup iterates 
inefficiently" put into trunk?  I see it's marked as resolved.

All,
I don't think we've discussed as a group how we'd like to handle the statuses 
on JIRA issues.
Should the person  marking an issue Resolved indicate the Fix Version if the 
resolution involved a code change.
Or does a release manager fill in the Fix Version when creating the release?

I see cTAKES-143 is not included in the releases notes for 3.0.0-incubating.
I want to make sure it gets included in the next release.

-- James







Re: switching existing svn working copy from incubating to tlp

2013-04-25 Thread Tim Miller
I think you just want svn switch without the relocate flag. My 
understanding is that --relocate is for when it's on a new server, we've 
just moved to a different directory on the same server.


On 04/25/2013 10:14 AM, Coarr, Matt wrote:

can I use "svn relocate" or "svn switch --relocate" to update my old working 
copy that was checked out from the incubator svn to use the new top-level-project svn?  Or do I 
need to check out a clean copy?

My working copy is checked out from here:  
https://svn.apache.org/repos/asf/incubator/ctakes/trunk

I've tried to use "svn relocate" to update my working copy to point to the tlp 
svn, but I get the following error:

svn: E155024: Invalid relocation destination: 
'https://svn.apache.org/repos/asf/ctakes/trunk' (does not point to target)

Here are a couple of the command that I've tried (both return the error 
mentioned above):

svn relocate https://svn.apache.org/repos/asf/incubator/ctakes 
https://svn.apache.org/repos/asf/ctakes

svn relocate https://svn.apache.org/repos/asf/incubator/ctakes/trunk 
https://svn.apache.org/repos/asf/ctakes/trunk

Matt




better place for umls credentials

2013-05-01 Thread Tim Miller
The developer guide lists 3 options for umls credentials, and they all 
have issues:

1) environment variable
-- tried this one, got errors in .bashrc for illegal variable 
names, maybe it works in windows only

2) vm arguments in run configuration
-- works, but then you have to add it for every new run configuration
3) edit the descriptor file
-- runs the risk of accidentally checking in your credentials to svn

I tried setting the values in eclipse.ini, that does not work.

I think now I have stumbled on a decent solution, better than the 
options we've had up till now. If you open windows->preferences in 
eclipse, then select Installed JREs, and select the jre you use and 
click edit. Now a window pops up that lets you put in default VM 
arguments. I put in my UMLS credentials here and that seemed to work. In 
theory this should then work for all run configurations, you shouldn't 
have to re-do it for new run configurations.


Can someone else please verify that this works? And if so should we make 
this the default way to do it in developer setup? Any downsides I'm missing?


Tim


files vs strings in collection reader

2013-05-07 Thread Tim Miller
The FilesInDirectoryCollectionReader creates an arraylist of 
java.io.File objects when it is initialized. For large datasets (~50k 
files) this is substantial time overhead and probably memory as well. 
Seems like it would be more efficient to use Strings instead of Files 
there and just open the File object when getNext() is called. It is 
pretty easy to implement, any downside to making this switch?

Tim


Re: files vs strings in collection reader

2013-05-07 Thread Tim Miller
This sounds like a job for... science! I'll try some experiments and see 
if it makes a difference.

Tim

On 05/07/2013 03:42 PM, Masanz, James J. wrote:

do you have any numbers of what sort of impact this will actually have?  Not 
clear to me what the savings would be from. Instantiating objects either way.  
Should we be just initializing the ArrayList to something other than the 
default size?

-- James



-Original Message-
From: dev-return-1580-Masanz.James=mayo@ctakes.apache.org [mailto:dev-
return-1580-Masanz.James=mayo@ctakes.apache.org] On Behalf Of Tim
Miller
Sent: Tuesday, May 07, 2013 2:18 PM
To: dev@ctakes.apache.org
Subject: files vs strings in collection reader

The FilesInDirectoryCollectionReader creates an arraylist of java.io.File
objects when it is initialized. For large datasets (~50k
files) this is substantial time overhead and probably memory as well.
Seems like it would be more efficient to use Strings instead of Files
there and just open the File object when getNext() is called. It is pretty
easy to implement, any downside to making this switch?
Tim




converting annotators to uimafit

2013-05-15 Thread Tim Miller
I think the uimaFIT way of handling parameters for annotators is great, 
are we considering converting? If so I'll be happy to start doing it for 
some of the main annotators. But even if we are do we know what the 
status is of the uima/uimaFIT integration? Like, if we switched 
everything over to uimafit would we have to change everything as soon as 
there is another uima release?

Tim


Re: sentence detector newline behavior

2013-05-21 Thread Tim Miller
I think the whole reason to use a machine learning approach for sentence 
detection should be to help weigh evidence with these cases where hard 
rules cause problems, mainly 1) when a period does not end a sentence, 
but also 2) where a newline does and does not mean end of sentence. It 
is of course bad that in your example if you don't put a sentence break 
you will think that "extravascular findings" is negated. But it is also 
bad if you put a sentence break immediately after the word "and" at the 
end of a line and then you find that your language model thinks that 
"and " is a good bigram.


I will create a jira for the parameter thing, and try to implement it 
and see if it gets ok results with the existing model.

Tim

On 05/21/2013 10:11 AM, Masanz, James J. wrote:

+1 for adding a boolean parameter, or perhaps instead a list of section IDs

The sentence detector model was trained on data that always breaks at carriage 
returns.

It is important for text that is a list something like this:

Heart Rate: normal
ENT: negative
EXTRAVASCULAR FINDINGS: Severe prostatic enlargement.

And without breaking on the line ending, the word negative would negate 
extravascular findings


-Original Message-
From: dev-return-1605-Masanz.James=mayo@ctakes.apache.org 
[mailto:dev-return-1605-Masanz.James=mayo@ctakes.apache.org] On Behalf Of 
Miller, Timothy
Sent: Tuesday, May 21, 2013 7:07 AM
To: dev@ctakes.apache.org
Subject: sentence detector newline behavior

The sentence detector always ends a sentence where there are newlines.
This is a problem for some notes (e.g. MIMIC radiology notes) where a
line can wrap in the  middle of a sentence at specified character
offsets. In the comments for SentenceDetector, it seems to be split up
very logically in that it first runs the opennlp sentence detector, then
breaks any detected sentence wherever there is a newline. Questions:
1) Would it be good to add a boolean parameter for breaking on newlines?
2) If that section was removed/avoided, does the opennlp sentence
detector give good results given our model? Or is the model trained on
text that always breaks at carriage returns?

Tim




Re: sentence detector newline behavior

2013-05-23 Thread Tim Miller
OK I've started doing this, was able to get training working on a very 
small example, will try doing slightly bigger.

Tim

On 05/22/2013 08:03 AM, Jörn Kottmann wrote:

On 05/22/2013 01:17 PM, Miller, Timothy wrote:

That's awesome! It might be worth trying at least. How does the training
process change? Previously the training data would be one sentence per
line, but with newlines as possible mid-sentence characters that could
be trouble, is there a new representation for training data? Or would we
have to use the training api?


Good point, yes that will be a problem with the default training 
format, but it shouldn't be hard
to solve. In the format itself we could define a new line tag e.g. 
 to mark new lines.
as a hack to make it work with 1.5.3 you could instead use a special 
char as a replacement

for the new line char.
When you pass the text down to the sentence detector a simple string 
replace could be used to

convert all new line chars to the special new line marker char.

If things work out for you performance wise as well we will just 
integrate it properly into OpenNLP

for the next release.

Could you produce a sentence detector training file with a new line 
marker char?


You should try to pick a char you can also pass in on a terminal 
otherwise you have to use the
API to train the model. The build in cross validation could be used to 
evaluate the performance.


Jörn




Re: srl bug with one word sentences

2013-05-29 Thread Tim Miller
There weren't any objections, but I'm not convinced that the new type 
will be an improvement over the fix checked in, so I'm going to mark 
this issue resolved unless someone wants to raise an argument.

Tim

On 04/11/2013 09:25 PM, Chen, Pei wrote:

My vote would be for 2) Create a DependencyTreeHead which would make it 
explicit similar to ClearTK's 
org.cleartk.syntax.dependency.type.TopDependencyNode.

Since there is already a temp fix, we can wait and see if there are any 
objections with adding this type first?

--Pei
____________
From: Tim Miller [timothy.mil...@childrens.harvard.edu]
Sent: Monday, April 08, 2013 3:54 PM
To: dev@ctakes.apache.org
Subject: srl bug with one word sentences

I filed a jira (https://issues.apache.org/jira/browse/CTAKES-184) for an
issue in the srl component of the dependency parser. Here is the issue:
The SRL component grabs the Dependency Node for each token using the
token span. When there is a one-word sentence there are two such nodes,
the "head" node which spans the sentence and the token node. There is a
quick fix checked in that looks at the "id" field, but it would be
preferable to fix it upstream so that users of the Dependency types
don't need to know about this issue.

Here are a few possibilities, any other ideas?
1) Change the span of the head node, maybe to 0 length at the start of
the sentence?
2) Create a separate type for DependencyTreeHead, perhaps spanless



--
Tim Miller, PhD
Postdoctoral Research Fellow
Children's Hospital Informatics Program
Boston Children's Hospital and Harvard Medical School
617-919-1223





Re: files vs strings in collection reader

2013-05-29 Thread Tim Miller
This collection reader latency issue was harder to test than expected -- 
the first run took ~20 minutes to load and the second took a negligible 
amount of time, presumably due to caching effects. But given our other 
conversation on a "big data" direction using UIMA-AS there is a 
potential solution out there.


UIMA-AS doesn't require Collection Readers -- you just deploy some 
number of pipelines, and then can write a bit of code that can create 
and add CAS's to a queue, asynchronously if desired. So when we get 
something like that up and running, then we can give users/devs a rule 
of thumb that says if you're regularly processing more than ~10k 
documents it's probably better to use UIMA-AS anyways, and then you'll 
get the benefits of the asynchronous methods.


Tim

On 05/07/2013 03:49 PM, Tim Miller wrote:
This sounds like a job for... science! I'll try some experiments and 
see if it makes a difference.

Tim

On 05/07/2013 03:42 PM, Masanz, James J. wrote:
do you have any numbers of what sort of impact this will actually 
have?  Not clear to me what the savings would be from. Instantiating 
objects either way.  Should we be just initializing the ArrayList to 
something other than the default size?


-- James



-Original Message-
From: dev-return-1580-Masanz.James=mayo@ctakes.apache.org 
[mailto:dev-

return-1580-Masanz.James=mayo....@ctakes.apache.org] On Behalf Of Tim
Miller
Sent: Tuesday, May 07, 2013 2:18 PM
To: dev@ctakes.apache.org
Subject: files vs strings in collection reader

The FilesInDirectoryCollectionReader creates an arraylist of 
java.io.File

objects when it is initialized. For large datasets (~50k
files) this is substantial time overhead and probably memory as well.
Seems like it would be more efficient to use Strings instead of Files
there and just open the File object when getNext() is called. It is 
pretty

easy to implement, any downside to making this switch?
Tim






Re: Next cTAKES release (3.1)?

2013-05-31 Thread Tim Miller
Yes I think it can be done by then. But even if not, my understanding is 
that the version turned on by default is not cleartk-based and the 
cleartk one is still under development.

Tim

On 05/31/2013 03:25 PM, Masanz, James J. wrote:

I'll be release manager for 3.1 (unless someone else is anxious to be and just 
hasn't seen this thread yet)

I'd suggest we target Wed June 26 to have an RC built.

Steve, would that seem reasonable for the relation extractor changes due to 
[1], plus those that would be needed if we upgrade ClearTK dependency to 1.4.0.

Anyone know enough about assertion code to make an educated guess of whether the 
"upgrade ClearTK dependency to 1.4.0" could be done by then too?

-Original Message-
From: dev-return-1652-Masanz.James=mayo@ctakes.apache.org 
[mailto:dev-return-1652-Masanz.James=mayo@ctakes.apache.org] On Behalf Of 
Steven Bethard
Sent: Friday, May 31, 2013 2:18 PM
To: dev@ctakes.apache.org
Subject: Re: Next cTAKES release (3.1)?

As a result of the CTAKES-190 changes to the dictionary lookup [1], the 
relation extractor needs some refactoring and retraining. Probably we won't 
have a chance to get to that until after NAACL (June 9-15). So it would be best 
for us to target the 3.1 release towards the end of June.

Steve

[1] https://issues.apache.org/jira/browse/CTAKES-190

On May 31, 2013, at 1:01 PM, "Chen, Pei"  wrote:


https://issues.apache.org/jira/browse/CTAKES/fixforversion/12323276#selectedTab=com.atlassian.jira.plugin.system.project%3Aversion-issues-panel
25/58 are either closed/resolved; there were a decent number of simple patch 
fixes I think.

To spread the knowledge, perhaps another committer could be the release manager 
(RM) for the next release.  Hint hint *James? ;)

--Pei


-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
Sent: Friday, April 12, 2013 4:41 PM
To: 'dev@ctakes.apache.org'
Subject: RE: Next cTAKES release (3.1)?


The new CEM Instance Template population is not complete yet, but if 3.1 is
late May or June, it will be.

Also, is the GUI close enough to being ready for prime time that it would
have a chance to be in 3.1?

-- James



-Original Message-
From: dev-return-1506-Masanz.James=mayo@ctakes.apache.org
[mailto:dev- return-1506-Masanz.James=mayo@ctakes.apache.org]

On

Behalf Of Chen, Pei
Sent: Thursday, April 11, 2013 7:56 PM
To: dev@ctakes.apache.org
Subject: Next cTAKES release (3.1)?

Hi,
I just wanted to gauge the interest of creating the next release of
cTAKES
(3.1) which is currently marked for May in Jira-

There have already been 22/53 issues [1] marked as fixed or closed.
Plenty of bug fixes and new components including:
- New CEM Instance Template population
- New Dependency Parser/Semantic Role Labeler
- New optional Clear POSTagger
- New regression testing component

Should we wait for the Temporal component?

[1]


https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%2

2%20
AND%20project%20%3D%20CTAKES




Re: upgrade ClearTK dependency to 1.4.0?

2013-05-31 Thread Tim Miller
Yeah, Pei, you're right, I was able to remove it w/o causing any errors. 
I'll check it in with the dependency and imports removed.

Tim

On 05/31/2013 03:39 PM, Chen, Pei wrote:

Regarding the cleartk-examples dependency from the assertion module:
I think there were some commits on this already [1].
There is only a reference to the example classes in the imports now, but it's 
actually never used.
Perhaps Matt could chime.  Maybe it's as simple as just removing the examples 
from the pom and imports.

[1] 
http://mail-archives.apache.org/mod_mbox/incubator-ctakes-dev/201210.mbox/%3CF085F637EA9843429DA7A267BBB709A01E8402EF%40IMCMBX04.MITRE.ORG%3E



-Original Message-
From: Steven Bethard [mailto:steven.beth...@colorado.edu]
Sent: Thursday, May 30, 2013 6:46 PM
To: ctakes-...@incubator.apache.org
Subject: upgrade ClearTK dependency to 1.4.0?

I just released ClearTK 1.4.0 and there are a couple of reasons we should
probably consider updating the cTAKES dependency:

(1) ClearTK 1.4.0 can now load trained models from the classpath, so we
could get rid of the workaround
org.apache.ctakes.relationextractor.ae.RelationExtractorAnnotator.allowClas
sifierModelOnClasspath.

(2) ClearTK 1.4.0 has wrappers for multi-class classification with LIBLINEAR
which is orders of magnitude faster than LIBSVM.

The main downside is that models will have to be re-trained. (It's not
necessarily the case that all models would need to be retrained, depending
on exactly which classes they were using, but it's probably safer to do so.)

I believe this would mostly affect ctakes-temporal, ctakes-relation-extractor
and ctakes-assertion.

Thoughts?

Steve

P.S. I noticed that ctakes-assertion declares a dependency on cleartk-
examples. The cleartk-examples module was never intended for release,
and has not been released as part of ClearTK 1.4.0. Looking at the code, it
seems like the dependency in cleartk-examples is not needed, but perhaps a
ctakes-assertion person could weigh in on why this dependency was there?




Re: Next cTAKES release (3.1)?

2013-07-02 Thread Tim Miller
Agreed that you could definitely help out, and that would be a great way 
to do so. We don't really have "examples" right now, more like just 
short test sentences for showing simple results and verifying that 
nothing has been broken by changes. I think regular length fake but 
realistic notes would be very useful.

Tim

On 07/02/2013 05:19 PM, John Green wrote:

Hi all,

Ive been following this mail list for a couple of months. Im a third year 
medical student rounding the bend toward my MD. I used to be a computer 
programmer, however, and continue my own projects. Im very interested in 
contributing eventually to cTakes development. In the meantime, given the 
current talk of examples, if any domain specific examples needed generated I am 
domain knowledgable enough that I could pound out a few free text notes made to 
order.

Let me know, you all may already have docs on hand willing todo this, but if 
not...

John Green

Sent from my iPhone

On Jun 28, 2013, at 8:59, "Chen, Pei"  wrote:


I completely agree with making cTAKES easier use.  I think it is exciting to 
hear the different use cases here and understanding where some of the areas 
that need improvements are (which we haven't thought about earlier).
I think Tim's suggestions and the 3 concrete actionable items makes a lot of 
sense.  Hopefully it should attract new users, adopters, and perhaps more 
committers.


i) Make the typesystem forefront in documentation -- generate javadocs and
have as a link on the ctakes frontpage/sidebar
ii) Similar to the way that we are aiming to have tests in every module, also
have clearly labeled examples in every module that set up a pipeline, run on
sample notes (could be the same sample notes from the tests), and do
something with the results.
iii) Follow Giri's recommendation to have example training data for people
who want to take the next step and train their own models

I think Java developers are accustomed to including a library as a 
dependency/jar, have an API to pass input, and get the results via pojos;  So 
the examples could initially shield the complexity of wiring a pipeline 
together etc.
If we can improve the API's and how it gets integrated with other apps, we can 
add any GUI/CLI tools on top of this afterwards.

--Pei


-Original Message-
From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
Sent: Friday, June 28, 2013 8:00 AM
To: dev@ctakes.apache.org
Subject: Re: Next cTAKES release (3.1)?

Very interesting discussion. I think Giri is right about giving example training
data in the format that our training code can read. While our ultimate goal
would be to build and release models that are completely domain-
independent, in the real world it is almost always better to use some
domain-specific data and we should think more about how to facilitate that.

As for making it easier to get started, it is not totally clear to me what this
means/how to do it so it might be useful to get specific about what this
means. I think our biggest hurdle is

1) Prerequisite of understanding UIMA/UIMAFit

Since UIMAFit is officially becoming part of UIMA that will be easier, and
hopefully people will just learn the easier (in my opinion) UIMAFit way than
the standard UIMA way of doing things. Is there something we can be doing
to make understanding UIMA easier? Or do we just need to say upfront that
this is a prerequisite and hope that people don't give up due to this thing that
is out of our control?

Another hurdle is:

2) cTAKES is a multi-purpose developer-aimed tool

So it's not just a matter of hiding complexity -- at some point people have to
understand their problem, understand cTAKES' capabilities, and start coding.
Pei's GUI will help for some common use cases but will not remove the
requirement that someone at the organization knows cTAKES.
I think one part of this problem is the fact that the typesystem is not well
documented. A developer needs to know what the output is (objects from
the typesystem), how to get them (which modules/pipelines), and what
information is in them. So maybe on this end my recommendation would be:
i) Make the typesystem forefront in documentation -- generate javadocs and
have as a link on the ctakes frontpage/sidebar
ii) Similar to the way that we are aiming to have tests in every module, also
have clearly labeled examples in every module that set up a pipeline, run on
sample notes (could be the same sample notes from the tests), and do
something with the results.
iii) Follow Giri's recommendation to have example training data for people
who want to take the next step and train their own models

This is quite a bit of developer overhead, so it's worth asking whether you
agree with my "diagnosis" and "treatment" or whether you think there are
different problems/solutions that should be higher priority.

Tim

On 06/27/2013 10:59 PM, Girivaraprasad Nambari wrote:

Hi Vijay and Andy,

Thanks for sharing those examples.

"Trouble is, privacy requir

Re: apostrophe and sentence detector

2013-08-26 Thread Tim Miller


On 08/26/2013 12:05 PM, Masanz, James J. wrote:

The recently rebuilt sentence detector (currently in trunk and the 3.1.0 
branch) is sometimes taking the apostrophe as a sentence break where the 
ctakes-3.0.0-incubating model didn't.

The training data used for the recently rebuilt model only contains only 7 
lines that end with an apostrophe (single quote)
Do you mean 7 sentences that end in a single apostrophe or 7 lines? The 
sentence detector will currently break on newlines no matter what, so 
the important number is how many sentences end mid-line with an 
apostrophe, right?

Tim


Re: apostrophe and sentence detector

2013-08-26 Thread Tim Miller
Ah, so we might suspect that some of those 7 lines in the file were 
indeed followed by newlines in the original training data. In the 
absence of more/better training data which would help us learn this I 
think it would be reasonable to restore the list of sentence-breaking 
characters to not include apostrophe. Seems like it is rare for a 
sentence to end on it, and my preference is to accidentally call 2 
sentences one sentence, rather than splitting one sentence in the 
middle. I think it's probably better for downstream processing.

Just my .02,
Tim

On 08/26/2013 12:29 PM, Masanz, James J. wrote:

The training data is one sentence per line.
That's how you feed data to the sentence detector.

-Original Message-
From: dev-return-1884-Masanz.James=mayo@ctakes.apache.org 
[mailto:dev-return-1884-Masanz.James=mayo@ctakes.apache.org] On Behalf Of 
Tim Miller
Sent: Monday, August 26, 2013 11:12 AM
To: dev@ctakes.apache.org
Subject: Re: apostrophe and sentence detector


On 08/26/2013 12:05 PM, Masanz, James J. wrote:

The recently rebuilt sentence detector (currently in trunk and the 3.1.0 
branch) is sometimes taking the apostrophe as a sentence break where the 
ctakes-3.0.0-incubating model didn't.

The training data used for the recently rebuilt model only contains only 7 
lines that end with an apostrophe (single quote)

Do you mean 7 sentences that end in a single apostrophe or 7 lines? The
sentence detector will currently break on newlines no matter what, so
the important number is how many sentences end mid-line with an
apostrophe, right?
Tim




Re: [VOTE} Release Apache cTAKES 3.1 - rc2

2013-08-29 Thread Tim Miller

James,
I second Dima's opinion to wait on temporal. There are some modules that 
probably perform well enough already to release but probably need a 
little more polish before they are ready for others to start using them.

Tim

On 08/29/2013 05:01 PM, Masanz, James J. wrote:

Thanks Pei,

I took the src zip, removed references to temporal from the pom, and was able 
to do a mvn clean compile without error, so at least there aren't any other 
compile issues.

I will wait until tomorrow to give a little more time for anyone to weigh in on 
whether to include ctakes-temporal, and then take action as I've described in

https://issues.apache.org/jira/browse/CTAKES-82

Then I'll create rc3

Meanwhile I'll do some other of rc2.

All,
Any other comments on rc2 appreciated before I create r3. Thanks.

-- James

-Original Message-
From: dev-return-1909-Masanz.James=mayo@ctakes.apache.org 
[mailto:dev-return-1909-Masanz.James=mayo@ctakes.apache.org] On Behalf Of 
Pei Chen
Sent: Wednesday, August 28, 2013 6:55 PM
To: dev@ctakes.apache.org
Subject: Re: [VOTE} Release Apache cTAKES 3.1 - rc2

One more thing- if I try to build from the src.tar.gz, I get the below.
It looks like the temporal project is defined as a module in the root
pom.xml but the src was not included.
Are we planning to include the source of that project in this release or
comment it out for a future release?

apache-ctakes-3.1.0-src pei$ mvn clean compile
[INFO] Scanning for projects...
[ERROR] The build could not read 1 project -> [Help 1]
[ERROR]
[ERROR]   The project org.apache.ctakes:ctakes:3.1.0
(/Users/pei/workspace/apache-ctakes/ctakes-3.1.0-rc2/apache-ctakes-3.1.0-src/pom.xml)
has 1 error
[ERROR] Child module
/Users/pei/workspace/apache-ctakes/ctakes-3.1.0-rc2/apache-ctakes-3.1.0-src/ctakes-temporal
of
/Users/pei/workspace/apache-ctakes/ctakes-3.1.0-rc2/apache-ctakes-3.1.0-src/pom.xml
does not exist
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions,
please read the following articles:
[ERROR] [Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/ProjectBuildingException


On Wed, Aug 28, 2013 at 3:23 PM, Masanz, James J. wrote:


Hello cTAKES users and enthusiasts,


This is a call for a vote on releasing the following candidate as Apache
cTAKES 3.1.0. This will be our first release as a TLP.



For more detailed information on the changes/release notes, please visit:


https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12323276&projectId=12313621



The release was made using the cTAKES release process documented here:

http://ctakes.apache.org/ctakes-release-guide.html



The candidate is available at:


http://people.apache.org/~james-masanz/RCs/ctakes-3.1.0/rc2/apache-ctakes-3.1.0-src.tar.gz/.zip



The nexus staging repository:
org.apache.ctakes-121 (u:james-masanz, a:129.176.197.197)<
https://repository.apache.org/content/repositories/orgapachectakes-121>



The tag to be voted on:

http://svn.apache.org/repos/asf/ctakes/tags/ctakes-3.1.0-rc2/



The MD5 checksum of the tarball can be found at:


http://people.apache.org/~james-masanz/RCs/ctakes-3.1.0/rc2/apache-ctakes-3.1.0-src.tar.gz.md5/
 .zip.md5



The signature of the tarball can be found at:


http://people.apache.org/~james-masanz/RCs/ctakes-3.1.0/rc2/apache-ctakes-3.1.0-src.tar.gz.asc/.zip.asc



Apache cTAKES' KEYS file, containing the PGP keys used to sign the release:

http://svn.apache.org/repos/asf/ctakes/tags/ctakes-3.1.0-rc2/KEYS



Please vote on releasing these packages as Apache cTAKES 3.1.0. The vote
is open for at least the next 72 hours.



Also, the convenience binary is:


http://people.apache.org/~james-masanz/RCs/ctakes-3.1.0/rc2/apache-ctakes-3.1.0-bin.tar.gz/.zip



Only votes from the cTAKES PMC are binding, but folks are welcome to check
the release candidate and voice their approval or disapproval. The vote
passes if at least three binding +1 votes are cast.



[ ] +1 Release the packages as Apache cTAKES 3.1.0

[ ] -1 Do not release the packages because...



Thanks!

James Masanz








Re: [VOTE} Release Apache cTAKES 3.1 - rc2

2013-08-29 Thread Tim Miller
Oh, also, I have a re-trained constituency parse model which I thought I 
had checked in but I guess I didn't. Is it too late to include for the 
next release candidate?

Tim

On 08/29/2013 05:01 PM, Masanz, James J. wrote:

Thanks Pei,

I took the src zip, removed references to temporal from the pom, and was able 
to do a mvn clean compile without error, so at least there aren't any other 
compile issues.

I will wait until tomorrow to give a little more time for anyone to weigh in on 
whether to include ctakes-temporal, and then take action as I've described in

https://issues.apache.org/jira/browse/CTAKES-82

Then I'll create rc3

Meanwhile I'll do some other of rc2.

All,
Any other comments on rc2 appreciated before I create r3. Thanks.

-- James

-Original Message-
From: dev-return-1909-Masanz.James=mayo@ctakes.apache.org 
[mailto:dev-return-1909-Masanz.James=mayo@ctakes.apache.org] On Behalf Of 
Pei Chen
Sent: Wednesday, August 28, 2013 6:55 PM
To: dev@ctakes.apache.org
Subject: Re: [VOTE} Release Apache cTAKES 3.1 - rc2

One more thing- if I try to build from the src.tar.gz, I get the below.
It looks like the temporal project is defined as a module in the root
pom.xml but the src was not included.
Are we planning to include the source of that project in this release or
comment it out for a future release?

apache-ctakes-3.1.0-src pei$ mvn clean compile
[INFO] Scanning for projects...
[ERROR] The build could not read 1 project -> [Help 1]
[ERROR]
[ERROR]   The project org.apache.ctakes:ctakes:3.1.0
(/Users/pei/workspace/apache-ctakes/ctakes-3.1.0-rc2/apache-ctakes-3.1.0-src/pom.xml)
has 1 error
[ERROR] Child module
/Users/pei/workspace/apache-ctakes/ctakes-3.1.0-rc2/apache-ctakes-3.1.0-src/ctakes-temporal
of
/Users/pei/workspace/apache-ctakes/ctakes-3.1.0-rc2/apache-ctakes-3.1.0-src/pom.xml
does not exist
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions,
please read the following articles:
[ERROR] [Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/ProjectBuildingException


On Wed, Aug 28, 2013 at 3:23 PM, Masanz, James J. wrote:


Hello cTAKES users and enthusiasts,


This is a call for a vote on releasing the following candidate as Apache
cTAKES 3.1.0. This will be our first release as a TLP.



For more detailed information on the changes/release notes, please visit:


https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12323276&projectId=12313621



The release was made using the cTAKES release process documented here:

http://ctakes.apache.org/ctakes-release-guide.html



The candidate is available at:


http://people.apache.org/~james-masanz/RCs/ctakes-3.1.0/rc2/apache-ctakes-3.1.0-src.tar.gz/.zip



The nexus staging repository:
org.apache.ctakes-121 (u:james-masanz, a:129.176.197.197)<
https://repository.apache.org/content/repositories/orgapachectakes-121>



The tag to be voted on:

http://svn.apache.org/repos/asf/ctakes/tags/ctakes-3.1.0-rc2/



The MD5 checksum of the tarball can be found at:


http://people.apache.org/~james-masanz/RCs/ctakes-3.1.0/rc2/apache-ctakes-3.1.0-src.tar.gz.md5/
 .zip.md5



The signature of the tarball can be found at:


http://people.apache.org/~james-masanz/RCs/ctakes-3.1.0/rc2/apache-ctakes-3.1.0-src.tar.gz.asc/.zip.asc



Apache cTAKES' KEYS file, containing the PGP keys used to sign the release:

http://svn.apache.org/repos/asf/ctakes/tags/ctakes-3.1.0-rc2/KEYS



Please vote on releasing these packages as Apache cTAKES 3.1.0. The vote
is open for at least the next 72 hours.



Also, the convenience binary is:


http://people.apache.org/~james-masanz/RCs/ctakes-3.1.0/rc2/apache-ctakes-3.1.0-bin.tar.gz/.zip



Only votes from the cTAKES PMC are binding, but folks are welcome to check
the release candidate and voice their approval or disapproval. The vote
passes if at least three binding +1 votes are cast.



[ ] +1 Release the packages as Apache cTAKES 3.1.0

[ ] -1 Do not release the packages because...



Thanks!

James Masanz








Re: cTAKES DATA DICTIONARY

2013-10-02 Thread Tim Miller

I still get an error from this link.
Firefox says:

   Error loading stylesheet: An XSLT stylesheet does not have an XML
   mimetype:
   
https://svn.apache.org/repos/asf/ctakes/trunk/ctakes-type-system/src/main/resources/org/apache/ctakes/typesystem/types/TypeSystemDescription.xsl


Chrome displays the xml with this message:

   This XML file does not appear to have any style information
   associated with it. The document tree is shown below.




This is under ubuntu.
Tim

On 09/30/2013 04:57 PM, Chen, Pei wrote:

It's fixed:
https://svn.apache.org/repos/asf/ctakes/trunk/ctakes-type-system/src/main/resources/org/apache/ctakes/typesystem/types/TypeSystem.xml
Thanks to infra- I just had to set the mime:types in svn...
Now we just need to beef up the docs a bit :)



-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
Sent: Monday, September 30, 2013 4:57 PM
To: 'dev@ctakes.apache.org'
Subject: RE: cTAKES DATA DICTIONARY

I think it's a bad idea to have a copy on the website.
Too easy to get out of sync with SVN.

I can view the copy in SVN OK with IE 8, but not with Chrome (only 2
browsers I've tried).


-Original Message-
From: dev-return-2058-Masanz.James=mayo@ctakes.apache.org
[mailto:dev-return-2058-Masanz.James=mayo@ctakes.apache.org] On
Behalf Of Chen, Pei
Sent: Monday, September 30, 2013 3:44 PM
To: dev@ctakes.apache.org
Subject: RE: cTAKES DATA DICTIONARY

http://ctakes.apache.org/docs/TypeSystem.xml

It's kind of weird that the xls works behind the apache site/web servers, but
not within the svn repo.  Does it ring a bell to anyone? I have a feeling that
the svn web servers are doing something weird and not allowing the
transformation.
But either way, I've made a copy of it on to the site;  I think the descriptions
need to be beefed up a little bit though...

Any objections in removing the "Equivalent to edu.mayo.bmi. etc. etc."  It
probably made sense in the past when we were upgrading to the common
type system, but I think it probably makes more sense to put the intended
meanings there now...

--Pei


-Original Message-
From: Richard Eckart de Castilho [mailto:r...@apache.org]
Sent: Monday, September 30, 2013 12:08 PM
To: dev@ctakes.apache.org
Subject: Re: cTAKES DATA DICTIONARY

Sounds cool, unfortunately as it is, it doesn't seem to work in
Safari, Firefox, or Chrome :( (OS X)

-- Richard

On 30.09.2013, at 18:01, "Chen, Pei" 
wrote:


Thanks Murali!
It's actually pretty cool to have a quick reference to lookup what
all the

different fields mean.

FYI: I've made the commits to:
http://svn.apache.org/repos/asf/ctakes/trunk/ctakes-type-system/src/
ma in/resources/org/apache/ctakes/typesystem/types/TypeSystem.xml
which references the XSL template[1].
The xslt looks okay from IE.

Troy, do you know if Confluence has an XSLT plugin rather than
relying on

individual browsers?  It would be nice to just pull the descriptions
of the fields directly from the TypeSystem descriptions so it can be
maintained in 1 place...

[1]
http://svn.apache.org/repos/asf/ctakes/trunk/ctakes-type-system/src/
ma


in/resources/org/apache/ctakes/typesystem/types/TypeSystemDescription.

xsl
--Pei


-Original Message-
From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu]
Sent: Saturday, September 28, 2013 2:15 PM
To: 
Cc: dev@ctakes.apache.org
Subject: Re: cTAKES DATA DICTIONARY

+1
That sounds like a good idea...
If you like, feel free to creat a Jira item for this and attach any
code/patches to it:

https://issues.apache.org/jira/browse/ctakes
I believe any should be able to create an account.

Also, if you subscribe to the list, it doesn't require moderator approval...
Sent from my iPad

On Sep 28, 2013, at 12:42 PM, "Murali Nagendranath"
mailto:mmin...@gmail.com>> wrote:

Hi folks,

One thing I think would be really helpful for new users would be to
have a data dictionary that describes the definition of each Type
and

Fields.

This could be a simple html table that we could link to.

Since there is already a TypeSystem.xml that contains the features
and descriptions, I was thinking we could create an easy xslt>html
template and have things maintained in one place.

If people think this is a good idea, I can try to take a first stab at it.

--Murali




Re: cTAKES DATA DICTIONARY

2013-10-02 Thread Tim Miller
FWIW I got it to work on the ipad but still doesn't work on firefox even 
if I clear my history first. Anyone else can confirm on firefox whether 
it's something on my end or on the server end?


Looks nice on the ipad!
Tim

On 10/02/2013 11:05 AM, Chen, Pei wrote:

Try refreshing now.
https://svn.apache.org/repos/asf/ctakes/trunk/ctakes-type-system/src/main/resources/org/apache/ctakes/typesystem/types/TypeSystem.xml
Not sure what happened, but the  change seem to be lost/clobbered, so I just readded it...
--Pei


-Original Message-----
From: Tim Miller [mailto:timothy.mil...@childrens.harvard.edu]
Sent: Wednesday, October 02, 2013 11:00 AM
To: dev@ctakes.apache.org
Subject: Re: cTAKES DATA DICTIONARY

I still get an error from this link.
Firefox says:

 Error loading stylesheet: An XSLT stylesheet does not have an XML
 mimetype:
 https://svn.apache.org/repos/asf/ctakes/trunk/ctakes-type-
system/src/main/resources/org/apache/ctakes/typesystem/types/TypeSyst
emDescription.xsl


Chrome displays the xml with this message:

 This XML file does not appear to have any style information
 associated with it. The document tree is shown below.




This is under ubuntu.
Tim

On 09/30/2013 04:57 PM, Chen, Pei wrote:

It's fixed:
https://svn.apache.org/repos/asf/ctakes/trunk/ctakes-type-system/src/m
ain/resources/org/apache/ctakes/typesystem/types/TypeSystem.xml
Thanks to infra- I just had to set the mime:types in svn...
Now we just need to beef up the docs a bit :)



-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
Sent: Monday, September 30, 2013 4:57 PM
To: 'dev@ctakes.apache.org'
Subject: RE: cTAKES DATA DICTIONARY

I think it's a bad idea to have a copy on the website.
Too easy to get out of sync with SVN.

I can view the copy in SVN OK with IE 8, but not with Chrome (only 2
browsers I've tried).


-Original Message-
From: dev-return-2058-Masanz.James=mayo@ctakes.apache.org
[mailto:dev-return-2058-Masanz.James=mayo@ctakes.apache.org]

On

Behalf Of Chen, Pei
Sent: Monday, September 30, 2013 3:44 PM
To: dev@ctakes.apache.org
Subject: RE: cTAKES DATA DICTIONARY

http://ctakes.apache.org/docs/TypeSystem.xml

It's kind of weird that the xls works behind the apache site/web
servers, but not within the svn repo.  Does it ring a bell to anyone?
I have a feeling that the svn web servers are doing something weird
and not allowing the transformation.
But either way, I've made a copy of it on to the site;  I think the
descriptions need to be beefed up a little bit though...

Any objections in removing the "Equivalent to edu.mayo.bmi. etc.
etc."  It probably made sense in the past when we were upgrading to
the common type system, but I think it probably makes more sense to
put the intended meanings there now...

--Pei


-Original Message-
From: Richard Eckart de Castilho [mailto:r...@apache.org]
Sent: Monday, September 30, 2013 12:08 PM
To: dev@ctakes.apache.org
Subject: Re: cTAKES DATA DICTIONARY

Sounds cool, unfortunately as it is, it doesn't seem to work in
Safari, Firefox, or Chrome :( (OS X)

-- Richard

On 30.09.2013, at 18:01, "Chen, Pei"

wrote:


Thanks Murali!
It's actually pretty cool to have a quick reference to lookup what
all the

different fields mean.

FYI: I've made the commits to:
http://svn.apache.org/repos/asf/ctakes/trunk/ctakes-type-system/src
/ ma

in/resources/org/apache/ctakes/typesystem/types/TypeSystem.xml

which references the XSL template[1].
The xslt looks okay from IE.

Troy, do you know if Confluence has an XSLT plugin rather than
relying on

individual browsers?  It would be nice to just pull the descriptions
of the fields directly from the TypeSystem descriptions so it can be
maintained in 1 place...

[1]
http://svn.apache.org/repos/asf/ctakes/trunk/ctakes-type-system/src
/
ma


in/resources/org/apache/ctakes/typesystem/types/TypeSystemDescription.

xsl
--Pei


-Original Message-
From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu]
Sent: Saturday, September 28, 2013 2:15 PM
To: 
Cc: dev@ctakes.apache.org
Subject: Re: cTAKES DATA DICTIONARY

+1
That sounds like a good idea...
If you like, feel free to creat a Jira item for this and attach
any code/patches to it:

https://issues.apache.org/jira/browse/ctakes
I believe any should be able to create an account.

Also, if you subscribe to the list, it doesn't require moderator

approval...

Sent from my iPad

On Sep 28, 2013, at 12:42 PM, "Murali Nagendranath"
mailto:mmin...@gmail.com>> wrote:

Hi folks,

One thing I think would be really helpful for new users would be
to have a data dictionary that describes the definition of each
Type and

Fields.

This could be a simple html table that we could link to.

Since there is already a TypeSystem.xml that contains the features
and descriptions, I was thinking we could create an easy xslt>html
template and h

relation types in typesystem

2013-10-17 Thread Tim Miller
I noticed yesterday that the ctakes type system is missing some of the 
types from the SHARP annotations. This causes an error when we try to 
read in data with those types with our reader. I created a JIRA for the 
issue here:

https://issues.apache.org/jira/browse/CTAKES-250

Any thoughts on whether I should just implement them or is there some 
reason to have a more detailed discussion?


--
Tim Miller, PhD
Postdoctoral Research Fellow
Children's Hospital Informatics Program
Boston Children's Hospital and Harvard Medical School
617-919-1223



Re: relation types in typesystem

2013-10-21 Thread Tim Miller
I'm not sure I understand the distinction between BinaryTextRelation and 
ElementRelation types. Is it the difference between the abstract 
relation and a particular text instantiation?


On 10/17/2013 02:33 PM, Wu, Stephen T., Ph.D. wrote:

+1 creation of additional relation types, both BinaryTextRelation and
ElementRelation.

stephen


On 10/17/13 12:20 PM, "Tim Miller" 
wrote:


I noticed yesterday that the ctakes type system is missing some of the
types from the SHARP annotations. This causes an error when we try to
read in data with those types with our reader. I created a JIRA for the
issue here:
https://issues.apache.org/jira/browse/CTAKES-250

Any thoughts on whether I should just implement them or is there some
reason to have a more detailed discussion?

--
Tim Miller, PhD
Postdoctoral Research Fellow
Children's Hospital Informatics Program
Boston Children's Hospital and Harvard Medical School
617-919-1223





Re: relation types in typesystem

2013-10-22 Thread Tim Miller
One more issue is with conditional attribute of relations. It is in the 
sharp annotation guidelines I have access to but not in the implemented 
typesystem nor in the typesystem documents I have. Should there be a 
conditional field alongside polarity and uncertainty for Relation?

Tim

On 10/21/2013 03:28 PM, Wu, Stephen T., Ph.D. wrote:

That's right.
BinaryTextRelations connect Annotations.
ElementRelations connect Elements.

In practice we have only ever used Annotations and BinaryTextRelations.

stephen



On 10/21/13 2:20 PM, "Tim Miller" 
wrote:


I'm not sure I understand the distinction between BinaryTextRelation and
ElementRelation types. Is it the difference between the abstract
relation and a particular text instantiation?

On 10/17/2013 02:33 PM, Wu, Stephen T., Ph.D. wrote:

+1 creation of additional relation types, both BinaryTextRelation and
ElementRelation.

stephen


On 10/17/13 12:20 PM, "Tim Miller"

wrote:


I noticed yesterday that the ctakes type system is missing some of the
types from the SHARP annotations. This causes an error when we try to
read in data with those types with our reader. I created a JIRA for the
issue here:
https://issues.apache.org/jira/browse/CTAKES-250

Any thoughts on whether I should just implement them or is there some
reason to have a more detailed discussion?

--
Tim Miller, PhD
Postdoctoral Research Fellow
Children's Hospital Informatics Program
Boston Children's Hospital and Harvard Medical School
617-919-1223





Re: relation types in typesystem

2013-10-22 Thread Tim Miller
Well, it's in the annotation guidelines and the typesystem. Whether 
there are any examples annotated like that I couldn't tell you. But 
there does seem to be at least one conditional relation in the 
stratified corpus, since it broke the code that reads the knowtator xml.

Tim

On 10/22/2013 11:06 AM, Dmitriy Dligach wrote:
Is polarity/uncertainty annotated for relations in the SHARP gold 
standard?


Dima

On 10/22/2013 10:45 AM, Tim Miller wrote:
One more issue is with conditional attribute of relations. It is in 
the sharp annotation guidelines I have access to but not in the 
implemented typesystem nor in the typesystem documents I have. Should 
there be a conditional field alongside polarity and uncertainty for 
Relation?

Tim

On 10/21/2013 03:28 PM, Wu, Stephen T., Ph.D. wrote:

That's right.
BinaryTextRelations connect Annotations.
ElementRelations connect Elements.

In practice we have only ever used Annotations and BinaryTextRelations.

stephen



On 10/21/13 2:20 PM, "Tim Miller" 


wrote:

I'm not sure I understand the distinction between 
BinaryTextRelation and

ElementRelation types. Is it the difference between the abstract
relation and a particular text instantiation?

On 10/17/2013 02:33 PM, Wu, Stephen T., Ph.D. wrote:

+1 creation of additional relation types, both BinaryTextRelation and
ElementRelation.

stephen


On 10/17/13 12:20 PM, "Tim Miller"

wrote:

I noticed yesterday that the ctakes type system is missing some 
of the
types from the SHARP annotations. This causes an error when we 
try to
read in data with those types with our reader. I created a JIRA 
for the

issue here:
https://issues.apache.org/jira/browse/CTAKES-250

Any thoughts on whether I should just implement them or is there 
some

reason to have a more detailed discussion?

--
Tim Miller, PhD
Postdoctoral Research Fellow
Children's Hospital Informatics Program
Boston Children's Hospital and Harvard Medical School
617-919-1223









Re: Sundry

2013-10-30 Thread Tim Miller

Thanks for bumping this Pei, it reminds me I meant to respond to it.

The OPQRST does sound like a great ML project. At a glance I might think 
a sequence model over sentences (like a CRF) would be a good model.
But I'm wondering what the end use case is? Is it for teaching OPQRST to 
new clinicians? Or maybe as a sort of middleware for other projects 
where it might be a useful feature? Without a physician's intuition I 
sometimes suffer from a failure of imagination on these things.


Tim


On 10/30/2013 09:59 AM, Chen, Pei wrote:

Hi John,
I was away for a little bit and finally got a chance to catch up on emails...


2) I work for the DoD and have latched on to several IRB approved projects
within that community where Ill be using cTakes, though minimally at first.
This is just a statement, a bug in the ear of the community of what people
are up to.

This is really news!  Looking forward to hearing more...


has anyone considered (and maybe the components already do this in some way I
haven't explored yet - time is ever limited) adding an OPQRST classifier?

I'm not too familiar on how OPQRST would be determined from the patient's 
record.
Just curious, how is it currently determined manually now?  Is it a single 
score determined by a formula/rule(s)?
Seems like another good use case for cTAKES output-- clinically focused.
--Pei




getContextMap() question

2013-11-12 Thread Tim Miller
I'm running the default pipeline on some large files and trying to fix 
some of the slower annotators. I changed ChunkAdjuster to use UimaFit 
selectors which dramatically improves speed on large files. I removed 
the OverlapAnnotator, with its complicated interface and extreme 
generality, from my pipeline altogether and replaced it with a 3-line 
static annotator. I think we should consider doing that for the default 
pipeline even if we think there are good reasons to keep the 
general-purpose annotator around.


Anyways, now I'm at the dictionary lookup which I suspect will be the 
slowest component. One call is to getContextMap() which seems especially 
slow. It is called for every LookupWindow, and given the span of that 
window, iterates over all LookupWindow's looking for one with the 
equivalent span. So in the end you give it a lookup window and it gives 
you the same one back basically. Of course the code is written very 
generally so there may be use cases where the types are different, but 
for the default case it seems a little weird for something doing nothing 
to take so long.


So, my question is, does anyone know what the engineering goals of this 
setup are? I think it can be optimized even within the super-general 
framework it is trying to maintain, but I don't want to break anything 
by making assumptions that aren't valid.


Thanks
Tim



publicly browsable javadoc

2013-12-03 Thread Tim Miller
I could've sworn at one point I had accessed a publicly available 
javadoc for cTAKES on the web, which importantly included the 
typesystem. Can anyone verify I'm not crazy and possibly point me to it? 
Having trouble finding it through the website and google.


Tim


Re: cTAKES Groovy...

2013-12-04 Thread Tim Miller
Very cool. I was noticing that it was downloading the umls resources 
which the parser itself doesn't need -- so I made a change to not grab 
clinical-pipeline and grab directly the things it was getting through 
that reference and now it runs even faster with only a 35M initial download.


I'd like to check in my change -- should we keep working out of sandbox 
or can we maybe put groovy scripts somewhere alongside the projects they 
belong to? Maybe in the scripts/ directory or scripts/groovy, 
scripts/perl, etc.? Any opinions on this?


Tim


On 11/27/2013 12:19 PM, Chen, Pei wrote:

The sample constituency parser printer should be working now...
Just copy and paste the text to parser.groovy and make it executable.
All you should need is groovy installed on your machine.
http://svn.apache.org/repos/asf/ctakes/sandbox/groovy/parser.groovy
$ parser.groovy input
Reading from directory: input
  (TOP (S (NP-SBJ (NN patient)) (VP (VBD took) (NP (NP (NNS 50mg)) (PP (IN of) 
(NP (NP (NN aspirin)) (PP (IN for) (NP (NP (NN pain)) (PP-LOC (IN in) (NP (NN 
knee)(. .)))

Maybe we could create one that will output UMLS CUI/Codes... and then others 
could easily modify to their needs.

--Pei

-Original Message-
From: William Karl Thompson [mailto:w...@northwestern.edu]
Sent: Tuesday, November 26, 2013 10:46 PM
To: dev@ctakes.apache.org
Subject: RE: cTAKES Groovy...

That is very cool!

Since we're talking Groovy, I'd just like make a plug for Gradle, a fantastic
build/deployment/dependency management tool that is in many ways much
nicer to work with than Maven, though it plays nicely with Maven (for
example, it can use Maven repositories). Gradle is also proven technology:
it's the build tool for the Android operating system.

From: Chen, Pei [pei.c...@childrens.harvard.edu]
Sent: Tuesday, November 26, 2013 4:13 PM
To: dev@ctakes.apache.org
Subject: cTAKES Groovy...

Tim had a good end user use case:
I just want to use the ctakes constituency parser and output the tree text to
console.
So I was inspired by Richard example of groovy...
Check out:
http://svn.apache.org/repos/asf/ctakes/sandbox/groovy/parser.groovy

The groovy script will "Automagically" download the required
classes,jars,resources and automatically runs.
No longer requires the user to have any knowledge of UIMA, cTAKES, etc.
Sample:
$ parser.groovy input
Reading from directory: input
patient took 50mg of aspirin for pain in knee.
begin:0 end:48

Pretty cool, 'eh...
--Pei




Re: Proposal: Using JIRA to track and request changes to documentation

2013-12-04 Thread Tim Miller
Sounds like a good idea to me. Good for tracking major issues, 
especially for targeting future releases and making sure we get the 
things done we say we will. Unless there is any way this dramatically 
violates some convention I don't see why not!

Tim

On 12/04/2013 03:11 PM, Andrew McMurry wrote:

Hi all
I'll have an update about the VM situation shortly (positive news) but in the 
meantime I propose a new issue type in JIRA: doc.

The ctakes docs are very good, and James deserves a lot of credit.  
User docs are as important as code, sometimes even more so.
It is therefore appropriate to track how documentation is being updated with 
release versions.

Example of DOC issues worth tracking in JIRA:
* "Confluence home page still refers to version 3.0 by default"
* "User FAQ should state recommended JVM memory size"
* "User FAQ should point to UMLS setup instructions"

As an added benefit, each time we do a release we can see if the docs need to 
be updated accordingly.
I am *NOT* proposing that every change to documentation requires a JIRA ticket.
But we should have a mechanism to record doc issues.

Do you agree with the proposal?

--AndyMC




Re: cTAKES Groovy...

2013-12-06 Thread Tim Miller
Sure, I checked it in under 
ctakes-constituency-parser/scripts/groovy/parser.groovy per my 
understanding of the thread from a few days go about where to put these 
things.

Tim

On 12/06/2013 12:03 PM, Masanz, James J. wrote:

Tim, could you check that change in you made to not download the big resources, 
or post it somewhere temporarily.

I'm having this issue when trying to run the  groovy script (I'm on Windows 7, 
if that makes a difference) and having it faster might help debug.

C:\using-groovy> groovy  parser.groovy   test-data-for-groovy
Reading from directory: test-data-for-groovy
Downloading: 
http://svn.apache.org/repos/asf/ctakes/trunk/ctakes-core-res/src/main/resources/org/apache/ctakes/core/sentdetect/sd-med-model.zip
Caught: groovy.lang.MissingMethodException: No signature of method: 
java.io.BufferedOutputStream.rightShift() is applicable for argument types: 
(sun.net.www.protocol.http.HttpURLConnection$HttpInputStream) values: 
[sun.net.www.protocol.http.HttpURLConnection$HttpInputStream@74be95bf]
Possible solutions: leftShift(java.lang.Object), 
leftShift(java.io.InputStream), leftShift([B)
groovy.lang.MissingMethodException: No signature of method: 
java.io.BufferedOutputStream.rightShift() is applicable for argument types: 
(sun.net.www.protocol.http.HttpURLConnection$HttpInputStream) values: 
[sun.net.www.protocol.http.HttpURLConnection$HttpInputStream@74be95bf]
Possible solutions: leftShift(java.lang.Object), 
leftShift(java.io.InputStream), leftShift([B)
 at parser.downloadFile(parser.groovy:99)
 at parser.run(parser.groovy:64)

Anyone run into such an error from groovy? Anyone else running groovy on Win7?

-- James


-Original Message-
From: dev-return-2270-Masanz.James=mayo@ctakes.apache.org 
[mailto:dev-return-2270-Masanz.James=mayo@ctakes.apache.org] On Behalf Of 
Tim Miller
Sent: Wednesday, December 04, 2013 9:09 AM
To: dev@ctakes.apache.org
Subject: Re: cTAKES Groovy...

Very cool. I was noticing that it was downloading the umls resources which the 
parser itself doesn't need -- so I made a change to not grab clinical-pipeline 
and grab directly the things it was getting through that reference and now it 
runs even faster with only a 35M initial download.

I'd like to check in my change -- should we keep working out of sandbox or can 
we maybe put groovy scripts somewhere alongside the projects they belong to? 
Maybe in the scripts/ directory or scripts/groovy, scripts/perl, etc.? Any 
opinions on this?

Tim


On 11/27/2013 12:19 PM, Chen, Pei wrote:

The sample constituency parser printer should be working now...
Just copy and paste the text to parser.groovy and make it executable.
All you should need is groovy installed on your machine.
http://svn.apache.org/repos/asf/ctakes/sandbox/groovy/parser.groovy
$ parser.groovy input
Reading from directory: input
   (TOP (S (NP-SBJ (NN patient)) (VP (VBD took) (NP (NP (NNS 50mg)) (PP
(IN of) (NP (NP (NN aspirin)) (PP (IN for) (NP (NP (NN pain)) (PP-LOC
(IN in) (NP (NN knee)(. .)))

Maybe we could create one that will output UMLS CUI/Codes... and then others 
could easily modify to their needs.

--Pei

-Original Message-
From: William Karl Thompson [mailto:w...@northwestern.edu]
Sent: Tuesday, November 26, 2013 10:46 PM
To: dev@ctakes.apache.org
Subject: RE: cTAKES Groovy...

That is very cool!

Since we're talking Groovy, I'd just like make a plug for Gradle, a
fantastic build/deployment/dependency management tool that is in many
ways much nicer to work with than Maven, though it plays nicely with
Maven (for example, it can use Maven repositories). Gradle is also proven 
technology:
it's the build tool for the Android operating system.

From: Chen, Pei [pei.c...@childrens.harvard.edu]
Sent: Tuesday, November 26, 2013 4:13 PM
To: dev@ctakes.apache.org
Subject: cTAKES Groovy...

Tim had a good end user use case:
I just want to use the ctakes constituency parser and output the tree
text to console.
So I was inspired by Richard example of groovy...
Check out:
http://svn.apache.org/repos/asf/ctakes/sandbox/groovy/parser.groovy

The groovy script will "Automagically" download the required
classes,jars,resources and automatically runs.
No longer requires the user to have any knowledge of UIMA, cTAKES, etc.
Sample:
$ parser.groovy input
Reading from directory: input
patient took 50mg of aspirin for pain in knee.
begin:0 end:48

Pretty cool, 'eh...
--Pei




Re: cTAKES Groovy...

2013-12-06 Thread Tim Miller
I have it in my .m2 directory timestamped October 2012. I believe the 
most recent versions of grape will look in m2 and grab from there if it 
exists.

Tim

On 12/06/2013 02:14 PM, Masanz, James J. wrote:

Thanks Sean.

Something doesn't seem to be working for me related to getting dependencies.

I did a wget of the parser.groovy that Tim just checked in today.

Then trying to run that groovy script I get this error:

$  groovy parser.groovy  test-data-for-groovy/
org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
General error during conversion: Error grabbing Grapes -- [unresolved 
dependency: jwnl#jwnl;1.3.3: not found]
java.lang.RuntimeException: Error grabbing Grapes -- [unresolved dependency: 
jwnl#jwnl;1.3.3: not found]

So I tried this (I'm no grape expert but a google search led to this 
suggestion) but it fails:
$ grape -V resolve   jwnl  jwnl 1.3.3

I see the following issue was created by opennlp that looks related
https://issues.apache.org/jira/browse/OPENNLP-510
saying that jwnl:jwnl 1.3.3 is no longer available (!)

What I don't get is why no one else is seeing this error.
Maybe everyone else already had that in their local maven repos? Hard to 
believe though given OPENNLP-510 is from May 2012.

Fyi:

$groovy --version
Groovy Version: 1.8.6 JVM: 1.6.0_27 Vendor: Sun Microsystems Inc. OS: Linux


-Original Message-
From: dev-return-2288-Masanz.James=mayo@ctakes.apache.org 
[mailto:dev-return-2288-Masanz.James=mayo@ctakes.apache.org] On Behalf Of 
Finan, Sean
Sent: Friday, December 06, 2013 11:19 AM
To: dev@ctakes.apache.org
Subject: RE: cTAKES Groovy...

Aside from a crash course almost 10 years ago, I haven't touched groovy very much.  
However, if you are having issues with" shifts" and files, you can look here:

http://blog.retep.org/category/development/java/groovy/

He defines what he calls shift operators for the file operations.

For all I know this is where Pei got his code, but it might be worth checking 
if anybody runs into errors.

-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
Sent: Friday, December 06, 2013 12:13 PM
To: 'dev@ctakes.apache.org'
Subject: RE: cTAKES Groovy...

FYI, the groovy error I was getting was a typo on my part

I had this:
out >> new URL(url).openStream()
instead of
out << new URL(url).openStream()

so it was trying to do a shift operation of some sort

-Original Message-
From: dev-return-2286-Masanz.James=mayo@ctakes.apache.org 
[mailto:dev-return-2286-Masanz.James=mayo@ctakes.apache.org] On Behalf Of 
Masanz, James J.
Sent: Friday, December 06, 2013 11:03 AM
To: 'dev@ctakes.apache.org'
Subject: RE: cTAKES Groovy...

Tim, could you check that change in you made to not download the big resources, 
or post it somewhere temporarily.

I'm having this issue when trying to run the  groovy script (I'm on Windows 7, 
if that makes a difference) and having it faster might help debug.

C:\using-groovy> groovy  parser.groovy   test-data-for-groovy
Reading from directory: test-data-for-groovy
Downloading: 
http://svn.apache.org/repos/asf/ctakes/trunk/ctakes-core-res/src/main/resources/org/apache/ctakes/core/sentdetect/sd-med-model.zip
Caught: groovy.lang.MissingMethodException: No signature of method: 
java.io.BufferedOutputStream.rightShift() is applicable for argument types: 
(sun.net.www.protocol.http.HttpURLConnection$HttpInputStream) values: 
[sun.net.www.protocol.http.HttpURLConnection$HttpInputStream@74be95bf]
Possible solutions: leftShift(java.lang.Object), 
leftShift(java.io.InputStream), leftShift([B)
groovy.lang.MissingMethodException: No signature of method: 
java.io.BufferedOutputStream.rightShift() is applicable for argument types: 
(sun.net.www.protocol.http.HttpURLConnection$HttpInputStream) values: 
[sun.net.www.protocol.http.HttpURLConnection$HttpInputStream@74be95bf]
Possible solutions: leftShift(java.lang.Object), 
leftShift(java.io.InputStream), leftShift([B)
 at parser.downloadFile(parser.groovy:99)
 at parser.run(parser.groovy:64)

Anyone run into such an error from groovy? Anyone else running groovy on Win7?

-- James


-Original Message-
From: dev-return-2270-Masanz.James=mayo@ctakes.apache.org 
[mailto:dev-return-2270-Masanz.James=mayo@ctakes.apache.org] On Behalf Of 
Tim Miller
Sent: Wednesday, December 04, 2013 9:09 AM
To: dev@ctakes.apache.org
Subject: Re: cTAKES Groovy...

Very cool. I was noticing that it was downloading the umls resources which the 
parser itself doesn't need -- so I made a change to not grab clinical-pipeline 
and grab directly the things it was getting through that reference and now it 
runs even faster with only a 35M initial download.

I'd like to check in my change -- should we keep working out of sandbox or can 
we maybe put groovy scripts somewhere alongside the projects they belong to? 
Maybe in the 

Re: cTAKES Groovy...

2013-12-12 Thread Tim Miller
I was able to replicate the error after removing the findstruct 
directories from my .groovy and .m2 repositories.


On 12/12/2013 12:22 PM, Masanz, James J. wrote:

Shouldn't be firewall - other grapes download fine.

I created a short groovy script to just grab findstructapi - I copy/pasted the 
@grab line from from the Groovy Grape section of
http://search.maven.org/#artifactdetails%7Cedu.mit.findstruct%7Cfindstructapi%7C0.0.1%7Cjar

And I still get

org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
General error during conversion: Error grabbing Grapes -- [download failed: 
edu.mit.findstruct#findstructapi;0.0.1!findstructapi.jar]

java.lang.RuntimeException: Error grabbing Grapes -- [download failed: 
edu.mit.findstruct#findstructapi;0.0.1!findstructapi.jar]

Very odd.

My script is simply:

#!/usr/bin/env groovy
@Grab(group='edu.mit.findstruct', module='findstructapi', version='0.0.1')
import java.io.File;

if(args.length < 1) {
System.out.println("Please specify input directory");
System.exit(1);
}
System.out.println("Input parm is: " + args[0]);
System.exit(0);


-Original Message-
From: dev-return-2305-Masanz.James=mayo@ctakes.apache.org 
[mailto:dev-return-2305-Masanz.James=mayo@ctakes.apache.org] On Behalf Of 
William Karl Thompson
Sent: Thursday, December 12, 2013 11:06 AM
To: dev@ctakes.apache.org
Subject: RE: cTAKES Groovy...

Seems unlikely to be the source of your problem, but could it be a firewall 
issue?

-Original Message-
From: Richard Eckart de Castilho [mailto:r...@apache.org]
Sent: Thursday, December 12, 2013 11:04 AM
To: dev@ctakes.apache.org
Subject: Re: cTAKES Groovy...

Might be a temporary network problem. The artifact is on Maven Central:

http://search.maven.org/#artifactdetails%7Cedu.mit.findstruct%7Cfindstructapi%7C0.0.1%7Cjar

-- Richard

On 12.12.2013, at 15:01, "Masanz, James J."  wrote:


The story continues:

The @GrabResolver line from Richard did the trick for jwnl.

But I cleared my .groovy/grapes and  .m2/repository and tried running 
parser.groovy and get the following:

org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
General error during conversion: Error grabbing Grapes -- [download
failed: edu.mit.findstruct#findstructapi;0.0.1!findstructapi.jar]

java.lang.RuntimeException: Error grabbing Grapes -- [download failed:
edu.mit.findstruct#findstructapi;0.0.1!findstructapi.jar]

FYI. I will take a look but if anyone has any hints, don't be shy


-Original Message-
From: dev-return-2299-Masanz.James=mayo@ctakes.apache.org
[mailto:dev-return-2299-Masanz.James=mayo@ctakes.apache.org] On
Behalf Of Finan, Sean
Sent: Friday, December 06, 2013 2:38 PM
To: dev@ctakes.apache.org
Subject: RE: cTAKES Groovy...

Good stuff -  Thanks Richard

-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
Sent: Friday, December 06, 2013 3:30 PM
To: 'dev@ctakes.apache.org'
Subject: RE: cTAKES Groovy...

Thanks Richard! That did the trick

I'll create a JIRA and update the script including adding a comment that that 
@GrabResolver  is only needed for pre-OpenNLP 1.5.3 and should be removed when we upgrade 
to 1.5.3+. and I'll update CTAKES-191 "Update Apache OpenNLP dependency to 
1.5.3" with a  reminder to update the script.

Trunk of cTAKES still uses 1.5.2-incubating

-Original Message-
From: dev-return-2297-Masanz.James=mayo@ctakes.apache.org
[mailto:dev-return-2297-Masanz.James=mayo@ctakes.apache.org] On
Behalf Of Richard Eckart de Castilho
Sent: Friday, December 06, 2013 2:12 PM
To: dev@ctakes.apache.org
Subject: Re: cTAKES Groovy...

On 06.12.2013, at 18:01, "Masanz, James J."  wrote:


I have not solved my issues on my ubuntu server yet where "Error
grabbing Grapes -- [unresolved dependency: jwnl#jwnl;1.3.3: not found]"

This has also already been fixed in OpenNLP 1.5.3, so there must be some 
dependency on OpenNLP 1.5.(1|2)-incubating.

Anyway, you should be able to fix it by adding this to the beginning of your 
Groovy script, in front of the Grapes:

@GrabResolver(name='opennlp.sf.net',
  root='http://opennlp.sourceforge.net/maven2')

-- Richard





Re: sentence detector newline behavior

2014-01-23 Thread Tim Miller
Just an FYI, a while back I did some of these annotations myself on 
MIMIC to get around this issue. I replaced the newline character with a 
special (non-English) character, then pre-processed ctakes input to 
replace newlines with that character, then did sentence detection, then 
added the newlines back in. I would be happy to share these annotations 
and my code modifications.

Tim


On 01/23/2014 04:01 PM, Karthik Sarma wrote:

We could possibly add some additional datasets for training. MIMIC data
does come to mind -- I can't remember off the top of my head if the MIMIC
dataset has sentences spanning lines or not.





--
Karthik Sarma
UCLA Medical Scientist Training Program Class of 20??
Member, UCLA Medical Imaging & Informatics Lab
Member, CA Delegation to the House of Delegates of the American Medical
Association
ksa...@ksarma.com
gchat: ksa...@gmail.com
linkedin: www.linkedin.com/in/ksarma


On Thu, Jan 23, 2014 at 4:22 AM, vijay garla  wrote:


Just to clarify - with the YTEX branch there are 2 sentence splitter - the
original ctakes sentence that splits on newlines, and the ytex sentence
splitter that doesn't.  the changes to other components in the ytex branch
(dependency parser, assertion) work with both sentence splitters.

I think it would be great if the intelligence regarding how to split was in
the opennlp model, but this requires training data.  I don't know what the
training data is, or if the training data has sentences that cross newline
boundaries (if not, won't buy us anything).

vijay




On Wed, Jan 22, 2014 at 3:47 PM, Finan, Sean <
sean.fi...@childrens.harvard.edu> wrote:


On  my end it looks like my email was reformatted and some of my

-newline-

removed in those last examples ...

-Original Message-
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
Sent: Wednesday, January 22, 2014 3:42 PM
To: dev@ctakes.apache.org
Subject: RE: sentence detector newline behavior

Thanks James


but then no typical sentence ending punctuation at the end of the line

Gotcha.


So simply using Lines would not suffice in those cases because it
would run together sentences where there are more than one on a line

I was actually thinking about something like a Line using -sentence
breaks- in addition to -newline-.  In other words, a Sentence being what
cTakes detects by ignoring CR/LF, and Lines being those Sentences
subdivided by -newline-.  Perhaps "Line" is a horrible moniker.
Regardless, it doesn't solve the problem of inappropriately missing
punctuation.  I was focused a little more on the difference between
persistent auto- line wrapping and structured information like lists,

where

the first benefits from Sentence and the second from Line.

"The Patient has
  been prescribed two
  medications."

"Prescriptions:
   Advil
   Tylenol
   No Aspirin"


However, when it comes to the problem that you mention, there is no
benefit to a Line.

"The patient has been seen six times in the past week.  Pain has been
persistent for ten days Advil and Tylenol have been prescribed"
-- 2 sentences, 3 lines


"The patient has been seen six times in the past week.
Pain has been persistent for ten days
Advil and Tylenol have been prescribed"
-- 2 sentences, 3 lines

"The patient has been seen six times in
  the past week.  Pain has been persistent  for ten days  Advil and

Tylenol

have been prescribed"
-- 2 sentences, 5 lines

Nothing can really be done for the last bit where punctuation is missing.




-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
Sent: Wednesday, January 22, 2014 3:07 PM
To: 'dev@ctakes.apache.org'
Subject: RE: sentence detector newline behavior


I know there are notes where there are multiple sentences on a line, but
then no typical sentence ending punctuation at the end of the line (or no
punctuation at all at the end of the line). And in those sections,

negation

can be important.  So simply using Lines would not suffice in those cases
because it would run together sentences where there are more than one on

a

line. And using sentences alone (as found by OpenNLP 1.5) would not

suffice

because it would run together sentences from different lines.

-Original Message-
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
Sent: Wednesday, January 22, 2014 1:33 PM
To: dev@ctakes.apache.org
Subject: RE: sentence detector newline behavior

Just whistling in the wind here ...

Perhaps before any changes are made to universally toggle cTakes in one
direction or the other, we can take a poll of when & where
cTakes/Ytex/OpenNLP/Omaha needs a Sentence (ignoring CR/LF) as opposed

to a

Line (CR/LF delimited PLUS -sentence-)

If some capabilities like negation detection require -lines- then would

it

make more sense to have Sentence ignore -newline- and negation detection
itself split the Sentence into line items?  If an annotator is interested
in list items, each of which may be on a distinct -line-, then it can

split

up th

Re: sentence detector newline behavior

2014-01-27 Thread Tim Miller
OK, with the most recent version I am able to replicate the performance 
I was getting before. Thanks a lot Jörn!


Assuming this is in the next incremental release of opennlp, how quickly 
can we get a re-trained model into cTAKES? I heard from a researcher at 
AMIA who tried cTAKES and because of this bug in the way we handle 
sentences was trying to find an outside sentence detector as a 
preprocess to cTAKES, and frankly that is insane. We should be able to 
get something this simple right. And I think this is the kind of thing 
that can leave new users scratching their heads and doubting our overall 
competence.


James, I believe you are usually the one who rebuilds the models? What 
would be the best way to incorporate the data I have that has some 
instances of non-sentence terminating newlines?


Tim


On 01/27/2014 06:10 AM, Jörn Kottmann wrote:

On 01/26/2014 11:29 PM, Miller, Timothy wrote:

Yes, this fixes the whitespace sentence issue but the evaluation issue
remains. I believe the problem is in SentenceSampleStream, where in the
following block the whitespace trim happens before the  character is
replaced with the \n character. So test sentences that ended with 
will be one character longer than they should be.


>   sentence = sentence.trim();
>   sentence = replaceNewLineEscapeTags(sentence);
>   sentencesString.append(sentence);
>   int end = sentencesString.length();
>   sentenceSpans.add(new Span(begin, end));
>   sentencesString.append(' ');


Yes, that must be the issue. During training the new line is inlucded 
in the span, and during
detection the white space remover creates a span without the new line 
char.


I suggest that the evaluator just ignores white space differences 
between sentences. My test case then

has the expected performance numbers.

What do you think?

Anyway, I committed the change. Please give it a try.

Jörn




Re: sentence detector newline behavior

2014-01-27 Thread Tim Miller


On 01/27/2014 02:35 PM, Masanz, James J. wrote:

Tim, is the training data something you can share publicly? Or privately?  I 
can't publicly share the data that has been used to train the sentence 
detector, I can only share the models that get built. And you can't build a 
model from an existing model + more data, you need all the training data 
together.


It is from the MIMIC corpus which I definitely can't share publicly, but 
it's worth looking into whether I could share it privately with another 
person who has a signed data use agreement.



Regarding how quickly we can get this out there, I can train a new sentence 
detector in a day or two. But that's just the first step - to really 
incorporate this, I would suggest this be a point release.   We would need a 
release manager for that.  Right now I don't have time for that.  I haven't 
heard a consensus saying whether this should be the new behavior.

Yeah I suppose this is subject to the scale of the changes we make.

 From what I remember we are going to need code changes to make optional the 
code that splits at line breaks, or was your test replacing the existing cTAKES 
sentence detector and just using OpenNLP directly.


That is a good point, and something I was wondering about. Having now 
looked at both the ctakes and opennlp code for the sentence splitter it 
seems like there is a lot of overlap. I would've thought it was just a 
matter of converting annotations into our type system. So I'm curious if 
there is some justification for why there seems to be duplication (or if 
I'm hallucinating it).


Tim




-- James

-Original Message-
From: Tim Miller [mailto:timothy.mil...@childrens.harvard.edu]
Sent: Monday, January 27, 2014 8:52 AM
To: dev@ctakes.apache.org
Subject: Re: sentence detector newline behavior

OK, with the most recent version I am able to replicate the performance
I was getting before. Thanks a lot Jörn!

Assuming this is in the next incremental release of opennlp, how quickly
can we get a re-trained model into cTAKES? I heard from a researcher at
AMIA who tried cTAKES and because of this bug in the way we handle
sentences was trying to find an outside sentence detector as a
preprocess to cTAKES, and frankly that is insane. We should be able to
get something this simple right. And I think this is the kind of thing
that can leave new users scratching their heads and doubting our overall
competence.

James, I believe you are usually the one who rebuilds the models? What
would be the best way to incorporate the data I have that has some
instances of non-sentence terminating newlines?

Tim


On 01/27/2014 06:10 AM, Jörn Kottmann wrote:

On 01/26/2014 11:29 PM, Miller, Timothy wrote:

Yes, this fixes the whitespace sentence issue but the evaluation issue
remains. I believe the problem is in SentenceSampleStream, where in the
following block the whitespace trim happens before the  character is
replaced with the \n character. So test sentences that ended with 
will be one character longer than they should be.


   sentence = sentence.trim();
   sentence = replaceNewLineEscapeTags(sentence);
   sentencesString.append(sentence);
   int end = sentencesString.length();
   sentenceSpans.add(new Span(begin, end));
   sentencesString.append(' ');

Yes, that must be the issue. During training the new line is inlucded
in the span, and during
detection the white space remover creates a span without the new line
char.

I suggest that the evaluator just ignores white space differences
between sentences. My test case then
has the expected performance numbers.

What do you think?

Anyway, I committed the change. Please give it a try.

Jörn




Re: sentence detector newline behavior

2014-01-27 Thread Tim Miller


On 01/27/2014 06:03 PM, vijay garla wrote:

For clarity, I'd like to stress that the opennlp sentence model distributed
with ctakes today does 'work' with sentences that span newlines - as I
understand it, this model ignores newline tokens (or newlines are not
provided as features to that model).
Well, it depends on your definition of "works" :). It doesn't throw an 
exception but it automatically splits sentences at newlines. It is 
relatively normal to have text that "wraps" at ~80 characters with 
newlines added. It will look like this (this is made up text):


   The patient was having difficulty
   getting out of bed and was taking
   aspirin in the morning. He has
   returned today for a prescription
   for something stronger.


This style will cause multiple sentence fragments to be encoded which, 
as we've seen, will wreak havoc with negation detection.




I believe the improvements Tim and others are suggesting are for a new
sentence model + feature representation that takes advantage of newlines as
features.
To be precise, I'm proposing adding newlines to the set of characters 
that are candidates for end of sentences (i.e. decision points for the 
classifier), instead of having the hard constraint of splitting at all 
newlines.




Whatever we do, I believe we need backwards compatibility - those who are
using the current sentence model may need to continue using it.  To that
end:
* If we upgrade to the newest version of opennlp, will the old model work
(and produce the same results)?
I definitely think we shouldn't release a new model that doesn't perform 
well in some absolute sense. But I think this change generalizes the old 
model, so that given that it meets that absolute standard a user should 
only see improvements. Specifically they should see fewer incorrect 
sentence fragments if they give us text with newlines in mid-sentence. 
IMHO, that kind of change doesn't require 'backwards compatibility' per 
se. Maybe we can make it an option to have a hard constraint that breaks 
on newlines but I think it should default to not do so.



* If a contributor trains a new model that uses a different feature
representation, I believe that should go into a new Sentence Detector
AnalysisEngine (or the same AE but with different configuration
parameters), so users have a choice between the old and the new.
Yeah, I think having configuration parameters are fine as long as we 
have smart defaults.


Thanks for your input VJ.
Tim


-vj


On Mon, Jan 27, 2014 at 1:09 PM, digital paula wrote:




Tim,

I just had to chime in on a comment you made.My deadline has been
extended a bit on my pressing issue but I do intend to get back to testing
per VJ's fix or maybe another fix is in the works based on latest
emails...I need to read them again since a lot has been stated on the issue.

Okay, as a new user (working w/cTAKES since October) I have never thought
what you had stated:

  "And I think this is the kind of thing that can leave new users
scratching their heads and doubting our overall competence."

Yeah, the sentence-spanning-newline issue was a problem so I just brought
attention to it by my post of inquiry earlier this month on VJ's fix from
last month and worked around it with treating narrative as one string.

Anyone who's looked at the code would appreciate and acknowledge that
cTAKES is a powerful and complex application.  I'm overall impressed with
it and I intend to continue to use it, improve it, and grow with it.  I've
been delving deeper into cTAKES on the machine learning aspect...I'm
struggling a bit with it and if anything I scratch my head and doubt my
competence. ;-)

Regards,
Paula


Date: Mon, 27 Jan 2014 09:52:00 -0500
From: timothy.mil...@childrens.harvard.edu
To: dev@ctakes.apache.org
Subject: Re: sentence detector newline behavior

OK, with the most recent version I am able to replicate the performance
I was getting before. Thanks a lot Jörn!

Assuming this is in the next incremental release of opennlp, how quickly
can we get a re-trained model into cTAKES? I heard from a researcher at
AMIA who tried cTAKES and because of this bug in the way we handle
sentences was trying to find an outside sentence detector as a
preprocess to cTAKES, and frankly that is insane. We should be able to
get something this simple right. And I think this is the kind of thing
that can leave new users scratching their heads and doubting our overall
competence.

James, I believe you are usually the one who rebuilds the models? What
would be the best way to incorporate the data I have that has some
instances of non-sentence terminating newlines?

Tim


On 01/27/2014 06:10 AM, Jörn Kottmann wrote:

On 01/26/2014 11:29 PM, Miller, Timothy wrote:

Yes, this fixes the whitespace sentence issue but the evaluation issue
remains. I believe the problem is in SentenceSampleStream, where in

the

following block the whitespace trim happens before the  character

is

replaced with the \n character. So test se

Re: sentence detector newline behavior

2014-01-27 Thread Tim Miller


On 01/27/2014 06:36 PM, vijay garla wrote:

The opennlp model doesn't split on newlines - that is being done in the
analysis engine .  an alternative implementation of the analysis engine in
the ytex branch is available that does not split on newlines.

It works as in really really works.  The hard constraint you refer to is
not in the opennlp model.
Oh yeah for sure. I'm talking about fixing the ctakes issue though so 
that the default out of the box behavior is what a user expects. If that 
means using the ytex sentence splitter as the default that would be fine 
with me.

Tim



Vj



On Monday, January 27, 2014, Tim Miller <
timothy.mil...@childrens.harvard.edu> wrote:


On 01/27/2014 06:03 PM, vijay garla wrote:


For clarity, I'd like to stress that the opennlp sentence model
distributed
with ctakes today does 'work' with sentences that span newlines - as I
understand it, this model ignores newline tokens (or newlines are not
provided as features to that model).


Well, it depends on your definition of "works" :). It doesn't throw an
exception but it automatically splits sentences at newlines. It is
relatively normal to have text that "wraps" at ~80 characters with newlines
added. It will look like this (this is made up text):

The patient was having difficulty
getting out of bed and was taking
aspirin in the morning. He has
returned today for a prescription
for something stronger.


This style will cause multiple sentence fragments to be encoded which, as
we've seen, will wreak havoc with negation detection.


  I believe the improvements Tim and others are suggesting are for a new

sentence model + feature representation that takes advantage of newlines
as
features.


To be precise, I'm proposing adding newlines to the set of characters that
are candidates for end of sentences (i.e. decision points for the
classifier), instead of having the hard constraint of splitting at all
newlines.


  Whatever we do, I believe we need backwards compatibility - those who are

using the current sentence model may need to continue using it.  To that
end:
* If we upgrade to the newest version of opennlp, will the old model work
(and produce the same results)?


I definitely think we shouldn't release a new model that doesn't perform
well in some absolute sense. But I think this change generalizes the old
model, so that given that it meets that absolute standard a user should
only see improvements. Specifically they should see fewer incorrect
sentence fragments if they give us text with newlines in mid-sentence.
IMHO, that kind of change doesn't require 'backwards compatibility' per se.
Maybe we can make it an option to have a hard constraint that breaks on
newlines but I think it should default to not do so.

  * If a contributor trains a new model that uses a different feature

representation, I believe that should go into a new Sentence Detector
AnalysisEngine (or the same AE but with different configuration
parameters), so users have a choice between the old and the new.


Yeah, I think having configuration parameters are fine as long as we have
smart defaults.

Thanks for your input VJ.
Tim

  -vj


On Mon, Jan 27, 2014 at 1:09 PM, digital paula 
wrote:



Tim,

I just had to chime in on a comment you made.My deadline has been
extended a bit on my pressing issue but I do intend to get back to testing
per VJ's fix or maybe another fix is in the works based on latest
emails...I need to read them again since a lot has been stated on the
issue.

Okay, as a new user (working w/cTAKES since October) I have never thought
what you had stated:

   "And I think this is the kind of thing that can leave new users
scratching their heads and doubting our overall competence."

Yeah, the sentence-spanning-newline issue was a problem so I just brought
attention to it by my post of inquiry earlier this month on VJ's fix from
last month and worked around it with treating narrative as one string.

Anyone who's looked at the code would appreciate and acknowledge that
cTAKES is a powerful and complex application.  I'm overall impressed with
it and I intend to continue to use it, improve it, and grow with it.  I've
been delving deeper into cTAKES on the machine learning aspect...I'm
struggling a bit with it and if anything I scratch my head and doubt my
competence. ;-)

Regards,
Paula

  Date: Mon, 27 Jan 2014 09:52:00 -0500
From: timothy.mil...@childrens.harvard.edu
To: dev@ctakes.apache.org
Subject: Re: sentence detector newline behavior

OK, with the most recent version I am able to replicate the performance
I was getting before. Thanks a lot Jörn!

Assuming this is in the next incremental release of opennlp, how quickly
can we get a re-trained model into cTAKES? I heard from a researcher at
AMIA who tried cTAKES and because of this bug in the way we handle
sentences was tr

Re: AssertionDepUtils - missing import - in GenerateDependencyRepresentation

2014-02-07 Thread Tim Miller
OK thanks -- that sounds right but that's very frustrating because I 
tried an svn up on another workspace after I checked it in and it worked 
fine!

Tim

On 02/07/2014 03:04 PM, Masanz, James J. wrote:

Hi Tim,
I'm getting a compile error unable to resolve import - looks like 
org.apache.ctakes.assertion.util.AssertionDepUtils needs to be checked in. Used 
by GenerateDependencyRepresentation.java

Thanks,
James






Re: AssertionDepUtils - missing import - in GenerateDependencyRepresentation

2014-02-07 Thread Tim Miller
OK, now I'm seeing in my clean room that there are in fact a number of 
errors now that I 'fixed' that one. Please bear with me as I figure out 
the smallest compileable unit I can check in.

Tim

On 02/07/2014 03:15 PM, Tim Miller wrote:
OK thanks -- that sounds right but that's very frustrating because I 
tried an svn up on another workspace after I checked it in and it 
worked fine!

Tim

On 02/07/2014 03:04 PM, Masanz, James J. wrote:

Hi Tim,
I'm getting a compile error unable to resolve import - looks like 
org.apache.ctakes.assertion.util.AssertionDepUtils needs to be 
checked in. Used by GenerateDependencyRepresentation.java


Thanks,
James








training data for sentence detector

2014-02-07 Thread Tim Miller

James,
We were discussing the sentence detector thing in person here the other 
day and Pei had a thought that depending on what sources you were using 
for training the sentence detector, we might be able to do something 
equivalent here in Boston by using SHARP, THYME, MIPACQ data which are 
largely from Mayo and probably similar to what you use, then augmenting 
with the little bit of MIMIC that I annotated. I don't know how that 
compares size-wise to the dataset that you are using. Is it quite large 
or do you think if we use derived data from those other projects will we 
be good? What do you think of this plan? Anyone else?

Tim