Re: sentence detector newline behavior

2014-01-29 Thread Jörn Kottmann

On 01/27/2014 08:44 PM, Tim Miller wrote:


That is a good point, and something I was wondering about. Having now 
looked at both the ctakes and opennlp code for the sentence splitter 
it seems like there is a lot of overlap. I would've thought it was 
just a matter of converting annotations into our type system. So I'm 
curious if there is some justification for why there seems to be 
duplication (or if I'm hallucinating it). 


It should be possible (and if not we should make it possible) to 
directly use the opennlp-uima integration. It supports dynamic types 
which can be mapped in the descriptor.
This would also give you a smooth transition, your existing integration 
could be labeled as deprecated and be removed in one of the future releases.


Jörn


RE: sentence detector newline behavior

2014-01-29 Thread Chen, Pei
+1
There's an example of the configs here :)
https://issues.apache.org/jira/browse/CTAKES-98

I think we should be able to use OpenNLP's Sentence Annotator directly if we no 
longer need the custom newline rule(s) 
[Or if we find that a fixed rule is still required, perhaps OpenNLP can support 
it via config as well- there doesn't seem to be anything cTAKES specific about 
it].
Pending the results of Tim's retraining/evaluation of the new models??

--Pei
 -Original Message-
 From: Jörn Kottmann [mailto:kottm...@gmail.com]
 Sent: Wednesday, January 29, 2014 3:55 PM
 To: dev@ctakes.apache.org
 Subject: Re: sentence detector newline behavior
 
 On 01/27/2014 08:44 PM, Tim Miller wrote:
 
  That is a good point, and something I was wondering about. Having now
  looked at both the ctakes and opennlp code for the sentence splitter
  it seems like there is a lot of overlap. I would've thought it was
  just a matter of converting annotations into our type system. So I'm
  curious if there is some justification for why there seems to be
  duplication (or if I'm hallucinating it).
 
 It should be possible (and if not we should make it possible) to directly use
 the opennlp-uima integration. It supports dynamic types which can be
 mapped in the descriptor.
 This would also give you a smooth transition, your existing integration could
 be labeled as deprecated and be removed in one of the future releases.
 
 Jörn


Re: sentence detector newline behavior

2014-01-29 Thread Jörn Kottmann

On 01/27/2014 03:52 PM, Tim Miller wrote:
OK, with the most recent version I am able to replicate the 
performance I was getting before. Thanks a lot Jörn!


Assuming this is in the next incremental release of opennlp, how 
quickly can we get a re-trained model into cTAKES?



I am currently working full time on the next release, and will very like 
be able to keep the pace until it is out.
We are doing bigger changes this time and the next version will be 
1.6.0. One of the interesting changes for you
here could be the pluggable machine learning support (e.g. support for 
liblinear, mallet etc.).


Jörn


Re: sentence detector newline behavior

2014-01-27 Thread Jörn Kottmann

On 01/26/2014 11:29 PM, Miller, Timothy wrote:

Yes, this fixes the whitespace sentence issue but the evaluation issue
remains. I believe the problem is in SentenceSampleStream, where in the
following block the whitespace trim happens before the LF character is
replaced with the \n character. So test sentences that ended with LF
will be one character longer than they should be.


   sentence = sentence.trim();
   sentence = replaceNewLineEscapeTags(sentence);
   sentencesString.append(sentence);
   int end = sentencesString.length();
   sentenceSpans.add(new Span(begin, end));
   sentencesString.append(' ');


Yes, that must be the issue. During training the new line is inlucded in 
the span, and during

detection the white space remover creates a span without the new line char.

I suggest that the evaluator just ignores white space differences 
between sentences. My test case then

has the expected performance numbers.

What do you think?

Anyway, I committed the change. Please give it a try.

Jörn


Re: sentence detector newline behavior

2014-01-27 Thread Tim Miller
OK, with the most recent version I am able to replicate the performance 
I was getting before. Thanks a lot Jörn!


Assuming this is in the next incremental release of opennlp, how quickly 
can we get a re-trained model into cTAKES? I heard from a researcher at 
AMIA who tried cTAKES and because of this bug in the way we handle 
sentences was trying to find an outside sentence detector as a 
preprocess to cTAKES, and frankly that is insane. We should be able to 
get something this simple right. And I think this is the kind of thing 
that can leave new users scratching their heads and doubting our overall 
competence.


James, I believe you are usually the one who rebuilds the models? What 
would be the best way to incorporate the data I have that has some 
instances of non-sentence terminating newlines?


Tim


On 01/27/2014 06:10 AM, Jörn Kottmann wrote:

On 01/26/2014 11:29 PM, Miller, Timothy wrote:

Yes, this fixes the whitespace sentence issue but the evaluation issue
remains. I believe the problem is in SentenceSampleStream, where in the
following block the whitespace trim happens before the LF character is
replaced with the \n character. So test sentences that ended with LF
will be one character longer than they should be.


   sentence = sentence.trim();
   sentence = replaceNewLineEscapeTags(sentence);
   sentencesString.append(sentence);
   int end = sentencesString.length();
   sentenceSpans.add(new Span(begin, end));
   sentencesString.append(' ');


Yes, that must be the issue. During training the new line is inlucded 
in the span, and during
detection the white space remover creates a span without the new line 
char.


I suggest that the evaluator just ignores white space differences 
between sentences. My test case then

has the expected performance numbers.

What do you think?

Anyway, I committed the change. Please give it a try.

Jörn




RE: sentence detector newline behavior

2014-01-27 Thread digital paula



Tim,
 
I just had to chime in on a comment you made.My deadline has been extended 
a bit on my pressing issue but I do intend to get back to testing per VJ's fix 
or maybe another fix is in the works based on latest emails...I need to read 
them again since a lot has been stated on the issue. 
 
Okay, as a new user (working w/cTAKES since October) I have never thought what 
you had stated:
 
 And I think this is the kind of thing that can leave new users scratching 
their heads and doubting our overall competence.  
 
Yeah, the sentence-spanning-newline issue was a problem so I just brought 
attention to it by my post of inquiry earlier this month on VJ's fix from last 
month and worked around it with treating narrative as one string.  
 
Anyone who's looked at the code would appreciate and acknowledge that cTAKES is 
a powerful and complex application.  I'm overall impressed with it and I intend 
to continue to use it, improve it, and grow with it.  I've been delving deeper 
into cTAKES on the machine learning aspect...I'm struggling a bit with it and 
if anything I scratch my head and doubt my competence. ;-)  
 
Regards,
Paula
 
 Date: Mon, 27 Jan 2014 09:52:00 -0500
 From: timothy.mil...@childrens.harvard.edu
 To: dev@ctakes.apache.org
 Subject: Re: sentence detector newline behavior
 
 OK, with the most recent version I am able to replicate the performance 
 I was getting before. Thanks a lot Jörn!
 
 Assuming this is in the next incremental release of opennlp, how quickly 
 can we get a re-trained model into cTAKES? I heard from a researcher at 
 AMIA who tried cTAKES and because of this bug in the way we handle 
 sentences was trying to find an outside sentence detector as a 
 preprocess to cTAKES, and frankly that is insane. We should be able to 
 get something this simple right. And I think this is the kind of thing 
 that can leave new users scratching their heads and doubting our overall 
 competence.
 
 James, I believe you are usually the one who rebuilds the models? What 
 would be the best way to incorporate the data I have that has some 
 instances of non-sentence terminating newlines?
 
 Tim
 
 
 On 01/27/2014 06:10 AM, Jörn Kottmann wrote:
  On 01/26/2014 11:29 PM, Miller, Timothy wrote:
  Yes, this fixes the whitespace sentence issue but the evaluation issue
  remains. I believe the problem is in SentenceSampleStream, where in the
  following block the whitespace trim happens before the LF character is
  replaced with the \n character. So test sentences that ended with LF
  will be one character longer than they should be.
 
 sentence = sentence.trim();
 sentence = replaceNewLineEscapeTags(sentence);
 sentencesString.append(sentence);
 int end = sentencesString.length();
 sentenceSpans.add(new Span(begin, end));
 sentencesString.append(' ');
 
  Yes, that must be the issue. During training the new line is inlucded 
  in the span, and during
  detection the white space remover creates a span without the new line 
  char.
 
  I suggest that the evaluator just ignores white space differences 
  between sentences. My test case then
  has the expected performance numbers.
 
  What do you think?
 
  Anyway, I committed the change. Please give it a try.
 
  Jörn
 

  

RE: sentence detector newline behavior

2014-01-27 Thread Masanz, James J.

Tim, is the training data something you can share publicly? Or privately?  I 
can't publicly share the data that has been used to train the sentence 
detector, I can only share the models that get built. And you can't build a 
model from an existing model + more data, you need all the training data 
together.

Regarding how quickly we can get this out there, I can train a new sentence 
detector in a day or two. But that's just the first step - to really 
incorporate this, I would suggest this be a point release.   We would need a 
release manager for that.  Right now I don't have time for that.  I haven't 
heard a consensus saying whether this should be the new behavior. 

From what I remember we are going to need code changes to make optional the 
code that splits at line breaks, or was your test replacing the existing 
cTAKES sentence detector and just using OpenNLP directly.

-- James

-Original Message-
From: Tim Miller [mailto:timothy.mil...@childrens.harvard.edu] 
Sent: Monday, January 27, 2014 8:52 AM
To: dev@ctakes.apache.org
Subject: Re: sentence detector newline behavior

OK, with the most recent version I am able to replicate the performance 
I was getting before. Thanks a lot Jörn!

Assuming this is in the next incremental release of opennlp, how quickly 
can we get a re-trained model into cTAKES? I heard from a researcher at 
AMIA who tried cTAKES and because of this bug in the way we handle 
sentences was trying to find an outside sentence detector as a 
preprocess to cTAKES, and frankly that is insane. We should be able to 
get something this simple right. And I think this is the kind of thing 
that can leave new users scratching their heads and doubting our overall 
competence.

James, I believe you are usually the one who rebuilds the models? What 
would be the best way to incorporate the data I have that has some 
instances of non-sentence terminating newlines?

Tim


On 01/27/2014 06:10 AM, Jörn Kottmann wrote:
 On 01/26/2014 11:29 PM, Miller, Timothy wrote:
 Yes, this fixes the whitespace sentence issue but the evaluation issue
 remains. I believe the problem is in SentenceSampleStream, where in the
 following block the whitespace trim happens before the LF character is
 replaced with the \n character. So test sentences that ended with LF
 will be one character longer than they should be.

sentence = sentence.trim();
sentence = replaceNewLineEscapeTags(sentence);
sentencesString.append(sentence);
int end = sentencesString.length();
sentenceSpans.add(new Span(begin, end));
sentencesString.append(' ');

 Yes, that must be the issue. During training the new line is inlucded 
 in the span, and during
 detection the white space remover creates a span without the new line 
 char.

 I suggest that the evaluator just ignores white space differences 
 between sentences. My test case then
 has the expected performance numbers.

 What do you think?

 Anyway, I committed the change. Please give it a try.

 Jörn



RE: sentence detector newline behavior

2014-01-27 Thread Masanz, James J.
I didn't write the cTAKES sentence detector so I can't answer definitively but 
I do know it was originally written using what is now a pretty old version of 
OpenNLP and needed some things you couldn't get from the out-of-the-box OpenNLP 
at the time. From  what I remember the things specific to it were 
- the list of end of sentence candidate characters 
- and the handling of newlines

-- James

-Original Message-
From: Tim Miller [mailto:timothy.mil...@childrens.harvard.edu] 
Sent: Monday, January 27, 2014 1:45 PM
To: dev@ctakes.apache.org
Subject: Re: sentence detector newline behavior


On 01/27/2014 02:35 PM, Masanz, James J. wrote:
 Tim, is the training data something you can share publicly? Or privately?  I 
 can't publicly share the data that has been used to train the sentence 
 detector, I can only share the models that get built. And you can't build a 
 model from an existing model + more data, you need all the training data 
 together.

It is from the MIMIC corpus which I definitely can't share publicly, but 
it's worth looking into whether I could share it privately with another 
person who has a signed data use agreement.

 Regarding how quickly we can get this out there, I can train a new sentence 
 detector in a day or two. But that's just the first step - to really 
 incorporate this, I would suggest this be a point release.   We would need a 
 release manager for that.  Right now I don't have time for that.  I haven't 
 heard a consensus saying whether this should be the new behavior.
Yeah I suppose this is subject to the scale of the changes we make.
  From what I remember we are going to need code changes to make optional the 
 code that splits at line breaks, or was your test replacing the existing 
 cTAKES sentence detector and just using OpenNLP directly.

That is a good point, and something I was wondering about. Having now 
looked at both the ctakes and opennlp code for the sentence splitter it 
seems like there is a lot of overlap. I would've thought it was just a 
matter of converting annotations into our type system. So I'm curious if 
there is some justification for why there seems to be duplication (or if 
I'm hallucinating it).

Tim



 -- James

 -Original Message-
 From: Tim Miller [mailto:timothy.mil...@childrens.harvard.edu]
 Sent: Monday, January 27, 2014 8:52 AM
 To: dev@ctakes.apache.org
 Subject: Re: sentence detector newline behavior

 OK, with the most recent version I am able to replicate the performance
 I was getting before. Thanks a lot Jörn!

 Assuming this is in the next incremental release of opennlp, how quickly
 can we get a re-trained model into cTAKES? I heard from a researcher at
 AMIA who tried cTAKES and because of this bug in the way we handle
 sentences was trying to find an outside sentence detector as a
 preprocess to cTAKES, and frankly that is insane. We should be able to
 get something this simple right. And I think this is the kind of thing
 that can leave new users scratching their heads and doubting our overall
 competence.

 James, I believe you are usually the one who rebuilds the models? What
 would be the best way to incorporate the data I have that has some
 instances of non-sentence terminating newlines?

 Tim


 On 01/27/2014 06:10 AM, Jörn Kottmann wrote:
 On 01/26/2014 11:29 PM, Miller, Timothy wrote:
 Yes, this fixes the whitespace sentence issue but the evaluation issue
 remains. I believe the problem is in SentenceSampleStream, where in the
 following block the whitespace trim happens before the LF character is
 replaced with the \n character. So test sentences that ended with LF
 will be one character longer than they should be.

sentence = sentence.trim();
sentence = replaceNewLineEscapeTags(sentence);
sentencesString.append(sentence);
int end = sentencesString.length();
sentenceSpans.add(new Span(begin, end));
sentencesString.append(' ');
 Yes, that must be the issue. During training the new line is inlucded
 in the span, and during
 detection the white space remover creates a span without the new line
 char.

 I suggest that the evaluator just ignores white space differences
 between sentences. My test case then
 has the expected performance numbers.

 What do you think?

 Anyway, I committed the change. Please give it a try.

 Jörn



Re: sentence detector newline behavior

2014-01-27 Thread vijay garla
For clarity, I'd like to stress that the opennlp sentence model distributed
with ctakes today does 'work' with sentences that span newlines - as I
understand it, this model ignores newline tokens (or newlines are not
provided as features to that model).

I believe the improvements Tim and others are suggesting are for a new
sentence model + feature representation that takes advantage of newlines as
features.

Whatever we do, I believe we need backwards compatibility - those who are
using the current sentence model may need to continue using it.  To that
end:
* If we upgrade to the newest version of opennlp, will the old model work
(and produce the same results)?
* If a contributor trains a new model that uses a different feature
representation, I believe that should go into a new Sentence Detector
AnalysisEngine (or the same AE but with different configuration
parameters), so users have a choice between the old and the new.

-vj


On Mon, Jan 27, 2014 at 1:09 PM, digital paula cybersat...@hotmail.comwrote:




 Tim,

 I just had to chime in on a comment you made.My deadline has been
 extended a bit on my pressing issue but I do intend to get back to testing
 per VJ's fix or maybe another fix is in the works based on latest
 emails...I need to read them again since a lot has been stated on the issue.

 Okay, as a new user (working w/cTAKES since October) I have never thought
 what you had stated:

  And I think this is the kind of thing that can leave new users
 scratching their heads and doubting our overall competence.

 Yeah, the sentence-spanning-newline issue was a problem so I just brought
 attention to it by my post of inquiry earlier this month on VJ's fix from
 last month and worked around it with treating narrative as one string.

 Anyone who's looked at the code would appreciate and acknowledge that
 cTAKES is a powerful and complex application.  I'm overall impressed with
 it and I intend to continue to use it, improve it, and grow with it.  I've
 been delving deeper into cTAKES on the machine learning aspect...I'm
 struggling a bit with it and if anything I scratch my head and doubt my
 competence. ;-)

 Regards,
 Paula

  Date: Mon, 27 Jan 2014 09:52:00 -0500
  From: timothy.mil...@childrens.harvard.edu
  To: dev@ctakes.apache.org
  Subject: Re: sentence detector newline behavior
 
  OK, with the most recent version I am able to replicate the performance
  I was getting before. Thanks a lot Jörn!
 
  Assuming this is in the next incremental release of opennlp, how quickly
  can we get a re-trained model into cTAKES? I heard from a researcher at
  AMIA who tried cTAKES and because of this bug in the way we handle
  sentences was trying to find an outside sentence detector as a
  preprocess to cTAKES, and frankly that is insane. We should be able to
  get something this simple right. And I think this is the kind of thing
  that can leave new users scratching their heads and doubting our overall
  competence.
 
  James, I believe you are usually the one who rebuilds the models? What
  would be the best way to incorporate the data I have that has some
  instances of non-sentence terminating newlines?
 
  Tim
 
 
  On 01/27/2014 06:10 AM, Jörn Kottmann wrote:
   On 01/26/2014 11:29 PM, Miller, Timothy wrote:
   Yes, this fixes the whitespace sentence issue but the evaluation issue
   remains. I believe the problem is in SentenceSampleStream, where in
 the
   following block the whitespace trim happens before the LF character
 is
   replaced with the \n character. So test sentences that ended with LF
   will be one character longer than they should be.
  
  sentence = sentence.trim();
  sentence = replaceNewLineEscapeTags(sentence);
  sentencesString.append(sentence);
  int end = sentencesString.length();
  sentenceSpans.add(new Span(begin, end));
  sentencesString.append(' ');
  
   Yes, that must be the issue. During training the new line is inlucded
   in the span, and during
   detection the white space remover creates a span without the new line
   char.
  
   I suggest that the evaluator just ignores white space differences
   between sentences. My test case then
   has the expected performance numbers.
  
   What do you think?
  
   Anyway, I committed the change. Please give it a try.
  
   Jörn
 





Re: sentence detector newline behavior

2014-01-27 Thread Tim Miller


On 01/27/2014 06:03 PM, vijay garla wrote:

For clarity, I'd like to stress that the opennlp sentence model distributed
with ctakes today does 'work' with sentences that span newlines - as I
understand it, this model ignores newline tokens (or newlines are not
provided as features to that model).
Well, it depends on your definition of works :). It doesn't throw an 
exception but it automatically splits sentences at newlines. It is 
relatively normal to have text that wraps at ~80 characters with 
newlines added. It will look like this (this is made up text):


   The patient was having difficulty
   getting out of bed and was taking
   aspirin in the morning. He has
   returned today for a prescription
   for something stronger.


This style will cause multiple sentence fragments to be encoded which, 
as we've seen, will wreak havoc with negation detection.




I believe the improvements Tim and others are suggesting are for a new
sentence model + feature representation that takes advantage of newlines as
features.
To be precise, I'm proposing adding newlines to the set of characters 
that are candidates for end of sentences (i.e. decision points for the 
classifier), instead of having the hard constraint of splitting at all 
newlines.




Whatever we do, I believe we need backwards compatibility - those who are
using the current sentence model may need to continue using it.  To that
end:
* If we upgrade to the newest version of opennlp, will the old model work
(and produce the same results)?
I definitely think we shouldn't release a new model that doesn't perform 
well in some absolute sense. But I think this change generalizes the old 
model, so that given that it meets that absolute standard a user should 
only see improvements. Specifically they should see fewer incorrect 
sentence fragments if they give us text with newlines in mid-sentence. 
IMHO, that kind of change doesn't require 'backwards compatibility' per 
se. Maybe we can make it an option to have a hard constraint that breaks 
on newlines but I think it should default to not do so.



* If a contributor trains a new model that uses a different feature
representation, I believe that should go into a new Sentence Detector
AnalysisEngine (or the same AE but with different configuration
parameters), so users have a choice between the old and the new.
Yeah, I think having configuration parameters are fine as long as we 
have smart defaults.


Thanks for your input VJ.
Tim


-vj


On Mon, Jan 27, 2014 at 1:09 PM, digital paula cybersat...@hotmail.comwrote:




Tim,

I just had to chime in on a comment you made.My deadline has been
extended a bit on my pressing issue but I do intend to get back to testing
per VJ's fix or maybe another fix is in the works based on latest
emails...I need to read them again since a lot has been stated on the issue.

Okay, as a new user (working w/cTAKES since October) I have never thought
what you had stated:

  And I think this is the kind of thing that can leave new users
scratching their heads and doubting our overall competence.

Yeah, the sentence-spanning-newline issue was a problem so I just brought
attention to it by my post of inquiry earlier this month on VJ's fix from
last month and worked around it with treating narrative as one string.

Anyone who's looked at the code would appreciate and acknowledge that
cTAKES is a powerful and complex application.  I'm overall impressed with
it and I intend to continue to use it, improve it, and grow with it.  I've
been delving deeper into cTAKES on the machine learning aspect...I'm
struggling a bit with it and if anything I scratch my head and doubt my
competence. ;-)

Regards,
Paula


Date: Mon, 27 Jan 2014 09:52:00 -0500
From: timothy.mil...@childrens.harvard.edu
To: dev@ctakes.apache.org
Subject: Re: sentence detector newline behavior

OK, with the most recent version I am able to replicate the performance
I was getting before. Thanks a lot Jörn!

Assuming this is in the next incremental release of opennlp, how quickly
can we get a re-trained model into cTAKES? I heard from a researcher at
AMIA who tried cTAKES and because of this bug in the way we handle
sentences was trying to find an outside sentence detector as a
preprocess to cTAKES, and frankly that is insane. We should be able to
get something this simple right. And I think this is the kind of thing
that can leave new users scratching their heads and doubting our overall
competence.

James, I believe you are usually the one who rebuilds the models? What
would be the best way to incorporate the data I have that has some
instances of non-sentence terminating newlines?

Tim


On 01/27/2014 06:10 AM, Jörn Kottmann wrote:

On 01/26/2014 11:29 PM, Miller, Timothy wrote:

Yes, this fixes the whitespace sentence issue but the evaluation issue
remains. I believe the problem is in SentenceSampleStream, where in

the

following block the whitespace trim happens before the LF character

is

replaced with the \n

Re: sentence detector newline behavior

2014-01-26 Thread Jörn Kottmann

On 01/25/2014 10:03 PM, Miller, Timothy wrote:

On 01/25/2014 12:24 PM, Jörn Kottmann wrote:

The code which computes the spans tries to remove white space from it.
Removing the white space from a whitespace only sentence is causing
the exception your are seeing. Which response would you expect from
the sentence detector? Should a white space only sentence be returned?

I would say no.


In case a sentence is terminated by a new line. Should the new line
char be included in the sentence span or not?

I would also say no.


I made a quick patch for this issue -- now it runs but scores really
poorly compared to my model file (30 vs 75 or so). I suspect something
is wrong with the evaluation, the spans being slightly off somehow.


The evaluation should ignore white spaces. I committed now my fix, it 
would be nice if you can

test it.

There might be still something wrong. In my test data I replaced all 
question marks with white spaces, and the result

is slightly worse than with the original data.

Jörn


Re: sentence detector newline behavior

2014-01-26 Thread Miller, Timothy

On 01/26/2014 09:59 AM, Jörn Kottmann wrote:

 The evaluation should ignore white spaces. I committed now my fix, it 
 would be nice if you can
 test it.

 There might be still something wrong. In my test data I replaced all 
 question marks with white spaces, and the result
 is slightly worse than with the original data.

 Jörn
Yes, this fixes the whitespace sentence issue but the evaluation issue
remains. I believe the problem is in SentenceSampleStream, where in the
following block the whitespace trim happens before the LF character is
replaced with the \n character. So test sentences that ended with LF
will be one character longer than they should be.

   sentence = sentence.trim();
   sentence = replaceNewLineEscapeTags(sentence);
   sentencesString.append(sentence);
   int end = sentencesString.length();
   sentenceSpans.add(new Span(begin, end));
   sentencesString.append(' ');



Re: sentence detector newline behavior

2014-01-25 Thread Miller, Timothy
Thanks Joern,
I'll try it. My understanding is I just need to give it my training
data, with the special character I used replaced with the literal string
LF and each line in the file is an example sentence.

Just thinking about the cTAKES wrapper -- do your changes make it so
that we wouldn't need to add the special characters (LF,CR) to a
document within the cTAKES sentence detector wrapper? It sounds like we
would need to add CR and LF to our eosChars value, it's early (for
my brain) but I wonder whether that could be a default on the opennlp end?

Tim

On 01/24/2014 04:14 PM, Jörn Kottmann wrote:
 The changes are now committed.

 To train a model which can recognize new lines the new lines must be encoded
 with the CR or LF tags (or both).

 The same tags are used to pass in the eos chars to the command line trainer.
 For example:
 SentenceDetectorCrossValidator  -lang en -data /home/xyz/eos-cr.all 
 -encoding ISO-8859-15 -eosChars .!?:LF

 Tim, it would be nice if you could test this with your annotations.

 Jörn

 On 01/23/2014 10:06 PM, Tim Miller wrote:
 Just an FYI, a while back I did some of these annotations myself on 
 MIMIC to get around this issue. I replaced the newline character with 
 a special (non-English) character, then pre-processed ctakes input to 
 replace newlines with that character, then did sentence detection, 
 then added the newlines back in. I would be happy to share these 
 annotations and my code modifications.
 Tim


 On 01/23/2014 04:01 PM, Karthik Sarma wrote:
 We could possibly add some additional datasets for training. MIMIC data
 does come to mind -- I can't remember off the top of my head if the 
 MIMIC
 dataset has sentences spanning lines or not.





 -- 
 Karthik Sarma
 UCLA Medical Scientist Training Program Class of 20??
 Member, UCLA Medical Imaging  Informatics Lab
 Member, CA Delegation to the House of Delegates of the American Medical
 Association
 ksa...@ksarma.com
 gchat: ksa...@gmail.com
 linkedin: www.linkedin.com/in/ksarma


 On Thu, Jan 23, 2014 at 4:22 AM, vijay garla vnga...@gmail.com wrote:

 Just to clarify - with the YTEX branch there are 2 sentence splitter 
 - the
 original ctakes sentence that splits on newlines, and the ytex sentence
 splitter that doesn't.  the changes to other components in the ytex 
 branch
 (dependency parser, assertion) work with both sentence splitters.

 I think it would be great if the intelligence regarding how to split 
 was in
 the opennlp model, but this requires training data.  I don't know 
 what the
 training data is, or if the training data has sentences that cross 
 newline
 boundaries (if not, won't buy us anything).

 vijay




 On Wed, Jan 22, 2014 at 3:47 PM, Finan, Sean 
 sean.fi...@childrens.harvard.edu wrote:

 On  my end it looks like my email was reformatted and some of my
 -newline-
 removed in those last examples ...

 -Original Message-
 From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
 Sent: Wednesday, January 22, 2014 3:42 PM
 To: dev@ctakes.apache.org
 Subject: RE: sentence detector newline behavior

 Thanks James

 but then no typical sentence ending punctuation at the end of the 
 line
 Gotcha.

 So simply using Lines would not suffice in those cases because it
 would run together sentences where there are more than one on a line
 I was actually thinking about something like a Line using -sentence
 breaks- in addition to -newline-.  In other words, a Sentence being 
 what
 cTakes detects by ignoring CR/LF, and Lines being those Sentences
 subdivided by -newline-.  Perhaps Line is a horrible moniker.
 Regardless, it doesn't solve the problem of inappropriately missing
 punctuation.  I was focused a little more on the difference between
 persistent auto- line wrapping and structured information like lists,
 where
 the first benefits from Sentence and the second from Line.

 The Patient has
   been prescribed two
   medications.

 Prescriptions:
Advil
Tylenol
No Aspirin


 However, when it comes to the problem that you mention, there is no
 benefit to a Line.

 The patient has been seen six times in the past week.  Pain has been
 persistent for ten days Advil and Tylenol have been prescribed
 -- 2 sentences, 3 lines


 The patient has been seen six times in the past week.
 Pain has been persistent for ten days
 Advil and Tylenol have been prescribed
 -- 2 sentences, 3 lines

 The patient has been seen six times in
   the past week.  Pain has been persistent  for ten days Advil and
 Tylenol
 have been prescribed
 -- 2 sentences, 5 lines

 Nothing can really be done for the last bit where punctuation is 
 missing.




 -Original Message-
 From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
 Sent: Wednesday, January 22, 2014 3:07 PM
 To: 'dev@ctakes.apache.org'
 Subject: RE: sentence detector newline behavior


 I know there are notes where there are multiple sentences on a 
 line, but
 then no typical sentence ending punctuation at the end of the line

Re: sentence detector newline behavior

2014-01-25 Thread Miller, Timothy
I'm running into one issue, it gets tripped up on sentences with
line-ending spaces.  I could easily remove them with a script but by
default they are in there. It happens when a sentence example ends:

...BILAT HEMATOMAS.  LF

(There is a period, then 2 spaces, then the line feed character.) I am
pretty sure this is the root because when I fix this example to be .LF
it gets tripped up in another place instead (with the same error). The
specific error I get is this:

 Exception in thread main java.lang.IllegalArgumentException: start
 index must not be larger than end index: start=8842, end=8839
 at opennlp.tools.util.Span.init(Span.java:47)
 at opennlp.tools.util.Span.init(Span.java:63)
 at
 opennlp.tools.sentdetect.SentenceDetectorME.sentPosDetect(SentenceDetectorME.java:244)
 at
 opennlp.tools.sentdetect.SentenceDetectorEvaluator.processSample(SentenceDetectorEvaluator.java:56)
 at
 opennlp.tools.sentdetect.SentenceDetectorEvaluator.processSample(SentenceDetectorEvaluator.java:1)
 at opennlp.tools.util.eval.Evaluator.evaluateSample(Evaluator.java:82)
 at opennlp.tools.util.eval.Evaluator.evaluate(Evaluator.java:109)
 at
 opennlp.tools.sentdetect.SDCrossValidator.evaluate(SDCrossValidator.java:130)
 at
 opennlp.tools.cmdline.sentdetect.SentenceDetectorCrossValidatorTool.run(SentenceDetectorCrossValidatorTool.java:78)
 at opennlp.tools.cmdline.CLI.main(CLI.java:214)

I thought I'd let you know since you might be able to fix it in 2
minutes but if I don't hear from you today I'll probably take a look at
it later today to try to fix it myself.
Tim

On 01/24/2014 04:14 PM, Jörn Kottmann wrote:
 The changes are now committed.

 To train a model which can recognize new lines the new lines must be encoded
 with the CR or LF tags (or both).

 The same tags are used to pass in the eos chars to the command line trainer.
 For example:
 SentenceDetectorCrossValidator  -lang en -data /home/xyz/eos-cr.all 
 -encoding ISO-8859-15 -eosChars .!?:LF

 Tim, it would be nice if you could test this with your annotations.

 Jörn

 On 01/23/2014 10:06 PM, Tim Miller wrote:
 Just an FYI, a while back I did some of these annotations myself on 
 MIMIC to get around this issue. I replaced the newline character with 
 a special (non-English) character, then pre-processed ctakes input to 
 replace newlines with that character, then did sentence detection, 
 then added the newlines back in. I would be happy to share these 
 annotations and my code modifications.
 Tim


 On 01/23/2014 04:01 PM, Karthik Sarma wrote:
 We could possibly add some additional datasets for training. MIMIC data
 does come to mind -- I can't remember off the top of my head if the 
 MIMIC
 dataset has sentences spanning lines or not.





 -- 
 Karthik Sarma
 UCLA Medical Scientist Training Program Class of 20??
 Member, UCLA Medical Imaging  Informatics Lab
 Member, CA Delegation to the House of Delegates of the American Medical
 Association
 ksa...@ksarma.com
 gchat: ksa...@gmail.com
 linkedin: www.linkedin.com/in/ksarma


 On Thu, Jan 23, 2014 at 4:22 AM, vijay garla vnga...@gmail.com wrote:

 Just to clarify - with the YTEX branch there are 2 sentence splitter 
 - the
 original ctakes sentence that splits on newlines, and the ytex sentence
 splitter that doesn't.  the changes to other components in the ytex 
 branch
 (dependency parser, assertion) work with both sentence splitters.

 I think it would be great if the intelligence regarding how to split 
 was in
 the opennlp model, but this requires training data.  I don't know 
 what the
 training data is, or if the training data has sentences that cross 
 newline
 boundaries (if not, won't buy us anything).

 vijay




 On Wed, Jan 22, 2014 at 3:47 PM, Finan, Sean 
 sean.fi...@childrens.harvard.edu wrote:

 On  my end it looks like my email was reformatted and some of my
 -newline-
 removed in those last examples ...

 -Original Message-
 From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
 Sent: Wednesday, January 22, 2014 3:42 PM
 To: dev@ctakes.apache.org
 Subject: RE: sentence detector newline behavior

 Thanks James

 but then no typical sentence ending punctuation at the end of the 
 line
 Gotcha.

 So simply using Lines would not suffice in those cases because it
 would run together sentences where there are more than one on a line
 I was actually thinking about something like a Line using -sentence
 breaks- in addition to -newline-.  In other words, a Sentence being 
 what
 cTakes detects by ignoring CR/LF, and Lines being those Sentences
 subdivided by -newline-.  Perhaps Line is a horrible moniker.
 Regardless, it doesn't solve the problem of inappropriately missing
 punctuation.  I was focused a little more on the difference between
 persistent auto- line wrapping and structured information like lists,
 where
 the first benefits from Sentence and the second from Line.

 The Patient has
   been prescribed two
   medications.

 Prescriptions

Re: sentence detector newline behavior

2014-01-25 Thread Jörn Kottmann

On 01/25/2014 01:33 PM, Miller, Timothy wrote:

Thanks Joern,
I'll try it. My understanding is I just need to give it my training
data, with the special character I used replaced with the literal string
LF and each line in the file is an example sentence.


Yes, exactly.


Just thinking about the cTAKES wrapper -- do your changes make it so
that we wouldn't need to add the special characters (LF,CR) to a
document within the cTAKES sentence detector wrapper?


Right, the sentence detector expects the chars as input, not the tags.

For example:
This is a sentence terminated by a new line\nAnd this is on more sentence.



It sounds like we
would need to add CR and LF to our eosChars value, it's early (for
my brain) but I wonder whether that could be a default on the opennlp end?


If you pass them in during the training they are stored in the model 
package. All you need to

do is to instantiate the Sentence Detector and it should be ready to use.

BTW, there is also an UIMA integration in opennlp-uima, maybe that could 
work quite well for ctakes.


Jörn




Re: sentence detector newline behavior

2014-01-25 Thread Jörn Kottmann

On 01/25/2014 03:03 PM, Miller, Timothy wrote:

I'm running into one issue, it gets tripped up on sentences with
line-ending spaces.  I could easily remove them with a script but by
default they are in there. It happens when a sentence example ends:

...BILAT HEMATOMAS.  LF

(There is a period, then 2 spaces, then the line feed character.) I am
pretty sure this is the root because when I fix this example to be .LF
it gets tripped up in another place instead (with the same error). The
specific error I get is this:



What happens here is probably that two sentences are detected. It wants 
to split on

the dot, and on the LF.

The sentence detector classifies every eos char if it could be a split 
or not. On the other hand
the user expects to get a span (with begin and end offset) per sentence. 
The code which computes

the spans tries to remove white space from it.

Removing the white space from a whitespace only sentence is causing the 
exception your are seeing.


Which response would you expect from the sentence detector? Should a 
white space only sentence be returned?


In case a sentence is terminated by a new line. Should the new line char 
be included in the sentence span or not?


Jörn


Re: sentence detector newline behavior

2014-01-25 Thread Miller, Timothy

On 01/25/2014 12:24 PM, Jörn Kottmann wrote:
 The code which computes the spans tries to remove white space from it.
 Removing the white space from a whitespace only sentence is causing
 the exception your are seeing. Which response would you expect from
 the sentence detector? Should a white space only sentence be returned? 
I would say no.

 In case a sentence is terminated by a new line. Should the new line
 char be included in the sentence span or not? 
I would also say no.


I made a quick patch for this issue -- now it runs but scores really
poorly compared to my model file (30 vs 75 or so). I suspect something
is wrong with the evaluation, the spans being slightly off somehow.


Re: sentence detector newline behavior

2014-01-24 Thread Jörn Kottmann

On 01/23/2014 10:06 PM, Tim Miller wrote:
Just an FYI, a while back I did some of these annotations myself on 
MIMIC to get around this issue. I replaced the newline character with 
a special (non-English) character, then pre-processed ctakes input to 
replace newlines with that character, then did sentence detection, 
then added the newlines back in. I would be happy to share these 
annotations and my code modifications.


I would be really happy to get access to your annotations so I can test 
the new line support in OpenNLP with it.


Instead of a special char you would now have to use tags (CR and LF) 
to encode the new lines in the training data.
The tags only need to be inserted into the training data, for the actual 
sentence detection the document string can be passed in as it is.


Jörn


Re: sentence detector newline behavior

2014-01-23 Thread vijay garla
Just to clarify - with the YTEX branch there are 2 sentence splitter - the
original ctakes sentence that splits on newlines, and the ytex sentence
splitter that doesn't.  the changes to other components in the ytex branch
(dependency parser, assertion) work with both sentence splitters.

I think it would be great if the intelligence regarding how to split was in
the opennlp model, but this requires training data.  I don't know what the
training data is, or if the training data has sentences that cross newline
boundaries (if not, won't buy us anything).

vijay




On Wed, Jan 22, 2014 at 3:47 PM, Finan, Sean 
sean.fi...@childrens.harvard.edu wrote:

 On  my end it looks like my email was reformatted and some of my -newline-
 removed in those last examples ...

 -Original Message-
 From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
 Sent: Wednesday, January 22, 2014 3:42 PM
 To: dev@ctakes.apache.org
 Subject: RE: sentence detector newline behavior

 Thanks James

  but then no typical sentence ending punctuation at the end of the line

 Gotcha.

  So simply using Lines would not suffice in those cases because it
  would run together sentences where there are more than one on a line

 I was actually thinking about something like a Line using -sentence
 breaks- in addition to -newline-.  In other words, a Sentence being what
 cTakes detects by ignoring CR/LF, and Lines being those Sentences
 subdivided by -newline-.  Perhaps Line is a horrible moniker.
 Regardless, it doesn't solve the problem of inappropriately missing
 punctuation.  I was focused a little more on the difference between
 persistent auto- line wrapping and structured information like lists, where
 the first benefits from Sentence and the second from Line.

 The Patient has
  been prescribed two
  medications.

 Prescriptions:
   Advil
   Tylenol
   No Aspirin


 However, when it comes to the problem that you mention, there is no
 benefit to a Line.

 The patient has been seen six times in the past week.  Pain has been
 persistent for ten days Advil and Tylenol have been prescribed
 -- 2 sentences, 3 lines


 The patient has been seen six times in the past week.
 Pain has been persistent for ten days
 Advil and Tylenol have been prescribed
 -- 2 sentences, 3 lines

 The patient has been seen six times in
  the past week.  Pain has been persistent  for ten days  Advil and Tylenol
 have been prescribed
 -- 2 sentences, 5 lines

 Nothing can really be done for the last bit where punctuation is missing.




 -Original Message-
 From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
 Sent: Wednesday, January 22, 2014 3:07 PM
 To: 'dev@ctakes.apache.org'
 Subject: RE: sentence detector newline behavior


 I know there are notes where there are multiple sentences on a line, but
 then no typical sentence ending punctuation at the end of the line (or no
 punctuation at all at the end of the line). And in those sections, negation
 can be important.  So simply using Lines would not suffice in those cases
 because it would run together sentences where there are more than one on a
 line. And using sentences alone (as found by OpenNLP 1.5) would not suffice
 because it would run together sentences from different lines.

 -Original Message-
 From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
 Sent: Wednesday, January 22, 2014 1:33 PM
 To: dev@ctakes.apache.org
 Subject: RE: sentence detector newline behavior

 Just whistling in the wind here ...

 Perhaps before any changes are made to universally toggle cTakes in one
 direction or the other, we can take a poll of when  where
 cTakes/Ytex/OpenNLP/Omaha needs a Sentence (ignoring CR/LF) as opposed to a
 Line (CR/LF delimited PLUS -sentence-)

 If some capabilities like negation detection require -lines- then would it
 make more sense to have Sentence ignore -newline- and negation detection
 itself split the Sentence into line items?  If an annotator is interested
 in list items, each of which may be on a distinct -line-, then it can split
 up the Sentence as needed.  I think that James hints that cTakes code
 already does this in some places.

 If a good deal of functionality requires -newline- delimited types, would
 it make sense to introduce a type Line?  If something uses a structured
 list it could iterate through Line types, while something using pure text
 could iterate through Sentence types.  This facilitates section-by-section
 different behavior, does not require any decision on global defaults, and
 makes data selection for training Sentence a nonesuch wrt line breaks.
  However, it adds to the system and would require a per-use choice decision
 by developers OR a toggle by users (back to the default decision).
 Perhaps this has already been tried?

 Sean


 -Original Message-
 From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
 Sent: Wednesday, January 22, 2014 1:06 PM
 To: 'dev@ctakes.apache.org'
 Subject: RE: sentence detector newline

Re: sentence detector newline behavior

2014-01-23 Thread Karthik Sarma
We could possibly add some additional datasets for training. MIMIC data
does come to mind -- I can't remember off the top of my head if the MIMIC
dataset has sentences spanning lines or not.





--
Karthik Sarma
UCLA Medical Scientist Training Program Class of 20??
Member, UCLA Medical Imaging  Informatics Lab
Member, CA Delegation to the House of Delegates of the American Medical
Association
ksa...@ksarma.com
gchat: ksa...@gmail.com
linkedin: www.linkedin.com/in/ksarma


On Thu, Jan 23, 2014 at 4:22 AM, vijay garla vnga...@gmail.com wrote:

 Just to clarify - with the YTEX branch there are 2 sentence splitter - the
 original ctakes sentence that splits on newlines, and the ytex sentence
 splitter that doesn't.  the changes to other components in the ytex branch
 (dependency parser, assertion) work with both sentence splitters.

 I think it would be great if the intelligence regarding how to split was in
 the opennlp model, but this requires training data.  I don't know what the
 training data is, or if the training data has sentences that cross newline
 boundaries (if not, won't buy us anything).

 vijay




 On Wed, Jan 22, 2014 at 3:47 PM, Finan, Sean 
 sean.fi...@childrens.harvard.edu wrote:

  On  my end it looks like my email was reformatted and some of my
 -newline-
  removed in those last examples ...
 
  -Original Message-
  From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
  Sent: Wednesday, January 22, 2014 3:42 PM
  To: dev@ctakes.apache.org
  Subject: RE: sentence detector newline behavior
 
  Thanks James
 
   but then no typical sentence ending punctuation at the end of the line
 
  Gotcha.
 
   So simply using Lines would not suffice in those cases because it
   would run together sentences where there are more than one on a line
 
  I was actually thinking about something like a Line using -sentence
  breaks- in addition to -newline-.  In other words, a Sentence being what
  cTakes detects by ignoring CR/LF, and Lines being those Sentences
  subdivided by -newline-.  Perhaps Line is a horrible moniker.
  Regardless, it doesn't solve the problem of inappropriately missing
  punctuation.  I was focused a little more on the difference between
  persistent auto- line wrapping and structured information like lists,
 where
  the first benefits from Sentence and the second from Line.
 
  The Patient has
   been prescribed two
   medications.
 
  Prescriptions:
Advil
Tylenol
No Aspirin
 
 
  However, when it comes to the problem that you mention, there is no
  benefit to a Line.
 
  The patient has been seen six times in the past week.  Pain has been
  persistent for ten days Advil and Tylenol have been prescribed
  -- 2 sentences, 3 lines
 
 
  The patient has been seen six times in the past week.
  Pain has been persistent for ten days
  Advil and Tylenol have been prescribed
  -- 2 sentences, 3 lines
 
  The patient has been seen six times in
   the past week.  Pain has been persistent  for ten days  Advil and
 Tylenol
  have been prescribed
  -- 2 sentences, 5 lines
 
  Nothing can really be done for the last bit where punctuation is missing.
 
 
 
 
  -Original Message-
  From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
  Sent: Wednesday, January 22, 2014 3:07 PM
  To: 'dev@ctakes.apache.org'
  Subject: RE: sentence detector newline behavior
 
 
  I know there are notes where there are multiple sentences on a line, but
  then no typical sentence ending punctuation at the end of the line (or no
  punctuation at all at the end of the line). And in those sections,
 negation
  can be important.  So simply using Lines would not suffice in those cases
  because it would run together sentences where there are more than one on
 a
  line. And using sentences alone (as found by OpenNLP 1.5) would not
 suffice
  because it would run together sentences from different lines.
 
  -Original Message-
  From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
  Sent: Wednesday, January 22, 2014 1:33 PM
  To: dev@ctakes.apache.org
  Subject: RE: sentence detector newline behavior
 
  Just whistling in the wind here ...
 
  Perhaps before any changes are made to universally toggle cTakes in one
  direction or the other, we can take a poll of when  where
  cTakes/Ytex/OpenNLP/Omaha needs a Sentence (ignoring CR/LF) as opposed
 to a
  Line (CR/LF delimited PLUS -sentence-)
 
  If some capabilities like negation detection require -lines- then would
 it
  make more sense to have Sentence ignore -newline- and negation detection
  itself split the Sentence into line items?  If an annotator is interested
  in list items, each of which may be on a distinct -line-, then it can
 split
  up the Sentence as needed.  I think that James hints that cTakes code
  already does this in some places.
 
  If a good deal of functionality requires -newline- delimited types, would
  it make sense to introduce a type Line?  If something uses a structured
  list it could iterate

Re: sentence detector newline behavior

2014-01-23 Thread Tim Miller
Just an FYI, a while back I did some of these annotations myself on 
MIMIC to get around this issue. I replaced the newline character with a 
special (non-English) character, then pre-processed ctakes input to 
replace newlines with that character, then did sentence detection, then 
added the newlines back in. I would be happy to share these annotations 
and my code modifications.

Tim


On 01/23/2014 04:01 PM, Karthik Sarma wrote:

We could possibly add some additional datasets for training. MIMIC data
does come to mind -- I can't remember off the top of my head if the MIMIC
dataset has sentences spanning lines or not.





--
Karthik Sarma
UCLA Medical Scientist Training Program Class of 20??
Member, UCLA Medical Imaging  Informatics Lab
Member, CA Delegation to the House of Delegates of the American Medical
Association
ksa...@ksarma.com
gchat: ksa...@gmail.com
linkedin: www.linkedin.com/in/ksarma


On Thu, Jan 23, 2014 at 4:22 AM, vijay garla vnga...@gmail.com wrote:


Just to clarify - with the YTEX branch there are 2 sentence splitter - the
original ctakes sentence that splits on newlines, and the ytex sentence
splitter that doesn't.  the changes to other components in the ytex branch
(dependency parser, assertion) work with both sentence splitters.

I think it would be great if the intelligence regarding how to split was in
the opennlp model, but this requires training data.  I don't know what the
training data is, or if the training data has sentences that cross newline
boundaries (if not, won't buy us anything).

vijay




On Wed, Jan 22, 2014 at 3:47 PM, Finan, Sean 
sean.fi...@childrens.harvard.edu wrote:


On  my end it looks like my email was reformatted and some of my

-newline-

removed in those last examples ...

-Original Message-
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
Sent: Wednesday, January 22, 2014 3:42 PM
To: dev@ctakes.apache.org
Subject: RE: sentence detector newline behavior

Thanks James


but then no typical sentence ending punctuation at the end of the line

Gotcha.


So simply using Lines would not suffice in those cases because it
would run together sentences where there are more than one on a line

I was actually thinking about something like a Line using -sentence
breaks- in addition to -newline-.  In other words, a Sentence being what
cTakes detects by ignoring CR/LF, and Lines being those Sentences
subdivided by -newline-.  Perhaps Line is a horrible moniker.
Regardless, it doesn't solve the problem of inappropriately missing
punctuation.  I was focused a little more on the difference between
persistent auto- line wrapping and structured information like lists,

where

the first benefits from Sentence and the second from Line.

The Patient has
  been prescribed two
  medications.

Prescriptions:
   Advil
   Tylenol
   No Aspirin


However, when it comes to the problem that you mention, there is no
benefit to a Line.

The patient has been seen six times in the past week.  Pain has been
persistent for ten days Advil and Tylenol have been prescribed
-- 2 sentences, 3 lines


The patient has been seen six times in the past week.
Pain has been persistent for ten days
Advil and Tylenol have been prescribed
-- 2 sentences, 3 lines

The patient has been seen six times in
  the past week.  Pain has been persistent  for ten days  Advil and

Tylenol

have been prescribed
-- 2 sentences, 5 lines

Nothing can really be done for the last bit where punctuation is missing.




-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
Sent: Wednesday, January 22, 2014 3:07 PM
To: 'dev@ctakes.apache.org'
Subject: RE: sentence detector newline behavior


I know there are notes where there are multiple sentences on a line, but
then no typical sentence ending punctuation at the end of the line (or no
punctuation at all at the end of the line). And in those sections,

negation

can be important.  So simply using Lines would not suffice in those cases
because it would run together sentences where there are more than one on

a

line. And using sentences alone (as found by OpenNLP 1.5) would not

suffice

because it would run together sentences from different lines.

-Original Message-
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
Sent: Wednesday, January 22, 2014 1:33 PM
To: dev@ctakes.apache.org
Subject: RE: sentence detector newline behavior

Just whistling in the wind here ...

Perhaps before any changes are made to universally toggle cTakes in one
direction or the other, we can take a poll of when  where
cTakes/Ytex/OpenNLP/Omaha needs a Sentence (ignoring CR/LF) as opposed

to a

Line (CR/LF delimited PLUS -sentence-)

If some capabilities like negation detection require -lines- then would

it

make more sense to have Sentence ignore -newline- and negation detection
itself split the Sentence into line items?  If an annotator is interested
in list items, each of which may be on a distinct -line-, then it can

split

up

RE: sentence detector newline behavior

2014-01-22 Thread Masanz, James J.
The only rule I know of is that cTAKES (prior to ytex integration) always 
forces a sentence break at a newline.
This was because the clinical notes cTAKES original processed never had 
newlines in the middle of a sentence, but did need sentence breaks to occur at 
end of sentence for good negation detection on those notes.
I think Guergana earlier mentioned other EMRs also have this need, but it seems 
to not be ubiquitous.

From others' posts, it seems that we could use an option in cTAKES to turn off 
this forcing of sentence breaks at newlines (or depending on how you look at 
it, an option to turn on the forcing of sentence breaks if we change the 
default behavior)

I think we (cTAKES) need to decide the following:
 - do we want to do this for entire notes, or would it be  worth it to have it 
be on a section-by-section basis.
 - what do we make the default behavior - to force or not to force newlines to 
be sentence breaks
 - what data (that contains newlines) will we use for training the sentence 
detector

Regardless of those answers, I think OpenNLP support for including newlines in 
training data would be valuable for those others who have sentences that span 
lines.  And having an option on OpenNLP to always break at newline would be 
useful for at least some cTAKES users (and we could remove the cTAKES code that 
does that)

-- James

-Original Message-
From: dev-return-2390-Masanz.James=mayo@ctakes.apache.org 
[mailto:dev-return-2390-Masanz.James=mayo@ctakes.apache.org] On Behalf Of 
Jörn Kottmann
Sent: Tuesday, January 21, 2014 4:29 AM
To: dev@ctakes.apache.org
Subject: Re: sentence detector newline behavior

Yes, exactly, OPENNLP-602 is about training a sentence detector model 
which can use a new line as a end-of-sentence character.

In case you have certain rules to split sentences we should have a look 
at them. The Sentence Detector could be extended to support
a user provided rule based splitter. If there is an interest in that we 
could probably get it into 1.6.0 as well.

Jörn

On 01/20/2014 10:02 PM, Chen, Pei wrote:
 I presume Joern was suggesting that if he supports new lines in the opennlp 
 SentenceDectector (either part of the trained models or post processing with 
 some rules?) cTAKES will be able to use it out of the box and we should be 
 able remove any additional custom logic that we currently have- which seems 
 like a good idea.

 [but when to use within cTAKES individual components such as negation might 
 be another discussion?]
 --Pei

 On Jan 20, 2014, at 12:46 PM, vijay garla vnga...@gmail.com wrote:

 The sentence detection opennlp model used by ctakes does not split
 sentences at newlines - there is additional logic in the takes sentence
 splitter that does this (and an alternative impl that doesn't is in the
 ytex branch). Afaik no retraining / change to the feature representation is
 necessary.

 Vj

 On Monday, January 20, 2014, Jörn Kottmann kottm...@gmail.com wrote:

 Hi all,

 currently I have quite a bit of time to work on OpenNLP, and would like to
 help you
 out with this issue.

 Here is the follow up issue for this change:
 https://issues.apache.org/jira/browse/OPENNLP-602

 I am still trying to figure out what would be the best option to implement
 this.
 In the training data a user could just use a special tag to identify the
 chars.

 Instead of NEWLINE it might be better to use CR and LF to encode
 these two chars
 in the training data. Any thoughts?

 I am planning to release this as part of OpenNLP 1.6.0.

 Thanks,
 Jörn

 On 05/22/2013 02:03 PM, Jörn Kottmann wrote:

 On 05/22/2013 01:17 PM, Miller, Timothy wrote:

 That's awesome! It might be worth trying at least. How does the training
 process change? Previously the training data would be one sentence per
 line, but with newlines as possible mid-sentence characters that could
 be trouble, is there a new representation for training data? Or would we
 have to use the training api?
 Good point, yes that will be a problem with the default training format,
 but it shouldn't be hard
 to solve. In the format itself we could define a new line tag e.g.
 NEWLINE to mark new lines.
 as a hack to make it work with 1.5.3 you could instead use a special char
 as a replacement
 for the new line char.
 When you pass the text down to the sentence detector a simple string
 replace could be used to
 convert all new line chars to the special new line marker char.

 If things work out for you performance wise as well we will just
 integrate it properly into OpenNLP
 for the next release.

 Could you produce a sentence detector training file with a new line
 marker char?

 You should try to pick a char you can also pass in on a terminal
 otherwise you have to use the
 API to train the model. The build in cross validation could be used to
 evaluate the performance.

 Jörn




RE: sentence detector newline behavior

2014-01-22 Thread Masanz, James J.

I know there are notes where there are multiple sentences on a line, but then 
no typical sentence ending punctuation at the end of the line (or no 
punctuation at all at the end of the line). And in those sections, negation can 
be important.  So simply using Lines would not suffice in those cases because 
it would run together sentences where there are more than one on a line. And 
using sentences alone (as found by OpenNLP 1.5) would not suffice because it 
would run together sentences from different lines.

-Original Message-
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] 
Sent: Wednesday, January 22, 2014 1:33 PM
To: dev@ctakes.apache.org
Subject: RE: sentence detector newline behavior

Just whistling in the wind here ...

Perhaps before any changes are made to universally toggle cTakes in one 
direction or the other, we can take a poll of when  where 
cTakes/Ytex/OpenNLP/Omaha needs a Sentence (ignoring CR/LF) as opposed to a 
Line (CR/LF delimited PLUS -sentence-)

If some capabilities like negation detection require -lines- then would it make 
more sense to have Sentence ignore -newline- and negation detection itself 
split the Sentence into line items?  If an annotator is interested in list 
items, each of which may be on a distinct -line-, then it can split up the 
Sentence as needed.  I think that James hints that cTakes code already does 
this in some places.  

If a good deal of functionality requires -newline- delimited types, would it 
make sense to introduce a type Line?  If something uses a structured list it 
could iterate through Line types, while something using pure text could iterate 
through Sentence types.  This facilitates section-by-section different 
behavior, does not require any decision on global defaults, and makes data 
selection for training Sentence a nonesuch wrt line breaks.  However, it adds 
to the system and would require a per-use choice decision by developers OR a 
toggle by users (back to the default decision).   Perhaps this has already been 
tried?

Sean


-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] 
Sent: Wednesday, January 22, 2014 1:06 PM
To: 'dev@ctakes.apache.org'
Subject: RE: sentence detector newline behavior

The only rule I know of is that cTAKES (prior to ytex integration) always 
forces a sentence break at a newline.
This was because the clinical notes cTAKES original processed never had 
newlines in the middle of a sentence, but did need sentence breaks to occur at 
end of sentence for good negation detection on those notes.
I think Guergana earlier mentioned other EMRs also have this need, but it seems 
to not be ubiquitous.

From others' posts, it seems that we could use an option in cTAKES to turn off 
this forcing of sentence breaks at newlines (or depending on how you look at 
it, an option to turn on the forcing of sentence breaks if we change the 
default behavior)

I think we (cTAKES) need to decide the following:
 - do we want to do this for entire notes, or would it be  worth it to have it 
be on a section-by-section basis.
 - what do we make the default behavior - to force or not to force newlines to 
be sentence breaks
 - what data (that contains newlines) will we use for training the sentence 
detector

Regardless of those answers, I think OpenNLP support for including newlines in 
training data would be valuable for those others who have sentences that span 
lines.  And having an option on OpenNLP to always break at newline would be 
useful for at least some cTAKES users (and we could remove the cTAKES code that 
does that)

-- James

-Original Message-
From: dev-return-2390-Masanz.James=mayo@ctakes.apache.org 
[mailto:dev-return-2390-Masanz.James=mayo@ctakes.apache.org] On Behalf Of 
Jörn Kottmann
Sent: Tuesday, January 21, 2014 4:29 AM
To: dev@ctakes.apache.org
Subject: Re: sentence detector newline behavior

Yes, exactly, OPENNLP-602 is about training a sentence detector model which can 
use a new line as a end-of-sentence character.

In case you have certain rules to split sentences we should have a look at 
them. The Sentence Detector could be extended to support a user provided rule 
based splitter. If there is an interest in that we could probably get it into 
1.6.0 as well.

Jörn

On 01/20/2014 10:02 PM, Chen, Pei wrote:
 I presume Joern was suggesting that if he supports new lines in the opennlp 
 SentenceDectector (either part of the trained models or post processing with 
 some rules?) cTAKES will be able to use it out of the box and we should be 
 able remove any additional custom logic that we currently have- which seems 
 like a good idea.

 [but when to use within cTAKES individual components such as negation 
 might be another discussion?] --Pei

 On Jan 20, 2014, at 12:46 PM, vijay garla vnga...@gmail.com wrote:

 The sentence detection opennlp model used by ctakes does not split 
 sentences at newlines - there is additional logic in the takes

RE: sentence detector newline behavior

2014-01-22 Thread Finan, Sean
On  my end it looks like my email was reformatted and some of my -newline- 
removed in those last examples ... 

-Original Message-
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] 
Sent: Wednesday, January 22, 2014 3:42 PM
To: dev@ctakes.apache.org
Subject: RE: sentence detector newline behavior

Thanks James

 but then no typical sentence ending punctuation at the end of the line

Gotcha.  

 So simply using Lines would not suffice in those cases because it 
 would run together sentences where there are more than one on a line

I was actually thinking about something like a Line using -sentence breaks- in 
addition to -newline-.  In other words, a Sentence being what cTakes detects by 
ignoring CR/LF, and Lines being those Sentences subdivided by -newline-.  
Perhaps Line is a horrible moniker.   Regardless, it doesn't solve the 
problem of inappropriately missing punctuation.  I was focused a little more on 
the difference between persistent auto- line wrapping and structured 
information like lists, where the first benefits from Sentence and the second 
from Line.

The Patient has
 been prescribed two
 medications. 

Prescriptions:
  Advil
  Tylenol
  No Aspirin


However, when it comes to the problem that you mention, there is no benefit to 
a Line.

The patient has been seen six times in the past week.  Pain has been 
persistent for ten days Advil and Tylenol have been prescribed
-- 2 sentences, 3 lines


The patient has been seen six times in the past week.  
Pain has been persistent for ten days
Advil and Tylenol have been prescribed
-- 2 sentences, 3 lines

The patient has been seen six times in
 the past week.  Pain has been persistent  for ten days  Advil and Tylenol have 
been prescribed
-- 2 sentences, 5 lines

Nothing can really be done for the last bit where punctuation is missing.




-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
Sent: Wednesday, January 22, 2014 3:07 PM
To: 'dev@ctakes.apache.org'
Subject: RE: sentence detector newline behavior


I know there are notes where there are multiple sentences on a line, but then 
no typical sentence ending punctuation at the end of the line (or no 
punctuation at all at the end of the line). And in those sections, negation can 
be important.  So simply using Lines would not suffice in those cases because 
it would run together sentences where there are more than one on a line. And 
using sentences alone (as found by OpenNLP 1.5) would not suffice because it 
would run together sentences from different lines.

-Original Message-
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
Sent: Wednesday, January 22, 2014 1:33 PM
To: dev@ctakes.apache.org
Subject: RE: sentence detector newline behavior

Just whistling in the wind here ...

Perhaps before any changes are made to universally toggle cTakes in one 
direction or the other, we can take a poll of when  where 
cTakes/Ytex/OpenNLP/Omaha needs a Sentence (ignoring CR/LF) as opposed to a 
Line (CR/LF delimited PLUS -sentence-)

If some capabilities like negation detection require -lines- then would it make 
more sense to have Sentence ignore -newline- and negation detection itself 
split the Sentence into line items?  If an annotator is interested in list 
items, each of which may be on a distinct -line-, then it can split up the 
Sentence as needed.  I think that James hints that cTakes code already does 
this in some places.  

If a good deal of functionality requires -newline- delimited types, would it 
make sense to introduce a type Line?  If something uses a structured list it 
could iterate through Line types, while something using pure text could iterate 
through Sentence types.  This facilitates section-by-section different 
behavior, does not require any decision on global defaults, and makes data 
selection for training Sentence a nonesuch wrt line breaks.  However, it adds 
to the system and would require a per-use choice decision by developers OR a 
toggle by users (back to the default decision).   Perhaps this has already been 
tried?

Sean


-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
Sent: Wednesday, January 22, 2014 1:06 PM
To: 'dev@ctakes.apache.org'
Subject: RE: sentence detector newline behavior

The only rule I know of is that cTAKES (prior to ytex integration) always 
forces a sentence break at a newline.
This was because the clinical notes cTAKES original processed never had 
newlines in the middle of a sentence, but did need sentence breaks to occur at 
end of sentence for good negation detection on those notes.
I think Guergana earlier mentioned other EMRs also have this need, but it seems 
to not be ubiquitous.

From others' posts, it seems that we could use an option in cTAKES to turn off 
this forcing of sentence breaks at newlines (or depending on how you look at 
it, an option to turn on the forcing of sentence breaks if we change the 
default behavior)

I think we

Re: sentence detector newline behavior

2014-01-21 Thread Jörn Kottmann
Yes, exactly, OPENNLP-602 is about training a sentence detector model 
which can use a new line as a end-of-sentence character.


In case you have certain rules to split sentences we should have a look 
at them. The Sentence Detector could be extended to support
a user provided rule based splitter. If there is an interest in that we 
could probably get it into 1.6.0 as well.


Jörn

On 01/20/2014 10:02 PM, Chen, Pei wrote:

I presume Joern was suggesting that if he supports new lines in the opennlp 
SentenceDectector (either part of the trained models or post processing with 
some rules?) cTAKES will be able to use it out of the box and we should be able 
remove any additional custom logic that we currently have- which seems like a 
good idea.

[but when to use within cTAKES individual components such as negation might be 
another discussion?]
--Pei


On Jan 20, 2014, at 12:46 PM, vijay garla vnga...@gmail.com wrote:

The sentence detection opennlp model used by ctakes does not split
sentences at newlines - there is additional logic in the takes sentence
splitter that does this (and an alternative impl that doesn't is in the
ytex branch). Afaik no retraining / change to the feature representation is
necessary.

Vj


On Monday, January 20, 2014, Jörn Kottmann kottm...@gmail.com wrote:

Hi all,

currently I have quite a bit of time to work on OpenNLP, and would like to
help you
out with this issue.

Here is the follow up issue for this change:
https://issues.apache.org/jira/browse/OPENNLP-602

I am still trying to figure out what would be the best option to implement
this.
In the training data a user could just use a special tag to identify the
chars.

Instead of NEWLINE it might be better to use CR and LF to encode
these two chars
in the training data. Any thoughts?

I am planning to release this as part of OpenNLP 1.6.0.

Thanks,
Jörn


On 05/22/2013 02:03 PM, Jörn Kottmann wrote:


On 05/22/2013 01:17 PM, Miller, Timothy wrote:

That's awesome! It might be worth trying at least. How does the training
process change? Previously the training data would be one sentence per
line, but with newlines as possible mid-sentence characters that could
be trouble, is there a new representation for training data? Or would we
have to use the training api?

Good point, yes that will be a problem with the default training format,
but it shouldn't be hard
to solve. In the format itself we could define a new line tag e.g.
NEWLINE to mark new lines.
as a hack to make it work with 1.5.3 you could instead use a special char
as a replacement
for the new line char.
When you pass the text down to the sentence detector a simple string
replace could be used to
convert all new line chars to the special new line marker char.

If things work out for you performance wise as well we will just
integrate it properly into OpenNLP
for the next release.

Could you produce a sentence detector training file with a new line
marker char?

You should try to pick a char you can also pass in on a terminal
otherwise you have to use the
API to train the model. The build in cross validation could be used to
evaluate the performance.

Jörn






Re: sentence detector newline behavior

2014-01-20 Thread Jörn Kottmann

Hi all,

currently I have quite a bit of time to work on OpenNLP, and would like 
to help you

out with this issue.

Here is the follow up issue for this change:
https://issues.apache.org/jira/browse/OPENNLP-602

I am still trying to figure out what would be the best option to 
implement this.
In the training data a user could just use a special tag to identify the 
chars.


Instead of NEWLINE it might be better to use CR and LF to encode 
these two chars

in the training data. Any thoughts?

I am planning to release this as part of OpenNLP 1.6.0.

Thanks,
Jörn

On 05/22/2013 02:03 PM, Jörn Kottmann wrote:

On 05/22/2013 01:17 PM, Miller, Timothy wrote:

That's awesome! It might be worth trying at least. How does the training
process change? Previously the training data would be one sentence per
line, but with newlines as possible mid-sentence characters that could
be trouble, is there a new representation for training data? Or would we
have to use the training api?


Good point, yes that will be a problem with the default training 
format, but it shouldn't be hard
to solve. In the format itself we could define a new line tag e.g. 
NEWLINE to mark new lines.
as a hack to make it work with 1.5.3 you could instead use a special 
char as a replacement

for the new line char.
When you pass the text down to the sentence detector a simple string 
replace could be used to

convert all new line chars to the special new line marker char.

If things work out for you performance wise as well we will just 
integrate it properly into OpenNLP

for the next release.

Could you produce a sentence detector training file with a new line 
marker char?


You should try to pick a char you can also pass in on a terminal 
otherwise you have to use the
API to train the model. The build in cross validation could be used to 
evaluate the performance.


Jörn




Re: sentence detector newline behavior

2013-05-23 Thread Tim Miller
OK I've started doing this, was able to get training working on a very 
small example, will try doing slightly bigger.

Tim

On 05/22/2013 08:03 AM, Jörn Kottmann wrote:

On 05/22/2013 01:17 PM, Miller, Timothy wrote:

That's awesome! It might be worth trying at least. How does the training
process change? Previously the training data would be one sentence per
line, but with newlines as possible mid-sentence characters that could
be trouble, is there a new representation for training data? Or would we
have to use the training api?


Good point, yes that will be a problem with the default training 
format, but it shouldn't be hard
to solve. In the format itself we could define a new line tag e.g. 
NEWLINE to mark new lines.
as a hack to make it work with 1.5.3 you could instead use a special 
char as a replacement

for the new line char.
When you pass the text down to the sentence detector a simple string 
replace could be used to

convert all new line chars to the special new line marker char.

If things work out for you performance wise as well we will just 
integrate it properly into OpenNLP

for the next release.

Could you produce a sentence detector training file with a new line 
marker char?


You should try to pick a char you can also pass in on a terminal 
otherwise you have to use the
API to train the model. The build in cross validation could be used to 
evaluate the performance.


Jörn