Re: sentence detector newline behavior

Tim Miller Thu, 23 Jan 2014 13:08:07 -0800

Just an FYI, a while back I did some of these annotations myself onMIMIC to get around this issue. I replaced the newline character with aspecial (non-English) character, then pre-processed ctakes input toreplace newlines with that character, then did sentence detection, thenadded the newlines back in. I would be happy to share these annotationsand my code modifications.

Tim


On 01/23/2014 04:01 PM, Karthik Sarma wrote:

We could possibly add some additional datasets for training. MIMIC data
does come to mind -- I can't remember off the top of my head if the MIMIC
dataset has sentences spanning lines or not.





--
Karthik Sarma
UCLA Medical Scientist Training Program Class of 20??
Member, UCLA Medical Imaging & Informatics Lab
Member, CA Delegation to the House of Delegates of the American Medical
Association
ksa...@ksarma.com
gchat: ksa...@gmail.com
linkedin: www.linkedin.com/in/ksarma


On Thu, Jan 23, 2014 at 4:22 AM, vijay garla <vnga...@gmail.com> wrote:

Just to clarify - with the YTEX branch there are 2 sentence splitter - the
original ctakes sentence that splits on newlines, and the ytex sentence
splitter that doesn't.  the changes to other components in the ytex branch
(dependency parser, assertion) work with both sentence splitters.

I think it would be great if the intelligence regarding how to split was in
the opennlp model, but this requires training data.  I don't know what the
training data is, or if the training data has sentences that cross newline
boundaries (if not, won't buy us anything).

vijay




On Wed, Jan 22, 2014 at 3:47 PM, Finan, Sean <
sean.fi...@childrens.harvard.edu> wrote:

On  my end it looks like my email was reformatted and some of my

-newline-

removed in those last examples ...

-----Original Message-----
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
Sent: Wednesday, January 22, 2014 3:42 PM
To: dev@ctakes.apache.org
Subject: RE: sentence detector newline behavior

Thanks James

but then no typical sentence ending punctuation at the end of the line

Gotcha.

So simply using Lines would not suffice in those cases because it
would run together sentences where there are more than one on a line

I was actually thinking about something like a Line using -sentence
breaks- in addition to -newline-.  In other words, a Sentence being what
cTakes detects by ignoring CR/LF, and Lines being those Sentences
subdivided by -newline-.  Perhaps "Line" is a horrible moniker.
Regardless, it doesn't solve the problem of inappropriately missing
punctuation.  I was focused a little more on the difference between
persistent auto- line wrapping and structured information like lists,

where

the first benefits from Sentence and the second from Line.

"The Patient has
  been prescribed two
  medications."

"Prescriptions:
   Advil
   Tylenol
   No Aspirin"


However, when it comes to the problem that you mention, there is no
benefit to a Line.

"The patient has been seen six times in the past week.  Pain has been
persistent for ten days Advil and Tylenol have been prescribed"
-- 2 sentences, 3 lines


"The patient has been seen six times in the past week.
Pain has been persistent for ten days
Advil and Tylenol have been prescribed"
-- 2 sentences, 3 lines

"The patient has been seen six times in
  the past week.  Pain has been persistent  for ten days  Advil and

Tylenol

have been prescribed"
-- 2 sentences, 5 lines

Nothing can really be done for the last bit where punctuation is missing.




-----Original Message-----
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
Sent: Wednesday, January 22, 2014 3:07 PM
To: 'dev@ctakes.apache.org'
Subject: RE: sentence detector newline behavior


I know there are notes where there are multiple sentences on a line, but
then no typical sentence ending punctuation at the end of the line (or no
punctuation at all at the end of the line). And in those sections,

negation

can be important.  So simply using Lines would not suffice in those cases
because it would run together sentences where there are more than one on

line. And using sentences alone (as found by OpenNLP 1.5) would not

suffice

because it would run together sentences from different lines.

-----Original Message-----
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
Sent: Wednesday, January 22, 2014 1:33 PM
To: dev@ctakes.apache.org
Subject: RE: sentence detector newline behavior

Just whistling in the wind here ...

Perhaps before any changes are made to universally toggle cTakes in one
direction or the other, we can take a poll of when & where
cTakes/Ytex/OpenNLP/Omaha needs a Sentence (ignoring CR/LF) as opposed

to a

Line (CR/LF delimited PLUS -sentence-)

If some capabilities like negation detection require -lines- then would

it

make more sense to have Sentence ignore -newline- and negation detection
itself split the Sentence into line items?  If an annotator is interested
in list items, each of which may be on a distinct -line-, then it can

split

up the Sentence as needed.  I think that James hints that cTakes code
already does this in some places.

If a good deal of functionality requires -newline- delimited types, would
it make sense to introduce a type Line?  If something uses a structured
list it could iterate through Line types, while something using pure text
could iterate through Sentence types.  This facilitates

section-by-section

different behavior, does not require any decision on global defaults, and
makes data selection for training Sentence a nonesuch wrt line breaks.
  However, it adds to the system and would require a per-use choice

decision

by developers OR a toggle by users (back to the default decision).
Perhaps this has already been tried?

Sean

-----Original Message-----
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
Sent: Wednesday, January 22, 2014 1:06 PM
To: 'dev@ctakes.apache.org'
Subject: RE: sentence detector newline behavior

The only rule I know of is that cTAKES (prior to ytex integration) always
forces a sentence break at a newline.
This was because the clinical notes cTAKES original processed never had
newlines in the middle of a sentence, but did need sentence breaks to

occur

at end of sentence for good negation detection on those notes.
I think Guergana earlier mentioned other EMRs also have this need, but it
seems to not be ubiquitous.

 From others' posts, it seems that we could use an option in cTAKES to

turn

off this forcing of sentence breaks at newlines (or depending on how you
look at it, an option to turn on the forcing of sentence breaks if we
change the default behavior)

I think we (cTAKES) need to decide the following:
  - do we want to do this for entire notes, or would it be  worth it to
have it be on a section-by-section basis.
  - what do we make the default behavior - to force or not to force
newlines to be sentence breaks
  - what data (that contains newlines) will we use for training the
sentence detector

Regardless of those answers, I think OpenNLP support for including
newlines in training data would be valuable for those others who have
sentences that span lines.  And having an option on OpenNLP to always

break

at newline would be useful for at least some cTAKES users (and we could
remove the cTAKES code that does that)

-- James

-----Original Message-----
From: dev-return-2390-Masanz.James=mayo....@ctakes.apache.org [mailto:
dev-return-2390-Masanz.James=mayo....@ctakes.apache.org] On Behalf Of
Jörn Kottmann
Sent: Tuesday, January 21, 2014 4:29 AM
To: dev@ctakes.apache.org
Subject: Re: sentence detector newline behavior

Yes, exactly, OPENNLP-602 is about training a sentence detector model
which can use a new line as a end-of-sentence character.

In case you have certain rules to split sentences we should have a look

at

them. The Sentence Detector could be extended to support a user provided
rule based splitter. If there is an interest in that we could probably

get

it into 1.6.0 as well.

Jörn

On 01/20/2014 10:02 PM, Chen, Pei wrote:

I presume Joern was suggesting that if he supports new lines in the

opennlp SentenceDectector (either part of the trained models or post
processing with some rules?) cTAKES will be able to use it out of the box
and we should be able remove any additional custom logic that we

currently

have- which seems like a good idea.

[but when to use within cTAKES individual components such as negation
might be another discussion?] --Pei

On Jan 20, 2014, at 12:46 PM, "vijay garla" <vnga...@gmail.com>

wrote:

The sentence detection opennlp model used by ctakes does not split
sentences at newlines - there is additional logic in the takes
sentence splitter that does this (and an alternative impl that
doesn't is in the ytex branch). Afaik no retraining / change to the
feature representation is necessary.

Vj

On Monday, January 20, 2014, Jörn Kottmann <kottm...@gmail.com>

wrote:

Hi all,

currently I have quite a bit of time to work on OpenNLP, and would
like to help you out with this issue.

Here is the follow up issue for this change:
https://issues.apache.org/jira/browse/OPENNLP-602

I am still trying to figure out what would be the best option to
implement this.
In the training data a user could just use a special tag to identify
the chars.

Instead of <NEWLINE> it might be better to use <CR> and <LF> to
encode these two chars in the training data. Any thoughts?

I am planning to release this as part of OpenNLP 1.6.0.

Thanks,
Jörn

On 05/22/2013 02:03 PM, Jörn Kottmann wrote:

On 05/22/2013 01:17 PM, Miller, Timothy wrote:

That's awesome! It might be worth trying at least. How does the
training process change? Previously the training data would be one
sentence per line, but with newlines as possible mid-sentence
characters that could be trouble, is there a new representation
for training data? Or would we have to use the training api?

Good point, yes that will be a problem with the default training
format, but it shouldn't be hard to solve. In the format itself we
could define a new line tag e.g.
<NEWLINE> to mark new lines.
as a hack to make it work with 1.5.3 you could instead use a
special char as a replacement for the new line char.
When you pass the text down to the sentence detector a simple
string replace could be used to convert all new line chars to the
special new line marker char.

If things work out for you performance wise as well we will just
integrate it properly into OpenNLP for the next release.

Could you produce a sentence detector training file with a new line
marker char?

You should try to pick a char you can also pass in on a terminal
otherwise you have to use the API to train the model. The build in
cross validation could be used to evaluate the performance.

Jörn

Re: sentence detector newline behavior

Reply via email to