Re: sentence detector newline behavior

Karthik Sarma Thu, 23 Jan 2014 13:03:07 -0800

We could possibly add some additional datasets for training. MIMIC data
does come to mind -- I can't remember off the top of my head if the MIMIC
dataset has sentences spanning lines or not.






--
Karthik Sarma
UCLA Medical Scientist Training Program Class of 20??
Member, UCLA Medical Imaging & Informatics Lab
Member, CA Delegation to the House of Delegates of the American Medical
Association
[email protected]
gchat: [email protected]
linkedin: www.linkedin.com/in/ksarma


On Thu, Jan 23, 2014 at 4:22 AM, vijay garla <[email protected]> wrote:

> Just to clarify - with the YTEX branch there are 2 sentence splitter - the
> original ctakes sentence that splits on newlines, and the ytex sentence
> splitter that doesn't.  the changes to other components in the ytex branch
> (dependency parser, assertion) work with both sentence splitters.
>
> I think it would be great if the intelligence regarding how to split was in
> the opennlp model, but this requires training data.  I don't know what the
> training data is, or if the training data has sentences that cross newline
> boundaries (if not, won't buy us anything).
>
> vijay
>
>
>
>
> On Wed, Jan 22, 2014 at 3:47 PM, Finan, Sean <
> [email protected]> wrote:
>
> > On  my end it looks like my email was reformatted and some of my
> -newline-
> > removed in those last examples ...
> >
> > -----Original Message-----
> > From: Finan, Sean [mailto:[email protected]]
> > Sent: Wednesday, January 22, 2014 3:42 PM
> > To: [email protected]
> > Subject: RE: sentence detector newline behavior
> >
> > Thanks James
> >
> > > but then no typical sentence ending punctuation at the end of the line
> >
> > Gotcha.
> >
> > > So simply using Lines would not suffice in those cases because it
> > > would run together sentences where there are more than one on a line
> >
> > I was actually thinking about something like a Line using -sentence
> > breaks- in addition to -newline-.  In other words, a Sentence being what
> > cTakes detects by ignoring CR/LF, and Lines being those Sentences
> > subdivided by -newline-.  Perhaps "Line" is a horrible moniker.
> > Regardless, it doesn't solve the problem of inappropriately missing
> > punctuation.  I was focused a little more on the difference between
> > persistent auto- line wrapping and structured information like lists,
> where
> > the first benefits from Sentence and the second from Line.
> >
> > "The Patient has
> >  been prescribed two
> >  medications."
> >
> > "Prescriptions:
> >   Advil
> >   Tylenol
> >   No Aspirin"
> >
> >
> > However, when it comes to the problem that you mention, there is no
> > benefit to a Line.
> >
> > "The patient has been seen six times in the past week.  Pain has been
> > persistent for ten days Advil and Tylenol have been prescribed"
> > -- 2 sentences, 3 lines
> >
> >
> > "The patient has been seen six times in the past week.
> > Pain has been persistent for ten days
> > Advil and Tylenol have been prescribed"
> > -- 2 sentences, 3 lines
> >
> > "The patient has been seen six times in
> >  the past week.  Pain has been persistent  for ten days  Advil and
> Tylenol
> > have been prescribed"
> > -- 2 sentences, 5 lines
> >
> > Nothing can really be done for the last bit where punctuation is missing.
> >
> >
> >
> >
> > -----Original Message-----
> > From: Masanz, James J. [mailto:[email protected]]
> > Sent: Wednesday, January 22, 2014 3:07 PM
> > To: '[email protected]'
> > Subject: RE: sentence detector newline behavior
> >
> >
> > I know there are notes where there are multiple sentences on a line, but
> > then no typical sentence ending punctuation at the end of the line (or no
> > punctuation at all at the end of the line). And in those sections,
> negation
> > can be important.  So simply using Lines would not suffice in those cases
> > because it would run together sentences where there are more than one on
> a
> > line. And using sentences alone (as found by OpenNLP 1.5) would not
> suffice
> > because it would run together sentences from different lines.
> >
> > -----Original Message-----
> > From: Finan, Sean [mailto:[email protected]]
> > Sent: Wednesday, January 22, 2014 1:33 PM
> > To: [email protected]
> > Subject: RE: sentence detector newline behavior
> >
> > Just whistling in the wind here ...
> >
> > Perhaps before any changes are made to universally toggle cTakes in one
> > direction or the other, we can take a poll of when & where
> > cTakes/Ytex/OpenNLP/Omaha needs a Sentence (ignoring CR/LF) as opposed
> to a
> > Line (CR/LF delimited PLUS -sentence-)
> >
> > If some capabilities like negation detection require -lines- then would
> it
> > make more sense to have Sentence ignore -newline- and negation detection
> > itself split the Sentence into line items?  If an annotator is interested
> > in list items, each of which may be on a distinct -line-, then it can
> split
> > up the Sentence as needed.  I think that James hints that cTakes code
> > already does this in some places.
> >
> > If a good deal of functionality requires -newline- delimited types, would
> > it make sense to introduce a type Line?  If something uses a structured
> > list it could iterate through Line types, while something using pure text
> > could iterate through Sentence types.  This facilitates
> section-by-section
> > different behavior, does not require any decision on global defaults, and
> > makes data selection for training Sentence a nonesuch wrt line breaks.
> >  However, it adds to the system and would require a per-use choice
> decision
> > by developers OR a toggle by users (back to the default decision).
> > Perhaps this has already been tried?
> >
> > Sean
> >
> >
> > -----Original Message-----
> > From: Masanz, James J. [mailto:[email protected]]
> > Sent: Wednesday, January 22, 2014 1:06 PM
> > To: '[email protected]'
> > Subject: RE: sentence detector newline behavior
> >
> > The only rule I know of is that cTAKES (prior to ytex integration) always
> > forces a sentence break at a newline.
> > This was because the clinical notes cTAKES original processed never had
> > newlines in the middle of a sentence, but did need sentence breaks to
> occur
> > at end of sentence for good negation detection on those notes.
> > I think Guergana earlier mentioned other EMRs also have this need, but it
> > seems to not be ubiquitous.
> >
> > From others' posts, it seems that we could use an option in cTAKES to
> turn
> > off this forcing of sentence breaks at newlines (or depending on how you
> > look at it, an option to turn on the forcing of sentence breaks if we
> > change the default behavior)
> >
> > I think we (cTAKES) need to decide the following:
> >  - do we want to do this for entire notes, or would it be  worth it to
> > have it be on a section-by-section basis.
> >  - what do we make the default behavior - to force or not to force
> > newlines to be sentence breaks
> >  - what data (that contains newlines) will we use for training the
> > sentence detector
> >
> > Regardless of those answers, I think OpenNLP support for including
> > newlines in training data would be valuable for those others who have
> > sentences that span lines.  And having an option on OpenNLP to always
> break
> > at newline would be useful for at least some cTAKES users (and we could
> > remove the cTAKES code that does that)
> >
> > -- James
> >
> > -----Original Message-----
> > From: [email protected] [mailto:
> > [email protected]] On Behalf Of
> > Jörn Kottmann
> > Sent: Tuesday, January 21, 2014 4:29 AM
> > To: [email protected]
> > Subject: Re: sentence detector newline behavior
> >
> > Yes, exactly, OPENNLP-602 is about training a sentence detector model
> > which can use a new line as a end-of-sentence character.
> >
> > In case you have certain rules to split sentences we should have a look
> at
> > them. The Sentence Detector could be extended to support a user provided
> > rule based splitter. If there is an interest in that we could probably
> get
> > it into 1.6.0 as well.
> >
> > Jörn
> >
> > On 01/20/2014 10:02 PM, Chen, Pei wrote:
> > > I presume Joern was suggesting that if he supports new lines in the
> > opennlp SentenceDectector (either part of the trained models or post
> > processing with some rules?) cTAKES will be able to use it out of the box
> > and we should be able remove any additional custom logic that we
> currently
> > have- which seems like a good idea.
> > >
> > > [but when to use within cTAKES individual components such as negation
> > > might be another discussion?] --Pei
> > >
> > >> On Jan 20, 2014, at 12:46 PM, "vijay garla" <[email protected]>
> wrote:
> > >>
> > >> The sentence detection opennlp model used by ctakes does not split
> > >> sentences at newlines - there is additional logic in the takes
> > >> sentence splitter that does this (and an alternative impl that
> > >> doesn't is in the ytex branch). Afaik no retraining / change to the
> > >> feature representation is necessary.
> > >>
> > >> Vj
> > >>
> > >>> On Monday, January 20, 2014, Jörn Kottmann <[email protected]>
> wrote:
> > >>>
> > >>> Hi all,
> > >>>
> > >>> currently I have quite a bit of time to work on OpenNLP, and would
> > >>> like to help you out with this issue.
> > >>>
> > >>> Here is the follow up issue for this change:
> > >>> https://issues.apache.org/jira/browse/OPENNLP-602
> > >>>
> > >>> I am still trying to figure out what would be the best option to
> > >>> implement this.
> > >>> In the training data a user could just use a special tag to identify
> > >>> the chars.
> > >>>
> > >>> Instead of <NEWLINE> it might be better to use <CR> and <LF> to
> > >>> encode these two chars in the training data. Any thoughts?
> > >>>
> > >>> I am planning to release this as part of OpenNLP 1.6.0.
> > >>>
> > >>> Thanks,
> > >>> Jörn
> > >>>
> > >>>> On 05/22/2013 02:03 PM, Jörn Kottmann wrote:
> > >>>>
> > >>>>> On 05/22/2013 01:17 PM, Miller, Timothy wrote:
> > >>>>>
> > >>>>> That's awesome! It might be worth trying at least. How does the
> > >>>>> training process change? Previously the training data would be one
> > >>>>> sentence per line, but with newlines as possible mid-sentence
> > >>>>> characters that could be trouble, is there a new representation
> > >>>>> for training data? Or would we have to use the training api?
> > >>>> Good point, yes that will be a problem with the default training
> > >>>> format, but it shouldn't be hard to solve. In the format itself we
> > >>>> could define a new line tag e.g.
> > >>>> <NEWLINE> to mark new lines.
> > >>>> as a hack to make it work with 1.5.3 you could instead use a
> > >>>> special char as a replacement for the new line char.
> > >>>> When you pass the text down to the sentence detector a simple
> > >>>> string replace could be used to convert all new line chars to the
> > >>>> special new line marker char.
> > >>>>
> > >>>> If things work out for you performance wise as well we will just
> > >>>> integrate it properly into OpenNLP for the next release.
> > >>>>
> > >>>> Could you produce a sentence detector training file with a new line
> > >>>> marker char?
> > >>>>
> > >>>> You should try to pick a char you can also pass in on a terminal
> > >>>> otherwise you have to use the API to train the model. The build in
> > >>>> cross validation could be used to evaluate the performance.
> > >>>>
> > >>>> Jörn
> > >>>
> >
> >
>

Re: sentence detector newline behavior

Reply via email to