Re: sentence detector newline behavior

Miller, Timothy Sat, 25 Jan 2014 06:04:01 -0800

I'm running into one issue, it gets tripped up on sentences with
line-ending spaces.  I could easily remove them with a script but by
default they are in there. It happens when a sentence example ends:


...BILAT HEMATOMAS.  <LF>

(There is a period, then 2 spaces, then the line feed character.) I am
pretty sure this is the root because when I fix this example to be .<LF>
it gets tripped up in another place instead (with the same error). The
specific error I get is this:

> Exception in thread "main" java.lang.IllegalArgumentException: start
> index must not be larger than end index: start=8842, end=8839
>     at opennlp.tools.util.Span.<init>(Span.java:47)
>     at opennlp.tools.util.Span.<init>(Span.java:63)
>     at
> opennlp.tools.sentdetect.SentenceDetectorME.sentPosDetect(SentenceDetectorME.java:244)
>     at
> opennlp.tools.sentdetect.SentenceDetectorEvaluator.processSample(SentenceDetectorEvaluator.java:56)
>     at
> opennlp.tools.sentdetect.SentenceDetectorEvaluator.processSample(SentenceDetectorEvaluator.java:1)
>     at opennlp.tools.util.eval.Evaluator.evaluateSample(Evaluator.java:82)
>     at opennlp.tools.util.eval.Evaluator.evaluate(Evaluator.java:109)
>     at
> opennlp.tools.sentdetect.SDCrossValidator.evaluate(SDCrossValidator.java:130)
>     at
> opennlp.tools.cmdline.sentdetect.SentenceDetectorCrossValidatorTool.run(SentenceDetectorCrossValidatorTool.java:78)
>     at opennlp.tools.cmdline.CLI.main(CLI.java:214)

I thought I'd let you know since you might be able to fix it in 2
minutes but if I don't hear from you today I'll probably take a look at
it later today to try to fix it myself.
Tim

On 01/24/2014 04:14 PM, Jörn Kottmann wrote:
> The changes are now committed.
>
> To train a model which can recognize new lines the new lines must be encoded
> with the <CR> or <LF> tags (or both).
>
> The same tags are used to pass in the eos chars to the command line trainer.
> For example:
> SentenceDetectorCrossValidator  -lang en -data /home/xyz/eos-cr.all 
> -encoding ISO-8859-15 -eosChars .!?:<LF>
>
> Tim, it would be nice if you could test this with your annotations.
>
> Jörn
>
> On 01/23/2014 10:06 PM, Tim Miller wrote:
>> Just an FYI, a while back I did some of these annotations myself on 
>> MIMIC to get around this issue. I replaced the newline character with 
>> a special (non-English) character, then pre-processed ctakes input to 
>> replace newlines with that character, then did sentence detection, 
>> then added the newlines back in. I would be happy to share these 
>> annotations and my code modifications.
>> Tim
>>
>>
>> On 01/23/2014 04:01 PM, Karthik Sarma wrote:
>>> We could possibly add some additional datasets for training. MIMIC data
>>> does come to mind -- I can't remember off the top of my head if the 
>>> MIMIC
>>> dataset has sentences spanning lines or not.
>>>
>>>
>>>
>>>
>>>
>>> -- 
>>> Karthik Sarma
>>> UCLA Medical Scientist Training Program Class of 20??
>>> Member, UCLA Medical Imaging & Informatics Lab
>>> Member, CA Delegation to the House of Delegates of the American Medical
>>> Association
>>> [email protected]
>>> gchat: [email protected]
>>> linkedin: www.linkedin.com/in/ksarma
>>>
>>>
>>> On Thu, Jan 23, 2014 at 4:22 AM, vijay garla <[email protected]> wrote:
>>>
>>>> Just to clarify - with the YTEX branch there are 2 sentence splitter 
>>>> - the
>>>> original ctakes sentence that splits on newlines, and the ytex sentence
>>>> splitter that doesn't.  the changes to other components in the ytex 
>>>> branch
>>>> (dependency parser, assertion) work with both sentence splitters.
>>>>
>>>> I think it would be great if the intelligence regarding how to split 
>>>> was in
>>>> the opennlp model, but this requires training data.  I don't know 
>>>> what the
>>>> training data is, or if the training data has sentences that cross 
>>>> newline
>>>> boundaries (if not, won't buy us anything).
>>>>
>>>> vijay
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Jan 22, 2014 at 3:47 PM, Finan, Sean <
>>>> [email protected]> wrote:
>>>>
>>>>> On  my end it looks like my email was reformatted and some of my
>>>> -newline-
>>>>> removed in those last examples ...
>>>>>
>>>>> -----Original Message-----
>>>>> From: Finan, Sean [mailto:[email protected]]
>>>>> Sent: Wednesday, January 22, 2014 3:42 PM
>>>>> To: [email protected]
>>>>> Subject: RE: sentence detector newline behavior
>>>>>
>>>>> Thanks James
>>>>>
>>>>>> but then no typical sentence ending punctuation at the end of the 
>>>>>> line
>>>>> Gotcha.
>>>>>
>>>>>> So simply using Lines would not suffice in those cases because it
>>>>>> would run together sentences where there are more than one on a line
>>>>> I was actually thinking about something like a Line using -sentence
>>>>> breaks- in addition to -newline-.  In other words, a Sentence being 
>>>>> what
>>>>> cTakes detects by ignoring CR/LF, and Lines being those Sentences
>>>>> subdivided by -newline-.  Perhaps "Line" is a horrible moniker.
>>>>> Regardless, it doesn't solve the problem of inappropriately missing
>>>>> punctuation.  I was focused a little more on the difference between
>>>>> persistent auto- line wrapping and structured information like lists,
>>>> where
>>>>> the first benefits from Sentence and the second from Line.
>>>>>
>>>>> "The Patient has
>>>>>   been prescribed two
>>>>>   medications."
>>>>>
>>>>> "Prescriptions:
>>>>>    Advil
>>>>>    Tylenol
>>>>>    No Aspirin"
>>>>>
>>>>>
>>>>> However, when it comes to the problem that you mention, there is no
>>>>> benefit to a Line.
>>>>>
>>>>> "The patient has been seen six times in the past week.  Pain has been
>>>>> persistent for ten days Advil and Tylenol have been prescribed"
>>>>> -- 2 sentences, 3 lines
>>>>>
>>>>>
>>>>> "The patient has been seen six times in the past week.
>>>>> Pain has been persistent for ten days
>>>>> Advil and Tylenol have been prescribed"
>>>>> -- 2 sentences, 3 lines
>>>>>
>>>>> "The patient has been seen six times in
>>>>>   the past week.  Pain has been persistent  for ten days Advil and
>>>> Tylenol
>>>>> have been prescribed"
>>>>> -- 2 sentences, 5 lines
>>>>>
>>>>> Nothing can really be done for the last bit where punctuation is 
>>>>> missing.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Masanz, James J. [mailto:[email protected]]
>>>>> Sent: Wednesday, January 22, 2014 3:07 PM
>>>>> To: '[email protected]'
>>>>> Subject: RE: sentence detector newline behavior
>>>>>
>>>>>
>>>>> I know there are notes where there are multiple sentences on a 
>>>>> line, but
>>>>> then no typical sentence ending punctuation at the end of the line 
>>>>> (or no
>>>>> punctuation at all at the end of the line). And in those sections,
>>>> negation
>>>>> can be important.  So simply using Lines would not suffice in those 
>>>>> cases
>>>>> because it would run together sentences where there are more than 
>>>>> one on
>>>> a
>>>>> line. And using sentences alone (as found by OpenNLP 1.5) would not
>>>> suffice
>>>>> because it would run together sentences from different lines.
>>>>>
>>>>> -----Original Message-----
>>>>> From: Finan, Sean [mailto:[email protected]]
>>>>> Sent: Wednesday, January 22, 2014 1:33 PM
>>>>> To: [email protected]
>>>>> Subject: RE: sentence detector newline behavior
>>>>>
>>>>> Just whistling in the wind here ...
>>>>>
>>>>> Perhaps before any changes are made to universally toggle cTakes in 
>>>>> one
>>>>> direction or the other, we can take a poll of when & where
>>>>> cTakes/Ytex/OpenNLP/Omaha needs a Sentence (ignoring CR/LF) as opposed
>>>> to a
>>>>> Line (CR/LF delimited PLUS -sentence-)
>>>>>
>>>>> If some capabilities like negation detection require -lines- then 
>>>>> would
>>>> it
>>>>> make more sense to have Sentence ignore -newline- and negation 
>>>>> detection
>>>>> itself split the Sentence into line items?  If an annotator is 
>>>>> interested
>>>>> in list items, each of which may be on a distinct -line-, then it can
>>>> split
>>>>> up the Sentence as needed.  I think that James hints that cTakes code
>>>>> already does this in some places.
>>>>>
>>>>> If a good deal of functionality requires -newline- delimited types, 
>>>>> would
>>>>> it make sense to introduce a type Line?  If something uses a 
>>>>> structured
>>>>> list it could iterate through Line types, while something using 
>>>>> pure text
>>>>> could iterate through Sentence types.  This facilitates
>>>> section-by-section
>>>>> different behavior, does not require any decision on global 
>>>>> defaults, and
>>>>> makes data selection for training Sentence a nonesuch wrt line breaks.
>>>>>   However, it adds to the system and would require a per-use choice
>>>> decision
>>>>> by developers OR a toggle by users (back to the default decision).
>>>>> Perhaps this has already been tried?
>>>>>
>>>>> Sean
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Masanz, James J. [mailto:[email protected]]
>>>>> Sent: Wednesday, January 22, 2014 1:06 PM
>>>>> To: '[email protected]'
>>>>> Subject: RE: sentence detector newline behavior
>>>>>
>>>>> The only rule I know of is that cTAKES (prior to ytex integration) 
>>>>> always
>>>>> forces a sentence break at a newline.
>>>>> This was because the clinical notes cTAKES original processed never 
>>>>> had
>>>>> newlines in the middle of a sentence, but did need sentence breaks to
>>>> occur
>>>>> at end of sentence for good negation detection on those notes.
>>>>> I think Guergana earlier mentioned other EMRs also have this need, 
>>>>> but it
>>>>> seems to not be ubiquitous.
>>>>>
>>>>>  From others' posts, it seems that we could use an option in cTAKES to
>>>> turn
>>>>> off this forcing of sentence breaks at newlines (or depending on 
>>>>> how you
>>>>> look at it, an option to turn on the forcing of sentence breaks if we
>>>>> change the default behavior)
>>>>>
>>>>> I think we (cTAKES) need to decide the following:
>>>>>   - do we want to do this for entire notes, or would it be worth it to
>>>>> have it be on a section-by-section basis.
>>>>>   - what do we make the default behavior - to force or not to force
>>>>> newlines to be sentence breaks
>>>>>   - what data (that contains newlines) will we use for training the
>>>>> sentence detector
>>>>>
>>>>> Regardless of those answers, I think OpenNLP support for including
>>>>> newlines in training data would be valuable for those others who have
>>>>> sentences that span lines.  And having an option on OpenNLP to always
>>>> break
>>>>> at newline would be useful for at least some cTAKES users (and we 
>>>>> could
>>>>> remove the cTAKES code that does that)
>>>>>
>>>>> -- James
>>>>>
>>>>> -----Original Message-----
>>>>> From: [email protected] [mailto:
>>>>> [email protected]] On Behalf Of
>>>>> Jörn Kottmann
>>>>> Sent: Tuesday, January 21, 2014 4:29 AM
>>>>> To: [email protected]
>>>>> Subject: Re: sentence detector newline behavior
>>>>>
>>>>> Yes, exactly, OPENNLP-602 is about training a sentence detector model
>>>>> which can use a new line as a end-of-sentence character.
>>>>>
>>>>> In case you have certain rules to split sentences we should have a 
>>>>> look
>>>> at
>>>>> them. The Sentence Detector could be extended to support a user 
>>>>> provided
>>>>> rule based splitter. If there is an interest in that we could probably
>>>> get
>>>>> it into 1.6.0 as well.
>>>>>
>>>>> Jörn
>>>>>
>>>>> On 01/20/2014 10:02 PM, Chen, Pei wrote:
>>>>>> I presume Joern was suggesting that if he supports new lines in the
>>>>> opennlp SentenceDectector (either part of the trained models or post
>>>>> processing with some rules?) cTAKES will be able to use it out of 
>>>>> the box
>>>>> and we should be able remove any additional custom logic that we
>>>> currently
>>>>> have- which seems like a good idea.
>>>>>> [but when to use within cTAKES individual components such as negation
>>>>>> might be another discussion?] --Pei
>>>>>>
>>>>>>> On Jan 20, 2014, at 12:46 PM, "vijay garla" <[email protected]>
>>>> wrote:
>>>>>>> The sentence detection opennlp model used by ctakes does not split
>>>>>>> sentences at newlines - there is additional logic in the takes
>>>>>>> sentence splitter that does this (and an alternative impl that
>>>>>>> doesn't is in the ytex branch). Afaik no retraining / change to the
>>>>>>> feature representation is necessary.
>>>>>>>
>>>>>>> Vj
>>>>>>>
>>>>>>>> On Monday, January 20, 2014, Jörn Kottmann <[email protected]>
>>>> wrote:
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> currently I have quite a bit of time to work on OpenNLP, and would
>>>>>>>> like to help you out with this issue.
>>>>>>>>
>>>>>>>> Here is the follow up issue for this change:
>>>>>>>> https://issues.apache.org/jira/browse/OPENNLP-602
>>>>>>>>
>>>>>>>> I am still trying to figure out what would be the best option to
>>>>>>>> implement this.
>>>>>>>> In the training data a user could just use a special tag to 
>>>>>>>> identify
>>>>>>>> the chars.
>>>>>>>>
>>>>>>>> Instead of <NEWLINE> it might be better to use <CR> and <LF> to
>>>>>>>> encode these two chars in the training data. Any thoughts?
>>>>>>>>
>>>>>>>> I am planning to release this as part of OpenNLP 1.6.0.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Jörn
>>>>>>>>
>>>>>>>>> On 05/22/2013 02:03 PM, Jörn Kottmann wrote:
>>>>>>>>>
>>>>>>>>>> On 05/22/2013 01:17 PM, Miller, Timothy wrote:
>>>>>>>>>>
>>>>>>>>>> That's awesome! It might be worth trying at least. How does the
>>>>>>>>>> training process change? Previously the training data would be 
>>>>>>>>>> one
>>>>>>>>>> sentence per line, but with newlines as possible mid-sentence
>>>>>>>>>> characters that could be trouble, is there a new representation
>>>>>>>>>> for training data? Or would we have to use the training api?
>>>>>>>>> Good point, yes that will be a problem with the default training
>>>>>>>>> format, but it shouldn't be hard to solve. In the format itself we
>>>>>>>>> could define a new line tag e.g.
>>>>>>>>> <NEWLINE> to mark new lines.
>>>>>>>>> as a hack to make it work with 1.5.3 you could instead use a
>>>>>>>>> special char as a replacement for the new line char.
>>>>>>>>> When you pass the text down to the sentence detector a simple
>>>>>>>>> string replace could be used to convert all new line chars to the
>>>>>>>>> special new line marker char.
>>>>>>>>>
>>>>>>>>> If things work out for you performance wise as well we will just
>>>>>>>>> integrate it properly into OpenNLP for the next release.
>>>>>>>>>
>>>>>>>>> Could you produce a sentence detector training file with a new 
>>>>>>>>> line
>>>>>>>>> marker char?
>>>>>>>>>
>>>>>>>>> You should try to pick a char you can also pass in on a terminal
>>>>>>>>> otherwise you have to use the API to train the model. The build in
>>>>>>>>> cross validation could be used to evaluate the performance.
>>>>>>>>>
>>>>>>>>> Jörn
>

Re: sentence detector newline behavior

Reply via email to