I'm running into one issue, it gets tripped up on sentences with line-ending spaces. I could easily remove them with a script but by default they are in there. It happens when a sentence example ends:
...BILAT HEMATOMAS. <LF> (There is a period, then 2 spaces, then the line feed character.) I am pretty sure this is the root because when I fix this example to be .<LF> it gets tripped up in another place instead (with the same error). The specific error I get is this: > Exception in thread "main" java.lang.IllegalArgumentException: start > index must not be larger than end index: start=8842, end=8839 > at opennlp.tools.util.Span.<init>(Span.java:47) > at opennlp.tools.util.Span.<init>(Span.java:63) > at > opennlp.tools.sentdetect.SentenceDetectorME.sentPosDetect(SentenceDetectorME.java:244) > at > opennlp.tools.sentdetect.SentenceDetectorEvaluator.processSample(SentenceDetectorEvaluator.java:56) > at > opennlp.tools.sentdetect.SentenceDetectorEvaluator.processSample(SentenceDetectorEvaluator.java:1) > at opennlp.tools.util.eval.Evaluator.evaluateSample(Evaluator.java:82) > at opennlp.tools.util.eval.Evaluator.evaluate(Evaluator.java:109) > at > opennlp.tools.sentdetect.SDCrossValidator.evaluate(SDCrossValidator.java:130) > at > opennlp.tools.cmdline.sentdetect.SentenceDetectorCrossValidatorTool.run(SentenceDetectorCrossValidatorTool.java:78) > at opennlp.tools.cmdline.CLI.main(CLI.java:214) I thought I'd let you know since you might be able to fix it in 2 minutes but if I don't hear from you today I'll probably take a look at it later today to try to fix it myself. Tim On 01/24/2014 04:14 PM, Jörn Kottmann wrote: > The changes are now committed. > > To train a model which can recognize new lines the new lines must be encoded > with the <CR> or <LF> tags (or both). > > The same tags are used to pass in the eos chars to the command line trainer. > For example: > SentenceDetectorCrossValidator -lang en -data /home/xyz/eos-cr.all > -encoding ISO-8859-15 -eosChars .!?:<LF> > > Tim, it would be nice if you could test this with your annotations. > > Jörn > > On 01/23/2014 10:06 PM, Tim Miller wrote: >> Just an FYI, a while back I did some of these annotations myself on >> MIMIC to get around this issue. I replaced the newline character with >> a special (non-English) character, then pre-processed ctakes input to >> replace newlines with that character, then did sentence detection, >> then added the newlines back in. I would be happy to share these >> annotations and my code modifications. >> Tim >> >> >> On 01/23/2014 04:01 PM, Karthik Sarma wrote: >>> We could possibly add some additional datasets for training. MIMIC data >>> does come to mind -- I can't remember off the top of my head if the >>> MIMIC >>> dataset has sentences spanning lines or not. >>> >>> >>> >>> >>> >>> -- >>> Karthik Sarma >>> UCLA Medical Scientist Training Program Class of 20?? >>> Member, UCLA Medical Imaging & Informatics Lab >>> Member, CA Delegation to the House of Delegates of the American Medical >>> Association >>> ksa...@ksarma.com >>> gchat: ksa...@gmail.com >>> linkedin: www.linkedin.com/in/ksarma >>> >>> >>> On Thu, Jan 23, 2014 at 4:22 AM, vijay garla <vnga...@gmail.com> wrote: >>> >>>> Just to clarify - with the YTEX branch there are 2 sentence splitter >>>> - the >>>> original ctakes sentence that splits on newlines, and the ytex sentence >>>> splitter that doesn't. the changes to other components in the ytex >>>> branch >>>> (dependency parser, assertion) work with both sentence splitters. >>>> >>>> I think it would be great if the intelligence regarding how to split >>>> was in >>>> the opennlp model, but this requires training data. I don't know >>>> what the >>>> training data is, or if the training data has sentences that cross >>>> newline >>>> boundaries (if not, won't buy us anything). >>>> >>>> vijay >>>> >>>> >>>> >>>> >>>> On Wed, Jan 22, 2014 at 3:47 PM, Finan, Sean < >>>> sean.fi...@childrens.harvard.edu> wrote: >>>> >>>>> On my end it looks like my email was reformatted and some of my >>>> -newline- >>>>> removed in those last examples ... >>>>> >>>>> -----Original Message----- >>>>> From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] >>>>> Sent: Wednesday, January 22, 2014 3:42 PM >>>>> To: dev@ctakes.apache.org >>>>> Subject: RE: sentence detector newline behavior >>>>> >>>>> Thanks James >>>>> >>>>>> but then no typical sentence ending punctuation at the end of the >>>>>> line >>>>> Gotcha. >>>>> >>>>>> So simply using Lines would not suffice in those cases because it >>>>>> would run together sentences where there are more than one on a line >>>>> I was actually thinking about something like a Line using -sentence >>>>> breaks- in addition to -newline-. In other words, a Sentence being >>>>> what >>>>> cTakes detects by ignoring CR/LF, and Lines being those Sentences >>>>> subdivided by -newline-. Perhaps "Line" is a horrible moniker. >>>>> Regardless, it doesn't solve the problem of inappropriately missing >>>>> punctuation. I was focused a little more on the difference between >>>>> persistent auto- line wrapping and structured information like lists, >>>> where >>>>> the first benefits from Sentence and the second from Line. >>>>> >>>>> "The Patient has >>>>> been prescribed two >>>>> medications." >>>>> >>>>> "Prescriptions: >>>>> Advil >>>>> Tylenol >>>>> No Aspirin" >>>>> >>>>> >>>>> However, when it comes to the problem that you mention, there is no >>>>> benefit to a Line. >>>>> >>>>> "The patient has been seen six times in the past week. Pain has been >>>>> persistent for ten days Advil and Tylenol have been prescribed" >>>>> -- 2 sentences, 3 lines >>>>> >>>>> >>>>> "The patient has been seen six times in the past week. >>>>> Pain has been persistent for ten days >>>>> Advil and Tylenol have been prescribed" >>>>> -- 2 sentences, 3 lines >>>>> >>>>> "The patient has been seen six times in >>>>> the past week. Pain has been persistent for ten days Advil and >>>> Tylenol >>>>> have been prescribed" >>>>> -- 2 sentences, 5 lines >>>>> >>>>> Nothing can really be done for the last bit where punctuation is >>>>> missing. >>>>> >>>>> >>>>> >>>>> >>>>> -----Original Message----- >>>>> From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] >>>>> Sent: Wednesday, January 22, 2014 3:07 PM >>>>> To: 'dev@ctakes.apache.org' >>>>> Subject: RE: sentence detector newline behavior >>>>> >>>>> >>>>> I know there are notes where there are multiple sentences on a >>>>> line, but >>>>> then no typical sentence ending punctuation at the end of the line >>>>> (or no >>>>> punctuation at all at the end of the line). And in those sections, >>>> negation >>>>> can be important. So simply using Lines would not suffice in those >>>>> cases >>>>> because it would run together sentences where there are more than >>>>> one on >>>> a >>>>> line. And using sentences alone (as found by OpenNLP 1.5) would not >>>> suffice >>>>> because it would run together sentences from different lines. >>>>> >>>>> -----Original Message----- >>>>> From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] >>>>> Sent: Wednesday, January 22, 2014 1:33 PM >>>>> To: dev@ctakes.apache.org >>>>> Subject: RE: sentence detector newline behavior >>>>> >>>>> Just whistling in the wind here ... >>>>> >>>>> Perhaps before any changes are made to universally toggle cTakes in >>>>> one >>>>> direction or the other, we can take a poll of when & where >>>>> cTakes/Ytex/OpenNLP/Omaha needs a Sentence (ignoring CR/LF) as opposed >>>> to a >>>>> Line (CR/LF delimited PLUS -sentence-) >>>>> >>>>> If some capabilities like negation detection require -lines- then >>>>> would >>>> it >>>>> make more sense to have Sentence ignore -newline- and negation >>>>> detection >>>>> itself split the Sentence into line items? If an annotator is >>>>> interested >>>>> in list items, each of which may be on a distinct -line-, then it can >>>> split >>>>> up the Sentence as needed. I think that James hints that cTakes code >>>>> already does this in some places. >>>>> >>>>> If a good deal of functionality requires -newline- delimited types, >>>>> would >>>>> it make sense to introduce a type Line? If something uses a >>>>> structured >>>>> list it could iterate through Line types, while something using >>>>> pure text >>>>> could iterate through Sentence types. This facilitates >>>> section-by-section >>>>> different behavior, does not require any decision on global >>>>> defaults, and >>>>> makes data selection for training Sentence a nonesuch wrt line breaks. >>>>> However, it adds to the system and would require a per-use choice >>>> decision >>>>> by developers OR a toggle by users (back to the default decision). >>>>> Perhaps this has already been tried? >>>>> >>>>> Sean >>>>> >>>>> >>>>> -----Original Message----- >>>>> From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] >>>>> Sent: Wednesday, January 22, 2014 1:06 PM >>>>> To: 'dev@ctakes.apache.org' >>>>> Subject: RE: sentence detector newline behavior >>>>> >>>>> The only rule I know of is that cTAKES (prior to ytex integration) >>>>> always >>>>> forces a sentence break at a newline. >>>>> This was because the clinical notes cTAKES original processed never >>>>> had >>>>> newlines in the middle of a sentence, but did need sentence breaks to >>>> occur >>>>> at end of sentence for good negation detection on those notes. >>>>> I think Guergana earlier mentioned other EMRs also have this need, >>>>> but it >>>>> seems to not be ubiquitous. >>>>> >>>>> From others' posts, it seems that we could use an option in cTAKES to >>>> turn >>>>> off this forcing of sentence breaks at newlines (or depending on >>>>> how you >>>>> look at it, an option to turn on the forcing of sentence breaks if we >>>>> change the default behavior) >>>>> >>>>> I think we (cTAKES) need to decide the following: >>>>> - do we want to do this for entire notes, or would it be worth it to >>>>> have it be on a section-by-section basis. >>>>> - what do we make the default behavior - to force or not to force >>>>> newlines to be sentence breaks >>>>> - what data (that contains newlines) will we use for training the >>>>> sentence detector >>>>> >>>>> Regardless of those answers, I think OpenNLP support for including >>>>> newlines in training data would be valuable for those others who have >>>>> sentences that span lines. And having an option on OpenNLP to always >>>> break >>>>> at newline would be useful for at least some cTAKES users (and we >>>>> could >>>>> remove the cTAKES code that does that) >>>>> >>>>> -- James >>>>> >>>>> -----Original Message----- >>>>> From: dev-return-2390-Masanz.James=mayo....@ctakes.apache.org [mailto: >>>>> dev-return-2390-Masanz.James=mayo....@ctakes.apache.org] On Behalf Of >>>>> Jörn Kottmann >>>>> Sent: Tuesday, January 21, 2014 4:29 AM >>>>> To: dev@ctakes.apache.org >>>>> Subject: Re: sentence detector newline behavior >>>>> >>>>> Yes, exactly, OPENNLP-602 is about training a sentence detector model >>>>> which can use a new line as a end-of-sentence character. >>>>> >>>>> In case you have certain rules to split sentences we should have a >>>>> look >>>> at >>>>> them. The Sentence Detector could be extended to support a user >>>>> provided >>>>> rule based splitter. If there is an interest in that we could probably >>>> get >>>>> it into 1.6.0 as well. >>>>> >>>>> Jörn >>>>> >>>>> On 01/20/2014 10:02 PM, Chen, Pei wrote: >>>>>> I presume Joern was suggesting that if he supports new lines in the >>>>> opennlp SentenceDectector (either part of the trained models or post >>>>> processing with some rules?) cTAKES will be able to use it out of >>>>> the box >>>>> and we should be able remove any additional custom logic that we >>>> currently >>>>> have- which seems like a good idea. >>>>>> [but when to use within cTAKES individual components such as negation >>>>>> might be another discussion?] --Pei >>>>>> >>>>>>> On Jan 20, 2014, at 12:46 PM, "vijay garla" <vnga...@gmail.com> >>>> wrote: >>>>>>> The sentence detection opennlp model used by ctakes does not split >>>>>>> sentences at newlines - there is additional logic in the takes >>>>>>> sentence splitter that does this (and an alternative impl that >>>>>>> doesn't is in the ytex branch). Afaik no retraining / change to the >>>>>>> feature representation is necessary. >>>>>>> >>>>>>> Vj >>>>>>> >>>>>>>> On Monday, January 20, 2014, Jörn Kottmann <kottm...@gmail.com> >>>> wrote: >>>>>>>> Hi all, >>>>>>>> >>>>>>>> currently I have quite a bit of time to work on OpenNLP, and would >>>>>>>> like to help you out with this issue. >>>>>>>> >>>>>>>> Here is the follow up issue for this change: >>>>>>>> https://issues.apache.org/jira/browse/OPENNLP-602 >>>>>>>> >>>>>>>> I am still trying to figure out what would be the best option to >>>>>>>> implement this. >>>>>>>> In the training data a user could just use a special tag to >>>>>>>> identify >>>>>>>> the chars. >>>>>>>> >>>>>>>> Instead of <NEWLINE> it might be better to use <CR> and <LF> to >>>>>>>> encode these two chars in the training data. Any thoughts? >>>>>>>> >>>>>>>> I am planning to release this as part of OpenNLP 1.6.0. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Jörn >>>>>>>> >>>>>>>>> On 05/22/2013 02:03 PM, Jörn Kottmann wrote: >>>>>>>>> >>>>>>>>>> On 05/22/2013 01:17 PM, Miller, Timothy wrote: >>>>>>>>>> >>>>>>>>>> That's awesome! It might be worth trying at least. How does the >>>>>>>>>> training process change? Previously the training data would be >>>>>>>>>> one >>>>>>>>>> sentence per line, but with newlines as possible mid-sentence >>>>>>>>>> characters that could be trouble, is there a new representation >>>>>>>>>> for training data? Or would we have to use the training api? >>>>>>>>> Good point, yes that will be a problem with the default training >>>>>>>>> format, but it shouldn't be hard to solve. In the format itself we >>>>>>>>> could define a new line tag e.g. >>>>>>>>> <NEWLINE> to mark new lines. >>>>>>>>> as a hack to make it work with 1.5.3 you could instead use a >>>>>>>>> special char as a replacement for the new line char. >>>>>>>>> When you pass the text down to the sentence detector a simple >>>>>>>>> string replace could be used to convert all new line chars to the >>>>>>>>> special new line marker char. >>>>>>>>> >>>>>>>>> If things work out for you performance wise as well we will just >>>>>>>>> integrate it properly into OpenNLP for the next release. >>>>>>>>> >>>>>>>>> Could you produce a sentence detector training file with a new >>>>>>>>> line >>>>>>>>> marker char? >>>>>>>>> >>>>>>>>> You should try to pick a char you can also pass in on a terminal >>>>>>>>> otherwise you have to use the API to train the model. The build in >>>>>>>>> cross validation could be used to evaluate the performance. >>>>>>>>> >>>>>>>>> Jörn >