We could possibly add some additional datasets for training. MIMIC data does come to mind -- I can't remember off the top of my head if the MIMIC dataset has sentences spanning lines or not.
-- Karthik Sarma UCLA Medical Scientist Training Program Class of 20?? Member, UCLA Medical Imaging & Informatics Lab Member, CA Delegation to the House of Delegates of the American Medical Association ksa...@ksarma.com gchat: ksa...@gmail.com linkedin: www.linkedin.com/in/ksarma On Thu, Jan 23, 2014 at 4:22 AM, vijay garla <vnga...@gmail.com> wrote: > Just to clarify - with the YTEX branch there are 2 sentence splitter - the > original ctakes sentence that splits on newlines, and the ytex sentence > splitter that doesn't. the changes to other components in the ytex branch > (dependency parser, assertion) work with both sentence splitters. > > I think it would be great if the intelligence regarding how to split was in > the opennlp model, but this requires training data. I don't know what the > training data is, or if the training data has sentences that cross newline > boundaries (if not, won't buy us anything). > > vijay > > > > > On Wed, Jan 22, 2014 at 3:47 PM, Finan, Sean < > sean.fi...@childrens.harvard.edu> wrote: > > > On my end it looks like my email was reformatted and some of my > -newline- > > removed in those last examples ... > > > > -----Original Message----- > > From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] > > Sent: Wednesday, January 22, 2014 3:42 PM > > To: dev@ctakes.apache.org > > Subject: RE: sentence detector newline behavior > > > > Thanks James > > > > > but then no typical sentence ending punctuation at the end of the line > > > > Gotcha. > > > > > So simply using Lines would not suffice in those cases because it > > > would run together sentences where there are more than one on a line > > > > I was actually thinking about something like a Line using -sentence > > breaks- in addition to -newline-. In other words, a Sentence being what > > cTakes detects by ignoring CR/LF, and Lines being those Sentences > > subdivided by -newline-. Perhaps "Line" is a horrible moniker. > > Regardless, it doesn't solve the problem of inappropriately missing > > punctuation. I was focused a little more on the difference between > > persistent auto- line wrapping and structured information like lists, > where > > the first benefits from Sentence and the second from Line. > > > > "The Patient has > > been prescribed two > > medications." > > > > "Prescriptions: > > Advil > > Tylenol > > No Aspirin" > > > > > > However, when it comes to the problem that you mention, there is no > > benefit to a Line. > > > > "The patient has been seen six times in the past week. Pain has been > > persistent for ten days Advil and Tylenol have been prescribed" > > -- 2 sentences, 3 lines > > > > > > "The patient has been seen six times in the past week. > > Pain has been persistent for ten days > > Advil and Tylenol have been prescribed" > > -- 2 sentences, 3 lines > > > > "The patient has been seen six times in > > the past week. Pain has been persistent for ten days Advil and > Tylenol > > have been prescribed" > > -- 2 sentences, 5 lines > > > > Nothing can really be done for the last bit where punctuation is missing. > > > > > > > > > > -----Original Message----- > > From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] > > Sent: Wednesday, January 22, 2014 3:07 PM > > To: 'dev@ctakes.apache.org' > > Subject: RE: sentence detector newline behavior > > > > > > I know there are notes where there are multiple sentences on a line, but > > then no typical sentence ending punctuation at the end of the line (or no > > punctuation at all at the end of the line). And in those sections, > negation > > can be important. So simply using Lines would not suffice in those cases > > because it would run together sentences where there are more than one on > a > > line. And using sentences alone (as found by OpenNLP 1.5) would not > suffice > > because it would run together sentences from different lines. > > > > -----Original Message----- > > From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] > > Sent: Wednesday, January 22, 2014 1:33 PM > > To: dev@ctakes.apache.org > > Subject: RE: sentence detector newline behavior > > > > Just whistling in the wind here ... > > > > Perhaps before any changes are made to universally toggle cTakes in one > > direction or the other, we can take a poll of when & where > > cTakes/Ytex/OpenNLP/Omaha needs a Sentence (ignoring CR/LF) as opposed > to a > > Line (CR/LF delimited PLUS -sentence-) > > > > If some capabilities like negation detection require -lines- then would > it > > make more sense to have Sentence ignore -newline- and negation detection > > itself split the Sentence into line items? If an annotator is interested > > in list items, each of which may be on a distinct -line-, then it can > split > > up the Sentence as needed. I think that James hints that cTakes code > > already does this in some places. > > > > If a good deal of functionality requires -newline- delimited types, would > > it make sense to introduce a type Line? If something uses a structured > > list it could iterate through Line types, while something using pure text > > could iterate through Sentence types. This facilitates > section-by-section > > different behavior, does not require any decision on global defaults, and > > makes data selection for training Sentence a nonesuch wrt line breaks. > > However, it adds to the system and would require a per-use choice > decision > > by developers OR a toggle by users (back to the default decision). > > Perhaps this has already been tried? > > > > Sean > > > > > > -----Original Message----- > > From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] > > Sent: Wednesday, January 22, 2014 1:06 PM > > To: 'dev@ctakes.apache.org' > > Subject: RE: sentence detector newline behavior > > > > The only rule I know of is that cTAKES (prior to ytex integration) always > > forces a sentence break at a newline. > > This was because the clinical notes cTAKES original processed never had > > newlines in the middle of a sentence, but did need sentence breaks to > occur > > at end of sentence for good negation detection on those notes. > > I think Guergana earlier mentioned other EMRs also have this need, but it > > seems to not be ubiquitous. > > > > From others' posts, it seems that we could use an option in cTAKES to > turn > > off this forcing of sentence breaks at newlines (or depending on how you > > look at it, an option to turn on the forcing of sentence breaks if we > > change the default behavior) > > > > I think we (cTAKES) need to decide the following: > > - do we want to do this for entire notes, or would it be worth it to > > have it be on a section-by-section basis. > > - what do we make the default behavior - to force or not to force > > newlines to be sentence breaks > > - what data (that contains newlines) will we use for training the > > sentence detector > > > > Regardless of those answers, I think OpenNLP support for including > > newlines in training data would be valuable for those others who have > > sentences that span lines. And having an option on OpenNLP to always > break > > at newline would be useful for at least some cTAKES users (and we could > > remove the cTAKES code that does that) > > > > -- James > > > > -----Original Message----- > > From: dev-return-2390-Masanz.James=mayo....@ctakes.apache.org [mailto: > > dev-return-2390-Masanz.James=mayo....@ctakes.apache.org] On Behalf Of > > Jörn Kottmann > > Sent: Tuesday, January 21, 2014 4:29 AM > > To: dev@ctakes.apache.org > > Subject: Re: sentence detector newline behavior > > > > Yes, exactly, OPENNLP-602 is about training a sentence detector model > > which can use a new line as a end-of-sentence character. > > > > In case you have certain rules to split sentences we should have a look > at > > them. The Sentence Detector could be extended to support a user provided > > rule based splitter. If there is an interest in that we could probably > get > > it into 1.6.0 as well. > > > > Jörn > > > > On 01/20/2014 10:02 PM, Chen, Pei wrote: > > > I presume Joern was suggesting that if he supports new lines in the > > opennlp SentenceDectector (either part of the trained models or post > > processing with some rules?) cTAKES will be able to use it out of the box > > and we should be able remove any additional custom logic that we > currently > > have- which seems like a good idea. > > > > > > [but when to use within cTAKES individual components such as negation > > > might be another discussion?] --Pei > > > > > >> On Jan 20, 2014, at 12:46 PM, "vijay garla" <vnga...@gmail.com> > wrote: > > >> > > >> The sentence detection opennlp model used by ctakes does not split > > >> sentences at newlines - there is additional logic in the takes > > >> sentence splitter that does this (and an alternative impl that > > >> doesn't is in the ytex branch). Afaik no retraining / change to the > > >> feature representation is necessary. > > >> > > >> Vj > > >> > > >>> On Monday, January 20, 2014, Jörn Kottmann <kottm...@gmail.com> > wrote: > > >>> > > >>> Hi all, > > >>> > > >>> currently I have quite a bit of time to work on OpenNLP, and would > > >>> like to help you out with this issue. > > >>> > > >>> Here is the follow up issue for this change: > > >>> https://issues.apache.org/jira/browse/OPENNLP-602 > > >>> > > >>> I am still trying to figure out what would be the best option to > > >>> implement this. > > >>> In the training data a user could just use a special tag to identify > > >>> the chars. > > >>> > > >>> Instead of <NEWLINE> it might be better to use <CR> and <LF> to > > >>> encode these two chars in the training data. Any thoughts? > > >>> > > >>> I am planning to release this as part of OpenNLP 1.6.0. > > >>> > > >>> Thanks, > > >>> Jörn > > >>> > > >>>> On 05/22/2013 02:03 PM, Jörn Kottmann wrote: > > >>>> > > >>>>> On 05/22/2013 01:17 PM, Miller, Timothy wrote: > > >>>>> > > >>>>> That's awesome! It might be worth trying at least. How does the > > >>>>> training process change? Previously the training data would be one > > >>>>> sentence per line, but with newlines as possible mid-sentence > > >>>>> characters that could be trouble, is there a new representation > > >>>>> for training data? Or would we have to use the training api? > > >>>> Good point, yes that will be a problem with the default training > > >>>> format, but it shouldn't be hard to solve. In the format itself we > > >>>> could define a new line tag e.g. > > >>>> <NEWLINE> to mark new lines. > > >>>> as a hack to make it work with 1.5.3 you could instead use a > > >>>> special char as a replacement for the new line char. > > >>>> When you pass the text down to the sentence detector a simple > > >>>> string replace could be used to convert all new line chars to the > > >>>> special new line marker char. > > >>>> > > >>>> If things work out for you performance wise as well we will just > > >>>> integrate it properly into OpenNLP for the next release. > > >>>> > > >>>> Could you produce a sentence detector training file with a new line > > >>>> marker char? > > >>>> > > >>>> You should try to pick a char you can also pass in on a terminal > > >>>> otherwise you have to use the API to train the model. The build in > > >>>> cross validation could be used to evaluate the performance. > > >>>> > > >>>> Jörn > > >>> > > > > >