Re: sentence detector newline behavior
On 01/27/2014 08:44 PM, Tim Miller wrote: That is a good point, and something I was wondering about. Having now looked at both the ctakes and opennlp code for the sentence splitter it seems like there is a lot of overlap. I would've thought it was just a matter of converting annotations into our type system. So I'm curious if there is some justification for why there seems to be duplication (or if I'm hallucinating it). It should be possible (and if not we should make it possible) to directly use the opennlp-uima integration. It supports dynamic types which can be mapped in the descriptor. This would also give you a smooth transition, your existing integration could be labeled as deprecated and be removed in one of the future releases. Jörn
RE: sentence detector newline behavior
+1 There's an example of the configs here :) https://issues.apache.org/jira/browse/CTAKES-98 I think we should be able to use OpenNLP's Sentence Annotator directly if we no longer need the custom newline rule(s) [Or if we find that a fixed rule is still required, perhaps OpenNLP can support it via config as well- there doesn't seem to be anything cTAKES specific about it]. Pending the results of Tim's retraining/evaluation of the new models?? --Pei -Original Message- From: Jörn Kottmann [mailto:kottm...@gmail.com] Sent: Wednesday, January 29, 2014 3:55 PM To: dev@ctakes.apache.org Subject: Re: sentence detector newline behavior On 01/27/2014 08:44 PM, Tim Miller wrote: That is a good point, and something I was wondering about. Having now looked at both the ctakes and opennlp code for the sentence splitter it seems like there is a lot of overlap. I would've thought it was just a matter of converting annotations into our type system. So I'm curious if there is some justification for why there seems to be duplication (or if I'm hallucinating it). It should be possible (and if not we should make it possible) to directly use the opennlp-uima integration. It supports dynamic types which can be mapped in the descriptor. This would also give you a smooth transition, your existing integration could be labeled as deprecated and be removed in one of the future releases. Jörn
Re: sentence detector newline behavior
On 01/27/2014 03:52 PM, Tim Miller wrote: OK, with the most recent version I am able to replicate the performance I was getting before. Thanks a lot Jörn! Assuming this is in the next incremental release of opennlp, how quickly can we get a re-trained model into cTAKES? I am currently working full time on the next release, and will very like be able to keep the pace until it is out. We are doing bigger changes this time and the next version will be 1.6.0. One of the interesting changes for you here could be the pluggable machine learning support (e.g. support for liblinear, mallet etc.). Jörn
Re: sentence detector newline behavior
On 01/26/2014 11:29 PM, Miller, Timothy wrote: Yes, this fixes the whitespace sentence issue but the evaluation issue remains. I believe the problem is in SentenceSampleStream, where in the following block the whitespace trim happens before the LF character is replaced with the \n character. So test sentences that ended with LF will be one character longer than they should be. sentence = sentence.trim(); sentence = replaceNewLineEscapeTags(sentence); sentencesString.append(sentence); int end = sentencesString.length(); sentenceSpans.add(new Span(begin, end)); sentencesString.append(' '); Yes, that must be the issue. During training the new line is inlucded in the span, and during detection the white space remover creates a span without the new line char. I suggest that the evaluator just ignores white space differences between sentences. My test case then has the expected performance numbers. What do you think? Anyway, I committed the change. Please give it a try. Jörn
Re: sentence detector newline behavior
OK, with the most recent version I am able to replicate the performance I was getting before. Thanks a lot Jörn! Assuming this is in the next incremental release of opennlp, how quickly can we get a re-trained model into cTAKES? I heard from a researcher at AMIA who tried cTAKES and because of this bug in the way we handle sentences was trying to find an outside sentence detector as a preprocess to cTAKES, and frankly that is insane. We should be able to get something this simple right. And I think this is the kind of thing that can leave new users scratching their heads and doubting our overall competence. James, I believe you are usually the one who rebuilds the models? What would be the best way to incorporate the data I have that has some instances of non-sentence terminating newlines? Tim On 01/27/2014 06:10 AM, Jörn Kottmann wrote: On 01/26/2014 11:29 PM, Miller, Timothy wrote: Yes, this fixes the whitespace sentence issue but the evaluation issue remains. I believe the problem is in SentenceSampleStream, where in the following block the whitespace trim happens before the LF character is replaced with the \n character. So test sentences that ended with LF will be one character longer than they should be. sentence = sentence.trim(); sentence = replaceNewLineEscapeTags(sentence); sentencesString.append(sentence); int end = sentencesString.length(); sentenceSpans.add(new Span(begin, end)); sentencesString.append(' '); Yes, that must be the issue. During training the new line is inlucded in the span, and during detection the white space remover creates a span without the new line char. I suggest that the evaluator just ignores white space differences between sentences. My test case then has the expected performance numbers. What do you think? Anyway, I committed the change. Please give it a try. Jörn
RE: sentence detector newline behavior
Tim, I just had to chime in on a comment you made.My deadline has been extended a bit on my pressing issue but I do intend to get back to testing per VJ's fix or maybe another fix is in the works based on latest emails...I need to read them again since a lot has been stated on the issue. Okay, as a new user (working w/cTAKES since October) I have never thought what you had stated: And I think this is the kind of thing that can leave new users scratching their heads and doubting our overall competence. Yeah, the sentence-spanning-newline issue was a problem so I just brought attention to it by my post of inquiry earlier this month on VJ's fix from last month and worked around it with treating narrative as one string. Anyone who's looked at the code would appreciate and acknowledge that cTAKES is a powerful and complex application. I'm overall impressed with it and I intend to continue to use it, improve it, and grow with it. I've been delving deeper into cTAKES on the machine learning aspect...I'm struggling a bit with it and if anything I scratch my head and doubt my competence. ;-) Regards, Paula Date: Mon, 27 Jan 2014 09:52:00 -0500 From: timothy.mil...@childrens.harvard.edu To: dev@ctakes.apache.org Subject: Re: sentence detector newline behavior OK, with the most recent version I am able to replicate the performance I was getting before. Thanks a lot Jörn! Assuming this is in the next incremental release of opennlp, how quickly can we get a re-trained model into cTAKES? I heard from a researcher at AMIA who tried cTAKES and because of this bug in the way we handle sentences was trying to find an outside sentence detector as a preprocess to cTAKES, and frankly that is insane. We should be able to get something this simple right. And I think this is the kind of thing that can leave new users scratching their heads and doubting our overall competence. James, I believe you are usually the one who rebuilds the models? What would be the best way to incorporate the data I have that has some instances of non-sentence terminating newlines? Tim On 01/27/2014 06:10 AM, Jörn Kottmann wrote: On 01/26/2014 11:29 PM, Miller, Timothy wrote: Yes, this fixes the whitespace sentence issue but the evaluation issue remains. I believe the problem is in SentenceSampleStream, where in the following block the whitespace trim happens before the LF character is replaced with the \n character. So test sentences that ended with LF will be one character longer than they should be. sentence = sentence.trim(); sentence = replaceNewLineEscapeTags(sentence); sentencesString.append(sentence); int end = sentencesString.length(); sentenceSpans.add(new Span(begin, end)); sentencesString.append(' '); Yes, that must be the issue. During training the new line is inlucded in the span, and during detection the white space remover creates a span without the new line char. I suggest that the evaluator just ignores white space differences between sentences. My test case then has the expected performance numbers. What do you think? Anyway, I committed the change. Please give it a try. Jörn
RE: sentence detector newline behavior
Tim, is the training data something you can share publicly? Or privately? I can't publicly share the data that has been used to train the sentence detector, I can only share the models that get built. And you can't build a model from an existing model + more data, you need all the training data together. Regarding how quickly we can get this out there, I can train a new sentence detector in a day or two. But that's just the first step - to really incorporate this, I would suggest this be a point release. We would need a release manager for that. Right now I don't have time for that. I haven't heard a consensus saying whether this should be the new behavior. From what I remember we are going to need code changes to make optional the code that splits at line breaks, or was your test replacing the existing cTAKES sentence detector and just using OpenNLP directly. -- James -Original Message- From: Tim Miller [mailto:timothy.mil...@childrens.harvard.edu] Sent: Monday, January 27, 2014 8:52 AM To: dev@ctakes.apache.org Subject: Re: sentence detector newline behavior OK, with the most recent version I am able to replicate the performance I was getting before. Thanks a lot Jörn! Assuming this is in the next incremental release of opennlp, how quickly can we get a re-trained model into cTAKES? I heard from a researcher at AMIA who tried cTAKES and because of this bug in the way we handle sentences was trying to find an outside sentence detector as a preprocess to cTAKES, and frankly that is insane. We should be able to get something this simple right. And I think this is the kind of thing that can leave new users scratching their heads and doubting our overall competence. James, I believe you are usually the one who rebuilds the models? What would be the best way to incorporate the data I have that has some instances of non-sentence terminating newlines? Tim On 01/27/2014 06:10 AM, Jörn Kottmann wrote: On 01/26/2014 11:29 PM, Miller, Timothy wrote: Yes, this fixes the whitespace sentence issue but the evaluation issue remains. I believe the problem is in SentenceSampleStream, where in the following block the whitespace trim happens before the LF character is replaced with the \n character. So test sentences that ended with LF will be one character longer than they should be. sentence = sentence.trim(); sentence = replaceNewLineEscapeTags(sentence); sentencesString.append(sentence); int end = sentencesString.length(); sentenceSpans.add(new Span(begin, end)); sentencesString.append(' '); Yes, that must be the issue. During training the new line is inlucded in the span, and during detection the white space remover creates a span without the new line char. I suggest that the evaluator just ignores white space differences between sentences. My test case then has the expected performance numbers. What do you think? Anyway, I committed the change. Please give it a try. Jörn
RE: sentence detector newline behavior
I didn't write the cTAKES sentence detector so I can't answer definitively but I do know it was originally written using what is now a pretty old version of OpenNLP and needed some things you couldn't get from the out-of-the-box OpenNLP at the time. From what I remember the things specific to it were - the list of end of sentence candidate characters - and the handling of newlines -- James -Original Message- From: Tim Miller [mailto:timothy.mil...@childrens.harvard.edu] Sent: Monday, January 27, 2014 1:45 PM To: dev@ctakes.apache.org Subject: Re: sentence detector newline behavior On 01/27/2014 02:35 PM, Masanz, James J. wrote: Tim, is the training data something you can share publicly? Or privately? I can't publicly share the data that has been used to train the sentence detector, I can only share the models that get built. And you can't build a model from an existing model + more data, you need all the training data together. It is from the MIMIC corpus which I definitely can't share publicly, but it's worth looking into whether I could share it privately with another person who has a signed data use agreement. Regarding how quickly we can get this out there, I can train a new sentence detector in a day or two. But that's just the first step - to really incorporate this, I would suggest this be a point release. We would need a release manager for that. Right now I don't have time for that. I haven't heard a consensus saying whether this should be the new behavior. Yeah I suppose this is subject to the scale of the changes we make. From what I remember we are going to need code changes to make optional the code that splits at line breaks, or was your test replacing the existing cTAKES sentence detector and just using OpenNLP directly. That is a good point, and something I was wondering about. Having now looked at both the ctakes and opennlp code for the sentence splitter it seems like there is a lot of overlap. I would've thought it was just a matter of converting annotations into our type system. So I'm curious if there is some justification for why there seems to be duplication (or if I'm hallucinating it). Tim -- James -Original Message- From: Tim Miller [mailto:timothy.mil...@childrens.harvard.edu] Sent: Monday, January 27, 2014 8:52 AM To: dev@ctakes.apache.org Subject: Re: sentence detector newline behavior OK, with the most recent version I am able to replicate the performance I was getting before. Thanks a lot Jörn! Assuming this is in the next incremental release of opennlp, how quickly can we get a re-trained model into cTAKES? I heard from a researcher at AMIA who tried cTAKES and because of this bug in the way we handle sentences was trying to find an outside sentence detector as a preprocess to cTAKES, and frankly that is insane. We should be able to get something this simple right. And I think this is the kind of thing that can leave new users scratching their heads and doubting our overall competence. James, I believe you are usually the one who rebuilds the models? What would be the best way to incorporate the data I have that has some instances of non-sentence terminating newlines? Tim On 01/27/2014 06:10 AM, Jörn Kottmann wrote: On 01/26/2014 11:29 PM, Miller, Timothy wrote: Yes, this fixes the whitespace sentence issue but the evaluation issue remains. I believe the problem is in SentenceSampleStream, where in the following block the whitespace trim happens before the LF character is replaced with the \n character. So test sentences that ended with LF will be one character longer than they should be. sentence = sentence.trim(); sentence = replaceNewLineEscapeTags(sentence); sentencesString.append(sentence); int end = sentencesString.length(); sentenceSpans.add(new Span(begin, end)); sentencesString.append(' '); Yes, that must be the issue. During training the new line is inlucded in the span, and during detection the white space remover creates a span without the new line char. I suggest that the evaluator just ignores white space differences between sentences. My test case then has the expected performance numbers. What do you think? Anyway, I committed the change. Please give it a try. Jörn
Re: sentence detector newline behavior
For clarity, I'd like to stress that the opennlp sentence model distributed with ctakes today does 'work' with sentences that span newlines - as I understand it, this model ignores newline tokens (or newlines are not provided as features to that model). I believe the improvements Tim and others are suggesting are for a new sentence model + feature representation that takes advantage of newlines as features. Whatever we do, I believe we need backwards compatibility - those who are using the current sentence model may need to continue using it. To that end: * If we upgrade to the newest version of opennlp, will the old model work (and produce the same results)? * If a contributor trains a new model that uses a different feature representation, I believe that should go into a new Sentence Detector AnalysisEngine (or the same AE but with different configuration parameters), so users have a choice between the old and the new. -vj On Mon, Jan 27, 2014 at 1:09 PM, digital paula cybersat...@hotmail.comwrote: Tim, I just had to chime in on a comment you made.My deadline has been extended a bit on my pressing issue but I do intend to get back to testing per VJ's fix or maybe another fix is in the works based on latest emails...I need to read them again since a lot has been stated on the issue. Okay, as a new user (working w/cTAKES since October) I have never thought what you had stated: And I think this is the kind of thing that can leave new users scratching their heads and doubting our overall competence. Yeah, the sentence-spanning-newline issue was a problem so I just brought attention to it by my post of inquiry earlier this month on VJ's fix from last month and worked around it with treating narrative as one string. Anyone who's looked at the code would appreciate and acknowledge that cTAKES is a powerful and complex application. I'm overall impressed with it and I intend to continue to use it, improve it, and grow with it. I've been delving deeper into cTAKES on the machine learning aspect...I'm struggling a bit with it and if anything I scratch my head and doubt my competence. ;-) Regards, Paula Date: Mon, 27 Jan 2014 09:52:00 -0500 From: timothy.mil...@childrens.harvard.edu To: dev@ctakes.apache.org Subject: Re: sentence detector newline behavior OK, with the most recent version I am able to replicate the performance I was getting before. Thanks a lot Jörn! Assuming this is in the next incremental release of opennlp, how quickly can we get a re-trained model into cTAKES? I heard from a researcher at AMIA who tried cTAKES and because of this bug in the way we handle sentences was trying to find an outside sentence detector as a preprocess to cTAKES, and frankly that is insane. We should be able to get something this simple right. And I think this is the kind of thing that can leave new users scratching their heads and doubting our overall competence. James, I believe you are usually the one who rebuilds the models? What would be the best way to incorporate the data I have that has some instances of non-sentence terminating newlines? Tim On 01/27/2014 06:10 AM, Jörn Kottmann wrote: On 01/26/2014 11:29 PM, Miller, Timothy wrote: Yes, this fixes the whitespace sentence issue but the evaluation issue remains. I believe the problem is in SentenceSampleStream, where in the following block the whitespace trim happens before the LF character is replaced with the \n character. So test sentences that ended with LF will be one character longer than they should be. sentence = sentence.trim(); sentence = replaceNewLineEscapeTags(sentence); sentencesString.append(sentence); int end = sentencesString.length(); sentenceSpans.add(new Span(begin, end)); sentencesString.append(' '); Yes, that must be the issue. During training the new line is inlucded in the span, and during detection the white space remover creates a span without the new line char. I suggest that the evaluator just ignores white space differences between sentences. My test case then has the expected performance numbers. What do you think? Anyway, I committed the change. Please give it a try. Jörn
Re: sentence detector newline behavior
On 01/27/2014 06:03 PM, vijay garla wrote: For clarity, I'd like to stress that the opennlp sentence model distributed with ctakes today does 'work' with sentences that span newlines - as I understand it, this model ignores newline tokens (or newlines are not provided as features to that model). Well, it depends on your definition of works :). It doesn't throw an exception but it automatically splits sentences at newlines. It is relatively normal to have text that wraps at ~80 characters with newlines added. It will look like this (this is made up text): The patient was having difficulty getting out of bed and was taking aspirin in the morning. He has returned today for a prescription for something stronger. This style will cause multiple sentence fragments to be encoded which, as we've seen, will wreak havoc with negation detection. I believe the improvements Tim and others are suggesting are for a new sentence model + feature representation that takes advantage of newlines as features. To be precise, I'm proposing adding newlines to the set of characters that are candidates for end of sentences (i.e. decision points for the classifier), instead of having the hard constraint of splitting at all newlines. Whatever we do, I believe we need backwards compatibility - those who are using the current sentence model may need to continue using it. To that end: * If we upgrade to the newest version of opennlp, will the old model work (and produce the same results)? I definitely think we shouldn't release a new model that doesn't perform well in some absolute sense. But I think this change generalizes the old model, so that given that it meets that absolute standard a user should only see improvements. Specifically they should see fewer incorrect sentence fragments if they give us text with newlines in mid-sentence. IMHO, that kind of change doesn't require 'backwards compatibility' per se. Maybe we can make it an option to have a hard constraint that breaks on newlines but I think it should default to not do so. * If a contributor trains a new model that uses a different feature representation, I believe that should go into a new Sentence Detector AnalysisEngine (or the same AE but with different configuration parameters), so users have a choice between the old and the new. Yeah, I think having configuration parameters are fine as long as we have smart defaults. Thanks for your input VJ. Tim -vj On Mon, Jan 27, 2014 at 1:09 PM, digital paula cybersat...@hotmail.comwrote: Tim, I just had to chime in on a comment you made.My deadline has been extended a bit on my pressing issue but I do intend to get back to testing per VJ's fix or maybe another fix is in the works based on latest emails...I need to read them again since a lot has been stated on the issue. Okay, as a new user (working w/cTAKES since October) I have never thought what you had stated: And I think this is the kind of thing that can leave new users scratching their heads and doubting our overall competence. Yeah, the sentence-spanning-newline issue was a problem so I just brought attention to it by my post of inquiry earlier this month on VJ's fix from last month and worked around it with treating narrative as one string. Anyone who's looked at the code would appreciate and acknowledge that cTAKES is a powerful and complex application. I'm overall impressed with it and I intend to continue to use it, improve it, and grow with it. I've been delving deeper into cTAKES on the machine learning aspect...I'm struggling a bit with it and if anything I scratch my head and doubt my competence. ;-) Regards, Paula Date: Mon, 27 Jan 2014 09:52:00 -0500 From: timothy.mil...@childrens.harvard.edu To: dev@ctakes.apache.org Subject: Re: sentence detector newline behavior OK, with the most recent version I am able to replicate the performance I was getting before. Thanks a lot Jörn! Assuming this is in the next incremental release of opennlp, how quickly can we get a re-trained model into cTAKES? I heard from a researcher at AMIA who tried cTAKES and because of this bug in the way we handle sentences was trying to find an outside sentence detector as a preprocess to cTAKES, and frankly that is insane. We should be able to get something this simple right. And I think this is the kind of thing that can leave new users scratching their heads and doubting our overall competence. James, I believe you are usually the one who rebuilds the models? What would be the best way to incorporate the data I have that has some instances of non-sentence terminating newlines? Tim On 01/27/2014 06:10 AM, Jörn Kottmann wrote: On 01/26/2014 11:29 PM, Miller, Timothy wrote: Yes, this fixes the whitespace sentence issue but the evaluation issue remains. I believe the problem is in SentenceSampleStream, where in the following block the whitespace trim happens before the LF character is replaced with the \n
Re: sentence detector newline behavior
On 01/25/2014 10:03 PM, Miller, Timothy wrote: On 01/25/2014 12:24 PM, Jörn Kottmann wrote: The code which computes the spans tries to remove white space from it. Removing the white space from a whitespace only sentence is causing the exception your are seeing. Which response would you expect from the sentence detector? Should a white space only sentence be returned? I would say no. In case a sentence is terminated by a new line. Should the new line char be included in the sentence span or not? I would also say no. I made a quick patch for this issue -- now it runs but scores really poorly compared to my model file (30 vs 75 or so). I suspect something is wrong with the evaluation, the spans being slightly off somehow. The evaluation should ignore white spaces. I committed now my fix, it would be nice if you can test it. There might be still something wrong. In my test data I replaced all question marks with white spaces, and the result is slightly worse than with the original data. Jörn
Re: sentence detector newline behavior
On 01/26/2014 09:59 AM, Jörn Kottmann wrote: The evaluation should ignore white spaces. I committed now my fix, it would be nice if you can test it. There might be still something wrong. In my test data I replaced all question marks with white spaces, and the result is slightly worse than with the original data. Jörn Yes, this fixes the whitespace sentence issue but the evaluation issue remains. I believe the problem is in SentenceSampleStream, where in the following block the whitespace trim happens before the LF character is replaced with the \n character. So test sentences that ended with LF will be one character longer than they should be. sentence = sentence.trim(); sentence = replaceNewLineEscapeTags(sentence); sentencesString.append(sentence); int end = sentencesString.length(); sentenceSpans.add(new Span(begin, end)); sentencesString.append(' ');
Re: sentence detector newline behavior
Thanks Joern, I'll try it. My understanding is I just need to give it my training data, with the special character I used replaced with the literal string LF and each line in the file is an example sentence. Just thinking about the cTAKES wrapper -- do your changes make it so that we wouldn't need to add the special characters (LF,CR) to a document within the cTAKES sentence detector wrapper? It sounds like we would need to add CR and LF to our eosChars value, it's early (for my brain) but I wonder whether that could be a default on the opennlp end? Tim On 01/24/2014 04:14 PM, Jörn Kottmann wrote: The changes are now committed. To train a model which can recognize new lines the new lines must be encoded with the CR or LF tags (or both). The same tags are used to pass in the eos chars to the command line trainer. For example: SentenceDetectorCrossValidator -lang en -data /home/xyz/eos-cr.all -encoding ISO-8859-15 -eosChars .!?:LF Tim, it would be nice if you could test this with your annotations. Jörn On 01/23/2014 10:06 PM, Tim Miller wrote: Just an FYI, a while back I did some of these annotations myself on MIMIC to get around this issue. I replaced the newline character with a special (non-English) character, then pre-processed ctakes input to replace newlines with that character, then did sentence detection, then added the newlines back in. I would be happy to share these annotations and my code modifications. Tim On 01/23/2014 04:01 PM, Karthik Sarma wrote: We could possibly add some additional datasets for training. MIMIC data does come to mind -- I can't remember off the top of my head if the MIMIC dataset has sentences spanning lines or not. -- Karthik Sarma UCLA Medical Scientist Training Program Class of 20?? Member, UCLA Medical Imaging Informatics Lab Member, CA Delegation to the House of Delegates of the American Medical Association ksa...@ksarma.com gchat: ksa...@gmail.com linkedin: www.linkedin.com/in/ksarma On Thu, Jan 23, 2014 at 4:22 AM, vijay garla vnga...@gmail.com wrote: Just to clarify - with the YTEX branch there are 2 sentence splitter - the original ctakes sentence that splits on newlines, and the ytex sentence splitter that doesn't. the changes to other components in the ytex branch (dependency parser, assertion) work with both sentence splitters. I think it would be great if the intelligence regarding how to split was in the opennlp model, but this requires training data. I don't know what the training data is, or if the training data has sentences that cross newline boundaries (if not, won't buy us anything). vijay On Wed, Jan 22, 2014 at 3:47 PM, Finan, Sean sean.fi...@childrens.harvard.edu wrote: On my end it looks like my email was reformatted and some of my -newline- removed in those last examples ... -Original Message- From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] Sent: Wednesday, January 22, 2014 3:42 PM To: dev@ctakes.apache.org Subject: RE: sentence detector newline behavior Thanks James but then no typical sentence ending punctuation at the end of the line Gotcha. So simply using Lines would not suffice in those cases because it would run together sentences where there are more than one on a line I was actually thinking about something like a Line using -sentence breaks- in addition to -newline-. In other words, a Sentence being what cTakes detects by ignoring CR/LF, and Lines being those Sentences subdivided by -newline-. Perhaps Line is a horrible moniker. Regardless, it doesn't solve the problem of inappropriately missing punctuation. I was focused a little more on the difference between persistent auto- line wrapping and structured information like lists, where the first benefits from Sentence and the second from Line. The Patient has been prescribed two medications. Prescriptions: Advil Tylenol No Aspirin However, when it comes to the problem that you mention, there is no benefit to a Line. The patient has been seen six times in the past week. Pain has been persistent for ten days Advil and Tylenol have been prescribed -- 2 sentences, 3 lines The patient has been seen six times in the past week. Pain has been persistent for ten days Advil and Tylenol have been prescribed -- 2 sentences, 3 lines The patient has been seen six times in the past week. Pain has been persistent for ten days Advil and Tylenol have been prescribed -- 2 sentences, 5 lines Nothing can really be done for the last bit where punctuation is missing. -Original Message- From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] Sent: Wednesday, January 22, 2014 3:07 PM To: 'dev@ctakes.apache.org' Subject: RE: sentence detector newline behavior I know there are notes where there are multiple sentences on a line, but then no typical sentence ending punctuation at the end of the line
Re: sentence detector newline behavior
I'm running into one issue, it gets tripped up on sentences with line-ending spaces. I could easily remove them with a script but by default they are in there. It happens when a sentence example ends: ...BILAT HEMATOMAS. LF (There is a period, then 2 spaces, then the line feed character.) I am pretty sure this is the root because when I fix this example to be .LF it gets tripped up in another place instead (with the same error). The specific error I get is this: Exception in thread main java.lang.IllegalArgumentException: start index must not be larger than end index: start=8842, end=8839 at opennlp.tools.util.Span.init(Span.java:47) at opennlp.tools.util.Span.init(Span.java:63) at opennlp.tools.sentdetect.SentenceDetectorME.sentPosDetect(SentenceDetectorME.java:244) at opennlp.tools.sentdetect.SentenceDetectorEvaluator.processSample(SentenceDetectorEvaluator.java:56) at opennlp.tools.sentdetect.SentenceDetectorEvaluator.processSample(SentenceDetectorEvaluator.java:1) at opennlp.tools.util.eval.Evaluator.evaluateSample(Evaluator.java:82) at opennlp.tools.util.eval.Evaluator.evaluate(Evaluator.java:109) at opennlp.tools.sentdetect.SDCrossValidator.evaluate(SDCrossValidator.java:130) at opennlp.tools.cmdline.sentdetect.SentenceDetectorCrossValidatorTool.run(SentenceDetectorCrossValidatorTool.java:78) at opennlp.tools.cmdline.CLI.main(CLI.java:214) I thought I'd let you know since you might be able to fix it in 2 minutes but if I don't hear from you today I'll probably take a look at it later today to try to fix it myself. Tim On 01/24/2014 04:14 PM, Jörn Kottmann wrote: The changes are now committed. To train a model which can recognize new lines the new lines must be encoded with the CR or LF tags (or both). The same tags are used to pass in the eos chars to the command line trainer. For example: SentenceDetectorCrossValidator -lang en -data /home/xyz/eos-cr.all -encoding ISO-8859-15 -eosChars .!?:LF Tim, it would be nice if you could test this with your annotations. Jörn On 01/23/2014 10:06 PM, Tim Miller wrote: Just an FYI, a while back I did some of these annotations myself on MIMIC to get around this issue. I replaced the newline character with a special (non-English) character, then pre-processed ctakes input to replace newlines with that character, then did sentence detection, then added the newlines back in. I would be happy to share these annotations and my code modifications. Tim On 01/23/2014 04:01 PM, Karthik Sarma wrote: We could possibly add some additional datasets for training. MIMIC data does come to mind -- I can't remember off the top of my head if the MIMIC dataset has sentences spanning lines or not. -- Karthik Sarma UCLA Medical Scientist Training Program Class of 20?? Member, UCLA Medical Imaging Informatics Lab Member, CA Delegation to the House of Delegates of the American Medical Association ksa...@ksarma.com gchat: ksa...@gmail.com linkedin: www.linkedin.com/in/ksarma On Thu, Jan 23, 2014 at 4:22 AM, vijay garla vnga...@gmail.com wrote: Just to clarify - with the YTEX branch there are 2 sentence splitter - the original ctakes sentence that splits on newlines, and the ytex sentence splitter that doesn't. the changes to other components in the ytex branch (dependency parser, assertion) work with both sentence splitters. I think it would be great if the intelligence regarding how to split was in the opennlp model, but this requires training data. I don't know what the training data is, or if the training data has sentences that cross newline boundaries (if not, won't buy us anything). vijay On Wed, Jan 22, 2014 at 3:47 PM, Finan, Sean sean.fi...@childrens.harvard.edu wrote: On my end it looks like my email was reformatted and some of my -newline- removed in those last examples ... -Original Message- From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] Sent: Wednesday, January 22, 2014 3:42 PM To: dev@ctakes.apache.org Subject: RE: sentence detector newline behavior Thanks James but then no typical sentence ending punctuation at the end of the line Gotcha. So simply using Lines would not suffice in those cases because it would run together sentences where there are more than one on a line I was actually thinking about something like a Line using -sentence breaks- in addition to -newline-. In other words, a Sentence being what cTakes detects by ignoring CR/LF, and Lines being those Sentences subdivided by -newline-. Perhaps Line is a horrible moniker. Regardless, it doesn't solve the problem of inappropriately missing punctuation. I was focused a little more on the difference between persistent auto- line wrapping and structured information like lists, where the first benefits from Sentence and the second from Line. The Patient has been prescribed two medications. Prescriptions
Re: sentence detector newline behavior
On 01/25/2014 01:33 PM, Miller, Timothy wrote: Thanks Joern, I'll try it. My understanding is I just need to give it my training data, with the special character I used replaced with the literal string LF and each line in the file is an example sentence. Yes, exactly. Just thinking about the cTAKES wrapper -- do your changes make it so that we wouldn't need to add the special characters (LF,CR) to a document within the cTAKES sentence detector wrapper? Right, the sentence detector expects the chars as input, not the tags. For example: This is a sentence terminated by a new line\nAnd this is on more sentence. It sounds like we would need to add CR and LF to our eosChars value, it's early (for my brain) but I wonder whether that could be a default on the opennlp end? If you pass them in during the training they are stored in the model package. All you need to do is to instantiate the Sentence Detector and it should be ready to use. BTW, there is also an UIMA integration in opennlp-uima, maybe that could work quite well for ctakes. Jörn
Re: sentence detector newline behavior
On 01/25/2014 03:03 PM, Miller, Timothy wrote: I'm running into one issue, it gets tripped up on sentences with line-ending spaces. I could easily remove them with a script but by default they are in there. It happens when a sentence example ends: ...BILAT HEMATOMAS. LF (There is a period, then 2 spaces, then the line feed character.) I am pretty sure this is the root because when I fix this example to be .LF it gets tripped up in another place instead (with the same error). The specific error I get is this: What happens here is probably that two sentences are detected. It wants to split on the dot, and on the LF. The sentence detector classifies every eos char if it could be a split or not. On the other hand the user expects to get a span (with begin and end offset) per sentence. The code which computes the spans tries to remove white space from it. Removing the white space from a whitespace only sentence is causing the exception your are seeing. Which response would you expect from the sentence detector? Should a white space only sentence be returned? In case a sentence is terminated by a new line. Should the new line char be included in the sentence span or not? Jörn
Re: sentence detector newline behavior
On 01/25/2014 12:24 PM, Jörn Kottmann wrote: The code which computes the spans tries to remove white space from it. Removing the white space from a whitespace only sentence is causing the exception your are seeing. Which response would you expect from the sentence detector? Should a white space only sentence be returned? I would say no. In case a sentence is terminated by a new line. Should the new line char be included in the sentence span or not? I would also say no. I made a quick patch for this issue -- now it runs but scores really poorly compared to my model file (30 vs 75 or so). I suspect something is wrong with the evaluation, the spans being slightly off somehow.
Re: sentence detector newline behavior
On 01/23/2014 10:06 PM, Tim Miller wrote: Just an FYI, a while back I did some of these annotations myself on MIMIC to get around this issue. I replaced the newline character with a special (non-English) character, then pre-processed ctakes input to replace newlines with that character, then did sentence detection, then added the newlines back in. I would be happy to share these annotations and my code modifications. I would be really happy to get access to your annotations so I can test the new line support in OpenNLP with it. Instead of a special char you would now have to use tags (CR and LF) to encode the new lines in the training data. The tags only need to be inserted into the training data, for the actual sentence detection the document string can be passed in as it is. Jörn
Re: sentence detector newline behavior
Just to clarify - with the YTEX branch there are 2 sentence splitter - the original ctakes sentence that splits on newlines, and the ytex sentence splitter that doesn't. the changes to other components in the ytex branch (dependency parser, assertion) work with both sentence splitters. I think it would be great if the intelligence regarding how to split was in the opennlp model, but this requires training data. I don't know what the training data is, or if the training data has sentences that cross newline boundaries (if not, won't buy us anything). vijay On Wed, Jan 22, 2014 at 3:47 PM, Finan, Sean sean.fi...@childrens.harvard.edu wrote: On my end it looks like my email was reformatted and some of my -newline- removed in those last examples ... -Original Message- From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] Sent: Wednesday, January 22, 2014 3:42 PM To: dev@ctakes.apache.org Subject: RE: sentence detector newline behavior Thanks James but then no typical sentence ending punctuation at the end of the line Gotcha. So simply using Lines would not suffice in those cases because it would run together sentences where there are more than one on a line I was actually thinking about something like a Line using -sentence breaks- in addition to -newline-. In other words, a Sentence being what cTakes detects by ignoring CR/LF, and Lines being those Sentences subdivided by -newline-. Perhaps Line is a horrible moniker. Regardless, it doesn't solve the problem of inappropriately missing punctuation. I was focused a little more on the difference between persistent auto- line wrapping and structured information like lists, where the first benefits from Sentence and the second from Line. The Patient has been prescribed two medications. Prescriptions: Advil Tylenol No Aspirin However, when it comes to the problem that you mention, there is no benefit to a Line. The patient has been seen six times in the past week. Pain has been persistent for ten days Advil and Tylenol have been prescribed -- 2 sentences, 3 lines The patient has been seen six times in the past week. Pain has been persistent for ten days Advil and Tylenol have been prescribed -- 2 sentences, 3 lines The patient has been seen six times in the past week. Pain has been persistent for ten days Advil and Tylenol have been prescribed -- 2 sentences, 5 lines Nothing can really be done for the last bit where punctuation is missing. -Original Message- From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] Sent: Wednesday, January 22, 2014 3:07 PM To: 'dev@ctakes.apache.org' Subject: RE: sentence detector newline behavior I know there are notes where there are multiple sentences on a line, but then no typical sentence ending punctuation at the end of the line (or no punctuation at all at the end of the line). And in those sections, negation can be important. So simply using Lines would not suffice in those cases because it would run together sentences where there are more than one on a line. And using sentences alone (as found by OpenNLP 1.5) would not suffice because it would run together sentences from different lines. -Original Message- From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] Sent: Wednesday, January 22, 2014 1:33 PM To: dev@ctakes.apache.org Subject: RE: sentence detector newline behavior Just whistling in the wind here ... Perhaps before any changes are made to universally toggle cTakes in one direction or the other, we can take a poll of when where cTakes/Ytex/OpenNLP/Omaha needs a Sentence (ignoring CR/LF) as opposed to a Line (CR/LF delimited PLUS -sentence-) If some capabilities like negation detection require -lines- then would it make more sense to have Sentence ignore -newline- and negation detection itself split the Sentence into line items? If an annotator is interested in list items, each of which may be on a distinct -line-, then it can split up the Sentence as needed. I think that James hints that cTakes code already does this in some places. If a good deal of functionality requires -newline- delimited types, would it make sense to introduce a type Line? If something uses a structured list it could iterate through Line types, while something using pure text could iterate through Sentence types. This facilitates section-by-section different behavior, does not require any decision on global defaults, and makes data selection for training Sentence a nonesuch wrt line breaks. However, it adds to the system and would require a per-use choice decision by developers OR a toggle by users (back to the default decision). Perhaps this has already been tried? Sean -Original Message- From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] Sent: Wednesday, January 22, 2014 1:06 PM To: 'dev@ctakes.apache.org' Subject: RE: sentence detector newline
Re: sentence detector newline behavior
We could possibly add some additional datasets for training. MIMIC data does come to mind -- I can't remember off the top of my head if the MIMIC dataset has sentences spanning lines or not. -- Karthik Sarma UCLA Medical Scientist Training Program Class of 20?? Member, UCLA Medical Imaging Informatics Lab Member, CA Delegation to the House of Delegates of the American Medical Association ksa...@ksarma.com gchat: ksa...@gmail.com linkedin: www.linkedin.com/in/ksarma On Thu, Jan 23, 2014 at 4:22 AM, vijay garla vnga...@gmail.com wrote: Just to clarify - with the YTEX branch there are 2 sentence splitter - the original ctakes sentence that splits on newlines, and the ytex sentence splitter that doesn't. the changes to other components in the ytex branch (dependency parser, assertion) work with both sentence splitters. I think it would be great if the intelligence regarding how to split was in the opennlp model, but this requires training data. I don't know what the training data is, or if the training data has sentences that cross newline boundaries (if not, won't buy us anything). vijay On Wed, Jan 22, 2014 at 3:47 PM, Finan, Sean sean.fi...@childrens.harvard.edu wrote: On my end it looks like my email was reformatted and some of my -newline- removed in those last examples ... -Original Message- From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] Sent: Wednesday, January 22, 2014 3:42 PM To: dev@ctakes.apache.org Subject: RE: sentence detector newline behavior Thanks James but then no typical sentence ending punctuation at the end of the line Gotcha. So simply using Lines would not suffice in those cases because it would run together sentences where there are more than one on a line I was actually thinking about something like a Line using -sentence breaks- in addition to -newline-. In other words, a Sentence being what cTakes detects by ignoring CR/LF, and Lines being those Sentences subdivided by -newline-. Perhaps Line is a horrible moniker. Regardless, it doesn't solve the problem of inappropriately missing punctuation. I was focused a little more on the difference between persistent auto- line wrapping and structured information like lists, where the first benefits from Sentence and the second from Line. The Patient has been prescribed two medications. Prescriptions: Advil Tylenol No Aspirin However, when it comes to the problem that you mention, there is no benefit to a Line. The patient has been seen six times in the past week. Pain has been persistent for ten days Advil and Tylenol have been prescribed -- 2 sentences, 3 lines The patient has been seen six times in the past week. Pain has been persistent for ten days Advil and Tylenol have been prescribed -- 2 sentences, 3 lines The patient has been seen six times in the past week. Pain has been persistent for ten days Advil and Tylenol have been prescribed -- 2 sentences, 5 lines Nothing can really be done for the last bit where punctuation is missing. -Original Message- From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] Sent: Wednesday, January 22, 2014 3:07 PM To: 'dev@ctakes.apache.org' Subject: RE: sentence detector newline behavior I know there are notes where there are multiple sentences on a line, but then no typical sentence ending punctuation at the end of the line (or no punctuation at all at the end of the line). And in those sections, negation can be important. So simply using Lines would not suffice in those cases because it would run together sentences where there are more than one on a line. And using sentences alone (as found by OpenNLP 1.5) would not suffice because it would run together sentences from different lines. -Original Message- From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] Sent: Wednesday, January 22, 2014 1:33 PM To: dev@ctakes.apache.org Subject: RE: sentence detector newline behavior Just whistling in the wind here ... Perhaps before any changes are made to universally toggle cTakes in one direction or the other, we can take a poll of when where cTakes/Ytex/OpenNLP/Omaha needs a Sentence (ignoring CR/LF) as opposed to a Line (CR/LF delimited PLUS -sentence-) If some capabilities like negation detection require -lines- then would it make more sense to have Sentence ignore -newline- and negation detection itself split the Sentence into line items? If an annotator is interested in list items, each of which may be on a distinct -line-, then it can split up the Sentence as needed. I think that James hints that cTakes code already does this in some places. If a good deal of functionality requires -newline- delimited types, would it make sense to introduce a type Line? If something uses a structured list it could iterate
Re: sentence detector newline behavior
Just an FYI, a while back I did some of these annotations myself on MIMIC to get around this issue. I replaced the newline character with a special (non-English) character, then pre-processed ctakes input to replace newlines with that character, then did sentence detection, then added the newlines back in. I would be happy to share these annotations and my code modifications. Tim On 01/23/2014 04:01 PM, Karthik Sarma wrote: We could possibly add some additional datasets for training. MIMIC data does come to mind -- I can't remember off the top of my head if the MIMIC dataset has sentences spanning lines or not. -- Karthik Sarma UCLA Medical Scientist Training Program Class of 20?? Member, UCLA Medical Imaging Informatics Lab Member, CA Delegation to the House of Delegates of the American Medical Association ksa...@ksarma.com gchat: ksa...@gmail.com linkedin: www.linkedin.com/in/ksarma On Thu, Jan 23, 2014 at 4:22 AM, vijay garla vnga...@gmail.com wrote: Just to clarify - with the YTEX branch there are 2 sentence splitter - the original ctakes sentence that splits on newlines, and the ytex sentence splitter that doesn't. the changes to other components in the ytex branch (dependency parser, assertion) work with both sentence splitters. I think it would be great if the intelligence regarding how to split was in the opennlp model, but this requires training data. I don't know what the training data is, or if the training data has sentences that cross newline boundaries (if not, won't buy us anything). vijay On Wed, Jan 22, 2014 at 3:47 PM, Finan, Sean sean.fi...@childrens.harvard.edu wrote: On my end it looks like my email was reformatted and some of my -newline- removed in those last examples ... -Original Message- From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] Sent: Wednesday, January 22, 2014 3:42 PM To: dev@ctakes.apache.org Subject: RE: sentence detector newline behavior Thanks James but then no typical sentence ending punctuation at the end of the line Gotcha. So simply using Lines would not suffice in those cases because it would run together sentences where there are more than one on a line I was actually thinking about something like a Line using -sentence breaks- in addition to -newline-. In other words, a Sentence being what cTakes detects by ignoring CR/LF, and Lines being those Sentences subdivided by -newline-. Perhaps Line is a horrible moniker. Regardless, it doesn't solve the problem of inappropriately missing punctuation. I was focused a little more on the difference between persistent auto- line wrapping and structured information like lists, where the first benefits from Sentence and the second from Line. The Patient has been prescribed two medications. Prescriptions: Advil Tylenol No Aspirin However, when it comes to the problem that you mention, there is no benefit to a Line. The patient has been seen six times in the past week. Pain has been persistent for ten days Advil and Tylenol have been prescribed -- 2 sentences, 3 lines The patient has been seen six times in the past week. Pain has been persistent for ten days Advil and Tylenol have been prescribed -- 2 sentences, 3 lines The patient has been seen six times in the past week. Pain has been persistent for ten days Advil and Tylenol have been prescribed -- 2 sentences, 5 lines Nothing can really be done for the last bit where punctuation is missing. -Original Message- From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] Sent: Wednesday, January 22, 2014 3:07 PM To: 'dev@ctakes.apache.org' Subject: RE: sentence detector newline behavior I know there are notes where there are multiple sentences on a line, but then no typical sentence ending punctuation at the end of the line (or no punctuation at all at the end of the line). And in those sections, negation can be important. So simply using Lines would not suffice in those cases because it would run together sentences where there are more than one on a line. And using sentences alone (as found by OpenNLP 1.5) would not suffice because it would run together sentences from different lines. -Original Message- From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] Sent: Wednesday, January 22, 2014 1:33 PM To: dev@ctakes.apache.org Subject: RE: sentence detector newline behavior Just whistling in the wind here ... Perhaps before any changes are made to universally toggle cTakes in one direction or the other, we can take a poll of when where cTakes/Ytex/OpenNLP/Omaha needs a Sentence (ignoring CR/LF) as opposed to a Line (CR/LF delimited PLUS -sentence-) If some capabilities like negation detection require -lines- then would it make more sense to have Sentence ignore -newline- and negation detection itself split the Sentence into line items? If an annotator is interested in list items, each of which may be on a distinct -line-, then it can split up
RE: sentence detector newline behavior
The only rule I know of is that cTAKES (prior to ytex integration) always forces a sentence break at a newline. This was because the clinical notes cTAKES original processed never had newlines in the middle of a sentence, but did need sentence breaks to occur at end of sentence for good negation detection on those notes. I think Guergana earlier mentioned other EMRs also have this need, but it seems to not be ubiquitous. From others' posts, it seems that we could use an option in cTAKES to turn off this forcing of sentence breaks at newlines (or depending on how you look at it, an option to turn on the forcing of sentence breaks if we change the default behavior) I think we (cTAKES) need to decide the following: - do we want to do this for entire notes, or would it be worth it to have it be on a section-by-section basis. - what do we make the default behavior - to force or not to force newlines to be sentence breaks - what data (that contains newlines) will we use for training the sentence detector Regardless of those answers, I think OpenNLP support for including newlines in training data would be valuable for those others who have sentences that span lines. And having an option on OpenNLP to always break at newline would be useful for at least some cTAKES users (and we could remove the cTAKES code that does that) -- James -Original Message- From: dev-return-2390-Masanz.James=mayo@ctakes.apache.org [mailto:dev-return-2390-Masanz.James=mayo@ctakes.apache.org] On Behalf Of Jörn Kottmann Sent: Tuesday, January 21, 2014 4:29 AM To: dev@ctakes.apache.org Subject: Re: sentence detector newline behavior Yes, exactly, OPENNLP-602 is about training a sentence detector model which can use a new line as a end-of-sentence character. In case you have certain rules to split sentences we should have a look at them. The Sentence Detector could be extended to support a user provided rule based splitter. If there is an interest in that we could probably get it into 1.6.0 as well. Jörn On 01/20/2014 10:02 PM, Chen, Pei wrote: I presume Joern was suggesting that if he supports new lines in the opennlp SentenceDectector (either part of the trained models or post processing with some rules?) cTAKES will be able to use it out of the box and we should be able remove any additional custom logic that we currently have- which seems like a good idea. [but when to use within cTAKES individual components such as negation might be another discussion?] --Pei On Jan 20, 2014, at 12:46 PM, vijay garla vnga...@gmail.com wrote: The sentence detection opennlp model used by ctakes does not split sentences at newlines - there is additional logic in the takes sentence splitter that does this (and an alternative impl that doesn't is in the ytex branch). Afaik no retraining / change to the feature representation is necessary. Vj On Monday, January 20, 2014, Jörn Kottmann kottm...@gmail.com wrote: Hi all, currently I have quite a bit of time to work on OpenNLP, and would like to help you out with this issue. Here is the follow up issue for this change: https://issues.apache.org/jira/browse/OPENNLP-602 I am still trying to figure out what would be the best option to implement this. In the training data a user could just use a special tag to identify the chars. Instead of NEWLINE it might be better to use CR and LF to encode these two chars in the training data. Any thoughts? I am planning to release this as part of OpenNLP 1.6.0. Thanks, Jörn On 05/22/2013 02:03 PM, Jörn Kottmann wrote: On 05/22/2013 01:17 PM, Miller, Timothy wrote: That's awesome! It might be worth trying at least. How does the training process change? Previously the training data would be one sentence per line, but with newlines as possible mid-sentence characters that could be trouble, is there a new representation for training data? Or would we have to use the training api? Good point, yes that will be a problem with the default training format, but it shouldn't be hard to solve. In the format itself we could define a new line tag e.g. NEWLINE to mark new lines. as a hack to make it work with 1.5.3 you could instead use a special char as a replacement for the new line char. When you pass the text down to the sentence detector a simple string replace could be used to convert all new line chars to the special new line marker char. If things work out for you performance wise as well we will just integrate it properly into OpenNLP for the next release. Could you produce a sentence detector training file with a new line marker char? You should try to pick a char you can also pass in on a terminal otherwise you have to use the API to train the model. The build in cross validation could be used to evaluate the performance. Jörn
RE: sentence detector newline behavior
I know there are notes where there are multiple sentences on a line, but then no typical sentence ending punctuation at the end of the line (or no punctuation at all at the end of the line). And in those sections, negation can be important. So simply using Lines would not suffice in those cases because it would run together sentences where there are more than one on a line. And using sentences alone (as found by OpenNLP 1.5) would not suffice because it would run together sentences from different lines. -Original Message- From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] Sent: Wednesday, January 22, 2014 1:33 PM To: dev@ctakes.apache.org Subject: RE: sentence detector newline behavior Just whistling in the wind here ... Perhaps before any changes are made to universally toggle cTakes in one direction or the other, we can take a poll of when where cTakes/Ytex/OpenNLP/Omaha needs a Sentence (ignoring CR/LF) as opposed to a Line (CR/LF delimited PLUS -sentence-) If some capabilities like negation detection require -lines- then would it make more sense to have Sentence ignore -newline- and negation detection itself split the Sentence into line items? If an annotator is interested in list items, each of which may be on a distinct -line-, then it can split up the Sentence as needed. I think that James hints that cTakes code already does this in some places. If a good deal of functionality requires -newline- delimited types, would it make sense to introduce a type Line? If something uses a structured list it could iterate through Line types, while something using pure text could iterate through Sentence types. This facilitates section-by-section different behavior, does not require any decision on global defaults, and makes data selection for training Sentence a nonesuch wrt line breaks. However, it adds to the system and would require a per-use choice decision by developers OR a toggle by users (back to the default decision). Perhaps this has already been tried? Sean -Original Message- From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] Sent: Wednesday, January 22, 2014 1:06 PM To: 'dev@ctakes.apache.org' Subject: RE: sentence detector newline behavior The only rule I know of is that cTAKES (prior to ytex integration) always forces a sentence break at a newline. This was because the clinical notes cTAKES original processed never had newlines in the middle of a sentence, but did need sentence breaks to occur at end of sentence for good negation detection on those notes. I think Guergana earlier mentioned other EMRs also have this need, but it seems to not be ubiquitous. From others' posts, it seems that we could use an option in cTAKES to turn off this forcing of sentence breaks at newlines (or depending on how you look at it, an option to turn on the forcing of sentence breaks if we change the default behavior) I think we (cTAKES) need to decide the following: - do we want to do this for entire notes, or would it be worth it to have it be on a section-by-section basis. - what do we make the default behavior - to force or not to force newlines to be sentence breaks - what data (that contains newlines) will we use for training the sentence detector Regardless of those answers, I think OpenNLP support for including newlines in training data would be valuable for those others who have sentences that span lines. And having an option on OpenNLP to always break at newline would be useful for at least some cTAKES users (and we could remove the cTAKES code that does that) -- James -Original Message- From: dev-return-2390-Masanz.James=mayo@ctakes.apache.org [mailto:dev-return-2390-Masanz.James=mayo@ctakes.apache.org] On Behalf Of Jörn Kottmann Sent: Tuesday, January 21, 2014 4:29 AM To: dev@ctakes.apache.org Subject: Re: sentence detector newline behavior Yes, exactly, OPENNLP-602 is about training a sentence detector model which can use a new line as a end-of-sentence character. In case you have certain rules to split sentences we should have a look at them. The Sentence Detector could be extended to support a user provided rule based splitter. If there is an interest in that we could probably get it into 1.6.0 as well. Jörn On 01/20/2014 10:02 PM, Chen, Pei wrote: I presume Joern was suggesting that if he supports new lines in the opennlp SentenceDectector (either part of the trained models or post processing with some rules?) cTAKES will be able to use it out of the box and we should be able remove any additional custom logic that we currently have- which seems like a good idea. [but when to use within cTAKES individual components such as negation might be another discussion?] --Pei On Jan 20, 2014, at 12:46 PM, vijay garla vnga...@gmail.com wrote: The sentence detection opennlp model used by ctakes does not split sentences at newlines - there is additional logic in the takes
RE: sentence detector newline behavior
On my end it looks like my email was reformatted and some of my -newline- removed in those last examples ... -Original Message- From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] Sent: Wednesday, January 22, 2014 3:42 PM To: dev@ctakes.apache.org Subject: RE: sentence detector newline behavior Thanks James but then no typical sentence ending punctuation at the end of the line Gotcha. So simply using Lines would not suffice in those cases because it would run together sentences where there are more than one on a line I was actually thinking about something like a Line using -sentence breaks- in addition to -newline-. In other words, a Sentence being what cTakes detects by ignoring CR/LF, and Lines being those Sentences subdivided by -newline-. Perhaps Line is a horrible moniker. Regardless, it doesn't solve the problem of inappropriately missing punctuation. I was focused a little more on the difference between persistent auto- line wrapping and structured information like lists, where the first benefits from Sentence and the second from Line. The Patient has been prescribed two medications. Prescriptions: Advil Tylenol No Aspirin However, when it comes to the problem that you mention, there is no benefit to a Line. The patient has been seen six times in the past week. Pain has been persistent for ten days Advil and Tylenol have been prescribed -- 2 sentences, 3 lines The patient has been seen six times in the past week. Pain has been persistent for ten days Advil and Tylenol have been prescribed -- 2 sentences, 3 lines The patient has been seen six times in the past week. Pain has been persistent for ten days Advil and Tylenol have been prescribed -- 2 sentences, 5 lines Nothing can really be done for the last bit where punctuation is missing. -Original Message- From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] Sent: Wednesday, January 22, 2014 3:07 PM To: 'dev@ctakes.apache.org' Subject: RE: sentence detector newline behavior I know there are notes where there are multiple sentences on a line, but then no typical sentence ending punctuation at the end of the line (or no punctuation at all at the end of the line). And in those sections, negation can be important. So simply using Lines would not suffice in those cases because it would run together sentences where there are more than one on a line. And using sentences alone (as found by OpenNLP 1.5) would not suffice because it would run together sentences from different lines. -Original Message- From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] Sent: Wednesday, January 22, 2014 1:33 PM To: dev@ctakes.apache.org Subject: RE: sentence detector newline behavior Just whistling in the wind here ... Perhaps before any changes are made to universally toggle cTakes in one direction or the other, we can take a poll of when where cTakes/Ytex/OpenNLP/Omaha needs a Sentence (ignoring CR/LF) as opposed to a Line (CR/LF delimited PLUS -sentence-) If some capabilities like negation detection require -lines- then would it make more sense to have Sentence ignore -newline- and negation detection itself split the Sentence into line items? If an annotator is interested in list items, each of which may be on a distinct -line-, then it can split up the Sentence as needed. I think that James hints that cTakes code already does this in some places. If a good deal of functionality requires -newline- delimited types, would it make sense to introduce a type Line? If something uses a structured list it could iterate through Line types, while something using pure text could iterate through Sentence types. This facilitates section-by-section different behavior, does not require any decision on global defaults, and makes data selection for training Sentence a nonesuch wrt line breaks. However, it adds to the system and would require a per-use choice decision by developers OR a toggle by users (back to the default decision). Perhaps this has already been tried? Sean -Original Message- From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] Sent: Wednesday, January 22, 2014 1:06 PM To: 'dev@ctakes.apache.org' Subject: RE: sentence detector newline behavior The only rule I know of is that cTAKES (prior to ytex integration) always forces a sentence break at a newline. This was because the clinical notes cTAKES original processed never had newlines in the middle of a sentence, but did need sentence breaks to occur at end of sentence for good negation detection on those notes. I think Guergana earlier mentioned other EMRs also have this need, but it seems to not be ubiquitous. From others' posts, it seems that we could use an option in cTAKES to turn off this forcing of sentence breaks at newlines (or depending on how you look at it, an option to turn on the forcing of sentence breaks if we change the default behavior) I think we
Re: sentence detector newline behavior
Yes, exactly, OPENNLP-602 is about training a sentence detector model which can use a new line as a end-of-sentence character. In case you have certain rules to split sentences we should have a look at them. The Sentence Detector could be extended to support a user provided rule based splitter. If there is an interest in that we could probably get it into 1.6.0 as well. Jörn On 01/20/2014 10:02 PM, Chen, Pei wrote: I presume Joern was suggesting that if he supports new lines in the opennlp SentenceDectector (either part of the trained models or post processing with some rules?) cTAKES will be able to use it out of the box and we should be able remove any additional custom logic that we currently have- which seems like a good idea. [but when to use within cTAKES individual components such as negation might be another discussion?] --Pei On Jan 20, 2014, at 12:46 PM, vijay garla vnga...@gmail.com wrote: The sentence detection opennlp model used by ctakes does not split sentences at newlines - there is additional logic in the takes sentence splitter that does this (and an alternative impl that doesn't is in the ytex branch). Afaik no retraining / change to the feature representation is necessary. Vj On Monday, January 20, 2014, Jörn Kottmann kottm...@gmail.com wrote: Hi all, currently I have quite a bit of time to work on OpenNLP, and would like to help you out with this issue. Here is the follow up issue for this change: https://issues.apache.org/jira/browse/OPENNLP-602 I am still trying to figure out what would be the best option to implement this. In the training data a user could just use a special tag to identify the chars. Instead of NEWLINE it might be better to use CR and LF to encode these two chars in the training data. Any thoughts? I am planning to release this as part of OpenNLP 1.6.0. Thanks, Jörn On 05/22/2013 02:03 PM, Jörn Kottmann wrote: On 05/22/2013 01:17 PM, Miller, Timothy wrote: That's awesome! It might be worth trying at least. How does the training process change? Previously the training data would be one sentence per line, but with newlines as possible mid-sentence characters that could be trouble, is there a new representation for training data? Or would we have to use the training api? Good point, yes that will be a problem with the default training format, but it shouldn't be hard to solve. In the format itself we could define a new line tag e.g. NEWLINE to mark new lines. as a hack to make it work with 1.5.3 you could instead use a special char as a replacement for the new line char. When you pass the text down to the sentence detector a simple string replace could be used to convert all new line chars to the special new line marker char. If things work out for you performance wise as well we will just integrate it properly into OpenNLP for the next release. Could you produce a sentence detector training file with a new line marker char? You should try to pick a char you can also pass in on a terminal otherwise you have to use the API to train the model. The build in cross validation could be used to evaluate the performance. Jörn
Re: sentence detector newline behavior
Hi all, currently I have quite a bit of time to work on OpenNLP, and would like to help you out with this issue. Here is the follow up issue for this change: https://issues.apache.org/jira/browse/OPENNLP-602 I am still trying to figure out what would be the best option to implement this. In the training data a user could just use a special tag to identify the chars. Instead of NEWLINE it might be better to use CR and LF to encode these two chars in the training data. Any thoughts? I am planning to release this as part of OpenNLP 1.6.0. Thanks, Jörn On 05/22/2013 02:03 PM, Jörn Kottmann wrote: On 05/22/2013 01:17 PM, Miller, Timothy wrote: That's awesome! It might be worth trying at least. How does the training process change? Previously the training data would be one sentence per line, but with newlines as possible mid-sentence characters that could be trouble, is there a new representation for training data? Or would we have to use the training api? Good point, yes that will be a problem with the default training format, but it shouldn't be hard to solve. In the format itself we could define a new line tag e.g. NEWLINE to mark new lines. as a hack to make it work with 1.5.3 you could instead use a special char as a replacement for the new line char. When you pass the text down to the sentence detector a simple string replace could be used to convert all new line chars to the special new line marker char. If things work out for you performance wise as well we will just integrate it properly into OpenNLP for the next release. Could you produce a sentence detector training file with a new line marker char? You should try to pick a char you can also pass in on a terminal otherwise you have to use the API to train the model. The build in cross validation could be used to evaluate the performance. Jörn
Re: sentence detector newline behavior
OK I've started doing this, was able to get training working on a very small example, will try doing slightly bigger. Tim On 05/22/2013 08:03 AM, Jörn Kottmann wrote: On 05/22/2013 01:17 PM, Miller, Timothy wrote: That's awesome! It might be worth trying at least. How does the training process change? Previously the training data would be one sentence per line, but with newlines as possible mid-sentence characters that could be trouble, is there a new representation for training data? Or would we have to use the training api? Good point, yes that will be a problem with the default training format, but it shouldn't be hard to solve. In the format itself we could define a new line tag e.g. NEWLINE to mark new lines. as a hack to make it work with 1.5.3 you could instead use a special char as a replacement for the new line char. When you pass the text down to the sentence detector a simple string replace could be used to convert all new line chars to the special new line marker char. If things work out for you performance wise as well we will just integrate it properly into OpenNLP for the next release. Could you produce a sentence detector training file with a new line marker char? You should try to pick a char you can also pass in on a terminal otherwise you have to use the API to train the model. The build in cross validation could be used to evaluate the performance. Jörn