subject:"SentenceDetector \[EXTERNAL\]"

Re: SentenceDetector [EXTERNAL]

2018-04-06 Thread Miller, Timothy

The changes were mainly meant to adapt the OpenNLP model to
idiosyncrasies of clinical text, but you're right that they have some
shortcomings.

The newline thing is in the data sources used originally to build the
model, there were frequent cases of headings/sentence fragments by
themselves on a line, and _no_ cases of mid-sentence newlines. That,
combined with the fact that OpenNLP's train file format (at the time)
itself used newlines as a separator, led to the creation of that simple
rule rather than trying to retrain with newline as a candidate sentence
splitter. I created a different training file format and annotator that
does what you suggest, and built an alternative sentence splitter
model, here:
org/apache/ctakes/core/ae/SentenceDetectorAnnotatorBIO.java

it operates at the character level and splits a document into
sentences. For some people it works better. For data where there are
potentially mid-sentence newlines (like MIMIC), it is probably the only
model with usable results. It's typical failure mode is to lump two
sentences together, while the default annotator does the opposite.

Tim

On Fri, 2018-04-06 at 02:11 +, Ewan Mellor wrote:
> I'm looking at SentenceDetector from ctakes-core.  It has a
> surprising
> idea of what counts as a "sentence".  Before I delve any deeper,
> I wanted to ask whether there is a reason for what it's doing, in
> particular
> whether there's anything in the clinical pipeline that's depending on
> its
> behavior specifically.
> 
> The main problem I have is that it's splitting on characters like
> colon and
> semicolon, which aren't usually considered sentence separators, with
> the
> result that it often ends up tagging phrases rather than whole
> sentences.
> 
> It's using SentenceDetectorCtakes and EndOfSentenceScannerImpl, which
> seem
> to be derived from equivalents in OpenNLP, but with changes that I
> can't
> track (they date from the original edu.mayo import as far as I can
> tell).
> Other than the additional separator characters, I can't tell whether
> these
> classes are doing anything important that you wouldn't equally get
> from
> OpenNLP's SentenceDetectorME, so I don't know why they're being used.
> 
> SentenceDetector is also splitting on newlines after passing the text
> through
> the max entropy sentence model.  I don't see the point in this -- if
> you're
> going to split on newlines anyway, then why not do that before
> passing
> through the entropy model?  Or just have newline as one of the
> potential
> EOS characters and treat it as a possible break point rather than a
> definite
> one?
> 
> Any insight would be welcome.
> 
> Thanks,
> 
> Ewan.

RE: SentenceDetector [EXTERNAL] [SUSPICIOUS]

2018-04-06 Thread Finan, Sean

Hi Ewan,

We use Tim's SentenceDetectorAnnotatorBIO in a project that has run hundreds of 
notes that contain newline-spanning sentences and it does work very well.  As 
Tim wrote, it does sometimes lump lines together if they aren't prose.  One 
example is lines of text in a list.  There are a few ctakes annotators that can 
help correct this: ParagraphSentenceFixer and ListSentenceFixer.  

If you are running the default clinical pipeline, you can use the 
SectionedFastPipeline.piper in ctakes-clinical-pipeline-res instead of the 
DefaultFastPipeline.piper
The difference between the two is that SectionedFastPipeline loads 
FullTokenizerPipeline.piper in ctakes-core-res instead of 
DefaultTokenizerPipeline.piper.  The FullTokenizer... has a reference to the 
Sentence...BIO, and you can enable it by swapping comment specifiers to:

// The sentence detector needs our custom model path, otherwise default values 
are used.
addLogged SentenceDetectorAnnotatorBIO 
classifierJarPath=/org/apache/ctakes/core/sentdetect/model.jar

// The SentenceDetectorAnnotatorBIO is a "lumper" that works well for notes in 
which end of line does not indicate a sentence.
// If that is not your case, then you may get better results using the more 
standard SentenceDetector
//add SentenceDetector

Sean

-Original Message-
From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
Sent: Friday, April 06, 2018 9:46 AM
To: dev@ctakes.apache.org
Subject: Re: SentenceDetector [EXTERNAL] [SUSPICIOUS]

The changes were mainly meant to adapt the OpenNLP model to idiosyncrasies of 
clinical text, but you're right that they have some shortcomings.

The newline thing is in the data sources used originally to build the model, 
there were frequent cases of headings/sentence fragments by themselves on a 
line, and _no_ cases of mid-sentence newlines. That, combined with the fact 
that OpenNLP's train file format (at the time) itself used newlines as a 
separator, led to the creation of that simple rule rather than trying to 
retrain with newline as a candidate sentence splitter. I created a different 
training file format and annotator that does what you suggest, and built an 
alternative sentence splitter model, here:
org/apache/ctakes/core/ae/SentenceDetectorAnnotatorBIO.java

it operates at the character level and splits a document into sentences. For 
some people it works better. For data where there are potentially mid-sentence 
newlines (like MIMIC), it is probably the only model with usable results. It's 
typical failure mode is to lump two sentences together, while the default 
annotator does the opposite.

Tim

On Fri, 2018-04-06 at 02:11 +, Ewan Mellor wrote:
> I'm looking at SentenceDetector from ctakes-core.  It has a surprising 
> idea of what counts as a "sentence".  Before I delve any deeper, I 
> wanted to ask whether there is a reason for what it's doing, in 
> particular whether there's anything in the clinical pipeline that's 
> depending on its behavior specifically.
> 
> The main problem I have is that it's splitting on characters like 
> colon and semicolon, which aren't usually considered sentence 
> separators, with the result that it often ends up tagging phrases 
> rather than whole sentences.
> 
> It's using SentenceDetectorCtakes and EndOfSentenceScannerImpl, which 
> seem to be derived from equivalents in OpenNLP, but with changes that 
> I can't track (they date from the original edu.mayo import as far as I 
> can tell).
> Other than the additional separator characters, I can't tell whether 
> these classes are doing anything important that you wouldn't equally 
> get from OpenNLP's SentenceDetectorME, so I don't know why they're 
> being used.
> 
> SentenceDetector is also splitting on newlines after passing the text 
> through the max entropy sentence model.  I don't see the point in this 
> -- if you're going to split on newlines anyway, then why not do that 
> before passing through the entropy model?  Or just have newline as one 
> of the potential EOS characters and treat it as a possible break point 
> rather than a definite one?
> 
> Any insight would be welcome.
> 
> Thanks,
> 
> Ewan.

Re: SentenceDetector [EXTERNAL] [SUSPICIOUS]

2018-04-17 Thread Ewan Mellor

Thanks for the replies, both of you.  I have filed
https://issues.apache.org/jira/browse/CTAKES-507 with a patch to put
those details in the class comments.

I'm using double-newline as a paragraph separator, with mid-sentence single
newlines (the text is coming from an OCR phase so it's broken into lines).
I'm getting reasonable results now, by adding a custom annotator first that
adds Segment annotations for each of the paragraphs, and then using
SentenceDetectorAnnotatorBIO.  I then don't need ParagraphSentenceFixer
because my sentences can't span paragraphs as SentenceDetectorAnnotatorBIO
only sees one paragraph at a time.

I did find one place where it failed -- using the default tokenCounts.txt
it's breaking sentences at "Dr." which is a bit sad given the subject matter.
I'm hacking around that for now but if it proves to be an issue I may try
finding a tagged corpus to retrain with (I don't have a decent tagged
corpus myself, which is why I'm using the default models).

Thanks,

Ewan.

On Fri, Apr 06, 2018 at 02:06:11PM +, Finan, Sean wrote:

> Hi Ewan,
> 
> We use Tim's SentenceDetectorAnnotatorBIO in a project that has run hundreds 
> of notes that contain newline-spanning sentences and it does work very well.  
> As Tim wrote, it does sometimes lump lines together if they aren't prose.  
> One example is lines of text in a list.  There are a few ctakes annotators 
> that can help correct this: ParagraphSentenceFixer and ListSentenceFixer.  
> 
> If you are running the default clinical pipeline, you can use the 
> SectionedFastPipeline.piper in ctakes-clinical-pipeline-res instead of the 
> DefaultFastPipeline.piper
> The difference between the two is that SectionedFastPipeline loads 
> FullTokenizerPipeline.piper in ctakes-core-res instead of 
> DefaultTokenizerPipeline.piper.  The FullTokenizer... has a reference to the 
> Sentence...BIO, and you can enable it by swapping comment specifiers to:
> 
> // The sentence detector needs our custom model path, otherwise default 
> values are used.
> addLogged SentenceDetectorAnnotatorBIO 
> classifierJarPath=/org/apache/ctakes/core/sentdetect/model.jar
> 
> // The SentenceDetectorAnnotatorBIO is a "lumper" that works well for notes 
> in which end of line does not indicate a sentence.
> // If that is not your case, then you may get better results using the more 
> standard SentenceDetector
> //add SentenceDetector
> 
> Sean
> 
> -Original Message-
> From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
> Sent: Friday, April 06, 2018 9:46 AM
> To: dev@ctakes.apache.org
> Subject: Re: SentenceDetector [EXTERNAL] [SUSPICIOUS]
> 
> The changes were mainly meant to adapt the OpenNLP model to idiosyncrasies of 
> clinical text, but you're right that they have some shortcomings.
> 
> The newline thing is in the data sources used originally to build the model, 
> there were frequent cases of headings/sentence fragments by themselves on a 
> line, and _no_ cases of mid-sentence newlines. That, combined with the fact 
> that OpenNLP's train file format (at the time) itself used newlines as a 
> separator, led to the creation of that simple rule rather than trying to 
> retrain with newline as a candidate sentence splitter. I created a different 
> training file format and annotator that does what you suggest, and built an 
> alternative sentence splitter model, here:
> org/apache/ctakes/core/ae/SentenceDetectorAnnotatorBIO.java
> 
> it operates at the character level and splits a document into sentences. For 
> some people it works better. For data where there are potentially 
> mid-sentence newlines (like MIMIC), it is probably the only model with usable 
> results. It's typical failure mode is to lump two sentences together, while 
> the default annotator does the opposite.
> 
> Tim
> 
> 
> On Fri, 2018-04-06 at 02:11 +, Ewan Mellor wrote:
> > I'm looking at SentenceDetector from ctakes-core.ï¿½ï¿½It has a surprising 
> > idea of what counts as a "sentence".ï¿½ï¿½Before I delve any deeper, I 
> > wanted to ask whether there is a reason for what it's doing, in 
> > particular whether there's anything in the clinical pipeline that's 
> > depending on its behavior specifically.
> > 
> > The main problem I have is that it's splitting on characters like 
> > colon and semicolon, which aren't usually considered sentence 
> > separators, with the result that it often ends up tagging phrases 
> > rather than whole sentences.
> > 
> > It's using SentenceDetectorCtakes and EndOfSentenceScannerImpl, which 
> > seem to be derived from equivalents in OpenN

Re: SentenceDetector [EXTERNAL]

RE: SentenceDetector [EXTERNAL] [SUSPICIOUS]

Re: SentenceDetector [EXTERNAL] [SUSPICIOUS]

3 matches

Site Navigation

Mail list logo

Footer information