Hi Abad,
The expert on SentenceDetectorAnnotatorBIO is Timothy Miller, so he might be
able to weigh in on some of this.
I haven't noticed Sentence..BIO splitting sentences on decimals, but as an AI
trained model you never quite know what might happen.
You could easily make something like the MrsDr.. that handles decimal problems.
Basically, a copy of MrsDr.. with lines ~62
if ( (text.endsWith( " Mr." ) || text.endsWith( " Mrs." ) ||
text.endsWith( " Dr." )
|| text.endsWith( " a.m." ) || text.endsWith( " p.m." )
|| text.equals( "Mr." ) || text.equals( "Mrs." ) || text.equals(
"Dr." ))
&& i < sentenceCount - 1
&& !newlines.contains( sentence.getEnd() ) ) {
to something like
if ( text.length() > 1
&& text.charAt( text.length()-1 ) == '.'
&& Character.isDigit( text.charAt( text.length()-2 ) )
&& !sentences.get( i+1 ).getCoveredText().isEmpty()
&& Character.isDigit( sentences.get( i+1
).getCoveredText().charAt( 0 ) ) ) {
That if (..) could be cleaned up a little, but that should do it.
Sean
________________________________________
From: [email protected] <[email protected]>
Sent: Friday, June 12, 2020 11:21 AM
To: [email protected]; [email protected]
Subject: RE: Sentence detector changes [EXTERNAL]
* External Email - Caution *
Hi Sean,
Thank you for your advise and we tried using the 'SentenceDetectorAnnotatorBIO'
along with the changes required in piper files as you mentioned and we could
find that its splitting the sentences based on '.' only , Actually we were
able to get similar o/p by using the 'SentenceDetectorAnnotator' itself by
just using '.' as the only eosCandidate in the EOSScannerImpl class.
So will 'SentenceDetectorAnnotatorBIO' be able to extract sentences using some
other way. Like some problems we face are the ''SentenceDetectorAnnotatorBIO' '
is splitting the sentence whenever it sees a decimal point like 5.5 or a date
where separated using '.' like 01.01.2020.
Can the AE's EolSentenceFixer & MrsDrSentenceJoiner be able to resolve our
above issues where sentences are splitted on encountering decimals or '.'
separated dates. If it can what are the changes that we need to do in the piper
file to incorporate the same.
Thanks & Regards
Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028
-----Original Message-----
From: Finan, Sean <[email protected]>
Sent: Thursday, June 11, 2020 9:14 PM
To: [email protected]; [email protected]
Subject: Re: Sentence detector changes [EXTERNAL]
[External]
Thank you for clarifying the contents of the second image. That changes
everything.
You are using the original SentenceDetector. So, somewhere in your piper file
you've got:
add SentenceDetector
I was under the impression that you are using the newer alternative,
SentenceDetectorAnnotatorBIO. While the original SentenceDetector is more of a
"splitter", the BIO version is more of a "lumper".
I would switch the detector and see how the results change using:
add SentenceDetectorAnnotatorBIO
classifierJarPath=/org/apache/ctakes/core/sentdetect/model.jar
Don't forget to comment out the "add SentenceDetector".
After you have looked at results from the BIO version, then you can consider
which better fits your data and any needs for further adjustment of Sentences.
Sean
________________________________________
From: [email protected] <[email protected]>
Sent: Thursday, June 11, 2020 11:17 AM
To: [email protected]; [email protected]
Subject: RE: Sentence detector changes [EXTERNAL]
* External Email - Caution *
Thank you Sean for the response. Sorry that the image are not visible for you
and forgot to mention the version we are using which is version 4.0.
Reiterating it as below
First image was how the Sentence Object looks like using CAS viewer Second
Image was the list of EndOfSentence Candidate like in the class
‘EOSScannerImpl’as below
private static final char [] eosCandidates={ ‘.’, ‘!’,’)’,’]’, ‘>’,
‘/’’’,’:’, ‘;’};
So any modification to SentenceExtractor have impacts on every other downstream
modules right? We will definitely have a look into the AE's you mentioned and
you mean to say , that to try adding the AE's EolSentenceFixer,
MrsDrSentenceJoiner which would refine the sentence extraction right?.
Thanks & Regards
Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028
-----Original Message-----
From: Finan, Sean <[email protected]>
Sent: Thursday, June 11, 2020 8:20 PM
To: [email protected]; [email protected]
Subject: Re: Sentence detector changes [EXTERNAL]
[External]
Hi Abad,
None of your embedded images are visible to me, so I don't have whatever
information is contained within those images.
It sounds like you are using the SentenceDetectorBIO. Very cool.
It does have a few idiosyncrasies, one of which you have identified.
There are two helper AEs in ctakes-core that might be useful for you. They are
not in the released (4.0) version of ctakes, only in ctakes trunk.
EolSentenceFixer
Re-annotates Sentences based upon short lines, preventing a Sentence from
spanning over an intentional line break.
The BIO will often lump short (intentionally separated) lines into a single
sentence. This attempts to detect such intentionally short lines and split
them.
MrsDrSentenceJoiner
Joins Sentences with person titles Mr. Mrs. Dr. that have been split by
SentenceDetectorBIO.
You can peek at the code in MrsDrSentenceJoiner and do something similar to
repair cases in which other texts like ')' have causes improper splits.
Because Sentence boundaries are often used in downstream processing (Mentions,
Relations), it is very important that they be properly assigned.
Sean
________________________________
From: [email protected] <[email protected]>
Sent: Thursday, June 11, 2020 10:17 AM
To: [email protected]; [email protected]
Subject: Sentence detector changes [EXTERNAL]
* External Email - Caution *
Hi Team,
We are trying to utilize the maximum potential of cTAKES to meet the
requirements for our profile, where we have a requirement to extract the
sentences from the medical document. We have seen cTAKES already providing the
list of sentences in the clinical text within the object as below
[cid:[email protected]]
We also notice that sentences are delimited based on the below predefined
delimiters, which was actually a problem in our requirement where sentences
were seggregated whenever one of the below tokens are encountered.
[cid:[email protected]]
For eg: “Patient was taking Paracetamol (650 mg) thrice daily” , was splitted
to two different sentences(because a ‘)’ encountered)
1. Patient was taking Paracetamol (650 mg)
2. thrice daily
So we tried to customize it by removing some of the defined delimiters to meet
our requirement. Actually we tried with just ‘.’ As delimiter and found
sentences are splitted whenever a ‘.’ Is encountered Since this is a change
done at the core module , we would like to know whether this is going to impact
the clinical token identification process or going to have impact on the
already provided informations like tlink,timex or any other critical attribute.
Kindly advice.
Thanks & Regards
[cid:D3145E69-CD94-48C1-877F-5134EEAFB598]
Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028
This e-mail and any files transmitted with it are for the sole use of the
intended recipient(s) and may contain confidential and privileged information.
If you are not the intended recipient(s), please reply to the sender and
destroy all copies of the original message. Any unauthorized review, use,
disclosure, dissemination, forwarding, printing or copying of this email,
and/or any action taken in reliance on the contents of this e-mail is strictly
prohibited and may be unlawful. Where permitted by applicable law, this e-mail
and other e-mail communications sent to and from Cognizant e-mail addresses may
be monitored. This e-mail and any files transmitted with it are for the sole
use of the intended recipient(s) and may contain confidential and privileged
information. If you are not the intended recipient(s), please reply to the
sender and destroy all copies of the original message. Any unauthorized review,
use, disclosure, dissemination, forwarding, printing or copying of this email,
and/or any action taken in reliance on the contents of this e-mail is strictly
prohibited and may be unlawful. Where permitted by applicable law, this e-mail
and other e-mail communications sent to and from Cognizant e-mail addresses may
be monitored.
This e-mail and any files transmitted with it are for the sole use of the
intended recipient(s) and may contain confidential and privileged information.
If you are not the intended recipient(s), please reply to the sender and
destroy all copies of the original message. Any unauthorized review, use,
disclosure, dissemination, forwarding, printing or copying of this email,
and/or any action taken in reliance on the contents of this e-mail is strictly
prohibited and may be unlawful. Where permitted by applicable law, this e-mail
and other e-mail communications sent to and from Cognizant e-mail addresses may
be monitored.
This e-mail and any files transmitted with it are for the sole use of the
intended recipient(s) and may contain confidential and privileged information.
If you are not the intended recipient(s), please reply to the sender and
destroy all copies of the original message. Any unauthorized review, use,
disclosure, dissemination, forwarding, printing or copying of this email,
and/or any action taken in reliance on the contents of this e-mail is strictly
prohibited and may be unlawful. Where permitted by applicable law, this e-mail
and other e-mail communications sent to and from Cognizant e-mail addresses may
be monitored.
This e-mail and any files transmitted with it are for the sole use of the
intended recipient(s) and may contain confidential and privileged information.
If you are not the intended recipient(s), please reply to the sender and
destroy all copies of the original message. Any unauthorized review, use,
disclosure, dissemination, forwarding, printing or copying of this email,
and/or any action taken in reliance on the contents of this e-mail is strictly
prohibited and may be unlawful. Where permitted by applicable law, this e-mail
and other e-mail communications sent to and from Cognizant e-mail addresses may
be monitored.
This e-mail and any files transmitted with it are for the sole use of the
intended recipient(s) and may contain confidential and privileged information.
If you are not the intended recipient(s), please reply to the sender and
destroy all copies of the original message. Any unauthorized review, use,
disclosure, dissemination, forwarding, printing or copying of this email,
and/or any action taken in reliance on the contents of this e-mail is strictly
prohibited and may be unlawful. Where permitted by applicable law, this e-mail
and other e-mail communications sent to and from Cognizant e-mail addresses may
be monitored.