Just in case somebody else has already done this or has any ideas, I am forwarding the question and one answer to: > How do we find out which entity has more relevance to the document. I need > this, as we need to limit our outputs to max 10 terms for one clinical > document.
Sean From: Finan, Sean Sent: Tuesday, January 02, 2018 12:31 PM To: 'Ratan Sharma' Subject: RE: Facing issues with cTakes confidence score [EXTERNAL] Hi Ratan, There are a couple of things that you can do, but getting down to 10 terms per a note will be difficult. The first thing to do is go into your resources and edit your dictionary’s setup xml file. It is either in ctakes-dictionary-lookup-fast-res/ or resources/ depending upon how you are running. Go all the way to the end of org/ctakes/dictionary/lookup/fast/ At the bottom of the xml file you will see a couple of commented lines, one with “PrecisionTermConsumer”. Uncomment that line and comment out the line with “DefaultTermConsumer”. This will limit mentions so that you will get things like “lung cancer” instead of both “lung cancer” and “cancer” – “lung cancer” being the more specific disease. You will still get “lung” in each case as an anatomical site. The second thing that you can do is build up a map of counts per CUI. You can get a map of cuis and the number of times they appear in the document (Map<String,Long>) with the following command: OntologyConceptUtil.getCuiCounts( jCas ) You can sort by the number of appearances and grab the top 10. Another thing that might help is filtering out the negated concepts. Something like: Map<String,Long> topTenYes = JCaseUtil.select( jCas, IdentifiedAnnotation.class ).stream() .filter( ia -> ia.getPolarity != CONST.NE_POLARITY_NEGATION_PRESENT ) .map( OntologyConceptUtil::getCuiCounts ) Another thing to do would be to filter out by subject. For each identified annotation use .getSubject().equals( CONST.ATTR_SUBJECT_PATIENT ). Related to subject, you can filter out identified annotations in sections like family history. Use JCasUtil.selectCovered( jCas, Segment.class, IdentifiedAnnotation.class ) and filter out when by checking each segment’s .getPreferredText(). If the preferred text is “Family Medical History” then you can probably discount everything in that section. Likewise, if the mentions are in things like “Patient History” then they may not have to do with the current encounter. You can find section names in the ctakes-core-res DefaultSectionRegex.bsv file. You will need to have the BsvRegexSectionizer in your pipeline. I would use the SectionedFastPipeline piper in ctakes-clinical-pipeline-res and your custom filtering annotator to the end of it. Lastly, if you use the temporal modules you can filter by the time relative to the document time (doc time rel) being overlap or before/overlap. Use the SectionedTemporalPipeline piper in ctakes-temporal-res. Then some code like the following: If ( annotation instanceof EventMention ) { Final Event event = ((EventMention)annotation).getEvent(); If ( event != null ) { Final EventProperties properties = event.getProperties(); If ( properties != null ) { Final String doctimerel = properties.getDocTimeRel(); Final Boolean keepThisAnnotation = doctimerel != null && doctimerel.contains( “Overlap” ); That should give you a start. I am not sure how much each will help, but they are suggestions of things that you can try. Sean From: Ratan Sharma [mailto:ratanc...@gmail.com] Sent: Tuesday, January 02, 2018 11:20 AM To: Finan, Sean Subject: Re: Facing issues with cTakes confidence score [EXTERNAL] Thanks Sean for the reply. So is there no way we can assign relevance/confidence of entities. How do we find out which entity has more relevance to the document. I need this, as we need to limit our outputs to max 10 terms for one clinical document. Thanks for your time on this. Really appreciate it. On Tue, Jan 2, 2018 at 9:00 PM, Finan, Sean <sean.fi...@childrens.harvard.edu<mailto:sean.fi...@childrens.harvard.edu>> wrote: Hi Ratan, What Tim said is absolutely correct. Those mentions are all discovered by dictionary lookup procedures. The default procedure is strict lookup against a term in the dictionary database and no lookup has any more validity than any other, so “confidence” is pretty meaningless. Confidence can be introduced by other modules and for various reasons, but for creation of mentions using standard ctakes that value is never set. Sean From: Ratan Sharma [mailto:ratanc...@gmail.com<mailto:ratanc...@gmail.com>] Sent: Saturday, December 30, 2017 5:23 AM To: Finan, Sean Subject: Facing issues with cTakes confidence score [EXTERNAL] Hi Sean, Can you please add your thoughts to this query : http://ctakes.markmail.org/search/?q=#query:+page:1+mid:czbrt7itywvvjqnm+state:results<https://urldefense.proofpoint.com/v2/url?u=http-3A__ctakes.markmail.org_search_-3Fq-3D-23query-3A-2Bpage-3A1-2Bmid-3Aczbrt7itywvvjqnm-2Bstate-3Aresults&d=DwMFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=hg1c3pcA5EPDx0yofQhMmgyBwv8irHKwjV0fqOKJdfs&s=3TEOrc4vXTkcsD9cHx6KNk2tPxgwfqsavRF5jcD9wuU&e=> I am looking for a way to distinguish which entity has higher weight-age than others..like a relevance score for each entity. Is it possible we can have a meeting to discuss this. Anytime of yours is fine with me. Thank you. Ratan