Re: With custom dictionary - over-eager resolution of acronyms [EXTERNAL] [EXTERNAL]

Peter Abramowitsch Sun, 02 Aug 2020 14:17:21 -0700

>It would have two completely different applications:  a superior way of
finding the values of findings and a way of validating/pruning the polarity
status of concepts that are in an semi-grammatical or improperly punctuated
sentence
-- Cool.  I expect to see it by end of business tomorrow.


... if only.

Peter

On Sun, Aug 2, 2020 at 10:46 AM Finan, Sean <
[email protected]> wrote:

> For Peter and Jeff:
>
> > are the vocabulary & tui selections that one finds as defaults in the
> dictionary creator something set by the creator as a ctakes optimization
> -- Good question, and the answer is "no."  Those vocabularies and semantic
> types were chosen simply because they contain clinical terms of interest to
> previously done national studies.  The other semantic types and vocabulary
> terms, while present in notes, are often not of interest to "standard"
> clinical studies.  Adding more terms from other vocabularies and semantic
> types should not slow down processing to any noticeable degree.
> > are the defaults governed by information the creator reads from the UMLS
> release
> -- As far as I know there are no recommendations of this sort made by the
> NLM.
>
> >It would have two completely different applications:  a superior way of
> finding the values of findings and a way of validating/pruning the polarity
> status of concepts that are in an semi-grammatical or improperly punctuated
> sentence
> -- Cool.  I expect to see it by end of business tomorrow.
>
> >I recently created a dictionary based off of UMLS 2020AA and did not see
> 'bed' or 'soft' mapped as synonyms to those terms in my .script file. They
> are there, but mapped to other cuis (for example, the cui for an actual bed
> from SNOMED). I think the difference is that I select all of the available
> TUIs on the right and when I do that 'bed' and 'soft' get assigned to a
> different CUIs (with TUIs of "manufactured object" and "quantitative
> concept" respectively) and the CUI synonyms for the more clinical TUIs are
> skipped. I selected all the TUIs because the defaults seemed to be missing
> some things people might be interested in, but I did not expect the
> behavior where it would change how identical terms from other TUIs get
> included (maybe this is some kind of WSD?)
> -- Yes, there is some horribly simple "WSD" being done before the
> dictionary is written.
> What you are seeing is that SOFT only exists as two synonym entries under
> "Short Stature ...", while it exists as 2++ synonym entries for "bed"
> and/or it is the preferred text for "bed" (probably not), or something like
> that.
>
> >but I imagine it could cause other misses.
> -- True.  It is really difficult to make the perfect dictionary for any
> purpose.  So, we just go for the best coverage and fewest extraneous
> entries - or fewest frequently discovered extraneous entries.  "Bed" may
> not be a problem for notes on outpatient visits.  For inpatient notes it
> would be a different story.
>
> And of course, once you get a great set of terms, you get to play with the
> valid parts of speech.  You decide on grabbing every term or only the
> longest overlapping terms.  Allow discontinuous spans or require continuous
> spans.
>
> Fun.
>
>
>
> ________________________________________
> From: Peter Abramowitsch <[email protected]>
> Sent: Sunday, August 2, 2020 12:14 PM
> To: [email protected]
> Subject: Re: With custom dictionary - over-eager resolution of acronyms
> [EXTERNAL] [EXTERNAL]
>
> * External Email - Caution *
>
>
> Many thanks Sean and Jeff.  You guys must be both on the East Coast,
> because my coffee has only just kicked in enough to digest your lucid
> replies.   Super helpful information.  It sounds like the quick and dirty
> solution is to rebuild the dictionary without the OMIM and MTH
> vocabularies.  So it’s not a case of a CUI being remapped - but that it’s
> being layered onto by a particular vocabulary adding a synonym (which in
> this case is probably very rarely used)
>
> One question related to that - are the vocabulary & tui selections that
> one finds as defaults in the dictionary creator something set by the
> creator as a ctakes optimization, or are the defaults governed by
> information the creator reads from the UMLS release?
>
> And thanks for mentioning the capitalization project.  I had been looking
> in vain for that functionality which I had assumed was already there.  You
> can tell that these are still my first experiences with dictionary building.
>
> I appreciate how difficult it is to find the time to build enhancements to
> the product when one is so busy just using it.   There’s an enhancement
> I’ve been prototyping for months which brings in some functionality from
> the Stanford NLP project.  But just don’t have time or energy to productize
> it.   It would have two completely different applications:  a superior way
> of finding the values of findings and a way of validating/pruning the
> polarity status of concepts that are in an semi-grammatical or improperly
> punctuated sentence - such as “Denies headache, abdominal pain, temperature
> normal”
>
> Maybe one day....
>
> Thanks again
> Peter
>
> Sent from my iPad
>
> > On Aug 2, 2020, at 06:25, Finan, Sean <[email protected]>
> wrote:
> >
> > Hi Peter,
> >
> > I would guess that you are seeing things like "SOFT" because you new
> dictionary has a vocabulary that was not included in sno_rx_16ab.
> > I don't remember if OMIM (which has the 'SOFT' synonym) was included in
> sno_rx_16ab.  Probably not, omim is a more -specialized- vocabulary for
> genetics.
> >
> > The term is only in the omim (and mth) vocabularies in the 2016AB umls
> release.
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__uts.nlm.nih.gov_metathesaurus.html-23C3542022-3B0-3B1-3BCUI-3B2016AB-3BWORD-3BCUI-3B-2A&d=DwIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=K3tMfx4RCQyZOhhxqqooR8jr4rMTlawX51Pz_bjcInc&s=7znT93CZlVXo4x9Era3J3Lfx6KbtaPfylNmjOkGhs9E&e=
> ;
> >
> > The term is in snomed in umls 2020AA, but only with the expanded
> full-text synonym.  It still has the abbreviation from omim.
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__uts.nlm.nih.gov_metathesaurus.html-23SHORT-2520STATURE-2C-2520ONYCHODYSPLASIA-2C-2520FACIAL-2520DYSMORPHISM-2C-2520AND-2520HYPOTRICHOSIS-3B0-3B1-3BTERM-3B2020AA-3BWORD-3BTERM-3B-2A&d=DwIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=K3tMfx4RCQyZOhhxqqooR8jr4rMTlawX51Pz_bjcInc&s=Lg3VS4Doc0_jhCg-v-gZRwB87fZ76a4o7nr89b7EKN0&e=
> ;
> >
> > As for finding terms in adjectives, the default parts of speech(pos)
> that are checked for terms are:
> >
> VB,VBD,VBG,VBN,VBP,VBZ,CC,CD,DT,EX,IN,LS,MD,PDT,POS,PP,PP$,PRP,PRP$,RP,TO,WDT,WP,WPS,WRB
> >
> > You can see what these are here:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ling.upenn.edu_courses_Fall-5F2003_ling001_penn-5Ftreebank-5Fpos.html&d=DwIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=K3tMfx4RCQyZOhhxqqooR8jr4rMTlawX51Pz_bjcInc&s=p6rKDBv8CR-mooZAqh3B-bR3foZ2DKgGy_LQHQwTNX8&e=
> >
> > You can override this list.  In your piper file, set the variable
> "exclusionTags"
> >
> > // Default excluded parts of speech, plus various forms of adjective.
> > set
> exclusionTags="ADJ,JJ,JJR,JJS,VB,VBD,VBG,VBN,VBP,VBZ,CC,CD,DT,EX,IN,LS,MD,PDT,POS,PP,PP$,PRP,PRP$,RP,TO,WDT,WP,WPS,WRB"
> >
> > //  Annotate concepts based upon default algorithms.
> > add DefaultJCasTermAnnotator
> >
> >
> > You'll notice that I threw in 'ADJ' for good measure.  It should not
> break anything.
> >
> > I have modified this list many times for various projects.  In one I
> allow verbs for lookup.  For those notes the value of the true positives
> outweighed the increased false negatives.  In another I actually empty the
> entire list to allow everything (set exclusionTags="").  I did this because
> there is a lot of structured text in lists and tables, but the pos tagger
> is trying to resolve prose text.  The pos assigned on the structured text
> is all over the place, and terms are missed left and right.
> >
> > So ... last but definitely not least, case-sensitivity.
> > I started working on this a while ago, but right now it sits unfinished.
> >
> > There is an additional table in the dictionary database, in which all
> synonyms are all upper-case.
> > This second table is created with synonyms that exist in the umls as all
> upper-case.
> > The first  "classic" table is created using ONLY synonyms from the umls
> that are lower and/or mixed case.
> >
> > When the annotator engine iterates over the text, it checks one table
> (classic) or the other (caps) depending upon the case of the text in the
> note.
> >
> > It sounds like minor work, but it requires a new engine, new dictionary,
> and new dictionary creator.  None of this is difficult, but it requires
> time.
> >
> > Anyway, I hope that some of this helps.
> >
> > Sean
> >
> >
> > ________________________________________
> > From: Peter Abramowitsch <[email protected]>
> > Sent: Saturday, August 1, 2020 11:35 PM
> > To: [email protected]
> > Subject: Re: With custom dictionary - over-eager resolution of acronyms
> [EXTERNAL]
> >
> > * External Email - Caution *
> >
> >
> > Hi Jeff thanks for your suggestions,
> >
> > I spent some time in the script file and sure enough,  my 2020 UMLS
> > extraction actually has these two entries:
> >
> > INSERT INTO CUI_TERMS VALUES(3542022,0,1,'soft','soft')
> > INSERT INTO PREFTERM VALUES(3542022,'SHORT STATURE, ONYCHODYSPLASIA,
> FACIAL
> > DYSMORPHISM, AND HYPOTRICHOSIS SYNDROME')
> >
> > It's unbelievable.  the UMLS entry has got to be wrong or I'm missing
> > something to say that it only applies (as an acronym) if it's capitalized
> >
> > In sno_rx  there is neither a CUI 3542022 nor the definition of "soft"
> as a
> > solitary word, nor even a mention of ONYCHODYSPLASIA or HYPOTRICHOSIS
> >
> > In any case, I would have thought that ctakes will only create an event
> > mention from a term tagged as NN or NP slot, not a ADJ as in "soft
> tissue"
> >
> > Anyway  Thanks!  Now I will keep poking around.
> >
> >
> > Peter
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >> On Sat, Aug 1, 2020 at 5:06 PM Jeffrey Miller <[email protected]>
> wrote:
> >>
> >> Sorry, I meant suggest to search for 'soft' in the dictionary file not
> >> 'short'
> >>
> >> grep -i ,\'soft\', *.script
> >>
> >>> On Sat, Aug 1, 2020 at 7:47 PM Jeffrey Miller <[email protected]>
> wrote:
> >>>
> >>> Hi Peter,
> >>>
> >>> To my knowledge, there isn't any drastic difference in the behavior of
> >> the
> >>> dictionary gui creator and the way the sno_rx dictionary was created. I
> >>> originally thought there was, but I realized the difference was that I
> >> had
> >>> not installed all of UMLS to my machine (just the vocabularies I was
> >>> interested in) and I was missing synonyms. The first thing I would
> check,
> >>> are you able to find a matching entry in the .script file for your
> ctakes
> >>> dictionary when you do this:
> >>>
> >>> grep -i ,\'short\', *.script
> >>>
> >>> That would confirm whether or not you have a term in your dictionary
> made
> >>> up only of 'short' and whether it mapped to the CUI equal to "SHORT
> >>> STATURE, ONYCHODYSPLASIA, FACIAL DYSMORPHISM, AND HYPOTRICHOSIS
> >> SYNDROME".
> >>> If it's not in there, something else is going on. You could do the same
> >> for
> >>> 'bed'.
> >>>
> >>> If not, another thing I might check is that I noticed you are using
> >>> the OverlapJCasTermAnnotator in your prior e-mail. I don't have much
> >>> experience with it, and I don't think it should cause this behavior,
> but
> >> I
> >>> wonder if that could be making the difference (as compared
> >>> to DefaultJCasTermAnnotator).
> >>>
> >>> Jeff
> >>>
> >>> On Sat, Aug 1, 2020 at 5:27 PM Peter Abramowitsch <
> >> [email protected]>
> >>> wrote:
> >>>
> >>>>
> >>>> Hi All
> >>>>
> >>>> Having created a new dictionary from the 2020AA UMLS and added Genes
> and
> >>>> Receptors to the dictionary-creator's default selections, I have a
> >> curious
> >>>> problem where cTakes now assigns the most bizarre acronyms to ordinary
> >>>> words used in POS contexts where it shouldn't  find <XXX>Mentions.
> >>>>
> >>>> Here are two examples:
> >>>>
> >>>> 1.   soft (in "soft tissue...")
> >>>> becomes   "SHORT STATURE, ONYCHODYSPLASIA, FACIAL DYSMORPHISM, AND
> >>>> HYPOTRICHOSIS SYNDROME",
> >>>>
> >>>> 2.   bed in ("The wound bed was...")
> >>>> becomes  "BORNHOLM EYE DISEASE"
> >>>>
> >>>> I have not changed the TermConsumer type in the descriptor XML.
> >>>>
> >>>> Are the DictionaryCreator's defaults, the equivalent to the default
> >>>> sno_rx that's delivered with the app?
> >>>>
> >>>> Attached is the vocab subsets list I used
> >>>>
> >>>>
> >>>> Peter
> >>>>
> >>>>
> >>>>
> >>
>

Re: With custom dictionary - over-eager resolution of acronyms [EXTERNAL] [EXTERNAL]

Reply via email to