>It would have two completely different applications: a superior way of finding the values of findings and a way of validating/pruning the polarity status of concepts that are in an semi-grammatical or improperly punctuated sentence -- Cool. I expect to see it by end of business tomorrow.
... if only. Peter On Sun, Aug 2, 2020 at 10:46 AM Finan, Sean < sean.fi...@childrens.harvard.edu> wrote: > For Peter and Jeff: > > > are the vocabulary & tui selections that one finds as defaults in the > dictionary creator something set by the creator as a ctakes optimization > -- Good question, and the answer is "no." Those vocabularies and semantic > types were chosen simply because they contain clinical terms of interest to > previously done national studies. The other semantic types and vocabulary > terms, while present in notes, are often not of interest to "standard" > clinical studies. Adding more terms from other vocabularies and semantic > types should not slow down processing to any noticeable degree. > > are the defaults governed by information the creator reads from the UMLS > release > -- As far as I know there are no recommendations of this sort made by the > NLM. > > >It would have two completely different applications: a superior way of > finding the values of findings and a way of validating/pruning the polarity > status of concepts that are in an semi-grammatical or improperly punctuated > sentence > -- Cool. I expect to see it by end of business tomorrow. > > >I recently created a dictionary based off of UMLS 2020AA and did not see > 'bed' or 'soft' mapped as synonyms to those terms in my .script file. They > are there, but mapped to other cuis (for example, the cui for an actual bed > from SNOMED). I think the difference is that I select all of the available > TUIs on the right and when I do that 'bed' and 'soft' get assigned to a > different CUIs (with TUIs of "manufactured object" and "quantitative > concept" respectively) and the CUI synonyms for the more clinical TUIs are > skipped. I selected all the TUIs because the defaults seemed to be missing > some things people might be interested in, but I did not expect the > behavior where it would change how identical terms from other TUIs get > included (maybe this is some kind of WSD?) > -- Yes, there is some horribly simple "WSD" being done before the > dictionary is written. > What you are seeing is that SOFT only exists as two synonym entries under > "Short Stature ...", while it exists as 2++ synonym entries for "bed" > and/or it is the preferred text for "bed" (probably not), or something like > that. > > >but I imagine it could cause other misses. > -- True. It is really difficult to make the perfect dictionary for any > purpose. So, we just go for the best coverage and fewest extraneous > entries - or fewest frequently discovered extraneous entries. "Bed" may > not be a problem for notes on outpatient visits. For inpatient notes it > would be a different story. > > And of course, once you get a great set of terms, you get to play with the > valid parts of speech. You decide on grabbing every term or only the > longest overlapping terms. Allow discontinuous spans or require continuous > spans. > > Fun. > > > > ________________________________________ > From: Peter Abramowitsch <pabramowit...@gmail.com> > Sent: Sunday, August 2, 2020 12:14 PM > To: dev@ctakes.apache.org > Subject: Re: With custom dictionary - over-eager resolution of acronyms > [EXTERNAL] [EXTERNAL] > > * External Email - Caution * > > > Many thanks Sean and Jeff. You guys must be both on the East Coast, > because my coffee has only just kicked in enough to digest your lucid > replies. Super helpful information. It sounds like the quick and dirty > solution is to rebuild the dictionary without the OMIM and MTH > vocabularies. So it’s not a case of a CUI being remapped - but that it’s > being layered onto by a particular vocabulary adding a synonym (which in > this case is probably very rarely used) > > One question related to that - are the vocabulary & tui selections that > one finds as defaults in the dictionary creator something set by the > creator as a ctakes optimization, or are the defaults governed by > information the creator reads from the UMLS release? > > And thanks for mentioning the capitalization project. I had been looking > in vain for that functionality which I had assumed was already there. You > can tell that these are still my first experiences with dictionary building. > > I appreciate how difficult it is to find the time to build enhancements to > the product when one is so busy just using it. There’s an enhancement > I’ve been prototyping for months which brings in some functionality from > the Stanford NLP project. But just don’t have time or energy to productize > it. It would have two completely different applications: a superior way > of finding the values of findings and a way of validating/pruning the > polarity status of concepts that are in an semi-grammatical or improperly > punctuated sentence - such as “Denies headache, abdominal pain, temperature > normal” > > Maybe one day.... > > Thanks again > Peter > > Sent from my iPad > > > On Aug 2, 2020, at 06:25, Finan, Sean <sean.fi...@childrens.harvard.edu> > wrote: > > > > Hi Peter, > > > > I would guess that you are seeing things like "SOFT" because you new > dictionary has a vocabulary that was not included in sno_rx_16ab. > > I don't remember if OMIM (which has the 'SOFT' synonym) was included in > sno_rx_16ab. Probably not, omim is a more -specialized- vocabulary for > genetics. > > > > The term is only in the omim (and mth) vocabularies in the 2016AB umls > release. > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__uts.nlm.nih.gov_metathesaurus.html-23C3542022-3B0-3B1-3BCUI-3B2016AB-3BWORD-3BCUI-3B-2A&d=DwIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=K3tMfx4RCQyZOhhxqqooR8jr4rMTlawX51Pz_bjcInc&s=7znT93CZlVXo4x9Era3J3Lfx6KbtaPfylNmjOkGhs9E&e= > ; > > > > The term is in snomed in umls 2020AA, but only with the expanded > full-text synonym. It still has the abbreviation from omim. > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__uts.nlm.nih.gov_metathesaurus.html-23SHORT-2520STATURE-2C-2520ONYCHODYSPLASIA-2C-2520FACIAL-2520DYSMORPHISM-2C-2520AND-2520HYPOTRICHOSIS-3B0-3B1-3BTERM-3B2020AA-3BWORD-3BTERM-3B-2A&d=DwIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=K3tMfx4RCQyZOhhxqqooR8jr4rMTlawX51Pz_bjcInc&s=Lg3VS4Doc0_jhCg-v-gZRwB87fZ76a4o7nr89b7EKN0&e= > ; > > > > As for finding terms in adjectives, the default parts of speech(pos) > that are checked for terms are: > > > VB,VBD,VBG,VBN,VBP,VBZ,CC,CD,DT,EX,IN,LS,MD,PDT,POS,PP,PP$,PRP,PRP$,RP,TO,WDT,WP,WPS,WRB > > > > You can see what these are here: > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ling.upenn.edu_courses_Fall-5F2003_ling001_penn-5Ftreebank-5Fpos.html&d=DwIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=K3tMfx4RCQyZOhhxqqooR8jr4rMTlawX51Pz_bjcInc&s=p6rKDBv8CR-mooZAqh3B-bR3foZ2DKgGy_LQHQwTNX8&e= > > > > You can override this list. In your piper file, set the variable > "exclusionTags" > > > > // Default excluded parts of speech, plus various forms of adjective. > > set > exclusionTags="ADJ,JJ,JJR,JJS,VB,VBD,VBG,VBN,VBP,VBZ,CC,CD,DT,EX,IN,LS,MD,PDT,POS,PP,PP$,PRP,PRP$,RP,TO,WDT,WP,WPS,WRB" > > > > // Annotate concepts based upon default algorithms. > > add DefaultJCasTermAnnotator > > > > > > You'll notice that I threw in 'ADJ' for good measure. It should not > break anything. > > > > I have modified this list many times for various projects. In one I > allow verbs for lookup. For those notes the value of the true positives > outweighed the increased false negatives. In another I actually empty the > entire list to allow everything (set exclusionTags=""). I did this because > there is a lot of structured text in lists and tables, but the pos tagger > is trying to resolve prose text. The pos assigned on the structured text > is all over the place, and terms are missed left and right. > > > > So ... last but definitely not least, case-sensitivity. > > I started working on this a while ago, but right now it sits unfinished. > > > > There is an additional table in the dictionary database, in which all > synonyms are all upper-case. > > This second table is created with synonyms that exist in the umls as all > upper-case. > > The first "classic" table is created using ONLY synonyms from the umls > that are lower and/or mixed case. > > > > When the annotator engine iterates over the text, it checks one table > (classic) or the other (caps) depending upon the case of the text in the > note. > > > > It sounds like minor work, but it requires a new engine, new dictionary, > and new dictionary creator. None of this is difficult, but it requires > time. > > > > Anyway, I hope that some of this helps. > > > > Sean > > > > > > ________________________________________ > > From: Peter Abramowitsch <pabramowit...@gmail.com> > > Sent: Saturday, August 1, 2020 11:35 PM > > To: dev@ctakes.apache.org > > Subject: Re: With custom dictionary - over-eager resolution of acronyms > [EXTERNAL] > > > > * External Email - Caution * > > > > > > Hi Jeff thanks for your suggestions, > > > > I spent some time in the script file and sure enough, my 2020 UMLS > > extraction actually has these two entries: > > > > INSERT INTO CUI_TERMS VALUES(3542022,0,1,'soft','soft') > > INSERT INTO PREFTERM VALUES(3542022,'SHORT STATURE, ONYCHODYSPLASIA, > FACIAL > > DYSMORPHISM, AND HYPOTRICHOSIS SYNDROME') > > > > It's unbelievable. the UMLS entry has got to be wrong or I'm missing > > something to say that it only applies (as an acronym) if it's capitalized > > > > In sno_rx there is neither a CUI 3542022 nor the definition of "soft" > as a > > solitary word, nor even a mention of ONYCHODYSPLASIA or HYPOTRICHOSIS > > > > In any case, I would have thought that ctakes will only create an event > > mention from a term tagged as NN or NP slot, not a ADJ as in "soft > tissue" > > > > Anyway Thanks! Now I will keep poking around. > > > > > > Peter > > > > > > > > > > > > > > > > > > > > > > > > > >> On Sat, Aug 1, 2020 at 5:06 PM Jeffrey Miller <jeff...@gmail.com> > wrote: > >> > >> Sorry, I meant suggest to search for 'soft' in the dictionary file not > >> 'short' > >> > >> grep -i ,\'soft\', *.script > >> > >>> On Sat, Aug 1, 2020 at 7:47 PM Jeffrey Miller <jeff...@gmail.com> > wrote: > >>> > >>> Hi Peter, > >>> > >>> To my knowledge, there isn't any drastic difference in the behavior of > >> the > >>> dictionary gui creator and the way the sno_rx dictionary was created. I > >>> originally thought there was, but I realized the difference was that I > >> had > >>> not installed all of UMLS to my machine (just the vocabularies I was > >>> interested in) and I was missing synonyms. The first thing I would > check, > >>> are you able to find a matching entry in the .script file for your > ctakes > >>> dictionary when you do this: > >>> > >>> grep -i ,\'short\', *.script > >>> > >>> That would confirm whether or not you have a term in your dictionary > made > >>> up only of 'short' and whether it mapped to the CUI equal to "SHORT > >>> STATURE, ONYCHODYSPLASIA, FACIAL DYSMORPHISM, AND HYPOTRICHOSIS > >> SYNDROME". > >>> If it's not in there, something else is going on. You could do the same > >> for > >>> 'bed'. > >>> > >>> If not, another thing I might check is that I noticed you are using > >>> the OverlapJCasTermAnnotator in your prior e-mail. I don't have much > >>> experience with it, and I don't think it should cause this behavior, > but > >> I > >>> wonder if that could be making the difference (as compared > >>> to DefaultJCasTermAnnotator). > >>> > >>> Jeff > >>> > >>> On Sat, Aug 1, 2020 at 5:27 PM Peter Abramowitsch < > >> pabramowit...@gmail.com> > >>> wrote: > >>> > >>>> > >>>> Hi All > >>>> > >>>> Having created a new dictionary from the 2020AA UMLS and added Genes > and > >>>> Receptors to the dictionary-creator's default selections, I have a > >> curious > >>>> problem where cTakes now assigns the most bizarre acronyms to ordinary > >>>> words used in POS contexts where it shouldn't find <XXX>Mentions. > >>>> > >>>> Here are two examples: > >>>> > >>>> 1. soft (in "soft tissue...") > >>>> becomes "SHORT STATURE, ONYCHODYSPLASIA, FACIAL DYSMORPHISM, AND > >>>> HYPOTRICHOSIS SYNDROME", > >>>> > >>>> 2. bed in ("The wound bed was...") > >>>> becomes "BORNHOLM EYE DISEASE" > >>>> > >>>> I have not changed the TermConsumer type in the descriptor XML. > >>>> > >>>> Are the DictionaryCreator's defaults, the equivalent to the default > >>>> sno_rx that's delivered with the app? > >>>> > >>>> Attached is the vocab subsets list I used > >>>> > >>>> > >>>> Peter > >>>> > >>>> > >>>> > >> >