Hi Jeff, The dictionary creator uses the CUI set from selected sources, but synonyms from all available sources for CUIs in that set.
I am not sure what is going on with the 's' in "diabetes". A grep for "diabetes mellitus" and "diabete mellitus" in the umls mrconso file might have a hint. Perhaps some code thinks that it is fixing a plural term? Sean ________________________________________ From: Jeffrey Miller <[email protected]> Sent: Tuesday, June 18, 2019 10:23 PM To: [email protected] Subject: Re: Differences in dictionary built with dictionaryBuilder and sno_rx16ab from sourceforge [EXTERNAL] Thanks Sean. I actually think I figured out what is causing the difference. When I create the UMLS install on my machine, I only install RxNorm and SNOMEDCT_US, so when I use the dictionaryCreator GUI, there are only those two sources on the left. I noticed in the screenshots on the wiki page for the dictionary creator GUI that many sources were installed, but only SNOMEDCT_US and RxNorm were selected. So, I tried installing all of the active UMLS set (but still only selecting RxNorm and SNOMEDCT_US in the dictionaryCreator GUI) and it made a difference as to which terms appeared in the final cTAKES dictionary. As an example, I now get the "DM" entry for diabetes. I don't know why this should make a difference, but it appears that it does. Another odd observation related to this. In the sno_rx_2016ab file, I noticed there seems to be an error: INSERT INTO CUI_TERMS VALUES(11849,0,2,'diabete mellitus','diabete') The 's' is missing from diabetes. When I created my dictionary (from the restricted UMLS install, but still 2016ab) the cTAKES dictionary entry for that term is correct: INSERT INTO CUI_TERMS VALUES(11849,1,2,'diabetes mellitus','mellitus') When I created the dictionary from the full cTAKES install tonight, that error appeared again. Jeff On Mon, Jun 17, 2019 at 8:08 PM Finan, Sean < [email protected]> wrote: > Hi Jeff, > > Thanks for doing the research. Since the sno_rx_16ab was made 3+ years > ago I can't swear to any of those filter sets being exactly what was used. > > I think that the key to working with any project is to check the > dictionary against a project's needs. Fill in the gaps by either editing > the sql (.script) file or by adding a second dictionary. In smaller > "focus" projects I usually end up augmenting the default dictionary with a > small custom bsv dictionary to catch any known synonyms or terms that > aren't represented in the default. In projects requiring larger nets I > have built dictionaries that are horribly inclusive - 2 to 3 times the > sno_rx_16ab. > > Sean > ________________________________________ > From: Jeffrey Miller <[email protected]> > Sent: Monday, June 17, 2019 4:39 PM > To: [email protected] > Subject: Re: Differences in dictionary built with dictionaryBuilder and > sno_rx16ab from sourceforge [EXTERNAL] > > Thanks for following up Sean. I've looked into the links you sent along. > There are different groups of filters and it appears that the > dictionaryBuilder GUI is hardcoded to use the files in the "tiny" > directory. I don't think this is the set of filters used to make > sno_rx_16ab because the 'tiny' filter group contains "today" (today brand > veterinary product. 310367) in "UnwantedTexts.txt", but the > sno_rx_16ab.script file has "today" still in there. If you create a > dictionary with the dictionary builder, it does not include that term. > > I thought maybe the set of files under the "default" filter directory might > be the one used for the sno_rx_16ab package so I recompiled the > dictionaryCreator GUI to use the "default" filter files and created a new > snomed rxnorm dictionary from the 2016ab umls release, but the output is > still quite different that the packaged sno_rx_16ab dictionary. From > looking at diffs, it looks like there are a substantial number of additions > to the sno_rx_16ab, so much so that I really must be missing something. For > example, for CUI 12169 which describes a low sodium diet, there are about > 27 CUI terms in sno_rx_16ab.script, but in the script generated by the > dictionaryGUI there are only 7 (with the "tiny" or "default" filter > groups). > > On Sun, Jun 16, 2019 at 3:27 PM Remy Sanouillet <[email protected]> > wrote: > > > Thanks for the clarifications, Sean. That was very enlightening. I look > > forward to the documentation (even if it entails some suffering on your > > part.) > > > > If/when you stumble on some idle time allowing you to implement the > manual > > edit panel, it would be nice to have it allow for re-partitioning the > > ontology. As you are very aware, UMLS CUIs and SNOMED do not always have > a > > one-to-one correspondence resulting in a CUI matching multiples SNOMEDs > or > > a SNOMED being mapped to several CUIs. > > > > In some cases, clinicians don't agree with that partitioning in > specialized > > contexts and the inheritance that ensues and would like to re-assign > them. > > > > Not holding my breath, but just something to keep in mind. > > > > Remy > > > > On Sun, Jun 16, 2019 at 7:16 AM Finan, Sean < > > [email protected]> wrote: > > > > > Hi Jeff, > > > > > > >1) ... > > > There are several collections of filter sets here: > > > > ctakes-gui-res\src\main\resources\org\apache\ctakes\gui\dictionary\data\ > > > > > > 2) ... > > > There is additional logic within the dictionary creator code: > > > ctakes-gui\src\main\java\org\apache\ctakes\gui\dictionary\ > > > > > > I haven't gone through it in a really long time, and without doing so > now > > > I can't enumerate the filters. I have family visiting, otherwise my > > > curiosity would force me to do so and get back to you. Honestly, it > > > should be documented somewhere, but writing (especially technical) is > > > pretty much my least favorite activity. > > > > > > Sean > > > > > > > > > p.s. > > > Please don't wait for it, but I am currently working on new dictionary > > > code and plan to introduce that in ctakes. Again, please don't wait > for > > it > > > as it is mixed in with other work and will not be available for several > > > months (if at all). > > > > > > > > > ________________________________________ > > > From: Jeffrey Miller <[email protected]> > > > Sent: Sunday, June 16, 2019 9:49 AM > > > To: [email protected] > > > Subject: Re: Differences in dictionary built with dictionaryBuilder and > > > sno_rx16ab from sourceforge [EXTERNAL] > > > > > > Hi Sean, > > > > > > Thanks for your response. I had two follow-up questions that would be > > very > > > helpful to understand if you have a few moments: > > > > > > 1) Are the specific filters used in the official sno_rx_16ab codified > > > anywhere so that I could reproduce them? > > > > > > 2) Do these filters explain all the changes? For example, when I use > the > > > dictionary creator to export sno_med and rx_norm, I only get "diabetes > > > mellitus" where as sno_rx_16ab contains both "diabetes" and "dm". > > > Especially with the addition of "dm" it feels like I must be missing a > > step > > > or a setting somewhere. > > > > > > Thanks! > > > Jeff > > > > > > On Sun, Jun 16, 2019 at 8:55 AM Finan, Sean < > > > [email protected]> wrote: > > > > > > > Hi all, > > > > > > > > The contents of the sno_rx_16ab are a dump of the umls 2016AB snomed > > and > > > > rxnorm terms with certain symantic types. Nothing was added, but > > > synonyms > > > > are filtered based upon various rules. For instance, unnecessary > > > suffixes > > > > are removed ("Wart (Finding)" -> "Wart"), really long terms are > > excluded > > > > ("can walk straight line with only minimal assistance"), terms with > > dose > > > or > > > > form are ignored and so forth. > > > > > > > > Some filters can be changed by adding/removing from > > > prefix/suffix/contains > > > > lists in plaintext files or by modifying the dictionary creator code. > > > > > > > > There was no manual curation (or nothing major). As Remy mentioned > > that > > > > requires a lot of attention and time. The dictionary database was > not > > > > intended to be perfect, just as good as possible without major > > > investment - > > > > and reproducible with updates to the umls. > > > > > > > > As the dictionary is released as a sql database, you should be able > to > > > add > > > > and remove fairly easily if sql savvy. I have long wanted to add a > > > "manual > > > > edit" panel to the dictionary gui, but haven't had the time. If > > anybody > > > > else would like to work on such a tool that would be tonic. > > > > > > > > Sean > > > > > > > > > > > > ________________________________________ > > > > From: Harish Kulkarni <[email protected]> > > > > Sent: Saturday, June 15, 2019 5:16 PM > > > > To: [email protected] > > > > Subject: Re: Differences in dictionary built with dictionaryBuilder > and > > > > sno_rx16ab from sourceforge [EXTERNAL] > > > > > > > > unsubscribe > > > > > > > > On Sat, Jun 15, 2019 at 1:40 PM Remy Sanouillet < > [email protected]> > > > > wrote: > > > > > > > > > Yes, I agree it would be nice because the tokenization that occurs > > when > > > > > creating the dictionaries from the releases make comparisons a bit > > > tricky > > > > > and is not 100% reversible. I would love to hear an answer to your > > > > > quandary. > > > > > > > > > > Remy > > > > > > > > > > On Sat, Jun 15, 2019 at 1:23 PM Jeffrey Miller <[email protected]> > > > > wrote: > > > > > > > > > > > Thanks, I was curious if the cTAKES devs that created the > > sno_rx_16ab > > > > > > dictionary had put the differences applied to the default UMLS > > output > > > > > into > > > > > > version control in some form. I imagine the > > > > > > additions/synonyms/abbreviations that were added manually must > have > > > > been > > > > > > collected over time somewhere prior to merging them with 2016ab > > UMLS > > > > > > release? I basically want to recreate the default cTAKES 4.0.0 > > > release > > > > > with > > > > > > an additional ontology and the latest terms. I can likely come up > > > with > > > > a > > > > > > diff myself but was wondering if this was already maintained as > > part > > > of > > > > > > cTAKES. > > > > > > > > > > > > On Sat, Jun 15, 2019 at 12:24 PM Remy Sanouillet < > > > [email protected] > > > > > > > > > > > wrote: > > > > > > > > > > > > > Yes, that's pretty much what we do too. Not only to enhance the > > > > > > dictionary, > > > > > > > but to put in corrections because, lo and behold, there are > some > > > > errors > > > > > > in > > > > > > > there!. As you know, an ontology is a constant curation job and > > > that > > > > > > > script, under SCM, allows you to isolate those changes and, if > > > > > necessary, > > > > > > > re-apply them to new versions. > > > > > > > > > > > > > > Remy > > > > > > > > > > > > > > On Sat, Jun 15, 2019 at 8:36 AM gandhi rajan < > > > > [email protected]> > > > > > > > wrote: > > > > > > > > > > > > > > > Hi Jeff, > > > > > > > > > > > > > > > > As far as I know, maintaining a separate SQL script to add > > > > additional > > > > > > > > entries should work seamlessly. > > > > > > > > > > > > > > > > On Saturday, June 15, 2019, Jeffrey Miller < > [email protected]> > > > > > wrote: > > > > > > > > > > > > > > > > > Thanks Remy. Does anyone know if these manually curated > > > > > > > > > modifications/synonyms are tracked anywhere (aside from the > > > > > > dictionary > > > > > > > > > itself) so they can be carried forward in future dictionary > > > > > updates? > > > > > > > > > > > > > > > > > > On Fri, Jun 14, 2019 at 4:28 PM Remy Sanouillet < > > > > > > [email protected]> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > From my experience, it seems pretty obvious that > > sno_rx_16ab > > > > is a > > > > > > > > curated > > > > > > > > > > dictionary based on the SNOMED 2016AB release. It does > not > > > > > contain > > > > > > > the > > > > > > > > > full > > > > > > > > > > set but it has additional edits and synonyms that are > > pretty > > > > > useful > > > > > > > > > > (including 'dm'). > > > > > > > > > > > > > > > > > > > > We have had to manage those mods as an adjunct. > > > > > > > > > > > > > > > > > > > > Remy > > > > > > > > > > > > > > > > > > > > On Fri, Jun 14, 2019 at 1:03 PM Jeffrey Miller < > > > > > [email protected]> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > I have created a custom dictionary from the latest UMLS > > > > release > > > > > > > with > > > > > > > > > > > SNOMEDCT_US and RxNorm and I've noticed it seems to be > > > > > > generating > > > > > > > > > > .script > > > > > > > > > > > file with unexpected differences as compared to the > > > > sno_rx_16ab > > > > > > > file > > > > > > > > > > > available as part of the cTAKES release. Specifically, > > for > > > > > > > diabetes, > > > > > > > > it > > > > > > > > > > is > > > > > > > > > > > missing these two rows: > > > > > > > > > > > INSERT INTO CUI_TERMS VALUES(11849,0,1,'dm','dm') > > > > > > > > > > > INSERT INTO CUI_TERMS > > > VALUES(11849,0,1,'diabetes','diabetes') > > > > > > > > > > > > > > > > > > > > > > and only has this one: > > > > > > > > > > > INSERT INTO CUI_TERMS VALUES(11849,1,2,'diabetes > > > > > > > > mellitus','mellitus') > > > > > > > > > > > > > > > > > > > > > > The end result is that "diabetes" is not being picked > up > > in > > > > the > > > > > > > test > > > > > > > > > > text I > > > > > > > > > > > am running through- it requires the full 'diabetes > > > mellitus'. > > > > > > > > > > > > > > > > > > > > > > Is there any setting on the UMLS install side or the > > > ctTAKES > > > > > > > > dictionary > > > > > > > > > > > creator that could account for missing alternative > forms > > > like > > > > > > this? > > > > > > > > > I've > > > > > > > > > > > tried downloading the 2016AB release (which I think is > > the > > > > one > > > > > > used > > > > > > > > to > > > > > > > > > > > create the bundled sno_rx_16ab package?) and I am not > > > getting > > > > > the > > > > > > > > > > alternate > > > > > > > > > > > forms in that dictionary either. > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > Jeff > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Regards, > > > > > > > > Gandhi > > > > > > > > > > > > > > > > "The best way to find urself is to lose urself in the service > > of > > > > > others > > > > > > > > !!!" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
