P.S. Extra config bit: I also removed "CD" from the exclusionTags in the UmlsOverlapLookupAnnotator.
On Wed, Mar 7, 2018 at 10:58 AM, Kean Kaufmann <[email protected]> wrote: > Hi Sean, > > I'm perplexed. It seems as if the number of tokens that the > UmlsOverlapLookupAnnotator will skip varies with the content of the > RareWordDictionary. > > Here's my setup. I think I've included enough information to replicate my > perplexity, if you have time/inclination to do that; let me know if I've > left anything out. > > I have a custom dictionary built from UMLS sources including SNOMEDCT_US: > > sql> select cui,text from cui_terms where text='chronic kidney disease' or >> cui in (2316786,2316787); >> CUI TEXT >> ------- -------------------------------- >> 1561643 chronic kidney disease >> 2316787 stage 3 chronic kidney disease >> 2316787 chronic kidney disease stage 3 >> 2316787 chronic kidney disease , stage 3 >> 2316787 ckd stage 3 >> 2316786 chronic kidney disease stage 2 >> 2316786 chronic kidney disease , stage 2 >> 2316786 stage 2 chronic kidney disease >> 2316786 ckd stage 2 >> Fetched 9 rows. >> sql> > > > My documents contain acronym expansions and Roman numerals for stages, > like this: > > Problem List: >> CKD (chronic kidney disease), stage II >> Decubitus ulcer - grade II > > > So I create a BSV RareWordDictionary to capture the Roman numerals. > I don't want to have to guess at all the possible punctuation variations, > so I try to make my entries as general as safely possible, > using the UmlsOverlapLookupAnnotator with consecutiveSkips set to 2. > > C2316786|chronic kidney disease II > C2316787|chronic kidney disease III > > I add dictionary and dictionaryConceptPair entries for my BSV file to > cTakesHsql.xml as shown in the example/ directory, using > SemanticCleanupTermConsumer as rareWordConsumer. > > Success! Now "chronic kidney disease), stage II" gets annotated as a > DiseaseDisorderMention with CUI C2316786. > > But a couple of things confuse me. > > *1. Removing an entry* > > If I remove the other BSV entry, "chronic kidney disease III", > "chronic kidney disease), stage II" isn't identified anymore: > suddenly it only annotates "chronic kidney disease", with C1561643. > > *2. Adding an entry* > > My documents also have staging language for ulcers, e.g. "Decubitus ulcer > - grade II". > > If I add an entry for this to my BSV dictionary, so now I have: > > C2316786|chronic kidney disease II > C2316787|chronic kidney disease III > C1720518|decubitus ulcer II > > and annotate this text: > > Problem List: >> CKD (chronic kidney disease), stage II >> Decubitus ulcer - grade II > > > then "Decubitus ulcer - grade II" gets annotated as a > DiseaseDisorderMention with C1720518, as hoped. > But only "chronic kidney disease" is identified, as before... "stage II" > gets left out. > > *3. Adding a comma* > > If I add an entry with a comma in it: > > C2316786|chronic kidney disease , II > > then "chronic kidney disease), stage II" gets picked up, no matter what. > > Without the comma entry, it's skipping three consecutive tokens... but > sometimes it seems willing to do that, and sometimes it doesn't. > > Is this expected behavior? > If so, can you help me understand what to expect? > At this point I hesitate to add anything to the BSV dictionary! > > Many thanks, > Kean > >
