Sean -- Thanks as always! > It shouldn't be a problem, but is there an eol character after the " II" in your bsv? It shouldn't be necessary, but who knows.
Double checked: yeah, no. Yes, there's an eol; no, that doesn't seem to be it. > Can you create a Jira item with this information? https://issues.apache.org/jira/browse/CTAKES-498 Let me know if there's any other info you need. On Wed, Mar 7, 2018 at 1:24 PM, Finan, Sean < [email protected]> wrote: > Hi Kean, > > It does sound like you are getting some odd results. I will need to look > into the code, but I won't have time to do so for a few days. My initial > thoughts are below. > > >If I add an entry with a comma in it: > >then "chronic kidney disease), stage II" gets picked up, no matter what. > Well ... Commas do get special treatment so that items in a list do not > spawn false overlapping terms. In other words, "A, B, C" is more likely to > be 3 terms "A" "B" "C" than it is to be 2 terms "A B" "C". So, a > dictionary entry that explicitly contains a comma provides a hint to the > system that "A,B" is actually one term. > > >If I remove the other BSV entry, "chronic kidney disease III", > It shouldn't be a problem, but is there an eol character after the " II" > in your bsv? It shouldn't be necessary, but who knows. > > > then "Decubitus ulcer - grade II" gets annotated as a > DiseaseDisorderMention with C1720518, as hoped. > But only "chronic kidney disease" is identified, as before... "stage II" > gets left out. > That is very strange. I have no idea why adding an entry would change the > behavior. I will have to look at the code and run your examples. By the > way, thank you for the explicit examples! > > >Is this expected behavior? > No, and thanks for letting me know about it. Can you create a Jira item > with this information? > > Thanks, > Sean > > _______________________________________ > From: Kean Kaufmann <[email protected]> > Sent: Wednesday, March 7, 2018 10:58 AM > To: [email protected] > Subject: UmlsOverlapLookupAnnotator + BsvRareWordDictionary: # tokens > skipped varies? [EXTERNAL] > > Hi Sean, > > I'm perplexed. It seems as if the number of tokens that the > UmlsOverlapLookupAnnotator will skip varies with the content of the > RareWordDictionary. > > Here's my setup. I think I've included enough information to replicate my > perplexity, if you have time/inclination to do that; let me know if I've > left anything out. > > I have a custom dictionary built from UMLS sources including SNOMEDCT_US: > > sql> select cui,text from cui_terms where text='chronic kidney disease' or > > cui in (2316786,2316787); > > CUI TEXT > > ------- -------------------------------- > > 1561643 chronic kidney disease > > 2316787 stage 3 chronic kidney disease > > 2316787 chronic kidney disease stage 3 > > 2316787 chronic kidney disease , stage 3 > > 2316787 ckd stage 3 > > 2316786 chronic kidney disease stage 2 > > 2316786 chronic kidney disease , stage 2 > > 2316786 stage 2 chronic kidney disease > > 2316786 ckd stage 2 > > Fetched 9 rows. > > sql> > > > My documents contain acronym expansions and Roman numerals for stages, like > this: > > Problem List: > > CKD (chronic kidney disease), stage II > > Decubitus ulcer - grade II > > > So I create a BSV RareWordDictionary to capture the Roman numerals. > I don't want to have to guess at all the possible punctuation variations, > so I try to make my entries as general as safely possible, > using the UmlsOverlapLookupAnnotator with consecutiveSkips set to 2. > > C2316786|chronic kidney disease II > C2316787|chronic kidney disease III > > I add dictionary and dictionaryConceptPair entries for my BSV file to > cTakesHsql.xml as shown in the example/ directory, using > SemanticCleanupTermConsumer as rareWordConsumer. > > Success! Now "chronic kidney disease), stage II" gets annotated as a > DiseaseDisorderMention with CUI C2316786. > > But a couple of things confuse me. > > *1. Removing an entry* > > If I remove the other BSV entry, "chronic kidney disease III", > "chronic kidney disease), stage II" isn't identified anymore: > suddenly it only annotates "chronic kidney disease", with C1561643. > > *2. Adding an entry* > > My documents also have staging language for ulcers, e.g. "Decubitus ulcer - > grade II". > > If I add an entry for this to my BSV dictionary, so now I have: > > C2316786|chronic kidney disease II > C2316787|chronic kidney disease III > C1720518|decubitus ulcer II > > and annotate this text: > > Problem List: > > CKD (chronic kidney disease), stage II > > Decubitus ulcer - grade II > > > then "Decubitus ulcer - grade II" gets annotated as a > DiseaseDisorderMention with C1720518, as hoped. > But only "chronic kidney disease" is identified, as before... "stage II" > gets left out. > > *3. Adding a comma* > > If I add an entry with a comma in it: > > C2316786|chronic kidney disease , II > > then "chronic kidney disease), stage II" gets picked up, no matter what. > > Without the comma entry, it's skipping three consecutive tokens... but > sometimes it seems willing to do that, and sometimes it doesn't. > > Is this expected behavior? > If so, can you help me understand what to expect? > At this point I hesitate to add anything to the BSV dictionary! > > Many thanks, > Kean >
