Re: UmlsOverlapLookupAnnotator + BsvRareWordDictionary: # tokens skipped varies? [EXTERNAL]

Kean Kaufmann Wed, 07 Mar 2018 10:54:06 -0800

Sean -- Thanks as always!

> It shouldn't be a problem, but is there an eol character after the " II"
in your bsv?  It shouldn't be necessary, but who knows.


Double checked:  yeah, no.  Yes, there's an eol; no, that doesn't seem to
be it.

> Can you create a Jira item with this information?

https://issues.apache.org/jira/browse/CTAKES-498

Let me know if there's any other info you need.





On Wed, Mar 7, 2018 at 1:24 PM, Finan, Sean <
[email protected]> wrote:

> Hi Kean,
>
> It does sound like you are getting some odd results.  I will need to look
> into the code, but I won't have time to do so for a few days.  My initial
> thoughts are below.
>
> >If I add an entry with a comma in it:
> >then "chronic kidney disease), stage II" gets picked up, no matter what.
> Well ... Commas do get special treatment so that items in a list do not
> spawn false overlapping terms.  In other words, "A, B, C" is more likely to
> be 3 terms "A" "B" "C" than it is to be 2 terms "A B" "C".  So, a
> dictionary entry that explicitly contains a comma provides a hint to the
> system that "A,B" is actually one term.
>
> >If I remove the other BSV entry, "chronic kidney disease III",
> It shouldn't be a problem, but is there an eol character after the " II"
> in your bsv?  It shouldn't be necessary, but who knows.
>
> > then "Decubitus ulcer - grade II" gets annotated as a
> DiseaseDisorderMention with C1720518, as hoped.
> But only "chronic kidney disease" is identified, as before... "stage II"
> gets left out.
> That is very strange.  I have no idea why adding an entry would change the
> behavior.  I will have to look at the code and run your examples.  By the
> way, thank you for the explicit examples!
>
> >Is this expected behavior?
> No, and thanks for letting me know about it.  Can you create a Jira item
> with this information?
>
> Thanks,
> Sean
>
> _______________________________________
> From: Kean Kaufmann <[email protected]>
> Sent: Wednesday, March 7, 2018 10:58 AM
> To: [email protected]
> Subject: UmlsOverlapLookupAnnotator + BsvRareWordDictionary: # tokens
> skipped varies? [EXTERNAL]
>
> Hi Sean,
>
> I'm perplexed. It seems as if the number of tokens that the
> UmlsOverlapLookupAnnotator will skip varies with the content of the
> RareWordDictionary.
>
> Here's my setup.  I think I've included enough information to replicate my
> perplexity, if you have time/inclination to do that; let me know if I've
> left anything out.
>
> I have a custom dictionary built from UMLS sources including SNOMEDCT_US:
>
> sql> select cui,text from cui_terms where text='chronic kidney disease' or
> > cui in (2316786,2316787);
> >     CUI  TEXT
> > -------  --------------------------------
> > 1561643  chronic kidney disease
> > 2316787  stage 3 chronic kidney disease
> > 2316787  chronic kidney disease stage 3
> > 2316787  chronic kidney disease , stage 3
> > 2316787  ckd stage 3
> > 2316786  chronic kidney disease stage 2
> > 2316786  chronic kidney disease , stage 2
> > 2316786  stage 2 chronic kidney disease
> > 2316786  ckd stage 2
> > Fetched 9 rows.
> > sql>
>
>
> My documents contain acronym expansions and Roman numerals for stages, like
> this:
>
> Problem List:
> > CKD (chronic kidney disease), stage II
> > Decubitus ulcer - grade II
>
>
> So I create a BSV RareWordDictionary to capture the Roman numerals.
> I don't want to have to guess at all the possible punctuation variations,
> so I try to make my entries as general as safely possible,
> using the UmlsOverlapLookupAnnotator with consecutiveSkips set to 2.
>
> C2316786|chronic kidney disease II
> C2316787|chronic kidney disease III
>
> I add dictionary and dictionaryConceptPair entries for my BSV file to
> cTakesHsql.xml as shown in the example/ directory, using
> SemanticCleanupTermConsumer as rareWordConsumer.
>
> Success! Now "chronic kidney disease), stage II" gets annotated as a
> DiseaseDisorderMention with CUI C2316786.
>
> But a couple of things confuse me.
>
> *1. Removing an entry*
>
> If I remove the other BSV entry, "chronic kidney disease III",
> "chronic kidney disease), stage II" isn't identified anymore:
> suddenly it only annotates "chronic kidney disease", with C1561643.
>
> *2. Adding an entry*
>
> My documents also have staging language for ulcers, e.g. "Decubitus ulcer -
> grade II".
>
> If I add an entry for this to my BSV dictionary, so now I have:
>
> C2316786|chronic kidney disease II
> C2316787|chronic kidney disease III
> C1720518|decubitus ulcer II
>
> and annotate this text:
>
> Problem List:
> > CKD (chronic kidney disease), stage II
> > Decubitus ulcer - grade II
>
>
> then "Decubitus ulcer - grade II" gets annotated as a
> DiseaseDisorderMention with C1720518, as hoped.
> But only "chronic kidney disease" is identified, as before... "stage II"
> gets left out.
>
> *3. Adding a comma*
>
> If I add an entry with a comma in it:
>
> C2316786|chronic kidney disease , II
>
> then "chronic kidney disease), stage II" gets picked up, no matter what.
>
> Without the comma entry, it's skipping three consecutive tokens... but
> sometimes it seems willing to do that, and sometimes it doesn't.
>
> Is this expected behavior?
> If so, can you help me understand what to expect?
> At this point I hesitate to add anything to the BSV dictionary!
>
> Many thanks,
> Kean
>

Re: UmlsOverlapLookupAnnotator + BsvRareWordDictionary: # tokens skipped varies? [EXTERNAL]

Reply via email to