Having issues with LVG overgenerating false positives.

I've tried leaving it out altogether, but there are some stemming stumbling
blocks in the dictionary, e.g.

sql> select cui, tui, text, prefterm from cui_terms c join tui t on t.cui =
> c.cui join prefterm p on p.cui = c.cui and cui=11849;
>   CUI  TUI  TEXT              PREFTERM
> -----  ---  ----------------  -----------------
> 11849   47  dm                Diabetes Mellitus
> 11849   47  diabetes          Diabetes Mellitus
> 11849   47  diabete mellitus  Diabetes Mellitus


I've also added some particularly problematic words to the LvgAnnotator's
ExclusionSet, e.g.

>             <!-- oth = C0449210, "OTH tumor staging notation" -->
>
<string>other</string><string>Other</string><string>OTHER</string>
>             <!-- moth = C1445661, "Moth antigen" -->
>
<string>mother</string><string>Mother</string><string>MOTHER</string>
>             <!-- plan = C0270724, "Infantile Neuroaxonal Dystrophy" -->
>
<string>planning</string><string>Planning</string><string>PLANNING</string>
>             <!-- not Attention Deficit Disorder -->
>
<string>adding</string><string>Adding</string><string>ADDING</string>
>             <!-- pas = C0030125, "p-Aminosalicylic acid" -->
>            <string>pass</string><string>Pass</string><string>PASS</string>
>
<string>passing</string><string>Passing</string><string>PASSING</string>
>            <!-- bre = C2363129, "Benign Rolandic Epilepsy" ?! -->
>
 <string>bring</string><string>Bring</string><string>BRING</string>

But what I'd really like is more control over LVG's behavior: for instance,
blocking the "-er" suffixing rule completely, not letting the "-ing" rule
apply to a stem without vowels, and not letting the plural rule add "-s" to
stems ending in "s".  I've fiddled with the LVG rules under ctakes-lvg-res,
e.g. data/rules/dm.rul , but to no apparent effect.

Has anybody done this? I've glanced at the cTAKES wiki, but the install
instructions don't seem to address this level of customization; and I've
skimmed the NLM documentation, but it doesn't seem to be intended for
developers.  Can anyone point me to more detailed docs?

And: Has anyone tried plugging in another stemmer? To play nicely with the
ctakes-dictionary-lookup-fast annotators, it seems as if all it would have
to do would be to populate canonicalForm.

Happy Friday, and thanks for any help you can provide!

Kean Kaufmann
NLP Developer
RecordsOne, Inc.

Reply via email to