Those variants are not used by the dictionary lookup. I did look at them to see if it was worthwhile for the new dictionary, but they are all over the place so I passed. ________________________________________ From: Miller, Timothy [timothy.mil...@childrens.harvard.edu] Sent: Thursday, April 17, 2014 1:25 PM To: dev@ctakes.apache.org Subject: Re: lvg entries
Pei and I had a similar discussion in person -- mapping from lexical variants to a stem might be useful. Pei also mentioned that one intended use might have been searching the dictionary with lexical variants, but I don't think that is done. Looking at the precision of the variants, I think its highly unlikely the speed tradeoff would be worth any improvements in recall. Finally, at least in eclipse doing a search on references to the method to retrieve the lemma entries turns up nothing. Tim On 04/17/2014 01:14 PM, Dligach, Dmitriy wrote: > I don’t know of any applications within cTAKES that make use of this… The > reverse (mapping from these “variants” to the normal form) may be useful > though. > > Dima > > > > > On Apr 17, 2014, at 11:50, Miller, Timothy > <timothy.mil...@childrens.harvard.edu> wrote: > >> Sure, just as an example, I gave it a note with about 1000 words. It >> generates 11500 NonEmptyFSList elements (each is basically one lexical >> variant). >> >> For the word "symptomatic", these are the first 10 of 20 lexical variants: >> Symptomaticer/JJ >> Symptomaticer/RB >> Symptomaticed/VB >> Symptomaticcing/VB >> Symptomatics/VB >> Symptomatics/NN >> Symptomaticked/VB >> Symptomatic/VB >> Symptomatic/JJ >> Symptomatic/RB >> >> Tim >> >> >> On 04/17/2014 12:31 PM, Dligach, Dmitriy wrote: >>> Tim, this is a very interesting observation. Could you please send a few >>> examples of what LVG generates? Both sensical and non :) >>> >>> Dima >>> >>> >>> >>> >>> On Apr 17, 2014, at 11:28, Miller, Timothy >>> <timothy.mil...@childrens.harvard.edu> wrote: >>> >>>> The LVG annotator creates an enormous number of "lemmas" for every >>>> WordToken in the CAS, and I'm wondering what the original purpose was? I >>>> think this is probably a minor bottleneck for speed but mostly a pretty >>>> big space hog (at least 50% of the space of xmi files in my tests). >>>> >>>> As of right now I'm not sure if any downstream components are using >>>> these lemmas, and on a manual inspection the precision seems to be >>>> pretty abysmal (meaning most of them are nonsensical as lexical >>>> variants), so as I said, just wondering if we can revisit why cTAKES >>>> generates so many and whether that component can be optimized. >>>> >>>> Thanks >>>> Tim >>>> >