RE: lvg entries

Finan, Sean Thu, 17 Apr 2014 10:54:13 -0700

Those variants are not used by the dictionary lookup.  I did look at them to 
see if it was worthwhile for the new dictionary, but they are all over the 
place so I passed.  
________________________________________
From: Miller, Timothy [timothy.mil...@childrens.harvard.edu]
Sent: Thursday, April 17, 2014 1:25 PM
To: dev@ctakes.apache.org
Subject: Re: lvg entries


Pei and I had a similar discussion in person -- mapping from lexical
variants to a stem might be useful. Pei also mentioned that one intended
use might have been searching the dictionary with lexical variants, but
I don't think that is done. Looking at the precision of the variants, I
think its highly unlikely the speed tradeoff would be worth any
improvements in recall.

Finally, at least in eclipse doing a search on references to the method
to retrieve the lemma entries turns up nothing.

Tim


On 04/17/2014 01:14 PM, Dligach, Dmitriy wrote:
> I don’t know of any applications within cTAKES that make use of this… The 
> reverse (mapping from these “variants” to the normal form) may be useful 
> though.
>
> Dima
>
>
>
>
> On Apr 17, 2014, at 11:50, Miller, Timothy 
> <timothy.mil...@childrens.harvard.edu> wrote:
>
>> Sure, just as an example, I gave it a note with about 1000 words. It
>> generates 11500 NonEmptyFSList elements (each is basically one lexical
>> variant).
>>
>> For the word "symptomatic", these are the first 10 of 20 lexical variants:
>> Symptomaticer/JJ
>> Symptomaticer/RB
>> Symptomaticed/VB
>> Symptomaticcing/VB
>> Symptomatics/VB
>> Symptomatics/NN
>> Symptomaticked/VB
>> Symptomatic/VB
>> Symptomatic/JJ
>> Symptomatic/RB
>>
>> Tim
>>
>>
>> On 04/17/2014 12:31 PM, Dligach, Dmitriy wrote:
>>> Tim, this is a very interesting observation. Could you please send a few 
>>> examples of what LVG generates? Both sensical and non :)
>>>
>>> Dima
>>>
>>>
>>>
>>>
>>> On Apr 17, 2014, at 11:28, Miller, Timothy 
>>> <timothy.mil...@childrens.harvard.edu> wrote:
>>>
>>>> The LVG annotator creates an enormous number of "lemmas" for every
>>>> WordToken in the CAS, and I'm wondering what the original purpose was? I
>>>> think this is probably a minor bottleneck for speed but mostly a pretty
>>>> big space hog (at least 50% of the space of xmi files in my tests).
>>>>
>>>> As of right now I'm not sure if any downstream components are using
>>>> these lemmas, and on a manual inspection the precision seems to be
>>>> pretty abysmal (meaning most of them are nonsensical as lexical
>>>> variants), so as I said, just wondering if we can revisit why cTAKES
>>>> generates so many and whether that component can be optimized.
>>>>
>>>> Thanks
>>>> Tim
>>>>
>

RE: lvg entries

Reply via email to