Hmm... I don't see normalizedForm filled in. I see LVG filling in
canonicalForm, is it possible that's what you're thinking of?  (Not that
I know what the difference is or is supposed to be, just going off what
I see in my xmis.)
Tim


On 04/17/2014 06:23 PM, Masanz, James J. wrote:
> The normalizedForm field is filled in. It is used by dictionary lookup.
>
> So, for example, if the dictionary would contain "lymph node" but not "lymph 
> nodes", a document with text of "lymph nodes" would match the dictionary 
> entry "lymph node" because "node", being the normalized form of "nodes", 
> would be used when searching dictionary entries (in addition to searching 
> dictionary entries for "nodes")
>
> -----Original Message-----
> From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
> Sent: Thursday, April 17, 2014 4:33 PM
> To: dev@ctakes.apache.org
> Subject: Re: lvg entries
>
> Quick follow-up since I was interested. The current dependency parser
> does have the option to use ctakes lemmas or do its own lemmatizing, but
> that doesn't use the lemma field, it uses the normalizedForm field. I'm
> not sure if that field is actually ever filled in -- on my example data
> it is always null.
>
> Tim
>
> On 04/17/2014 01:57 PM, Masanz, James J. wrote:
>> Offhand I recall at least one of the dependency parsers used the Lemma 
>> annotations at one point.
>> Not sure if still does.
>>
>> There is an option for turning off the posting of the lemmas to the cas.
>>
>> Hope that helps
>>
>> -----Original Message-----
>> From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
>> Sent: Thursday, April 17, 2014 11:27 AM
>> To: dev@ctakes.apache.org
>> Subject: lvg entries
>>
>> The LVG annotator creates an enormous number of "lemmas" for every
>> WordToken in the CAS, and I'm wondering what the original purpose was? I
>> think this is probably a minor bottleneck for speed but mostly a pretty
>> big space hog (at least 50% of the space of xmi files in my tests).
>>
>> As of right now I'm not sure if any downstream components are using
>> these lemmas, and on a manual inspection the precision seems to be
>> pretty abysmal (meaning most of them are nonsensical as lexical
>> variants), so as I said, just wondering if we can revisit why cTAKES
>> generates so many and whether that component can be optimized.
>>
>> Thanks
>> Tim
>>
>>
>

Reply via email to