You are right, I was thinking of the field called canonicalForm.

normlizedForm is set by ExtractionPrepAnnotator.java - but if I remember right, 
that's at the end of the pipelines that it's included in. And it's set to 
either the canonicalForm (if there is one) or the coveredText

Not sure what the intent there was.

-----Original Message-----
From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
Sent: Friday, April 18, 2014 11:16 AM
To: dev@ctakes.apache.org
Subject: Re: lvg entries

Hmm... I don't see normalizedForm filled in. I see LVG filling in
canonicalForm, is it possible that's what you're thinking of?  (Not that
I know what the difference is or is supposed to be, just going off what
I see in my xmis.)
Tim


On 04/17/2014 06:23 PM, Masanz, James J. wrote:
> The normalizedForm field is filled in. It is used by dictionary lookup.
>
> So, for example, if the dictionary would contain "lymph node" but not "lymph 
> nodes", a document with text of "lymph nodes" would match the dictionary 
> entry "lymph node" because "node", being the normalized form of "nodes", 
> would be used when searching dictionary entries (in addition to searching 
> dictionary entries for "nodes")
>
> -----Original Message-----
> From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
> Sent: Thursday, April 17, 2014 4:33 PM
> To: dev@ctakes.apache.org
> Subject: Re: lvg entries
>
> Quick follow-up since I was interested. The current dependency parser
> does have the option to use ctakes lemmas or do its own lemmatizing, but
> that doesn't use the lemma field, it uses the normalizedForm field. I'm
> not sure if that field is actually ever filled in -- on my example data
> it is always null.
>
> Tim
>
> On 04/17/2014 01:57 PM, Masanz, James J. wrote:
>> Offhand I recall at least one of the dependency parsers used the Lemma 
>> annotations at one point.
>> Not sure if still does.
>>
>> There is an option for turning off the posting of the lemmas to the cas.
>>
>> Hope that helps
>>
>> -----Original Message-----
>> From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
>> Sent: Thursday, April 17, 2014 11:27 AM
>> To: dev@ctakes.apache.org
>> Subject: lvg entries
>>
>> The LVG annotator creates an enormous number of "lemmas" for every
>> WordToken in the CAS, and I'm wondering what the original purpose was? I
>> think this is probably a minor bottleneck for speed but mostly a pretty
>> big space hog (at least 50% of the space of xmi files in my tests).
>>
>> As of right now I'm not sure if any downstream components are using
>> these lemmas, and on a manual inspection the precision seems to be
>> pretty abysmal (meaning most of them are nonsensical as lexical
>> variants), so as I said, just wondering if we can revisit why cTAKES
>> generates so many and whether that component can be optimized.
>>
>> Thanks
>> Tim
>>
>>
>

Reply via email to