Hmm... I don't see normalizedForm filled in. I see LVG filling in canonicalForm, is it possible that's what you're thinking of? (Not that I know what the difference is or is supposed to be, just going off what I see in my xmis.) Tim
On 04/17/2014 06:23 PM, Masanz, James J. wrote: > The normalizedForm field is filled in. It is used by dictionary lookup. > > So, for example, if the dictionary would contain "lymph node" but not "lymph > nodes", a document with text of "lymph nodes" would match the dictionary > entry "lymph node" because "node", being the normalized form of "nodes", > would be used when searching dictionary entries (in addition to searching > dictionary entries for "nodes") > > -----Original Message----- > From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] > Sent: Thursday, April 17, 2014 4:33 PM > To: dev@ctakes.apache.org > Subject: Re: lvg entries > > Quick follow-up since I was interested. The current dependency parser > does have the option to use ctakes lemmas or do its own lemmatizing, but > that doesn't use the lemma field, it uses the normalizedForm field. I'm > not sure if that field is actually ever filled in -- on my example data > it is always null. > > Tim > > On 04/17/2014 01:57 PM, Masanz, James J. wrote: >> Offhand I recall at least one of the dependency parsers used the Lemma >> annotations at one point. >> Not sure if still does. >> >> There is an option for turning off the posting of the lemmas to the cas. >> >> Hope that helps >> >> -----Original Message----- >> From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] >> Sent: Thursday, April 17, 2014 11:27 AM >> To: dev@ctakes.apache.org >> Subject: lvg entries >> >> The LVG annotator creates an enormous number of "lemmas" for every >> WordToken in the CAS, and I'm wondering what the original purpose was? I >> think this is probably a minor bottleneck for speed but mostly a pretty >> big space hog (at least 50% of the space of xmi files in my tests). >> >> As of right now I'm not sure if any downstream components are using >> these lemmas, and on a manual inspection the precision seems to be >> pretty abysmal (meaning most of them are nonsensical as lexical >> variants), so as I said, just wondering if we can revisit why cTAKES >> generates so many and whether that component can be optimized. >> >> Thanks >> Tim >> >> >