Re: lvg entries

2014-04-18 Thread Miller, Timothy
Hmm... I don't see normalizedForm filled in. I see LVG filling in
canonicalForm, is it possible that's what you're thinking of?  (Not that
I know what the difference is or is supposed to be, just going off what
I see in my xmis.)
Tim


On 04/17/2014 06:23 PM, Masanz, James J. wrote:
 The normalizedForm field is filled in. It is used by dictionary lookup.

 So, for example, if the dictionary would contain lymph node but not lymph 
 nodes, a document with text of lymph nodes would match the dictionary 
 entry lymph node because node, being the normalized form of nodes, 
 would be used when searching dictionary entries (in addition to searching 
 dictionary entries for nodes)

 -Original Message-
 From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
 Sent: Thursday, April 17, 2014 4:33 PM
 To: dev@ctakes.apache.org
 Subject: Re: lvg entries

 Quick follow-up since I was interested. The current dependency parser
 does have the option to use ctakes lemmas or do its own lemmatizing, but
 that doesn't use the lemma field, it uses the normalizedForm field. I'm
 not sure if that field is actually ever filled in -- on my example data
 it is always null.

 Tim

 On 04/17/2014 01:57 PM, Masanz, James J. wrote:
 Offhand I recall at least one of the dependency parsers used the Lemma 
 annotations at one point.
 Not sure if still does.

 There is an option for turning off the posting of the lemmas to the cas.

 Hope that helps

 -Original Message-
 From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
 Sent: Thursday, April 17, 2014 11:27 AM
 To: dev@ctakes.apache.org
 Subject: lvg entries

 The LVG annotator creates an enormous number of lemmas for every
 WordToken in the CAS, and I'm wondering what the original purpose was? I
 think this is probably a minor bottleneck for speed but mostly a pretty
 big space hog (at least 50% of the space of xmi files in my tests).

 As of right now I'm not sure if any downstream components are using
 these lemmas, and on a manual inspection the precision seems to be
 pretty abysmal (meaning most of them are nonsensical as lexical
 variants), so as I said, just wondering if we can revisit why cTAKES
 generates so many and whether that component can be optimized.

 Thanks
 Tim






RE: lvg entries

2014-04-18 Thread Masanz, James J.

You are right, I was thinking of the field called canonicalForm.

normlizedForm is set by ExtractionPrepAnnotator.java - but if I remember right, 
that's at the end of the pipelines that it's included in. And it's set to 
either the canonicalForm (if there is one) or the coveredText

Not sure what the intent there was.

-Original Message-
From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
Sent: Friday, April 18, 2014 11:16 AM
To: dev@ctakes.apache.org
Subject: Re: lvg entries

Hmm... I don't see normalizedForm filled in. I see LVG filling in
canonicalForm, is it possible that's what you're thinking of?  (Not that
I know what the difference is or is supposed to be, just going off what
I see in my xmis.)
Tim


On 04/17/2014 06:23 PM, Masanz, James J. wrote:
 The normalizedForm field is filled in. It is used by dictionary lookup.

 So, for example, if the dictionary would contain lymph node but not lymph 
 nodes, a document with text of lymph nodes would match the dictionary 
 entry lymph node because node, being the normalized form of nodes, 
 would be used when searching dictionary entries (in addition to searching 
 dictionary entries for nodes)

 -Original Message-
 From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
 Sent: Thursday, April 17, 2014 4:33 PM
 To: dev@ctakes.apache.org
 Subject: Re: lvg entries

 Quick follow-up since I was interested. The current dependency parser
 does have the option to use ctakes lemmas or do its own lemmatizing, but
 that doesn't use the lemma field, it uses the normalizedForm field. I'm
 not sure if that field is actually ever filled in -- on my example data
 it is always null.

 Tim

 On 04/17/2014 01:57 PM, Masanz, James J. wrote:
 Offhand I recall at least one of the dependency parsers used the Lemma 
 annotations at one point.
 Not sure if still does.

 There is an option for turning off the posting of the lemmas to the cas.

 Hope that helps

 -Original Message-
 From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
 Sent: Thursday, April 17, 2014 11:27 AM
 To: dev@ctakes.apache.org
 Subject: lvg entries

 The LVG annotator creates an enormous number of lemmas for every
 WordToken in the CAS, and I'm wondering what the original purpose was? I
 think this is probably a minor bottleneck for speed but mostly a pretty
 big space hog (at least 50% of the space of xmi files in my tests).

 As of right now I'm not sure if any downstream components are using
 these lemmas, and on a manual inspection the precision seems to be
 pretty abysmal (meaning most of them are nonsensical as lexical
 variants), so as I said, just wondering if we can revisit why cTAKES
 generates so many and whether that component can be optimized.

 Thanks
 Tim






Re: lvg entries

2014-04-18 Thread Miller, Timothy
Thanks for tracking that down Andy.

I am making a pass at UimaFit-izing the configuration parameters for all
the annotators in the default pipeline, before I create the static
factory methods like we recently discussed. Should I go ahead and change
this to make default behavior be false?

Tim


On 04/18/2014 12:47 AM, andy mcmurry wrote:
 There is a lot of config handling, maybe PostLemmas is being set to true or
 configInit() is not setting up  the NLM wrapper incorrectly.

 ctakes-lvg *README*
 Note: as distributed, PostLemmas is set to false.  This is done to reduce
 the size of the CAS.
 Set PostLemmas to true to have org.apache.ctakes.typesystem.type.Lemma
 annotations added to the CAS.

 *LvgAnnotator.xml *
 PostLemmas = True

 *LvgAnnotator.java*
 if (postLemmas) {
  lvgResource.getLvgLex()
 }







 On Thu, Apr 17, 2014 at 3:23 PM, Masanz, James J. 
 masanz.ja...@mayo.eduwrote:

 The normalizedForm field is filled in. It is used by dictionary lookup.

 So, for example, if the dictionary would contain lymph node but not
 lymph nodes, a document with text of lymph nodes would match the
 dictionary entry lymph node because node, being the normalized form of
 nodes, would be used when searching dictionary entries (in addition to
 searching dictionary entries for nodes)

 -Original Message-
 From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
 Sent: Thursday, April 17, 2014 4:33 PM
 To: dev@ctakes.apache.org
 Subject: Re: lvg entries

 Quick follow-up since I was interested. The current dependency parser
 does have the option to use ctakes lemmas or do its own lemmatizing, but
 that doesn't use the lemma field, it uses the normalizedForm field. I'm
 not sure if that field is actually ever filled in -- on my example data
 it is always null.

 Tim

 On 04/17/2014 01:57 PM, Masanz, James J. wrote:
 Offhand I recall at least one of the dependency parsers used the Lemma
 annotations at one point.
 Not sure if still does.

 There is an option for turning off the posting of the lemmas to the cas.

 Hope that helps

 -Original Message-
 From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
 Sent: Thursday, April 17, 2014 11:27 AM
 To: dev@ctakes.apache.org
 Subject: lvg entries

 The LVG annotator creates an enormous number of lemmas for every
 WordToken in the CAS, and I'm wondering what the original purpose was? I
 think this is probably a minor bottleneck for speed but mostly a pretty
 big space hog (at least 50% of the space of xmi files in my tests).

 As of right now I'm not sure if any downstream components are using
 these lemmas, and on a manual inspection the precision seems to be
 pretty abysmal (meaning most of them are nonsensical as lexical
 variants), so as I said, just wondering if we can revisit why cTAKES
 generates so many and whether that component can be optimized.

 Thanks
 Tim






RE: lvg entries

2014-04-18 Thread Finan, Sean
+1 false

-Original Message-
From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
Sent: Friday, April 18, 2014 2:54 PM
To: dev@ctakes.apache.org
Subject: Re: lvg entries

Thanks for tracking that down Andy.

I am making a pass at UimaFit-izing the configuration parameters for all the 
annotators in the default pipeline, before I create the static factory methods 
like we recently discussed. Should I go ahead and change this to make default 
behavior be false?

Tim


On 04/18/2014 12:47 AM, andy mcmurry wrote:
 There is a lot of config handling, maybe PostLemmas is being set to 
 true or
 configInit() is not setting up  the NLM wrapper incorrectly.

 ctakes-lvg *README*
 Note: as distributed, PostLemmas is set to false.  This is done to 
 reduce the size of the CAS.
 Set PostLemmas to true to have org.apache.ctakes.typesystem.type.Lemma
 annotations added to the CAS.

 *LvgAnnotator.xml *
 PostLemmas = True

 *LvgAnnotator.java*
 if (postLemmas) {
  lvgResource.getLvgLex()
 }







 On Thu, Apr 17, 2014 at 3:23 PM, Masanz, James J. 
 masanz.ja...@mayo.eduwrote:

 The normalizedForm field is filled in. It is used by dictionary lookup.

 So, for example, if the dictionary would contain lymph node but not 
 lymph nodes, a document with text of lymph nodes would match the 
 dictionary entry lymph node because node, being the normalized 
 form of nodes, would be used when searching dictionary entries (in 
 addition to searching dictionary entries for nodes)

 -Original Message-
 From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
 Sent: Thursday, April 17, 2014 4:33 PM
 To: dev@ctakes.apache.org
 Subject: Re: lvg entries

 Quick follow-up since I was interested. The current dependency parser 
 does have the option to use ctakes lemmas or do its own lemmatizing, 
 but that doesn't use the lemma field, it uses the normalizedForm 
 field. I'm not sure if that field is actually ever filled in -- on my 
 example data it is always null.

 Tim

 On 04/17/2014 01:57 PM, Masanz, James J. wrote:
 Offhand I recall at least one of the dependency parsers used the 
 Lemma
 annotations at one point.
 Not sure if still does.

 There is an option for turning off the posting of the lemmas to the cas.

 Hope that helps

 -Original Message-
 From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
 Sent: Thursday, April 17, 2014 11:27 AM
 To: dev@ctakes.apache.org
 Subject: lvg entries

 The LVG annotator creates an enormous number of lemmas for every 
 WordToken in the CAS, and I'm wondering what the original purpose 
 was? I think this is probably a minor bottleneck for speed but 
 mostly a pretty big space hog (at least 50% of the space of xmi files in my 
 tests).

 As of right now I'm not sure if any downstream components are using 
 these lemmas, and on a manual inspection the precision seems to be 
 pretty abysmal (meaning most of them are nonsensical as lexical 
 variants), so as I said, just wondering if we can revisit why cTAKES 
 generates so many and whether that component can be optimized.

 Thanks
 Tim