subject:"RE\: lvg entries"

RE: new dictionary lookup {was RE: lvg entries]

2014-04-22 Thread Finan, Sean

Hi James,

>> Will the new dictionary lookup use the canonicalForm?

It does use WordToken.getCanonicalForm();
Usually this seems to be empty, but as long as it is present it will be used.


-Original Message-
From: andy mcmurry [mailto:mcmurry.a...@gmail.com] 
Sent: Tuesday, April 22, 2014 4:23 AM
To: dev@ctakes.apache.org
Subject: Re: new dictionary lookup {was RE: lvg entries]

Highly Relevant

*DNorm: disease name normalization*
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3810844/

"Disease names are often created by combining roots and affixes from Greek or 
Latin (e.g. ‘hemochromatosis’)" 






On Mon, Apr 21, 2014 at 8:57 AM, Masanz, James J. wrote:

> Sean,
>
> Will the new dictionary lookup use the canonicalForm? If not, perhaps 
> you can remove LVG from at least some of the pipelines (drug-ner does 
> not include the dependency parser)
>
> -Original Message-
> From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
> Sent: Thursday, April 17, 2014 12:52 PM
> To: dev@ctakes.apache.org
> Subject: RE: lvg entries
>
> Those variants are not used by the dictionary lookup.  I did look at 
> them to see if it was worthwhile for the new dictionary, but they are 
> all over the place so I passed.
> 
> From: Miller, Timothy [timothy.mil...@childrens.harvard.edu]
> Sent: Thursday, April 17, 2014 1:25 PM
> To: dev@ctakes.apache.org
> Subject: Re: lvg entries
>
> Pei and I had a similar discussion in person -- mapping from lexical 
> variants to a stem might be useful. Pei also mentioned that one 
> intended use might have been searching the dictionary with lexical 
> variants, but I don't think that is done. Looking at the precision of 
> the variants, I think its highly unlikely the speed tradeoff would be 
> worth any improvements in recall.
>
> Finally, at least in eclipse doing a search on references to the 
> method to retrieve the lemma entries turns up nothing.
>
> Tim
>
>
> On 04/17/2014 01:14 PM, Dligach, Dmitriy wrote:
> > I don't know of any applications within cTAKES that make use of this...
> The reverse (mapping from these "variants" to the normal form) may be 
> useful though.
> >
> > Dima
> >
> >
> >
> >
> > On Apr 17, 2014, at 11:50, Miller, Timothy <
> timothy.mil...@childrens.harvard.edu> wrote:
> >
> >> Sure, just as an example, I gave it a note with about 1000 words. 
> >> It generates 11500 NonEmptyFSList elements (each is basically one 
> >> lexical variant).
> >>
> >> For the word "symptomatic", these are the first 10 of 20 lexical
> variants:
> >> Symptomaticer/JJ
> >> Symptomaticer/RB
> >> Symptomaticed/VB
> >> Symptomaticcing/VB
> >> Symptomatics/VB
> >> Symptomatics/NN
> >> Symptomaticked/VB
> >> Symptomatic/VB
> >> Symptomatic/JJ
> >> Symptomatic/RB
> >>
> >> Tim
> >>
> >>
> >> On 04/17/2014 12:31 PM, Dligach, Dmitriy wrote:
> >>> Tim, this is a very interesting observation. Could you please send 
> >>> a
> few examples of what LVG generates? Both sensical and non :)
> >>>
> >>> Dima
> >>>
> >>>
> >>>
> >>>
> >>> On Apr 17, 2014, at 11:28, Miller, Timothy <
> timothy.mil...@childrens.harvard.edu> wrote:
> >>>
> >>>> The LVG annotator creates an enormous number of "lemmas" for 
> >>>> every WordToken in the CAS, and I'm wondering what the original 
> >>>> purpose
> was? I
> >>>> think this is probably a minor bottleneck for speed but mostly a
> pretty
> >>>> big space hog (at least 50% of the space of xmi files in my tests).
> >>>>
> >>>> As of right now I'm not sure if any downstream components are 
> >>>> using these lemmas, and on a manual inspection the precision 
> >>>> seems to be pretty abysmal (meaning most of them are nonsensical 
> >>>> as lexical variants), so as I said, just wondering if we can 
> >>>> revisit why cTAKES generates so many and whether that component can be 
> >>>> optimized.
> >>>>
> >>>> Thanks
> >>>> Tim
> >>>>
> >
>
>

Re: new dictionary lookup {was RE: lvg entries]

2014-04-22 Thread andy mcmurry

Highly Relevant

*DNorm: disease name normalization*
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3810844/

"Disease names are often created by combining roots and affixes from Greek
or Latin (e.g. ‘hemochromatosis’)" 






On Mon, Apr 21, 2014 at 8:57 AM, Masanz, James J. wrote:

> Sean,
>
> Will the new dictionary lookup use the canonicalForm? If not, perhaps you
> can remove LVG from at least some of the pipelines (drug-ner does not
> include the dependency parser)
>
> -Original Message-
> From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
> Sent: Thursday, April 17, 2014 12:52 PM
> To: dev@ctakes.apache.org
> Subject: RE: lvg entries
>
> Those variants are not used by the dictionary lookup.  I did look at them
> to see if it was worthwhile for the new dictionary, but they are all over
> the place so I passed.
> 
> From: Miller, Timothy [timothy.mil...@childrens.harvard.edu]
> Sent: Thursday, April 17, 2014 1:25 PM
> To: dev@ctakes.apache.org
> Subject: Re: lvg entries
>
> Pei and I had a similar discussion in person -- mapping from lexical
> variants to a stem might be useful. Pei also mentioned that one intended
> use might have been searching the dictionary with lexical variants, but
> I don't think that is done. Looking at the precision of the variants, I
> think its highly unlikely the speed tradeoff would be worth any
> improvements in recall.
>
> Finally, at least in eclipse doing a search on references to the method
> to retrieve the lemma entries turns up nothing.
>
> Tim
>
>
> On 04/17/2014 01:14 PM, Dligach, Dmitriy wrote:
> > I don't know of any applications within cTAKES that make use of this...
> The reverse (mapping from these "variants" to the normal form) may be
> useful though.
> >
> > Dima
> >
> >
> >
> >
> > On Apr 17, 2014, at 11:50, Miller, Timothy <
> timothy.mil...@childrens.harvard.edu> wrote:
> >
> >> Sure, just as an example, I gave it a note with about 1000 words. It
> >> generates 11500 NonEmptyFSList elements (each is basically one lexical
> >> variant).
> >>
> >> For the word "symptomatic", these are the first 10 of 20 lexical
> variants:
> >> Symptomaticer/JJ
> >> Symptomaticer/RB
> >> Symptomaticed/VB
> >> Symptomaticcing/VB
> >> Symptomatics/VB
> >> Symptomatics/NN
> >> Symptomaticked/VB
> >> Symptomatic/VB
> >> Symptomatic/JJ
> >> Symptomatic/RB
> >>
> >> Tim
> >>
> >>
> >> On 04/17/2014 12:31 PM, Dligach, Dmitriy wrote:
> >>> Tim, this is a very interesting observation. Could you please send a
> few examples of what LVG generates? Both sensical and non :)
> >>>
> >>> Dima
> >>>
> >>>
> >>>
> >>>
> >>> On Apr 17, 2014, at 11:28, Miller, Timothy <
> timothy.mil...@childrens.harvard.edu> wrote:
> >>>
> >>>> The LVG annotator creates an enormous number of "lemmas" for every
> >>>> WordToken in the CAS, and I'm wondering what the original purpose
> was? I
> >>>> think this is probably a minor bottleneck for speed but mostly a
> pretty
> >>>> big space hog (at least 50% of the space of xmi files in my tests).
> >>>>
> >>>> As of right now I'm not sure if any downstream components are using
> >>>> these lemmas, and on a manual inspection the precision seems to be
> >>>> pretty abysmal (meaning most of them are nonsensical as lexical
> >>>> variants), so as I said, just wondering if we can revisit why cTAKES
> >>>> generates so many and whether that component can be optimized.
> >>>>
> >>>> Thanks
> >>>> Tim
> >>>>
> >
>
>

new dictionary lookup {was RE: lvg entries]

2014-04-21 Thread Masanz, James J.

Sean,

Will the new dictionary lookup use the canonicalForm? If not, perhaps you can 
remove LVG from at least some of the pipelines (drug-ner does not include the 
dependency parser)

-Original Message-
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] 
Sent: Thursday, April 17, 2014 12:52 PM
To: dev@ctakes.apache.org
Subject: RE: lvg entries

Those variants are not used by the dictionary lookup.  I did look at them to 
see if it was worthwhile for the new dictionary, but they are all over the 
place so I passed.  

From: Miller, Timothy [timothy.mil...@childrens.harvard.edu]
Sent: Thursday, April 17, 2014 1:25 PM
To: dev@ctakes.apache.org
Subject: Re: lvg entries

Pei and I had a similar discussion in person -- mapping from lexical
variants to a stem might be useful. Pei also mentioned that one intended
use might have been searching the dictionary with lexical variants, but
I don't think that is done. Looking at the precision of the variants, I
think its highly unlikely the speed tradeoff would be worth any
improvements in recall.

Finally, at least in eclipse doing a search on references to the method
to retrieve the lemma entries turns up nothing.

Tim


On 04/17/2014 01:14 PM, Dligach, Dmitriy wrote:
> I don't know of any applications within cTAKES that make use of this... The 
> reverse (mapping from these "variants" to the normal form) may be useful 
> though.
>
> Dima
>
>
>
>
> On Apr 17, 2014, at 11:50, Miller, Timothy 
>  wrote:
>
>> Sure, just as an example, I gave it a note with about 1000 words. It
>> generates 11500 NonEmptyFSList elements (each is basically one lexical
>> variant).
>>
>> For the word "symptomatic", these are the first 10 of 20 lexical variants:
>> Symptomaticer/JJ
>> Symptomaticer/RB
>> Symptomaticed/VB
>> Symptomaticcing/VB
>> Symptomatics/VB
>> Symptomatics/NN
>> Symptomaticked/VB
>> Symptomatic/VB
>> Symptomatic/JJ
>> Symptomatic/RB
>>
>> Tim
>>
>>
>> On 04/17/2014 12:31 PM, Dligach, Dmitriy wrote:
>>> Tim, this is a very interesting observation. Could you please send a few 
>>> examples of what LVG generates? Both sensical and non :)
>>>
>>> Dima
>>>
>>>
>>>
>>>
>>> On Apr 17, 2014, at 11:28, Miller, Timothy 
>>>  wrote:
>>>
>>>> The LVG annotator creates an enormous number of "lemmas" for every
>>>> WordToken in the CAS, and I'm wondering what the original purpose was? I
>>>> think this is probably a minor bottleneck for speed but mostly a pretty
>>>> big space hog (at least 50% of the space of xmi files in my tests).
>>>>
>>>> As of right now I'm not sure if any downstream components are using
>>>> these lemmas, and on a manual inspection the precision seems to be
>>>> pretty abysmal (meaning most of them are nonsensical as lexical
>>>> variants), so as I said, just wondering if we can revisit why cTAKES
>>>> generates so many and whether that component can be optimized.
>>>>
>>>> Thanks
>>>> Tim
>>>>
>

Re: lvg entries

2014-04-18 Thread andy mcmurry

+1 false ... I think

I just wonder what side effects there might be to tweaking LVG


On Fri, Apr 18, 2014 at 11:56 AM, Finan, Sean <
sean.fi...@childrens.harvard.edu> wrote:

> +1 false
>
> -Original Message-
> From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
> Sent: Friday, April 18, 2014 2:54 PM
> To: dev@ctakes.apache.org
> Subject: Re: lvg entries
>
> Thanks for tracking that down Andy.
>
> I am making a pass at UimaFit-izing the configuration parameters for all
> the annotators in the default pipeline, before I create the static factory
> methods like we recently discussed. Should I go ahead and change this to
> make default behavior be false?
>
> Tim
>
>
> On 04/18/2014 12:47 AM, andy mcmurry wrote:
> > There is a lot of config handling, maybe PostLemmas is being set to
> > true or
> > configInit() is not setting up  the NLM wrapper incorrectly.
> >
> > ctakes-lvg *README*
> > Note: as distributed, PostLemmas is set to false.  This is done to
> > reduce the size of the CAS.
> > Set PostLemmas to true to have org.apache.ctakes.typesystem.type.Lemma
> > annotations added to the CAS.
> >
> > *LvgAnnotator.xml *
> > PostLemmas = True
> >
> > *LvgAnnotator.java*
> > if (postLemmas) {
> >  lvgResource.getLvgLex()
> > }
> >
> >
> >
> >
> >
> >
> >
> > On Thu, Apr 17, 2014 at 3:23 PM, Masanz, James J.  >wrote:
> >
> >> The normalizedForm field is filled in. It is used by dictionary lookup.
> >>
> >> So, for example, if the dictionary would contain "lymph node" but not
> >> "lymph nodes", a document with text of "lymph nodes" would match the
> >> dictionary entry "lymph node" because "node", being the normalized
> >> form of "nodes", would be used when searching dictionary entries (in
> >> addition to searching dictionary entries for "nodes")
> >>
> >> -Original Message-
> >> From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
> >> Sent: Thursday, April 17, 2014 4:33 PM
> >> To: dev@ctakes.apache.org
> >> Subject: Re: lvg entries
> >>
> >> Quick follow-up since I was interested. The current dependency parser
> >> does have the option to use ctakes lemmas or do its own lemmatizing,
> >> but that doesn't use the lemma field, it uses the normalizedForm
> >> field. I'm not sure if that field is actually ever filled in -- on my
> >> example data it is always null.
> >>
> >> Tim
> >>
> >> On 04/17/2014 01:57 PM, Masanz, James J. wrote:
> >>> Offhand I recall at least one of the dependency parsers used the
> >>> Lemma
> >> annotations at one point.
> >>> Not sure if still does.
> >>>
> >>> There is an option for turning off the posting of the lemmas to the
> cas.
> >>>
> >>> Hope that helps
> >>>
> >>> -Original Message-
> >>> From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
> >>> Sent: Thursday, April 17, 2014 11:27 AM
> >>> To: dev@ctakes.apache.org
> >>> Subject: lvg entries
> >>>
> >>> The LVG annotator creates an enormous number of "lemmas" for every
> >>> WordToken in the CAS, and I'm wondering what the original purpose
> >>> was? I think this is probably a minor bottleneck for speed but
> >>> mostly a pretty big space hog (at least 50% of the space of xmi files
> in my tests).
> >>>
> >>> As of right now I'm not sure if any downstream components are using
> >>> these lemmas, and on a manual inspection the precision seems to be
> >>> pretty abysmal (meaning most of them are nonsensical as lexical
> >>> variants), so as I said, just wondering if we can revisit why cTAKES
> >>> generates so many and whether that component can be optimized.
> >>>
> >>> Thanks
> >>> Tim
> >>>
> >>>
> >>
>
>

RE: lvg entries

2014-04-18 Thread Finan, Sean

+1 false

-Original Message-
From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
Sent: Friday, April 18, 2014 2:54 PM
To: dev@ctakes.apache.org
Subject: Re: lvg entries

Thanks for tracking that down Andy.

I am making a pass at UimaFit-izing the configuration parameters for all the 
annotators in the default pipeline, before I create the static factory methods 
like we recently discussed. Should I go ahead and change this to make default 
behavior be false?

Tim


On 04/18/2014 12:47 AM, andy mcmurry wrote:
> There is a lot of config handling, maybe PostLemmas is being set to 
> true or
> configInit() is not setting up  the NLM wrapper incorrectly.
>
> ctakes-lvg *README*
> Note: as distributed, PostLemmas is set to false.  This is done to 
> reduce the size of the CAS.
> Set PostLemmas to true to have org.apache.ctakes.typesystem.type.Lemma
> annotations added to the CAS.
>
> *LvgAnnotator.xml *
> PostLemmas = True
>
> *LvgAnnotator.java*
> if (postLemmas) {
>  lvgResource.getLvgLex()
> }
>
>
>
>
>
>
>
> On Thu, Apr 17, 2014 at 3:23 PM, Masanz, James J. 
> wrote:
>
>> The normalizedForm field is filled in. It is used by dictionary lookup.
>>
>> So, for example, if the dictionary would contain "lymph node" but not 
>> "lymph nodes", a document with text of "lymph nodes" would match the 
>> dictionary entry "lymph node" because "node", being the normalized 
>> form of "nodes", would be used when searching dictionary entries (in 
>> addition to searching dictionary entries for "nodes")
>>
>> -Original Message-
>> From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
>> Sent: Thursday, April 17, 2014 4:33 PM
>> To: dev@ctakes.apache.org
>> Subject: Re: lvg entries
>>
>> Quick follow-up since I was interested. The current dependency parser 
>> does have the option to use ctakes lemmas or do its own lemmatizing, 
>> but that doesn't use the lemma field, it uses the normalizedForm 
>> field. I'm not sure if that field is actually ever filled in -- on my 
>> example data it is always null.
>>
>> Tim
>>
>> On 04/17/2014 01:57 PM, Masanz, James J. wrote:
>>> Offhand I recall at least one of the dependency parsers used the 
>>> Lemma
>> annotations at one point.
>>> Not sure if still does.
>>>
>>> There is an option for turning off the posting of the lemmas to the cas.
>>>
>>> Hope that helps
>>>
>>> -Original Message-
>>> From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
>>> Sent: Thursday, April 17, 2014 11:27 AM
>>> To: dev@ctakes.apache.org
>>> Subject: lvg entries
>>>
>>> The LVG annotator creates an enormous number of "lemmas" for every 
>>> WordToken in the CAS, and I'm wondering what the original purpose 
>>> was? I think this is probably a minor bottleneck for speed but 
>>> mostly a pretty big space hog (at least 50% of the space of xmi files in my 
>>> tests).
>>>
>>> As of right now I'm not sure if any downstream components are using 
>>> these lemmas, and on a manual inspection the precision seems to be 
>>> pretty abysmal (meaning most of them are nonsensical as lexical 
>>> variants), so as I said, just wondering if we can revisit why cTAKES 
>>> generates so many and whether that component can be optimized.
>>>
>>> Thanks
>>> Tim
>>>
>>>
>>

Re: lvg entries

2014-04-18 Thread Miller, Timothy

Thanks for tracking that down Andy.

I am making a pass at UimaFit-izing the configuration parameters for all
the annotators in the default pipeline, before I create the static
factory methods like we recently discussed. Should I go ahead and change
this to make default behavior be false?

Tim


On 04/18/2014 12:47 AM, andy mcmurry wrote:
> There is a lot of config handling, maybe PostLemmas is being set to true or
> configInit() is not setting up  the NLM wrapper incorrectly.
>
> ctakes-lvg *README*
> Note: as distributed, PostLemmas is set to false.  This is done to reduce
> the size of the CAS.
> Set PostLemmas to true to have org.apache.ctakes.typesystem.type.Lemma
> annotations added to the CAS.
>
> *LvgAnnotator.xml *
> PostLemmas = True
>
> *LvgAnnotator.java*
> if (postLemmas) {
>  lvgResource.getLvgLex()
> }
>
>
>
>
>
>
>
> On Thu, Apr 17, 2014 at 3:23 PM, Masanz, James J. 
> wrote:
>
>> The normalizedForm field is filled in. It is used by dictionary lookup.
>>
>> So, for example, if the dictionary would contain "lymph node" but not
>> "lymph nodes", a document with text of "lymph nodes" would match the
>> dictionary entry "lymph node" because "node", being the normalized form of
>> "nodes", would be used when searching dictionary entries (in addition to
>> searching dictionary entries for "nodes")
>>
>> -Original Message-
>> From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
>> Sent: Thursday, April 17, 2014 4:33 PM
>> To: dev@ctakes.apache.org
>> Subject: Re: lvg entries
>>
>> Quick follow-up since I was interested. The current dependency parser
>> does have the option to use ctakes lemmas or do its own lemmatizing, but
>> that doesn't use the lemma field, it uses the normalizedForm field. I'm
>> not sure if that field is actually ever filled in -- on my example data
>> it is always null.
>>
>> Tim
>>
>> On 04/17/2014 01:57 PM, Masanz, James J. wrote:
>>> Offhand I recall at least one of the dependency parsers used the Lemma
>> annotations at one point.
>>> Not sure if still does.
>>>
>>> There is an option for turning off the posting of the lemmas to the cas.
>>>
>>> Hope that helps
>>>
>>> -Original Message-
>>> From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
>>> Sent: Thursday, April 17, 2014 11:27 AM
>>> To: dev@ctakes.apache.org
>>> Subject: lvg entries
>>>
>>> The LVG annotator creates an enormous number of "lemmas" for every
>>> WordToken in the CAS, and I'm wondering what the original purpose was? I
>>> think this is probably a minor bottleneck for speed but mostly a pretty
>>> big space hog (at least 50% of the space of xmi files in my tests).
>>>
>>> As of right now I'm not sure if any downstream components are using
>>> these lemmas, and on a manual inspection the precision seems to be
>>> pretty abysmal (meaning most of them are nonsensical as lexical
>>> variants), so as I said, just wondering if we can revisit why cTAKES
>>> generates so many and whether that component can be optimized.
>>>
>>> Thanks
>>> Tim
>>>
>>>
>>

RE: lvg entries

2014-04-18 Thread Masanz, James J.


You are right, I was thinking of the field called canonicalForm.

normlizedForm is set by ExtractionPrepAnnotator.java - but if I remember right, 
that's at the end of the pipelines that it's included in. And it's set to 
either the canonicalForm (if there is one) or the coveredText

Not sure what the intent there was.

-Original Message-
From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
Sent: Friday, April 18, 2014 11:16 AM
To: dev@ctakes.apache.org
Subject: Re: lvg entries

Hmm... I don't see normalizedForm filled in. I see LVG filling in
canonicalForm, is it possible that's what you're thinking of?  (Not that
I know what the difference is or is supposed to be, just going off what
I see in my xmis.)
Tim


On 04/17/2014 06:23 PM, Masanz, James J. wrote:
> The normalizedForm field is filled in. It is used by dictionary lookup.
>
> So, for example, if the dictionary would contain "lymph node" but not "lymph 
> nodes", a document with text of "lymph nodes" would match the dictionary 
> entry "lymph node" because "node", being the normalized form of "nodes", 
> would be used when searching dictionary entries (in addition to searching 
> dictionary entries for "nodes")
>
> -Original Message-
> From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
> Sent: Thursday, April 17, 2014 4:33 PM
> To: dev@ctakes.apache.org
> Subject: Re: lvg entries
>
> Quick follow-up since I was interested. The current dependency parser
> does have the option to use ctakes lemmas or do its own lemmatizing, but
> that doesn't use the lemma field, it uses the normalizedForm field. I'm
> not sure if that field is actually ever filled in -- on my example data
> it is always null.
>
> Tim
>
> On 04/17/2014 01:57 PM, Masanz, James J. wrote:
>> Offhand I recall at least one of the dependency parsers used the Lemma 
>> annotations at one point.
>> Not sure if still does.
>>
>> There is an option for turning off the posting of the lemmas to the cas.
>>
>> Hope that helps
>>
>> -Original Message-
>> From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
>> Sent: Thursday, April 17, 2014 11:27 AM
>> To: dev@ctakes.apache.org
>> Subject: lvg entries
>>
>> The LVG annotator creates an enormous number of "lemmas" for every
>> WordToken in the CAS, and I'm wondering what the original purpose was? I
>> think this is probably a minor bottleneck for speed but mostly a pretty
>> big space hog (at least 50% of the space of xmi files in my tests).
>>
>> As of right now I'm not sure if any downstream components are using
>> these lemmas, and on a manual inspection the precision seems to be
>> pretty abysmal (meaning most of them are nonsensical as lexical
>> variants), so as I said, just wondering if we can revisit why cTAKES
>> generates so many and whether that component can be optimized.
>>
>> Thanks
>> Tim
>>
>>
>

Re: lvg entries

2014-04-18 Thread Miller, Timothy

Hmm... I don't see normalizedForm filled in. I see LVG filling in
canonicalForm, is it possible that's what you're thinking of?  (Not that
I know what the difference is or is supposed to be, just going off what
I see in my xmis.)
Tim


On 04/17/2014 06:23 PM, Masanz, James J. wrote:
> The normalizedForm field is filled in. It is used by dictionary lookup.
>
> So, for example, if the dictionary would contain "lymph node" but not "lymph 
> nodes", a document with text of "lymph nodes" would match the dictionary 
> entry "lymph node" because "node", being the normalized form of "nodes", 
> would be used when searching dictionary entries (in addition to searching 
> dictionary entries for "nodes")
>
> -Original Message-
> From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
> Sent: Thursday, April 17, 2014 4:33 PM
> To: dev@ctakes.apache.org
> Subject: Re: lvg entries
>
> Quick follow-up since I was interested. The current dependency parser
> does have the option to use ctakes lemmas or do its own lemmatizing, but
> that doesn't use the lemma field, it uses the normalizedForm field. I'm
> not sure if that field is actually ever filled in -- on my example data
> it is always null.
>
> Tim
>
> On 04/17/2014 01:57 PM, Masanz, James J. wrote:
>> Offhand I recall at least one of the dependency parsers used the Lemma 
>> annotations at one point.
>> Not sure if still does.
>>
>> There is an option for turning off the posting of the lemmas to the cas.
>>
>> Hope that helps
>>
>> -Original Message-
>> From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
>> Sent: Thursday, April 17, 2014 11:27 AM
>> To: dev@ctakes.apache.org
>> Subject: lvg entries
>>
>> The LVG annotator creates an enormous number of "lemmas" for every
>> WordToken in the CAS, and I'm wondering what the original purpose was? I
>> think this is probably a minor bottleneck for speed but mostly a pretty
>> big space hog (at least 50% of the space of xmi files in my tests).
>>
>> As of right now I'm not sure if any downstream components are using
>> these lemmas, and on a manual inspection the precision seems to be
>> pretty abysmal (meaning most of them are nonsensical as lexical
>> variants), so as I said, just wondering if we can revisit why cTAKES
>> generates so many and whether that component can be optimized.
>>
>> Thanks
>> Tim
>>
>>
>

Re: lvg entries

2014-04-17 Thread andy mcmurry

There is a lot of config handling, maybe PostLemmas is being set to true or
configInit() is not setting up  the NLM wrapper incorrectly.

ctakes-lvg *README*
Note: as distributed, PostLemmas is set to false.  This is done to reduce
the size of the CAS.
Set PostLemmas to true to have org.apache.ctakes.typesystem.type.Lemma
annotations added to the CAS.

*LvgAnnotator.xml *
PostLemmas = True

*LvgAnnotator.java*
if (postLemmas) {
 lvgResource.getLvgLex()
}







On Thu, Apr 17, 2014 at 3:23 PM, Masanz, James J. wrote:

> The normalizedForm field is filled in. It is used by dictionary lookup.
>
> So, for example, if the dictionary would contain "lymph node" but not
> "lymph nodes", a document with text of "lymph nodes" would match the
> dictionary entry "lymph node" because "node", being the normalized form of
> "nodes", would be used when searching dictionary entries (in addition to
> searching dictionary entries for "nodes")
>
> -Original Message-
> From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
> Sent: Thursday, April 17, 2014 4:33 PM
> To: dev@ctakes.apache.org
> Subject: Re: lvg entries
>
> Quick follow-up since I was interested. The current dependency parser
> does have the option to use ctakes lemmas or do its own lemmatizing, but
> that doesn't use the lemma field, it uses the normalizedForm field. I'm
> not sure if that field is actually ever filled in -- on my example data
> it is always null.
>
> Tim
>
> On 04/17/2014 01:57 PM, Masanz, James J. wrote:
> > Offhand I recall at least one of the dependency parsers used the Lemma
> annotations at one point.
> > Not sure if still does.
> >
> > There is an option for turning off the posting of the lemmas to the cas.
> >
> > Hope that helps
> >
> > -Original Message-
> > From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
> > Sent: Thursday, April 17, 2014 11:27 AM
> > To: dev@ctakes.apache.org
> > Subject: lvg entries
> >
> > The LVG annotator creates an enormous number of "lemmas" for every
> > WordToken in the CAS, and I'm wondering what the original purpose was? I
> > think this is probably a minor bottleneck for speed but mostly a pretty
> > big space hog (at least 50% of the space of xmi files in my tests).
> >
> > As of right now I'm not sure if any downstream components are using
> > these lemmas, and on a manual inspection the precision seems to be
> > pretty abysmal (meaning most of them are nonsensical as lexical
> > variants), so as I said, just wondering if we can revisit why cTAKES
> > generates so many and whether that component can be optimized.
> >
> > Thanks
> > Tim
> >
> >
>
>

RE: lvg entries

2014-04-17 Thread Masanz, James J.

The normalizedForm field is filled in. It is used by dictionary lookup.

So, for example, if the dictionary would contain "lymph node" but not "lymph 
nodes", a document with text of "lymph nodes" would match the dictionary entry 
"lymph node" because "node", being the normalized form of "nodes", would be 
used when searching dictionary entries (in addition to searching dictionary 
entries for "nodes")

-Original Message-
From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
Sent: Thursday, April 17, 2014 4:33 PM
To: dev@ctakes.apache.org
Subject: Re: lvg entries

Quick follow-up since I was interested. The current dependency parser
does have the option to use ctakes lemmas or do its own lemmatizing, but
that doesn't use the lemma field, it uses the normalizedForm field. I'm
not sure if that field is actually ever filled in -- on my example data
it is always null.

Tim

On 04/17/2014 01:57 PM, Masanz, James J. wrote:
> Offhand I recall at least one of the dependency parsers used the Lemma 
> annotations at one point.
> Not sure if still does.
>
> There is an option for turning off the posting of the lemmas to the cas.
>
> Hope that helps
>
> -Original Message-
> From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
> Sent: Thursday, April 17, 2014 11:27 AM
> To: dev@ctakes.apache.org
> Subject: lvg entries
>
> The LVG annotator creates an enormous number of "lemmas" for every
> WordToken in the CAS, and I'm wondering what the original purpose was? I
> think this is probably a minor bottleneck for speed but mostly a pretty
> big space hog (at least 50% of the space of xmi files in my tests).
>
> As of right now I'm not sure if any downstream components are using
> these lemmas, and on a manual inspection the precision seems to be
> pretty abysmal (meaning most of them are nonsensical as lexical
> variants), so as I said, just wondering if we can revisit why cTAKES
> generates so many and whether that component can be optimized.
>
> Thanks
> Tim
>
>

RE: lvg entries

2014-04-17 Thread Masanz, James J.

Before the switch to OpenNLP (which was done before the first opensource 
release of cTAKES), I believe the Lemma annotations were used by the POS tagger 
and/or phrasal parser.  As far as I know, that was the original intention of 
the Lemmas. I believe they were turned off by default for some releases, until 
someone started to use them (or at least look at maybe using them)

That's all just from memory. We'd have to look through histories to see when 
things changed.

I don't think the Lemma annotations were ever used for dictionary lookup. That 
used the (single) output of the normalizer function of the LVG component

-Original Message-
From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
Sent: Thursday, April 17, 2014 3:34 PM
To: dev@ctakes.apache.org
Subject: Re: lvg entries

Thanks James. Does it ring a bell to you that the original intention was
something like query expansion for a dictionary lookup?
Tim

On 04/17/2014 01:57 PM, Masanz, James J. wrote:
> Offhand I recall at least one of the dependency parsers used the Lemma 
> annotations at one point.
> Not sure if still does.
>
> There is an option for turning off the posting of the lemmas to the cas.
>
> Hope that helps
>
> -Original Message-
> From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
> Sent: Thursday, April 17, 2014 11:27 AM
> To: dev@ctakes.apache.org
> Subject: lvg entries
>
> The LVG annotator creates an enormous number of "lemmas" for every
> WordToken in the CAS, and I'm wondering what the original purpose was? I
> think this is probably a minor bottleneck for speed but mostly a pretty
> big space hog (at least 50% of the space of xmi files in my tests).
>
> As of right now I'm not sure if any downstream components are using
> these lemmas, and on a manual inspection the precision seems to be
> pretty abysmal (meaning most of them are nonsensical as lexical
> variants), so as I said, just wondering if we can revisit why cTAKES
> generates so many and whether that component can be optimized.
>
> Thanks
> Tim
>
>

Re: lvg entries

2014-04-17 Thread Miller, Timothy

Quick follow-up since I was interested. The current dependency parser
does have the option to use ctakes lemmas or do its own lemmatizing, but
that doesn't use the lemma field, it uses the normalizedForm field. I'm
not sure if that field is actually ever filled in -- on my example data
it is always null.

Tim

On 04/17/2014 01:57 PM, Masanz, James J. wrote:
> Offhand I recall at least one of the dependency parsers used the Lemma 
> annotations at one point.
> Not sure if still does.
>
> There is an option for turning off the posting of the lemmas to the cas.
>
> Hope that helps
>
> -Original Message-
> From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
> Sent: Thursday, April 17, 2014 11:27 AM
> To: dev@ctakes.apache.org
> Subject: lvg entries
>
> The LVG annotator creates an enormous number of "lemmas" for every
> WordToken in the CAS, and I'm wondering what the original purpose was? I
> think this is probably a minor bottleneck for speed but mostly a pretty
> big space hog (at least 50% of the space of xmi files in my tests).
>
> As of right now I'm not sure if any downstream components are using
> these lemmas, and on a manual inspection the precision seems to be
> pretty abysmal (meaning most of them are nonsensical as lexical
> variants), so as I said, just wondering if we can revisit why cTAKES
> generates so many and whether that component can be optimized.
>
> Thanks
> Tim
>
>

Re: lvg entries

2014-04-17 Thread Miller, Timothy

Thanks James. Does it ring a bell to you that the original intention was
something like query expansion for a dictionary lookup?
Tim


On 04/17/2014 01:57 PM, Masanz, James J. wrote:
> Offhand I recall at least one of the dependency parsers used the Lemma 
> annotations at one point.
> Not sure if still does.
>
> There is an option for turning off the posting of the lemmas to the cas.
>
> Hope that helps
>
> -Original Message-
> From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
> Sent: Thursday, April 17, 2014 11:27 AM
> To: dev@ctakes.apache.org
> Subject: lvg entries
>
> The LVG annotator creates an enormous number of "lemmas" for every
> WordToken in the CAS, and I'm wondering what the original purpose was? I
> think this is probably a minor bottleneck for speed but mostly a pretty
> big space hog (at least 50% of the space of xmi files in my tests).
>
> As of right now I'm not sure if any downstream components are using
> these lemmas, and on a manual inspection the precision seems to be
> pretty abysmal (meaning most of them are nonsensical as lexical
> variants), so as I said, just wondering if we can revisit why cTAKES
> generates so many and whether that component can be optimized.
>
> Thanks
> Tim
>
>

RE: lvg entries

2014-04-17 Thread Masanz, James J.


Offhand I recall at least one of the dependency parsers used the Lemma 
annotations at one point.
Not sure if still does.

There is an option for turning off the posting of the lemmas to the cas.

Hope that helps

-Original Message-
From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
Sent: Thursday, April 17, 2014 11:27 AM
To: dev@ctakes.apache.org
Subject: lvg entries

The LVG annotator creates an enormous number of "lemmas" for every
WordToken in the CAS, and I'm wondering what the original purpose was? I
think this is probably a minor bottleneck for speed but mostly a pretty
big space hog (at least 50% of the space of xmi files in my tests).

As of right now I'm not sure if any downstream components are using
these lemmas, and on a manual inspection the precision seems to be
pretty abysmal (meaning most of them are nonsensical as lexical
variants), so as I said, just wondering if we can revisit why cTAKES
generates so many and whether that component can be optimized.

Thanks
Tim

RE: lvg entries

2014-04-17 Thread Finan, Sean

Those variants are not used by the dictionary lookup.  I did look at them to 
see if it was worthwhile for the new dictionary, but they are all over the 
place so I passed.  

From: Miller, Timothy [timothy.mil...@childrens.harvard.edu]
Sent: Thursday, April 17, 2014 1:25 PM
To: dev@ctakes.apache.org
Subject: Re: lvg entries

Pei and I had a similar discussion in person -- mapping from lexical
variants to a stem might be useful. Pei also mentioned that one intended
use might have been searching the dictionary with lexical variants, but
I don't think that is done. Looking at the precision of the variants, I
think its highly unlikely the speed tradeoff would be worth any
improvements in recall.

Finally, at least in eclipse doing a search on references to the method
to retrieve the lemma entries turns up nothing.

Tim


On 04/17/2014 01:14 PM, Dligach, Dmitriy wrote:
> I don’t know of any applications within cTAKES that make use of this… The 
> reverse (mapping from these “variants” to the normal form) may be useful 
> though.
>
> Dima
>
>
>
>
> On Apr 17, 2014, at 11:50, Miller, Timothy 
>  wrote:
>
>> Sure, just as an example, I gave it a note with about 1000 words. It
>> generates 11500 NonEmptyFSList elements (each is basically one lexical
>> variant).
>>
>> For the word "symptomatic", these are the first 10 of 20 lexical variants:
>> Symptomaticer/JJ
>> Symptomaticer/RB
>> Symptomaticed/VB
>> Symptomaticcing/VB
>> Symptomatics/VB
>> Symptomatics/NN
>> Symptomaticked/VB
>> Symptomatic/VB
>> Symptomatic/JJ
>> Symptomatic/RB
>>
>> Tim
>>
>>
>> On 04/17/2014 12:31 PM, Dligach, Dmitriy wrote:
>>> Tim, this is a very interesting observation. Could you please send a few 
>>> examples of what LVG generates? Both sensical and non :)
>>>
>>> Dima
>>>
>>>
>>>
>>>
>>> On Apr 17, 2014, at 11:28, Miller, Timothy 
>>>  wrote:
>>>
>>>> The LVG annotator creates an enormous number of "lemmas" for every
>>>> WordToken in the CAS, and I'm wondering what the original purpose was? I
>>>> think this is probably a minor bottleneck for speed but mostly a pretty
>>>> big space hog (at least 50% of the space of xmi files in my tests).
>>>>
>>>> As of right now I'm not sure if any downstream components are using
>>>> these lemmas, and on a manual inspection the precision seems to be
>>>> pretty abysmal (meaning most of them are nonsensical as lexical
>>>> variants), so as I said, just wondering if we can revisit why cTAKES
>>>> generates so many and whether that component can be optimized.
>>>>
>>>> Thanks
>>>> Tim
>>>>
>

Re: lvg entries

2014-04-17 Thread Miller, Timothy

Pei and I had a similar discussion in person -- mapping from lexical
variants to a stem might be useful. Pei also mentioned that one intended
use might have been searching the dictionary with lexical variants, but
I don't think that is done. Looking at the precision of the variants, I
think its highly unlikely the speed tradeoff would be worth any
improvements in recall.

Finally, at least in eclipse doing a search on references to the method
to retrieve the lemma entries turns up nothing.

Tim

On 04/17/2014 01:14 PM, Dligach, Dmitriy wrote:
> I don’t know of any applications within cTAKES that make use of this… The 
> reverse (mapping from these “variants” to the normal form) may be useful 
> though.
>
> Dima
>
>
>
>
> On Apr 17, 2014, at 11:50, Miller, Timothy 
>  wrote:
>
>> Sure, just as an example, I gave it a note with about 1000 words. It
>> generates 11500 NonEmptyFSList elements (each is basically one lexical
>> variant).
>>
>> For the word "symptomatic", these are the first 10 of 20 lexical variants:
>> Symptomaticer/JJ
>> Symptomaticer/RB
>> Symptomaticed/VB
>> Symptomaticcing/VB
>> Symptomatics/VB
>> Symptomatics/NN
>> Symptomaticked/VB
>> Symptomatic/VB
>> Symptomatic/JJ
>> Symptomatic/RB
>>
>> Tim
>>
>>
>> On 04/17/2014 12:31 PM, Dligach, Dmitriy wrote:
>>> Tim, this is a very interesting observation. Could you please send a few 
>>> examples of what LVG generates? Both sensical and non :)
>>>
>>> Dima
>>>
>>>
>>>
>>>
>>> On Apr 17, 2014, at 11:28, Miller, Timothy 
>>>  wrote:
>>>
 The LVG annotator creates an enormous number of "lemmas" for every
 WordToken in the CAS, and I'm wondering what the original purpose was? I
 think this is probably a minor bottleneck for speed but mostly a pretty
 big space hog (at least 50% of the space of xmi files in my tests).

 As of right now I'm not sure if any downstream components are using
 these lemmas, and on a manual inspection the precision seems to be
 pretty abysmal (meaning most of them are nonsensical as lexical
 variants), so as I said, just wondering if we can revisit why cTAKES
 generates so many and whether that component can be optimized.

 Thanks
 Tim

>

Re: lvg entries

2014-04-17 Thread Dligach, Dmitriy

I don’t know of any applications within cTAKES that make use of this… The 
reverse (mapping from these “variants” to the normal form) may be useful though.

Dima




On Apr 17, 2014, at 11:50, Miller, Timothy 
 wrote:

> Sure, just as an example, I gave it a note with about 1000 words. It
> generates 11500 NonEmptyFSList elements (each is basically one lexical
> variant).
> 
> For the word "symptomatic", these are the first 10 of 20 lexical variants:
> Symptomaticer/JJ
> Symptomaticer/RB
> Symptomaticed/VB
> Symptomaticcing/VB
> Symptomatics/VB
> Symptomatics/NN
> Symptomaticked/VB
> Symptomatic/VB
> Symptomatic/JJ
> Symptomatic/RB
> 
> Tim
> 
> 
> On 04/17/2014 12:31 PM, Dligach, Dmitriy wrote:
>> Tim, this is a very interesting observation. Could you please send a few 
>> examples of what LVG generates? Both sensical and non :)
>> 
>> Dima
>> 
>> 
>> 
>> 
>> On Apr 17, 2014, at 11:28, Miller, Timothy 
>>  wrote:
>> 
>>> The LVG annotator creates an enormous number of "lemmas" for every
>>> WordToken in the CAS, and I'm wondering what the original purpose was? I
>>> think this is probably a minor bottleneck for speed but mostly a pretty
>>> big space hog (at least 50% of the space of xmi files in my tests).
>>> 
>>> As of right now I'm not sure if any downstream components are using
>>> these lemmas, and on a manual inspection the precision seems to be
>>> pretty abysmal (meaning most of them are nonsensical as lexical
>>> variants), so as I said, just wondering if we can revisit why cTAKES
>>> generates so many and whether that component can be optimized.
>>> 
>>> Thanks
>>> Tim
>>> 
>> 
>

Re: lvg entries

2014-04-17 Thread Miller, Timothy

Sure, just as an example, I gave it a note with about 1000 words. It
generates 11500 NonEmptyFSList elements (each is basically one lexical
variant).

For the word "symptomatic", these are the first 10 of 20 lexical variants:
Symptomaticer/JJ
Symptomaticer/RB
Symptomaticed/VB
Symptomaticcing/VB
Symptomatics/VB
Symptomatics/NN
Symptomaticked/VB
Symptomatic/VB
Symptomatic/JJ
Symptomatic/RB

Tim


On 04/17/2014 12:31 PM, Dligach, Dmitriy wrote:
> Tim, this is a very interesting observation. Could you please send a few 
> examples of what LVG generates? Both sensical and non :)
>
> Dima
>
>
>
>
> On Apr 17, 2014, at 11:28, Miller, Timothy 
>  wrote:
>
>> The LVG annotator creates an enormous number of "lemmas" for every
>> WordToken in the CAS, and I'm wondering what the original purpose was? I
>> think this is probably a minor bottleneck for speed but mostly a pretty
>> big space hog (at least 50% of the space of xmi files in my tests).
>>
>> As of right now I'm not sure if any downstream components are using
>> these lemmas, and on a manual inspection the precision seems to be
>> pretty abysmal (meaning most of them are nonsensical as lexical
>> variants), so as I said, just wondering if we can revisit why cTAKES
>> generates so many and whether that component can be optimized.
>>
>> Thanks
>> Tim
>>
>

Re: lvg entries

2014-04-17 Thread Dligach, Dmitriy

Tim, this is a very interesting observation. Could you please send a few 
examples of what LVG generates? Both sensical and non :)

Dima




On Apr 17, 2014, at 11:28, Miller, Timothy 
 wrote:

> The LVG annotator creates an enormous number of "lemmas" for every
> WordToken in the CAS, and I'm wondering what the original purpose was? I
> think this is probably a minor bottleneck for speed but mostly a pretty
> big space hog (at least 50% of the space of xmi files in my tests).
> 
> As of right now I'm not sure if any downstream components are using
> these lemmas, and on a manual inspection the precision seems to be
> pretty abysmal (meaning most of them are nonsensical as lexical
> variants), so as I said, just wondering if we can revisit why cTAKES
> generates so many and whether that component can be optimized.
> 
> Thanks
> Tim
>

RE: new dictionary lookup {was RE: lvg entries]

Re: new dictionary lookup {was RE: lvg entries]

new dictionary lookup {was RE: lvg entries]

Re: lvg entries

RE: lvg entries

Re: lvg entries

RE: lvg entries

Re: lvg entries

Re: lvg entries

RE: lvg entries

RE: lvg entries

Re: lvg entries

Re: lvg entries

RE: lvg entries

RE: lvg entries

Re: lvg entries

Re: lvg entries

Re: lvg entries

Re: lvg entries

19 matches

Site Navigation

Mail list logo

Footer information