RE: Evaluate cTAKES perfomance [SUSPICIOUS]

2017-03-19 Thread Finan, Sean
Great explanation, 
Thank you Tim!

-Original Message-
From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
Sent: Saturday, March 18, 2017 7:18 AM
To: dev@ctakes.apache.org
Subject: Re: Evaluate cTAKES perfomance [SUSPICIOUS]

To save you a little trouble, in ctakes-temporal we rely a lot on an outside 
library called ClearTK that has some evaluation APIs built in that work well 
with UIMA frameworks and typical NLP tasks. We use the following classes:
https://urldefense.proofpoint.com/v2/url?u=http-3A__cleartk.github.io_cleartk_apidocs_2.0.0_org_cleartk_eval_AnnotationStatistics.html=DwIFAw=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao=lKr9UzntVnVdsEbHHjtjhfCS3BgJa6dyTE9LsTnhLkA=PUUopYYvh-wxt0oYmHdevHhjzYZh19cvYGae-3pQOd8=
https://urldefense.proofpoint.com/v2/url?u=http-3A__cleartk.github.io_cleartk_apidocs_2.0.0_org_cleartk_eval_Evaluation-5FImplBase.html=DwIFAw=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao=lKr9UzntVnVdsEbHHjtjhfCS3BgJa6dyTE9LsTnhLkA=MP2Jy56D9Rj58htcPx5g_oX_Ca-ACJVdAJnysg2H0Uc=
 

The simplest place to start looking in ctakes-temporal is probably the 
EventAnnotator and its evaluation, since they are simple one word spans. Then 
the TimeAnnotator is slightly more complicated with multi-word spans. Then if 
you are interested in evaluating relations I would suggest switching over to 
ctakes-relation-extractor which is more stable than the ctakes-temporal 
relation code, which is an area of highly active (i.e., funded) research and so 
the code has not been cleaned up as much.
Tim


From: Leander Melms <me...@students.uni-marburg.de>
Sent: Friday, March 17, 2017 3:05 PM
To: dev@ctakes.apache.org
Subject: Re: Evaluate cTAKES perfomance

Thanks! I'll have a look at it and will try to give something back to the 
community!

Leander


> On 17 Mar 2017, at 19:42, Finan, Sean <sean.fi...@childrens.harvard.edu> 
> wrote:
>
> Ah - you meant best way to test.  Sorry, I misread your inquiry as a best way 
> to write output.
>
> Yes, that is a great introduction document for ctakes and early tests.  There 
> are a few small test classes in ctakes that read anafora files, run ctakes 
> and run agreement numbers.  You can find some in the ctakes-temporal module.  
> I didn't write them, and I think that they are built-to-fit purpose-driven 
> classes, but you could try to adapt them to a general purpose case.  That 
> would be a great thing to have in ctakes!
>
> Sean
>
> -Original Message-
> From: Leander Melms [mailto:me...@students.uni-marburg.de]
> Sent: Friday, March 17, 2017 1:46 PM
> To: dev@ctakes.apache.org
> Subject: Re: Evaluate cTAKES perfomance
>
> Hi Sean,
>
> thank you (again) for your help and feedback! I'll give it a try! Seems like 
> the authors of the publication "Mayo clinical Text analysis and Knowledge 
> Extraction System" 
> (https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ncbi.nlm.nih.gov_pmc_articles_PMC2995668_=DwIFAg=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao=PZ0f8s12PJA8W5B4hMlw-0F83VAM9m6E1ypWVaT2hcM=Isgii7k_fUy_qLsyqEdh15wKLAnFT6_KeE7zN1dE73Q=
>   
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ncbi.nlm.nih.gov_pmc_articles_PMC2995668_=DwIFAg=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao=PZ0f8s12PJA8W5B4hMlw-0F83VAM9m6E1ypWVaT2hcM=Isgii7k_fUy_qLsyqEdh15wKLAnFT6_KeE7zN1dE73Q=>)
>  did this as well.
>
> Thank you
> Leander
>
>
>
>> On 17 Mar 2017, at 18:33, Finan, Sean <sean.fi...@childrens.harvard.edu> 
>> wrote:
>>
>> Hi Leander,
>>
>> There is no single correct way to do this, but a couple of similar 
>> classes exist.  Well, one sat in my sandbox for two years until about 5 
>> seconds ago as I only just checked it in.  Anyway, take a look at two 
>> classes in ctakes-core org.apache.ctakes.core They are TextSpanWriter and 
>> CuiCountFileWriter.
>>
>> TextSpanWriter writes annotation name | span | covered text in a file, one 
>> per document.
>>
>> CuiCountFileWriter writes a list of discovered cuis and their counts.
>>
>> It sounds like you are interested in a combination of both - basically 
>> TextSpanWriter with the added output of CUIs.
>>
>> You can also have a look at EntityCollector of 
>> org.apache.ctakes.core.pipeline.  It has an annotation engine that keeps a 
>> running list of "entities" for the whole run, doc ids, spans, text and cuis.
>>
>> Sean
>>
>>
>> -Original Message-
>> From: Leander Melms [mailto:me...@students.uni-marburg.de]
>> Sent: Friday, March 17

Re: Evaluate cTAKES perfomance

2017-03-18 Thread Miller, Timothy
To save you a little trouble, in ctakes-temporal we rely a lot on an outside 
library called ClearTK that has some evaluation APIs built in that work well 
with UIMA frameworks and typical NLP tasks. We use the following classes:
http://cleartk.github.io/cleartk/apidocs/2.0.0/org/cleartk/eval/AnnotationStatistics.html
http://cleartk.github.io/cleartk/apidocs/2.0.0/org/cleartk/eval/Evaluation_ImplBase.html

The simplest place to start looking in ctakes-temporal is probably the 
EventAnnotator and its evaluation, since they are simple one word spans. Then 
the TimeAnnotator is slightly more complicated with multi-word spans. Then if 
you are interested in evaluating relations I would suggest switching over to 
ctakes-relation-extractor which is more stable than the ctakes-temporal 
relation code, which is an area of highly active (i.e., funded) research and so 
the code has not been cleaned up as much.
Tim


From: Leander Melms <me...@students.uni-marburg.de>
Sent: Friday, March 17, 2017 3:05 PM
To: dev@ctakes.apache.org
Subject: Re: Evaluate cTAKES perfomance

Thanks! I'll have a look at it and will try to give something back to the 
community!

Leander


> On 17 Mar 2017, at 19:42, Finan, Sean <sean.fi...@childrens.harvard.edu> 
> wrote:
>
> Ah - you meant best way to test.  Sorry, I misread your inquiry as a best way 
> to write output.
>
> Yes, that is a great introduction document for ctakes and early tests.  There 
> are a few small test classes in ctakes that read anafora files, run ctakes 
> and run agreement numbers.  You can find some in the ctakes-temporal module.  
> I didn't write them, and I think that they are built-to-fit purpose-driven 
> classes, but you could try to adapt them to a general purpose case.  That 
> would be a great thing to have in ctakes!
>
> Sean
>
> -Original Message-
> From: Leander Melms [mailto:me...@students.uni-marburg.de]
> Sent: Friday, March 17, 2017 1:46 PM
> To: dev@ctakes.apache.org
> Subject: Re: Evaluate cTAKES perfomance
>
> Hi Sean,
>
> thank you (again) for your help and feedback! I'll give it a try! Seems like 
> the authors of the publication "Mayo clinical Text analysis and Knowledge 
> Extraction System" 
> (https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ncbi.nlm.nih.gov_pmc_articles_PMC2995668_=DwIFAg=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao=PZ0f8s12PJA8W5B4hMlw-0F83VAM9m6E1ypWVaT2hcM=Isgii7k_fUy_qLsyqEdh15wKLAnFT6_KeE7zN1dE73Q=
>   
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ncbi.nlm.nih.gov_pmc_articles_PMC2995668_=DwIFAg=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao=PZ0f8s12PJA8W5B4hMlw-0F83VAM9m6E1ypWVaT2hcM=Isgii7k_fUy_qLsyqEdh15wKLAnFT6_KeE7zN1dE73Q=
>  >) did this as well.
>
> Thank you
> Leander
>
>
>
>> On 17 Mar 2017, at 18:33, Finan, Sean <sean.fi...@childrens.harvard.edu> 
>> wrote:
>>
>> Hi Leander,
>>
>> There is no single correct way to do this, but a couple of similar
>> classes exist.  Well, one sat in my sandbox for two years until about 5 
>> seconds ago as I only just checked it in.  Anyway, take a look at two 
>> classes in ctakes-core org.apache.ctakes.core They are TextSpanWriter and 
>> CuiCountFileWriter.
>>
>> TextSpanWriter writes annotation name | span | covered text in a file, one 
>> per document.
>>
>> CuiCountFileWriter writes a list of discovered cuis and their counts.
>>
>> It sounds like you are interested in a combination of both - basically 
>> TextSpanWriter with the added output of CUIs.
>>
>> You can also have a look at EntityCollector of 
>> org.apache.ctakes.core.pipeline.  It has an annotation engine that keeps a 
>> running list of "entities" for the whole run, doc ids, spans, text and cuis.
>>
>> Sean
>>
>>
>> -Original Message-
>> From: Leander Melms [mailto:me...@students.uni-marburg.de]
>> Sent: Friday, March 17, 2017 1:09 PM
>> To: dev@ctakes.apache.org
>> Subject: Re: Evaluate cTAKES perfomance
>>
>> Sorry for writing again. I just have a quick question: My idea is to parse 
>> the cTAKES output to a text file with a structure like this 
>> DocName|Spans|CUI|CoveredText|ConceptType and do the same with the cold 
>> standart (from anafora).
>>
>> Is this a correct way to do this?
>>
>> I'm new to the subject and happy about the tiniest information on the topic.
>>
>> Thanks
>> Leander
>>
>> I
>>> On 17 Mar 2017, at 12:05, Leander Melms <me...@students.uni-marburg.de> 
>>> wrote:
>>>
>&

Re: Evaluate cTAKES perfomance

2017-03-17 Thread Leander Melms
Hi Sean,

thank you (again) for your help and feedback! I'll give it a try! Seems like 
the authors of the publication "Mayo clinical Text analysis and Knowledge 
Extraction System" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2995668/ 
<https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2995668/>) did this as well.

Thank you
Leander



> On 17 Mar 2017, at 18:33, Finan, Sean <sean.fi...@childrens.harvard.edu> 
> wrote:
> 
> Hi Leander,
> 
> There is no single correct way to do this, but a couple of similar classes 
> exist.  Well, one sat in my sandbox for two years until about 5 seconds ago 
> as I only just checked it in.  Anyway, take a look at two classes in 
> ctakes-core org.apache.ctakes.core
> They are TextSpanWriter and CuiCountFileWriter.
> 
> TextSpanWriter writes annotation name | span | covered text in a file, one 
> per document.
> 
> CuiCountFileWriter writes a list of discovered cuis and their counts.
> 
> It sounds like you are interested in a combination of both - basically 
> TextSpanWriter with the added output of CUIs.
> 
> You can also have a look at EntityCollector of 
> org.apache.ctakes.core.pipeline.  It has an annotation engine that keeps a 
> running list of "entities" for the whole run, doc ids, spans, text and cuis.
> 
> Sean
> 
> 
> -Original Message-
> From: Leander Melms [mailto:me...@students.uni-marburg.de] 
> Sent: Friday, March 17, 2017 1:09 PM
> To: dev@ctakes.apache.org
> Subject: Re: Evaluate cTAKES perfomance
> 
> Sorry for writing again. I just have a quick question: My idea is to parse 
> the cTAKES output to a text file with a structure like this 
> DocName|Spans|CUI|CoveredText|ConceptType and do the same with the cold 
> standart (from anafora). 
> 
> Is this a correct way to do this? 
> 
> I'm new to the subject and happy about the tiniest information on the topic.
> 
> Thanks
> Leander
> 
> I
>> On 17 Mar 2017, at 12:05, Leander Melms <me...@students.uni-marburg.de> 
>> wrote:
>> 
>> Hi,
>> 
>> I've integrated a custom dictionary, retrained some of the OpenNLP models 
>> and would like to evaluate the changes on a gold standard. I'd like to 
>> calculate the precision, the recall and the f1-score to compare the results.
>> 
>> My question is: Does cTAKES ship with some evaluation / test scripts? What 
>> is the best strategry to do this? Has anyone dealt with this topic before? 
>> 
>> I'm happy to share the results afterwards if there is interest for it.
>> 
>> Thanks
>> Leander
>> 
> 
> 



RE: Evaluate cTAKES perfomance

2017-03-17 Thread Finan, Sean
Hi Leander,

There is no single correct way to do this, but a couple of similar classes 
exist.  Well, one sat in my sandbox for two years until about 5 seconds ago as 
I only just checked it in.  Anyway, take a look at two classes in ctakes-core 
org.apache.ctakes.core
They are TextSpanWriter and CuiCountFileWriter.

TextSpanWriter writes annotation name | span | covered text in a file, one per 
document.

CuiCountFileWriter writes a list of discovered cuis and their counts.

It sounds like you are interested in a combination of both - basically 
TextSpanWriter with the added output of CUIs.

You can also have a look at EntityCollector of org.apache.ctakes.core.pipeline. 
 It has an annotation engine that keeps a running list of "entities" for the 
whole run, doc ids, spans, text and cuis.

Sean


-Original Message-
From: Leander Melms [mailto:me...@students.uni-marburg.de] 
Sent: Friday, March 17, 2017 1:09 PM
To: dev@ctakes.apache.org
Subject: Re: Evaluate cTAKES perfomance

Sorry for writing again. I just have a quick question: My idea is to parse the 
cTAKES output to a text file with a structure like this 
DocName|Spans|CUI|CoveredText|ConceptType and do the same with the cold 
standart (from anafora). 

Is this a correct way to do this? 

I'm new to the subject and happy about the tiniest information on the topic.

Thanks
Leander

I
> On 17 Mar 2017, at 12:05, Leander Melms <me...@students.uni-marburg.de> wrote:
> 
> Hi,
> 
> I've integrated a custom dictionary, retrained some of the OpenNLP models and 
> would like to evaluate the changes on a gold standard. I'd like to calculate 
> the precision, the recall and the f1-score to compare the results.
> 
> My question is: Does cTAKES ship with some evaluation / test scripts? What is 
> the best strategry to do this? Has anyone dealt with this topic before? 
> 
> I'm happy to share the results afterwards if there is interest for it.
> 
> Thanks
> Leander
>