RE: Evaluate cTAKES perfomance [SUSPICIOUS]
Great explanation, Thank you Tim! -Original Message- From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] Sent: Saturday, March 18, 2017 7:18 AM To: dev@ctakes.apache.org Subject: Re: Evaluate cTAKES perfomance [SUSPICIOUS] To save you a little trouble, in ctakes-temporal we rely a lot on an outside library called ClearTK that has some evaluation APIs built in that work well with UIMA frameworks and typical NLP tasks. We use the following classes: https://urldefense.proofpoint.com/v2/url?u=http-3A__cleartk.github.io_cleartk_apidocs_2.0.0_org_cleartk_eval_AnnotationStatistics.html=DwIFAw=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao=lKr9UzntVnVdsEbHHjtjhfCS3BgJa6dyTE9LsTnhLkA=PUUopYYvh-wxt0oYmHdevHhjzYZh19cvYGae-3pQOd8= https://urldefense.proofpoint.com/v2/url?u=http-3A__cleartk.github.io_cleartk_apidocs_2.0.0_org_cleartk_eval_Evaluation-5FImplBase.html=DwIFAw=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao=lKr9UzntVnVdsEbHHjtjhfCS3BgJa6dyTE9LsTnhLkA=MP2Jy56D9Rj58htcPx5g_oX_Ca-ACJVdAJnysg2H0Uc= The simplest place to start looking in ctakes-temporal is probably the EventAnnotator and its evaluation, since they are simple one word spans. Then the TimeAnnotator is slightly more complicated with multi-word spans. Then if you are interested in evaluating relations I would suggest switching over to ctakes-relation-extractor which is more stable than the ctakes-temporal relation code, which is an area of highly active (i.e., funded) research and so the code has not been cleaned up as much. Tim From: Leander Melms <me...@students.uni-marburg.de> Sent: Friday, March 17, 2017 3:05 PM To: dev@ctakes.apache.org Subject: Re: Evaluate cTAKES perfomance Thanks! I'll have a look at it and will try to give something back to the community! Leander > On 17 Mar 2017, at 19:42, Finan, Sean <sean.fi...@childrens.harvard.edu> > wrote: > > Ah - you meant best way to test. Sorry, I misread your inquiry as a best way > to write output. > > Yes, that is a great introduction document for ctakes and early tests. There > are a few small test classes in ctakes that read anafora files, run ctakes > and run agreement numbers. You can find some in the ctakes-temporal module. > I didn't write them, and I think that they are built-to-fit purpose-driven > classes, but you could try to adapt them to a general purpose case. That > would be a great thing to have in ctakes! > > Sean > > -Original Message- > From: Leander Melms [mailto:me...@students.uni-marburg.de] > Sent: Friday, March 17, 2017 1:46 PM > To: dev@ctakes.apache.org > Subject: Re: Evaluate cTAKES perfomance > > Hi Sean, > > thank you (again) for your help and feedback! I'll give it a try! Seems like > the authors of the publication "Mayo clinical Text analysis and Knowledge > Extraction System" > (https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ncbi.nlm.nih.gov_pmc_articles_PMC2995668_=DwIFAg=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao=PZ0f8s12PJA8W5B4hMlw-0F83VAM9m6E1ypWVaT2hcM=Isgii7k_fUy_qLsyqEdh15wKLAnFT6_KeE7zN1dE73Q= > > <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ncbi.nlm.nih.gov_pmc_articles_PMC2995668_=DwIFAg=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao=PZ0f8s12PJA8W5B4hMlw-0F83VAM9m6E1ypWVaT2hcM=Isgii7k_fUy_qLsyqEdh15wKLAnFT6_KeE7zN1dE73Q=>) > did this as well. > > Thank you > Leander > > > >> On 17 Mar 2017, at 18:33, Finan, Sean <sean.fi...@childrens.harvard.edu> >> wrote: >> >> Hi Leander, >> >> There is no single correct way to do this, but a couple of similar >> classes exist. Well, one sat in my sandbox for two years until about 5 >> seconds ago as I only just checked it in. Anyway, take a look at two >> classes in ctakes-core org.apache.ctakes.core They are TextSpanWriter and >> CuiCountFileWriter. >> >> TextSpanWriter writes annotation name | span | covered text in a file, one >> per document. >> >> CuiCountFileWriter writes a list of discovered cuis and their counts. >> >> It sounds like you are interested in a combination of both - basically >> TextSpanWriter with the added output of CUIs. >> >> You can also have a look at EntityCollector of >> org.apache.ctakes.core.pipeline. It has an annotation engine that keeps a >> running list of "entities" for the whole run, doc ids, spans, text and cuis. >> >> Sean >> >> >> -Original Message- >> From: Leander Melms [mailto:me...@students.uni-marburg.de] >> Sent: Friday, March 17
Re: Evaluate cTAKES perfomance
To save you a little trouble, in ctakes-temporal we rely a lot on an outside library called ClearTK that has some evaluation APIs built in that work well with UIMA frameworks and typical NLP tasks. We use the following classes: http://cleartk.github.io/cleartk/apidocs/2.0.0/org/cleartk/eval/AnnotationStatistics.html http://cleartk.github.io/cleartk/apidocs/2.0.0/org/cleartk/eval/Evaluation_ImplBase.html The simplest place to start looking in ctakes-temporal is probably the EventAnnotator and its evaluation, since they are simple one word spans. Then the TimeAnnotator is slightly more complicated with multi-word spans. Then if you are interested in evaluating relations I would suggest switching over to ctakes-relation-extractor which is more stable than the ctakes-temporal relation code, which is an area of highly active (i.e., funded) research and so the code has not been cleaned up as much. Tim From: Leander Melms <me...@students.uni-marburg.de> Sent: Friday, March 17, 2017 3:05 PM To: dev@ctakes.apache.org Subject: Re: Evaluate cTAKES perfomance Thanks! I'll have a look at it and will try to give something back to the community! Leander > On 17 Mar 2017, at 19:42, Finan, Sean <sean.fi...@childrens.harvard.edu> > wrote: > > Ah - you meant best way to test. Sorry, I misread your inquiry as a best way > to write output. > > Yes, that is a great introduction document for ctakes and early tests. There > are a few small test classes in ctakes that read anafora files, run ctakes > and run agreement numbers. You can find some in the ctakes-temporal module. > I didn't write them, and I think that they are built-to-fit purpose-driven > classes, but you could try to adapt them to a general purpose case. That > would be a great thing to have in ctakes! > > Sean > > -Original Message- > From: Leander Melms [mailto:me...@students.uni-marburg.de] > Sent: Friday, March 17, 2017 1:46 PM > To: dev@ctakes.apache.org > Subject: Re: Evaluate cTAKES perfomance > > Hi Sean, > > thank you (again) for your help and feedback! I'll give it a try! Seems like > the authors of the publication "Mayo clinical Text analysis and Knowledge > Extraction System" > (https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ncbi.nlm.nih.gov_pmc_articles_PMC2995668_=DwIFAg=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao=PZ0f8s12PJA8W5B4hMlw-0F83VAM9m6E1ypWVaT2hcM=Isgii7k_fUy_qLsyqEdh15wKLAnFT6_KeE7zN1dE73Q= > > <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ncbi.nlm.nih.gov_pmc_articles_PMC2995668_=DwIFAg=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao=PZ0f8s12PJA8W5B4hMlw-0F83VAM9m6E1ypWVaT2hcM=Isgii7k_fUy_qLsyqEdh15wKLAnFT6_KeE7zN1dE73Q= > >) did this as well. > > Thank you > Leander > > > >> On 17 Mar 2017, at 18:33, Finan, Sean <sean.fi...@childrens.harvard.edu> >> wrote: >> >> Hi Leander, >> >> There is no single correct way to do this, but a couple of similar >> classes exist. Well, one sat in my sandbox for two years until about 5 >> seconds ago as I only just checked it in. Anyway, take a look at two >> classes in ctakes-core org.apache.ctakes.core They are TextSpanWriter and >> CuiCountFileWriter. >> >> TextSpanWriter writes annotation name | span | covered text in a file, one >> per document. >> >> CuiCountFileWriter writes a list of discovered cuis and their counts. >> >> It sounds like you are interested in a combination of both - basically >> TextSpanWriter with the added output of CUIs. >> >> You can also have a look at EntityCollector of >> org.apache.ctakes.core.pipeline. It has an annotation engine that keeps a >> running list of "entities" for the whole run, doc ids, spans, text and cuis. >> >> Sean >> >> >> -Original Message- >> From: Leander Melms [mailto:me...@students.uni-marburg.de] >> Sent: Friday, March 17, 2017 1:09 PM >> To: dev@ctakes.apache.org >> Subject: Re: Evaluate cTAKES perfomance >> >> Sorry for writing again. I just have a quick question: My idea is to parse >> the cTAKES output to a text file with a structure like this >> DocName|Spans|CUI|CoveredText|ConceptType and do the same with the cold >> standart (from anafora). >> >> Is this a correct way to do this? >> >> I'm new to the subject and happy about the tiniest information on the topic. >> >> Thanks >> Leander >> >> I >>> On 17 Mar 2017, at 12:05, Leander Melms <me...@students.uni-marburg.de> >>> wrote: >>> >&
Re: Evaluate cTAKES perfomance
Hi Sean, thank you (again) for your help and feedback! I'll give it a try! Seems like the authors of the publication "Mayo clinical Text analysis and Knowledge Extraction System" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2995668/ <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2995668/>) did this as well. Thank you Leander > On 17 Mar 2017, at 18:33, Finan, Sean <sean.fi...@childrens.harvard.edu> > wrote: > > Hi Leander, > > There is no single correct way to do this, but a couple of similar classes > exist. Well, one sat in my sandbox for two years until about 5 seconds ago > as I only just checked it in. Anyway, take a look at two classes in > ctakes-core org.apache.ctakes.core > They are TextSpanWriter and CuiCountFileWriter. > > TextSpanWriter writes annotation name | span | covered text in a file, one > per document. > > CuiCountFileWriter writes a list of discovered cuis and their counts. > > It sounds like you are interested in a combination of both - basically > TextSpanWriter with the added output of CUIs. > > You can also have a look at EntityCollector of > org.apache.ctakes.core.pipeline. It has an annotation engine that keeps a > running list of "entities" for the whole run, doc ids, spans, text and cuis. > > Sean > > > -Original Message- > From: Leander Melms [mailto:me...@students.uni-marburg.de] > Sent: Friday, March 17, 2017 1:09 PM > To: dev@ctakes.apache.org > Subject: Re: Evaluate cTAKES perfomance > > Sorry for writing again. I just have a quick question: My idea is to parse > the cTAKES output to a text file with a structure like this > DocName|Spans|CUI|CoveredText|ConceptType and do the same with the cold > standart (from anafora). > > Is this a correct way to do this? > > I'm new to the subject and happy about the tiniest information on the topic. > > Thanks > Leander > > I >> On 17 Mar 2017, at 12:05, Leander Melms <me...@students.uni-marburg.de> >> wrote: >> >> Hi, >> >> I've integrated a custom dictionary, retrained some of the OpenNLP models >> and would like to evaluate the changes on a gold standard. I'd like to >> calculate the precision, the recall and the f1-score to compare the results. >> >> My question is: Does cTAKES ship with some evaluation / test scripts? What >> is the best strategry to do this? Has anyone dealt with this topic before? >> >> I'm happy to share the results afterwards if there is interest for it. >> >> Thanks >> Leander >> > >
RE: Evaluate cTAKES perfomance
Hi Leander, There is no single correct way to do this, but a couple of similar classes exist. Well, one sat in my sandbox for two years until about 5 seconds ago as I only just checked it in. Anyway, take a look at two classes in ctakes-core org.apache.ctakes.core They are TextSpanWriter and CuiCountFileWriter. TextSpanWriter writes annotation name | span | covered text in a file, one per document. CuiCountFileWriter writes a list of discovered cuis and their counts. It sounds like you are interested in a combination of both - basically TextSpanWriter with the added output of CUIs. You can also have a look at EntityCollector of org.apache.ctakes.core.pipeline. It has an annotation engine that keeps a running list of "entities" for the whole run, doc ids, spans, text and cuis. Sean -Original Message- From: Leander Melms [mailto:me...@students.uni-marburg.de] Sent: Friday, March 17, 2017 1:09 PM To: dev@ctakes.apache.org Subject: Re: Evaluate cTAKES perfomance Sorry for writing again. I just have a quick question: My idea is to parse the cTAKES output to a text file with a structure like this DocName|Spans|CUI|CoveredText|ConceptType and do the same with the cold standart (from anafora). Is this a correct way to do this? I'm new to the subject and happy about the tiniest information on the topic. Thanks Leander I > On 17 Mar 2017, at 12:05, Leander Melms <me...@students.uni-marburg.de> wrote: > > Hi, > > I've integrated a custom dictionary, retrained some of the OpenNLP models and > would like to evaluate the changes on a gold standard. I'd like to calculate > the precision, the recall and the f1-score to compare the results. > > My question is: Does cTAKES ship with some evaluation / test scripts? What is > the best strategry to do this? Has anyone dealt with this topic before? > > I'm happy to share the results afterwards if there is interest for it. > > Thanks > Leander >