Hi Yiming,
Re your question about gold standard datasets. In parallel with releasing best 
performing methods in cTAKES, we have generated several gold standard datesets. 
Our plan is to start distributing them through a unified effort -- a health NLP 
Center. See attached exec summary. We hope to have the Center running in the 
very near future.

Cheers,
--Guergana

-----Original Message-----
From: Zuo Yiming [mailto:yiming...@gmail.com] 
Sent: Wednesday, October 19, 2016 12:22 PM
To: dev@ctakes.apache.org
Subject: Re: Best combination of analysis engines to consider negation, family 
history, uncertainty, etc.

Hi Sean and Timothy,

Thanks for your clarification about ClearTK tools. I'm amazed by the power of 
cTAKES and the resource and community you guys take efforts to built. I will 
certainly be happy to provide more feedback as my project moves on.

For Timothy,

By rule-based system, do you refer to the assertion annotator? How about the 
old negation annotator and the status annotator, are they also ruled-based 
system? I got a feeling that assertion annotator and ClearTK system are more 
favored than negation annotator and the status annotator for some reason in 
cTAKES right now.

Regarding ClearTK system on my test files, the negation, history, uncertainty 
modules work just fine as the assertion annotator. My test files are only a 
few, so it's really hard to tell which one is better. The main difference comes 
when detecting subject and generic property. On my limited test files, ClearTK 
system doesn't work at all. It will assign patient as the subject for all 
detected phrases when it's the patient's family member who have diabetes. The 
same problem goes to the generic property, ClearTK system assigns false as the 
generic property for all detected phrases. The paper mentioned by you and Sean 
seems interesting, I will take a look later.

As for further questions, can you guys give me some suggestions where to find 
public golden standard datasets so I can actually conduct some independent 
evaluation of cTAKES by metrics like precision/recall and F1 score?

At last, a minor suggestion from the user perspective will be to add the 
preferred words property to the AggregatePlaintextUMLSProcessor. Like I pointed 
out briefly in my first email, using AggregatePlaintextFastUMLSProcessor we can 
get the preferred words for detected phrases but not 
AggregatePlaintextUMLSProcessor. This is very helpful when the detected phrases 
are acronyms such as pt for patient. From my experience, 
AggregatePlaintextUMLSProcessor tend to detect more clinical relevant phrases 
compared with AggregatePlaintextFastUMLSProcessor. It will be really nice if we 
can have the same preferred words property in AggregatePlaintextUMLSProcessor 
in future cTAKES release.

Best,
Yiming

On Wed, Oct 19, 2016 at 11:11 AM, Miller, Timothy < 
timothy.mil...@childrens.harvard.edu> wrote:

> I can second Sean's thank you, it is good to have this feedback. The 
> ClearTK machine learning models were made the default after we ran 
> some experiments that found it performed better across a range of 
> standard datasets than rule-based algorithms or the existing cTAKES 
> module ( 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__journals.plos.org_plosone_article-3Fid-3D10.1371_journal.pone.0112774&d=DQIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=SeLHlpmrGNnJ9mI2WCgf_wwQk9zL4aIrVmfBoSi-j0kfEcrO4yRGmRCJNAr-rCmP&m=h2xGj7JrNP5pTj6fU4IE9EdNfbJZ0FkOk3swxGR91E4&s=9b891QWT_DEckn4f25-xn3W32qkz8UoOw61qKAOqpK0&e=
>  ).
> Since making them the default, though, we have heard from people and 
> had our own experience conflict with those experiments. And certainly 
> the errors in the rule-based system are easier to understand.
>
> Just curious, are you able to characterize the errors you see from the 
> ClearTK system? I did some experiments recently on a new dataset 
> comparing negex with the cleartk negation module and found that there 
> was a precision/recall tradeoff but almost identical F1 scores. But 
> for that dataset the tradeoff negex provided was preferred by our 
> collaborators. (I think negex had better recall of negated terms but worse 
> precision).
>
> Tim
>
>
>
> ________________________________________
> From: Finan, Sean <sean.fi...@childrens.harvard.edu>
> Sent: Wednesday, October 19, 2016 10:53 AM
> To: dev@ctakes.apache.org
> Subject: RE: Best combination of analysis engines to consider 
> negation, family history, uncertainty, etc.
>
> Hi Yiming,
>
>
>
> Thank you very much for letting the community know what has and has 
> not worked for you.  I have also had better results with the Assertion 
> annotators than the ClearTk alternatives, but that could be because of 
> the note types/formats that I am using.
>
>
>
> Regarding the "Clear" in names, it is because ClearTk (Clear ToolKit) 
> is used to train machine learning models for detection of the 
> indicated property.  You can find information on ClearTk starting here:
> https://urldefense.proofpoint.com/v2/url?u=http-3A__clear.
> colorado.edu_compsem_&d=DQIGaQ&c=qS4goWBT7poplM69zy_
> 3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-
> OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=aRk0CH-2UrNpH0F4PgdnzixY-
> xVsh8OYTCP8mhe27Gw&s=0mEmiKK5adFN2YCkYyNCNM3Cv4FNWlMbN8XU6GtcQP4&e=
>
>
>
> If you prefer to read a paper, you can check out 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.
> lrec-2Dconf.org_proceedings_lrec2014_pdf_218-5FPaper.pdf&
> d=DQIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=
> Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=aRk
> 0CH- 
> 2UrNpH0F4PgdnzixY-xVsh8OYTCP8mhe27Gw&s=T-pZCKB6BckhHzvYc9gyutCmKQlhitd
> O
> _-i4e387tjM&e=
>
>
>
> Others no the devlist can provide much more information than can I, so 
> you could post a question if you like.
>
>
>
> Cheers,
>
> Sean
>
>
>
> -----Original Message-----
>
> From: Zuo Yiming [mailto:yiming...@gmail.com]
>
> Sent: Wednesday, October 19, 2016 10:04 AM
>
> To: u...@ctakes.apache.org; dev@ctakes.apache.org
>
> Subject: Best combination of analysis engines to consider negation, 
> family history, uncertainty, etc.
>
>
>
> Hi everyone,
>
>
>
> I've spent the last a few months working on a clinical NLP project 
> using cTAKES. It's a very complex system to me and every time I dig 
> into it some new discoveries will come out. Since last week, I tried 
> to figure out which analysis engine can help to do a good job to 
> consider cases like negation, family history, uncertainty, etc. By 
> now, I had some experience and would like to share with the community.
>
>
>
> The best combination for me is to use 
> assertionMiniPipelineAnalysisEngine
>
> for negation, uncertainty, generic and subject detection, and 
> HistoryCleartkAnalysisEngine for history detection. Both engines are 
> in desc/ctakes-assertion folder. The 
> assertionMiniPipelineAnalysisEngine
> also claims to be useful for conditional detection, which I haven't 
> verified using my test files yet.
>
>
>
> I'm using the AggregatePlaintextFastUMLSProcessor on the higher level.
> The default analysis engines in AggregatePlaintextFastUMLSProcessor 
> for negation, uncertainty, generic, etc. are StatusAnnotator + 
> NegationAnnotator + PolarityCleartkAnalysisEngine + 
> SubjectCleartkAnalysisEngine + UncertaintyCleartkAnalysisEngine + 
> GenericCleartkAnalysisEngine + HistoryCleartkAnalysisEngine. It looks 
> like in the node part, StatusAnnotator and NegationAnnotator are 
> commented out, so only the remaining five analysis engines are 
> actually used and all of them are in the same desc/ctakes-assertion 
> folder. These five analysis engines were not effective in my test 
> files and I'm still confused by their relationship to the 
> assertionaAnalysisEngine, conceptConverterAnalysisEngine, 
> GenericAttributeAnalysisEngine and SubjectAttributeAnalysisEngine used in 
> assertionMiniPipelineAnalysisEngine.
>
> It looks to me the Clear in their names indicate something but I 
> couldn't figure it out without going through the java code, which I 
> intend not to do at this level.
>
>
>
> That's pretty much all of it for now. Anyone familiar with this topic 
> are welcome to jump in to provide my insights or correction. 
> Hopefully, we can have a nice discussion that can be useful to other users 
> and developers.
>
>
>
> ps. The reason for using AggregatePlaintextFastUMLSProcessor rather 
> than AggregatePlaintextProcessor is that I find the preferred words 
> property in the former very useful while it can't be detected using the 
> latter.
>
>
>
> Best,
>
> Yiming
>
> --
>
> Yiming Zuo <https://urldefense.proofpoint.com/v2/url?u=https-> 
> 3A__sites.google.com_site_yimingzuo_&d=DQIBaQ&c=qS4goWBT7poplM69zy_
> 3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao
> &m=
> 4at7fOO27JCueBfJFn7Hv2vKWlUAK-nuYYdmMyGRJPQ&s=vSmSOvLXuCa-
> Pwp8qu05VTzZgGA0P3Y2CL8q3JBhppQ&e=> Georgetown U. Medical Center:
>
> Dr. Ressom's Omics Lab 
> <https://urldefense.proofpoint.com/v2/url?u=http-> 
> 3A__omics.georgetown.edu_&d=DQIBaQ&c=qS4goWBT7poplM69zy_
> 3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao
> &m=
> 4at7fOO27JCueBfJFn7Hv2vKWlUAK-nuYYdmMyGRJPQ&s=yNsVaS7s20e-
> 125SmdmQqKHvQ0lAQ7si98GefPRDxT0&e=> ECE Department of Virginia Tech:
>
> Computational Bioinformatics & Bio-imaging Laboratory 
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__urldefense&d=DQI
> BaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=SeLHlpmrGNnJ9mI2WC
> gf_wwQk9zL4aIrVmfBoSi-j0kfEcrO4yRGmRCJNAr-rCmP&m=h2xGj7JrNP5pTj6fU4IE9
> EdNfbJZ0FkOk3swxGR91E4&s=UwqUSJ1x3i9O3xH_RPn5yrKe-Q589wKhd0zowUZ18Ik&e
> = .> proofpoint.com/v2/url?u=http-3A__www.cbil.ece.vt.edu_&d=
> DQIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=
> fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=
> 4at7fOO27JCueBfJFn7Hv2vKWlUAK-nuYYdmMyGRJPQ&s=DpORI1TH9yITkdlRX_
> RLjxejH2jMJUq8yFaTPjWAar4&e=>
>
>


--
Yiming Zuo 
<https://urldefense.proofpoint.com/v2/url?u=https-3A__sites.google.com_site_yimingzuo_&d=DQIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=SeLHlpmrGNnJ9mI2WCgf_wwQk9zL4aIrVmfBoSi-j0kfEcrO4yRGmRCJNAr-rCmP&m=h2xGj7JrNP5pTj6fU4IE9EdNfbJZ0FkOk3swxGR91E4&s=zDQKdGR1qvXq0eeMIGpXofXm-JpOb8J7iC6XIlqEjfA&e=
 > Georgetown U. Medical Center:
Dr. Ressom's Omics Lab 
<https://urldefense.proofpoint.com/v2/url?u=http-3A__omics.georgetown.edu_&d=DQIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=SeLHlpmrGNnJ9mI2WCgf_wwQk9zL4aIrVmfBoSi-j0kfEcrO4yRGmRCJNAr-rCmP&m=h2xGj7JrNP5pTj6fU4IE9EdNfbJZ0FkOk3swxGR91E4&s=8Rio1GmvriiEeWqhgJ9kyY6ykiwgKdKKR4XWFWFfEGU&e=
 > ECE Department of Virginia Tech:
Computational Bioinformatics & Bio-imaging Laboratory 
<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.cbil.ece.vt.edu_&d=DQIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=SeLHlpmrGNnJ9mI2WCgf_wwQk9zL4aIrVmfBoSi-j0kfEcrO4yRGmRCJNAr-rCmP&m=h2xGj7JrNP5pTj6fU4IE9EdNfbJZ0FkOk3swxGR91E4&s=KLQqKplLX_oCGE9TY63PGAw_mjyg26FSV_SSQckScaQ&e=
 >

Reply via email to