Hi Tim, Here is an untested example, but should show the concept.
Document 1: Sarah had Induced Abortion Illegally. Document 2: John had a previous history of Abuse Health Service. The following CUIs would be the matches if everything went well. Illegally Induced Abortion, C000804 Health Service Abuse, C000864 When running in cTAKES with the bug, the following would happen. Document 1 would be processed, and different permutations would be tested to find a match against the document. For example, the permutation 1, 2, 3 would be tried (Induced Abortion Illegally). For the sake of this discussion, we will say that returned nothing. All the permutations of 1, 2, and 3 would be tried. A match for the permutation 3, 1, 2 (Illegally Induced Abortion) would be found. When the bug is present, the list of permutations, (1, 2, 3), (1, 3, 2), (2, 1, 3), ... (3, 1, 2) would be changed to (1, 2, 3), (1, 3, 2), (2, 1, 3), ... (1, 2, 3). The permutation was sorted to find the starting and ending span of the match. As you can see the permutation of (1, 2, 3) now exists twice, and, (3, 1, 2) no longer exists. Document 2 would be processed, and no match for the permutation of (3, 1, 2) would be tried, so Abuse Health Service would never be tried, but (1, 2,3) Health Service Abuse would be attempted twice. With the small number of documents being process, I doubt this skewed the tests that were be running between the two sets. I was curious though if this could have been a factor, as the more documents and cuis that were processed, the more permutations that would be sorted to (1, 2, 3). Note, permutations of different lengths also occur. Using the permutations, cTAKES can find additional cuis that may not be discovered using exact matching techniques, so to have this degrade overtime was something that we wanted to fix. This example was a contrived example, and would not really match in cTAKES due to the first word being different; but I think it adequately displays the concept and the bug that was caused by the permutations being sorted. Please let me know if you have any questions about my example. Thanks, IMAT Solutions <http://imatsolutions.com> Kim Ebert Software Engineer Office: 801.669.7342 kim.eb...@imatsolutions.com <mailto:greg.hub...@imatsolutions.com> On 12/19/2014 09:54 AM, Miller, Timothy wrote: > Thanks Kim, > This sounds interesting though I don't totally understand it. Are you saying > that extraction performance for a given note depends on which order the note > was in the processing queue? If so that's pretty bad! If you (or anyone else > who understands this issue) has a concrete example I think that might help me > understand what the problem is/was. > > Even though, as Pei mentioned, we are going to try moving the community to > the faster dictionary, I would like to understand better just to help myself > avoid issues of this type going forward (and verify the new dictionary > doesn't use similar logic). > > Also, when we finish annotating the sample notes, might we use that as a > point of comparison for the two dictionaries? That would get around the issue > that not everyone has access to the datasets we used for validation and > others are likely not able to share theirs either. And maybe we can replicate > the notes if we want to simulate the scenario Kim is talking about with > thousands or more notes. > > Tim > > > On 12/19/2014 10:24 AM, Kim Ebert wrote: > Guergana, > > I'm curious to the number of records that are in your gold standard sets, or > if your gold standard set was run through a long running cTAKES process. I > know at some point we fixed a bug in the old dictionary lookup that caused > the permutations to become corrupted over time. Typically this isn't seen in > the first few records, but over time as patterns are used the permutations > would become corrupted. This caused documents that were fed through cTAKES > more than once to have less codes returned than the first time. > > For example, if a permutation of 4,2,3,1 was found, the permutation would be > corrupted to be 1,2,3,4. It would no longer be possible to detect > permutations of 4,2,3,1 until cTAKES was restarted. We got the fix in after > the cTAKES 3.2.0 release. https://issues.apache.org/jira/browse/CTAKES-310 > Depending upon the corpus size, I could see the permutation engine eventually > only have a single permutation of 1,2,3,4. > > Typically though, this isn't very easily detected in the first 100 or so > documents. > > We discovered this issue when we made cTAKES have consistent output of codes > in our system. > > [IMAT Solutions]<http://imatsolutions.com> > Kim Ebert > Software Engineer > [Office:] 801.669.7342 > kim.eb...@imatsolutions.com<mailto:greg.hub...@imatsolutions.com> > On 12/19/2014 07:05 AM, Savova, Guergana wrote: > > We are doing a similar kind of evaluation and will report the results. > > Before we released the Fast lookup, we did a systematic evaluation across > three gold standard sets. We did not see the trend that Bruce reported below. > The P, R and F1 results from the old dictionary look up and the fast one were > similar. > > Thank you everyone! > --Guergana > > -----Original Message----- > From: David Kincaid [mailto:kincaid.d...@gmail.com] > Sent: Friday, December 19, 2014 9:02 AM > To: dev@ctakes.apache.org<mailto:dev@ctakes.apache.org> > Subject: Re: cTakes Annotation Comparison > > Thanks for this, Bruce! Very interesting work. It confirms what I've seen in > my small tests that I've done in a non-systematic way. Did you happen to > capture the number of false positives yet (annotations made by cTAKES that > are not in the human adjudicated standard)? I've seen a lot of dictionary > hits that are not actually entity mentions, but I haven't had a chance to do > a systematic analysis (we're working on our annotated gold standard now). One > great example is the antibiotic "Today". Every time the word today appears in > any text it is annotated as a medication mention when it almost never is > being used in that sense. > > These results by themselves are quite disappointing to me. Both the > UMLSProcessor and especially the FastUMLSProcessor seem to have pretty poor > recall. It seems like the trade off for more speed is a ten-fold (or more) > decrease in entity recognition. > > Thanks again for sharing your results with us. I think they are very useful > to the project. > > - Dave > > On Thu, Dec 18, 2014 at 5:06 PM, Bruce Tietjen < > bruce.tiet...@perfectsearchcorp.com<mailto:bruce.tiet...@perfectsearchcorp.com>> > wrote: > > > Actually, we are working on a similar tool to compare it to the human > adjudicated standard for the set we tested against. I didn't mention > it before because the tool isn't complete yet, but initial results for > the set (excluding those marked as "CUI-less") was as follows: > > Human adjudicated annotations: 4591 (excluding CUI-less) > > Annotations found matching the human adjudicated standard > UMLSProcessor 2245 > FastUMLSProcessor 215 > > > > > > > [image: IMAT Solutions] <http://imatsolutions.com><http://imatsolutions.com> > Bruce Tietjen > Senior Software Engineer > [image: Mobile:] 801.634.1547 > bruce.tiet...@imatsolutions.com<mailto:bruce.tiet...@imatsolutions.com> > > On Thu, Dec 18, 2014 at 3:37 PM, Chen, Pei > <pei.c...@childrens.harvard.edu<mailto:pei.c...@childrens.harvard.edu> > > > wrote: > > > Bruce, > Thanks for this-- very useful. > Perhaps Sean Finan comment more- > but it's also probably worth it to compare to an adjudicated human > annotated gold standard. > > --Pei > > -----Original Message----- > From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com] > Sent: Thursday, December 18, 2014 1:45 PM > To: dev@ctakes.apache.org<mailto:dev@ctakes.apache.org> > Subject: cTakes Annotation Comparison > > With the recent release of cTakes 3.2.1, we were very interested in > checking for any differences in annotations between using the > AggregatePlaintextUMLSProcessor pipeline and the > AggregatePlanetextFastUMLSProcessor pipeline within this release of > > > cTakes > > > with its associated set of UMLS resources. > > We chose to use the SHARE 14-a-b Training data that consists of 199 > documents (Discharge 61, ECG 54, Echo 42 and Radiology 42) as the > basis for the comparison. > > We decided to share a summary of the results with the development > community. > > Documents Processed: 199 > > Processing Time: > UMLSProcessor 2,439 seconds > FastUMLSProcessor 1,837 seconds > > Total Annotations Reported: > UMLSProcessor 20,365 annotations > FastUMLSProcessor 8,284 annotations > > > Annotation Comparisons: > Annotations common to both sets: 3,940 > Annotations reported only by the UMLSProcessor: 16,425 > Annotations reported only by the FastUMLSProcessor: 4,344 > > > If anyone is interested, following was our test procedure: > > We used the UIMA CPE to process the document set twice, once using > the AggregatePlaintextUMLSProcessor pipeline and once using the > AggregatePlaintextFastUMLSProcessor pipeline. We used the > WriteCAStoFile CAS consumer to write the results to output files. > > We used a tool we recently developed to analyze and compare the > annotations generated by the two pipelines. The tool compares the > two outputs for each file and reports any differences in the > annotations (MedicationMention, SignSymptomMention, > ProcedureMention, AnatomicalSiteMention, and > DiseaseDisorderMention) between the two output sets. The tool > reports the number of 'matches' and 'misses' between each annotation set. A > 'match' > > > is > > > defined as the presence of an identified source text interval with > its associated CUI appearing in both annotation sets. A 'miss' is > defined as the presence of an identified source text interval and > its associated CUI in one annotation set, but no matching identified > source text interval > > > and > > > CUI in the other. The tool also reports the total number of > annotations (source text intervals with associated CUIs) reported in > each annotation set. The compare tool is in our GitHub repository at > https://github.com/perfectsearch/cTAKES-compare > > > > >