Re: cTakes output predictability

Kim Ebert Tue, 07 Oct 2014 11:50:52 -0700

Hi Bruce,

Could you send the record over that you are seeing this on?


Thanks,

Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/

On 10/07/2014 11:20 AM, Bruce Tietjen wrote:
> I did not intend to step on anyone's toes.
>
> One of the reasons I proposed the changes was to try to make it extremely
> obvious when there are significant difference in output from the cTakes
> pipeline when running the same document again, and once identified, make it
> easier to identify the source of the difference.
>
> Because of the huge number of differences between the output using the
> FileWriterCasConsumer.xml, first detecting that there is a significant
> differences and identifying them for a large set of documents is a daunting
> task.
>
> The following is an example of some significant differences that I have
> detected between two subsequent runs on the same document using the current
> release of cTakes. (There are actually quite a few documents that exhibit
> this kind of behavior. This is only one example.)
>
>
> Snippet from first run:
>
>     <org.apache.ctakes.typesystem.type.textspan.LookupWindowAnnotation
> _indexed="1" _id="9869" _ref_sofa="3" begin="3039" end="3047"/>
>     <org.apache.ctakes.typesystem.type.textsem.MedicationMention
> _indexed="1" _id="9895" _ref_sofa="3" begin="2075" end="2081" id="95"
> _ref_ontologyConceptArr="9891" typeID="1" segmentID="SIMPLE_SEGMENT"
> discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="1"
> conditional="false" generic="true" subject="patient" historyOf="0"/>
>     <org.apache.ctakes.typesystem.type.textsem.MedicationMention
> _indexed="1" _id="9937" _ref_sofa="3" begin="2312" end="2322" id="110"
> _ref_ontologyConceptArr="9934" typeID="1" segmentID="SIMPLE_SEGMENT"
> discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="1"
> conditional="false" generic="false" subject="patient" historyOf="0"/>
>     <org.apache.ctakes.typesystem.type.textsem.DiseaseDisorderMention
> _indexed="1" _id="9979" _ref_sofa="3" begin="0" end="4" id="0"
> _ref_ontologyConceptArr="9976" typeID="2" segmentID="SIMPLE_SEGMENT"
> discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="0"
> conditional="false" generic="false" subject="patient" historyOf="0"/>
>
>
> Snippet from subsequent trun:
>
>     <org.apache.ctakes.typesystem.type.textsem.ProcedureMention
> _indexed="1" _id="15773" _ref_sofa="3" begin="2929" end="2933" id="125"
> _ref_ontologyConceptArr="15770" typeID="5" segmentID="SIMPLE_SEGMENT"
> discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="0"
> conditional="false" generic="false" subject="patient" historyOf="0"/>
>     <org.apache.ctakes.typesystem.type.textsem.MedicationMention
> _indexed="1" _id="15928" _ref_sofa="3" begin="2075" end="2081" id="95"
> _ref_ontologyConceptArr="15924" typeID="1" segmentID="SIMPLE_SEGMENT"
> discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="1"
> conditional="false" generic="true" subject="patient" historyOf="0"/>
>     <org.apache.ctakes.typesystem.type.syntax.ConllDependencyNode
> _indexed="1" _id="15958" _ref_sofa="3" begin="0" end="5" id="0"/>
>
>
> Note that in the first instance, there were two MedicationMentions, but in
> the second, there is only one.
>
> Yes, everyone could write their own custom compare code, but wouldn't it be
> more valuable to the community to make that task easier?
>
> Thanks,
>
> Bruce Tietjen
>
>
>
>  [image: IMAT Solutions] <http://imatsolutions.com>
>  Bruce Tietjen
> Senior Software Engineer
> [image: Mobile:] 801.634.1547
> bruce.tiet...@imatsolutions.com
>
> On Tue, Oct 7, 2014 at 11:01 AM, Kim Ebert <kim.eb...@perfectsearchcorp.com>
> wrote:
>
>> Hi Sean,
>>
>> No, your not a jerk. These are things worth considering, and I
>> understand your concerns with touching various points of the codebase.
>>
>> I'll talk with our group over here and see where we want to go. We are
>> really interested in cTakes behaving well, so we are usually pretty
>> careful in testing our changes before committing anything.
>>
>> Thanks,
>>
>> Kim Ebert
>> 1.801.669.7342
>> Perfect Search Corp
>> http://www.perfectsearchcorp.com/
>>
>> On 10/07/2014 10:46 AM, Finan, Sean wrote:
>>> Hi Kim,
>>>
>>>> It concerns me a bit by making the code return consistent results would
>> be so concerning.
>>> Could you please clarify what you mean by "consistent results"?  Do you
>> mean ordering and IDs or are you talking about actual type values not
>> matching?
>>>> This should be the default mode of operation.
>>> Depending upon what you meant above, I may agree or disagree.
>>>
>>>> Since it doesn't appear that there are any consequences with moving
>> forward with changing the code
>>> Why do you say this?
>>>
>>> I think that there may be more required changes than you realize.  Every
>> insertion into the CAS must be of ordered data.  This means that, for
>> instance, named entities discovered by dictionary will need to be inserted
>> in some predictable order, such as by alphabetized cui per every
>> alphabetized tui (and other code) per ordered text span.  You will need to
>> check and recheck every point at which the CAS is modified by every
>> module.  Right now there are at least three or four places in two cTakes
>> dictionary modules where a change would be required - and that doesn't
>> include YTEX lookup.
>>> If you really feel strongly about this and are going to change cTakes
>> code, then I suggest (at the risk of sounding like a complete jerk) that
>> you also consider the following:
>>> 1.  Don't check anything into trunk until all is well with your changes
>> and tests
>>> Just in case you abandon the effort
>>> 2.  Write unit tests for every change
>>> True, Map to LinkedMap shouldn't break anything, but they are good to
>> have, and may prevent others in the future from switching back to a
>> non-linked map or any unordered collection (set not list, etc.).  It also
>> makes a better place for explanation in Javadoc than inlines above the code.
>>> 3.  Run memory requirement tests before all of your changes and then
>> again after your changes
>>> I'm actually curious about how much memory might be eaten with linkages
>> everywhere
>>> 4.  Run performance (speed) tests before and after
>>> On a large corpus to ensure that garbage collection is involved
>>> 5.  Do the above with every combination possible in current workflows:
>> every combination of available sentence detector, pos tagger, smoking
>> status detector, dictionary lookup, cas consumer, etc.
>>> As soon as somebody says "all output is consistently ordered between
>> runs" it had better be so for every possible workflow
>>> 6.  Write system tests to ensure ordered/predicted outputs with each
>> combination
>>> Otherwise somebody may break it
>>> 7.  Document the what, how, and why for future development
>>> Otherwise somebody won't know to stick to the new rules
>>> 8.  Assist anybody as needed that in the future breaks one of these unit
>> or system tests with a fix or new feature
>>> By mandating such a rule you are assuming responsibility for it
>>> 9.  Assist anybody as needed that in the future adds a new module or
>> workflow to cTakes to abide by the ordering requirement
>>> By mandating such a rule you are assuming responsibility for it
>>> 10.  Assist anybody as needed that in the future adds a new module or
>> workflow to add system tests to ensure maintenance of the ordering
>> requirement
>>> By mandating such a rule you are assuming responsibility for it
>>>
>>>
>>> -----Original Message-----
>>> From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com]
>>> Sent: Tuesday, October 07, 2014 11:57 AM
>>> To: dev@ctakes.apache.org
>>> Subject: Re: cTakes output predictability
>>>
>>> I think we may really prefer the first method. Since it doesn't appear
>> that there are any consequences with moving forward with changing the code,
>> we would really like to move forward with this approach.
>>> Kim Ebert
>>> 1.801.669.7342
>>> Perfect Search Corp
>>> http://www.perfectsearchcorp.com/
>>>
>>> On 10/07/2014 09:35 AM, britt fitch wrote:
>>>> The option Sean mentioned of writing your own custom consumer (without
>>>> the UIMA id that is causing your issues) should meet these needs I
>>>> believe.
>>>>
>>>>
>>>>
>>>> Britt Fitch
>>>> Wired Informatics
>>>> 265 Franklin St Ste 1702
>>>> Boston, MA 02110
>>>> http://wiredinformatics.com
>>>> britt.fi...@wiredinformatics.com
>>>>
>>>> On Oct 7, 2014, at 11:29 AM, Kim Ebert
>>>> <kim.eb...@perfectsearchcorp.com
>>>> <mailto:kim.eb...@perfectsearchcorp.com>> wrote:
>>>>
>>>>> Hi Sean,
>>>>>
>>>>> Well of course that makes plenty of sense. Testing different cTakes
>>>>> configurations you would expect different output. In our testing
>>>>> we've found several cases where running with the same configuration
>>>>> outputs different data under different moons. Having consistent
>>>>> results helps us know if we've made improvements to our quality or
>>>>> not. Having output that is in a predictable order makes checking to
>>>>> see if there are differences much cheaper when you are dealing with
>> larger data sets.
>>>>> Kim Ebert
>>>>> 1.801.669.7342
>>>>> Perfect Search Corp
>>>>> http://www.perfectsearchcorp.com/
>>>>>
>>>>> On 10/07/2014 08:50 AM, Finan, Sean wrote:
>>>>>> Hi Kim,
>>>>>>
>>>>>> One might want compare the Sentence detector that uses end of line
>>>>>> characters as sentence splitters with one that does not.  Such a
>>>>>> change in sentence splitting would not only effect the sentence type
>>>>>> discoveries but also practically every type that follows.
>>>>>>
>>>>>> Another might want to compare a note with "skin cancer" vs. one in
>>>>>> which you replace "skin cancer" with "melanoma" just to see what the
>>>>>> CUI differences might be.  There are changes in two words vs. one,
>>>>>> 11 characters vs. 8, a removed adjective(?), and of course changes
>>>>>> in CUIs.
>>>>>>
>>>>>> Of course, if you are just running notes on a new moon and then
>>>>>> again on a full moon ...
>>>>>>
>>>>>> Sean
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com]
>>>>>> Sent: Tuesday, October 07, 2014 10:41 AM
>>>>>> To: dev@ctakes.apache.org
>>>>>> Subject: Re: cTakes output predictability
>>>>>>
>>>>>> Sean,
>>>>>>
>>>>>> "...being different because of a possibly intentional difference."
>>>>>>
>>>>>> I would like you to elaborate a bit on the what would be
>>>>>> intentionally different between the processing of the same document
>>>>>> multiple times. It would help my understanding of cTakes.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Kim Ebert
>>>>>> 1.801.669.7342
>>>>>> Perfect Search Corp
>>>>>> http://www.perfectsearchcorp.com/
>>>>>>
>>>>>> On 10/07/2014 07:30 AM, Finan, Sean wrote:
>>>>>>> Steve Bethard wrote:
>>>>>>>> I spent some time writing a script for diff-ing CASes
>>>>>>> I urge anyone interested in comparing cTakes CASes / output to use
>>>>>>> this type of approach.  Comparison of program output is a
>>>>>>> post-process task, and unless absolutely necessary code to juggle
>>>>>>> data and metadata belongs there.  Attempts to force every module
>>>>>>> past, present and Future to abide by fixed orderings, enumerations
>>>>>>> etc. is not as simple a task as one might initially think -
>>>>>>> especially if third-party libraries are involved.  I won't get into
>>>>>>> problems associated with why one is comparing output (swapped
>>>>>>> module?) and IDs, orders etc. being different because of a possibly
>>>>>>> intentional difference.
>>>>>>>
>>>>>>> In addition to or instead of creating a post-processing script, one
>>>>>>> could write a new "cas-consumer" that writes output in a desired
>>>>>>> format - but this should not require changes to engines.
>>>>>>>
>>>>>>> "If it ain't broke, don't fix it"
>>>>>>>
>>>>>>> Sean
>>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Steven Bethard [mailto:steven.beth...@gmail.com]
>>>>>>> Sent: Monday, October 06, 2014 11:23 PM
>>>>>>> To: dev@ctakes.apache.org
>>>>>>> Subject: Re: cTakes output predictability
>>>>>>>
>>>>>>> On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen
>>>>>>> <bruce.tiet...@perfectsearchcorp.com> wrote:
>>>>>>>> Since I started working with cTakes some time ago, I have found it
>>>>>>>> difficult to compare the output between subsequent runs on the
>>>>>>>> same files because annotations are often assigned different IDs,
>>>>>>>> are listed in different order, etc.
>>>>>>> At one point, I spent some time writing a script for diff-ing CASes
>>>>>>> that intended to address some of these kinds of issues. It's still
>>>>>>> here in cTAKES:
>>>>>>>
>>>>>>> ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analy
>>>>>>> sis
>>>>>>> /CompareFeatureStructures.java
>>>>>>>
>>>>>>> You might see if you could use or adapt that to your needs.
>>>>>>>
>>>>>>> Steve
>>

Re: cTakes output predictability

Reply via email to