RE: cTakes output predictability

2014-10-07 Thread Finan, Sean
Steve Bethard wrote:
 I spent some time writing a script for diff-ing CASes

I urge anyone interested in comparing cTakes CASes / output to use this type of 
approach.  Comparison of program output is a post-process task, and unless 
absolutely necessary code to juggle data and metadata belongs there.  Attempts 
to force every module past, present and Future to abide by fixed orderings, 
enumerations etc. is not as simple a task as one might initially think - 
especially if third-party libraries are involved.  I won't get into problems 
associated with why one is comparing output (swapped module?) and IDs, orders 
etc. being different because of a possibly intentional difference.

In addition to or instead of creating a post-processing script, one could write 
a new cas-consumer that writes output in a desired format - but this should 
not require changes to engines.

If it ain't broke, don't fix it

Sean


-Original Message-
From: Steven Bethard [mailto:steven.beth...@gmail.com] 
Sent: Monday, October 06, 2014 11:23 PM
To: dev@ctakes.apache.org
Subject: Re: cTakes output predictability

On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen
bruce.tiet...@perfectsearchcorp.com wrote:
 Since I started working with cTakes some time ago, I have found it
 difficult to compare the output between subsequent runs on the same files
 because annotations are often assigned different IDs, are listed in
 different order, etc.

At one point, I spent some time writing a script for diff-ing CASes
that intended to address some of these kinds of issues. It's still
here in cTAKES:

ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analysis/CompareFeatureStructures.java

You might see if you could use or adapt that to your needs.

Steve


RE: cTakes output predictability

2014-10-07 Thread Finan, Sean
Hi Kim,

One might want compare the Sentence detector that uses end of line characters 
as sentence splitters with one that does not.  Such a change in sentence 
splitting would not only effect the sentence type discoveries but also 
practically every type that follows.

Another might want to compare a note with skin cancer vs. one in which you 
replace skin cancer with melanoma just to see what the CUI differences 
might be.  There are changes in two words vs. one, 11 characters vs. 8, a 
removed adjective(?), and of course changes in CUIs.

Of course, if you are just running notes on a new moon and then again on a full 
moon ...

Sean

-Original Message-
From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com] 
Sent: Tuesday, October 07, 2014 10:41 AM
To: dev@ctakes.apache.org
Subject: Re: cTakes output predictability

Sean,

...being different because of a possibly intentional difference.

I would like you to elaborate a bit on the what would be intentionally 
different between the processing of the same document multiple times. It would 
help my understanding of cTakes.

Thanks,

Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/

On 10/07/2014 07:30 AM, Finan, Sean wrote:
 Steve Bethard wrote:
 I spent some time writing a script for diff-ing CASes
 I urge anyone interested in comparing cTakes CASes / output to use this type 
 of approach.  Comparison of program output is a post-process task, and unless 
 absolutely necessary code to juggle data and metadata belongs there.  
 Attempts to force every module past, present and Future to abide by fixed 
 orderings, enumerations etc. is not as simple a task as one might initially 
 think - especially if third-party libraries are involved.  I won't get into 
 problems associated with why one is comparing output (swapped module?) and 
 IDs, orders etc. being different because of a possibly intentional difference.

 In addition to or instead of creating a post-processing script, one could 
 write a new cas-consumer that writes output in a desired format - but this 
 should not require changes to engines.

 If it ain't broke, don't fix it

 Sean


 -Original Message-
 From: Steven Bethard [mailto:steven.beth...@gmail.com]
 Sent: Monday, October 06, 2014 11:23 PM
 To: dev@ctakes.apache.org
 Subject: Re: cTakes output predictability

 On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen 
 bruce.tiet...@perfectsearchcorp.com wrote:
 Since I started working with cTakes some time ago, I have found it 
 difficult to compare the output between subsequent runs on the same 
 files because annotations are often assigned different IDs, are 
 listed in different order, etc.
 At one point, I spent some time writing a script for diff-ing CASes 
 that intended to address some of these kinds of issues. It's still 
 here in cTAKES:

 ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analysis
 /CompareFeatureStructures.java

 You might see if you could use or adapt that to your needs.

 Steve



Re: cTakes output predictability

2014-10-07 Thread britt fitch
The option Sean mentioned of writing your own custom consumer (without the UIMA 
id that is causing your issues) should meet these needs I believe. 

 
Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
britt.fi...@wiredinformatics.com

On Oct 7, 2014, at 11:29 AM, Kim Ebert kim.eb...@perfectsearchcorp.com wrote:

 Hi Sean,
 
 Well of course that makes plenty of sense. Testing different cTakes
 configurations you would expect different output. In our testing we've
 found several cases where running with the same configuration outputs
 different data under different moons. Having consistent results helps us
 know if we've made improvements to our quality or not. Having output
 that is in a predictable order makes checking to see if there are
 differences much cheaper when you are dealing with larger data sets.
 
 Kim Ebert
 1.801.669.7342
 Perfect Search Corp
 http://www.perfectsearchcorp.com/
 
 On 10/07/2014 08:50 AM, Finan, Sean wrote:
 Hi Kim,
 
 One might want compare the Sentence detector that uses end of line 
 characters as sentence splitters with one that does not.  Such a change in 
 sentence splitting would not only effect the sentence type discoveries but 
 also practically every type that follows.
 
 Another might want to compare a note with skin cancer vs. one in which you 
 replace skin cancer with melanoma just to see what the CUI differences 
 might be.  There are changes in two words vs. one, 11 characters vs. 8, a 
 removed adjective(?), and of course changes in CUIs.
 
 Of course, if you are just running notes on a new moon and then again on a 
 full moon ...
 
 Sean
 
 -Original Message-
 From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com] 
 Sent: Tuesday, October 07, 2014 10:41 AM
 To: dev@ctakes.apache.org
 Subject: Re: cTakes output predictability
 
 Sean,
 
 ...being different because of a possibly intentional difference.
 
 I would like you to elaborate a bit on the what would be intentionally 
 different between the processing of the same document multiple times. It 
 would help my understanding of cTakes.
 
 Thanks,
 
 Kim Ebert
 1.801.669.7342
 Perfect Search Corp
 http://www.perfectsearchcorp.com/
 
 On 10/07/2014 07:30 AM, Finan, Sean wrote:
 Steve Bethard wrote:
 I spent some time writing a script for diff-ing CASes
 I urge anyone interested in comparing cTakes CASes / output to use this 
 type of approach.  Comparison of program output is a post-process task, and 
 unless absolutely necessary code to juggle data and metadata belongs there. 
  Attempts to force every module past, present and Future to abide by fixed 
 orderings, enumerations etc. is not as simple a task as one might initially 
 think - especially if third-party libraries are involved.  I won't get into 
 problems associated with why one is comparing output (swapped module?) and 
 IDs, orders etc. being different because of a possibly intentional 
 difference.
 
 In addition to or instead of creating a post-processing script, one could 
 write a new cas-consumer that writes output in a desired format - but 
 this should not require changes to engines.
 
 If it ain't broke, don't fix it
 
 Sean
 
 
 -Original Message-
 From: Steven Bethard [mailto:steven.beth...@gmail.com]
 Sent: Monday, October 06, 2014 11:23 PM
 To: dev@ctakes.apache.org
 Subject: Re: cTakes output predictability
 
 On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen 
 bruce.tiet...@perfectsearchcorp.com wrote:
 Since I started working with cTakes some time ago, I have found it 
 difficult to compare the output between subsequent runs on the same 
 files because annotations are often assigned different IDs, are 
 listed in different order, etc.
 At one point, I spent some time writing a script for diff-ing CASes 
 that intended to address some of these kinds of issues. It's still 
 here in cTAKES:
 
 ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analysis
 /CompareFeatureStructures.java
 
 You might see if you could use or adapt that to your needs.
 
 Steve
 



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: cTakes output predictability

2014-10-07 Thread Kim Ebert
Hi Sean,

Well of course that makes plenty of sense. Testing different cTakes
configurations you would expect different output. In our testing we've
found several cases where running with the same configuration outputs
different data under different moons. Having consistent results helps us
know if we've made improvements to our quality or not. Having output
that is in a predictable order makes checking to see if there are
differences much cheaper when you are dealing with larger data sets.

Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/

On 10/07/2014 08:50 AM, Finan, Sean wrote:
 Hi Kim,

 One might want compare the Sentence detector that uses end of line characters 
 as sentence splitters with one that does not.  Such a change in sentence 
 splitting would not only effect the sentence type discoveries but also 
 practically every type that follows.

 Another might want to compare a note with skin cancer vs. one in which you 
 replace skin cancer with melanoma just to see what the CUI differences 
 might be.  There are changes in two words vs. one, 11 characters vs. 8, a 
 removed adjective(?), and of course changes in CUIs.

 Of course, if you are just running notes on a new moon and then again on a 
 full moon ...

 Sean

 -Original Message-
 From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com] 
 Sent: Tuesday, October 07, 2014 10:41 AM
 To: dev@ctakes.apache.org
 Subject: Re: cTakes output predictability

 Sean,

 ...being different because of a possibly intentional difference.

 I would like you to elaborate a bit on the what would be intentionally 
 different between the processing of the same document multiple times. It 
 would help my understanding of cTakes.

 Thanks,

 Kim Ebert
 1.801.669.7342
 Perfect Search Corp
 http://www.perfectsearchcorp.com/

 On 10/07/2014 07:30 AM, Finan, Sean wrote:
 Steve Bethard wrote:
 I spent some time writing a script for diff-ing CASes
 I urge anyone interested in comparing cTakes CASes / output to use this type 
 of approach.  Comparison of program output is a post-process task, and 
 unless absolutely necessary code to juggle data and metadata belongs there.  
 Attempts to force every module past, present and Future to abide by fixed 
 orderings, enumerations etc. is not as simple a task as one might initially 
 think - especially if third-party libraries are involved.  I won't get into 
 problems associated with why one is comparing output (swapped module?) and 
 IDs, orders etc. being different because of a possibly intentional 
 difference.

 In addition to or instead of creating a post-processing script, one could 
 write a new cas-consumer that writes output in a desired format - but this 
 should not require changes to engines.

 If it ain't broke, don't fix it

 Sean


 -Original Message-
 From: Steven Bethard [mailto:steven.beth...@gmail.com]
 Sent: Monday, October 06, 2014 11:23 PM
 To: dev@ctakes.apache.org
 Subject: Re: cTakes output predictability

 On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen 
 bruce.tiet...@perfectsearchcorp.com wrote:
 Since I started working with cTakes some time ago, I have found it 
 difficult to compare the output between subsequent runs on the same 
 files because annotations are often assigned different IDs, are 
 listed in different order, etc.
 At one point, I spent some time writing a script for diff-ing CASes 
 that intended to address some of these kinds of issues. It's still 
 here in cTAKES:

 ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analysis
 /CompareFeatureStructures.java

 You might see if you could use or adapt that to your needs.

 Steve



Re: cTakes output predictability

2014-10-07 Thread Kim Ebert
I think we may really prefer the first method. Since it doesn't appear
that there are any consequences with moving forward with changing the
code, we would really like to move forward with this approach.

Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/

On 10/07/2014 09:35 AM, britt fitch wrote:
 The option Sean mentioned of writing your own custom consumer (without
 the UIMA id that is causing your issues) should meet these needs I
 believe. 



 Britt Fitch
 Wired Informatics
 265 Franklin St Ste 1702
 Boston, MA 02110
 http://wiredinformatics.com
 britt.fi...@wiredinformatics.com

 On Oct 7, 2014, at 11:29 AM, Kim Ebert
 kim.eb...@perfectsearchcorp.com
 mailto:kim.eb...@perfectsearchcorp.com wrote:

 Hi Sean,

 Well of course that makes plenty of sense. Testing different cTakes
 configurations you would expect different output. In our testing we've
 found several cases where running with the same configuration outputs
 different data under different moons. Having consistent results helps us
 know if we've made improvements to our quality or not. Having output
 that is in a predictable order makes checking to see if there are
 differences much cheaper when you are dealing with larger data sets.

 Kim Ebert
 1.801.669.7342
 Perfect Search Corp
 http://www.perfectsearchcorp.com/

 On 10/07/2014 08:50 AM, Finan, Sean wrote:
 Hi Kim,

 One might want compare the Sentence detector that uses end of line
 characters as sentence splitters with one that does not.  Such a
 change in sentence splitting would not only effect the sentence type
 discoveries but also practically every type that follows.

 Another might want to compare a note with skin cancer vs. one in
 which you replace skin cancer with melanoma just to see what the
 CUI differences might be.  There are changes in two words vs. one,
 11 characters vs. 8, a removed adjective(?), and of course changes
 in CUIs.

 Of course, if you are just running notes on a new moon and then
 again on a full moon ...

 Sean

 -Original Message-
 From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com]
 Sent: Tuesday, October 07, 2014 10:41 AM
 To: dev@ctakes.apache.org
 Subject: Re: cTakes output predictability

 Sean,

 ...being different because of a possibly intentional difference.

 I would like you to elaborate a bit on the what would be
 intentionally different between the processing of the same document
 multiple times. It would help my understanding of cTakes.

 Thanks,

 Kim Ebert
 1.801.669.7342
 Perfect Search Corp
 http://www.perfectsearchcorp.com/

 On 10/07/2014 07:30 AM, Finan, Sean wrote:
 Steve Bethard wrote:
 I spent some time writing a script for diff-ing CASes
 I urge anyone interested in comparing cTakes CASes / output to use
 this type of approach.  Comparison of program output is a
 post-process task, and unless absolutely necessary code to juggle
 data and metadata belongs there.  Attempts to force every module
 past, present and Future to abide by fixed orderings, enumerations
 etc. is not as simple a task as one might initially think -
 especially if third-party libraries are involved.  I won't get into
 problems associated with why one is comparing output (swapped
 module?) and IDs, orders etc. being different because of a possibly
 intentional difference.

 In addition to or instead of creating a post-processing script, one
 could write a new cas-consumer that writes output in a desired
 format - but this should not require changes to engines.

 If it ain't broke, don't fix it

 Sean


 -Original Message-
 From: Steven Bethard [mailto:steven.beth...@gmail.com]
 Sent: Monday, October 06, 2014 11:23 PM
 To: dev@ctakes.apache.org
 Subject: Re: cTakes output predictability

 On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen
 bruce.tiet...@perfectsearchcorp.com wrote:
 Since I started working with cTakes some time ago, I have found it
 difficult to compare the output between subsequent runs on the same
 files because annotations are often assigned different IDs, are
 listed in different order, etc.
 At one point, I spent some time writing a script for diff-ing CASes
 that intended to address some of these kinds of issues. It's still
 here in cTAKES:

 ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analysis
 /CompareFeatureStructures.java

 You might see if you could use or adapt that to your needs.

 Steve





RE: cTakes output predictability

2014-10-07 Thread Masanz, James J.
FWIW, I agree with Sean that comparing should be a post-processing step and 
trying to get UIMA internal IDs to match on subsequent runs is not worth 
opening the code for.

-Original Message-
From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com] 
Sent: Tuesday, October 07, 2014 10:56 AM
To: dev@ctakes.apache.org
Subject: Re: cTakes output predictability

I think we may really prefer the first method. Since it doesn't appear
that there are any consequences with moving forward with changing the
code, we would really like to move forward with this approach.

Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/

On 10/07/2014 09:35 AM, britt fitch wrote:
 The option Sean mentioned of writing your own custom consumer (without
 the UIMA id that is causing your issues) should meet these needs I
 believe. 



 Britt Fitch
 Wired Informatics
 265 Franklin St Ste 1702
 Boston, MA 02110
 http://wiredinformatics.com
 britt.fi...@wiredinformatics.com

 On Oct 7, 2014, at 11:29 AM, Kim Ebert
 kim.eb...@perfectsearchcorp.com
 mailto:kim.eb...@perfectsearchcorp.com wrote:

 Hi Sean,

 Well of course that makes plenty of sense. Testing different cTakes
 configurations you would expect different output. In our testing we've
 found several cases where running with the same configuration outputs
 different data under different moons. Having consistent results helps us
 know if we've made improvements to our quality or not. Having output
 that is in a predictable order makes checking to see if there are
 differences much cheaper when you are dealing with larger data sets.

 Kim Ebert
 1.801.669.7342
 Perfect Search Corp
 http://www.perfectsearchcorp.com/

 On 10/07/2014 08:50 AM, Finan, Sean wrote:
 Hi Kim,

 One might want compare the Sentence detector that uses end of line
 characters as sentence splitters with one that does not.  Such a
 change in sentence splitting would not only effect the sentence type
 discoveries but also practically every type that follows.

 Another might want to compare a note with skin cancer vs. one in
 which you replace skin cancer with melanoma just to see what the
 CUI differences might be.  There are changes in two words vs. one,
 11 characters vs. 8, a removed adjective(?), and of course changes
 in CUIs.

 Of course, if you are just running notes on a new moon and then
 again on a full moon ...

 Sean

 -Original Message-
 From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com]
 Sent: Tuesday, October 07, 2014 10:41 AM
 To: dev@ctakes.apache.org
 Subject: Re: cTakes output predictability

 Sean,

 ...being different because of a possibly intentional difference.

 I would like you to elaborate a bit on the what would be
 intentionally different between the processing of the same document
 multiple times. It would help my understanding of cTakes.

 Thanks,

 Kim Ebert
 1.801.669.7342
 Perfect Search Corp
 http://www.perfectsearchcorp.com/

 On 10/07/2014 07:30 AM, Finan, Sean wrote:
 Steve Bethard wrote:
 I spent some time writing a script for diff-ing CASes
 I urge anyone interested in comparing cTakes CASes / output to use
 this type of approach.  Comparison of program output is a
 post-process task, and unless absolutely necessary code to juggle
 data and metadata belongs there.  Attempts to force every module
 past, present and Future to abide by fixed orderings, enumerations
 etc. is not as simple a task as one might initially think -
 especially if third-party libraries are involved.  I won't get into
 problems associated with why one is comparing output (swapped
 module?) and IDs, orders etc. being different because of a possibly
 intentional difference.

 In addition to or instead of creating a post-processing script, one
 could write a new cas-consumer that writes output in a desired
 format - but this should not require changes to engines.

 If it ain't broke, don't fix it

 Sean


 -Original Message-
 From: Steven Bethard [mailto:steven.beth...@gmail.com]
 Sent: Monday, October 06, 2014 11:23 PM
 To: dev@ctakes.apache.org
 Subject: Re: cTakes output predictability

 On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen
 bruce.tiet...@perfectsearchcorp.com wrote:
 Since I started working with cTakes some time ago, I have found it
 difficult to compare the output between subsequent runs on the same
 files because annotations are often assigned different IDs, are
 listed in different order, etc.
 At one point, I spent some time writing a script for diff-ing CASes
 that intended to address some of these kinds of issues. It's still
 here in cTAKES:

 ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analysis
 /CompareFeatureStructures.java

 You might see if you could use or adapt that to your needs.

 Steve





Re: cTakes output predictability

2014-10-07 Thread britt fitch
I think changing the code raises at least some concerns of affecting others, 
while adding a custom consumer raises zero. Given how easy it is to write a 
custom consumer, that is my vote. 

 
Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
britt.fi...@wiredinformatics.com

On Oct 7, 2014, at 11:56 AM, Kim Ebert kim.eb...@perfectsearchcorp.com wrote:

 I think we may really prefer the first method. Since it doesn't appear
 that there are any consequences with moving forward with changing the
 code, we would really like to move forward with this approach.
 
 Kim Ebert
 1.801.669.7342
 Perfect Search Corp
 http://www.perfectsearchcorp.com/
 
 On 10/07/2014 09:35 AM, britt fitch wrote:
 The option Sean mentioned of writing your own custom consumer (without
 the UIMA id that is causing your issues) should meet these needs I
 believe. 
 
   
 
 Britt Fitch
 Wired Informatics
 265 Franklin St Ste 1702
 Boston, MA 02110
 http://wiredinformatics.com
 britt.fi...@wiredinformatics.com
 
 On Oct 7, 2014, at 11:29 AM, Kim Ebert
 kim.eb...@perfectsearchcorp.com
 mailto:kim.eb...@perfectsearchcorp.com wrote:
 
 Hi Sean,
 
 Well of course that makes plenty of sense. Testing different cTakes
 configurations you would expect different output. In our testing we've
 found several cases where running with the same configuration outputs
 different data under different moons. Having consistent results helps us
 know if we've made improvements to our quality or not. Having output
 that is in a predictable order makes checking to see if there are
 differences much cheaper when you are dealing with larger data sets.
 
 Kim Ebert
 1.801.669.7342
 Perfect Search Corp
 http://www.perfectsearchcorp.com/
 
 On 10/07/2014 08:50 AM, Finan, Sean wrote:
 Hi Kim,
 
 One might want compare the Sentence detector that uses end of line
 characters as sentence splitters with one that does not.  Such a
 change in sentence splitting would not only effect the sentence type
 discoveries but also practically every type that follows.
 
 Another might want to compare a note with skin cancer vs. one in
 which you replace skin cancer with melanoma just to see what the
 CUI differences might be.  There are changes in two words vs. one,
 11 characters vs. 8, a removed adjective(?), and of course changes
 in CUIs.
 
 Of course, if you are just running notes on a new moon and then
 again on a full moon ...
 
 Sean
 
 -Original Message-
 From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com]
 Sent: Tuesday, October 07, 2014 10:41 AM
 To: dev@ctakes.apache.org
 Subject: Re: cTakes output predictability
 
 Sean,
 
 ...being different because of a possibly intentional difference.
 
 I would like you to elaborate a bit on the what would be
 intentionally different between the processing of the same document
 multiple times. It would help my understanding of cTakes.
 
 Thanks,
 
 Kim Ebert
 1.801.669.7342
 Perfect Search Corp
 http://www.perfectsearchcorp.com/
 
 On 10/07/2014 07:30 AM, Finan, Sean wrote:
 Steve Bethard wrote:
 I spent some time writing a script for diff-ing CASes
 I urge anyone interested in comparing cTakes CASes / output to use
 this type of approach.  Comparison of program output is a
 post-process task, and unless absolutely necessary code to juggle
 data and metadata belongs there.  Attempts to force every module
 past, present and Future to abide by fixed orderings, enumerations
 etc. is not as simple a task as one might initially think -
 especially if third-party libraries are involved.  I won't get into
 problems associated with why one is comparing output (swapped
 module?) and IDs, orders etc. being different because of a possibly
 intentional difference.
 
 In addition to or instead of creating a post-processing script, one
 could write a new cas-consumer that writes output in a desired
 format - but this should not require changes to engines.
 
 If it ain't broke, don't fix it
 
 Sean
 
 
 -Original Message-
 From: Steven Bethard [mailto:steven.beth...@gmail.com]
 Sent: Monday, October 06, 2014 11:23 PM
 To: dev@ctakes.apache.org
 Subject: Re: cTakes output predictability
 
 On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen
 bruce.tiet...@perfectsearchcorp.com wrote:
 Since I started working with cTakes some time ago, I have found it
 difficult to compare the output between subsequent runs on the same
 files because annotations are often assigned different IDs, are
 listed in different order, etc.
 At one point, I spent some time writing a script for diff-ing CASes
 that intended to address some of these kinds of issues. It's still
 here in cTAKES:
 
 ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analysis
 /CompareFeatureStructures.java
 
 You might see if you could use or adapt that to your needs.
 
 Steve



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: cTakes output predictability

2014-10-07 Thread Kim Ebert
I think it would be helpful actually, as digging deeper into the issue
has highlighted to me a few places in the code that actually cause
inconsistent results to be returned when running the same document
through multiple times. I think having the code base be predictable will
make it easier to debug.

Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/

On 10/07/2014 09:58 AM, Masanz, James J. wrote:
 FWIW, I agree with Sean that comparing should be a post-processing step and 
 trying to get UIMA internal IDs to match on subsequent runs is not worth 
 opening the code for.

 -Original Message-
 From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com] 
 Sent: Tuesday, October 07, 2014 10:56 AM
 To: dev@ctakes.apache.org
 Subject: Re: cTakes output predictability

 I think we may really prefer the first method. Since it doesn't appear
 that there are any consequences with moving forward with changing the
 code, we would really like to move forward with this approach.

 Kim Ebert
 1.801.669.7342
 Perfect Search Corp
 http://www.perfectsearchcorp.com/

 On 10/07/2014 09:35 AM, britt fitch wrote:
 The option Sean mentioned of writing your own custom consumer (without
 the UIMA id that is causing your issues) should meet these needs I
 believe. 

   

 Britt Fitch
 Wired Informatics
 265 Franklin St Ste 1702
 Boston, MA 02110
 http://wiredinformatics.com
 britt.fi...@wiredinformatics.com

 On Oct 7, 2014, at 11:29 AM, Kim Ebert
 kim.eb...@perfectsearchcorp.com
 mailto:kim.eb...@perfectsearchcorp.com wrote:

 Hi Sean,

 Well of course that makes plenty of sense. Testing different cTakes
 configurations you would expect different output. In our testing we've
 found several cases where running with the same configuration outputs
 different data under different moons. Having consistent results helps us
 know if we've made improvements to our quality or not. Having output
 that is in a predictable order makes checking to see if there are
 differences much cheaper when you are dealing with larger data sets.

 Kim Ebert
 1.801.669.7342
 Perfect Search Corp
 http://www.perfectsearchcorp.com/

 On 10/07/2014 08:50 AM, Finan, Sean wrote:
 Hi Kim,

 One might want compare the Sentence detector that uses end of line
 characters as sentence splitters with one that does not.  Such a
 change in sentence splitting would not only effect the sentence type
 discoveries but also practically every type that follows.

 Another might want to compare a note with skin cancer vs. one in
 which you replace skin cancer with melanoma just to see what the
 CUI differences might be.  There are changes in two words vs. one,
 11 characters vs. 8, a removed adjective(?), and of course changes
 in CUIs.

 Of course, if you are just running notes on a new moon and then
 again on a full moon ...

 Sean

 -Original Message-
 From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com]
 Sent: Tuesday, October 07, 2014 10:41 AM
 To: dev@ctakes.apache.org
 Subject: Re: cTakes output predictability

 Sean,

 ...being different because of a possibly intentional difference.

 I would like you to elaborate a bit on the what would be
 intentionally different between the processing of the same document
 multiple times. It would help my understanding of cTakes.

 Thanks,

 Kim Ebert
 1.801.669.7342
 Perfect Search Corp
 http://www.perfectsearchcorp.com/

 On 10/07/2014 07:30 AM, Finan, Sean wrote:
 Steve Bethard wrote:
 I spent some time writing a script for diff-ing CASes
 I urge anyone interested in comparing cTakes CASes / output to use
 this type of approach.  Comparison of program output is a
 post-process task, and unless absolutely necessary code to juggle
 data and metadata belongs there.  Attempts to force every module
 past, present and Future to abide by fixed orderings, enumerations
 etc. is not as simple a task as one might initially think -
 especially if third-party libraries are involved.  I won't get into
 problems associated with why one is comparing output (swapped
 module?) and IDs, orders etc. being different because of a possibly
 intentional difference.

 In addition to or instead of creating a post-processing script, one
 could write a new cas-consumer that writes output in a desired
 format - but this should not require changes to engines.

 If it ain't broke, don't fix it

 Sean


 -Original Message-
 From: Steven Bethard [mailto:steven.beth...@gmail.com]
 Sent: Monday, October 06, 2014 11:23 PM
 To: dev@ctakes.apache.org
 Subject: Re: cTakes output predictability

 On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen
 bruce.tiet...@perfectsearchcorp.com wrote:
 Since I started working with cTakes some time ago, I have found it
 difficult to compare the output between subsequent runs on the same
 files because annotations are often assigned different IDs, are
 listed in different order, etc.
 At one point, I spent some time writing a script for diff-ing CASes

Re: cTakes output predictability

2014-10-07 Thread Kim Ebert
It concerns me a bit by making the code return consistent results would
be so concerning. This should be the default mode of operation.

Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/

On 10/07/2014 09:59 AM, britt fitch wrote:
 I think changing the code raises at least some concerns of affecting
 others, while adding a custom consumer raises zero. Given how easy it
 is to write a custom consumer, that is my vote. 



 Britt Fitch
 Wired Informatics
 265 Franklin St Ste 1702
 Boston, MA 02110
 http://wiredinformatics.com
 britt.fi...@wiredinformatics.com

 On Oct 7, 2014, at 11:56 AM, Kim Ebert
 kim.eb...@perfectsearchcorp.com
 mailto:kim.eb...@perfectsearchcorp.com wrote:

 I think we may really prefer the first method. Since it doesn't appear
 that there are any consequences with moving forward with changing the
 code, we would really like to move forward with this approach.

 Kim Ebert
 1.801.669.7342
 Perfect Search Corp
 http://www.perfectsearchcorp.com/

 On 10/07/2014 09:35 AM, britt fitch wrote:
 The option Sean mentioned of writing your own custom consumer (without
 the UIMA id that is causing your issues) should meet these needs I
 believe. 

   

 Britt Fitch
 Wired Informatics
 265 Franklin St Ste 1702
 Boston, MA 02110
 http://wiredinformatics.com
 britt.fi...@wiredinformatics.com

 On Oct 7, 2014, at 11:29 AM, Kim Ebert
 kim.eb...@perfectsearchcorp.com
 mailto:kim.eb...@perfectsearchcorp.com wrote:

 Hi Sean,

 Well of course that makes plenty of sense. Testing different cTakes
 configurations you would expect different output. In our testing we've
 found several cases where running with the same configuration outputs
 different data under different moons. Having consistent results
 helps us
 know if we've made improvements to our quality or not. Having output
 that is in a predictable order makes checking to see if there are
 differences much cheaper when you are dealing with larger data sets.

 Kim Ebert
 1.801.669.7342
 Perfect Search Corp
 http://www.perfectsearchcorp.com/

 On 10/07/2014 08:50 AM, Finan, Sean wrote:
 Hi Kim,

 One might want compare the Sentence detector that uses end of line
 characters as sentence splitters with one that does not.  Such a
 change in sentence splitting would not only effect the sentence type
 discoveries but also practically every type that follows.

 Another might want to compare a note with skin cancer vs. one in
 which you replace skin cancer with melanoma just to see what the
 CUI differences might be.  There are changes in two words vs. one,
 11 characters vs. 8, a removed adjective(?), and of course changes
 in CUIs.

 Of course, if you are just running notes on a new moon and then
 again on a full moon ...

 Sean

 -Original Message-
 From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com]
 Sent: Tuesday, October 07, 2014 10:41 AM
 To: dev@ctakes.apache.org
 Subject: Re: cTakes output predictability

 Sean,

 ...being different because of a possibly intentional difference.

 I would like you to elaborate a bit on the what would be
 intentionally different between the processing of the same document
 multiple times. It would help my understanding of cTakes.

 Thanks,

 Kim Ebert
 1.801.669.7342
 Perfect Search Corp
 http://www.perfectsearchcorp.com/

 On 10/07/2014 07:30 AM, Finan, Sean wrote:
 Steve Bethard wrote:
 I spent some time writing a script for diff-ing CASes
 I urge anyone interested in comparing cTakes CASes / output to use
 this type of approach.  Comparison of program output is a
 post-process task, and unless absolutely necessary code to juggle
 data and metadata belongs there.  Attempts to force every module
 past, present and Future to abide by fixed orderings, enumerations
 etc. is not as simple a task as one might initially think -
 especially if third-party libraries are involved.  I won't get into
 problems associated with why one is comparing output (swapped
 module?) and IDs, orders etc. being different because of a possibly
 intentional difference.

 In addition to or instead of creating a post-processing script, one
 could write a new cas-consumer that writes output in a desired
 format - but this should not require changes to engines.

 If it ain't broke, don't fix it

 Sean


 -Original Message-
 From: Steven Bethard [mailto:steven.beth...@gmail.com]
 Sent: Monday, October 06, 2014 11:23 PM
 To: dev@ctakes.apache.org
 Subject: Re: cTakes output predictability

 On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen
 bruce.tiet...@perfectsearchcorp.com wrote:
 Since I started working with cTakes some time ago, I have found it
 difficult to compare the output between subsequent runs on the same
 files because annotations are often assigned different IDs, are
 listed in different order, etc.
 At one point, I spent some time writing a script for diff-ing CASes
 that intended to address some of these kinds of issues. It's still
 here in cTAKES:

 ctakes

Re: cTakes output predictability

2014-10-07 Thread Kim Ebert
Jay,

I agree. This does lead to reproducible unit tests, which helps us out
in the long term.

Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/

On 10/06/2014 05:38 PM, jay vyas wrote:
 Im not a ctakes expert by any means, but in general, I like that idea
 predictable and deterministic ordering of mapped elements almost always
 leads to less buggy applications.
 As groovy has shown (LinkedHashMap is the default data structure and its
 much easier imo to get reproducible groovy unit tests etc b/c of that).


 On Mon, Oct 6, 2014 at 4:59 PM, Bruce Tietjen 
 bruce.tiet...@perfectsearchcorp.com wrote:

 Since I started working with cTakes some time ago, I have found it
 difficult to compare the output between subsequent runs on the same files
 because annotations are often assigned different IDs, are listed in
 different order, etc.

 One area that seems to be a cause for at least some of these differences is
 the common use of HashMap where enumerating the contents is not guaranteed
 to return items in the same order they were added.

 I would like to work towards addressing this issue by changing those areas
 of the code where it matters to use a LinkedHashMap instead.

 Is this something the community would be interested in and find helpful?

 Thanks,

 Bruce Tietjen
 Perfect Search Corp.






RE: cTakes output predictability

2014-10-07 Thread Finan, Sean
Hi Kim,

 It concerns me a bit by making the code return consistent results would be so 
 concerning. 
Could you please clarify what you mean by consistent results?  Do you mean 
ordering and IDs or are you talking about actual type values not matching?

This should be the default mode of operation.
Depending upon what you meant above, I may agree or disagree.

 Since it doesn't appear that there are any consequences with moving forward 
 with changing the code
Why do you say this?  

I think that there may be more required changes than you realize.  Every 
insertion into the CAS must be of ordered data.  This means that, for instance, 
named entities discovered by dictionary will need to be inserted in some 
predictable order, such as by alphabetized cui per every alphabetized tui (and 
other code) per ordered text span.  You will need to check and recheck every 
point at which the CAS is modified by every module.  Right now there are at 
least three or four places in two cTakes dictionary modules where a change 
would be required - and that doesn't include YTEX lookup.

If you really feel strongly about this and are going to change cTakes code, 
then I suggest (at the risk of sounding like a complete jerk) that you also 
consider the following:
1.  Don't check anything into trunk until all is well with your changes and 
tests
Just in case you abandon the effort
2.  Write unit tests for every change   
True, Map to LinkedMap shouldn't break anything, but they are good to have, and 
may prevent others in the future from switching back to a non-linked map or any 
unordered collection (set not list, etc.).  It also makes a better place for 
explanation in Javadoc than inlines above the code.
3.  Run memory requirement tests before all of your changes and then again 
after your changes
I'm actually curious about how much memory might be eaten with linkages 
everywhere
4.  Run performance (speed) tests before and after
On a large corpus to ensure that garbage collection is involved
5.  Do the above with every combination possible in current workflows: every 
combination of available sentence detector, pos tagger, smoking status 
detector, dictionary lookup, cas consumer, etc.
As soon as somebody says all output is consistently ordered between runs it 
had better be so for every possible workflow
6.  Write system tests to ensure ordered/predicted outputs with each combination
Otherwise somebody may break it
7.  Document the what, how, and why for future development
Otherwise somebody won't know to stick to the new rules
8.  Assist anybody as needed that in the future breaks one of these unit or 
system tests with a fix or new feature
By mandating such a rule you are assuming responsibility for it
9.  Assist anybody as needed that in the future adds a new module or workflow 
to cTakes to abide by the ordering requirement
By mandating such a rule you are assuming responsibility for it
10.  Assist anybody as needed that in the future adds a new module or workflow 
to add system tests to ensure maintenance of the ordering requirement
By mandating such a rule you are assuming responsibility for it


-Original Message-
From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com] 
Sent: Tuesday, October 07, 2014 11:57 AM
To: dev@ctakes.apache.org
Subject: Re: cTakes output predictability

I think we may really prefer the first method. Since it doesn't appear that 
there are any consequences with moving forward with changing the code, we would 
really like to move forward with this approach.

Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/

On 10/07/2014 09:35 AM, britt fitch wrote:
 The option Sean mentioned of writing your own custom consumer (without 
 the UIMA id that is causing your issues) should meet these needs I 
 believe.



 Britt Fitch
 Wired Informatics
 265 Franklin St Ste 1702
 Boston, MA 02110
 http://wiredinformatics.com
 britt.fi...@wiredinformatics.com

 On Oct 7, 2014, at 11:29 AM, Kim Ebert 
 kim.eb...@perfectsearchcorp.com 
 mailto:kim.eb...@perfectsearchcorp.com wrote:

 Hi Sean,

 Well of course that makes plenty of sense. Testing different cTakes 
 configurations you would expect different output. In our testing 
 we've found several cases where running with the same configuration 
 outputs different data under different moons. Having consistent 
 results helps us know if we've made improvements to our quality or 
 not. Having output that is in a predictable order makes checking to 
 see if there are differences much cheaper when you are dealing with larger 
 data sets.

 Kim Ebert
 1.801.669.7342
 Perfect Search Corp
 http://www.perfectsearchcorp.com/

 On 10/07/2014 08:50 AM, Finan, Sean wrote:
 Hi Kim,

 One might want compare the Sentence detector that uses end of line 
 characters as sentence splitters with one that does not.  Such a 
 change in sentence splitting would not only effect the sentence type 
 discoveries but also

Re: cTakes output predictability

2014-10-07 Thread Kim Ebert
Hi Sean,

No, your not a jerk. These are things worth considering, and I
understand your concerns with touching various points of the codebase.

I'll talk with our group over here and see where we want to go. We are
really interested in cTakes behaving well, so we are usually pretty
careful in testing our changes before committing anything.

Thanks,

Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/

On 10/07/2014 10:46 AM, Finan, Sean wrote:
 Hi Kim,

 It concerns me a bit by making the code return consistent results would be 
 so concerning. 
 Could you please clarify what you mean by consistent results?  Do you mean 
 ordering and IDs or are you talking about actual type values not matching?

 This should be the default mode of operation.
 Depending upon what you meant above, I may agree or disagree.

 Since it doesn't appear that there are any consequences with moving forward 
 with changing the code
 Why do you say this?  

 I think that there may be more required changes than you realize.  Every 
 insertion into the CAS must be of ordered data.  This means that, for 
 instance, named entities discovered by dictionary will need to be inserted in 
 some predictable order, such as by alphabetized cui per every alphabetized 
 tui (and other code) per ordered text span.  You will need to check and 
 recheck every point at which the CAS is modified by every module.  Right now 
 there are at least three or four places in two cTakes dictionary modules 
 where a change would be required - and that doesn't include YTEX lookup.

 If you really feel strongly about this and are going to change cTakes code, 
 then I suggest (at the risk of sounding like a complete jerk) that you also 
 consider the following:
 1.  Don't check anything into trunk until all is well with your changes and 
 tests
 Just in case you abandon the effort
 2.  Write unit tests for every change   
 True, Map to LinkedMap shouldn't break anything, but they are good to have, 
 and may prevent others in the future from switching back to a non-linked map 
 or any unordered collection (set not list, etc.).  It also makes a better 
 place for explanation in Javadoc than inlines above the code.
 3.  Run memory requirement tests before all of your changes and then again 
 after your changes
 I'm actually curious about how much memory might be eaten with linkages 
 everywhere
 4.  Run performance (speed) tests before and after
 On a large corpus to ensure that garbage collection is involved
 5.  Do the above with every combination possible in current workflows: every 
 combination of available sentence detector, pos tagger, smoking status 
 detector, dictionary lookup, cas consumer, etc.
 As soon as somebody says all output is consistently ordered between runs it 
 had better be so for every possible workflow
 6.  Write system tests to ensure ordered/predicted outputs with each 
 combination
 Otherwise somebody may break it
 7.  Document the what, how, and why for future development
 Otherwise somebody won't know to stick to the new rules
 8.  Assist anybody as needed that in the future breaks one of these unit or 
 system tests with a fix or new feature
 By mandating such a rule you are assuming responsibility for it
 9.  Assist anybody as needed that in the future adds a new module or workflow 
 to cTakes to abide by the ordering requirement
 By mandating such a rule you are assuming responsibility for it
 10.  Assist anybody as needed that in the future adds a new module or 
 workflow to add system tests to ensure maintenance of the ordering requirement
 By mandating such a rule you are assuming responsibility for it


 -Original Message-
 From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com] 
 Sent: Tuesday, October 07, 2014 11:57 AM
 To: dev@ctakes.apache.org
 Subject: Re: cTakes output predictability

 I think we may really prefer the first method. Since it doesn't appear that 
 there are any consequences with moving forward with changing the code, we 
 would really like to move forward with this approach.

 Kim Ebert
 1.801.669.7342
 Perfect Search Corp
 http://www.perfectsearchcorp.com/

 On 10/07/2014 09:35 AM, britt fitch wrote:
 The option Sean mentioned of writing your own custom consumer (without 
 the UIMA id that is causing your issues) should meet these needs I 
 believe.

   

 Britt Fitch
 Wired Informatics
 265 Franklin St Ste 1702
 Boston, MA 02110
 http://wiredinformatics.com
 britt.fi...@wiredinformatics.com

 On Oct 7, 2014, at 11:29 AM, Kim Ebert 
 kim.eb...@perfectsearchcorp.com 
 mailto:kim.eb...@perfectsearchcorp.com wrote:

 Hi Sean,

 Well of course that makes plenty of sense. Testing different cTakes 
 configurations you would expect different output. In our testing 
 we've found several cases where running with the same configuration 
 outputs different data under different moons. Having consistent 
 results helps us know if we've made improvements to our

Re: cTakes output predictability

2014-10-07 Thread Kim Ebert
Hi Sean,

Yes, I mean actual type values not matching.

Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/

On 10/07/2014 10:46 AM, Finan, Sean wrote:
 Hi Kim,

 It concerns me a bit by making the code return consistent results would be 
 so concerning. 
 Could you please clarify what you mean by consistent results?  Do you mean 
 ordering and IDs or are you talking about actual type values not matching?

 This should be the default mode of operation.
 Depending upon what you meant above, I may agree or disagree.

 Since it doesn't appear that there are any consequences with moving forward 
 with changing the code
 Why do you say this?  

 I think that there may be more required changes than you realize.  Every 
 insertion into the CAS must be of ordered data.  This means that, for 
 instance, named entities discovered by dictionary will need to be inserted in 
 some predictable order, such as by alphabetized cui per every alphabetized 
 tui (and other code) per ordered text span.  You will need to check and 
 recheck every point at which the CAS is modified by every module.  Right now 
 there are at least three or four places in two cTakes dictionary modules 
 where a change would be required - and that doesn't include YTEX lookup.

 If you really feel strongly about this and are going to change cTakes code, 
 then I suggest (at the risk of sounding like a complete jerk) that you also 
 consider the following:
 1.  Don't check anything into trunk until all is well with your changes and 
 tests
 Just in case you abandon the effort
 2.  Write unit tests for every change   
 True, Map to LinkedMap shouldn't break anything, but they are good to have, 
 and may prevent others in the future from switching back to a non-linked map 
 or any unordered collection (set not list, etc.).  It also makes a better 
 place for explanation in Javadoc than inlines above the code.
 3.  Run memory requirement tests before all of your changes and then again 
 after your changes
 I'm actually curious about how much memory might be eaten with linkages 
 everywhere
 4.  Run performance (speed) tests before and after
 On a large corpus to ensure that garbage collection is involved
 5.  Do the above with every combination possible in current workflows: every 
 combination of available sentence detector, pos tagger, smoking status 
 detector, dictionary lookup, cas consumer, etc.
 As soon as somebody says all output is consistently ordered between runs it 
 had better be so for every possible workflow
 6.  Write system tests to ensure ordered/predicted outputs with each 
 combination
 Otherwise somebody may break it
 7.  Document the what, how, and why for future development
 Otherwise somebody won't know to stick to the new rules
 8.  Assist anybody as needed that in the future breaks one of these unit or 
 system tests with a fix or new feature
 By mandating such a rule you are assuming responsibility for it
 9.  Assist anybody as needed that in the future adds a new module or workflow 
 to cTakes to abide by the ordering requirement
 By mandating such a rule you are assuming responsibility for it
 10.  Assist anybody as needed that in the future adds a new module or 
 workflow to add system tests to ensure maintenance of the ordering requirement
 By mandating such a rule you are assuming responsibility for it


 -Original Message-
 From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com] 
 Sent: Tuesday, October 07, 2014 11:57 AM
 To: dev@ctakes.apache.org
 Subject: Re: cTakes output predictability

 I think we may really prefer the first method. Since it doesn't appear that 
 there are any consequences with moving forward with changing the code, we 
 would really like to move forward with this approach.

 Kim Ebert
 1.801.669.7342
 Perfect Search Corp
 http://www.perfectsearchcorp.com/

 On 10/07/2014 09:35 AM, britt fitch wrote:
 The option Sean mentioned of writing your own custom consumer (without 
 the UIMA id that is causing your issues) should meet these needs I 
 believe.

   

 Britt Fitch
 Wired Informatics
 265 Franklin St Ste 1702
 Boston, MA 02110
 http://wiredinformatics.com
 britt.fi...@wiredinformatics.com

 On Oct 7, 2014, at 11:29 AM, Kim Ebert 
 kim.eb...@perfectsearchcorp.com 
 mailto:kim.eb...@perfectsearchcorp.com wrote:

 Hi Sean,

 Well of course that makes plenty of sense. Testing different cTakes 
 configurations you would expect different output. In our testing 
 we've found several cases where running with the same configuration 
 outputs different data under different moons. Having consistent 
 results helps us know if we've made improvements to our quality or 
 not. Having output that is in a predictable order makes checking to 
 see if there are differences much cheaper when you are dealing with larger 
 data sets.

 Kim Ebert
 1.801.669.7342
 Perfect Search Corp
 http://www.perfectsearchcorp.com/

 On 10/07/2014 08:50 AM, Finan, Sean wrote

Re: cTakes output predictability

2014-10-07 Thread Bruce Tietjen
 changes
 and tests
  Just in case you abandon the effort
  2.  Write unit tests for every change
  True, Map to LinkedMap shouldn't break anything, but they are good to
 have, and may prevent others in the future from switching back to a
 non-linked map or any unordered collection (set not list, etc.).  It also
 makes a better place for explanation in Javadoc than inlines above the code.
  3.  Run memory requirement tests before all of your changes and then
 again after your changes
  I'm actually curious about how much memory might be eaten with linkages
 everywhere
  4.  Run performance (speed) tests before and after
  On a large corpus to ensure that garbage collection is involved
  5.  Do the above with every combination possible in current workflows:
 every combination of available sentence detector, pos tagger, smoking
 status detector, dictionary lookup, cas consumer, etc.
  As soon as somebody says all output is consistently ordered between
 runs it had better be so for every possible workflow
  6.  Write system tests to ensure ordered/predicted outputs with each
 combination
  Otherwise somebody may break it
  7.  Document the what, how, and why for future development
  Otherwise somebody won't know to stick to the new rules
  8.  Assist anybody as needed that in the future breaks one of these unit
 or system tests with a fix or new feature
  By mandating such a rule you are assuming responsibility for it
  9.  Assist anybody as needed that in the future adds a new module or
 workflow to cTakes to abide by the ordering requirement
  By mandating such a rule you are assuming responsibility for it
  10.  Assist anybody as needed that in the future adds a new module or
 workflow to add system tests to ensure maintenance of the ordering
 requirement
  By mandating such a rule you are assuming responsibility for it
 
 
  -Original Message-
  From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com]
  Sent: Tuesday, October 07, 2014 11:57 AM
  To: dev@ctakes.apache.org
  Subject: Re: cTakes output predictability
 
  I think we may really prefer the first method. Since it doesn't appear
 that there are any consequences with moving forward with changing the code,
 we would really like to move forward with this approach.
 
  Kim Ebert
  1.801.669.7342
  Perfect Search Corp
  http://www.perfectsearchcorp.com/
 
  On 10/07/2014 09:35 AM, britt fitch wrote:
  The option Sean mentioned of writing your own custom consumer (without
  the UIMA id that is causing your issues) should meet these needs I
  believe.
 
 
 
  Britt Fitch
  Wired Informatics
  265 Franklin St Ste 1702
  Boston, MA 02110
  http://wiredinformatics.com
  britt.fi...@wiredinformatics.com
 
  On Oct 7, 2014, at 11:29 AM, Kim Ebert
  kim.eb...@perfectsearchcorp.com
  mailto:kim.eb...@perfectsearchcorp.com wrote:
 
  Hi Sean,
 
  Well of course that makes plenty of sense. Testing different cTakes
  configurations you would expect different output. In our testing
  we've found several cases where running with the same configuration
  outputs different data under different moons. Having consistent
  results helps us know if we've made improvements to our quality or
  not. Having output that is in a predictable order makes checking to
  see if there are differences much cheaper when you are dealing with
 larger data sets.
 
  Kim Ebert
  1.801.669.7342
  Perfect Search Corp
  http://www.perfectsearchcorp.com/
 
  On 10/07/2014 08:50 AM, Finan, Sean wrote:
  Hi Kim,
 
  One might want compare the Sentence detector that uses end of line
  characters as sentence splitters with one that does not.  Such a
  change in sentence splitting would not only effect the sentence type
  discoveries but also practically every type that follows.
 
  Another might want to compare a note with skin cancer vs. one in
  which you replace skin cancer with melanoma just to see what the
  CUI differences might be.  There are changes in two words vs. one,
  11 characters vs. 8, a removed adjective(?), and of course changes
  in CUIs.
 
  Of course, if you are just running notes on a new moon and then
  again on a full moon ...
 
  Sean
 
  -Original Message-
  From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com]
  Sent: Tuesday, October 07, 2014 10:41 AM
  To: dev@ctakes.apache.org
  Subject: Re: cTakes output predictability
 
  Sean,
 
  ...being different because of a possibly intentional difference.
 
  I would like you to elaborate a bit on the what would be
  intentionally different between the processing of the same document
  multiple times. It would help my understanding of cTakes.
 
  Thanks,
 
  Kim Ebert
  1.801.669.7342
  Perfect Search Corp
  http://www.perfectsearchcorp.com/
 
  On 10/07/2014 07:30 AM, Finan, Sean wrote:
  Steve Bethard wrote:
  I spent some time writing a script for diff-ing CASes
  I urge anyone interested in comparing cTakes CASes / output to use
  this type of approach.  Comparison of program output is a
  post-process task

RE: cTakes output predictability

2014-10-07 Thread Finan, Sean
I'm just about sapped on this topic.  What comes below is my final writing.

Kim wrote:
Yes, I mean actual type values not matching.

Ok, this is a very serious problem and should have nothing to do with ordering 
and/or IDs.  I repeat: this should have nothing to do with ordering or ids.  
Reordering or changing ID assignment, while possibly producing repeatable 
output, will not necessary fix the actual bug.  Please write a Jira for each 
item, and (imo) we should think about withholding any non-bug-fix release until 
they have been dealt with.

Bruce wrote:
 I did not intend to step on anyone's toes.
No worries - I don't think that any toes have been stepped upon. It is good 
that questions and concerns are shared with the group.  

 Note that in the first instance, there were two MedicationMentions, but in 
 the second, there is only one.
Assuming that the second drug mention doesn't appear elsewhere in output2 then 
this needs to be addressed.  Please log a tar.  Relating this to the order/id 
issue, which number of mentions is correct (2)?  If you reorder will that 
consistently output two medications instead of one or one medication instead of 
two?  This is most likely a bug in the identification and/or storage and/or 
retrieval code and needs to be fixed there.

Yes, everyone could write their own custom compare code, but wouldn't it be 
more valuable to the community to make that task easier?

I would hope that a reusable Cas-Consumer that sorts and re-IDs annotations 
could be started and people could add to it as needed.  I would also hope that 
a reusable post-process comparison utility could be started and 
improved/maintained.

Sean


-Original Message-
From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com] 
Sent: Tuesday, October 07, 2014 1:21 PM
To: dev@ctakes.apache.org
Subject: Re: cTakes output predictability

I did not intend to step on anyone's toes.

One of the reasons I proposed the changes was to try to make it extremely 
obvious when there are significant difference in output from the cTakes 
pipeline when running the same document again, and once identified, make it 
easier to identify the source of the difference.

Because of the huge number of differences between the output using the 
FileWriterCasConsumer.xml, first detecting that there is a significant 
differences and identifying them for a large set of documents is a daunting 
task.

The following is an example of some significant differences that I have 
detected between two subsequent runs on the same document using the current 
release of cTakes. (There are actually quite a few documents that exhibit this 
kind of behavior. This is only one example.)


Snippet from first run:

org.apache.ctakes.typesystem.type.textspan.LookupWindowAnnotation
_indexed=1 _id=9869 _ref_sofa=3 begin=3039 end=3047/
org.apache.ctakes.typesystem.type.textsem.MedicationMention
_indexed=1 _id=9895 _ref_sofa=3 begin=2075 end=2081 id=95
_ref_ontologyConceptArr=9891 typeID=1 segmentID=SIMPLE_SEGMENT
discoveryTechnique=1 confidence=1.0 polarity=1 uncertainty=1
conditional=false generic=true subject=patient historyOf=0/
org.apache.ctakes.typesystem.type.textsem.MedicationMention
_indexed=1 _id=9937 _ref_sofa=3 begin=2312 end=2322 id=110
_ref_ontologyConceptArr=9934 typeID=1 segmentID=SIMPLE_SEGMENT
discoveryTechnique=1 confidence=1.0 polarity=1 uncertainty=1
conditional=false generic=false subject=patient historyOf=0/
org.apache.ctakes.typesystem.type.textsem.DiseaseDisorderMention
_indexed=1 _id=9979 _ref_sofa=3 begin=0 end=4 id=0
_ref_ontologyConceptArr=9976 typeID=2 segmentID=SIMPLE_SEGMENT
discoveryTechnique=1 confidence=1.0 polarity=1 uncertainty=0
conditional=false generic=false subject=patient historyOf=0/


Snippet from subsequent trun:

org.apache.ctakes.typesystem.type.textsem.ProcedureMention
_indexed=1 _id=15773 _ref_sofa=3 begin=2929 end=2933 id=125
_ref_ontologyConceptArr=15770 typeID=5 segmentID=SIMPLE_SEGMENT
discoveryTechnique=1 confidence=1.0 polarity=1 uncertainty=0
conditional=false generic=false subject=patient historyOf=0/
org.apache.ctakes.typesystem.type.textsem.MedicationMention
_indexed=1 _id=15928 _ref_sofa=3 begin=2075 end=2081 id=95
_ref_ontologyConceptArr=15924 typeID=1 segmentID=SIMPLE_SEGMENT
discoveryTechnique=1 confidence=1.0 polarity=1 uncertainty=1
conditional=false generic=true subject=patient historyOf=0/
org.apache.ctakes.typesystem.type.syntax.ConllDependencyNode
_indexed=1 _id=15958 _ref_sofa=3 begin=0 end=5 id=0/


Note that in the first instance, there were two MedicationMentions, but in the 
second, there is only one.

Yes, everyone could write their own custom compare code, but wouldn't it be 
more valuable to the community to make that task easier?

Thanks,

Bruce Tietjen



 [image: IMAT Solutions] http://imatsolutions.com  Bruce Tietjen Senior 
Software Engineer
[image: Mobile:] 801.634.1547
bruce.tiet...@imatsolutions.com

On Tue, Oct 7, 2014 at 11:01 AM

Re: cTakes output predictability

2014-10-07 Thread Kim Ebert
.
 If you really feel strongly about this and are going to change cTakes
 code, then I suggest (at the risk of sounding like a complete jerk) that
 you also consider the following:
 1.  Don't check anything into trunk until all is well with your changes
 and tests
 Just in case you abandon the effort
 2.  Write unit tests for every change
 True, Map to LinkedMap shouldn't break anything, but they are good to
 have, and may prevent others in the future from switching back to a
 non-linked map or any unordered collection (set not list, etc.).  It also
 makes a better place for explanation in Javadoc than inlines above the code.
 3.  Run memory requirement tests before all of your changes and then
 again after your changes
 I'm actually curious about how much memory might be eaten with linkages
 everywhere
 4.  Run performance (speed) tests before and after
 On a large corpus to ensure that garbage collection is involved
 5.  Do the above with every combination possible in current workflows:
 every combination of available sentence detector, pos tagger, smoking
 status detector, dictionary lookup, cas consumer, etc.
 As soon as somebody says all output is consistently ordered between
 runs it had better be so for every possible workflow
 6.  Write system tests to ensure ordered/predicted outputs with each
 combination
 Otherwise somebody may break it
 7.  Document the what, how, and why for future development
 Otherwise somebody won't know to stick to the new rules
 8.  Assist anybody as needed that in the future breaks one of these unit
 or system tests with a fix or new feature
 By mandating such a rule you are assuming responsibility for it
 9.  Assist anybody as needed that in the future adds a new module or
 workflow to cTakes to abide by the ordering requirement
 By mandating such a rule you are assuming responsibility for it
 10.  Assist anybody as needed that in the future adds a new module or
 workflow to add system tests to ensure maintenance of the ordering
 requirement
 By mandating such a rule you are assuming responsibility for it


 -Original Message-
 From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com]
 Sent: Tuesday, October 07, 2014 11:57 AM
 To: dev@ctakes.apache.org
 Subject: Re: cTakes output predictability

 I think we may really prefer the first method. Since it doesn't appear
 that there are any consequences with moving forward with changing the code,
 we would really like to move forward with this approach.
 Kim Ebert
 1.801.669.7342
 Perfect Search Corp
 http://www.perfectsearchcorp.com/

 On 10/07/2014 09:35 AM, britt fitch wrote:
 The option Sean mentioned of writing your own custom consumer (without
 the UIMA id that is causing your issues) should meet these needs I
 believe.



 Britt Fitch
 Wired Informatics
 265 Franklin St Ste 1702
 Boston, MA 02110
 http://wiredinformatics.com
 britt.fi...@wiredinformatics.com

 On Oct 7, 2014, at 11:29 AM, Kim Ebert
 kim.eb...@perfectsearchcorp.com
 mailto:kim.eb...@perfectsearchcorp.com wrote:

 Hi Sean,

 Well of course that makes plenty of sense. Testing different cTakes
 configurations you would expect different output. In our testing
 we've found several cases where running with the same configuration
 outputs different data under different moons. Having consistent
 results helps us know if we've made improvements to our quality or
 not. Having output that is in a predictable order makes checking to
 see if there are differences much cheaper when you are dealing with
 larger data sets.
 Kim Ebert
 1.801.669.7342
 Perfect Search Corp
 http://www.perfectsearchcorp.com/

 On 10/07/2014 08:50 AM, Finan, Sean wrote:
 Hi Kim,

 One might want compare the Sentence detector that uses end of line
 characters as sentence splitters with one that does not.  Such a
 change in sentence splitting would not only effect the sentence type
 discoveries but also practically every type that follows.

 Another might want to compare a note with skin cancer vs. one in
 which you replace skin cancer with melanoma just to see what the
 CUI differences might be.  There are changes in two words vs. one,
 11 characters vs. 8, a removed adjective(?), and of course changes
 in CUIs.

 Of course, if you are just running notes on a new moon and then
 again on a full moon ...

 Sean

 -Original Message-
 From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com]
 Sent: Tuesday, October 07, 2014 10:41 AM
 To: dev@ctakes.apache.org
 Subject: Re: cTakes output predictability

 Sean,

 ...being different because of a possibly intentional difference.

 I would like you to elaborate a bit on the what would be
 intentionally different between the processing of the same document
 multiple times. It would help my understanding of cTakes.

 Thanks,

 Kim Ebert
 1.801.669.7342
 Perfect Search Corp
 http://www.perfectsearchcorp.com/

 On 10/07/2014 07:30 AM, Finan, Sean wrote:
 Steve Bethard wrote:
 I spent some time writing a script for diff-ing CASes
 I urge anyone

RE: cTakes output predictability

2014-10-07 Thread Finan, Sean
Hi Kim,

Great Catch!

I think that by now this thread may be discarded by most as spam.  So, I'm back 
(apologies - I know that you are tired of me by now).

I checked the code that you pointed to ...  I really dislike looking at older 
cTakes code because I'm filled with an overwhelming urge to refactor.

If I understand the code correctly (it could use some doc), it runs negation 
engines and then if any negation exists it creates a single hit signifying 
negation.  Like a heavyweight Boolean.   Unfortunately, as you know, because 
Collection s  is a Set and it throws in the first token to come along ...  

An isolated change here would probably be better than going through the entire 
code base and switching to LinkedHashMaps, Lists, etc. - plus it would fix your 
problem.

You could (for reuse by others, assuming that one doesn't already exist) create 
a singleton BaseTokenComparator implements ComparatorBaseToken  with 
something like:
   public int compare( final BaseToken textSpan1, final BaseToken textSpan2 ) {
  if ( textSpan1. getStartOffset () != textSpan2. getStartOffset () ) {
 return textSpan1. getStartOffset () - textSpan2. getStartOffset ();
  }
  return textSpan1. getEndOffset () - textSpan2. getEndOffset ();
   }

And in NegationContextAnalyzer line ~48
Final ListNegationIndicator negatorsList = new ArrayList( 
_negIndicatorFSM.execute(fsmTokenList) );
If ( !negatorsList.isEmpty() ) {
Collections.sort( negatorsList, BaseTokenComparator.getInstance() );
Return new ContextHit( negatorsList.get(0).getStartOffset(), 
negatorsList.get(0).getEndOffset() );

Or you could write a (faster) method to use in place of the List and Sort like:
BaseToken getFirstTextSpan( final IterableBaseToken tokens ) {
BaseToken firstToken  = null;
For ( BaseToken token : tokens ) {
If ( firstToken == null || token.getStartOffset()  
firstToken.getStartOffset() ) {
firstToken = token;
continue;
}
If ( token.getStartOffset() == firstToken.getStartOffset()  
token.getEndOffset()  firstToken.getEndOffset() ) {
firstToken = token;
}
}
Return firstToken; 


Of course, a perfectly reasonable question to pose to the community is 
something like Is the best stored negation context the first or largest or 
???  Perhaps the first negator span isn't the most wanted for later use - 
perhaps it is the most-encompassing span so that multiple words can be reused.  
You could throw that out under a new thread title and perhaps the original 
authors or current users would speak up as to what might be best.  Personally I 
have no idea.

Anyway, great catch!

Sean


-Original Message-
From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com] 
Sent: Tuesday, October 07, 2014 3:11 PM
To: dev@ctakes.apache.org
Subject: Re: cTakes output predictability

Hi all,

I'm not sure these should be classified as bugs. They look l like design 
decisions at some point, but they do have impact in the consistency of the 
results. If they are right are not might be something to debate later down the 
road, but it would be nice to be consistent in the output.

For example, I have the following text.

I do not see any

Can result in the following ContextAnnotations:

org.apache.ctakes.typesystem.type.textsem.ContextAnnotation
_indexed=1 _id=130 _ref_sofa=1 begin=*13* end=*16* id=0
typeID=0 discoveryTechnique=0 confidence=0.0 polarity=0
uncertainty=0 conditional=false generic=false historyOf=0
FocusText=I Scope=RIGHT/

or

org.apache.ctakes.typesystem.type.textsem.ContextAnnotation
_indexed=1 _id=130 _ref_sofa=1 begin=*5* end=*16* id=0
typeID=0 discoveryTechnique=0 confidence=0.0 polarity=0
uncertainty=0 conditional=false generic=false historyOf=0
FocusText=I Scope=RIGHT/

or

org.apache.ctakes.typesystem.type.textsem.ContextAnnotation
_indexed=1 _id=130 _ref_sofa=1 begin=*5* end=*8* id=0
typeID=0 discoveryTechnique=0 confidence=0.0 polarity=0
uncertainty=0 conditional=false generic=false historyOf=0
FocusText=I Scope=RIGHT/

Well, after doing some digging it turns out that 
org.apache.ctakes.necontexts.negation.NegationContextAnalyzer is to blame.

The code looks like the following:

public ContextHit analyzeContext(List? extends Annotation contextTokens, 
int scopeOrientation)
throws AnalysisEngineProcessException {
ListTextToken fsmTokenList = wrapAsFsmTokens(contextTokens);

try {
SetNegationIndicator s =
_negIndicatorFSM.execute(fsmTokenList);

*if (s.size()  0) {*
NegationIndicator neg = s.iterator().next();
   *return new ContextHit(neg.getStartOffset(),
neg.getEndOffset());*
} else {
return null;
}
} catch (Exception e) {
throw new AnalysisEngineProcessException(e

Re: cTakes output predictability

2014-10-07 Thread Kim Ebert
Hi Sean,

Alright, it seems that rather than doing the sorted approach, we want to
manage these individually. I'll create tickets on all of the items we
have found so far. This is just one example. Then maybe we can move our
discussion of how to solve each one to discussions around that ticket
instead of this really long email thread.

I just wanted to check which way we wanted to go on these.

Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/

On 10/07/2014 03:07 PM, Finan, Sean wrote:
 Hi Kim,

 Great Catch!

 I think that by now this thread may be discarded by most as spam.  So, I'm 
 back (apologies - I know that you are tired of me by now).

 I checked the code that you pointed to ...  I really dislike looking at older 
 cTakes code because I'm filled with an overwhelming urge to refactor.

 If I understand the code correctly (it could use some doc), it runs negation 
 engines and then if any negation exists it creates a single hit signifying 
 negation.  Like a heavyweight Boolean.   Unfortunately, as you know, because 
 Collection s  is a Set and it throws in the first token to come along ...  

 An isolated change here would probably be better than going through the 
 entire code base and switching to LinkedHashMaps, Lists, etc. - plus it would 
 fix your problem.

 You could (for reuse by others, assuming that one doesn't already exist) 
 create a singleton BaseTokenComparator implements ComparatorBaseToken  with 
 something like:
public int compare( final BaseToken textSpan1, final BaseToken textSpan2 ) 
 {
   if ( textSpan1. getStartOffset () != textSpan2. getStartOffset () ) {
  return textSpan1. getStartOffset () - textSpan2. getStartOffset ();
   }
   return textSpan1. getEndOffset () - textSpan2. getEndOffset ();
}

 And in NegationContextAnalyzer line ~48
 Final ListNegationIndicator negatorsList = new ArrayList( 
 _negIndicatorFSM.execute(fsmTokenList) );
 If ( !negatorsList.isEmpty() ) {
   Collections.sort( negatorsList, BaseTokenComparator.getInstance() );
   Return new ContextHit( negatorsList.get(0).getStartOffset(), 
 negatorsList.get(0).getEndOffset() );

 Or you could write a (faster) method to use in place of the List and Sort 
 like:
 BaseToken getFirstTextSpan( final IterableBaseToken tokens ) {
   BaseToken firstToken  = null;
   For ( BaseToken token : tokens ) {
   If ( firstToken == null || token.getStartOffset()  
 firstToken.getStartOffset() ) {
   firstToken = token;
   continue;
   }
   If ( token.getStartOffset() == firstToken.getStartOffset()  
 token.getEndOffset()  firstToken.getEndOffset() ) {
   firstToken = token;
   }
   }
   Return firstToken; 
   

 Of course, a perfectly reasonable question to pose to the community is 
 something like Is the best stored negation context the first or largest or 
 ???  Perhaps the first negator span isn't the most wanted for later use - 
 perhaps it is the most-encompassing span so that multiple words can be 
 reused.  You could throw that out under a new thread title and perhaps the 
 original authors or current users would speak up as to what might be best.  
 Personally I have no idea.

 Anyway, great catch!

 Sean


 -Original Message-
 From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com] 
 Sent: Tuesday, October 07, 2014 3:11 PM
 To: dev@ctakes.apache.org
 Subject: Re: cTakes output predictability

 Hi all,

 I'm not sure these should be classified as bugs. They look l like design 
 decisions at some point, but they do have impact in the consistency of the 
 results. If they are right are not might be something to debate later down 
 the road, but it would be nice to be consistent in the output.

 For example, I have the following text.

 I do not see any

 Can result in the following ContextAnnotations:

 org.apache.ctakes.typesystem.type.textsem.ContextAnnotation
 _indexed=1 _id=130 _ref_sofa=1 begin=*13* end=*16* id=0
 typeID=0 discoveryTechnique=0 confidence=0.0 polarity=0
 uncertainty=0 conditional=false generic=false historyOf=0
 FocusText=I Scope=RIGHT/

 or

 org.apache.ctakes.typesystem.type.textsem.ContextAnnotation
 _indexed=1 _id=130 _ref_sofa=1 begin=*5* end=*16* id=0
 typeID=0 discoveryTechnique=0 confidence=0.0 polarity=0
 uncertainty=0 conditional=false generic=false historyOf=0
 FocusText=I Scope=RIGHT/

 or

 org.apache.ctakes.typesystem.type.textsem.ContextAnnotation
 _indexed=1 _id=130 _ref_sofa=1 begin=*5* end=*8* id=0
 typeID=0 discoveryTechnique=0 confidence=0.0 polarity=0
 uncertainty=0 conditional=false generic=false historyOf=0
 FocusText=I Scope=RIGHT/

 Well, after doing some digging it turns out that 
 org.apache.ctakes.necontexts.negation.NegationContextAnalyzer is to blame.

 The code looks like the following:

 public ContextHit analyzeContext(List? extends Annotation

Re: cTakes output predictability

2014-10-06 Thread jay vyas
Im not a ctakes expert by any means, but in general, I like that idea
predictable and deterministic ordering of mapped elements almost always
leads to less buggy applications.
As groovy has shown (LinkedHashMap is the default data structure and its
much easier imo to get reproducible groovy unit tests etc b/c of that).


On Mon, Oct 6, 2014 at 4:59 PM, Bruce Tietjen 
bruce.tiet...@perfectsearchcorp.com wrote:

 Since I started working with cTakes some time ago, I have found it
 difficult to compare the output between subsequent runs on the same files
 because annotations are often assigned different IDs, are listed in
 different order, etc.

 One area that seems to be a cause for at least some of these differences is
 the common use of HashMap where enumerating the contents is not guaranteed
 to return items in the same order they were added.

 I would like to work towards addressing this issue by changing those areas
 of the code where it matters to use a LinkedHashMap instead.

 Is this something the community would be interested in and find helpful?

 Thanks,

 Bruce Tietjen
 Perfect Search Corp.




-- 
jay vyas


Re: cTakes output predictability

2014-10-06 Thread Britt Fitch
Before making changes to the data structure I think it would be good to
understand the use case.

Bruce, can can you give a high level description of the issue you are
trying to solve?

Cheers,

Britt


On Mon, Oct 6, 2014 at 7:38 PM, jay vyas jayunit100.apa...@gmail.com
wrote:

 Im not a ctakes expert by any means, but in general, I like that idea
 predictable and deterministic ordering of mapped elements almost always
 leads to less buggy applications.
 As groovy has shown (LinkedHashMap is the default data structure and its
 much easier imo to get reproducible groovy unit tests etc b/c of that).


 On Mon, Oct 6, 2014 at 4:59 PM, Bruce Tietjen 
 bruce.tiet...@perfectsearchcorp.com wrote:

  Since I started working with cTakes some time ago, I have found it
  difficult to compare the output between subsequent runs on the same files
  because annotations are often assigned different IDs, are listed in
  different order, etc.
 
  One area that seems to be a cause for at least some of these differences
 is
  the common use of HashMap where enumerating the contents is not
 guaranteed
  to return items in the same order they were added.
 
  I would like to work towards addressing this issue by changing those
 areas
  of the code where it matters to use a LinkedHashMap instead.
 
  Is this something the community would be interested in and find helpful?
 
  Thanks,
 
  Bruce Tietjen
  Perfect Search Corp.
 



 --
 jay vyas



Re: cTakes output predictability

2014-10-06 Thread Steven Bethard
On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen
bruce.tiet...@perfectsearchcorp.com wrote:
 Since I started working with cTakes some time ago, I have found it
 difficult to compare the output between subsequent runs on the same files
 because annotations are often assigned different IDs, are listed in
 different order, etc.

At one point, I spent some time writing a script for diff-ing CASes
that intended to address some of these kinds of issues. It's still
here in cTAKES:

ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analysis/CompareFeatureStructures.java

You might see if you could use or adapt that to your needs.

Steve