[jira] [Commented] (MATH-1197) Incorrect Kolmogorov–Smirnov Statistic for two samples
[ https://issues.apache.org/jira/browse/MATH-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14284589#comment-14284589 ] Phil Steitz commented on MATH-1197: --- +1 on the patch > Incorrect Kolmogorov–Smirnov Statistic for two samples > --- > > Key: MATH-1197 > URL: https://issues.apache.org/jira/browse/MATH-1197 > Project: Commons Math > Issue Type: Bug >Affects Versions: 3.4.1 > Environment: Ubuntu 14.04 >Reporter: Danaja Thiyunuwan Maldeniya > Attachments: MATH-1197.patch > > > kolmogorovSmirnovTest(double[],double[]) against the samples given below > gives 5.699107852308316E-12 instead of 0.9793 (approx.) Traced the issue to > kolmogorovSmirnovStatistic(double[],double[]) which gives 0.49507389162561577 > instead of 0.064 (verified with ks.test in R and JDistlib) > double[] x = > {0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,2.202653,2.202653,2.202653 > > ,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653 > > ,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653 > > ,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,3.181199,3.181199,3.181199,3.181199,3.181199,3.181199,3.723539 > > ,3.723539,3.723539,3.723539,4.383482,4.383482,4.383482,4.383482,5.320671,5.320671,5.320671,5.717284,6.964001,7.352165 > > ,8.710510,8.710510,8.710510,8.710510,8.710510,8.710510,9.539004,9.539004, > 10.720619, 17.726077, 17.726077, 17.726077, 17.726077 > ,22.053875 ,23.799144 ,27.355308 ,30.584960 ,30.584960 > ,30.584960, 30.584960, 30.751808}; > double[] y = > {0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,2.202653 > > ,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,3.061758,3.723539,5.628420,5.628420,5.628420,5.628420 > ,5.628420,6.916982,6.916982,6.916982, 10.178538, 10.178538, > 10.178538, 10.178538, 10.178538 }; -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1197) Incorrect Kolmogorov–Smirnov Statistic for two samples
[ https://issues.apache.org/jira/browse/MATH-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14284576#comment-14284576 ] Phil Steitz commented on MATH-1197: --- Assuming whatever bugs in the D computation have been fixed, our exactP should actually be "exact." I could not make sense of, or find documentation for, what R does for small samples. Our code computes the exact distribution of the associated D statistic. I suspect that R does some kind of approximation. As you said, R I think also disallows ties. > Incorrect Kolmogorov–Smirnov Statistic for two samples > --- > > Key: MATH-1197 > URL: https://issues.apache.org/jira/browse/MATH-1197 > Project: Commons Math > Issue Type: Bug >Affects Versions: 3.4.1 > Environment: Ubuntu 14.04 >Reporter: Danaja Thiyunuwan Maldeniya > Attachments: MATH-1197.patch > > > kolmogorovSmirnovTest(double[],double[]) against the samples given below > gives 5.699107852308316E-12 instead of 0.9793 (approx.) Traced the issue to > kolmogorovSmirnovStatistic(double[],double[]) which gives 0.49507389162561577 > instead of 0.064 (verified with ks.test in R and JDistlib) > double[] x = > {0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,2.202653,2.202653,2.202653 > > ,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653 > > ,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653 > > ,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,3.181199,3.181199,3.181199,3.181199,3.181199,3.181199,3.723539 > > ,3.723539,3.723539,3.723539,4.383482,4.383482,4.383482,4.383482,5.320671,5.320671,5.320671,5.717284,6.964001,7.352165 > > ,8.710510,8.710510,8.710510,8.710510,8.710510,8.710510,9.539004,9.539004, > 10.720619, 17.726077, 17.726077, 17.726077, 17.726077 > ,22.053875 ,23.799144 ,27.355308 ,30.584960 ,30.584960 > ,30.584960, 30.584960, 30.751808}; > double[] y = > {0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,2.202653 > > ,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,3.061758,3.723539,5.628420,5.628420,5.628420,5.628420 > ,5.628420,6.916982,6.916982,6.916982, 10.178538, 10.178538, > 10.178538, 10.178538, 10.178538 }; -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1197) Incorrect Kolmogorov–Smirnov Statistic for two samples
[ https://issues.apache.org/jira/browse/MATH-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14284500#comment-14284500 ] Thomas Neidhart commented on MATH-1197: --- The exactP method also seems to have a problem when comparing it with the results from R. Take this example: {code} double[] x = new double[] { 0, 0, 0, 0, 1 }; double[] y = new double[] { 0, 0, 1, 1, 2, 3 }; final KolmogorovSmirnovTest test = new KolmogorovSmirnovTest(); System.out.println("p=" + test.kolmogorovSmirnovTest(x, y, true)); System.out.println("D=" + test.kolmogorovSmirnovStatistic(x, y)); System.out.println("approximateP=" + test.approximateP(test.kolmogorovSmirnovStatistic(x, y), x.length, y.length)); System.out.println("exactP=" + test.exactP(test.kolmogorovSmirnovStatistic(x, y), x.length, y.length, false)); {code} returns: {noformat} p=0.35714285714285715 D=0.46673 approximateP=0.5925028311389975 exactP=0.4155844155844156 {noformat} R computes the following: {noformat} data: x and y D = 0.4667, p-value = 0.5925 alternative hypothesis: two-sided {noformat} > Incorrect Kolmogorov–Smirnov Statistic for two samples > --- > > Key: MATH-1197 > URL: https://issues.apache.org/jira/browse/MATH-1197 > Project: Commons Math > Issue Type: Bug >Affects Versions: 3.4.1 > Environment: Ubuntu 14.04 >Reporter: Danaja Thiyunuwan Maldeniya > Attachments: MATH-1197.patch > > > kolmogorovSmirnovTest(double[],double[]) against the samples given below > gives 5.699107852308316E-12 instead of 0.9793 (approx.) Traced the issue to > kolmogorovSmirnovStatistic(double[],double[]) which gives 0.49507389162561577 > instead of 0.064 (verified with ks.test in R and JDistlib) > double[] x = > {0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,2.202653,2.202653,2.202653 > > ,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653 > > ,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653 > > ,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,3.181199,3.181199,3.181199,3.181199,3.181199,3.181199,3.723539 > > ,3.723539,3.723539,3.723539,4.383482,4.383482,4.383482,4.383482,5.320671,5.320671,5.320671,5.717284,6.964001,7.352165 > > ,8.710510,8.710510,8.710510,8.710510,8.710510,8.710510,9.539004,9.539004, > 10.720619, 17.726077, 17.726077, 17.726077, 17.726077 > ,22.053875 ,23.799144 ,27.355308 ,30.584960 ,30.584960 > ,30.584960, 30.584960, 30.751808}; > double[] y = > {0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,2.202653 > > ,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,3.061758,3.723539,5.628420,5.628420,5.628420,5.628420 > ,5.628420,6.916982,6.916982,6.916982, 10.178538, 10.178538, > 10.178538, 10.178538, 10.1
[jira] [Commented] (MATH-1197) Incorrect Kolmogorov–Smirnov Statistic for two samples
[ https://issues.apache.org/jira/browse/MATH-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14284003#comment-14284003 ] Phil Steitz commented on MATH-1197: --- Yes, this is a bug. Arrays.binarySearch should not have been used here. > Incorrect Kolmogorov–Smirnov Statistic for two samples > --- > > Key: MATH-1197 > URL: https://issues.apache.org/jira/browse/MATH-1197 > Project: Commons Math > Issue Type: Bug >Affects Versions: 3.4.1 > Environment: Ubuntu 14.04 >Reporter: Danaja Thiyunuwan Maldeniya > > kolmogorovSmirnovTest(double[],double[]) against the samples given below > gives 5.699107852308316E-12 instead of 0.9793 (approx.) Traced the issue to > kolmogorovSmirnovStatistic(double[],double[]) which gives 0.49507389162561577 > instead of 0.064 (verified with ks.test in R and JDistlib) > double[] x = > {0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,2.202653,2.202653,2.202653 > > ,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653 > > ,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653 > > ,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,3.181199,3.181199,3.181199,3.181199,3.181199,3.181199,3.723539 > > ,3.723539,3.723539,3.723539,4.383482,4.383482,4.383482,4.383482,5.320671,5.320671,5.320671,5.717284,6.964001,7.352165 > > ,8.710510,8.710510,8.710510,8.710510,8.710510,8.710510,9.539004,9.539004, > 10.720619, 17.726077, 17.726077, 17.726077, 17.726077 > ,22.053875 ,23.799144 ,27.355308 ,30.584960 ,30.584960 > ,30.584960, 30.584960, 30.751808}; > double[] y = > {0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,2.202653 > > ,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,3.061758,3.723539,5.628420,5.628420,5.628420,5.628420 > ,5.628420,6.916982,6.916982,6.916982, 10.178538, 10.178538, > 10.178538, 10.178538, 10.178538 }; -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1197) Incorrect Kolmogorov–Smirnov Statistic for two samples
[ https://issues.apache.org/jira/browse/MATH-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14283756#comment-14283756 ] Thomas Neidhart commented on MATH-1197: --- One observation: the samples contain a lot of equal values. The KS test statistic is implemented using Arrays.binarySearch, but this does not specify which index will be found when looking for a given value in a sorted array. E.g. if you have samples [0, 0, 0, 0, 0, 1] and you search for 0, you might get an index in the range [0, 4]. As far as I understand the KS statistic, it is an empirical distribution function which calculates the cumulative density based on how many values are less or equal than the given observation, which is not equal to the result returned by Arrays.binarySearch. > Incorrect Kolmogorov–Smirnov Statistic for two samples > --- > > Key: MATH-1197 > URL: https://issues.apache.org/jira/browse/MATH-1197 > Project: Commons Math > Issue Type: Bug >Affects Versions: 3.4.1 > Environment: Ubuntu 14.04 >Reporter: Danaja Thiyunuwan Maldeniya > > kolmogorovSmirnovTest(double[],double[]) against the samples given below > gives 5.699107852308316E-12 instead of 0.9793 (approx.) Traced the issue to > kolmogorovSmirnovStatistic(double[],double[]) which gives 0.49507389162561577 > instead of 0.064 (verified with ks.test in R and JDistlib) > double[] x = > {0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,2.202653,2.202653,2.202653 > > ,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653 > > ,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653 > > ,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,3.181199,3.181199,3.181199,3.181199,3.181199,3.181199,3.723539 > > ,3.723539,3.723539,3.723539,4.383482,4.383482,4.383482,4.383482,5.320671,5.320671,5.320671,5.717284,6.964001,7.352165 > > ,8.710510,8.710510,8.710510,8.710510,8.710510,8.710510,9.539004,9.539004, > 10.720619, 17.726077, 17.726077, 17.726077, 17.726077 > ,22.053875 ,23.799144 ,27.355308 ,30.584960 ,30.584960 > ,30.584960, 30.584960, 30.751808}; > double[] y = > {0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 > > ,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,2.202653 > > ,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,2.202653,3.061758,3.723539,5.628420,5.628420,5.628420,5.628420 > ,5.628420,6.916982,6.916982,6.916982, 10.178538, 10.178538, > 10.178538, 10.178538, 10.178538 }; -- This message was sent by Atlassian JIRA (v6.3.4#6332)