Hello and thanks for your patience. As far as I understand, the paper of Marsiglia and colleagues refers to CDF samples (i.e. from a hypothetical distribution — e.g. a Poisson), while I have an ECDF sample (i.e. (pseudo-)observed data — e.g. rpois(1000, 500). In my study, I am actually comparing the statistical distribution of people per hectares in a region (~50,000 observations) with samples from that distribution.
Agree that we cannot consider at p-values for low sample size. Still, somewhere above 100, I expect to see p-values between 0 and 0.05 (rejecting the fact that the sample comes from the reference distribution). Ideally, I would try the procedure suggested by Marsiglia to compute p-values but this is far beyond my coding skills. > On 5 Sep 2019, at 21:21, Rui Barradas <ruipbarra...@sapo.pt> wrote: > > Hello, > > Inline. > > Às 20:09 de 05/09/19, Boo G. escreveu: >> Hello again. >> I have tied this before but I see two problems: >> 1) According to the documentation I could read (including the ks.test code), >> the ks statistic would be max(abs(x - y)) and if you plot this for very low >> sample sizes you can actually see that this make sense. The results of >> ks.test(x, y) yields very different values. > > The problem is that the distribution of Dn is very difficult to compute. From > the reference [1] in the help page ?ks.test: > > Kolmogorov's goodness-of-fit measure, Dn , for a sample CDF has > consistently been set aside for methods such as the D+n or D-n of Smirnov, > primarily, it seems, because of the difficulty of computing the distribution > of Dn . As far as we know, no easy way to compute that distribution has ever > been provided in the 70+ years since Kolmogorov's fundamental paper. We > provide one here, a C procedure that provides Pr(Dn < d) with 13-15 digit > accuracy for n ranging from 2 to at least 16000. > > > That is why I used ks.test and its Dn and p-values. Note that n >= 2, size = > 1 is not covered (p-value == 1). > > Also, the p-values distribution seem to become closer to a uniform with > increasing sizes. Try > > hist(d_stat[801:1000, 3]) > > > [1]https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.jstatsoft.org%2Farticle%2Fview%2Fv008i18&data=01%7C01%7Cgianluca.boo%40soton.ac.uk%7Cfebe94a441914aa58a9908d732363a52%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=JBHkkXn6G4oLQZCV7HoqBLO4a3sMixTa16kOVFwXPlY%3D&reserved=0 > > > Hope this helps, > > Rui Barradas > >> 2) Also in this case the p-values don’t make much sense, according to my >> previous interpretation. >> Again, I could be wrong in my interpretation. >>> On 5 Sep 2019, at 20:46, Rui Barradas <ruipbarra...@sapo.pt >>> <mailto:ruipbarra...@sapo.pt>> wrote: >>> >>> Hello, >>> >>> I'm sorry, but apparently I missed the point of your problem. >>> Please do not take my previous answer seriously. >>> >>> But you can use ks.test, just in a different way than what I wrote >>> previously. >>> >>> Corrected code: >>> >>> >>> #simulation >>> for (i in 1:1000) { >>> #sample from the reference distribution >>> m_2 <-m_1[(sample(nrow(m_1), size=i, prob=p_1, replace=F)),] >>> m_2 <-m_2[order(m_2$d_1),] >>> d_2 <- m_2$d_1 >>> p_2 <- m_2$p_1 >>> >>> #weighted ecdf for the reference distribution and the sample >>> f_d_1 <- ewcdf(d_1, normalise=F) >>> f_d_2 <- ewcdf(d_2, 1/p_2, normalise=F, adjust=1/length(d_2)) >>> >>> #kolmogorov-smirnov distance >>> x <- f_d_1(d_2) >>> y <- f_d_2(d_2) >>> ht <- ks.test(x, y) >>> d_stat[i, 2] <- ht$statistic >>> d_stat[i, 3] <- ht$p.value >>> } >>> >>> >>> Hope this helps, >>> >>> Rui Barradas >>> >>> Às 19:29 de 05/09/19, Rui Barradas escreveu: >>>> Hello, >>>> I don't have the algorithms at hand but the KS statistic calculation is >>>> more complicated than your max/abs difference. >>>> Anyway, why not use ks.test? it's not that difficult: >>>> set.seed(1234) >>>> #reference distribution >>>> d_1 <- sort(rpois(1000, 500)) >>>> p_1 <- d_1/sum(d_1) >>>> m_1 <- data.frame(d_1, p_1) >>>> #data frame to store the values of the simulation >>>> d_stat <- data.frame(1:1000, NA, NA) >>>> names(d_stat) <- c("sample_size", "ks_distance", "p_value") >>>> #simulation >>>> for (i in 1:1000) { >>>> #sample from the reference distribution >>>> m_2 <-m_1[(sample(nrow(m_1), size=i, prob=p_1, replace=F)),] >>>> d_2 <- m_2$d_1 >>>> ht <- ks.test(d_1, d_2) >>>> #kolmogorov-smirnov distance >>>> d_stat[i, 2] <- ht$statistic >>>> d_stat[i, 3] <- ht$p.value >>>> } >>>> hist(d_stat[, 2]) >>>> hist(d_stat[, 3]) >>>> Note that d_2 is not sorted, but the results are equal in the sense of >>>> function identical(), meaning they are *exactly* the same. Why shouldn't >>>> they? >>>> Hope this helps, >>>> Rui Barradas >>>> Às 17:06 de 05/09/19, Boo G. escreveu: >>>>> Hello, >>>>> >>>>> I am trying to perform a Kolmogorov–Smirnov test to assess the difference >>>>> between a distribution and samples drawn proportionally to size of >>>>> different sizes. I managed to compute the Kolmogorov–Smirnov distance but >>>>> I am lost with the p-value. I have looked into the ks.test function >>>>> unsuccessfully. Can anyone help me with computing p-values for a >>>>> two-tailed test? >>>>> >>>>> Below a simplified version of my code. >>>>> >>>>> Thanks in advance. >>>>> Gianluca >>>>> >>>>> >>>>> library(spatstat) >>>>> >>>>> #reference distribution >>>>> d_1 <- sort(rpois(1000, 500)) >>>>> p_1 <- d_1/sum(d_1) >>>>> m_1 <- data.frame(d_1, p_1) >>>>> >>>>> #data frame to store the values of the siumation >>>>> d_stat <- data.frame(1:1000, NA, NA) >>>>> names(d_stat) <- c("sample_size", "ks_distance", "p_value") >>>>> >>>>> #simulation >>>>> for (i in 1:1000) { >>>>> #sample from the reference distribution >>>>> m_2 <-m_1[(sample(nrow(m_1), size=i, prob=p_1, replace=F)),] >>>>> m_2 <-m_2[order(m_2$d_1),] >>>>> d_2 <- m_2$d_1 >>>>> p_2 <- m_2$p_1 >>>>> >>>>> #weighted ecdf for the reference distribution and the sample >>>>> f_d_1 <- ewcdf(d_1, normalise=F) >>>>> f_d_2 <- ewcdf(d_2, 1/p_2, normalise=F, adjust=1/length(d_2)) >>>>> >>>>> #kolmogorov-smirnov distance >>>>> d_stat[i,2] <- max(abs(f_d_1(d_2) - f_d_2(d_2))) >>>>> } >>>>> >>>>> >>>>> [[alternative HTML version deleted]] >>>>> >>>>> ______________________________________________ >>>>> R-help@r-project.org <mailto:R-help@r-project.org> mailing list -- To >>>>> UNSUBSCRIBE and more, see >>>>> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=01%7C01%7Cgianluca.boo%40soton.ac.uk%7Cfebe94a441914aa58a9908d732363a52%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=mZZ05M0Qrjy8EelRpDiKpozWtLbF2kSGG%2BjzlAYyzf4%3D&reserved=0 >>>>> PLEASE do read the posting >>>>> guidehttps://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.R-project.org%2Fposting-guide.html&data=01%7C01%7Cgianluca.boo%40soton.ac.uk%7C4b4a253a1edc4960297f08d732316627%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=YKZNBeGme6V9QGyV2%2F150H6rZnLy9bX7xly%2BQEf6O14%3D&reserved=0 >>>>> and provide commented, minimal, self-contained, reproducible code. >>>>> >>>> ______________________________________________ >>>> R-help@r-project.org <mailto:R-help@r-project.org>mailing list -- To >>>> UNSUBSCRIBE and more, see >>>> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=01%7C01%7Cgianluca.boo%40soton.ac.uk%7Cfebe94a441914aa58a9908d732363a52%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=mZZ05M0Qrjy8EelRpDiKpozWtLbF2kSGG%2BjzlAYyzf4%3D&reserved=0 >>>> PLEASE do read the posting >>>> guidehttps://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.R-project.org%2Fposting-guide.html&data=01%7C01%7Cgianluca.boo%40soton.ac.uk%7C4b4a253a1edc4960297f08d732316627%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=YKZNBeGme6V9QGyV2%2F150H6rZnLy9bX7xly%2BQEf6O14%3D&reserved=0 >>>> and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.