On Mon, May 04, 2009 at 07:28:06PM +0200, Peter Dalgaard wrote: > Petr Savicky wrote: > > For this, we get > > > > > convert(0.3) > > [1] "0.3" > > > convert(1/3) > > [1] "0.3333333333333333" # 16 digits suffice > > > convert(0.12345) > > [1] "0.12345" > > > convert(0.12345678901234567) > > [1] "0.12345678901234566" > > > 0.12345678901234567 == as.numeric("0.12345678901234566") > > [1] TRUE > > > > This algorithm is slower than a single call to sprintf("%.17g", x), but it > > produces nicer numbers, if possible, and guarantees that the value is > > always preserved. > > Yes, but.... > > > convert(0.2+0.1) > [1] "0.30000000000000004"
I am not sure whether this should be considered an error. Computing with decimal numbers requires some care. If we want to get the result, which is the closest to 0.3, it is better to use round(0.1+0.2, digits=1). > I think that the real issue is that we actually do want almost-equal > numbers to be folded together. Yes, this is frequently needed. A related question of an approximate match in a numeric sequence was discussed in R-devel thread "Match .3 in a sequence" from March. In order to fold almost-equal numbers together, we need to specify a tolerance to use. The tolerance depends on application. In my opinion, it is hard to choose a tolerance, which could be a good default. > The most relevant case I can conjure up > is this (permutation testing): > > > zz <- replicate(20000,sum(sample(sleep$extra,10))) > > length(table(zz)) > [1] 427 > > length(table(signif(zz,7))) > [1] 281 In order to obtain the correct result in this example, it is possible to use zz <- signif(zz,7) as you suggest or zz <- round(zz, digits=1) and use the resulting zz for all further calculations. > Notice that the discrepancy comes from sums that really are identical > values (in decimal arithmetic), but where the binary FP inaccuracy makes > them slightly different. > > [for a nice picture, continue the example with > > > tt <- table(signif(zz,7)) > > plot(as.numeric(names(tt)),tt, type="h") The form of this picture is not due to rounding errors. The picture may be obtained even within an integer arithmetic as follows. ss <- round(10*sleep$extra) zz <- replicate(20000,sum(sample(ss,10))) tt <- table(zz) plot(as.numeric(names(tt)),tt, type="h") The variation of the frequencies is due to two effects. First, each individual value of the sum occurs with low probability, so 20000 replications do not suffice to get low variance of individual frequencies. Using 1000 repetitions of the code above, i obtained estimate of some of the probabilities. The most frequent sums have probability approximately p=0.0089 for a single sample. With n=20000 replications, we get the mean frequency p*n = 178 and standard deviation sqrt(p*(1-p)*n) = 13.28216. The other cause of variation of the frequencies is that even the true distribution of the sums has a lot of local minima and maxima. The mean of 1000 repetitions of the above table restricted to values of the sum in the interval 140:168 produced the estimates value mean frequency (over 1000 tables) 140 172.411 141 172.090 142 174.297 143 166.039 144 159.260 145 163.891 146 162.317 147 165.460 148 177.870 149 177.971 150 177.754 151 178.525 local maximum 152 169.851 153 164.443 local minimum 154 168.488 the mean value of the sum 155 164.816 local minimum 156 169.297 157 179.248 local maximum 158 177.799 159 176.743 160 177.777 161 164.173 162 162.585 163 164.641 164 159.913 165 165.932 166 173.014 167 172.276 168 171.612 The local minima and maxima are visible here. The mean value 154 is approximately the center of the histogram. Petr. ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel