Petr Savicky wrote: > On Mon, May 04, 2009 at 05:39:52PM +0200, Martin Maechler wrote: > [snip] >> Let me quickly expand the tasks we have wanted to address, when >> I started changing factor() for R-devel. >> >> 1) R-core had unanimously decided that R 2.10.0 should not allow >> duplicated levels in factors anymore. >> >> When working on that, I had realized that quite a few bits of code >> were implicitly relying on duplicated levels (or something >> related), see below, so the current version of R-devel only >> *warns* in some cases where duplicated levels are produced >> instead of giving an error. >> >> What I had also found was that basically, even our own (!) code >> and quite a bit of user code has more or less relied on other >> things that were not true (even though "almost always" fulfilled): >> >> 2) if x contains no duplicated values, then factor(x) should neither >> >> 3) factor(x) constructs a factor object with *unique* levels >> >> {This is what our decision "1)" implies and now enforces} >> >> 4) as.numeric(names(table(x))) should be identical to unique(x) >> >> where "4)" is basically ensured by "3)" as table() calls >> factor() for non-factor args. >> >> As mentioned the bad thing is that "2) - 4)" are typically >> fulfilled in all tests package writers would use. >> >> Concerning '3)' [and '1)'], as you know, inside R-core we have >> proposed to at least ensure that `levels<-` >> should not allow duplicated levels, >> and I had concluded that >> a) factor() really should use `levels<-` instead of the low-level >> attr(., "levels") <- .... >> b) factor() itself must make sure that the default levels became unique. >> >> --- >> >> Given Petr's (and more) examples and the strong requirement of >> "user convenience" and back-compatibility, >> I now tend to agree (with Peter) that we cannot ensure all of 2) >> and 4) still allow factor() to behave as it did for "rounded >> decimal numbers", >> and consequently would have to (continue to) not ensuring >> properties (2) and (4). >> Something quite unfortunate, since, as I said, much useR code >> implicitly relies on these, and so that code is buggy even >> though the bug will only show in exceptional cases. > > Let me suggest to consider also the following algorithm: determine > the number of digits needed to preserve the double value exactly for > each number separately. An R code prototype demonstrating the > algorithm could be as follows > > convert <- function(x) # x should be a single number > { > for (d in 1:16) { > y <- sprintf(paste("%.", d, "g", sep=""), x) > if (x == as.numeric(y)) { > return(y) > } > } > return(sprintf("%.17g", x)) > } > > For this, we get > > > convert(0.3) > [1] "0.3" > > convert(1/3) > [1] "0.3333333333333333" # 16 digits suffice > > convert(0.12345) > [1] "0.12345" > > convert(0.12345678901234567) > [1] "0.12345678901234566" > > 0.12345678901234567 == as.numeric("0.12345678901234566") > [1] TRUE > > This algorithm is slower than a single call to sprintf("%.17g", x), but it > produces nicer numbers, if possible, and guarantees that the value is > always preserved.
Yes, but.... > convert(0.2+0.1) [1] "0.30000000000000004" I think that the real issue is that we actually do want almost-equal numbers to be folded together. The most relevant case I can conjure up is this (permutation testing): > zz <- replicate(20000,sum(sample(sleep$extra,10))) > length(table(zz)) [1] 427 > length(table(signif(zz,7))) [1] 281 Notice that the discrepancy comes from sums that really are identical values (in decimal arithmetic), but where the binary FP inaccuracy makes them slightly different. [for a nice picture, continue the example with > tt <- table(signif(zz,7)) > plot(as.numeric(names(tt)),tt, type="h") ] -- O__ ---- Peter Dalgaard Ă˜ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalga...@biostat.ku.dk) FAX: (+45) 35327907 ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel