>>>>> "MM" == Martin Maechler <maech...@stat.math.ethz.ch> >>>>> on Tue, 5 May 2009 10:35:42 +0200 writes:
>>>>> "PD" == Peter Dalgaard <p.dalga...@biostat.ku.dk> >>>>> on Mon, 04 May 2009 19:28:06 +0200 writes: PD> Petr Savicky wrote: >>> On Mon, May 04, 2009 at 05:39:52PM +0200, Martin Maechler wrote: >>> [snip] >>>> Let me quickly expand the tasks we have wanted to address, when >>>> I started changing factor() for R-devel. >>>> >>>> 1) R-core had unanimously decided that R 2.10.0 should not allow >>>> duplicated levels in factors anymore. >>>> >>>> When working on that, I had realized that quite a few bits of code >>>> were implicitly relying on duplicated levels (or something >>>> related), see below, so the current version of R-devel only >>>> *warns* in some cases where duplicated levels are produced >>>> instead of giving an error. >>>> >>>> What I had also found was that basically, even our own (!) code >>>> and quite a bit of user code has more or less relied on other >>>> things that were not true (even though "almost always" fulfilled): >>>> >>>> 2) if x contains no duplicated values, then factor(x) should neither >>>> >>>> 3) factor(x) constructs a factor object with *unique* levels >>>> >>>> {This is what our decision "1)" implies and now enforces} >>>> >>>> 4) as.numeric(names(table(x))) should be identical to unique(x) >>>> >>>> where "4)" is basically ensured by "3)" as table() calls >>>> factor() for non-factor args. >>>> >>>> As mentioned the bad thing is that "2) - 4)" are typically >>>> fulfilled in all tests package writers would use. >>>> >>>> Concerning '3)' [and '1)'], as you know, inside R-core we have >>>> proposed to at least ensure that `levels<-` >>>> should not allow duplicated levels, >>>> and I had concluded that >>>> a) factor() really should use `levels<-` instead of the low-level >>>> attr(., "levels") <- .... >>>> b) factor() itself must make sure that the default levels became unique. >>>> >>>> --- >>>> >>>> Given Petr's (and more) examples and the strong requirement of >>>> "user convenience" and back-compatibility, >>>> I now tend to agree (with Peter) that we cannot ensure all of 2) >>>> and 4) still allow factor() to behave as it did for "rounded >>>> decimal numbers", >>>> and consequently would have to (continue to) not ensuring >>>> properties (2) and (4). >>>> Something quite unfortunate, since, as I said, much useR code >>>> implicitly relies on these, and so that code is buggy even >>>> though the bug will only show in exceptional cases. [................................] PD> I think that the real issue is that we actually do want almost-equal PD> numbers to be folded together. yes, this now (revision 48469) will happen by default, using signif(x, 15) where '15' is the default for the new optional argument 'digitsLabels' {better argument name? (but must nost start with 'label')} Why '15': Because this is most back-compatible and sufficient to solve simple arithmetic (0.1 + 0.2) issues. MM> in most cases, but not all {*when* levels is not specified}, MM> but useR's code sometimes *does* rely on factor() / table() MM> using exact values. MM> Also, what should happen when the user explicitly calls MM> factor(x, levels = sort(unique(x))) MM> at least in that case we really should *not* fold almost equals. MM> and the "old" code (<= R 2.9.0) did fold them in border cases, MM> and lead non-unique levels. MM> Can we agree that any rounding etc - if needed - will only MM> happen when MM> 1) missing(levels) MM> 2) is.numeric(x) || is.complex(x) The code I've committed (revision 48469) now does that.. MM> I'm also thinking of at least keeping the current behavior as an MM> option, e.g. by factor(x, ...., keepUniqueness = TRUE, ....) MM> where the default would be keepUniqueness = FALSE. current argument name is 'keepUnique'. PD> The most relevant case I can conjure up is this (permutation testing): >>> zz <- replicate(20000,sum(sample(sleep$extra,10))) >>> length(table(zz)) PD> [1] 427 >>> length(table(signif(zz,7))) PD> [1] 281 PD> Notice that the discrepancy comes from sums that really are identical PD> values (in decimal arithmetic), but where the binary FP inaccuracy makes PD> them slightly different. MM> Yes, that's a good example. MM> However, I now think it would be helpful to slightly separate MM> the issue from what factor() should do from MM> how table() should call factor() in those cases it does. I still believe that. Currently, table() calls " factor(a, exclude = exclude) " when 'a' is not a factor, e.g., when it is numeric. I propose that table() should also gain some of the new optional factor() arguments, and maybe even using a different default than 15 Note that the new R-devel now gives > set.seed(7); zz <- replicate(20000,sum(sample(sleep$extra,10))) > length(tz <- table(zz)) [1] 283 whereas R <= 2.9.0 gives .... [1] 422 so that at least for this examples, '15' is good enough i.e., '7' is not needed. As mentioned above, the advantage of '15' is that it is much closer to previous R (and S+ !) behavior than a smaller value. Martin ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel