Re: [Rd] suggestion for extending ?as.factor
On Mon, May 11, 2009 at 05:06:38PM +0200, Martin Maechler wrote: The version I have committed a few hours ago is indeed a much re-simplified version, using as.character(.) explicitly The current development version (2009-05-11 r48528) contains in ?factor a description of levels parametr levels: an optional vector of the values that 'x' might have taken. The default is the unique set of values taken by 'character(x)', sorted into increasing order _of 'x'_. Note that this set can be smaller than 'sort(unique(x))'. I think that 'character(x)' should be 'as.character(x)'. Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion for extending ?as.factor
PS == Petr Savicky savi...@cs.cas.cz on Tue, 12 May 2009 13:17:15 +0200 writes: PS On Mon, May 11, 2009 at 05:06:38PM +0200, Martin Maechler wrote: The version I have committed a few hours ago is indeed a much re-simplified version, using as.character(.) explicitly PS The current development version (2009-05-11 r48528) contains PS in ?factor a description of levels parametr PS levels: an optional vector of the values that 'x' might have taken. PS The default is the unique set of values taken by PS 'character(x)', sorted into increasing order _of 'x'_. Note PS that this set can be smaller than 'sort(unique(x))'. PS I think that 'character(x)' should be 'as.character(x)'. Definitely; thank you. Thanks as well to suggestions from yesterday, of which I have at least added the approximately. Martin __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion for extending ?as.factor
PS == Petr Savicky savi...@cs.cas.cz on Sun, 10 May 2009 13:52:53 +0200 writes: PS On Sat, May 09, 2009 at 10:55:17PM +0200, Martin Maechler wrote: PS [...] If'd revert to such a solution, we'd have to get back to Peter's point about the issue that he'd think table(.) should be more tolerant than as.character() about almost equality. For compatibility reasons, we could also return back to the reasoning that useR should use {something like} table(signif(x, 14)) instead of table(x) for numeric x in typical cases. PS In the released versions 2.8.1 and 2.9.0, function factor() satisfies PS identical(as.character(factor(x)), as.character(x))(*) PS for all numeric x. This follows from the code (levels are computed by PS as.character() from unmodified input values) and may be verified PS even for the problematic cases, for example PS x - (0.3 + 2e-16 * c(-2,-1,1,2)) PS factor(x) PS # [1] 0.300 0.3 0.3 0.300 PS # Levels: 0.300 0.3 0.3 0.300 PS as.character(x) PS # [1] 0.300 0.3 0.3 PS # [4] 0.300 PS identical(as.character(factor(x)), as.character(x)) PS # [1] TRUE PS In my opinion, it is reasonable to require that (*) be PS preserved also in future versions of R. PS Function as.character(x) has disadvantages. Besides of PS the platform dependence, it also does not always perform PS rounding needed to eliminate FP errors. Usually, PS as.character(x) rounds to at most 15 digits, so, we get, PS for example PS as.character(0.1 + 0.2) # [1] 0.3 PS as required. However, there are also exceptions, for example PS as.character(1e19 + 1e5) # [1] 10100352 PS Here, the number is printed exactly, so the resulting PS string contains the FP error caused by the fact that PS 1e19 + 1e5 has more than 53 significant digits in binary PS representation, namely 59. PS binary representation of 1e19 + 1e5 is PS 100010101100011100100011010010001000100111101010 PS binary representation of 10100352 is PS 100010101100011100100011010010001000100110001000 PS However, as.character(x) seems to do enough rounding for PS most purposes, otherwise it would not be suitable as the PS basic numeric to character conversion. If table() needs PS factor() with a different conversion than PS as.character(x), it may be done explicitly as discussed PS by Martin above. PS So, i suggest to use as.character() as the default PS conversion in factor(), so that PS identical(as.character(factor(x)), as.character(x)) is PS satisfied for the default usage of factor(). PS Of course, i appreciate, if factor() has parameters, PS which allow better control of the underlying conversion, PS as it is done in the current development versions. The version I have committed a few hours ago is indeed a much re-simplified version, using as.character(.) explicitly and consequently no longer providing the extra optional arguments that we have had for a couple of days. Keeping such a basic function factor() as simple as possible seems a good strategy to me. Martin Maechler __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion for extending ?as.factor
On Mon, May 11, 2009 at 05:06:38PM +0200, Martin Maechler wrote: [...] The version I have committed a few hours ago is indeed a much re-simplified version, using as.character(.) explicitly and consequently no longer providing the extra optional arguments that we have had for a couple of days. Keeping such a basic function factor() as simple as possible seems a good strategy to me. OK. I understand the argument of simplicity. So, factor(x) is just a compressed encoding of as.character(x), where each value is stored only once. This sounds good to me. Let me go back to the original purpose of this thread: suggestion for extending ?as.factor I think that somewhere in the help page, we could have something like Using factor() to a numeric vector should be done with caution. The information in x is preserved to the extent to which it is preserved in as.character(x). If this leads to too many different levels due to minor differences among the input numbers, it is suggested to use something like factor(signif(x, digits)) or factor(round(x, digits)), where the number of decimal digits appropriate for a given application should be used. Let me point out that the following sentence from Warning is not exactly correct as it is in svn at the moment. So, i suggest to add the word approximately to the place marked with square brackets and add one more sentence of explanation marked also by square brackets. To transform a factor \code{f} to [approximately] its original numeric values, \code{as.numeric(levels(f))[f]} is recommended and slightly more efficient than \code{as.numeric(as.character(f))}. [Note that the original values may be extracted only to the precision used in as.character(x), which is typically 15 decimal digits.] Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion for extending ?as.factor
On Sat, May 09, 2009 at 10:55:17PM +0200, Martin Maechler wrote: [...] If'd revert to such a solution, we'd have to get back to Peter's point about the issue that he'd think table(.) should be more tolerant than as.character() about almost equality. For compatibility reasons, we could also return back to the reasoning that useR should use {something like} table(signif(x, 14)) instead of table(x) for numeric x in typical cases. In the released versions 2.8.1 and 2.9.0, function factor() satisfies identical(as.character(factor(x)), as.character(x))(*) for all numeric x. This follows from the code (levels are computed by as.character() from unmodified input values) and may be verified even for the problematic cases, for example x - (0.3 + 2e-16 * c(-2,-1,1,2)) factor(x) # [1] 0.300 0.3 0.3 0.300 # Levels: 0.300 0.3 0.3 0.300 as.character(x) # [1] 0.300 0.3 0.3 # [4] 0.300 identical(as.character(factor(x)), as.character(x)) # [1] TRUE In my opinion, it is reasonable to require that (*) be preserved also in future versions of R. Function as.character(x) has disadvantages. Besides of the platform dependence, it also does not always perform rounding needed to eliminate FP errors. Usually, as.character(x) rounds to at most 15 digits, so, we get, for example as.character(0.1 + 0.2) # [1] 0.3 as required. However, there are also exceptions, for example as.character(1e19 + 1e5) # [1] 10100352 Here, the number is printed exactly, so the resulting string contains the FP error caused by the fact that 1e19 + 1e5 has more than 53 significant digits in binary representation, namely 59. binary representation of 1e19 + 1e5 is 100010101100011100100011010010001000100111101010 binary representation of 10100352 is 100010101100011100100011010010001000100110001000 However, as.character(x) seems to do enough rounding for most purposes, otherwise it would not be suitable as the basic numeric to character conversion. If table() needs factor() with a different conversion than as.character(x), it may be done explicitly as discussed by Martin above. So, i suggest to use as.character() as the default conversion in factor(), so that identical(as.character(factor(x)), as.character(x)) is satisfied for the default usage of factor(). Of course, i appreciate, if factor() has parameters, which allow better control of the underlying conversion, as it is done in the current development versions. Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion for extending ?as.factor
At 14:18 08/05/2009, Martin Maechler wrote: PS == Petr Savicky savi...@cs.cas.cz on Fri, 8 May 2009 11:01:55 +0200 writes: Somewhere below Martin asks for alternatives from list readers. I do not have alternatives, but I do have two comments, one immediately below this, the other embedded in-line. This whole thread reminds me just why I have spent the best part of a decade climbing the virtual Matterhorn called 'Learning R' and why it is such a pleasure to use. It is the fact that somebody, somewhere cares enough about consistency, usability and accuracy to devote hours to getting even obscure details just right. PS On Wed, May 06, 2009 at 10:41:58AM +0200, Martin Maechler wrote: PD I think that the real issue is that we actually do want almost-equal PD numbers to be folded together. yes, this now (revision 48469) will happen by default, using signif(x, 15) where '15' is the default for the new optional argument 'digitsLabels' {better argument name? (but must nost start with 'label')} PS Let me analyze the current behavior of factor(x) for numeric x with missing(levels) PS and missing(labels). In this situation, levels are computed as sort(unique(x)) PS from possibly transformed x. Then, labels are constructed by a conversion of the PS levels to strings. PS I understand the current (R 2.10.0, 2009-05-07 r48492) behavior as follows. PS If keepUnique is FALSE (the default), then PS - values x are transformed by signif(x, digitsLabels) PS - labels are computed using as.character(levels) PS - digitsLabels defaults to 15, but may be set to any integer value PS If keepUnique is TRUE, then PS - values x are preserved PS - labels are computed using sprintf(%.*g, digitsLabels, levels) PS - digitsLabels defaults to 17, but may be set to any integer value (in theory; in practice, I think I've suggested somewhere that it should be = 17; but see below.) Your summary seems correct to me. PS There are several situations, when this approach produces duplicated levels. PS Besides the one described in my previous email, there are also others PS factor(c(0.3, 0.1+0.2), keepUnique=TRUE, digitsLabels=15) yes, but this is not much sensical; I've already contemplated to produce a warning in such cases, something like if(keepUnique digitsLabels 17) warning(gettextf( 'digitsLabels = %d' is typically too small when 'keepUnique' is true, digitsLabels)) PS factor(1 + 0:5 * 1e-16, digitsLabels=17) again, this does not make much sense; but why disallow the useR to shoot into his foot? I agree. As a useR I do not want to be stopped from doing anything. I would appreciate a warning just before I shoot myself in the foot and I definitely want one if it looks like I am going to aim for my head. PS I would like to suggest a modification. It eliminates most of the cases, where PS we get duplicated levels. It would eliminate all such cases, if the function PS signif() works as expected. Unfortunately, if signif() works as it does in the PS current versions of R, we still get duplicated levels. PS The suggested modification is as follows. PS If keepUnique is FALSE (the default), then PS - values x are transformed by signif(x, digitsLabels) PS - labels are computed using sprintf(%.*g, digitsLabels, levels) PS - digitsLabels defaults to 15, but may be set to any integer value I tend like this change, given -- as you found yesterday -- that as.character() is not even preserving 15 digits. OTOH, as.character() has been in use for a very long history of S (and R), whereas using sprintf() is not back compatible with it and actually depends on the LIBC implementation of the system-sprintf. For that reason as.character() would be preferable. Hmm PS If keepUnique is TRUE, then PS - values x are preserved PS - labels are computed using sprintf(%.*g, 17, levels) PS - digitsLabels is ignored I had originally planned to do exactly the above. However, e.g., digitsLabels = 18 may be desired in some cases, and that's why I also left the possibility to apply it in the keepUnique case. PS Arguments for the modification are the following. PS 1. If keepUnique is FALSE, then computing labels using as.character() leads PS to duplicated labels as demonstrated in my previous email. So, i suggest to PS use sprintf(%.*g, digitsLabels, levels) instead of as.character(). {as said above, that seems sensible, though unfurtunately quite a bit less back-compatible!} PS 2. If keepUnique is TRUE and we allow digitsLabels less than 17, then we get PS duplicated labels. So, i suggest to force digitsLabels=17, if keepUnique=TRUE. PS If signif(,digitsLabels) works as expected, than the above approach should not PS produce duplicated labels. Unfortunately, this is not the case. PS There are
Re: [Rd] suggestion for extending ?as.factor
PS == Petr Savicky savi...@cs.cas.cz on Fri, 8 May 2009 18:10:56 +0200 writes: PS On Fri, May 08, 2009 at 05:14:48PM +0200, Petr Savicky wrote: Let me suggest to consider the following modification, where match() is done on the strings, not on the original values. levels - unique(as.character(sort(unique(x x - as.character(x) f - match(x, levels) PS An alternative solution is PS ind - order(x) PS x - as.character(x) # or any other conversion to character PS levels - unique(x[ind]) # get unique levels ordered by the original values PS f - match(x, levels) (slightly but not much more complicated though). Yes, indeed that brings us back to (something like) the original use factor(format(x)) ... suggestion which would have been fine if there hadn't been the issue of ordering, exactly what you've addressed before. PS The advantage of this over the suggestion from my previous email is that PS the string conversion is applied only once. The conversion need not be only PS as.character(). There may be other choices specified by a parametr. I have PS strong objections against the existing implementation of as.character(), PS but still i think that as.character() should be the default for factor() PS for the sake of consistency of the R language. The biggest advantage to reverting to something simple like that, would be that it is really simple. My first tests with (a variation of) the above indicate favorable results. More on this on Monday. If'd revert to such a solution, we'd have to get back to Peter's point about the issue that he'd think table(.) should be more tolerant than as.character() about almost equality. For compatibility reasons, we could also return back to the reasoning that useR should use {something like} table(signif(x, 14)) instead of table(x) for numeric x in typical cases. Martin __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion for extending ?as.factor
On Wed, May 06, 2009 at 10:41:58AM +0200, Martin Maechler wrote: PD I think that the real issue is that we actually do want almost-equal PD numbers to be folded together. yes, this now (revision 48469) will happen by default, using signif(x, 15) where '15' is the default for the new optional argument 'digitsLabels' {better argument name? (but must nost start with 'label')} Let me analyze the current behavior of factor(x) for numeric x with missing(levels) and missing(labels). In this situation, levels are computed as sort(unique(x)) from possibly transformed x. Then, labels are constructed by a conversion of the levels to strings. I understand the current (R 2.10.0, 2009-05-07 r48492) behavior as follows. If keepUnique is FALSE (the default), then - values x are transformed by signif(x, digitsLabels) - labels are computed using as.character(levels) - digitsLabels defaults to 15, but may be set to any integer value If keepUnique is TRUE, then - values x are preserved - labels are computed using sprintf(%.*g, digitsLabels, levels) - digitsLabels defaults to 17, but may be set to any integer value There are several situations, when this approach produces duplicated levels. Besides the one described in my previous email, there are also others factor(c(0.3, 0.1+0.2), keepUnique=TRUE, digitsLabels=15) factor(1 + 0:5 * 1e-16, digitsLabels=17) I would like to suggest a modification. It eliminates most of the cases, where we get duplicated levels. It would eliminate all such cases, if the function signif() works as expected. Unfortunately, if signif() works as it does in the current versions of R, we still get duplicated levels. The suggested modification is as follows. If keepUnique is FALSE (the default), then - values x are transformed by signif(x, digitsLabels) - labels are computed using sprintf(%.*g, digitsLabels, levels) - digitsLabels defaults to 15, but may be set to any integer value If keepUnique is TRUE, then - values x are preserved - labels are computed using sprintf(%.*g, 17, levels) - digitsLabels is ignored Arguments for the modification are the following. 1. If keepUnique is FALSE, then computing labels using as.character() leads to duplicated labels as demonstrated in my previous email. So, i suggest to use sprintf(%.*g, digitsLabels, levels) instead of as.character(). 2. If keepUnique is TRUE and we allow digitsLabels less than 17, then we get duplicated labels. So, i suggest to force digitsLabels=17, if keepUnique=TRUE. If signif(,digitsLabels) works as expected, than the above approach should not produce duplicated labels. Unfortunately, this is not the case. There are numbers, which remain different in signif(x, 16), but are mapped to the same string in sprintf(%.*g, 16, x). Examples of this kind may be found using the script for (i in 1:50) { x - 10^runif(1, 38, 50) y - x * (1 + 0:500 * 1e-16) y - unique(signif(y, 16)) z - unique(sprintf(%.16g, y)) stopifnot(length(y) == length(z)) } This script is tested on Intel default arithmetic and on Intel with SSE. Perhaps, digitsLabels = 16 could be forbidden, if keepUnique is FALSE. Unfortunately, a similar problem occurs even for digitsLabels = 15, although for much larger numbers. for (i in 1:200) { x - 10^runif(1, 250, 300) y - x * (1 + 0:500 * 1e-16) y - unique(signif(y, 15)) z - unique(sprintf(%.15g, y)) stopifnot(length(y) == length(z)) } This script finds collisions, if SSE is enabled, on two Intel computers, where i did the test. Without SSE, it finds collisions only on one of them. May be, it depends also on the compiler, which is different. Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion for extending ?as.factor
PS == Petr Savicky savi...@cs.cas.cz on Fri, 8 May 2009 11:01:55 +0200 writes: PS On Wed, May 06, 2009 at 10:41:58AM +0200, Martin Maechler wrote: PD I think that the real issue is that we actually do want almost-equal PD numbers to be folded together. yes, this now (revision 48469) will happen by default, using signif(x, 15) where '15' is the default for the new optional argument 'digitsLabels' {better argument name? (but must nost start with 'label')} PS Let me analyze the current behavior of factor(x) for numeric x with missing(levels) PS and missing(labels). In this situation, levels are computed as sort(unique(x)) PS from possibly transformed x. Then, labels are constructed by a conversion of the PS levels to strings. PS I understand the current (R 2.10.0, 2009-05-07 r48492) behavior as follows. PS If keepUnique is FALSE (the default), then PS - values x are transformed by signif(x, digitsLabels) PS - labels are computed using as.character(levels) PS - digitsLabels defaults to 15, but may be set to any integer value PS If keepUnique is TRUE, then PS - values x are preserved PS - labels are computed using sprintf(%.*g, digitsLabels, levels) PS - digitsLabels defaults to 17, but may be set to any integer value (in theory; in practice, I think I've suggested somewhere that it should be = 17; but see below.) Your summary seems correct to me. PS There are several situations, when this approach produces duplicated levels. PS Besides the one described in my previous email, there are also others PS factor(c(0.3, 0.1+0.2), keepUnique=TRUE, digitsLabels=15) yes, but this is not much sensical; I've already contemplated to produce a warning in such cases, something like if(keepUnique digitsLabels 17) warning(gettextf( 'digitsLabels = %d' is typically too small when 'keepUnique' is true, digitsLabels)) PS factor(1 + 0:5 * 1e-16, digitsLabels=17) again, this does not make much sense; but why disallow the useR to shoot into his foot? PS I would like to suggest a modification. It eliminates most of the cases, where PS we get duplicated levels. It would eliminate all such cases, if the function PS signif() works as expected. Unfortunately, if signif() works as it does in the PS current versions of R, we still get duplicated levels. PS The suggested modification is as follows. PS If keepUnique is FALSE (the default), then PS - values x are transformed by signif(x, digitsLabels) PS - labels are computed using sprintf(%.*g, digitsLabels, levels) PS - digitsLabels defaults to 15, but may be set to any integer value I tend like this change, given -- as you found yesterday -- that as.character() is not even preserving 15 digits. OTOH, as.character() has been in use for a very long history of S (and R), whereas using sprintf() is not back compatible with it and actually depends on the LIBC implementation of the system-sprintf. For that reason as.character() would be preferable. Hmm PS If keepUnique is TRUE, then PS - values x are preserved PS - labels are computed using sprintf(%.*g, 17, levels) PS - digitsLabels is ignored I had originally planned to do exactly the above. However, e.g., digitsLabels = 18 may be desired in some cases, and that's why I also left the possibility to apply it in the keepUnique case. PS Arguments for the modification are the following. PS 1. If keepUnique is FALSE, then computing labels using as.character() leads PS to duplicated labels as demonstrated in my previous email. So, i suggest to PS use sprintf(%.*g, digitsLabels, levels) instead of as.character(). {as said above, that seems sensible, though unfurtunately quite a bit less back-compatible!} PS 2. If keepUnique is TRUE and we allow digitsLabels less than 17, then we get PS duplicated labels. So, i suggest to force digitsLabels=17, if keepUnique=TRUE. PS If signif(,digitsLabels) works as expected, than the above approach should not PS produce duplicated labels. Unfortunately, this is not the case. PS There are numbers, which remain different in signif(x, 16), but are mapped PS to the same string in sprintf(%.*g, 16, x). Examples of this kind may be PS found using the script PS for (i in 1:50) { PS x - 10^runif(1, 38, 50) PS y - x * (1 + 0:500 * 1e-16) PS y - unique(signif(y, 16)) PS z - unique(sprintf(%.16g, y)) PS stopifnot(length(y) == length(z)) PS } PS This script is tested on Intel default arithmetic and on Intel with SSE. PS Perhaps, digitsLabels = 16 could be forbidden, if keepUnique is FALSE. PS Unfortunately, a similar problem occurs even for digitsLabels = 15, although for PS much larger numbers. PS for (i in 1:200) { PS x - 10^runif(1, 250, 300) PS y - x * (1 + 0:500 * 1e-16) PS y -
Re: [Rd] suggestion for extending ?as.factor
On Fri, May 08, 2009 at 03:18:01PM +0200, Martin Maechler wrote: As long as we don't want to allow factor(numeric) to fail --rarely -- I think (and that actually has been a recurring daunting thought for quite a few days) that we probably need an extra step of checking for duplicate levels, and if we find some, recode everything. This will blow up the body of the factor() function even more. What alternatives do you (all R-devel readers!) see? The command f - match(x, levels) in factor() uses the original values and not their string representations. I think that the main reason to do so is that we loose the ordering, if the conversion to character is done before levels are sorted. Let me suggest to consider the following modification, where match() is done on the strings, not on the original values. levels - unique(as.character(sort(unique(x x - as.character(x) f - match(x, levels) Since unique() preserves the order, we will get the levels correctly ordered. Due to using unique() twice, we will not have duplicated levels. Is it correct? Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion for extending ?as.factor
On Fri, May 08, 2009 at 05:14:48PM +0200, Petr Savicky wrote: Let me suggest to consider the following modification, where match() is done on the strings, not on the original values. levels - unique(as.character(sort(unique(x x - as.character(x) f - match(x, levels) An alternative solution is ind - order(x) x - as.character(x) # or any other conversion to character levels - unique(x[ind]) # get unique levels ordered by the original values f - match(x, levels) The advantage of this over the suggestion from my previous email is that the string conversion is applied only once. The conversion need not be only as.character(). There may be other choices specified by a parametr. I have strong objections against the existing implementation of as.character(), but still i think that as.character() should be the default for factor() for the sake of consistency of the R language. Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion for extending ?as.factor
PS == Petr Savicky savi...@cs.cas.cz on Fri, 8 May 2009 18:10:56 +0200 writes: PS On Fri, May 08, 2009 at 05:14:48PM +0200, Petr Savicky wrote: Let me suggest to consider the following modification, where match() is done on the strings, not on the original values. levels - unique(as.character(sort(unique(x x - as.character(x) f - match(x, levels) PS An alternative solution is ind - order(x) x - as.character(x) # or any other conversion to character levels - unique(x[ind]) # get unique levels ordered by the original values f - match(x, levels) Yes, that's an interesting quite different and simple approach. PS The advantage of this over the suggestion from my previous email is that PS the string conversion is applied only once. The conversion need not be only PS as.character(). There may be other choices specified by a parametr. I have PS strong objections against the existing implementation of as.character(), {(because it is not *accurate* enough, right ?)} PS but still i think that as.character() should be the default for factor() PS for the sake of consistency of the R language. Hmm... Peter Dalgaard very early in this thread remarked that at least in the use of table(..), factor() should not be extremely accurate, and that's what R-devel's factor has been doing recently. But then, table(.) could be changed to explicitly call factor(signif(x, 15), ...) for the case of numeric x. BTW: I found that practically all the remaining border cases you had, are solved by using 14 instead of 15. I'm currently testing a version of factor() that uses 14, *and* adds an extra final level test, removing duplicated ones. Martin __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion for extending ?as.factor
On Fri, May 08, 2009 at 06:48:40PM +0200, Martin Maechler wrote: PS == Petr Savicky savi...@cs.cas.cz on Fri, 8 May 2009 18:10:56 +0200 writes: [...] PS ... I have PS strong objections against the existing implementation of as.character(), {(because it is not *accurate* enough, right ?)} The problem is not exactly low accuracy. The problem is unpredictable accuracy. If the accuracy is systematically 15 or 14 digits, it would be fine and suitable for most purposes. However the accuracy ranges between 14 and 20 digits and may be different on different platforms. For example, on my old Xeon comupter, the same numbers may be converted to strings representing different values: with SSE without SSE 8459184.47742229 8459184.4774223 84307700406756081664 8.4307700406756e+19 9262815.27852281 9262815.2785228 2.1006742758024e+19 21006742758023974912 7.07078598983389e+25 7.0707859898339e+25 8.0808066145628e+28 8.08080661456281e+28 9180932974.85929 9180932974.8593 72.4923408890729 72.492340889073 Sometimes there are differences in trailing zeros. with SSE without SSE 1.97765325859480e+25 1.9776532585948e+25 21762633836.0360 21762633836.036 2018960238339.80 2018960238339.8 239567.78053486 239567.780534860 2571116684765.50 2571116684765.5 3989945.2102949 3989945.21029490 1.1259245205867e+23 1.12592452058670e+23 3.2867033904477e+29 3.28670339044770e+29 2.8271117654895e+29 2.82711176548950e+29 26854166.6173020 26854166.617302 4.85247217360750 4.8524721736075 345123.247838540 345123.24783854 For random numbers in the sample generated as 10^runif(10, 0, 30), from which i selected the first 20 examples above, the probability of different results was almost 0.01 (978 differences among 10 numbers). I think that the platform dependence even limits the advantage of backward compatibility. Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion for extending ?as.factor
On Wed, May 06, 2009 at 10:41:58AM +0200, Martin Maechler wrote: PD I think that the real issue is that we actually do want almost-equal PD numbers to be folded together. yes, this now (revision 48469) will happen by default, using signif(x, 15) where '15' is the default for the new optional argument 'digitsLabels' On some platforms, the function factor() in the current R 2.10.0 (2009-05-06 r48478) may produce duplicated levels. The examples are in general platform dependent. The following one produces duplicated (in fact triplicated) levels on both Intel default arithmetic and on Intel with SSE. x - 9.7738826945424 + c(-1, 0, 1) * 1e-14 x - signif(x, 15) factor(x) # [1] 9.7738826945424 9.7738826945424 9.7738826945424 # Levels: 9.7738826945424 9.7738826945424 9.7738826945424 # Warning message: # In `levels-`(`*tmp*`, value = c(9.7738826945424, 9.7738826945424, : # duplicated levels will not be allowed in factors anymore The reason is that the three numbers remain different in signif(x, 15), but are mapped to the same string in as.character(x). length(unique(x)) # [1] 3 length(unique(as.character(x))) # 1 Further examples may be found using x - as.character(9 + runif(5000)) x - as.numeric(x[nchar(x)==15]) # select numbers with 14 digits x - signif(cbind(x - 1e-14, x, x + 1e-14), 15) y - array(as.character(x), dim=dim(x)) x - x[which(y[,1] == y[,3]),] factor(x[1,]) Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion for extending ?as.factor
MM == Martin Maechler maech...@stat.math.ethz.ch on Tue, 5 May 2009 10:35:42 +0200 writes: PD == Peter Dalgaard p.dalga...@biostat.ku.dk on Mon, 04 May 2009 19:28:06 +0200 writes: PD Petr Savicky wrote: On Mon, May 04, 2009 at 05:39:52PM +0200, Martin Maechler wrote: [snip] Let me quickly expand the tasks we have wanted to address, when I started changing factor() for R-devel. 1) R-core had unanimously decided that R 2.10.0 should not allow duplicated levels in factors anymore. When working on that, I had realized that quite a few bits of code were implicitly relying on duplicated levels (or something related), see below, so the current version of R-devel only *warns* in some cases where duplicated levels are produced instead of giving an error. What I had also found was that basically, even our own (!) code and quite a bit of user code has more or less relied on other things that were not true (even though almost always fulfilled): 2) if x contains no duplicated values, then factor(x) should neither 3) factor(x) constructs a factor object with *unique* levels {This is what our decision 1) implies and now enforces} 4) as.numeric(names(table(x))) should be identical to unique(x) where 4) is basically ensured by 3) as table() calls factor() for non-factor args. As mentioned the bad thing is that 2) - 4) are typically fulfilled in all tests package writers would use. Concerning '3)' [and '1)'], as you know, inside R-core we have proposed to at least ensure that `levels-` should not allow duplicated levels, and I had concluded that a) factor() really should use `levels-` instead of the low-level attr(., levels) - b) factor() itself must make sure that the default levels became unique. --- Given Petr's (and more) examples and the strong requirement of user convenience and back-compatibility, I now tend to agree (with Peter) that we cannot ensure all of 2) and 4) still allow factor() to behave as it did for rounded decimal numbers, and consequently would have to (continue to) not ensuring properties (2) and (4). Something quite unfortunate, since, as I said, much useR code implicitly relies on these, and so that code is buggy even though the bug will only show in exceptional cases. [] PD I think that the real issue is that we actually do want almost-equal PD numbers to be folded together. yes, this now (revision 48469) will happen by default, using signif(x, 15) where '15' is the default for the new optional argument 'digitsLabels' {better argument name? (but must nost start with 'label')} Why '15': Because this is most back-compatible and sufficient to solve simple arithmetic (0.1 + 0.2) issues. MM in most cases, but not all {*when* levels is not specified}, MM but useR's code sometimes *does* rely on factor() / table() MM using exact values. MM Also, what should happen when the user explicitly calls MM factor(x, levels = sort(unique(x))) MM at least in that case we really should *not* fold almost equals. MM and the old code (= R 2.9.0) did fold them in border cases, MM and lead non-unique levels. MM Can we agree that any rounding etc - if needed - will only MM happen when MM 1) missing(levels) MM 2) is.numeric(x) || is.complex(x) The code I've committed (revision 48469) now does that.. MM I'm also thinking of at least keeping the current behavior as an MM option, e.g. by factor(x, , keepUniqueness = TRUE, ) MM where the default would be keepUniqueness = FALSE. current argument name is 'keepUnique'. PD The most relevant case I can conjure up is this (permutation testing): zz - replicate(2,sum(sample(sleep$extra,10))) length(table(zz)) PD [1] 427 length(table(signif(zz,7))) PD [1] 281 PD Notice that the discrepancy comes from sums that really are identical PD values (in decimal arithmetic), but where the binary FP inaccuracy makes PD them slightly different. MM Yes, that's a good example. MM However, I now think it would be helpful to slightly separate MM the issue from what factor() should do from MM how table() should call factor() in those cases it does. I still believe that. Currently, table() calls factor(a, exclude = exclude) when 'a' is not a factor, e.g., when it is numeric. I propose that table() should also gain some of the new optional factor() arguments, and maybe even using a different default than 15 Note that the new R-devel now gives set.seed(7); zz - replicate(2,sum(sample(sleep$extra,10))) length(tz - table(zz)) [1] 283 whereas R = 2.9.0 gives [1] 422 so that at
Re: [Rd] suggestion for extending ?as.factor
On Mon, May 04, 2009 at 07:28:06PM +0200, Peter Dalgaard wrote: Petr Savicky wrote: For this, we get convert(0.3) [1] 0.3 convert(1/3) [1] 0. # 16 digits suffice convert(0.12345) [1] 0.12345 convert(0.12345678901234567) [1] 0.12345678901234566 0.12345678901234567 == as.numeric(0.12345678901234566) [1] TRUE This algorithm is slower than a single call to sprintf(%.17g, x), but it produces nicer numbers, if possible, and guarantees that the value is always preserved. Yes, but convert(0.2+0.1) [1] 0.30004 I am not sure whether this should be considered an error. Computing with decimal numbers requires some care. If we want to get the result, which is the closest to 0.3, it is better to use round(0.1+0.2, digits=1). I think that the real issue is that we actually do want almost-equal numbers to be folded together. Yes, this is frequently needed. A related question of an approximate match in a numeric sequence was discussed in R-devel thread Match .3 in a sequence from March. In order to fold almost-equal numbers together, we need to specify a tolerance to use. The tolerance depends on application. In my opinion, it is hard to choose a tolerance, which could be a good default. The most relevant case I can conjure up is this (permutation testing): zz - replicate(2,sum(sample(sleep$extra,10))) length(table(zz)) [1] 427 length(table(signif(zz,7))) [1] 281 In order to obtain the correct result in this example, it is possible to use zz - signif(zz,7) as you suggest or zz - round(zz, digits=1) and use the resulting zz for all further calculations. Notice that the discrepancy comes from sums that really are identical values (in decimal arithmetic), but where the binary FP inaccuracy makes them slightly different. [for a nice picture, continue the example with tt - table(signif(zz,7)) plot(as.numeric(names(tt)),tt, type=h) The form of this picture is not due to rounding errors. The picture may be obtained even within an integer arithmetic as follows. ss - round(10*sleep$extra) zz - replicate(2,sum(sample(ss,10))) tt - table(zz) plot(as.numeric(names(tt)),tt, type=h) The variation of the frequencies is due to two effects. First, each individual value of the sum occurs with low probability, so 2 replications do not suffice to get low variance of individual frequencies. Using 1000 repetitions of the code above, i obtained estimate of some of the probabilities. The most frequent sums have probability approximately p=0.0089 for a single sample. With n=2 replications, we get the mean frequency p*n = 178 and standard deviation sqrt(p*(1-p)*n) = 13.28216. The other cause of variation of the frequencies is that even the true distribution of the sums has a lot of local minima and maxima. The mean of 1000 repetitions of the above table restricted to values of the sum in the interval 140:168 produced the estimates value mean frequency (over 1000 tables) 140 172.411 141 172.090 142 174.297 143 166.039 144 159.260 145 163.891 146 162.317 147 165.460 148 177.870 149 177.971 150 177.754 151 178.525 local maximum 152 169.851 153 164.443 local minimum 154 168.488 the mean value of the sum 155 164.816 local minimum 156 169.297 157 179.248 local maximum 158 177.799 159 176.743 160 177.777 161 164.173 162 162.585 163 164.641 164 159.913 165 165.932 166 173.014 167 172.276 168 171.612 The local minima and maxima are visible here. The mean value 154 is approximately the center of the histogram. Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion for extending ?as.factor
PD == Peter Dalgaard p.dalga...@biostat.ku.dk on Mon, 04 May 2009 19:28:06 +0200 writes: PD Petr Savicky wrote: On Mon, May 04, 2009 at 05:39:52PM +0200, Martin Maechler wrote: [snip] Let me quickly expand the tasks we have wanted to address, when I started changing factor() for R-devel. 1) R-core had unanimously decided that R 2.10.0 should not allow duplicated levels in factors anymore. When working on that, I had realized that quite a few bits of code were implicitly relying on duplicated levels (or something related), see below, so the current version of R-devel only *warns* in some cases where duplicated levels are produced instead of giving an error. What I had also found was that basically, even our own (!) code and quite a bit of user code has more or less relied on other things that were not true (even though almost always fulfilled): 2) if x contains no duplicated values, then factor(x) should neither 3) factor(x) constructs a factor object with *unique* levels {This is what our decision 1) implies and now enforces} 4) as.numeric(names(table(x))) should be identical to unique(x) where 4) is basically ensured by 3) as table() calls factor() for non-factor args. As mentioned the bad thing is that 2) - 4) are typically fulfilled in all tests package writers would use. Concerning '3)' [and '1)'], as you know, inside R-core we have proposed to at least ensure that `levels-` should not allow duplicated levels, and I had concluded that a) factor() really should use `levels-` instead of the low-level attr(., levels) - b) factor() itself must make sure that the default levels became unique. --- Given Petr's (and more) examples and the strong requirement of user convenience and back-compatibility, I now tend to agree (with Peter) that we cannot ensure all of 2) and 4) still allow factor() to behave as it did for rounded decimal numbers, and consequently would have to (continue to) not ensuring properties (2) and (4). Something quite unfortunate, since, as I said, much useR code implicitly relies on these, and so that code is buggy even though the bug will only show in exceptional cases. Let me suggest to consider also the following algorithm: determine the number of digits needed to preserve the double value exactly for each number separately. An R code prototype demonstrating the algorithm could be as follows convert - function(x) # x should be a single number { for (d in 1:16) { y - sprintf(paste(%., d, g, sep=), x) if (x == as.numeric(y)) { return(y) } } return(sprintf(%.17g, x)) } For this, we get convert(0.3) [1] 0.3 convert(1/3) [1] 0. # 16 digits suffice convert(0.12345) [1] 0.12345 convert(0.12345678901234567) [1] 0.12345678901234566 0.12345678901234567 == as.numeric(0.12345678901234566) [1] TRUE This algorithm is slower than a single call to sprintf(%.17g, x), but it produces nicer numbers, if possible, and guarantees that the value is always preserved. PD Yes, but convert(0.2+0.1) PD [1] 0.30004 PD I think that the real issue is that we actually do want almost-equal PD numbers to be folded together. in most cases, but not all {*when* levels is not specified}, but useR's code sometimes *does* rely on factor() / table() using exact values. Also, what should happen when the user explicitly calls factor(x, levels = sort(unique(x))) at least in that case we really should *not* fold almost equals. and the old code (= R 2.9.0) did fold them in border cases, and lead non-unique levels. Can we agree that any rounding etc - if needed - will only happen when 1) missing(levels) 2) is.numeric(x) || is.complex(x) I'm also thinking of at least keeping the current behavior as an option, e.g. by factor(x, , keepUniqueness = TRUE, ) where the default would be keepUniqueness = FALSE. PD The most relevant case I can conjure up is this (permutation testing): zz - replicate(2,sum(sample(sleep$extra,10))) length(table(zz)) PD [1] 427 length(table(signif(zz,7))) PD [1] 281 PD Notice that the discrepancy comes from sums that really are identical PD values (in decimal arithmetic), but where the binary FP inaccuracy makes PD them slightly different. Yes, that's a good example. However, I now think it would be helpful to slightly separate the issue from what factor() should do from how table() should call factor() in those cases it does. Also, you originally said We might as well have abolished conversion of numerics
Re: [Rd] suggestion for extending ?as.factor
Petr Savicky wrote: Notice that the discrepancy comes from sums that really are identical values (in decimal arithmetic), but where the binary FP inaccuracy makes them slightly different. [for a nice picture, continue the example with tt - table(signif(zz,7)) plot(as.numeric(names(tt)),tt, type=h) The form of this picture is not due to rounding errors. The picture may be obtained even within an integer arithmetic as follows. ss - round(10*sleep$extra) zz - replicate(2,sum(sample(ss,10))) tt - table(zz) plot(as.numeric(names(tt)),tt, type=h) I know. The point was rather that if you are not careful with rounding, you get the some of the bars wrong (you get 2 or 3 small bars very close to each other instead of one longer one). Computed p values from permutation tests (as in mean(sim=obs)) also need care for the same reason. The variation of the frequencies is due to two effects. First, each individual value of the sum occurs with low probability, so 2 The other cause of variation of the frequencies is that even the true distribution of the sums has a lot of local minima and maxima. Yes. You can actually generate the exact distribution easily using d - combn(sleep$extra, 10, sum) d - signif(d,7) tt - table(d) plot(as.numeric(names(tt)),tt, type=h) and if you omit the signif() bit (not with R-devel): table(table(names(table(d 1 2 3 137 161 17 i.e. 315 distinct values but over half occur in duplicate or triplicate versions. -- O__ Peter Dalgaard Øster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~ - (p.dalga...@biostat.ku.dk) FAX: (+45) 35327907 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion for extending ?as.factor
On Tue, May 05, 2009 at 11:27:36AM +0200, Peter Dalgaard wrote: I know. The point was rather that if you are not careful with rounding, you get the some of the bars wrong (you get 2 or 3 small bars very close to each other instead of one longer one). Computed p values from permutation tests (as in mean(sim=obs)) also need care for the same reason. OK. Now, i understand the point of the example. I think that it is the responsibility of the user to find the right way to eliminate the influence of the rounding errors, since this may require a different approach in different situations. However, i can also accept the point of view that as.factor() should do this to some extent by default. For example, we may require that as.factor() is consistent with as.character() in the way how to map different numbers to the same string. At the first glance, one could expect that to implement this, it is sufficient if as.factor(x) performs x - as.numeric(as.character(x)) levels - as.character(sort(unique(x))) Unfortunately, on some platforms (tested on Intel with SSE, R-2.10.0, 2009-05-02 r48453), this may produce repeated levels. x - c(0.6807853176681814000304, 0.6807853176681809559412) x - as.numeric(as.character(x)) levels - as.character(sort(unique(x))) levels # 0.68078531766818 0.68078531766818 levels[1] == levels[2] # TRUE Using the default Intel arithmetic, we get a single level, namely 0.680785317668181. Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion for extending ?as.factor
PS == Petr Savicky savi...@cs.cas.cz on Sun, 3 May 2009 22:32:04 +0200 writes: PS == Petr Savicky savi...@cs.cas.cz on Sun, 3 May 2009 22:32:04 +0200 writes: PS In R-2.10.0, the development version, function as.factor() uses 17 digit PS precision for conversion of numeric values to character type. This PS is very good for the consistency of the resulting factor, however, PS i expect that people will complain about, for example, as.factor(0.3) PS being PS [1] 0.2 PS Levels: 0.2 PS I suggest to extend the Warning section of ?as.factor by the following PS paragraph. PS If as.factor() is used for a numeric vector, then the numbers are PS converted to character strings with 17 digit precision using their PS machine representation. This guarantees that different numbers are PS converted to different levels, but may produce unwanted results, if PS the numbers are expected to have limited number of decimal positions. PS For example, as.factor(c(0.1, 0.2, 0.3)) produces PS [1] 0.10001 0.20001 0.2 PS Levels: 0.10001 0.20001 0.2 PS In order to avoid this, convert the numbers to a character vector PS using formatC() or a similar function before using as.factor(). PS Petr. Thank you, Petr, for the good suggestion. I have added a (shorter) paragraph, though to the 'Details' not the 'Warning' section, and also one to the 'Examples' : ## Converting (non-integer) numbers: as.factor(c(0.1, 0.2, 0.3)) # maybe not what you'd expect, so rather use factor(format(c(0.1, 0.2, 0.3))) Martin Maechler, ETH Zurich __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion for extending ?as.factor
PD == Peter Dalgaard p.dalga...@biostat.ku.dk on Mon, 04 May 2009 15:34:09 +0200 writes: PD Martin Maechler wrote: PS == Petr Savicky savi...@cs.cas.cz on Sun, 3 May 2009 22:32:04 +0200 writes: PS == Petr Savicky savi...@cs.cas.cz on Sun, 3 May 2009 22:32:04 +0200 writes: PS In R-2.10.0, the development version, function as.factor() uses 17 digit PS precision for conversion of numeric values to character type. This PS is very good for the consistency of the resulting factor, however, PS i expect that people will complain about, for example, as.factor(0.3) PS being PS [1] 0.2 PS Levels: 0.2 PS I suggest to extend the Warning section of ?as.factor by the following PS paragraph. PS If as.factor() is used for a numeric vector, then the numbers are PS converted to character strings with 17 digit precision using their PS machine representation. This guarantees that different numbers are PS converted to different levels, but may produce unwanted results, if PS the numbers are expected to have limited number of decimal positions. PS For example, as.factor(c(0.1, 0.2, 0.3)) produces PS [1] 0.10001 0.20001 0.2 PS Levels: 0.10001 0.20001 0.2 PS In order to avoid this, convert the numbers to a character vector PS using formatC() or a similar function before using as.factor(). PS Petr. Thank you, Petr, for the good suggestion. I have added a (shorter) paragraph, though to the 'Details' not the 'Warning' section, and also one to the 'Examples' : ## Converting (non-integer) numbers: as.factor(c(0.1, 0.2, 0.3)) # maybe not what you'd expect, so rather use factor(format(c(0.1, 0.2, 0.3))) PD Martin, PD I tend to consider this a bug, plain and simple. We might as well have PD abolished conversion of numerics to factor altogether. hmm, that comparison *is* an exaggeration ((Much code would have stopped working had we implemented the latter !!)) PD (Notice, BTW, that conversions to mode character PD changes the sort order so format() is not a universal PD fix. IIRC, we did consider the 1 10 2 3 4 5 6 7 8 9 PD issue when designing R's version factor().) yes, but I did not understand why this is relevant here; as factor(c(10,1:9, 8:2,1:5)) [1] 10 1 2 3 4 5 6 7 8 9 8 7 6 5 4 3 2 1 2 3 4 5 Levels: 1 2 3 4 5 6 7 8 9 10 also in earlier versions of R. PD The current R-devel behaviour is silly and we should just get rid of it PD before a final release. It should be the other way around: If people PD rely on whether numerical factor levels differ with 17 digits precision, PD THEN they should use format with suitable arguments. Hmm, I know tend to agree that we must further change some of the R-devel behavior. PD If we have issues with numeric values that are very slightly different PD but round to get the same level name, how about putting something like PD if (is.numeric(x)) x - zapsmall(x) well, rather if (is.numeric(x)) x - signif(x, 15) where '15' could be replaced by 7 or other values in 5:20 PD somewhere at the start of the factor() function? Let me quickly expand the tasks we have wanted to address, when I started changing factor() for R-devel. 1) R-core had unanimously decided that R 2.10.0 should not allow duplicated levels in factors anymore. When working on that, I had realized that quite a few bits of code were implicitly relying on duplicated levels (or something related), see below, so the current version of R-devel only *warns* in some cases where duplicated levels are produced instead of giving an error. What I had also found was that basically, even our own (!) code and quite a bit of user code has more or less relied on other things that were not true (even though almost always fulfilled): 2) if x contains no duplicated values, then factor(x) should neither 3) factor(x) constructs a factor object with *unique* levels {This is what our decision 1) implies and now enforces} 4) as.numeric(names(table(x))) should be identical to unique(x) where 4) is basically ensured by 3) as table() calls factor() for non-factor args. As mentioned the bad thing is that 2) - 4) are typically fulfilled in all tests package writers would use. Concerning '3)' [and '1)'], as you know, inside R-core we have proposed to at least ensure that `levels-` should not allow duplicated levels, and I had concluded that a) factor() really should use `levels-` instead of the low-level attr(., levels) - b) factor() itself must make sure that the default levels became unique. --- Given Petr's (and more) examples and the strong requirement of user convenience and back-compatibility, I now tend to agree (with Peter)
Re: [Rd] suggestion for extending ?as.factor
On Mon, May 04, 2009 at 05:39:52PM +0200, Martin Maechler wrote: [snip] Let me quickly expand the tasks we have wanted to address, when I started changing factor() for R-devel. 1) R-core had unanimously decided that R 2.10.0 should not allow duplicated levels in factors anymore. When working on that, I had realized that quite a few bits of code were implicitly relying on duplicated levels (or something related), see below, so the current version of R-devel only *warns* in some cases where duplicated levels are produced instead of giving an error. What I had also found was that basically, even our own (!) code and quite a bit of user code has more or less relied on other things that were not true (even though almost always fulfilled): 2) if x contains no duplicated values, then factor(x) should neither 3) factor(x) constructs a factor object with *unique* levels {This is what our decision 1) implies and now enforces} 4) as.numeric(names(table(x))) should be identical to unique(x) where 4) is basically ensured by 3) as table() calls factor() for non-factor args. As mentioned the bad thing is that 2) - 4) are typically fulfilled in all tests package writers would use. Concerning '3)' [and '1)'], as you know, inside R-core we have proposed to at least ensure that `levels-` should not allow duplicated levels, and I had concluded that a) factor() really should use `levels-` instead of the low-level attr(., levels) - b) factor() itself must make sure that the default levels became unique. --- Given Petr's (and more) examples and the strong requirement of user convenience and back-compatibility, I now tend to agree (with Peter) that we cannot ensure all of 2) and 4) still allow factor() to behave as it did for rounded decimal numbers, and consequently would have to (continue to) not ensuring properties (2) and (4). Something quite unfortunate, since, as I said, much useR code implicitly relies on these, and so that code is buggy even though the bug will only show in exceptional cases. Let me suggest to consider also the following algorithm: determine the number of digits needed to preserve the double value exactly for each number separately. An R code prototype demonstrating the algorithm could be as follows convert - function(x) # x should be a single number { for (d in 1:16) { y - sprintf(paste(%., d, g, sep=), x) if (x == as.numeric(y)) { return(y) } } return(sprintf(%.17g, x)) } For this, we get convert(0.3) [1] 0.3 convert(1/3) [1] 0. # 16 digits suffice convert(0.12345) [1] 0.12345 convert(0.12345678901234567) [1] 0.12345678901234566 0.12345678901234567 == as.numeric(0.12345678901234566) [1] TRUE This algorithm is slower than a single call to sprintf(%.17g, x), but it produces nicer numbers, if possible, and guarantees that the value is always preserved. Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion for extending ?as.factor
Petr Savicky wrote: On Mon, May 04, 2009 at 05:39:52PM +0200, Martin Maechler wrote: [snip] Let me quickly expand the tasks we have wanted to address, when I started changing factor() for R-devel. 1) R-core had unanimously decided that R 2.10.0 should not allow duplicated levels in factors anymore. When working on that, I had realized that quite a few bits of code were implicitly relying on duplicated levels (or something related), see below, so the current version of R-devel only *warns* in some cases where duplicated levels are produced instead of giving an error. What I had also found was that basically, even our own (!) code and quite a bit of user code has more or less relied on other things that were not true (even though almost always fulfilled): 2) if x contains no duplicated values, then factor(x) should neither 3) factor(x) constructs a factor object with *unique* levels {This is what our decision 1) implies and now enforces} 4) as.numeric(names(table(x))) should be identical to unique(x) where 4) is basically ensured by 3) as table() calls factor() for non-factor args. As mentioned the bad thing is that 2) - 4) are typically fulfilled in all tests package writers would use. Concerning '3)' [and '1)'], as you know, inside R-core we have proposed to at least ensure that `levels-` should not allow duplicated levels, and I had concluded that a) factor() really should use `levels-` instead of the low-level attr(., levels) - b) factor() itself must make sure that the default levels became unique. --- Given Petr's (and more) examples and the strong requirement of user convenience and back-compatibility, I now tend to agree (with Peter) that we cannot ensure all of 2) and 4) still allow factor() to behave as it did for rounded decimal numbers, and consequently would have to (continue to) not ensuring properties (2) and (4). Something quite unfortunate, since, as I said, much useR code implicitly relies on these, and so that code is buggy even though the bug will only show in exceptional cases. Let me suggest to consider also the following algorithm: determine the number of digits needed to preserve the double value exactly for each number separately. An R code prototype demonstrating the algorithm could be as follows convert - function(x) # x should be a single number { for (d in 1:16) { y - sprintf(paste(%., d, g, sep=), x) if (x == as.numeric(y)) { return(y) } } return(sprintf(%.17g, x)) } For this, we get convert(0.3) [1] 0.3 convert(1/3) [1] 0. # 16 digits suffice convert(0.12345) [1] 0.12345 convert(0.12345678901234567) [1] 0.12345678901234566 0.12345678901234567 == as.numeric(0.12345678901234566) [1] TRUE This algorithm is slower than a single call to sprintf(%.17g, x), but it produces nicer numbers, if possible, and guarantees that the value is always preserved. Yes, but convert(0.2+0.1) [1] 0.30004 I think that the real issue is that we actually do want almost-equal numbers to be folded together. The most relevant case I can conjure up is this (permutation testing): zz - replicate(2,sum(sample(sleep$extra,10))) length(table(zz)) [1] 427 length(table(signif(zz,7))) [1] 281 Notice that the discrepancy comes from sums that really are identical values (in decimal arithmetic), but where the binary FP inaccuracy makes them slightly different. [for a nice picture, continue the example with tt - table(signif(zz,7)) plot(as.numeric(names(tt)),tt, type=h) ] -- O__ Peter Dalgaard Øster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~ - (p.dalga...@biostat.ku.dk) FAX: (+45) 35327907 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion for extending ?as.factor
Martin Maechler wrote: PS == Petr Savicky savi...@cs.cas.cz on Sun, 3 May 2009 22:32:04 +0200 writes: PS == Petr Savicky savi...@cs.cas.cz on Sun, 3 May 2009 22:32:04 +0200 writes: PS In R-2.10.0, the development version, function as.factor() uses 17 digit PS precision for conversion of numeric values to character type. This PS is very good for the consistency of the resulting factor, however, PS i expect that people will complain about, for example, as.factor(0.3) PS being PS [1] 0.2 PS Levels: 0.2 PS I suggest to extend the Warning section of ?as.factor by the following PS paragraph. PS If as.factor() is used for a numeric vector, then the numbers are PS converted to character strings with 17 digit precision using their PS machine representation. This guarantees that different numbers are PS converted to different levels, but may produce unwanted results, if PS the numbers are expected to have limited number of decimal positions. PS For example, as.factor(c(0.1, 0.2, 0.3)) produces PS [1] 0.10001 0.20001 0.2 PS Levels: 0.10001 0.20001 0.2 PS In order to avoid this, convert the numbers to a character vector PS using formatC() or a similar function before using as.factor(). PS Petr. Thank you, Petr, for the good suggestion. I have added a (shorter) paragraph, though to the 'Details' not the 'Warning' section, and also one to the 'Examples' : ## Converting (non-integer) numbers: as.factor(c(0.1, 0.2, 0.3)) # maybe not what you'd expect, so rather use factor(format(c(0.1, 0.2, 0.3))) Martin, I tend to consider this a bug, plain and simple. We might as well have abolished conversion of numerics to factor altogether. (Notice, BTW, that conversions to mode character changes the sort order so format() is not a universal fix. IIRC, we did consider the 1 10 2 3 4 5 6 7 8 9 issue when designing R's version factor().) The current R-devel behaviour is silly and we should just get rid of it before a final release. It should be the other way around: If people rely on whether numerical factor levels differ with 17 digits precision, THEN they should use format with suitable arguments. If we have issues with numeric values that are very slightly different but round to get the same level name, how about putting something like if (is.numeric(x)) x - zapsmall(x) somewhere at the start of the factor() function? -p -- O__ Peter Dalgaard Øster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~ - (p.dalga...@biostat.ku.dk) FAX: (+45) 35327907 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel