Re: [Rd] suggestion for extending ?as.factor

2009-05-12 Thread Petr Savicky
On Mon, May 11, 2009 at 05:06:38PM +0200, Martin Maechler wrote:
 The version I have committed a few hours ago is indeed a much
 re-simplified version, using  as.character(.) explicitly

The current development version (2009-05-11 r48528) contains
in ?factor a description of levels parametr

  levels: an optional vector of the values that 'x' might have taken. 
  The default is the unique set of values taken by
  'character(x)', sorted into increasing order _of 'x'_.  Note
  that this set can be smaller than 'sort(unique(x))'.

I think that 'character(x)' should be 'as.character(x)'.

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] suggestion for extending ?as.factor

2009-05-12 Thread Martin Maechler
 PS == Petr Savicky savi...@cs.cas.cz
 on Tue, 12 May 2009 13:17:15 +0200 writes:

PS On Mon, May 11, 2009 at 05:06:38PM +0200, Martin Maechler wrote:
 The version I have committed a few hours ago is indeed a much
 re-simplified version, using  as.character(.) explicitly

PS The current development version (2009-05-11 r48528) contains
PS in ?factor a description of levels parametr

PS levels: an optional vector of the values that 'x' might have taken. 
PS The default is the unique set of values taken by
PS 'character(x)', sorted into increasing order _of 'x'_.  Note
PS that this set can be smaller than 'sort(unique(x))'.

PS I think that 'character(x)' should be 'as.character(x)'.

Definitely; thank you.
Thanks as well to suggestions from yesterday, of which I have at
least added the approximately.

Martin

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] suggestion for extending ?as.factor

2009-05-11 Thread Martin Maechler
 PS == Petr Savicky savi...@cs.cas.cz
 on Sun, 10 May 2009 13:52:53 +0200 writes:

PS On Sat, May 09, 2009 at 10:55:17PM +0200, Martin Maechler wrote:
PS [...]
 If'd revert to such a solution,
 we'd have to get back to Peter's point about the issue that
 he'd think  table(.) should be more tolerant than as.character()
 about almost equality.
 For compatibility reasons, we could also return back to the
 reasoning that useR should use {something like}
 table(signif(x, 14)) 
 instead of
 table(x) 
 for numeric x in typical cases.

PS In the released versions 2.8.1 and 2.9.0, function factor() satisfies
PS identical(as.character(factor(x)), as.character(x))(*)
PS for all numeric x. This follows from the code (levels are computed by
PS as.character() from unmodified input values) and may be verified
PS even for the problematic cases, for example
PS x - (0.3 + 2e-16 * c(-2,-1,1,2))
PS factor(x)
PS # [1] 0.300 0.3  0.3  0.300
PS # Levels: 0.300 0.3 0.3 0.300
PS as.character(x)
PS # [1] 0.300 0.3   0.3  
PS # [4] 0.300
PS identical(as.character(factor(x)), as.character(x))
PS # [1] TRUE

PS In my opinion, it is reasonable to require that (*) be
PS preserved also in future versions of R.

PS Function as.character(x) has disadvantages. Besides of
PS the platform dependence, it also does not always perform
PS rounding needed to eliminate FP errors. Usually,
PS as.character(x) rounds to at most 15 digits, so, we get,
PS for example

PS as.character(0.1 + 0.2) # [1] 0.3
PS as required. However, there are also exceptions, for example
PS as.character(1e19 + 1e5) # [1] 10100352

PS Here, the number is printed exactly, so the resulting
PS string contains the FP error caused by the fact that
PS 1e19 + 1e5 has more than 53 significant digits in binary
PS representation, namely 59.

PS binary representation of 1e19 + 1e5 is
PS 100010101100011100100011010010001000100111101010

PS binary representation of 10100352 is
PS 100010101100011100100011010010001000100110001000

PS However, as.character(x) seems to do enough rounding for
PS most purposes, otherwise it would not be suitable as the
PS basic numeric to character conversion. If table() needs
PS factor() with a different conversion than
PS as.character(x), it may be done explicitly as discussed
PS by Martin above.

PS So, i suggest to use as.character() as the default
PS conversion in factor(), so that
PS identical(as.character(factor(x)), as.character(x)) is
PS satisfied for the default usage of factor().

PS Of course, i appreciate, if factor() has parameters,
PS which allow better control of the underlying conversion,
PS as it is done in the current development versions.

The version I have committed a few hours ago is indeed a much
re-simplified version, using  as.character(.) explicitly
and consequently no longer providing the extra optional
arguments that we have had for a couple of days.

Keeping such a basic function   factor()  as simple as possible 
seems a good strategy to me.

Martin Maechler

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] suggestion for extending ?as.factor

2009-05-11 Thread Petr Savicky
On Mon, May 11, 2009 at 05:06:38PM +0200, Martin Maechler wrote:
[...]
 The version I have committed a few hours ago is indeed a much
 re-simplified version, using  as.character(.) explicitly
 and consequently no longer providing the extra optional
 arguments that we have had for a couple of days.
 
 Keeping such a basic function   factor()  as simple as possible 
 seems a good strategy to me.

OK. I understand the argument of simplicity. So, factor(x) is just
a compressed encoding of as.character(x), where each value is stored
only once. This sounds good to me.

Let me go back to the original purpose of this thread: suggestion for
extending ?as.factor

I think that somewhere in the help page, we could have something like

  Using factor() to a numeric vector should be done with caution. The
  information in x is preserved to the extent to which it is preserved
  in as.character(x). If this leads to too many different levels due to minor
  differences among the input numbers, it is suggested to use something like 
  factor(signif(x, digits)) or factor(round(x, digits)), where the number of 
  decimal digits appropriate for a given application should be used.

Let me point out that the following sentence from Warning is not exactly correct
as it is in svn at the moment. So, i suggest to add the word approximately to
the place marked with square brackets and add one more sentence of explanation
marked also by square brackets.

  To transform a factor \code{f} to [approximately]
  its original numeric values, \code{as.numeric(levels(f))[f]} is
  recommended and slightly more efficient than
  \code{as.numeric(as.character(f))}.
  [Note that the original values may be extracted only to the precision
  used in as.character(x), which is typically 15 decimal digits.]

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] suggestion for extending ?as.factor

2009-05-10 Thread Petr Savicky
On Sat, May 09, 2009 at 10:55:17PM +0200, Martin Maechler wrote:
[...]
 If'd revert to such a solution,
 we'd have to get back to Peter's point about the issue that
 he'd think  table(.) should be more tolerant than as.character()
 about almost equality.
 For compatibility reasons, we could also return back to the
 reasoning that useR should use {something like}
 table(signif(x, 14)) 
 instead of
 table(x) 
 for numeric x in typical cases.

In the released versions 2.8.1 and 2.9.0, function factor() satisfies
  identical(as.character(factor(x)), as.character(x))(*)
for all numeric x. This follows from the code (levels are computed by
as.character() from unmodified input values) and may be verified
even for the problematic cases, for example
  x - (0.3 + 2e-16 * c(-2,-1,1,2))
  factor(x)
  # [1] 0.300 0.3   0.3   0.300
  # Levels: 0.300 0.3 0.3 0.300
  as.character(x)
  # [1] 0.300 0.3   0.3  
  # [4] 0.300
  identical(as.character(factor(x)), as.character(x))
  # [1] TRUE

In my opinion, it is reasonable to require that (*) be preserved also in future
versions of R.

Function as.character(x) has disadvantages. Besides of the platform dependence,
it also does not always perform rounding needed to eliminate FP errors. Usually,
as.character(x) rounds to at most 15 digits, so, we get, for example
  as.character(0.1 + 0.2) # [1] 0.3
as required. However, there are also exceptions, for example
  as.character(1e19 + 1e5) # [1] 10100352

Here, the number is printed exactly, so the resulting string contains the FP 
error
caused by the fact that 1e19 + 1e5 has more than 53 significant digits in binary
representation, namely 59.

  binary representation of 1e19 + 1e5 is
  100010101100011100100011010010001000100111101010

  binary representation of 10100352 is
  100010101100011100100011010010001000100110001000

However, as.character(x) seems to do enough rounding for most purposes, 
otherwise
it would not be suitable as the basic numeric to character conversion. If 
table() needs
factor() with a different conversion than as.character(x), it may be done 
explicitly
as discussed by Martin above.

So, i suggest to use as.character() as the default conversion in factor(), so 
that 
  identical(as.character(factor(x)), as.character(x))
is satisfied for the default usage of factor().

Of course, i appreciate, if factor() has parameters, which allow better control
of the underlying conversion, as it is done in the current development versions.

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] suggestion for extending ?as.factor

2009-05-09 Thread Michael Dewey

At 14:18 08/05/2009, Martin Maechler wrote:


 PS == Petr Savicky savi...@cs.cas.cz
 on Fri, 8 May 2009 11:01:55 +0200 writes:


Somewhere below Martin asks for alternatives from list readers. I do 
not have alternatives, but I do have two comments, one immediately 
below this, the other embedded in-line.


This whole thread reminds me just why I have spent the best part of a 
decade climbing the virtual Matterhorn called 'Learning R' and why it 
is such a pleasure to use. It is the fact that somebody, somewhere 
cares enough about consistency, usability and accuracy to devote 
hours to getting even obscure details just right.




PS On Wed, May 06, 2009 at 10:41:58AM +0200, Martin Maechler wrote:
PD I think that the real issue is that we actually do want almost-equal
PD numbers to be folded together.

 yes, this now (revision 48469) will happen by default, 
using  signif(x, 15)

 where '15' is the default for the new optional argument 'digitsLabels'
 {better argument name? (but must nost start with 'label')}

PS Let me analyze the current behavior of factor(x) for 
numeric x with missing(levels)
PS and missing(labels). In this situation, levels are computed 
as sort(unique(x))
PS from possibly transformed x. Then, labels are constructed 
by a conversion of the

PS levels to strings.

PS I understand the current (R 2.10.0, 2009-05-07 r48492) 
behavior as follows.


PS If keepUnique is FALSE (the default), then
PS - values x are transformed by signif(x, digitsLabels)
PS - labels are computed using as.character(levels)
PS - digitsLabels defaults to 15, but may be set to any integer value

PS If keepUnique is TRUE, then
PS - values x are preserved
PS - labels are computed using sprintf(%.*g, digitsLabels, levels)
PS - digitsLabels defaults to 17, but may be set to any integer value

(in theory; in practice, I think I've suggested somewhere that
 it should be  = 17;  but see below.)

Your summary seems correct to me.

PS There are several situations, when this approach produces 
duplicated levels.

PS Besides the one described in my previous email, there are also others
PS factor(c(0.3, 0.1+0.2), keepUnique=TRUE, digitsLabels=15)

yes, but this is not much sensical; I've already contemplated
to produce a warning in such cases, something like

   if(keepUnique  digitsLabels  17)
 warning(gettextf(
 'digitsLabels = %d' is typically too small when 'keepUnique' is true,
 digitsLabels))


PS factor(1 + 0:5 * 1e-16, digitsLabels=17)

again, this does not make much sense; but why disallow the useR
to shoot into his foot?


I agree. As a useR I do not want to be stopped from doing anything. I 
would appreciate a warning just before I shoot myself in the foot and 
I definitely want one if it looks like I am going to aim for my head.


PS I would like to suggest a modification. It eliminates most 
of the cases, where
PS we get duplicated levels. It would eliminate all such 
cases, if the function
PS signif() works as expected. Unfortunately, if signif() 
works as it does in the

PS current versions of R, we still get duplicated levels.

PS The suggested modification is as follows.

PS If keepUnique is FALSE (the default), then
PS - values x are transformed by signif(x, digitsLabels)
PS - labels are computed using sprintf(%.*g, digitsLabels, levels)
PS - digitsLabels defaults to 15, but may be set to any integer value

I tend like this change, given -- as you found yesterday -- that
as.character() is not even preserving 15 digits.
OTOH,  as.character() has been in use for a very long history of
S (and R), whereas using sprintf() is not back compatible with
it and actually depends on the LIBC implementation of the system-sprintf.
For that reason as.character() would be preferable.
Hmm

PS If keepUnique is TRUE, then
PS - values x are preserved
PS - labels are computed using sprintf(%.*g, 17, levels)
PS - digitsLabels is ignored

I had originally planned to do exactly the above.
However, e.g.,  digitsLabels = 18  may be desired in some cases,
and that's why I also left the possibility to apply it in the
keepUnique case.


PS Arguments for the modification are the following.

PS 1. If keepUnique is FALSE, then computing labels using 
as.character() leads
PS to duplicated labels as demonstrated in my previous email. 
So, i suggest to

PS use sprintf(%.*g, digitsLabels, levels) instead of as.character().

{as said above, that seems sensible, though unfurtunately quite
 a bit less back-compatible!}

PS 2. If keepUnique is TRUE and we allow digitsLabels less 
than 17, then we get
PS duplicated labels. So, i suggest to force digitsLabels=17, 
if keepUnique=TRUE.


PS If signif(,digitsLabels) works as expected, than the above 
approach should not

PS produce duplicated labels. Unfortunately, this is not the case.
PS There are 

Re: [Rd] suggestion for extending ?as.factor

2009-05-09 Thread Martin Maechler
 PS == Petr Savicky savi...@cs.cas.cz
 on Fri, 8 May 2009 18:10:56 +0200 writes:

PS On Fri, May 08, 2009 at 05:14:48PM +0200, Petr Savicky wrote:
 Let me suggest to consider the following modification, where match() is 
done
 on the strings, not on the original values.
 levels - unique(as.character(sort(unique(x
 x - as.character(x)
 f - match(x, levels)

PS An alternative solution is

PS ind - order(x)
PS x - as.character(x) # or any other conversion to character
PS levels - unique(x[ind]) # get unique levels ordered by the original 
values
PS f - match(x, levels)

(slightly but not much more complicated though).

Yes, indeed that brings us back to (something like) the original
use  factor(format(x))  ...  suggestion which would have been
fine if there hadn't been the issue of ordering,
exactly what you've addressed before.


PS The advantage of this over the suggestion from my previous email is that
PS the string conversion is applied only once. The conversion need not be 
only
PS as.character(). There may be other choices specified by a parametr. I 
have
PS strong objections against the existing implementation of as.character(),
PS but still i think that as.character() should be the default for factor()
PS for the sake of consistency of the R language.

The biggest advantage to reverting to something simple like
that, would be that it is really simple.

My first tests with (a variation of) the above indicate
favorable results.  More on this on Monday.
If'd revert to such a solution,
we'd have to get back to Peter's point about the issue that
he'd think  table(.) should be more tolerant than as.character()
about almost equality.
For compatibility reasons, we could also return back to the
reasoning that useR should use {something like}
table(signif(x, 14)) 
instead of
table(x) 
for numeric x in typical cases.

Martin

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] suggestion for extending ?as.factor

2009-05-08 Thread Petr Savicky
On Wed, May 06, 2009 at 10:41:58AM +0200, Martin Maechler wrote:
  PD I think that the real issue is that we actually do want almost-equal
  PD numbers to be folded together. 
 
 yes, this now (revision 48469) will happen by default, using  signif(x, 15) 
 where '15' is the default for the new optional argument 'digitsLabels'
 {better argument name? (but must nost start with 'label')}

Let me analyze the current behavior of factor(x) for numeric x with 
missing(levels)
and missing(labels). In this situation, levels are computed as sort(unique(x))
from possibly transformed x. Then, labels are constructed by a conversion of the
levels to strings.

I understand the current (R 2.10.0, 2009-05-07 r48492) behavior as follows.

  If keepUnique is FALSE (the default), then
- values x are transformed by signif(x, digitsLabels)
- labels are computed using as.character(levels)
- digitsLabels defaults to 15, but may be set to any integer value

  If keepUnique is TRUE, then
- values x are preserved
- labels are computed using sprintf(%.*g, digitsLabels, levels)
- digitsLabels defaults to 17, but may be set to any integer value

There are several situations, when this approach produces duplicated levels.
Besides the one described in my previous email, there are also others
 factor(c(0.3, 0.1+0.2), keepUnique=TRUE, digitsLabels=15)
 factor(1 + 0:5 * 1e-16, digitsLabels=17)

I would like to suggest a modification. It eliminates most of the cases, where
we get duplicated levels. It would eliminate all such cases, if the function
signif() works as expected. Unfortunately, if signif() works as it does in the
current versions of R, we still get duplicated levels.

The suggested modification is as follows.

  If keepUnique is FALSE (the default), then
- values x are transformed by signif(x, digitsLabels)
- labels are computed using sprintf(%.*g, digitsLabels, levels)
- digitsLabels defaults to 15, but may be set to any integer value

  If keepUnique is TRUE, then
- values x are preserved
- labels are computed using sprintf(%.*g, 17, levels)
- digitsLabels is ignored

Arguments for the modification are the following.

1. If keepUnique is FALSE, then computing labels using as.character() leads
   to duplicated labels as demonstrated in my previous email. So, i suggest to
   use sprintf(%.*g, digitsLabels, levels) instead of as.character().

2. If keepUnique is TRUE and we allow digitsLabels less than 17, then we get
   duplicated labels. So, i suggest to force digitsLabels=17, if 
keepUnique=TRUE.

If signif(,digitsLabels) works as expected, than the above approach should not
produce duplicated labels. Unfortunately, this is not the case.
There are numbers, which remain different in signif(x, 16), but are mapped
to the same string in sprintf(%.*g, 16, x). Examples of this kind may be
found using the script

   for (i in 1:50) {
   x - 10^runif(1, 38, 50)
   y - x * (1 + 0:500 * 1e-16)
   y - unique(signif(y, 16))
   z - unique(sprintf(%.16g, y))
   stopifnot(length(y) == length(z))
   }

This script is tested on Intel default arithmetic and on Intel with SSE.

Perhaps, digitsLabels = 16 could be forbidden, if keepUnique is FALSE.

Unfortunately, a similar problem occurs even for digitsLabels = 15, although for
much larger numbers.

   for (i in 1:200) {
   x - 10^runif(1, 250, 300)
   y - x * (1 + 0:500 * 1e-16)
   y - unique(signif(y, 15))
   z - unique(sprintf(%.15g, y))
   stopifnot(length(y) == length(z))
   }

This script finds collisions, if SSE is enabled, on two Intel computers, where 
i did
the test. Without SSE, it finds collisions only on one of them. May be, it 
depends
also on the compiler, which is different.

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] suggestion for extending ?as.factor

2009-05-08 Thread Martin Maechler
 PS == Petr Savicky savi...@cs.cas.cz
 on Fri, 8 May 2009 11:01:55 +0200 writes:

PS On Wed, May 06, 2009 at 10:41:58AM +0200, Martin Maechler wrote:
PD I think that the real issue is that we actually do want almost-equal
PD numbers to be folded together. 
 
 yes, this now (revision 48469) will happen by default, using  signif(x, 
15) 
 where '15' is the default for the new optional argument 'digitsLabels'
 {better argument name? (but must nost start with 'label')}

PS Let me analyze the current behavior of factor(x) for numeric x with 
missing(levels)
PS and missing(labels). In this situation, levels are computed as 
sort(unique(x))
PS from possibly transformed x. Then, labels are constructed by a 
conversion of the
PS levels to strings.

PS I understand the current (R 2.10.0, 2009-05-07 r48492) behavior as 
follows.

PS If keepUnique is FALSE (the default), then
PS - values x are transformed by signif(x, digitsLabels)
PS - labels are computed using as.character(levels)
PS - digitsLabels defaults to 15, but may be set to any integer value

PS If keepUnique is TRUE, then
PS - values x are preserved
PS - labels are computed using sprintf(%.*g, digitsLabels, levels)
PS - digitsLabels defaults to 17, but may be set to any integer value

(in theory; in practice, I think I've suggested somewhere that
 it should be  = 17;  but see below.)

Your summary seems correct to me.

PS There are several situations, when this approach produces duplicated 
levels.
PS Besides the one described in my previous email, there are also others
PS factor(c(0.3, 0.1+0.2), keepUnique=TRUE, digitsLabels=15)

yes, but this is not much sensical; I've already contemplated
to produce a warning in such cases, something like
   
   if(keepUnique  digitsLabels  17)
 warning(gettextf(
 'digitsLabels = %d' is typically too small when 'keepUnique' is true,
 digitsLabels))


PS factor(1 + 0:5 * 1e-16, digitsLabels=17)

again, this does not make much sense; but why disallow the useR
to shoot into his foot?

PS I would like to suggest a modification. It eliminates most of the 
cases, where
PS we get duplicated levels. It would eliminate all such cases, if the 
function
PS signif() works as expected. Unfortunately, if signif() works as it does 
in the
PS current versions of R, we still get duplicated levels.

PS The suggested modification is as follows.

PS If keepUnique is FALSE (the default), then
PS - values x are transformed by signif(x, digitsLabels)
PS - labels are computed using sprintf(%.*g, digitsLabels, levels)
PS - digitsLabels defaults to 15, but may be set to any integer value

I tend like this change, given -- as you found yesterday -- that
as.character() is not even preserving 15 digits.
OTOH,  as.character() has been in use for a very long history of
S (and R), whereas using sprintf() is not back compatible with
it and actually depends on the LIBC implementation of the system-sprintf.
For that reason as.character() would be preferable.
Hmm

PS If keepUnique is TRUE, then
PS - values x are preserved
PS - labels are computed using sprintf(%.*g, 17, levels)
PS - digitsLabels is ignored

I had originally planned to do exactly the above.
However, e.g.,  digitsLabels = 18  may be desired in some cases,
and that's why I also left the possibility to apply it in the
keepUnique case.


PS Arguments for the modification are the following.

PS 1. If keepUnique is FALSE, then computing labels using as.character() 
leads
PS to duplicated labels as demonstrated in my previous email. So, i 
suggest to
PS use sprintf(%.*g, digitsLabels, levels) instead of as.character().

{as said above, that seems sensible, though unfurtunately quite
 a bit less back-compatible!}

PS 2. If keepUnique is TRUE and we allow digitsLabels less than 17, then 
we get
PS duplicated labels. So, i suggest to force digitsLabels=17, if 
keepUnique=TRUE.

PS If signif(,digitsLabels) works as expected, than the above approach 
should not
PS produce duplicated labels. Unfortunately, this is not the case.
PS There are numbers, which remain different in signif(x, 16), but are 
mapped
PS to the same string in sprintf(%.*g, 16, x). Examples of this kind may 
be
PS found using the script

PS for (i in 1:50) {
PS x - 10^runif(1, 38, 50)
PS y - x * (1 + 0:500 * 1e-16)
PS y - unique(signif(y, 16))
PS z - unique(sprintf(%.16g, y))
PS stopifnot(length(y) == length(z))
PS }

PS This script is tested on Intel default arithmetic and on Intel with SSE.

PS Perhaps, digitsLabels = 16 could be forbidden, if keepUnique is FALSE.

PS Unfortunately, a similar problem occurs even for digitsLabels = 15, 
although for
PS much larger numbers.

PS for (i in 1:200) {
PS x - 10^runif(1, 250, 300)
PS y - x * (1 + 0:500 * 1e-16)
PS y - 

Re: [Rd] suggestion for extending ?as.factor

2009-05-08 Thread Petr Savicky
On Fri, May 08, 2009 at 03:18:01PM +0200, Martin Maechler wrote:
 As long as we don't want to allow  factor(numeric) to fail --rarely -- 
 I think (and that actually has been a recurring daunting thought
 for quite a few days) that we probably need an
 extra step of checking for duplicate levels, and if we find
 some, recode everything. This will blow up the body of the
 factor() function even more.
 
 What alternatives do you (all R-devel readers!) see?

The command 
  f - match(x, levels)
in factor() uses the original values and not their string representations.
I think that the main reason to do so is that we loose the ordering, if the
conversion to character is done before levels are sorted.

Let me suggest to consider the following modification, where match() is done
on the strings, not on the original values.
  levels - unique(as.character(sort(unique(x
  x - as.character(x)
  f - match(x, levels)

Since unique() preserves the order, we will get the levels correctly
ordered. Due to using unique() twice, we will not have duplicated levels.

Is it correct?

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] suggestion for extending ?as.factor

2009-05-08 Thread Petr Savicky
On Fri, May 08, 2009 at 05:14:48PM +0200, Petr Savicky wrote:
 Let me suggest to consider the following modification, where match() is done
 on the strings, not on the original values.
   levels - unique(as.character(sort(unique(x
   x - as.character(x)
   f - match(x, levels)

An alternative solution is
  ind - order(x)
  x - as.character(x) # or any other conversion to character
  levels - unique(x[ind]) # get unique levels ordered by the original values
  f - match(x, levels)

The advantage of this over the suggestion from my previous email is that
the string conversion is applied only once. The conversion need not be only
as.character(). There may be other choices specified by a parametr. I have
strong objections against the existing implementation of as.character(),
but still i think that as.character() should be the default for factor()
for the sake of consistency of the R language.

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] suggestion for extending ?as.factor

2009-05-08 Thread Martin Maechler
 PS == Petr Savicky savi...@cs.cas.cz
 on Fri, 8 May 2009 18:10:56 +0200 writes:

PS On Fri, May 08, 2009 at 05:14:48PM +0200, Petr Savicky wrote:
 Let me suggest to consider the following modification, where match() is 
done
 on the strings, not on the original values.
 levels - unique(as.character(sort(unique(x
 x - as.character(x)
 f - match(x, levels)

PS An alternative solution is

 ind - order(x)
 x - as.character(x) # or any other conversion to character
 levels - unique(x[ind]) # get unique levels ordered by the original 
values
 f - match(x, levels)

Yes, that's an interesting quite different and simple approach.

PS The advantage of this over the suggestion from my previous email is that
PS the string conversion is applied only once. The conversion need not be 
only
PS as.character(). There may be other choices specified by a parametr. I 
have
PS strong objections against the existing implementation of as.character(),

{(because it is not *accurate* enough, right ?)}

PS but still i think that as.character() should be the default for factor()
PS for the sake of consistency of the R language.

Hmm...  Peter Dalgaard very early in this thread
remarked that at least in the use of  table(..),
factor() should not be extremely accurate, and that's what
R-devel's factor has been doing recently.

But then, table(.) could be changed to explicitly call
factor(signif(x, 15), ...)
for the case of numeric x.

BTW: I found that practically all the remaining border cases you
 had, are solved by using  14 instead of 15.

I'm currently testing a version of factor() that uses 14, 
*and* adds an extra final level test, removing duplicated ones.

Martin

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] suggestion for extending ?as.factor

2009-05-08 Thread Petr Savicky
On Fri, May 08, 2009 at 06:48:40PM +0200, Martin Maechler wrote:
  PS == Petr Savicky savi...@cs.cas.cz
  on Fri, 8 May 2009 18:10:56 +0200 writes:
[...]
 PS ... I have
 PS strong objections against the existing implementation of 
 as.character(),
 
 {(because it is not *accurate* enough, right ?)}

The problem is not exactly low accuracy. The problem is unpredictable
accuracy. If the accuracy is systematically 15 or 14 digits, it would be
fine and suitable for most purposes.

However the accuracy ranges between 14 and 20 digits and may be different
on different platforms. For example, on my old Xeon comupter, the same
numbers may be converted to strings representing different values:

  with SSE   without SSE

  8459184.47742229 8459184.4774223 
  84307700406756081664 8.4307700406756e+19 
  9262815.27852281 9262815.2785228 
  2.1006742758024e+19  21006742758023974912
  7.07078598983389e+25 7.0707859898339e+25 
  8.0808066145628e+28  8.08080661456281e+28
  9180932974.85929 9180932974.8593 
  72.4923408890729 72.492340889073 

Sometimes there are differences in trailing zeros.

  with SSE   without SSE

  1.97765325859480e+25 1.9776532585948e+25 
  21762633836.0360 21762633836.036 
  2018960238339.80 2018960238339.8 
  239567.78053486  239567.780534860
  2571116684765.50 2571116684765.5 
  3989945.2102949  3989945.21029490
  1.1259245205867e+23  1.12592452058670e+23
  3.2867033904477e+29  3.28670339044770e+29
  2.8271117654895e+29  2.82711176548950e+29
  26854166.6173020 26854166.617302 
  4.85247217360750 4.8524721736075 
  345123.247838540 345123.24783854 

For random numbers in the sample generated as 10^runif(10, 0, 30),
from which i selected the first 20 examples above, the probability of
different results was almost 0.01 (978 differences among 10 numbers).

I think that the platform dependence even limits the advantage
of backward compatibility.

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] suggestion for extending ?as.factor

2009-05-07 Thread Petr Savicky
On Wed, May 06, 2009 at 10:41:58AM +0200, Martin Maechler wrote:
  PD I think that the real issue is that we actually do want almost-equal
  PD numbers to be folded together. 
 
 yes, this now (revision 48469) will happen by default, using  signif(x, 15) 
 where '15' is the default for the new optional argument 'digitsLabels'

On some platforms, the function factor() in the current R 2.10.0
(2009-05-06 r48478) may produce duplicated levels. The examples are
in general platform dependent. The following one produces duplicated
(in fact triplicated) levels on both Intel default arithmetic and
on Intel with SSE.

  x - 9.7738826945424 + c(-1, 0, 1) * 1e-14
  x - signif(x, 15)
  factor(x)
  # [1] 9.7738826945424 9.7738826945424 9.7738826945424
  # Levels: 9.7738826945424 9.7738826945424 9.7738826945424
  # Warning message:
  # In `levels-`(`*tmp*`, value = c(9.7738826945424, 9.7738826945424,  :
  #   duplicated levels will not be allowed in factors anymore

The reason is that the three numbers remain different in signif(x, 15),
but are mapped to the same string in as.character(x).

  length(unique(x)) # [1] 3
  length(unique(as.character(x))) # 1

Further examples may be found using

  x - as.character(9 + runif(5000))
  x - as.numeric(x[nchar(x)==15]) # select numbers with 14 digits
  x - signif(cbind(x - 1e-14, x, x + 1e-14), 15)
  y - array(as.character(x), dim=dim(x))
  x - x[which(y[,1] == y[,3]),]
  factor(x[1,])

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] suggestion for extending ?as.factor

2009-05-06 Thread Martin Maechler
 MM == Martin Maechler maech...@stat.math.ethz.ch
 on Tue, 5 May 2009 10:35:42 +0200 writes:

 PD == Peter Dalgaard p.dalga...@biostat.ku.dk
 on Mon, 04 May 2009 19:28:06 +0200 writes:

PD Petr Savicky wrote:
 On Mon, May 04, 2009 at 05:39:52PM +0200, Martin Maechler wrote:
 [snip]
 Let me quickly expand the tasks we have wanted to address, when
 I started changing factor() for R-devel.
 
 1) R-core had unanimously decided that R 2.10.0 should not allow
 duplicated levels in factors anymore.
 
 When working on that, I had realized that quite a few bits of code
 were implicitly relying on duplicated levels (or something
 related), see below, so the current version of R-devel only
 *warns* in some cases where duplicated levels are produced
 instead of giving an error.
 
 What I had also found was that basically, even our own (!) code
 and quite a bit of user code has more or less relied on other
 things that were not true (even though almost always fulfilled):
 
 2) if x contains no duplicated values, then  factor(x) should neither
 
 3) factor(x) constructs a factor object with *unique* levels
 
 {This is what our decision 1) implies and now enforces}
 
 4) as.numeric(names(table(x))) should be  identical to unique(x)
 
 where 4) is basically ensured by 3) as table() calls
 factor() for non-factor args.
 
 As mentioned the bad thing is that 2) - 4) are typically
 fulfilled in all tests package writers would use.
 
 Concerning '3)' [and '1)'], as you know, inside R-core we have
 proposed to at least ensure that  `levels-` 
 should not allow duplicated levels, 
 and I had concluded that
 a) factor() really should use  `levels-` instead of the low-level 
 attr(., levels) - 
 b) factor() itself must make sure that the default levels became 
unique.
 
 ---
 
 Given Petr's (and more) examples and the strong requirement of
 user convenience and back-compatibility,
 I now tend to agree (with Peter) that we cannot ensure all of 2)
 and 4) still allow factor() to behave as it did for rounded
 decimal numbers,
 and consequently would have to (continue to) not ensuring
 properties (2) and (4).
 Something quite unfortunate, since, as I said, much useR code
 implicitly relies on these, and so that code is buggy even
 though the bug will only show in exceptional cases.

[]

 PD I think that the real issue is that we actually do want almost-equal
 PD numbers to be folded together. 

yes, this now (revision 48469) will happen by default, using  signif(x, 15) 
where '15' is the default for the new optional argument 'digitsLabels'
{better argument name? (but must nost start with 'label')}

Why '15': Because this is most back-compatible and sufficient to
  solve simple arithmetic (0.1 + 0.2) issues.

MM in most cases, but not all {*when*  levels is not specified},
MM but useR's code sometimes *does* rely on  factor()  /  table()
MM using exact values.

MM Also, what should happen when the user explicitly calls

MM factor(x, levels = sort(unique(x)))

MM at least in that case we really should *not* fold almost equals.
MM and the old code (= R 2.9.0) did fold them in border cases,
MM and lead non-unique levels.

MM Can we agree that any rounding etc - if needed - will only
MM happen when
MM 1) missing(levels)
MM 2) is.numeric(x) || is.complex(x)

The code I've committed (revision 48469) now does that..

MM I'm also thinking of at least keeping the current behavior as an
MM option, e.g. by  factor(x, , keepUniqueness = TRUE, )
MM where the default would be keepUniqueness = FALSE.

current argument name is 'keepUnique'.

PD The most relevant case I can conjure up is this (permutation testing):

 zz - replicate(2,sum(sample(sleep$extra,10)))
 length(table(zz))
PD [1] 427
 length(table(signif(zz,7)))
PD [1] 281

PD Notice that the discrepancy comes from sums that really are identical
PD values (in decimal arithmetic), but where the binary FP inaccuracy makes
PD them slightly different.

MM Yes, that's a good example.

MM However, I now think it would be helpful to slightly separate
MM the issue from what factor() should do from 
MM how table() should call factor() in those cases it does.

I still believe that.
Currently,  table()  calls factor(a, exclude = exclude) 
when 'a' is not a factor, e.g., when it is numeric.
I propose that  table() should also gain some of the new
optional factor() arguments, and maybe even using a different
default than 15

Note that the new R-devel now gives

 set.seed(7); zz - replicate(2,sum(sample(sleep$extra,10)))
 length(tz - table(zz))
[1] 283

whereas R = 2.9.0 gives

[1] 422

so that at 

Re: [Rd] suggestion for extending ?as.factor

2009-05-05 Thread Petr Savicky
On Mon, May 04, 2009 at 07:28:06PM +0200, Peter Dalgaard wrote:
 Petr Savicky wrote:
  For this, we get
  
 convert(0.3)
[1] 0.3
 convert(1/3)
[1] 0. # 16 digits suffice
 convert(0.12345)
[1] 0.12345
 convert(0.12345678901234567)
[1] 0.12345678901234566
 0.12345678901234567 == as.numeric(0.12345678901234566)
[1] TRUE
  
  This algorithm is slower than a single call to sprintf(%.17g, x), but it
  produces nicer numbers, if possible, and guarantees that the value is
  always preserved.
 
 Yes, but
 
  convert(0.2+0.1)
 [1] 0.30004

I am not sure whether this should be considered an error. Computing with decimal
numbers requires some care. If we want to get the result, which is the closest
to 0.3, it is better to use round(0.1+0.2, digits=1).
 
 I think that the real issue is that we actually do want almost-equal
 numbers to be folded together.

Yes, this is frequently needed. A related question of an approximate match
in a numeric sequence was discussed in R-devel thread Match .3 in a sequence
from March. In order to fold almost-equal numbers together, we need to specify
a tolerance to use. The tolerance depends on application. In my opinion, it is
hard to choose a tolerance, which could be a good default.

 The most relevant case I can conjure up
 is this (permutation testing):
 
  zz - replicate(2,sum(sample(sleep$extra,10)))
  length(table(zz))
 [1] 427
  length(table(signif(zz,7)))
 [1] 281

In order to obtain the correct result in this example, it is possible to use
  zz - signif(zz,7)
as you suggest or
  zz - round(zz, digits=1)
and use the resulting zz for all further calculations.

 Notice that the discrepancy comes from sums that really are identical
 values (in decimal arithmetic), but where the binary FP inaccuracy makes
 them slightly different.
 
 [for a nice picture, continue the example with
 
  tt - table(signif(zz,7))
  plot(as.numeric(names(tt)),tt, type=h)

The form of this picture is not due to rounding errors. The picture may be
obtained even within an integer arithmetic as follows.

  ss - round(10*sleep$extra)
  zz - replicate(2,sum(sample(ss,10)))
  tt - table(zz)
  plot(as.numeric(names(tt)),tt, type=h)

The variation of the frequencies is due to two effects.

First, each individual value of the sum occurs with low probability, so 2
replications do not suffice to get low variance of individual frequencies. Using
1000 repetitions of the code above, i obtained estimate of some of the 
probabilities.
The most frequent sums have probability approximately p=0.0089 for a single 
sample.
With n=2 replications, we get the mean frequency p*n = 178 and standard 
deviation
sqrt(p*(1-p)*n) = 13.28216.

The other cause of variation of the frequencies is that even the true 
distribution of
the sums has a lot of local minima and maxima. The mean of 1000 repetitions of 
the above
table restricted to values of the sum in the interval 140:168 produced the 
estimates

  value mean frequency (over 1000 tables)
  140   172.411
  141   172.090
  142   174.297
  143   166.039
  144   159.260
  145   163.891
  146   162.317
  147   165.460
  148   177.870
  149   177.971
  150   177.754
  151   178.525 local maximum
  152   169.851
  153   164.443 local minimum
  154   168.488 the mean value of the sum
  155   164.816 local minimum
  156   169.297
  157   179.248 local maximum
  158   177.799
  159   176.743
  160   177.777
  161   164.173
  162   162.585
  163   164.641
  164   159.913
  165   165.932
  166   173.014
  167   172.276
  168   171.612
The local minima and maxima are visible here. The mean value 154 is 
approximately
the center of the histogram.

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] suggestion for extending ?as.factor

2009-05-05 Thread Martin Maechler
 PD == Peter Dalgaard p.dalga...@biostat.ku.dk
 on Mon, 04 May 2009 19:28:06 +0200 writes:

PD Petr Savicky wrote:
 On Mon, May 04, 2009 at 05:39:52PM +0200, Martin Maechler wrote:
 [snip]
 Let me quickly expand the tasks we have wanted to address, when
 I started changing factor() for R-devel.
 
 1) R-core had unanimously decided that R 2.10.0 should not allow
 duplicated levels in factors anymore.
 
 When working on that, I had realized that quite a few bits of code
 were implicitly relying on duplicated levels (or something
 related), see below, so the current version of R-devel only
 *warns* in some cases where duplicated levels are produced
 instead of giving an error.
 
 What I had also found was that basically, even our own (!) code
 and quite a bit of user code has more or less relied on other
 things that were not true (even though almost always fulfilled):
 
 2) if x contains no duplicated values, then  factor(x) should neither
 
 3) factor(x) constructs a factor object with *unique* levels
 
 {This is what our decision 1) implies and now enforces}
 
 4) as.numeric(names(table(x))) should be  identical to unique(x)
 
 where 4) is basically ensured by 3) as table() calls
 factor() for non-factor args.
 
 As mentioned the bad thing is that 2) - 4) are typically
 fulfilled in all tests package writers would use.
 
 Concerning '3)' [and '1)'], as you know, inside R-core we have
 proposed to at least ensure that  `levels-` 
 should not allow duplicated levels, 
 and I had concluded that
 a) factor() really should use  `levels-` instead of the low-level  
 attr(., levels) - 
 b) factor() itself must make sure that the default levels became unique.
 
 ---
 
 Given Petr's (and more) examples and the strong requirement of
 user convenience and back-compatibility,
 I now tend to agree (with Peter) that we cannot ensure all of 2)
 and 4) still allow factor() to behave as it did for rounded
 decimal numbers,
 and consequently would have to (continue to) not ensuring
 properties (2) and (4).
 Something quite unfortunate, since, as I said, much useR code
 implicitly relies on these, and so that code is buggy even
 though the bug will only show in exceptional cases.
 
 Let me suggest to consider also the following algorithm: determine
 the number of digits needed to preserve the double value exactly for
 each number separately. An R code prototype demonstrating the 
 algorithm could be as follows
 
 convert - function(x) # x should be a single number
 {
 for (d in 1:16) {
 y - sprintf(paste(%., d, g, sep=), x)
 if (x == as.numeric(y)) {
 return(y)
 }
 }
 return(sprintf(%.17g, x))
 }
 
 For this, we get
 
  convert(0.3)
 [1] 0.3
  convert(1/3)
 [1] 0. # 16 digits suffice
  convert(0.12345)
 [1] 0.12345
  convert(0.12345678901234567)
 [1] 0.12345678901234566
  0.12345678901234567 == as.numeric(0.12345678901234566)
 [1] TRUE
 
 This algorithm is slower than a single call to sprintf(%.17g, x), but 
it
 produces nicer numbers, if possible, and guarantees that the value is
 always preserved.

PD Yes, but

 convert(0.2+0.1)
PD [1] 0.30004


PD I think that the real issue is that we actually do want almost-equal
PD numbers to be folded together. 

in most cases, but not all {*when*  levels is not specified},
but useR's code sometimes *does* rely on  factor()  /  table()
using exact values.

Also, what should happen when the user explicitly calls

  factor(x, levels = sort(unique(x)))

at least in that case we really should *not* fold almost equals.
and the old code (= R 2.9.0) did fold them in border cases,
and lead non-unique levels.

Can we agree that any rounding etc - if needed - will only
happen when
  1) missing(levels)
  2) is.numeric(x) || is.complex(x)

I'm also thinking of at least keeping the current behavior as an
option, e.g. by  factor(x, , keepUniqueness = TRUE, )
where the default would be keepUniqueness = FALSE.

PD The most relevant case I can conjure up is this (permutation testing):

 zz - replicate(2,sum(sample(sleep$extra,10)))
 length(table(zz))
PD [1] 427
 length(table(signif(zz,7)))
PD [1] 281

PD Notice that the discrepancy comes from sums that really are identical
PD values (in decimal arithmetic), but where the binary FP inaccuracy makes
PD them slightly different.

Yes, that's a good example.

However, I now think it would be helpful to slightly separate
the issue from what factor() should do from 
how table() should call factor() in those cases it does.

Also, you originally said

  We might as well have abolished conversion of numerics 

Re: [Rd] suggestion for extending ?as.factor

2009-05-05 Thread Peter Dalgaard
Petr Savicky wrote:


 
 Notice that the discrepancy comes from sums that really are identical
 values (in decimal arithmetic), but where the binary FP inaccuracy makes
 them slightly different.

 [for a nice picture, continue the example with

 tt - table(signif(zz,7))
 plot(as.numeric(names(tt)),tt, type=h)
 
 The form of this picture is not due to rounding errors. The picture may be
 obtained even within an integer arithmetic as follows.
 
   ss - round(10*sleep$extra)
   zz - replicate(2,sum(sample(ss,10)))
   tt - table(zz)
   plot(as.numeric(names(tt)),tt, type=h)

I know. The point was rather that if you are not careful with rounding,
you get the some of the bars wrong (you get 2 or 3 small bars very close
to each other instead of one longer one). Computed p values from
permutation tests (as in mean(sim=obs)) also need care for the same reason.

 
 The variation of the frequencies is due to two effects.
 
 First, each individual value of the sum occurs with low probability, so 2


 
 The other cause of variation of the frequencies is that even the true 
 distribution of
 the sums has a lot of local minima and maxima. 

Yes. You can actually generate the exact distribution easily using

d - combn(sleep$extra, 10, sum)
d - signif(d,7)
tt - table(d)
plot(as.numeric(names(tt)),tt, type=h)

and if you omit the signif() bit (not with R-devel):

 table(table(names(table(d

  1   2   3
137 161  17

i.e. 315 distinct values but over half occur in duplicate or triplicate
versions.


-- 
   O__   Peter Dalgaard Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark  Ph:  (+45) 35327918
~~ - (p.dalga...@biostat.ku.dk)  FAX: (+45) 35327907

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] suggestion for extending ?as.factor

2009-05-05 Thread Petr Savicky
On Tue, May 05, 2009 at 11:27:36AM +0200, Peter Dalgaard wrote:
 I know. The point was rather that if you are not careful with rounding,
 you get the some of the bars wrong (you get 2 or 3 small bars very close
 to each other instead of one longer one). Computed p values from
 permutation tests (as in mean(sim=obs)) also need care for the same reason.

OK. Now, i understand the point of the example. I think that it is
the responsibility of the user to find the right way to eliminate the
influence of the rounding errors, since this may require a different
approach in different situations. However, i can also accept the
point of view that as.factor() should do this to some extent by default.

For example, we may require that as.factor() is consistent with 
as.character() in the way how to map different numbers to the same
string.

At the first glance, one could expect that to implement this, it is 
sufficient if as.factor(x) performs
  x - as.numeric(as.character(x))
  levels - as.character(sort(unique(x)))

Unfortunately, on some platforms (tested on Intel with SSE, R-2.10.0,
2009-05-02 r48453), this may produce repeated levels.

  x - c(0.6807853176681814000304, 0.6807853176681809559412)
  x - as.numeric(as.character(x))
  levels - as.character(sort(unique(x)))
  levels # 0.68078531766818 0.68078531766818
  levels[1] == levels[2] # TRUE

Using the default Intel arithmetic, we get a single level, namely 
0.680785317668181.

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] suggestion for extending ?as.factor

2009-05-04 Thread Martin Maechler
 PS == Petr Savicky savi...@cs.cas.cz
 on Sun, 3 May 2009 22:32:04 +0200 writes:
 PS == Petr Savicky savi...@cs.cas.cz
 on Sun, 3 May 2009 22:32:04 +0200 writes:

PS In R-2.10.0, the development version, function as.factor() uses 17 digit
PS precision for conversion of numeric values to character type. This
PS is very good for the consistency of the resulting factor, however,
PS i expect that people will complain about, for example, as.factor(0.3)
PS being
PS [1] 0.2
PS Levels: 0.2

PS I suggest to extend the Warning section of ?as.factor by the following
PS paragraph.

PS If as.factor() is used for a numeric vector, then the numbers are
PS converted to character strings with 17 digit precision using their
PS machine representation. This guarantees that different numbers are
PS converted to different levels, but may produce unwanted results, if
PS the numbers are expected to have limited number of decimal positions.
PS For example, as.factor(c(0.1, 0.2, 0.3)) produces
PS [1] 0.10001 0.20001 0.2
PS Levels: 0.10001 0.20001 0.2
PS In order to avoid this, convert the numbers to a character vector
PS using formatC() or a similar function before using as.factor().

PS Petr.

Thank you, Petr, for the good suggestion.

I have added a (shorter) paragraph, though to the 'Details' not the
'Warning' section, and also one to the 'Examples' :

## Converting (non-integer) numbers:
as.factor(c(0.1, 0.2, 0.3)) # maybe not what you'd expect, so rather use
factor(format(c(0.1, 0.2, 0.3)))

Martin Maechler, ETH Zurich

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] suggestion for extending ?as.factor

2009-05-04 Thread Martin Maechler
 PD == Peter Dalgaard p.dalga...@biostat.ku.dk
 on Mon, 04 May 2009 15:34:09 +0200 writes:

PD Martin Maechler wrote:
 PS == Petr Savicky savi...@cs.cas.cz
 on Sun, 3 May 2009 22:32:04 +0200 writes:
 PS == Petr Savicky savi...@cs.cas.cz
 on Sun, 3 May 2009 22:32:04 +0200 writes:
 
PS In R-2.10.0, the development version, function as.factor() uses 17 digit
PS precision for conversion of numeric values to character type. This
PS is very good for the consistency of the resulting factor, however,
PS i expect that people will complain about, for example, as.factor(0.3)
PS being
PS [1] 0.2
PS Levels: 0.2
 
PS I suggest to extend the Warning section of ?as.factor by the following
PS paragraph.
 
PS If as.factor() is used for a numeric vector, then the numbers are
PS converted to character strings with 17 digit precision using their
PS machine representation. This guarantees that different numbers are
PS converted to different levels, but may produce unwanted results, if
PS the numbers are expected to have limited number of decimal positions.
PS For example, as.factor(c(0.1, 0.2, 0.3)) produces
PS [1] 0.10001 0.20001 0.2
PS Levels: 0.10001 0.20001 0.2
PS In order to avoid this, convert the numbers to a character vector
PS using formatC() or a similar function before using as.factor().
 
PS Petr.
 
 Thank you, Petr, for the good suggestion.
 
 I have added a (shorter) paragraph, though to the 'Details' not the
 'Warning' section, and also one to the 'Examples' :
 
 ## Converting (non-integer) numbers:
 as.factor(c(0.1, 0.2, 0.3)) # maybe not what you'd expect, so rather use
 factor(format(c(0.1, 0.2, 0.3)))

PD Martin,

PD I tend to consider this a bug, plain and simple. We might as well have
PD abolished conversion of numerics to factor altogether.

hmm, that comparison *is* an exaggeration ((Much code would have
stopped working had we implemented the latter !!))

PD (Notice, BTW, that conversions to mode character
PD changes the sort order so format() is not a universal
PD fix. IIRC, we did consider the 1 10 2 3 4 5 6 7 8 9
PD issue when designing R's version factor().)

yes, but I did not understand why this is relevant here;
as
   factor(c(10,1:9, 8:2,1:5))
   [1] 10 1  2  3  4  5  6  7  8  9  8  7  6  5  4  3  2  1  2  3  4  5 
  Levels: 1 2 3 4 5 6 7 8 9 10

also in earlier versions of R.

PD The current R-devel behaviour is silly and we should just get rid of it
PD before a final release. It should be the other way around: If people
PD rely on whether numerical factor levels differ with 17 digits precision,
PD THEN they should use format with suitable arguments.

Hmm, I know tend to agree that we must further change some of
the R-devel behavior.

PD If we have issues with numeric values that are very slightly different
PD but round to get the same level name, how about putting something like

PD if (is.numeric(x)) x - zapsmall(x)

well, rather  

  if (is.numeric(x)) x - signif(x, 15)

where '15' could be replaced by 7 or other values in 5:20

PD somewhere at the start of the factor() function?

Let me quickly expand the tasks we have wanted to address, when
I started changing factor() for R-devel.

1) R-core had unanimously decided that R 2.10.0 should not allow
   duplicated levels in factors anymore.

When working on that, I had realized that quite a few bits of code
were implicitly relying on duplicated levels (or something
related), see below, so the current version of R-devel only
*warns* in some cases where duplicated levels are produced
instead of giving an error.

What I had also found was that basically, even our own (!) code
and quite a bit of user code has more or less relied on other
things that were not true (even though almost always fulfilled):

2) if x contains no duplicated values, then  factor(x) should neither

3) factor(x) constructs a factor object with *unique* levels

  {This is what our decision 1) implies and now enforces}

4) as.numeric(names(table(x))) should be  identical to unique(x)

  where 4) is basically ensured by 3) as table() calls
  factor() for non-factor args.

As mentioned the bad thing is that 2) - 4) are typically
fulfilled in all tests package writers would use.

Concerning '3)' [and '1)'], as you know, inside R-core we have
proposed to at least ensure that  `levels-` 
should not allow duplicated levels, 
and I had concluded that
a) factor() really should use  `levels-` instead of the low-level  
   attr(., levels) - 
b) factor() itself must make sure that the default levels became unique.

---

Given Petr's (and more) examples and the strong requirement of
user convenience and back-compatibility,
I now tend to agree (with Peter) 

Re: [Rd] suggestion for extending ?as.factor

2009-05-04 Thread Petr Savicky
On Mon, May 04, 2009 at 05:39:52PM +0200, Martin Maechler wrote:
[snip]
 Let me quickly expand the tasks we have wanted to address, when
 I started changing factor() for R-devel.
 
 1) R-core had unanimously decided that R 2.10.0 should not allow
duplicated levels in factors anymore.
 
 When working on that, I had realized that quite a few bits of code
 were implicitly relying on duplicated levels (or something
 related), see below, so the current version of R-devel only
 *warns* in some cases where duplicated levels are produced
 instead of giving an error.
 
 What I had also found was that basically, even our own (!) code
 and quite a bit of user code has more or less relied on other
 things that were not true (even though almost always fulfilled):
 
 2) if x contains no duplicated values, then  factor(x) should neither
 
 3) factor(x) constructs a factor object with *unique* levels
 
   {This is what our decision 1) implies and now enforces}
 
 4) as.numeric(names(table(x))) should be  identical to unique(x)
 
   where 4) is basically ensured by 3) as table() calls
   factor() for non-factor args.
 
 As mentioned the bad thing is that 2) - 4) are typically
 fulfilled in all tests package writers would use.
 
 Concerning '3)' [and '1)'], as you know, inside R-core we have
 proposed to at least ensure that  `levels-` 
 should not allow duplicated levels, 
 and I had concluded that
 a) factor() really should use  `levels-` instead of the low-level
attr(., levels) - 
 b) factor() itself must make sure that the default levels became unique.
 
 ---
 
 Given Petr's (and more) examples and the strong requirement of
 user convenience and back-compatibility,
 I now tend to agree (with Peter) that we cannot ensure all of 2)
 and 4) still allow factor() to behave as it did for rounded
 decimal numbers,
 and consequently would have to (continue to) not ensuring
 properties (2) and (4).
 Something quite unfortunate, since, as I said, much useR code
 implicitly relies on these, and so that code is buggy even
 though the bug will only show in exceptional cases.

Let me suggest to consider also the following algorithm: determine
the number of digits needed to preserve the double value exactly for
each number separately. An R code prototype demonstrating the 
algorithm could be as follows

  convert - function(x) # x should be a single number
  {
  for (d in 1:16) {
  y - sprintf(paste(%., d, g, sep=), x)
  if (x == as.numeric(y)) {
  return(y)
  }
  }
  return(sprintf(%.17g, x))
  }

For this, we get

   convert(0.3)
  [1] 0.3
   convert(1/3)
  [1] 0. # 16 digits suffice
   convert(0.12345)
  [1] 0.12345
   convert(0.12345678901234567)
  [1] 0.12345678901234566
   0.12345678901234567 == as.numeric(0.12345678901234566)
  [1] TRUE

This algorithm is slower than a single call to sprintf(%.17g, x), but it
produces nicer numbers, if possible, and guarantees that the value is
always preserved.

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] suggestion for extending ?as.factor

2009-05-04 Thread Peter Dalgaard
Petr Savicky wrote:
 On Mon, May 04, 2009 at 05:39:52PM +0200, Martin Maechler wrote:
 [snip]
 Let me quickly expand the tasks we have wanted to address, when
 I started changing factor() for R-devel.

 1) R-core had unanimously decided that R 2.10.0 should not allow
duplicated levels in factors anymore.

 When working on that, I had realized that quite a few bits of code
 were implicitly relying on duplicated levels (or something
 related), see below, so the current version of R-devel only
 *warns* in some cases where duplicated levels are produced
 instead of giving an error.

 What I had also found was that basically, even our own (!) code
 and quite a bit of user code has more or less relied on other
 things that were not true (even though almost always fulfilled):

 2) if x contains no duplicated values, then  factor(x) should neither

 3) factor(x) constructs a factor object with *unique* levels

   {This is what our decision 1) implies and now enforces}

 4) as.numeric(names(table(x))) should be  identical to unique(x)

   where 4) is basically ensured by 3) as table() calls
   factor() for non-factor args.

 As mentioned the bad thing is that 2) - 4) are typically
 fulfilled in all tests package writers would use.

 Concerning '3)' [and '1)'], as you know, inside R-core we have
 proposed to at least ensure that  `levels-` 
 should not allow duplicated levels, 
 and I had concluded that
 a) factor() really should use  `levels-` instead of the low-level   
attr(., levels) - 
 b) factor() itself must make sure that the default levels became unique.

 ---

 Given Petr's (and more) examples and the strong requirement of
 user convenience and back-compatibility,
 I now tend to agree (with Peter) that we cannot ensure all of 2)
 and 4) still allow factor() to behave as it did for rounded
 decimal numbers,
 and consequently would have to (continue to) not ensuring
 properties (2) and (4).
 Something quite unfortunate, since, as I said, much useR code
 implicitly relies on these, and so that code is buggy even
 though the bug will only show in exceptional cases.
 
 Let me suggest to consider also the following algorithm: determine
 the number of digits needed to preserve the double value exactly for
 each number separately. An R code prototype demonstrating the 
 algorithm could be as follows
 
   convert - function(x) # x should be a single number
   {
   for (d in 1:16) {
   y - sprintf(paste(%., d, g, sep=), x)
   if (x == as.numeric(y)) {
   return(y)
   }
   }
   return(sprintf(%.17g, x))
   }
 
 For this, we get
 
convert(0.3)
   [1] 0.3
convert(1/3)
   [1] 0. # 16 digits suffice
convert(0.12345)
   [1] 0.12345
convert(0.12345678901234567)
   [1] 0.12345678901234566
0.12345678901234567 == as.numeric(0.12345678901234566)
   [1] TRUE
 
 This algorithm is slower than a single call to sprintf(%.17g, x), but it
 produces nicer numbers, if possible, and guarantees that the value is
 always preserved.

Yes, but

 convert(0.2+0.1)
[1] 0.30004


I think that the real issue is that we actually do want almost-equal
numbers to be folded together. The most relevant case I can conjure up
is this (permutation testing):

 zz - replicate(2,sum(sample(sleep$extra,10)))
 length(table(zz))
[1] 427
 length(table(signif(zz,7)))
[1] 281

Notice that the discrepancy comes from sums that really are identical
values (in decimal arithmetic), but where the binary FP inaccuracy makes
them slightly different.

[for a nice picture, continue the example with

 tt - table(signif(zz,7))
 plot(as.numeric(names(tt)),tt, type=h)

]

-- 
   O__   Peter Dalgaard Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark  Ph:  (+45) 35327918
~~ - (p.dalga...@biostat.ku.dk)  FAX: (+45) 35327907

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] suggestion for extending ?as.factor

2009-05-04 Thread Peter Dalgaard
Martin Maechler wrote:
 PS == Petr Savicky savi...@cs.cas.cz
 on Sun, 3 May 2009 22:32:04 +0200 writes:
 PS == Petr Savicky savi...@cs.cas.cz
 on Sun, 3 May 2009 22:32:04 +0200 writes:
 
 PS In R-2.10.0, the development version, function as.factor() uses 17 
 digit
 PS precision for conversion of numeric values to character type. This
 PS is very good for the consistency of the resulting factor, however,
 PS i expect that people will complain about, for example, as.factor(0.3)
 PS being
 PS [1] 0.2
 PS Levels: 0.2
 
 PS I suggest to extend the Warning section of ?as.factor by the 
 following
 PS paragraph.
 
 PS If as.factor() is used for a numeric vector, then the numbers are
 PS converted to character strings with 17 digit precision using their
 PS machine representation. This guarantees that different numbers are
 PS converted to different levels, but may produce unwanted results, if
 PS the numbers are expected to have limited number of decimal positions.
 PS For example, as.factor(c(0.1, 0.2, 0.3)) produces
 PS [1] 0.10001 0.20001 0.2
 PS Levels: 0.10001 0.20001 0.2
 PS In order to avoid this, convert the numbers to a character vector
 PS using formatC() or a similar function before using as.factor().
 
 PS Petr.
 
 Thank you, Petr, for the good suggestion.
 
 I have added a (shorter) paragraph, though to the 'Details' not the
 'Warning' section, and also one to the 'Examples' :
 
 ## Converting (non-integer) numbers:
 as.factor(c(0.1, 0.2, 0.3)) # maybe not what you'd expect, so rather use
 factor(format(c(0.1, 0.2, 0.3)))

Martin,

I tend to consider this a bug, plain and simple. We might as well have
abolished conversion of numerics to factor altogether. (Notice, BTW,
that conversions to mode character changes the sort order so format()
is not a universal fix. IIRC, we did consider the 1 10 2 3 4 5 6 7 8 9
issue when designing R's version factor().)

The current R-devel behaviour is silly and we should just get rid of it
before a final release. It should be the other way around: If people
rely on whether numerical factor levels differ with 17 digits precision,
THEN they should use format with suitable arguments.

If we have issues with numeric values that are very slightly different
but round to get the same level name, how about putting something like

if (is.numeric(x)) x - zapsmall(x)

somewhere at the start of the factor() function?

-p


-- 
   O__   Peter Dalgaard Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark  Ph:  (+45) 35327918
~~ - (p.dalga...@biostat.ku.dk)  FAX: (+45) 35327907

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel