Re: [Rd] suggestion for extending ?as.factor

2009-05-09 Thread Martin Maechler
> "PS" == Petr Savicky 
> on Fri, 8 May 2009 18:10:56 +0200 writes:

PS> On Fri, May 08, 2009 at 05:14:48PM +0200, Petr Savicky wrote:
>> Let me suggest to consider the following modification, where match() is 
done
>> on the strings, not on the original values.
>> levels <- unique(as.character(sort(unique(x
>> x <- as.character(x)
>> f <- match(x, levels)

PS> An alternative solution is

PS> ind <- order(x)
PS> x <- as.character(x) # or any other conversion to character
PS> levels <- unique(x[ind]) # get unique levels ordered by the original 
values
PS> f <- match(x, levels)

(slightly but not much more complicated though).

Yes, indeed that brings us back to (something like) the original
"use  factor(format(x))  ..."  suggestion which would have been
fine if there hadn't been the issue of ordering,
exactly what you've addressed before.


PS> The advantage of this over the suggestion from my previous email is that
PS> the string conversion is applied only once. The conversion need not be 
only
PS> as.character(). There may be other choices specified by a parametr. I 
have
PS> strong objections against the existing implementation of as.character(),
PS> but still i think that as.character() should be the default for factor()
PS> for the sake of consistency of the R language.

The biggest advantage to reverting to something simple like
that, would be that it is really simple.

My first tests with (a variation of) the above indicate
favorable results.  More on this on Monday.
If'd revert to such a solution,
we'd have to get back to Peter's point about the issue that
he'd think  table(.) should be more tolerant than as.character()
about "almost equality".
For compatibility reasons, we could also return back to the
reasoning that useR should use {something like}
table(signif(x, 14)) 
instead of
table(x) 
for numeric x in "typical" cases.

Martin

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] unsplit list of data.frames with one column

2009-05-09 Thread Wacek Kusnierczyk
Peter Dalgaard wrote:
> Will Gray wrote:
>>
>> Perhaps this is the intended behavior, but I discovered that unsplit
>> throws an error when it tries to set rownames of a variable that has
>> no dimension.  This occurs when unsplit is passed a list of
>> data.frames that have only a single column.
>>
>> An example:
>>
>> df <- data.frame(letters[seq(25)])
>> fac <- rep(seq(5), 5)
>> unsplit(split(df, fac), fac)
>>
>> For reference, I'm using R version 2.9.0 (2009-04-17), subversion
>> revision 48333, on Ubuntu 8.10.
>>
>
> That's a bug. The line
>
> x <- value[[1L]][rep(NA, len), ]
>
> should be
>
> x <- value[[1L]][rep(NA, len), , drop=FALSE]
>

looks like someone got caught by the drop=TRUE design...?

vQ

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] anyDuplicated(incomp=NA) fails

2009-05-09 Thread Martin Maechler
> William Dunlap 
> on Fri, 8 May 2009 16:16:56 -0700 writes:

> With today's R 2.10.0(devel) I get:
>> anyDuplicated(c(1,NA,3,NA,5), incomp=NA) # expect 0
> Warning: stack imbalance in 'anyDuplicated', 20 then 21
> Warning: stack imbalance in '.Internal', 19 then 20
> Warning: stack imbalance in '{', 17 then 18 [1] 0
>> anyDuplicated(c(1,NA,3,NA,3), incomp=NA) # expect 5
> Warning: stack imbalance in 'anyDuplicated', 20 then 21
> Warning: stack imbalance in '.Internal', 19 then 20
> Warning: stack imbalance in '{', 17 then 18 [1] 0
>> anyDuplicated(c(1,NA,3,NA,3), incomp=3) # expect 4
> Warning: stack imbalance in 'anyDuplicated', 20 then 21
> Warning: stack imbalance in '.Internal', 19 then 20
> Warning: stack imbalance in '{', 17 then 18 [1] 0
>> anyDuplicated(c(1,NA,3,NA,3), incomp=c(3,NA)) # exect 0
> Warning: stack imbalance in 'anyDuplicated', 20 then 21
> Warning: stack imbalance in '.Internal', 19 then 20
> Warning: stack imbalance in '{', 17 then 18 [1] 0
>> version$svn
> [1] "48493"

> After applying the attached patch I get

>> anyDuplicated(c(1,NA,3,NA,5), incomp=NA)
> [1] 0
>> anyDuplicated(c(1,NA,3,NA,3), incomp=NA)
> [1] 5
>> anyDuplicated(c(1,NA,3,NA,3), incomp=3)
> [1] 4
>> anyDuplicated(c(1,NA,3,NA,3), incomp=c(3,NA))
> [1] 0

> Calls to UNPROTECT() were missing an a macro definition
> did nothing because there were no backslashes at the ends
> of lines.  I didn't check the results very carefully.

Thank you, very much Bill!   Somewhat embarrassing...
Note that the patch "in theory" needs to be modified to only
UNPROTECT() when PROTECT() was called, which "in practice" is
always ;-), but in any case, I've slightly modified your patch
and also applied to R-patched.

Thanks once more,
Martin

> Bill Dunlap TIBCO Software Inc - Spotfire Division wdunlap
> tibco.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] suggestion for extending ?as.factor

2009-05-09 Thread Michael Dewey

At 14:18 08/05/2009, Martin Maechler wrote:


> "PS" == Petr Savicky 
> on Fri, 8 May 2009 11:01:55 +0200 writes:


Somewhere below Martin asks for alternatives from list readers. I do 
not have alternatives, but I do have two comments, one immediately 
below this, the other embedded in-line.


This whole thread reminds me just why I have spent the best part of a 
decade climbing the virtual Matterhorn called 'Learning R' and why it 
is such a pleasure to use. It is the fact that somebody, somewhere 
cares enough about consistency, usability and accuracy to devote 
hours to getting even obscure details just right.




PS> On Wed, May 06, 2009 at 10:41:58AM +0200, Martin Maechler wrote:
PD> I think that the real issue is that we actually do want almost-equal
PD> numbers to be folded together.
>>
>> yes, this now (revision 48469) will happen by default, 
using  signif(x, 15)

>> where '15' is the default for the new optional argument 'digitsLabels'
>> {better argument name? (but must nost start with 'label')}

PS> Let me analyze the current behavior of factor(x) for 
numeric x with missing(levels)
PS> and missing(labels). In this situation, levels are computed 
as sort(unique(x))
PS> from possibly transformed x. Then, labels are constructed 
by a conversion of the

PS> levels to strings.

PS> I understand the current (R 2.10.0, 2009-05-07 r48492) 
behavior as follows.


PS> If keepUnique is FALSE (the default), then
PS> - values x are transformed by signif(x, digitsLabels)
PS> - labels are computed using as.character(levels)
PS> - digitsLabels defaults to 15, but may be set to any integer value

PS> If keepUnique is TRUE, then
PS> - values x are preserved
PS> - labels are computed using sprintf("%.*g", digitsLabels, levels)
PS> - digitsLabels defaults to 17, but may be set to any integer value

(in theory; in practice, I think I've suggested somewhere that
 it should be  >= 17;  but see below.)

Your summary seems correct to me.

PS> There are several situations, when this approach produces 
duplicated levels.

PS> Besides the one described in my previous email, there are also others
PS> factor(c(0.3, 0.1+0.2), keepUnique=TRUE, digitsLabels=15)

yes, but this is not much sensical; I've already contemplated
to produce a warning in such cases, something like

   if(keepUnique && digitsLabels < 17)
 warning(gettextf(
 "'digitsLabels = %d' is typically too small when 'keepUnique' is true",
 digitsLabels))


PS> factor(1 + 0:5 * 1e-16, digitsLabels=17)

again, this does not make much sense; but why disallow the useR
to shoot into his foot?


I agree. As a useR I do not want to be stopped from doing anything. I 
would appreciate a warning just before I shoot myself in the foot and 
I definitely want one if it looks like I am going to aim for my head.


PS> I would like to suggest a modification. It eliminates most 
of the cases, where
PS> we get duplicated levels. It would eliminate all such 
cases, if the function
PS> signif() works as expected. Unfortunately, if signif() 
works as it does in the

PS> current versions of R, we still get duplicated levels.

PS> The suggested modification is as follows.

PS> If keepUnique is FALSE (the default), then
PS> - values x are transformed by signif(x, digitsLabels)
PS> - labels are computed using sprintf("%.*g", digitsLabels, levels)
PS> - digitsLabels defaults to 15, but may be set to any integer value

I tend like this change, given -- as you found yesterday -- that
as.character() is not even preserving 15 digits.
OTOH,  as.character() has been in use for a very long history of
S (and R), whereas using sprintf() is not back compatible with
it and actually depends on the LIBC implementation of the system-sprintf.
For that reason as.character() would be preferable.
Hmm

PS> If keepUnique is TRUE, then
PS> - values x are preserved
PS> - labels are computed using sprintf("%.*g", 17, levels)
PS> - digitsLabels is ignored

I had originally planned to do exactly the above.
However, e.g.,  digitsLabels = 18  may be desired in some cases,
and that's why I also left the possibility to apply it in the
keepUnique case.


PS> Arguments for the modification are the following.

PS> 1. If keepUnique is FALSE, then computing labels using 
as.character() leads
PS> to duplicated labels as demonstrated in my previous email. 
So, i suggest to

PS> use sprintf("%.*g", digitsLabels, levels) instead of as.character().

{as said above, that seems sensible, though unfurtunately quite
 a bit less back-compatible!}

PS> 2. If keepUnique is TRUE and we allow digitsLabels less 
than 17, then we get
PS> duplicated labels. So, i suggest to force digitsLabels=17, 
if keepUnique=TRUE.


PS> If signif(,digitsLabels) works as expected, than the above 
approach should not

PS> produce duplicated labels

Re: [Rd] Improve aggregate.default ...?

2009-05-09 Thread Gavin Simpson
On Sat, 2009-05-09 at 08:23 -0400, Gabor Grothendieck wrote:
> Try this:
> 
> > aggregate(dat["A"], dat["Group"], mean)
>   Group A
> 1 1 0.4944810
> 2 2 0.4765412
> 3 3 0.4521068
> 4 4 0.4989000

Thanks Gabor. Ideally, aggregate.default should "work" whatever indexing
one uses - here you are using the fact that a data.frame is a special
case of a list, and is not the way most help resources introduce
subsetting for data frames.

For personal use, I can use my own version of aggregate.default and as I
dislike using `$`, prefering with(), I don't run the risk of non
syntactic names being produced.

I was really looking for ideas for improving aggregate.default in
general. The solution I posted has its own infelicities...

Cheers,

G

> 
> On Sat, May 9, 2009 at 8:14 AM, Gavin Simpson  wrote:
> > Hi,
> >
> > I find it a bit annoying that aggregate.default forces the returned
> > object to loose the 'name' of the variable aggregated, replacing it with
> > 'x'.
> >
> > A brief example:
> >
> >> dat <- data.frame(A = runif(100), B = rnorm(100),
> > +   Group = gl(4, 25))
> >> with(dat, aggregate(A, by = list(Group = Group), FUN = mean))
> >  Group x
> > 1 1 0.6523228
> > 2 2 0.4544317
> > 3 3 0.4619624
> > 4 4 0.4703156
> >
> > This arises because aggregate default has:
> >
> > function (x, ...)
> > {
> >if (is.ts(x))
> >aggregate.ts(as.ts(x), ...)
> >else aggregate.data.frame(as.data.frame(x), ...)
> > }
> >
> > which recasts x as a data frame, but doesn't make any effort to supply a
> > name. Can we do a better job of supplying a useful name?
> >
> > My first attempt is:
> >
> > aggregate.default <- function(x, ...) {
> >if (is.ts(x))
> >aggregate.ts(as.ts(x), ...)
> >else {
> >nam <- deparse(substitute(x))
> >x <- as.data.frame(x)
> >names(x) <- nam
> >aggregate.data.frame(x, ...)
> >}
> > }
> >
> > Which works for the brief example above:
> >
> >> with(dat, aggregate(A, by = list(Group = Group), FUN = mean))
> >  Group A
> > 1 1 0.4269715
> > 2 2 0.5479352
> > 3 3 0.5091543
> > 4 4 0.4926412
> >
> > However, it fails make check-all because examples have relied on
> > returned object having 'x'. I also note that this might have the
> > annoying side effect of producing odd names if we use the following
> > incantation:
> >
> >> res <- aggregate(dat$A, by = list(Group = dat$Group), FUN = mean)
> >> str(res)
> > 'data.frame':   4 obs. of  2 variables:
> >  $ Group: Factor w/ 4 levels "1","2","3","4": 1 2 3 4
> >  $ dat$A: num  0.427 0.548 0.509 0.493
> >> res$dat$A
> > Error in res$dat$A : $ operator is invalid for atomic vectors
> >> res$`dat$A`
> > [1] 0.4269715 0.5479352 0.5091543 0.4926412
> >
> > Is there a way of coming up with a better way to name the aggregated
> > variable? Would a change of this kind be something R Core would consider
> > making to aggregate.default if a good solution is found?
> >
> > Thanks in advance,
> >
> > G
> > --
> > %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
> >  Dr. Gavin Simpson [t] +44 (0)20 7679 0522
> >  ECRC, UCL Geography,  [f] +44 (0)20 7679 0565
> >  Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk
> >  Gower Street, London  [w] http://www.ucl.ac.uk/~ucfagls/
> >  UK. WC1E 6BT. [w] http://www.freshwaters.org.uk
> > %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
> >
> > __
> > R-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
 Dr. Gavin Simpson [t] +44 (0)20 7679 0522
 ECRC, UCL Geography,  [f] +44 (0)20 7679 0565
 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street, London  [w] http://www.ucl.ac.uk/~ucfagls/
 UK. WC1E 6BT. [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Improve aggregate.default ...?

2009-05-09 Thread Gabor Grothendieck
Try this:

> aggregate(dat["A"], dat["Group"], mean)
  Group A
1 1 0.4944810
2 2 0.4765412
3 3 0.4521068
4 4 0.4989000

On Sat, May 9, 2009 at 8:14 AM, Gavin Simpson  wrote:
> Hi,
>
> I find it a bit annoying that aggregate.default forces the returned
> object to loose the 'name' of the variable aggregated, replacing it with
> 'x'.
>
> A brief example:
>
>> dat <- data.frame(A = runif(100), B = rnorm(100),
> +                   Group = gl(4, 25))
>> with(dat, aggregate(A, by = list(Group = Group), FUN = mean))
>  Group         x
> 1     1 0.6523228
> 2     2 0.4544317
> 3     3 0.4619624
> 4     4 0.4703156
>
> This arises because aggregate default has:
>
> function (x, ...)
> {
>    if (is.ts(x))
>        aggregate.ts(as.ts(x), ...)
>    else aggregate.data.frame(as.data.frame(x), ...)
> }
>
> which recasts x as a data frame, but doesn't make any effort to supply a
> name. Can we do a better job of supplying a useful name?
>
> My first attempt is:
>
> aggregate.default <- function(x, ...) {
>    if (is.ts(x))
>        aggregate.ts(as.ts(x), ...)
>    else {
>        nam <- deparse(substitute(x))
>        x <- as.data.frame(x)
>        names(x) <- nam
>        aggregate.data.frame(x, ...)
>    }
> }
>
> Which works for the brief example above:
>
>> with(dat, aggregate(A, by = list(Group = Group), FUN = mean))
>  Group         A
> 1     1 0.4269715
> 2     2 0.5479352
> 3     3 0.5091543
> 4     4 0.4926412
>
> However, it fails make check-all because examples have relied on
> returned object having 'x'. I also note that this might have the
> annoying side effect of producing odd names if we use the following
> incantation:
>
>> res <- aggregate(dat$A, by = list(Group = dat$Group), FUN = mean)
>> str(res)
> 'data.frame':   4 obs. of  2 variables:
>  $ Group: Factor w/ 4 levels "1","2","3","4": 1 2 3 4
>  $ dat$A: num  0.427 0.548 0.509 0.493
>> res$dat$A
> Error in res$dat$A : $ operator is invalid for atomic vectors
>> res$`dat$A`
> [1] 0.4269715 0.5479352 0.5091543 0.4926412
>
> Is there a way of coming up with a better way to name the aggregated
> variable? Would a change of this kind be something R Core would consider
> making to aggregate.default if a good solution is found?
>
> Thanks in advance,
>
> G
> --
> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
>  Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
>  ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
>  Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
>  Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
>  UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Improve aggregate.default ...?

2009-05-09 Thread Gavin Simpson
Hi,

I find it a bit annoying that aggregate.default forces the returned
object to loose the 'name' of the variable aggregated, replacing it with
'x'.

A brief example:

> dat <- data.frame(A = runif(100), B = rnorm(100), 
+   Group = gl(4, 25))
> with(dat, aggregate(A, by = list(Group = Group), FUN = mean))
  Group x
1 1 0.6523228
2 2 0.4544317
3 3 0.4619624
4 4 0.4703156

This arises because aggregate default has:

function (x, ...) 
{
if (is.ts(x)) 
aggregate.ts(as.ts(x), ...)
else aggregate.data.frame(as.data.frame(x), ...)
}

which recasts x as a data frame, but doesn't make any effort to supply a
name. Can we do a better job of supplying a useful name?

My first attempt is:

aggregate.default <- function(x, ...) {
if (is.ts(x))
aggregate.ts(as.ts(x), ...)
else {
nam <- deparse(substitute(x))
x <- as.data.frame(x)
names(x) <- nam
aggregate.data.frame(x, ...)
}
}

Which works for the brief example above:

> with(dat, aggregate(A, by = list(Group = Group), FUN = mean))
  Group A
1 1 0.4269715
2 2 0.5479352
3 3 0.5091543
4 4 0.4926412

However, it fails make check-all because examples have relied on
returned object having 'x'. I also note that this might have the
annoying side effect of producing odd names if we use the following
incantation:

> res <- aggregate(dat$A, by = list(Group = dat$Group), FUN = mean)
> str(res)
'data.frame':   4 obs. of  2 variables:
 $ Group: Factor w/ 4 levels "1","2","3","4": 1 2 3 4
 $ dat$A: num  0.427 0.548 0.509 0.493
> res$dat$A
Error in res$dat$A : $ operator is invalid for atomic vectors
> res$`dat$A`
[1] 0.4269715 0.5479352 0.5091543 0.4926412

Is there a way of coming up with a better way to name the aggregated
variable? Would a change of this kind be something R Core would consider
making to aggregate.default if a good solution is found?

Thanks in advance,

G
-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
 Dr. Gavin Simpson [t] +44 (0)20 7679 0522
 ECRC, UCL Geography,  [f] +44 (0)20 7679 0565
 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street, London  [w] http://www.ucl.ac.uk/~ucfagls/
 UK. WC1E 6BT. [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] unsplit list of data.frames with one column

2009-05-09 Thread Peter Dalgaard

Will Gray wrote:


Perhaps this is the intended behavior, but I discovered that unsplit 
throws an error when it tries to set rownames of a variable that has no 
dimension.  This occurs when unsplit is passed a list of data.frames 
that have only a single column.


An example:

df <- data.frame(letters[seq(25)])
fac <- rep(seq(5), 5)
unsplit(split(df, fac), fac)

For reference, I'm using R version 2.9.0 (2009-04-17), subversion 
revision 48333, on Ubuntu 8.10.




That's a bug. The line

x <- value[[1L]][rep(NA, len), ]

should be

x <- value[[1L]][rep(NA, len), , drop=FALSE]


--
   O__   Peter Dalgaard Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark  Ph:  (+45) 35327918
~~ - (p.dalga...@biostat.ku.dk)  FAX: (+45) 35327907

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel