subject:"\[Rd\] median and data frames"

Re: [Rd] median and data frames

2011-10-08 Thread Martin Maechler

> Martin Maechler 
> on Fri, 29 Apr 2011 16:25:09 +0200 writes:

> Paul Johnson 
> on Thu, 28 Apr 2011 00:20:27 -0500 writes:

>> On Wed, Apr 27, 2011 at 12:44 PM, Patrick Burns
>>  wrote:
>>> Here are some data frames:
>>> 
>>> df3.2 <- data.frame(1:3, 7:9) df4.2 <- data.frame(1:4,
>>> 7:10) df3.3 <- data.frame(1:3, 7:9, 10:12) df4.3 <-
>>> data.frame(1:4, 7:10, 10:13) df3.4 <- data.frame(1:3,
>>> 7:9, 10:12, 15:17) df4.4 <- data.frame(1:4, 7:10, 10:13,
>>> 15:18)
>>> 
>>> Now here are some commands and their answers:

 median(df4.4)
>>> [1]  8.5 11.5
 median(df3.2[c(1,2,3),])
>>> [1] 2 8
 median(df3.2[c(1,3,2),])
>>> [1]  2 NA Warning message: In mean.default(X[[2L]], ...)
>>> :  argument is not numeric or logical: returning NA
>>> 
>>> 
>>> 
>>> The sessionInfo is below, but it looks to me like the
>>> present behavior started in 2.10.0.
>>> 
>>> Sometimes it gets the right answer.  I'd be grateful to
>>> hear how it does that -- I can't figure it out.
>>> 

> Hello, Pat.

>> Nice poetry there!  I think I have an actual answer, as
>> opposed to the usual crap I spew.

>> I would agree if you said median.data.frame ought to be
>> written to work columnwise, similar to mean.data.frame.

>> apply and sapply always give the correct answer

>>> apply(df3.3, 2, median)
>> X1.3 X7.9 X10.12 2 8 11

> [...]

> exactly

>> mean.data.frame is now implemented as

>> mean.data.frame <- function(x, ...) sapply(x, mean, ...)

> exactly.

> My personal oppinion is that mean.data.frame() should
> never have been written.  People should know, or learn, to
> use apply functions for such a task.

> The unfortunate fact that mean.data.frame() exists makes
> people think that median.data.frame() should too, and then

>   var.data.frame() sd.data.frame() mad.data.frame()
> min.data.frame() max.data.frame() ...  ...

> all just in order to *not* to have to know sapply() 

> No, rather not.

> My vote is for deprecating mean.data.frame().
> Martin

This has now happened -- for R 2.14.0 and later.
As raised in this thread in April, there's a similar
"extra helpful" behavior within the sd() function,
and we've also deprecated that.

In addition -- getting back to Pat Burns' original post,
I'm also proposing to change  median()  
such that it produces an error instead of the current "sometimes
correct" (but mostly not!) results. 

Martin

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] median and data frames

2011-05-05 Thread Joshua Ulrich

On Fri, Apr 29, 2011 at 9:25 AM, Martin Maechler
 wrote:
>> Paul Johnson 
>>     on Thu, 28 Apr 2011 00:20:27 -0500 writes:
>
>    > On Wed, Apr 27, 2011 at 12:44 PM, Patrick Burns
>    >  wrote:
>    >> Here are some data frames:
>    >>
>    >> df3.2 <- data.frame(1:3, 7:9)
>    >> df4.2 <- data.frame(1:4, 7:10)
>    >> df3.3 <- data.frame(1:3, 7:9, 10:12)
>    >> df4.3 <- data.frame(1:4, 7:10, 10:13)
>    >> df3.4 <- data.frame(1:3, 7:9, 10:12, 15:17)
>    >> df4.4 <- data.frame(1:4, 7:10, 10:13, 15:18)
>    >>
>    >> Now here are some commands and their answers:
>
>    >>> median(df4.4)
>    >> [1]  8.5 11.5
>    >>> median(df3.2[c(1,2,3),])
>    >> [1] 2 8
>    >>> median(df3.2[c(1,3,2),])
>    >> [1]  2 NA
>    >> Warning message:
>    >> In mean.default(X[[2L]], ...) :
>    >>  argument is not numeric or logical: returning NA
>    >>
>    >>
>    >>
>    >> The sessionInfo is below, but it looks
>    >> to me like the present behavior started
>    >> in 2.10.0.
>    >>
>    >> Sometimes it gets the right answer.  I'd
>    >> be grateful to hear how it does that -- I
>    >> can't figure it out.
>    >>
>
>    > Hello, Pat.
>
>    > Nice poetry there!  I think I have an actual answer, as opposed to the
>    > usual crap I spew.
>
>    > I would agree if you said median.data.frame ought to be written to
>    > work columnwise, similar to mean.data.frame.
>
>    > apply and sapply  always give the correct answer
>
>    >> apply(df3.3, 2, median)
>    > X1.3   X7.9 X10.12
>    > 2      8     11
>
>    [...]
>
> exactly
>
>    > mean.data.frame is now implemented as
>
>    > mean.data.frame <- function(x, ...) sapply(x, mean, ...)
>
> exactly.
>
> My personal oppinion is that  mean.data.frame() should never have
> been written.
> People should know, or learn, to use apply functions for such a
> task.
>
> The unfortunate fact that mean.data.frame() exists makes people
> think that median.data.frame() should too,
> and then
>
>  var.data.frame()
>   sd.data.frame()
>  mad.data.frame()
>  min.data.frame()
>  max.data.frame()
>  ...
>  ...
>
> all just in order to *not* to have to know  sapply()
> 
>
> No, rather not.
>
> My vote is for deprecating  mean.data.frame().
>
> Martin
>

I agree.  However, sd() isn't currently (as of R-2.13.0) generic and
it operates by column for matrix and data.frame objects, so it behaves
a bit more like mean() and is similarly inconsistent from the other
listed functions.  I have no input on how this should be handled, but
thought it may be worth addressing.

Best,
--
Joshua Ulrich  |  FOSS Trading: www.fosstrading.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] median and data frames

2011-04-30 Thread Tim Hesterberg

I also favor deprecating mean.data.frame.

One possible exception would be for a single-column data frame.
But even here I'd say no, lest people expect the same behavior for
median, var, ...

Pat's suggestion of using stop() would work nicely for mean.
(but omit paste - stop handles that).

Tim Hesterberg

>If Martin's proposal is accepted, does
>that mean that the median method for
>data frames would be something like:
>
>function (x, ...)
>{
> stop(paste("you probably mean to use the command: sapply(",
> deparse(substitute(x)), ", median)", sep=""))
>}
>
>Pat
>
>
>On 29/04/2011 15:25, Martin Maechler wrote:
>>> Paul Johnson
>>>  on Thu, 28 Apr 2011 00:20:27 -0500 writes:
>>
>>  >  On Wed, Apr 27, 2011 at 12:44 PM, Patrick Burns
>>  >wrote:
>>  >>  Here are some data frames:
>>  >>
>>  >>  df3.2<- data.frame(1:3, 7:9)
>>  >>  df4.2<- data.frame(1:4, 7:10)
>>  >>  df3.3<- data.frame(1:3, 7:9, 10:12)
>>  >>  df4.3<- data.frame(1:4, 7:10, 10:13)
>>  >>  df3.4<- data.frame(1:3, 7:9, 10:12, 15:17)
>>  >>  df4.4<- data.frame(1:4, 7:10, 10:13, 15:18)
>>  >>
>>  >>  Now here are some commands and their answers:
>>
>>  >>>  median(df4.4)
>>  >>  [1]  8.5 11.5
>>  >>>  median(df3.2[c(1,2,3),])
>>  >>  [1] 2 8
>>  >>>  median(df3.2[c(1,3,2),])
>>  >>  [1]  2 NA
>>  >>  Warning message:
>>  >>  In mean.default(X[[2L]], ...) :
>>  >>argument is not numeric or logical: returning NA
>>  >>
>>  >>
>>  >>
>>  >>  The sessionInfo is below, but it looks
>>  >>  to me like the present behavior started
>>  >>  in 2.10.0.
>>  >>
>>  >>  Sometimes it gets the right answer.  I'd
>>  >>  be grateful to hear how it does that -- I
>>  >>  can't figure it out.
>>  >>
>>
>>  >  Hello, Pat.
>>
>>  >  Nice poetry there!  I think I have an actual answer, as opposed to 
>> the
>>  >  usual crap I spew.
>>
>>  >  I would agree if you said median.data.frame ought to be written to
>>  >  work columnwise, similar to mean.data.frame.
>>
>>  >  apply and sapply  always give the correct answer
>>
>>  >>  apply(df3.3, 2, median)
>>  >  X1.3   X7.9 X10.12
>>  >  2  8 11
>>
>>  [...]
>>
>> exactly
>>
>>  >  mean.data.frame is now implemented as
>>
>>  >  mean.data.frame<- function(x, ...) sapply(x, mean, ...)
>>
>> exactly.
>>
>> My personal oppinion is that  mean.data.frame() should never have
>> been written.
>> People should know, or learn, to use apply functions for such a
>> task.
>>
>> The unfortunate fact that mean.data.frame() exists makes people
>> think that median.data.frame() should too,
>> and then
>>
>>var.data.frame()
>> sd.data.frame()
>>mad.data.frame()
>>min.data.frame()
>>max.data.frame()
>>...
>>...
>>
>> all just in order to *not* to have to know  sapply()
>> 
>>
>> No, rather not.
>>
>> My vote is for deprecating  mean.data.frame().
>>
>> Martin
>>
>
>--
>Patrick Burns
>pbu...@pburns.seanet.com
>twitter: @portfolioprobe
>http://www.portfolioprobe.com/blog
>http://www.burns-stat.com
>(home of 'Some hints for the R beginner'
>and 'The R Inferno')

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] median and data frames

2011-04-29 Thread Patrick Burns


If Martin's proposal is accepted, does
that mean that the median method for
data frames would be something like:

function (x, ...)
{
stop(paste("you probably mean to use the command: sapply(",
deparse(substitute(x)), ", median)", sep=""))
}

Pat


On 29/04/2011 15:25, Martin Maechler wrote:

Paul Johnson
 on Thu, 28 Apr 2011 00:20:27 -0500 writes:


 >  On Wed, Apr 27, 2011 at 12:44 PM, Patrick Burns
 >wrote:
 >>  Here are some data frames:
 >>
 >>  df3.2<- data.frame(1:3, 7:9)
 >>  df4.2<- data.frame(1:4, 7:10)
 >>  df3.3<- data.frame(1:3, 7:9, 10:12)
 >>  df4.3<- data.frame(1:4, 7:10, 10:13)
 >>  df3.4<- data.frame(1:3, 7:9, 10:12, 15:17)
 >>  df4.4<- data.frame(1:4, 7:10, 10:13, 15:18)
 >>
 >>  Now here are some commands and their answers:

 >>>  median(df4.4)
 >>  [1]  8.5 11.5
 >>>  median(df3.2[c(1,2,3),])
 >>  [1] 2 8
 >>>  median(df3.2[c(1,3,2),])
 >>  [1]  2 NA
 >>  Warning message:
 >>  In mean.default(X[[2L]], ...) :
 >>argument is not numeric or logical: returning NA
 >>
 >>
 >>
 >>  The sessionInfo is below, but it looks
 >>  to me like the present behavior started
 >>  in 2.10.0.
 >>
 >>  Sometimes it gets the right answer.  I'd
 >>  be grateful to hear how it does that -- I
 >>  can't figure it out.
 >>

 >  Hello, Pat.

 >  Nice poetry there!  I think I have an actual answer, as opposed to the
 >  usual crap I spew.

 >  I would agree if you said median.data.frame ought to be written to
 >  work columnwise, similar to mean.data.frame.

 >  apply and sapply  always give the correct answer

 >>  apply(df3.3, 2, median)
 >  X1.3   X7.9 X10.12
 >  2  8 11

 [...]

exactly

 >  mean.data.frame is now implemented as

 >  mean.data.frame<- function(x, ...) sapply(x, mean, ...)

exactly.

My personal oppinion is that  mean.data.frame() should never have
been written.
People should know, or learn, to use apply functions for such a
task.

The unfortunate fact that mean.data.frame() exists makes people
think that median.data.frame() should too,
and then

   var.data.frame()
sd.data.frame()
   mad.data.frame()
   min.data.frame()
   max.data.frame()
   ...
   ...

all just in order to *not* to have to know  sapply()


No, rather not.

My vote is for deprecating  mean.data.frame().

Martin



--
Patrick Burns
pbu...@pburns.seanet.com
twitter: @portfolioprobe
http://www.portfolioprobe.com/blog
http://www.burns-stat.com
(home of 'Some hints for the R beginner'
and 'The R Inferno')

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] median and data frames

2011-04-29 Thread Hadley Wickham

> My personal oppinion is that  mean.data.frame() should never have
> been written.
> People should know, or learn, to use apply functions for such a
> task.
>
> The unfortunate fact that mean.data.frame() exists makes people
> think that median.data.frame() should too,
> and then
>
>  var.data.frame()
>   sd.data.frame()
>  mad.data.frame()
>  min.data.frame()
>  max.data.frame()
>  ...
>  ...
>
> all just in order to *not* to have to know  sapply()
> 
>
> No, rather not.
>
> My vote is for deprecating  mean.data.frame().

I totally agree!

Hadley

-- 
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] median and data frames

2011-04-29 Thread William Dunlap

> From: r-devel-boun...@r-project.org 
> [mailto:r-devel-boun...@r-project.org] On Behalf Of Martin Maechler
> Sent: Friday, April 29, 2011 7:25 AM
> To: Paul Johnson
> Cc: r-devel
> Subject: Re: [Rd] median and data frames
> [ ... lots of lines elided ... ] 
> My vote is for deprecating  mean.data.frame().

While R's data.frame method for mean(x) returns
the same thing as colMeans(x), Splus's (since 2005)
returns the same thing as mean(as.matrix(x)).  (Really,
it calls numerical.matrix(x), which turns non-numeric
columns into columns of numeric NA's).  I usually favor
making data.frames act more like matrices when possible
(since users often conflate the two classes) and I
like having all the methods of a generic function return
the same sort of thing (a single value in this case).

It is often nonsensical to ask for the mean of an
entire data.frame, as the columns may have different
units even when they are all numeric.  It does make
sense when you use a tool like read.table() or S+'s
importData() to import a matrix and you don't notice
it is stored as a data.frame.  It does make sense when
you have a single-column data.frame or matrix, perhaps
arising from the use of drop=FALSE when subscripting.  

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com 

> Martin
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] median and data frames

2011-04-29 Thread Martin Maechler

> Paul Johnson 
> on Thu, 28 Apr 2011 00:20:27 -0500 writes:

> On Wed, Apr 27, 2011 at 12:44 PM, Patrick Burns
>  wrote:
>> Here are some data frames:
>> 
>> df3.2 <- data.frame(1:3, 7:9)
>> df4.2 <- data.frame(1:4, 7:10)
>> df3.3 <- data.frame(1:3, 7:9, 10:12)
>> df4.3 <- data.frame(1:4, 7:10, 10:13)
>> df3.4 <- data.frame(1:3, 7:9, 10:12, 15:17)
>> df4.4 <- data.frame(1:4, 7:10, 10:13, 15:18)
>> 
>> Now here are some commands and their answers:

>>> median(df4.4)
>> [1]  8.5 11.5
>>> median(df3.2[c(1,2,3),])
>> [1] 2 8
>>> median(df3.2[c(1,3,2),])
>> [1]  2 NA
>> Warning message:
>> In mean.default(X[[2L]], ...) :
>>  argument is not numeric or logical: returning NA
>> 
>> 
>> 
>> The sessionInfo is below, but it looks
>> to me like the present behavior started
>> in 2.10.0.
>> 
>> Sometimes it gets the right answer.  I'd
>> be grateful to hear how it does that -- I
>> can't figure it out.
>> 

> Hello, Pat.

> Nice poetry there!  I think I have an actual answer, as opposed to the
> usual crap I spew.

> I would agree if you said median.data.frame ought to be written to
> work columnwise, similar to mean.data.frame.

> apply and sapply  always give the correct answer

>> apply(df3.3, 2, median)
> X1.3   X7.9 X10.12
> 2  8 11

[...]

exactly

> mean.data.frame is now implemented as

> mean.data.frame <- function(x, ...) sapply(x, mean, ...)

exactly.

My personal oppinion is that  mean.data.frame() should never have
been written.
People should know, or learn, to use apply functions for such a
task.

The unfortunate fact that mean.data.frame() exists makes people
think that median.data.frame() should too,
and then  

  var.data.frame()
   sd.data.frame()
  mad.data.frame()
  min.data.frame()
  max.data.frame()
  ...
  ...

all just in order to *not* to have to know  sapply() 


No, rather not.

My vote is for deprecating  mean.data.frame().

Martin

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] median and data frames

2011-04-28 Thread S Ellison

This seems trivially fixable using something like

median.data.frame <- function(x, na.rm=FALSE) {
   sapply(x, function(y, na.rm=FALSE) if(is.factor(y)) NA else
median(y, na.rm=na.rm), na.rm=na.rm)
}


>>> Paul Johnson  28/04/2011 06:20 >>>
On Wed, Apr 27, 2011 at 12:44 PM, Patrick Burns
 wrote:
> Here are some data frames:
>
> df3.2 <- data.frame(1:3, 7:9)
> df4.2 <- data.frame(1:4, 7:10)
> df3.3 <- data.frame(1:3, 7:9, 10:12)
> df4.3 <- data.frame(1:4, 7:10, 10:13)
> df3.4 <- data.frame(1:3, 7:9, 10:12, 15:17)
> df4.4 <- data.frame(1:4, 7:10, 10:13, 15:18)
>
> Now here are some commands and their answers:

>> median(df4.4)
> [1]  8.5 11.5
>> median(df3.2[c(1,2,3),])
> [1] 2 8
>> median(df3.2[c(1,3,2),])
> [1]  2 NA
> Warning message:
> In mean.default(X[[2L]], ...) :
>  argument is not numeric or logical: returning NA
>
>
>
> The sessionInfo is below, but it looks
> to me like the present behavior started
> in 2.10.0.
>
> Sometimes it gets the right answer.  I'd
> be grateful to hear how it does that -- I
> can't figure it out.
>

Hello, Pat.

Nice poetry there!  I think I have an actual answer, as opposed to the
usual crap I spew.

I would agree if you said median.data.frame ought to be written to
work columnwise, similar to mean.data.frame.

apply and sapply  always give the correct answer

> apply(df3.3, 2, median)
  X1.3   X7.9 X10.12
 2  8 11

> apply(df3.2, 2, median)
X1.3 X7.9
   28

> apply(df3.2[c(1,3,2),], 2, median)
X1.3 X7.9
   28

mean.data.frame is now implemented as

mean.data.frame <- function(x, ...) sapply(x, mean, ...)

I think we would suggest this for medians:

??

median.data.frame <- function(x,...) sapply(x, median, ...)

?

It works, see:

> median.data.frame(df3.2[c(1,3,2),])
X1.3 X7.9
   28

Would our next step be to enter that somewhere in R bugzilla? (I'm not
joking--I'm that naive).

I think I can explain why the current median works intermittently in
those cases you mention.  Give it a small set of pre-sorted data, all
is well.  median.default uses a sort function, and it is confused when
it is given a data.frame object rather than just a vector.


I put a browser() at the top of median.default

> median(df3.2[c(1,3,2),])
Called from: median.default(df3.2[c(1, 3, 2), ])
Browse[1]> n
debug at #4: if (is.factor(x)) stop("need numeric data")
Browse[2]> n
debug at #4: NULL
Browse[2]> n
debug at #6: if (length(names(x))) names(x) <- NULL
Browse[2]> n
debug at #6: names(x) <- NULL
Browse[2]> n
debug at #8: if (na.rm) x <- x[!is.na(x)] else if (any(is.na(x)))
return(x[FALSE][NA])
Browse[2]> n
debug at #8: if (any(is.na(x))) return(x[FALSE][NA])
Browse[2]> n
debug at #8: NULL
Browse[2]> n
debug at #12: n <- length(x)
Browse[2]> n
debug at #13: if (n == 0L) return(x[FALSE][NA])
Browse[2]> n
debug at #13: NULL
Browse[2]> n
debug at #15: half <- (n + 1L)%/%2L
Browse[2]> n
debug at #16: if (n%%2L == 1L) sort(x, partial = half)[half] else
mean(sort(x,
partial = half + 0L:1L)[half + 0L:1L])
Browse[2]> n
debug at #16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L])
Browse[2]> n
[1]  2 NA
Warning message:
In mean.default(X[[2L]], ...) :
  argument is not numeric or logical: returning NA


Note the sort there in step 16. I think that's what is killing us.

If you are lucky, give it a small  data frame that is in order, like
df3.2, the sort doesn't produce gibberish. When I get to that point, I
will show you the sort's effect.

First, the case that "works". I moved the browser() down, because I
got tired of looking at the same old not-yet-erroneous output.


> median(df3.2)
Called from: median.default(df3.2)
Browse[1]> n
debug at #15: half <- (n + 1L)%/%2L
Browse[2]> n
debug at #16: if (n%%2L == 1L) sort(x, partial = half)[half] else
mean(sort(x,
partial = half + 0L:1L)[half + 0L:1L])
Browse[2]> n
debug at #16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L])

Interactively, type

Browse[2]> sort(x, partial = half + 0L:1L)
  NA NA   NA   NA   NA   NA
1  1  7 NULL NULL NULL NULL
2  2  8
3  3  9
Warning message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :
  corrupt data frame: columns will be truncated or padded with NAs

But it still gives you a "right" answer:

Browse[2]> n
[1] 2 8


But if  you give it data out of order, the second column turns to NA,
and that causes doom.


> median(df3.2[c(1,3,2),])
Called from: median.default(df3.2[c(1, 3, 2), ])
Browse[1]> n
debug at #15: half <- (n + 1L)%/%2L
Browse[2]> n
debug at #16: if (n%%2L == 1L) sort(x, partial = half)[half] else
mean(sort(x,
partial = half + 0L:1L)[half + 0L:1L])
Browse[2]> n
debug at #16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L])

Interactively:

Browse[2]> sort(x, partial = half + 0L:1L)
  NA   NA NA   NA   NA   NA
1  1 NULL  7 NULL NULL NULL
3  3   9   
2  2   8   
Warning message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :
  corrupt data frame: columns will be truncated or padded with NA

Re: [Rd] median and data frames

2011-04-27 Thread Paul Johnson

On Wed, Apr 27, 2011 at 12:44 PM, Patrick Burns
 wrote:
> Here are some data frames:
>
> df3.2 <- data.frame(1:3, 7:9)
> df4.2 <- data.frame(1:4, 7:10)
> df3.3 <- data.frame(1:3, 7:9, 10:12)
> df4.3 <- data.frame(1:4, 7:10, 10:13)
> df3.4 <- data.frame(1:3, 7:9, 10:12, 15:17)
> df4.4 <- data.frame(1:4, 7:10, 10:13, 15:18)
>
> Now here are some commands and their answers:

>> median(df4.4)
> [1]  8.5 11.5
>> median(df3.2[c(1,2,3),])
> [1] 2 8
>> median(df3.2[c(1,3,2),])
> [1]  2 NA
> Warning message:
> In mean.default(X[[2L]], ...) :
>  argument is not numeric or logical: returning NA
>
>
>
> The sessionInfo is below, but it looks
> to me like the present behavior started
> in 2.10.0.
>
> Sometimes it gets the right answer.  I'd
> be grateful to hear how it does that -- I
> can't figure it out.
>

Hello, Pat.

Nice poetry there!  I think I have an actual answer, as opposed to the
usual crap I spew.

I would agree if you said median.data.frame ought to be written to
work columnwise, similar to mean.data.frame.

apply and sapply  always give the correct answer

> apply(df3.3, 2, median)
  X1.3   X7.9 X10.12
 2  8 11

> apply(df3.2, 2, median)
X1.3 X7.9
   28

> apply(df3.2[c(1,3,2),], 2, median)
X1.3 X7.9
   28

mean.data.frame is now implemented as

mean.data.frame <- function(x, ...) sapply(x, mean, ...)

I think we would suggest this for medians:

??

median.data.frame <- function(x,...) sapply(x, median, ...)

?

It works, see:

> median.data.frame(df3.2[c(1,3,2),])
X1.3 X7.9
   28

Would our next step be to enter that somewhere in R bugzilla? (I'm not
joking--I'm that naive).

I think I can explain why the current median works intermittently in
those cases you mention.  Give it a small set of pre-sorted data, all
is well.  median.default uses a sort function, and it is confused when
it is given a data.frame object rather than just a vector.


I put a browser() at the top of median.default

> median(df3.2[c(1,3,2),])
Called from: median.default(df3.2[c(1, 3, 2), ])
Browse[1]> n
debug at #4: if (is.factor(x)) stop("need numeric data")
Browse[2]> n
debug at #4: NULL
Browse[2]> n
debug at #6: if (length(names(x))) names(x) <- NULL
Browse[2]> n
debug at #6: names(x) <- NULL
Browse[2]> n
debug at #8: if (na.rm) x <- x[!is.na(x)] else if (any(is.na(x)))
return(x[FALSE][NA])
Browse[2]> n
debug at #8: if (any(is.na(x))) return(x[FALSE][NA])
Browse[2]> n
debug at #8: NULL
Browse[2]> n
debug at #12: n <- length(x)
Browse[2]> n
debug at #13: if (n == 0L) return(x[FALSE][NA])
Browse[2]> n
debug at #13: NULL
Browse[2]> n
debug at #15: half <- (n + 1L)%/%2L
Browse[2]> n
debug at #16: if (n%%2L == 1L) sort(x, partial = half)[half] else
mean(sort(x,
partial = half + 0L:1L)[half + 0L:1L])
Browse[2]> n
debug at #16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L])
Browse[2]> n
[1]  2 NA
Warning message:
In mean.default(X[[2L]], ...) :
  argument is not numeric or logical: returning NA


Note the sort there in step 16. I think that's what is killing us.

If you are lucky, give it a small  data frame that is in order, like
df3.2, the sort doesn't produce gibberish. When I get to that point, I
will show you the sort's effect.

First, the case that "works". I moved the browser() down, because I
got tired of looking at the same old not-yet-erroneous output.


> median(df3.2)
Called from: median.default(df3.2)
Browse[1]> n
debug at #15: half <- (n + 1L)%/%2L
Browse[2]> n
debug at #16: if (n%%2L == 1L) sort(x, partial = half)[half] else
mean(sort(x,
partial = half + 0L:1L)[half + 0L:1L])
Browse[2]> n
debug at #16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L])

Interactively, type

Browse[2]> sort(x, partial = half + 0L:1L)
  NA NA   NA   NA   NA   NA
1  1  7 NULL NULL NULL NULL
2  2  8
3  3  9
Warning message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :
  corrupt data frame: columns will be truncated or padded with NAs

But it still gives you a "right" answer:

Browse[2]> n
[1] 2 8


But if  you give it data out of order, the second column turns to NA,
and that causes doom.


> median(df3.2[c(1,3,2),])
Called from: median.default(df3.2[c(1, 3, 2), ])
Browse[1]> n
debug at #15: half <- (n + 1L)%/%2L
Browse[2]> n
debug at #16: if (n%%2L == 1L) sort(x, partial = half)[half] else
mean(sort(x,
partial = half + 0L:1L)[half + 0L:1L])
Browse[2]> n
debug at #16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L])

Interactively:

Browse[2]> sort(x, partial = half + 0L:1L)
  NA   NA NA   NA   NA   NA
1  1 NULL  7 NULL NULL NULL
3  3   9   
2  2   8   
Warning message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :
  corrupt data frame: columns will be truncated or padded with NAs

Browse[2]> n
[1]  2 NA
Warning message:
In mean.default(X[[2L]], ...) :
  argument is not numeric or logical: returning NA


Here's a larger test case. Note columns 1 and 3 turn to NULL

> df8.8 <- data.frame(a=2:8, b=1:7)

median(df8.8)

Re: [Rd] median and data frames

2011-04-27 Thread peter dalgaard


On Apr 27, 2011, at 19:44 , Patrick Burns wrote:

> I would think a method in analogy to
> 'mean.data.frame' would be a logical choice.
> But I'm presuming there might be an argument
> against that or 'median.data.frame' would already
> exist.

Only if someone had a better plan. As you are probably well aware, what you are 
currently seeing is a rather exquisite mashup of methods getting applied to 
objects they shouldn't be applied to. Some curious effects are revealed, e.g. 
this little beauty:

> sort(df3.3)
Error in `[.data.frame`(x, order(x, na.last = na.last, decreasing = 
decreasing)) : 
  undefined columns selected
> names(df3.3)<-NULL
> sort(df3.3)
  NA NA NA   NA   NA   NA   NA   NA   NA
1  1  7 10 NULL NULL NULL NULL NULL NULL
2  2  8 11  
3  3  9 12  
Warning message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :
  corrupt data frame: columns will be truncated or padded with NAs


-- 
Peter Dalgaard
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd@cbs.dk  Priv: pda...@gmail.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] median and data frames

2011-04-27 Thread Patrick Burns


Here are some data frames:

df3.2 <- data.frame(1:3, 7:9)
df4.2 <- data.frame(1:4, 7:10)
df3.3 <- data.frame(1:3, 7:9, 10:12)
df4.3 <- data.frame(1:4, 7:10, 10:13)
df3.4 <- data.frame(1:3, 7:9, 10:12, 15:17)
df4.4 <- data.frame(1:4, 7:10, 10:13, 15:18)

Now here are some commands and their answers:

> median(df3.2)
[1] 2 8
> median(df4.2)
[1] 2.5 8.5
> median(df3.3)
  NA
1  7
2  8
3  9
> median(df4.3)
  NA
1  7
2  8
3  9
4 10
> median(df3.4)
[1]  8 11
> median(df4.4)
[1]  8.5 11.5
> median(df3.2[c(1,2,3),])
[1] 2 8
> median(df3.2[c(1,3,2),])
[1]  2 NA
Warning message:
In mean.default(X[[2L]], ...) :
  argument is not numeric or logical: returning NA



The sessionInfo is below, but it looks
to me like the present behavior started
in 2.10.0.

Sometimes it gets the right answer.  I'd
be grateful to hear how it does that -- I
can't figure it out.

Under the current regime we can get numbers
that are correct, partially correct, or sort
of random (given the intention).

I claim that much better behavior would be
to always get exactly one of the following:

* a numeric answer (that is consistently correct)
* an error

I would think a method in analogy to
'mean.data.frame' would be a logical choice.
But I'm presuming there might be an argument
against that or 'median.data.frame' would already
exist.


> sessionInfo()
R version 2.13.0 (2011-04-13)
Platform: i386-pc-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=English_United Kingdom.1252
[2] LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252

attached base packages:
[1] graphics  grDevices utils datasets  stats methods   base

other attached packages:
[1] xts_0.8-0 zoo_1.6-5

loaded via a namespace (and not attached):
[1] grid_2.13.0 lattice_0.19-23 tools_2.13.0

--
Patrick Burns
pbu...@pburns.seanet.com
twitter: @portfolioprobe
http://www.portfolioprobe.com/blog
http://www.burns-stat.com
(home of 'Some hints for the R beginner'
and 'The R Inferno')

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] median and data frames

Re: [Rd] median and data frames

Re: [Rd] median and data frames

Re: [Rd] median and data frames

Re: [Rd] median and data frames

Re: [Rd] median and data frames

Re: [Rd] median and data frames

Re: [Rd] median and data frames

Re: [Rd] median and data frames

Re: [Rd] median and data frames

[Rd] median and data frames

11 matches

Site Navigation

Mail list logo

Footer information