Re: [Rd] median and data frames
> Martin Maechler > on Fri, 29 Apr 2011 16:25:09 +0200 writes: > Paul Johnson > on Thu, 28 Apr 2011 00:20:27 -0500 writes: >> On Wed, Apr 27, 2011 at 12:44 PM, Patrick Burns >> wrote: >>> Here are some data frames: >>> >>> df3.2 <- data.frame(1:3, 7:9) df4.2 <- data.frame(1:4, >>> 7:10) df3.3 <- data.frame(1:3, 7:9, 10:12) df4.3 <- >>> data.frame(1:4, 7:10, 10:13) df3.4 <- data.frame(1:3, >>> 7:9, 10:12, 15:17) df4.4 <- data.frame(1:4, 7:10, 10:13, >>> 15:18) >>> >>> Now here are some commands and their answers: median(df4.4) >>> [1] 8.5 11.5 median(df3.2[c(1,2,3),]) >>> [1] 2 8 median(df3.2[c(1,3,2),]) >>> [1] 2 NA Warning message: In mean.default(X[[2L]], ...) >>> : argument is not numeric or logical: returning NA >>> >>> >>> >>> The sessionInfo is below, but it looks to me like the >>> present behavior started in 2.10.0. >>> >>> Sometimes it gets the right answer. I'd be grateful to >>> hear how it does that -- I can't figure it out. >>> > Hello, Pat. >> Nice poetry there! I think I have an actual answer, as >> opposed to the usual crap I spew. >> I would agree if you said median.data.frame ought to be >> written to work columnwise, similar to mean.data.frame. >> apply and sapply always give the correct answer >>> apply(df3.3, 2, median) >> X1.3 X7.9 X10.12 2 8 11 > [...] > exactly >> mean.data.frame is now implemented as >> mean.data.frame <- function(x, ...) sapply(x, mean, ...) > exactly. > My personal oppinion is that mean.data.frame() should > never have been written. People should know, or learn, to > use apply functions for such a task. > The unfortunate fact that mean.data.frame() exists makes > people think that median.data.frame() should too, and then > var.data.frame() sd.data.frame() mad.data.frame() > min.data.frame() max.data.frame() ... ... > all just in order to *not* to have to know sapply() > No, rather not. > My vote is for deprecating mean.data.frame(). > Martin This has now happened -- for R 2.14.0 and later. As raised in this thread in April, there's a similar "extra helpful" behavior within the sd() function, and we've also deprecated that. In addition -- getting back to Pat Burns' original post, I'm also proposing to change median() such that it produces an error instead of the current "sometimes correct" (but mostly not!) results. Martin __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] median and data frames
On Fri, Apr 29, 2011 at 9:25 AM, Martin Maechler wrote: >> Paul Johnson >> on Thu, 28 Apr 2011 00:20:27 -0500 writes: > > > On Wed, Apr 27, 2011 at 12:44 PM, Patrick Burns > > wrote: > >> Here are some data frames: > >> > >> df3.2 <- data.frame(1:3, 7:9) > >> df4.2 <- data.frame(1:4, 7:10) > >> df3.3 <- data.frame(1:3, 7:9, 10:12) > >> df4.3 <- data.frame(1:4, 7:10, 10:13) > >> df3.4 <- data.frame(1:3, 7:9, 10:12, 15:17) > >> df4.4 <- data.frame(1:4, 7:10, 10:13, 15:18) > >> > >> Now here are some commands and their answers: > > >>> median(df4.4) > >> [1] 8.5 11.5 > >>> median(df3.2[c(1,2,3),]) > >> [1] 2 8 > >>> median(df3.2[c(1,3,2),]) > >> [1] 2 NA > >> Warning message: > >> In mean.default(X[[2L]], ...) : > >> argument is not numeric or logical: returning NA > >> > >> > >> > >> The sessionInfo is below, but it looks > >> to me like the present behavior started > >> in 2.10.0. > >> > >> Sometimes it gets the right answer. I'd > >> be grateful to hear how it does that -- I > >> can't figure it out. > >> > > > Hello, Pat. > > > Nice poetry there! I think I have an actual answer, as opposed to the > > usual crap I spew. > > > I would agree if you said median.data.frame ought to be written to > > work columnwise, similar to mean.data.frame. > > > apply and sapply always give the correct answer > > >> apply(df3.3, 2, median) > > X1.3 X7.9 X10.12 > > 2 8 11 > > [...] > > exactly > > > mean.data.frame is now implemented as > > > mean.data.frame <- function(x, ...) sapply(x, mean, ...) > > exactly. > > My personal oppinion is that mean.data.frame() should never have > been written. > People should know, or learn, to use apply functions for such a > task. > > The unfortunate fact that mean.data.frame() exists makes people > think that median.data.frame() should too, > and then > > var.data.frame() > sd.data.frame() > mad.data.frame() > min.data.frame() > max.data.frame() > ... > ... > > all just in order to *not* to have to know sapply() > > > No, rather not. > > My vote is for deprecating mean.data.frame(). > > Martin > I agree. However, sd() isn't currently (as of R-2.13.0) generic and it operates by column for matrix and data.frame objects, so it behaves a bit more like mean() and is similarly inconsistent from the other listed functions. I have no input on how this should be handled, but thought it may be worth addressing. Best, -- Joshua Ulrich | FOSS Trading: www.fosstrading.com __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] median and data frames
I also favor deprecating mean.data.frame. One possible exception would be for a single-column data frame. But even here I'd say no, lest people expect the same behavior for median, var, ... Pat's suggestion of using stop() would work nicely for mean. (but omit paste - stop handles that). Tim Hesterberg >If Martin's proposal is accepted, does >that mean that the median method for >data frames would be something like: > >function (x, ...) >{ > stop(paste("you probably mean to use the command: sapply(", > deparse(substitute(x)), ", median)", sep="")) >} > >Pat > > >On 29/04/2011 15:25, Martin Maechler wrote: >>> Paul Johnson >>> on Thu, 28 Apr 2011 00:20:27 -0500 writes: >> >> > On Wed, Apr 27, 2011 at 12:44 PM, Patrick Burns >> >wrote: >> >> Here are some data frames: >> >> >> >> df3.2<- data.frame(1:3, 7:9) >> >> df4.2<- data.frame(1:4, 7:10) >> >> df3.3<- data.frame(1:3, 7:9, 10:12) >> >> df4.3<- data.frame(1:4, 7:10, 10:13) >> >> df3.4<- data.frame(1:3, 7:9, 10:12, 15:17) >> >> df4.4<- data.frame(1:4, 7:10, 10:13, 15:18) >> >> >> >> Now here are some commands and their answers: >> >> >>> median(df4.4) >> >> [1] 8.5 11.5 >> >>> median(df3.2[c(1,2,3),]) >> >> [1] 2 8 >> >>> median(df3.2[c(1,3,2),]) >> >> [1] 2 NA >> >> Warning message: >> >> In mean.default(X[[2L]], ...) : >> >>argument is not numeric or logical: returning NA >> >> >> >> >> >> >> >> The sessionInfo is below, but it looks >> >> to me like the present behavior started >> >> in 2.10.0. >> >> >> >> Sometimes it gets the right answer. I'd >> >> be grateful to hear how it does that -- I >> >> can't figure it out. >> >> >> >> > Hello, Pat. >> >> > Nice poetry there! I think I have an actual answer, as opposed to >> the >> > usual crap I spew. >> >> > I would agree if you said median.data.frame ought to be written to >> > work columnwise, similar to mean.data.frame. >> >> > apply and sapply always give the correct answer >> >> >> apply(df3.3, 2, median) >> > X1.3 X7.9 X10.12 >> > 2 8 11 >> >> [...] >> >> exactly >> >> > mean.data.frame is now implemented as >> >> > mean.data.frame<- function(x, ...) sapply(x, mean, ...) >> >> exactly. >> >> My personal oppinion is that mean.data.frame() should never have >> been written. >> People should know, or learn, to use apply functions for such a >> task. >> >> The unfortunate fact that mean.data.frame() exists makes people >> think that median.data.frame() should too, >> and then >> >>var.data.frame() >> sd.data.frame() >>mad.data.frame() >>min.data.frame() >>max.data.frame() >>... >>... >> >> all just in order to *not* to have to know sapply() >> >> >> No, rather not. >> >> My vote is for deprecating mean.data.frame(). >> >> Martin >> > >-- >Patrick Burns >pbu...@pburns.seanet.com >twitter: @portfolioprobe >http://www.portfolioprobe.com/blog >http://www.burns-stat.com >(home of 'Some hints for the R beginner' >and 'The R Inferno') __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] median and data frames
If Martin's proposal is accepted, does that mean that the median method for data frames would be something like: function (x, ...) { stop(paste("you probably mean to use the command: sapply(", deparse(substitute(x)), ", median)", sep="")) } Pat On 29/04/2011 15:25, Martin Maechler wrote: Paul Johnson on Thu, 28 Apr 2011 00:20:27 -0500 writes: > On Wed, Apr 27, 2011 at 12:44 PM, Patrick Burns >wrote: >> Here are some data frames: >> >> df3.2<- data.frame(1:3, 7:9) >> df4.2<- data.frame(1:4, 7:10) >> df3.3<- data.frame(1:3, 7:9, 10:12) >> df4.3<- data.frame(1:4, 7:10, 10:13) >> df3.4<- data.frame(1:3, 7:9, 10:12, 15:17) >> df4.4<- data.frame(1:4, 7:10, 10:13, 15:18) >> >> Now here are some commands and their answers: >>> median(df4.4) >> [1] 8.5 11.5 >>> median(df3.2[c(1,2,3),]) >> [1] 2 8 >>> median(df3.2[c(1,3,2),]) >> [1] 2 NA >> Warning message: >> In mean.default(X[[2L]], ...) : >>argument is not numeric or logical: returning NA >> >> >> >> The sessionInfo is below, but it looks >> to me like the present behavior started >> in 2.10.0. >> >> Sometimes it gets the right answer. I'd >> be grateful to hear how it does that -- I >> can't figure it out. >> > Hello, Pat. > Nice poetry there! I think I have an actual answer, as opposed to the > usual crap I spew. > I would agree if you said median.data.frame ought to be written to > work columnwise, similar to mean.data.frame. > apply and sapply always give the correct answer >> apply(df3.3, 2, median) > X1.3 X7.9 X10.12 > 2 8 11 [...] exactly > mean.data.frame is now implemented as > mean.data.frame<- function(x, ...) sapply(x, mean, ...) exactly. My personal oppinion is that mean.data.frame() should never have been written. People should know, or learn, to use apply functions for such a task. The unfortunate fact that mean.data.frame() exists makes people think that median.data.frame() should too, and then var.data.frame() sd.data.frame() mad.data.frame() min.data.frame() max.data.frame() ... ... all just in order to *not* to have to know sapply() No, rather not. My vote is for deprecating mean.data.frame(). Martin -- Patrick Burns pbu...@pburns.seanet.com twitter: @portfolioprobe http://www.portfolioprobe.com/blog http://www.burns-stat.com (home of 'Some hints for the R beginner' and 'The R Inferno') __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] median and data frames
> My personal oppinion is that mean.data.frame() should never have > been written. > People should know, or learn, to use apply functions for such a > task. > > The unfortunate fact that mean.data.frame() exists makes people > think that median.data.frame() should too, > and then > > var.data.frame() > sd.data.frame() > mad.data.frame() > min.data.frame() > max.data.frame() > ... > ... > > all just in order to *not* to have to know sapply() > > > No, rather not. > > My vote is for deprecating mean.data.frame(). I totally agree! Hadley -- Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] median and data frames
> From: r-devel-boun...@r-project.org > [mailto:r-devel-boun...@r-project.org] On Behalf Of Martin Maechler > Sent: Friday, April 29, 2011 7:25 AM > To: Paul Johnson > Cc: r-devel > Subject: Re: [Rd] median and data frames > [ ... lots of lines elided ... ] > My vote is for deprecating mean.data.frame(). While R's data.frame method for mean(x) returns the same thing as colMeans(x), Splus's (since 2005) returns the same thing as mean(as.matrix(x)). (Really, it calls numerical.matrix(x), which turns non-numeric columns into columns of numeric NA's). I usually favor making data.frames act more like matrices when possible (since users often conflate the two classes) and I like having all the methods of a generic function return the same sort of thing (a single value in this case). It is often nonsensical to ask for the mean of an entire data.frame, as the columns may have different units even when they are all numeric. It does make sense when you use a tool like read.table() or S+'s importData() to import a matrix and you don't notice it is stored as a data.frame. It does make sense when you have a single-column data.frame or matrix, perhaps arising from the use of drop=FALSE when subscripting. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com > Martin > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] median and data frames
> Paul Johnson > on Thu, 28 Apr 2011 00:20:27 -0500 writes: > On Wed, Apr 27, 2011 at 12:44 PM, Patrick Burns > wrote: >> Here are some data frames: >> >> df3.2 <- data.frame(1:3, 7:9) >> df4.2 <- data.frame(1:4, 7:10) >> df3.3 <- data.frame(1:3, 7:9, 10:12) >> df4.3 <- data.frame(1:4, 7:10, 10:13) >> df3.4 <- data.frame(1:3, 7:9, 10:12, 15:17) >> df4.4 <- data.frame(1:4, 7:10, 10:13, 15:18) >> >> Now here are some commands and their answers: >>> median(df4.4) >> [1] 8.5 11.5 >>> median(df3.2[c(1,2,3),]) >> [1] 2 8 >>> median(df3.2[c(1,3,2),]) >> [1] 2 NA >> Warning message: >> In mean.default(X[[2L]], ...) : >> argument is not numeric or logical: returning NA >> >> >> >> The sessionInfo is below, but it looks >> to me like the present behavior started >> in 2.10.0. >> >> Sometimes it gets the right answer. I'd >> be grateful to hear how it does that -- I >> can't figure it out. >> > Hello, Pat. > Nice poetry there! I think I have an actual answer, as opposed to the > usual crap I spew. > I would agree if you said median.data.frame ought to be written to > work columnwise, similar to mean.data.frame. > apply and sapply always give the correct answer >> apply(df3.3, 2, median) > X1.3 X7.9 X10.12 > 2 8 11 [...] exactly > mean.data.frame is now implemented as > mean.data.frame <- function(x, ...) sapply(x, mean, ...) exactly. My personal oppinion is that mean.data.frame() should never have been written. People should know, or learn, to use apply functions for such a task. The unfortunate fact that mean.data.frame() exists makes people think that median.data.frame() should too, and then var.data.frame() sd.data.frame() mad.data.frame() min.data.frame() max.data.frame() ... ... all just in order to *not* to have to know sapply() No, rather not. My vote is for deprecating mean.data.frame(). Martin __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] median and data frames
This seems trivially fixable using something like median.data.frame <- function(x, na.rm=FALSE) { sapply(x, function(y, na.rm=FALSE) if(is.factor(y)) NA else median(y, na.rm=na.rm), na.rm=na.rm) } >>> Paul Johnson 28/04/2011 06:20 >>> On Wed, Apr 27, 2011 at 12:44 PM, Patrick Burns wrote: > Here are some data frames: > > df3.2 <- data.frame(1:3, 7:9) > df4.2 <- data.frame(1:4, 7:10) > df3.3 <- data.frame(1:3, 7:9, 10:12) > df4.3 <- data.frame(1:4, 7:10, 10:13) > df3.4 <- data.frame(1:3, 7:9, 10:12, 15:17) > df4.4 <- data.frame(1:4, 7:10, 10:13, 15:18) > > Now here are some commands and their answers: >> median(df4.4) > [1] 8.5 11.5 >> median(df3.2[c(1,2,3),]) > [1] 2 8 >> median(df3.2[c(1,3,2),]) > [1] 2 NA > Warning message: > In mean.default(X[[2L]], ...) : > argument is not numeric or logical: returning NA > > > > The sessionInfo is below, but it looks > to me like the present behavior started > in 2.10.0. > > Sometimes it gets the right answer. I'd > be grateful to hear how it does that -- I > can't figure it out. > Hello, Pat. Nice poetry there! I think I have an actual answer, as opposed to the usual crap I spew. I would agree if you said median.data.frame ought to be written to work columnwise, similar to mean.data.frame. apply and sapply always give the correct answer > apply(df3.3, 2, median) X1.3 X7.9 X10.12 2 8 11 > apply(df3.2, 2, median) X1.3 X7.9 28 > apply(df3.2[c(1,3,2),], 2, median) X1.3 X7.9 28 mean.data.frame is now implemented as mean.data.frame <- function(x, ...) sapply(x, mean, ...) I think we would suggest this for medians: ?? median.data.frame <- function(x,...) sapply(x, median, ...) ? It works, see: > median.data.frame(df3.2[c(1,3,2),]) X1.3 X7.9 28 Would our next step be to enter that somewhere in R bugzilla? (I'm not joking--I'm that naive). I think I can explain why the current median works intermittently in those cases you mention. Give it a small set of pre-sorted data, all is well. median.default uses a sort function, and it is confused when it is given a data.frame object rather than just a vector. I put a browser() at the top of median.default > median(df3.2[c(1,3,2),]) Called from: median.default(df3.2[c(1, 3, 2), ]) Browse[1]> n debug at #4: if (is.factor(x)) stop("need numeric data") Browse[2]> n debug at #4: NULL Browse[2]> n debug at #6: if (length(names(x))) names(x) <- NULL Browse[2]> n debug at #6: names(x) <- NULL Browse[2]> n debug at #8: if (na.rm) x <- x[!is.na(x)] else if (any(is.na(x))) return(x[FALSE][NA]) Browse[2]> n debug at #8: if (any(is.na(x))) return(x[FALSE][NA]) Browse[2]> n debug at #8: NULL Browse[2]> n debug at #12: n <- length(x) Browse[2]> n debug at #13: if (n == 0L) return(x[FALSE][NA]) Browse[2]> n debug at #13: NULL Browse[2]> n debug at #15: half <- (n + 1L)%/%2L Browse[2]> n debug at #16: if (n%%2L == 1L) sort(x, partial = half)[half] else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> n debug at #16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> n [1] 2 NA Warning message: In mean.default(X[[2L]], ...) : argument is not numeric or logical: returning NA Note the sort there in step 16. I think that's what is killing us. If you are lucky, give it a small data frame that is in order, like df3.2, the sort doesn't produce gibberish. When I get to that point, I will show you the sort's effect. First, the case that "works". I moved the browser() down, because I got tired of looking at the same old not-yet-erroneous output. > median(df3.2) Called from: median.default(df3.2) Browse[1]> n debug at #15: half <- (n + 1L)%/%2L Browse[2]> n debug at #16: if (n%%2L == 1L) sort(x, partial = half)[half] else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> n debug at #16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Interactively, type Browse[2]> sort(x, partial = half + 0L:1L) NA NA NA NA NA NA 1 1 7 NULL NULL NULL NULL 2 2 8 3 3 9 Warning message: In format.data.frame(x, digits = digits, na.encode = FALSE) : corrupt data frame: columns will be truncated or padded with NAs But it still gives you a "right" answer: Browse[2]> n [1] 2 8 But if you give it data out of order, the second column turns to NA, and that causes doom. > median(df3.2[c(1,3,2),]) Called from: median.default(df3.2[c(1, 3, 2), ]) Browse[1]> n debug at #15: half <- (n + 1L)%/%2L Browse[2]> n debug at #16: if (n%%2L == 1L) sort(x, partial = half)[half] else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> n debug at #16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Interactively: Browse[2]> sort(x, partial = half + 0L:1L) NA NA NA NA NA NA 1 1 NULL 7 NULL NULL NULL 3 3 9 2 2 8 Warning message: In format.data.frame(x, digits = digits, na.encode = FALSE) : corrupt data frame: columns will be truncated or padded with NA
Re: [Rd] median and data frames
On Wed, Apr 27, 2011 at 12:44 PM, Patrick Burns wrote: > Here are some data frames: > > df3.2 <- data.frame(1:3, 7:9) > df4.2 <- data.frame(1:4, 7:10) > df3.3 <- data.frame(1:3, 7:9, 10:12) > df4.3 <- data.frame(1:4, 7:10, 10:13) > df3.4 <- data.frame(1:3, 7:9, 10:12, 15:17) > df4.4 <- data.frame(1:4, 7:10, 10:13, 15:18) > > Now here are some commands and their answers: >> median(df4.4) > [1] 8.5 11.5 >> median(df3.2[c(1,2,3),]) > [1] 2 8 >> median(df3.2[c(1,3,2),]) > [1] 2 NA > Warning message: > In mean.default(X[[2L]], ...) : > argument is not numeric or logical: returning NA > > > > The sessionInfo is below, but it looks > to me like the present behavior started > in 2.10.0. > > Sometimes it gets the right answer. I'd > be grateful to hear how it does that -- I > can't figure it out. > Hello, Pat. Nice poetry there! I think I have an actual answer, as opposed to the usual crap I spew. I would agree if you said median.data.frame ought to be written to work columnwise, similar to mean.data.frame. apply and sapply always give the correct answer > apply(df3.3, 2, median) X1.3 X7.9 X10.12 2 8 11 > apply(df3.2, 2, median) X1.3 X7.9 28 > apply(df3.2[c(1,3,2),], 2, median) X1.3 X7.9 28 mean.data.frame is now implemented as mean.data.frame <- function(x, ...) sapply(x, mean, ...) I think we would suggest this for medians: ?? median.data.frame <- function(x,...) sapply(x, median, ...) ? It works, see: > median.data.frame(df3.2[c(1,3,2),]) X1.3 X7.9 28 Would our next step be to enter that somewhere in R bugzilla? (I'm not joking--I'm that naive). I think I can explain why the current median works intermittently in those cases you mention. Give it a small set of pre-sorted data, all is well. median.default uses a sort function, and it is confused when it is given a data.frame object rather than just a vector. I put a browser() at the top of median.default > median(df3.2[c(1,3,2),]) Called from: median.default(df3.2[c(1, 3, 2), ]) Browse[1]> n debug at #4: if (is.factor(x)) stop("need numeric data") Browse[2]> n debug at #4: NULL Browse[2]> n debug at #6: if (length(names(x))) names(x) <- NULL Browse[2]> n debug at #6: names(x) <- NULL Browse[2]> n debug at #8: if (na.rm) x <- x[!is.na(x)] else if (any(is.na(x))) return(x[FALSE][NA]) Browse[2]> n debug at #8: if (any(is.na(x))) return(x[FALSE][NA]) Browse[2]> n debug at #8: NULL Browse[2]> n debug at #12: n <- length(x) Browse[2]> n debug at #13: if (n == 0L) return(x[FALSE][NA]) Browse[2]> n debug at #13: NULL Browse[2]> n debug at #15: half <- (n + 1L)%/%2L Browse[2]> n debug at #16: if (n%%2L == 1L) sort(x, partial = half)[half] else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> n debug at #16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> n [1] 2 NA Warning message: In mean.default(X[[2L]], ...) : argument is not numeric or logical: returning NA Note the sort there in step 16. I think that's what is killing us. If you are lucky, give it a small data frame that is in order, like df3.2, the sort doesn't produce gibberish. When I get to that point, I will show you the sort's effect. First, the case that "works". I moved the browser() down, because I got tired of looking at the same old not-yet-erroneous output. > median(df3.2) Called from: median.default(df3.2) Browse[1]> n debug at #15: half <- (n + 1L)%/%2L Browse[2]> n debug at #16: if (n%%2L == 1L) sort(x, partial = half)[half] else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> n debug at #16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Interactively, type Browse[2]> sort(x, partial = half + 0L:1L) NA NA NA NA NA NA 1 1 7 NULL NULL NULL NULL 2 2 8 3 3 9 Warning message: In format.data.frame(x, digits = digits, na.encode = FALSE) : corrupt data frame: columns will be truncated or padded with NAs But it still gives you a "right" answer: Browse[2]> n [1] 2 8 But if you give it data out of order, the second column turns to NA, and that causes doom. > median(df3.2[c(1,3,2),]) Called from: median.default(df3.2[c(1, 3, 2), ]) Browse[1]> n debug at #15: half <- (n + 1L)%/%2L Browse[2]> n debug at #16: if (n%%2L == 1L) sort(x, partial = half)[half] else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> n debug at #16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Interactively: Browse[2]> sort(x, partial = half + 0L:1L) NA NA NA NA NA NA 1 1 NULL 7 NULL NULL NULL 3 3 9 2 2 8 Warning message: In format.data.frame(x, digits = digits, na.encode = FALSE) : corrupt data frame: columns will be truncated or padded with NAs Browse[2]> n [1] 2 NA Warning message: In mean.default(X[[2L]], ...) : argument is not numeric or logical: returning NA Here's a larger test case. Note columns 1 and 3 turn to NULL > df8.8 <- data.frame(a=2:8, b=1:7) median(df8.8)
Re: [Rd] median and data frames
On Apr 27, 2011, at 19:44 , Patrick Burns wrote: > I would think a method in analogy to > 'mean.data.frame' would be a logical choice. > But I'm presuming there might be an argument > against that or 'median.data.frame' would already > exist. Only if someone had a better plan. As you are probably well aware, what you are currently seeing is a rather exquisite mashup of methods getting applied to objects they shouldn't be applied to. Some curious effects are revealed, e.g. this little beauty: > sort(df3.3) Error in `[.data.frame`(x, order(x, na.last = na.last, decreasing = decreasing)) : undefined columns selected > names(df3.3)<-NULL > sort(df3.3) NA NA NA NA NA NA NA NA NA 1 1 7 10 NULL NULL NULL NULL NULL NULL 2 2 8 11 3 3 9 12 Warning message: In format.data.frame(x, digits = digits, na.encode = FALSE) : corrupt data frame: columns will be truncated or padded with NAs -- Peter Dalgaard Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] median and data frames
Here are some data frames: df3.2 <- data.frame(1:3, 7:9) df4.2 <- data.frame(1:4, 7:10) df3.3 <- data.frame(1:3, 7:9, 10:12) df4.3 <- data.frame(1:4, 7:10, 10:13) df3.4 <- data.frame(1:3, 7:9, 10:12, 15:17) df4.4 <- data.frame(1:4, 7:10, 10:13, 15:18) Now here are some commands and their answers: > median(df3.2) [1] 2 8 > median(df4.2) [1] 2.5 8.5 > median(df3.3) NA 1 7 2 8 3 9 > median(df4.3) NA 1 7 2 8 3 9 4 10 > median(df3.4) [1] 8 11 > median(df4.4) [1] 8.5 11.5 > median(df3.2[c(1,2,3),]) [1] 2 8 > median(df3.2[c(1,3,2),]) [1] 2 NA Warning message: In mean.default(X[[2L]], ...) : argument is not numeric or logical: returning NA The sessionInfo is below, but it looks to me like the present behavior started in 2.10.0. Sometimes it gets the right answer. I'd be grateful to hear how it does that -- I can't figure it out. Under the current regime we can get numbers that are correct, partially correct, or sort of random (given the intention). I claim that much better behavior would be to always get exactly one of the following: * a numeric answer (that is consistently correct) * an error I would think a method in analogy to 'mean.data.frame' would be a logical choice. But I'm presuming there might be an argument against that or 'median.data.frame' would already exist. > sessionInfo() R version 2.13.0 (2011-04-13) Platform: i386-pc-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=English_United Kingdom.1252 [2] LC_CTYPE=English_United Kingdom.1252 [3] LC_MONETARY=English_United Kingdom.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United Kingdom.1252 attached base packages: [1] graphics grDevices utils datasets stats methods base other attached packages: [1] xts_0.8-0 zoo_1.6-5 loaded via a namespace (and not attached): [1] grid_2.13.0 lattice_0.19-23 tools_2.13.0 -- Patrick Burns pbu...@pburns.seanet.com twitter: @portfolioprobe http://www.portfolioprobe.com/blog http://www.burns-stat.com (home of 'Some hints for the R beginner' and 'The R Inferno') __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel