On Wed, Apr 27, 2011 at 12:44 PM, Patrick Burns <pbu...@pburns.seanet.com> wrote: > Here are some data frames: > > df3.2 <- data.frame(1:3, 7:9) > df4.2 <- data.frame(1:4, 7:10) > df3.3 <- data.frame(1:3, 7:9, 10:12) > df4.3 <- data.frame(1:4, 7:10, 10:13) > df3.4 <- data.frame(1:3, 7:9, 10:12, 15:17) > df4.4 <- data.frame(1:4, 7:10, 10:13, 15:18) > > Now here are some commands and their answers:
>> median(df4.4) > [1] 8.5 11.5 >> median(df3.2[c(1,2,3),]) > [1] 2 8 >> median(df3.2[c(1,3,2),]) > [1] 2 NA > Warning message: > In mean.default(X[[2L]], ...) : > argument is not numeric or logical: returning NA > > > > The sessionInfo is below, but it looks > to me like the present behavior started > in 2.10.0. > > Sometimes it gets the right answer. I'd > be grateful to hear how it does that -- I > can't figure it out. > Hello, Pat. Nice poetry there! I think I have an actual answer, as opposed to the usual crap I spew. I would agree if you said median.data.frame ought to be written to work columnwise, similar to mean.data.frame. apply and sapply always give the correct answer > apply(df3.3, 2, median) X1.3 X7.9 X10.12 2 8 11 > apply(df3.2, 2, median) X1.3 X7.9 2 8 > apply(df3.2[c(1,3,2),], 2, median) X1.3 X7.9 2 8 mean.data.frame is now implemented as mean.data.frame <- function(x, ...) sapply(x, mean, ...) I think we would suggest this for medians: ?????????????????????? median.data.frame <- function(x,...) sapply(x, median, ...) ????????????????????? It works, see: > median.data.frame(df3.2[c(1,3,2),]) X1.3 X7.9 2 8 Would our next step be to enter that somewhere in R bugzilla? (I'm not joking--I'm that naive). I think I can explain why the current median works intermittently in those cases you mention. Give it a small set of pre-sorted data, all is well. median.default uses a sort function, and it is confused when it is given a data.frame object rather than just a vector. I put a browser() at the top of median.default > median(df3.2[c(1,3,2),]) Called from: median.default(df3.2[c(1, 3, 2), ]) Browse[1]> n debug at <tmp>#4: if (is.factor(x)) stop("need numeric data") Browse[2]> n debug at <tmp>#4: NULL Browse[2]> n debug at <tmp>#6: if (length(names(x))) names(x) <- NULL Browse[2]> n debug at <tmp>#6: names(x) <- NULL Browse[2]> n debug at <tmp>#8: if (na.rm) x <- x[!is.na(x)] else if (any(is.na(x))) return(x[FALSE][NA]) Browse[2]> n debug at <tmp>#8: if (any(is.na(x))) return(x[FALSE][NA]) Browse[2]> n debug at <tmp>#8: NULL Browse[2]> n debug at <tmp>#12: n <- length(x) Browse[2]> n debug at <tmp>#13: if (n == 0L) return(x[FALSE][NA]) Browse[2]> n debug at <tmp>#13: NULL Browse[2]> n debug at <tmp>#15: half <- (n + 1L)%/%2L Browse[2]> n debug at <tmp>#16: if (n%%2L == 1L) sort(x, partial = half)[half] else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> n debug at <tmp>#16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> n [1] 2 NA Warning message: In mean.default(X[[2L]], ...) : argument is not numeric or logical: returning NA Note the sort there in step 16. I think that's what is killing us. If you are lucky, give it a small data frame that is in order, like df3.2, the sort doesn't produce gibberish. When I get to that point, I will show you the sort's effect. First, the case that "works". I moved the browser() down, because I got tired of looking at the same old not-yet-erroneous output. > median(df3.2) Called from: median.default(df3.2) Browse[1]> n debug at <tmp>#15: half <- (n + 1L)%/%2L Browse[2]> n debug at <tmp>#16: if (n%%2L == 1L) sort(x, partial = half)[half] else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> n debug at <tmp>#16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Interactively, type Browse[2]> sort(x, partial = half + 0L:1L) NA NA NA NA NA NA 1 1 7 NULL NULL NULL NULL 2 2 8 <NA> <NA> <NA> <NA> 3 3 9 <NA> <NA> <NA> <NA> Warning message: In format.data.frame(x, digits = digits, na.encode = FALSE) : corrupt data frame: columns will be truncated or padded with NAs But it still gives you a "right" answer: Browse[2]> n [1] 2 8 But if you give it data out of order, the second column turns to NA, and that causes doom. > median(df3.2[c(1,3,2),]) Called from: median.default(df3.2[c(1, 3, 2), ]) Browse[1]> n debug at <tmp>#15: half <- (n + 1L)%/%2L Browse[2]> n debug at <tmp>#16: if (n%%2L == 1L) sort(x, partial = half)[half] else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> n debug at <tmp>#16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Interactively: Browse[2]> sort(x, partial = half + 0L:1L) NA NA NA NA NA NA 1 1 NULL 7 NULL NULL NULL 3 3 <NA> 9 <NA> <NA> <NA> 2 2 <NA> 8 <NA> <NA> <NA> Warning message: In format.data.frame(x, digits = digits, na.encode = FALSE) : corrupt data frame: columns will be truncated or padded with NAs Browse[2]> n [1] 2 NA Warning message: In mean.default(X[[2L]], ...) : argument is not numeric or logical: returning NA Here's a larger test case. Note columns 1 and 3 turn to NULL > df8.8 <- data.frame(a=2:8, b=1:7) median(df8.8) debug at <tmp>#16: if (n%%2L == 1L) sort(x, partial = half)[half] else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> n debug at <tmp>#16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> sort(x, partial = half + 0L:1L) NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NULL 2 NULL 1 NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL 2 <NA> 3 <NA> 2 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 3 <NA> 4 <NA> 3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 4 <NA> 5 <NA> 4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 5 <NA> 6 <NA> 5 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 6 <NA> 7 <NA> 6 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 7 <NA> 8 <NA> 7 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> Warning message: In format.data.frame(x, digits = digits, na.encode = FALSE) : corrupt data frame: columns will be truncated or padded with NAs Run ?sort and you see it was not intended for data.frames. In conclusion, I think median applied to a data.frame causes undefined behavior because median is not intending to deal with several columns at once. I don't see any changes in median.default that would explain the changes you see. Compare: >From R-2.13, src/library/stats/R/median.R median.default <- function(x, na.rm = FALSE) { if(is.factor(x)) stop("need numeric data") ## all other objects only need sort() & mean() to be working if(length(names(x))) names(x) <- NULL # for e.g., c(x = NA_real_) if(na.rm) x <- x[!is.na(x)] else if(any(is.na(x))) return(x[FALSE][NA]) n <- length(x) if (n == 0L) return(x[FALSE][NA]) half <- (n + 1L) %/% 2L if(n %% 2L == 1L) sort(x, partial = half)[half] else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) } >From R-2.9.2 median.default <- function(x, na.rm = FALSE) { if(is.factor(x)) stop("need numeric data") ## all other objects only need sort() & mean() to be working if(length(names(x))) names(x) <- NULL # for e.g., c(x = NA_real_) if(na.rm) x <- x[!is.na(x)] else if(any(is.na(x))) return(x[FALSE][NA]) n <- length(x) if (n == 0L) return(x[FALSE][NA]) half <- (n + 1L) %/% 2L if(n %% 2L == 1L) sort(x, partial = half)[half] else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) } pj -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel