Re: [Rd] median and data frames
On Wed, Apr 27, 2011 at 12:44 PM, Patrick Burns wrote: > Here are some data frames: > > df3.2 <- data.frame(1:3, 7:9) > df4.2 <- data.frame(1:4, 7:10) > df3.3 <- data.frame(1:3, 7:9, 10:12) > df4.3 <- data.frame(1:4, 7:10, 10:13) > df3.4 <- data.frame(1:3, 7:9, 10:12, 15:17) > df4.4 <- data.frame(1:4, 7:10, 10:13, 15:18) > > Now here are some commands and their answers: >> median(df4.4) > [1] 8.5 11.5 >> median(df3.2[c(1,2,3),]) > [1] 2 8 >> median(df3.2[c(1,3,2),]) > [1] 2 NA > Warning message: > In mean.default(X[[2L]], ...) : > argument is not numeric or logical: returning NA > > > > The sessionInfo is below, but it looks > to me like the present behavior started > in 2.10.0. > > Sometimes it gets the right answer. I'd > be grateful to hear how it does that -- I > can't figure it out. > Hello, Pat. Nice poetry there! I think I have an actual answer, as opposed to the usual crap I spew. I would agree if you said median.data.frame ought to be written to work columnwise, similar to mean.data.frame. apply and sapply always give the correct answer > apply(df3.3, 2, median) X1.3 X7.9 X10.12 2 8 11 > apply(df3.2, 2, median) X1.3 X7.9 28 > apply(df3.2[c(1,3,2),], 2, median) X1.3 X7.9 28 mean.data.frame is now implemented as mean.data.frame <- function(x, ...) sapply(x, mean, ...) I think we would suggest this for medians: ?? median.data.frame <- function(x,...) sapply(x, median, ...) ? It works, see: > median.data.frame(df3.2[c(1,3,2),]) X1.3 X7.9 28 Would our next step be to enter that somewhere in R bugzilla? (I'm not joking--I'm that naive). I think I can explain why the current median works intermittently in those cases you mention. Give it a small set of pre-sorted data, all is well. median.default uses a sort function, and it is confused when it is given a data.frame object rather than just a vector. I put a browser() at the top of median.default > median(df3.2[c(1,3,2),]) Called from: median.default(df3.2[c(1, 3, 2), ]) Browse[1]> n debug at #4: if (is.factor(x)) stop("need numeric data") Browse[2]> n debug at #4: NULL Browse[2]> n debug at #6: if (length(names(x))) names(x) <- NULL Browse[2]> n debug at #6: names(x) <- NULL Browse[2]> n debug at #8: if (na.rm) x <- x[!is.na(x)] else if (any(is.na(x))) return(x[FALSE][NA]) Browse[2]> n debug at #8: if (any(is.na(x))) return(x[FALSE][NA]) Browse[2]> n debug at #8: NULL Browse[2]> n debug at #12: n <- length(x) Browse[2]> n debug at #13: if (n == 0L) return(x[FALSE][NA]) Browse[2]> n debug at #13: NULL Browse[2]> n debug at #15: half <- (n + 1L)%/%2L Browse[2]> n debug at #16: if (n%%2L == 1L) sort(x, partial = half)[half] else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> n debug at #16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> n [1] 2 NA Warning message: In mean.default(X[[2L]], ...) : argument is not numeric or logical: returning NA Note the sort there in step 16. I think that's what is killing us. If you are lucky, give it a small data frame that is in order, like df3.2, the sort doesn't produce gibberish. When I get to that point, I will show you the sort's effect. First, the case that "works". I moved the browser() down, because I got tired of looking at the same old not-yet-erroneous output. > median(df3.2) Called from: median.default(df3.2) Browse[1]> n debug at #15: half <- (n + 1L)%/%2L Browse[2]> n debug at #16: if (n%%2L == 1L) sort(x, partial = half)[half] else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> n debug at #16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Interactively, type Browse[2]> sort(x, partial = half + 0L:1L) NA NA NA NA NA NA 1 1 7 NULL NULL NULL NULL 2 2 8 3 3 9 Warning message: In format.data.frame(x, digits = digits, na.encode = FALSE) : corrupt data frame: columns will be truncated or padded with NAs But it still gives you a "right" answer: Browse[2]> n [1] 2 8 But if you give it data out of order, the second column turns to NA, and that causes doom. > median(df3.2[c(1,3,2),]) Called from: median.default(df3.2[c(1, 3, 2), ]) Browse[1]> n debug at #15: half <- (n + 1L)%/%2L Browse[2]> n debug at #16: if (n%%2L == 1L) sort(x, partial = half)[half] else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Browse[2]> n debug at #16: mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) Interactively: Browse[2]> sort(x, partial = half + 0L:1L) NA NA NA NA NA NA 1 1 NULL 7 NULL NULL NULL 3 3 9 2 2 8 Warning message: In format.data.frame(x, digits = digits, na.encode = FALSE) : corrupt data frame: columns will be truncated or padded with NAs Browse[2]> n [1] 2 NA Warning message: In mean.default(X[[2L]], ...) : argument is not numeric or logical: returning NA Here's a larger test case. Note columns 1 and 3 turn to NULL > df8.8 <- data.frame(a=2:8, b=1:7) median(df8.8)
Re: [Rd] How to create vignette.pdf for R-2.13.0?
Dear Uwe, As I have already mentioned R CMD check gives the following output: * checking for unstated dependencies in vignettes ... OK * checking package vignettes in 'inst/doc' ... WARNING Package vignette(s) without corresponding PDF: APTvsXPS.Rnw xps.Rnw xpsClasses.Rnw xpsPreprocess.Rnw * checking running R code from vignettes ... OK * checking re-building of vignettes ... OK * checking PDF version of manual ... OK WARNING: There was 1 warning, see '/Volumes/CoreData/CRAN/xps.Rcheck/00check.log' for details Although the output says "checking PDF version of manual ... OK" I cannot find any result in "xps.Rcheck". When I run "R64 CMD build xps" I get: * checking for file 'xps/DESCRIPTION' ... OK * preparing 'xps': * checking DESCRIPTION meta-information ... OK * cleaning src * installing the package to re-build vignettes * creating vignettes ... OK * cleaning src * checking for LF line-endings in source and make files * checking for empty or unneeded directories * building 'xps_1.13.1.tar.gz' However, the resulting file "xps_1.13.1.tgz" does also not contain any vignettes. Best regards Christian On 4/27/11 10:16 AM, Uwe Ligges wrote: On 26.04.2011 21:58, cstrato wrote: Dear Duncan, dear Uwe, Just now I have re-run everything, and today xps.Rnw can be converted to a vignette w/o any problems using: a, buildVignettes("xps", dir="/Volumes/CoreData/CRAN/xps", quiet=F) b, R CMD Sweave xps.Rnw In both cases the vignette xps.pdf is created (maybe my Mac did not like to work during eastern holidays). However, one issue remains: "R64 CMD check xps_1.13.1.tar.gz" no longer creates any pdf files for the vignettes. Dioes it give an error or warning? It should check the code. R CMD build creates the pdf files. Uwe Ligges Best regards Christian On 4/25/11 9:31 PM, Duncan Murdoch wrote: On 25/04/2011 3:16 PM, cstrato wrote: Thank you. My problem seems to be that at the moment the problem can be seen only on my Mac, since e.g. the Bioconductor servers have no problems creating the vignettes. Then you are definitely the one in the best position to diagnose the problem. Use the usual approach: simplify it by cutting out everything that looks unrelated. Verify that the problem still exists, then cut some more. Eventually you'll have isolated the error to a particular small snippet of code, and then you can add print() statements, or use trace(), or do whatever is necessary to see what's so special about your system. I suspect it will turn out to be an assumption in the code that is not true on your system. If the assumption is being made by code you wrote, then fix it. If it's being made by R, let us know. Duncan Murdoch Best regards Christian On 4/25/11 8:55 PM, Duncan Murdoch wrote: cstrato wrote: Dear Duncan, Thank you for your example, however it is different since it does not use x and y. What about print(x+y)? Try it. Sorry, I do not believe that there is a bug in my code, since: 1, it worked in all versions from R starting with R-2.6.0 till R-2.12.2. 2, the identical code works in the examples 3, this code (or a similar code) is the starting code which all users of xps have to use, and there was never a problem. This might be a problem in R, or might be a problem in your code. As far as I know, it has only shown up in your code, so I'd guess that's where the problem is. In any case, you're the one in the best position to isolate it and debug it. If it turns out to be a problem in R, put together an example illustrating the problem that doesn't involve your code, and I'll take a look. Duncan Murdoch Maybe the reason could be that my code has to import - the CEL-files from the package dir - the file SchemeTest3.root from the package dir ?? Best regards Christian On 4/25/11 8:00 PM, Duncan Murdoch wrote: cstrato wrote: Dear Uwe, Your suggestion to look at the Sweave manual helped me to solve the problem. It seems that in R-2.13.0 every chunk can use the code from the chunk before but not from an earlier chunk. I'm either misreading what you wrote, or it's wrong. If I have this in a Sweave file: <<>>= x<- 1 @ <<>>= y<- 2 @ <<>>= print(x) @ I will see the value of x getting printed, even though it came from two chunks earlier. I think Uwe is right: there is some bug in the code you're running. Sweave isn't the problem. Duncan Murdoch Concretely, the following does not work since chunk 5 needs the code from chunk 3 and 4: ### ### chunk number 3: ### #line 126 "xps.Rnw" celdir<- file.path(.path.package("xps"), "raw") ### ### chunk number 4: ### #line 132 "xps.Rnw" scheme.test3<- root.scheme(file.path(.path.package("xps"), "schemes", "SchemeTest3.root")) ### ### chunk number 5: #
Re: [Rd] Thread synchronization [Was: Interrupting C++ code execution]
Sean, On Apr 27, 2011, at 3:21 PM, Sean Robert McGuffee wrote: > Hi Simon, > That makes a lot of sense to me. I'll start reading about R's event loop > signaling. I'm not sure what the best method will be for me to flag the > completeness of a threaded process in my case. In abstract it seems that I > could get R's event loop to look for any type of flag. I think key for me in > this case will be identifying whether a particular file has been completely > produced or not. In principle I could put that type of info into the file > itself, but I think I could also make a temp file somewhere with it's full > path and flag info about it. Then the event loop could look for a particular > pattern of temp file names. On the other hand, if I pass in that info when I > start the event loop, that might work too. Usually, the easiest on unix is to register a file handle as input handler (addInputHandler) - in practice a pipe - one end is owned by the thread and the other is owned by R. Then all you need is to write anything on the thread's end and it will wake up R's even loop and let you handle the read on that end so you can do anything. You could even have multiple threads share this one pipe since you could distinguish by payload which thread is calling. One example of this is the integrated HTTP server in R - see Rhttpd sources (it has also a variant that works on Windows using synchronization via OS event loop). > Regarding the external pointer idea, I was thinking about passing an object > to R as a return value after launching the thread, and then I might be able > to access a pointer inside that object to reference it from my thread. That > could be a binary vector or any type of object if I can figure out how to get > to it from my thread. Honestly, I don't know much about dynamic referencing > of objects from separate threads, but in principle memory is shared in this > case. I'll let you know if I come up with anything generic... Please keep me > posted on your package. Are any versions of it available yet? Yes, it is not released yet since it's not quite complete, but here we go, at your own risk ;): http://rforge.net/threads It will work on all platforms, eventually, but currently only unix is supported. The idea is sort of taking the multicore paradigm (parallel + collect) but using threads (threadEval + yield). The documentation it currently non-existent, but I plan to write a vignette for it ... maybe later this week ... Cheers, Simon > It didn't happen to come up on my list of R packages. I haven't necessarily > been maintaining an up-to-date version of R though. I don't know if that > influences the package list it shows me. > Sean > > > On 4/26/11 8:51 PM, "Simon Urbanek" wrote: > >> Sean, >> >> On Apr 26, 2011, at 5:06 PM, Sean Robert McGuffee wrote: >> >>> I've been thinking about how to handle c++ threads that were started via >>> Rcpp >>> calls to some of my c++ libraries from R. My main obstacle is trying to make >>> sure that users don't try to process files that are being generated by a >>> thread before the thread finishes. One thing I am considering is having my >>> threaded code return a class to R that contains a pointer that it remembers. >>> Then maybe I could just change the value at that pointer when my thread >>> finishes. Does that seem like a reasonable approach? I'm not completely sure >>> if this is related to your issue or not, but it might be similar enough to >>> be >>> worth asking... >> >> It depends. For a simple flag it's actually much more simple than that - you >> can create a boolean vector (make sure you preserve it) and just update its >> value when it's done - you don't even need an external pointer for that (if >> your'e careful). >> >> But the slight problem with that approach is rather that you don't have a way >> to tell R about the status change, so essentially you can only poll on the R >> side. A more proper way to deal with this is to use the event loop signaling >> to signal in R that the flag has changed. I'm working on a "threads" package >> that should help with that, but it's not complete yet (you can spawn threads >> from R and you can actually even synchronize them with R [so if the result is >> all you want it's there], but semaphores are not implemented yet --- your >> inquiry should shift it further up on my todo stack ;)). >> >> Cheers, >> Simon > > > __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Thread synchronization [Was: Interrupting C++ code execution]
Hi Simon, That makes a lot of sense to me. I'll start reading about R's event loop signaling. I'm not sure what the best method will be for me to flag the completeness of a threaded process in my case. In abstract it seems that I could get R's event loop to look for any type of flag. I think key for me in this case will be identifying whether a particular file has been completely produced or not. In principle I could put that type of info into the file itself, but I think I could also make a temp file somewhere with it's full path and flag info about it. Then the event loop could look for a particular pattern of temp file names. On the other hand, if I pass in that info when I start the event loop, that might work too. Regarding the external pointer idea, I was thinking about passing an object to R as a return value after launching the thread, and then I might be able to access a pointer inside that object to reference it from my thread. That could be a binary vector or any type of object if I can figure out how to get to it from my thread. Honestly, I don't know much about dynamic referencing of objects from separate threads, but in principle memory is shared in this case. I'll let you know if I come up with anything generic... Please keep me posted on your package. Are any versions of it available yet? It didn't happen to come up on my list of R packages. I haven't necessarily been maintaining an up-to-date version of R though. I don't know if that influences the package list it shows me. Sean On 4/26/11 8:51 PM, "Simon Urbanek" wrote: > Sean, > > On Apr 26, 2011, at 5:06 PM, Sean Robert McGuffee wrote: > >> I've been thinking about how to handle c++ threads that were started via Rcpp >> calls to some of my c++ libraries from R. My main obstacle is trying to make >> sure that users don't try to process files that are being generated by a >> thread before the thread finishes. One thing I am considering is having my >> threaded code return a class to R that contains a pointer that it remembers. >> Then maybe I could just change the value at that pointer when my thread >> finishes. Does that seem like a reasonable approach? I'm not completely sure >> if this is related to your issue or not, but it might be similar enough to be >> worth asking... > > It depends. For a simple flag it's actually much more simple than that - you > can create a boolean vector (make sure you preserve it) and just update its > value when it's done - you don't even need an external pointer for that (if > your'e careful). > > But the slight problem with that approach is rather that you don't have a way > to tell R about the status change, so essentially you can only poll on the R > side. A more proper way to deal with this is to use the event loop signaling > to signal in R that the flag has changed. I'm working on a "threads" package > that should help with that, but it's not complete yet (you can spawn threads > from R and you can actually even synchronize them with R [so if the result is > all you want it's there], but semaphores are not implemented yet --- your > inquiry should shift it further up on my todo stack ;)). > > Cheers, > Simon __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] median and data frames
On Apr 27, 2011, at 19:44 , Patrick Burns wrote: > I would think a method in analogy to > 'mean.data.frame' would be a logical choice. > But I'm presuming there might be an argument > against that or 'median.data.frame' would already > exist. Only if someone had a better plan. As you are probably well aware, what you are currently seeing is a rather exquisite mashup of methods getting applied to objects they shouldn't be applied to. Some curious effects are revealed, e.g. this little beauty: > sort(df3.3) Error in `[.data.frame`(x, order(x, na.last = na.last, decreasing = decreasing)) : undefined columns selected > names(df3.3)<-NULL > sort(df3.3) NA NA NA NA NA NA NA NA NA 1 1 7 10 NULL NULL NULL NULL NULL NULL 2 2 8 11 3 3 9 12 Warning message: In format.data.frame(x, digits = digits, na.encode = FALSE) : corrupt data frame: columns will be truncated or padded with NAs -- Peter Dalgaard Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] median and data frames
Here are some data frames: df3.2 <- data.frame(1:3, 7:9) df4.2 <- data.frame(1:4, 7:10) df3.3 <- data.frame(1:3, 7:9, 10:12) df4.3 <- data.frame(1:4, 7:10, 10:13) df3.4 <- data.frame(1:3, 7:9, 10:12, 15:17) df4.4 <- data.frame(1:4, 7:10, 10:13, 15:18) Now here are some commands and their answers: > median(df3.2) [1] 2 8 > median(df4.2) [1] 2.5 8.5 > median(df3.3) NA 1 7 2 8 3 9 > median(df4.3) NA 1 7 2 8 3 9 4 10 > median(df3.4) [1] 8 11 > median(df4.4) [1] 8.5 11.5 > median(df3.2[c(1,2,3),]) [1] 2 8 > median(df3.2[c(1,3,2),]) [1] 2 NA Warning message: In mean.default(X[[2L]], ...) : argument is not numeric or logical: returning NA The sessionInfo is below, but it looks to me like the present behavior started in 2.10.0. Sometimes it gets the right answer. I'd be grateful to hear how it does that -- I can't figure it out. Under the current regime we can get numbers that are correct, partially correct, or sort of random (given the intention). I claim that much better behavior would be to always get exactly one of the following: * a numeric answer (that is consistently correct) * an error I would think a method in analogy to 'mean.data.frame' would be a logical choice. But I'm presuming there might be an argument against that or 'median.data.frame' would already exist. > sessionInfo() R version 2.13.0 (2011-04-13) Platform: i386-pc-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=English_United Kingdom.1252 [2] LC_CTYPE=English_United Kingdom.1252 [3] LC_MONETARY=English_United Kingdom.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United Kingdom.1252 attached base packages: [1] graphics grDevices utils datasets stats methods base other attached packages: [1] xts_0.8-0 zoo_1.6-5 loaded via a namespace (and not attached): [1] grid_2.13.0 lattice_0.19-23 tools_2.13.0 -- Patrick Burns pbu...@pburns.seanet.com twitter: @portfolioprobe http://www.portfolioprobe.com/blog http://www.burns-stat.com (home of 'Some hints for the R beginner' and 'The R Inferno') __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Wish R Core had a standard format (or generic function) for "newdata" objects
Among many solutions, I generally use the following code, which avoids the ideal average individual, by considering the mean across of the predicted values: averagingpredict <- function(model, varname, varseq, type, subset=NULL) { if(is.null(subset)) mydata <- model$data else mydata <- model$data[subset, ] f <- function(x) { mydata[, varname] <- x mean(predict(model, newdata=mydata, type=type), na.rm=TRUE) } sapply(varseq, f) } It is time consuming, but it deals with non numeric variables. Christophe 2011/4/26 Paul Johnson > Is anybody working on a way to standardize the creation of "newdata" > objects for predict methods? > > When using predict, I find it difficult/tedious to create newdata data > frames when there are many variables. It is necessary to set all > variables at the mean/mode/median, and then for some variables of > interest, one has to insert values for which predictions are desired. > I was at a presentation by Scott Long last week and he was discussing > the increasing emphasis in Stata on calculations of marginal > predictions and "Spost" an several other packages, and, > co-incidentally, I had a student visit who is learning to use R MASS's > polr (W.Venables and B. Ripley) and we wrestled for quite a while to > try to make the same calculations that Stata makes automatically. It > spits out predicted probabilities each independent variable, keeping > other variables at a reference level. > > I've found R packages that aim to do essentially the same thing. > > In Frank Harrell's Design/rms framework, he uses a "data.dist" > function that generates an object that the user has to put into the R > options. I think many users trip over the use of "options" there. If > I don't use that for a month or two, I completely forget the fine > points and have to fight with it. But it does "work" to give plots > and predict functions the information they require. > > In Zelig ( by Kosuke Imai, Gary King, and Olivia Lau), a function > "setx" does the work of creating "newdata" objects. That appears to be > about right as a candidate for a generic "newdata" function. Perhaps > it could directly generalize to all R regression functions, but right > now it is tailored to the models in Zelig. It has separate methods for > the different types of models, and that is a bit confusing to me,since > the "newdata" in one model should be the same as the newdata in > another, I'm guessing. But his code is all there, I'll keep looking. > > In Effects (by John Fox), there are internal functions to create > newdata and plot the marginal effects. If you load effects and run, > for example, "effects:::effect.lm" you see Prof Fox has his own way of > grabbing information from model columns and calculating predictions. > > I think it is time the R Core Team would look at this tell "us" what > is the right way to do this. I think the interface to setx in Zelig is > pretty easy to understand, at least for numeric variables. > > In R's termplot function, such a thing could be put to use. As far as > I can tell now, termplot is doing most of the work of creating a > newdata object, but not exactly. > > It seems like it would be a shame to proliferate more functions that > do the same function, when it is such a common thing. > > -- > Paul E. Johnson > Professor, Political Science > 1541 Lilac Lane, Room 504 > University of Kansas > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > -- Christophe DUTANG Ph. D. student at ISFA, Lyon, France [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Wish R Core had a standard format (or generic function) for "newdata" objects
On Tue, Apr 26, 2011 at 7:39 PM, Duncan Murdoch wrote: > If you don't like the way this was done in my three lines above, or by Frank > Harrell, or the Zelig group, or John Fox, why don't you do it yourself, and > get it right this time? It's pretty rude to complain about things that > others have given you for free, and demand they do it better. > > Duncan Murdoch > I offer sincere apology for sounding that way. I'm not attacking anybody. I'm just talking, asking don't you agree this were standardized. And you disagree, and I respect that since you are actually doing the work. >From a "lowly user's point of view", I wish "you experts" out there would tell us one way to do this, we could follow your example. When there's a regression model fitted with 20 variables in it, and half of them are numeric, 4 are unordered factors, 3 are ordinal factors, and what not, then this is a hard problem for many of us ordinary users. Or it is tedious. They want "keep everything fixed," except one variable that takes on different specified values. And they want to do that for every variable, one at a time. Stata has made this easy for many models, R could as well, if we coalesced on a more-or-less standard way to create newdata objects for predict. But, in the end, I agree with your sentiment. I just have to do this, show you it is handy. I think Zelig's setx has it about right, I'll pursue that strategy. pj -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] on the Distribution task view
Hi useRs, As the maintainer of the Distribution task view ( http://cran.r-project.org/web/views/Distributions.html) for more than two years, the following feedback exercise should have been done earlier. But late is better than never! I start this discussion to get your feedbacks/suggestions on the task view, as well as to list the new packages that could be added to it. Kind regards Christophe -- Christophe DUTANG Ph. D. student at ISFA, Lyon, France [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] standard format for newdata objects
On Wed, 2011-04-27 at 12:00 +0200, Peter Dalgaard wrote: > Er... No, I don't think Paul is being particularly rude here (and he > has been doing us some substantial favors in the past, notably his > useful Rtips page). I know the kind of functionality he is looking > for; e.g., SAS JMP has some rather nice interactive displays of > regression effects for which you'll need to fill in "something" for > the other variables. > > However, that being said, I agree with Duncan that we probably do not > want to canonicalize any particular method of filling in "average" > values for data frame variables. Whatever you do will be statistically > dubious (in particular, using the mode of a factor variable gives me > the creeps: Do a subgroup analysis and your "average person" switches > from male to female?), so I think it is one of those cases where it is > best to provide mechanism, not policy. > I agree with Peter. There are two tasks in newdata: deciding what the default reference levels should be, and building the data frame with those levels. It's the first part that is hard. For survival curves from a Cox model the historical default has been to use the mean of each covariate, which can be awful (sex coded as 0/1 leads to prediction for a hermaphrodite?). Nevertheless, I've not been able to think of a strategy that would give sensible answers for most of the data I use and coxph retains the flawed default for lack of a better idea. When teaching a class on this, I tell listeners "bite the bullet" and build the newdata that makes clinical sense, because package defaults are always unwise for some of the variables. How can a package possibly know that it should use bilirubin=1.0 (upper limit of normal) and AST = 45 when the data set is one of my liver transplant studies? Frank Harrell would argue that his "sometimes misguided" default in cph is better than the "almost always wrong" one in coxph though, and there is certainly some strength in that position. Terry Therneau __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Wish R Core had a standard format (or generic function) for "newdata" objects
On Wed, Apr 27, 2011 at 3:55 AM, peter dalgaard wrote: > > On Apr 27, 2011, at 02:39 , Duncan Murdoch wrote: > >> On 26/04/2011 11:13 AM, Paul Johnson wrote: >>> Is anybody working on a way to standardize the creation of "newdata" >>> objects for predict methods? > > [snip] > >>> I think it is time the R Core Team would look at this tell "us" what >>> is the right way to do this. I think the interface to setx in Zelig is >>> pretty easy to understand, at least for numeric variables. >> >> If you don't like the way this was done in my three lines above, or by Frank >> Harrell, or the Zelig group, or John Fox, why don't you do it yourself, and >> get it right this time? It's pretty rude to complain about things that >> others have given you for free, and demand they do it better. > > Er... No, I don't think Paul is being particularly rude here (and he has been > doing us some substantial favors in the past, notably his useful Rtips page). > I know the kind of functionality he is looking for; e.g., SAS JMP has some > rather nice interactive displays of regression effects for which you'll need > to fill in "something" for the other variables. > > However, that being said, I agree with Duncan that we probably do not want to > canonicalize any particular method of filling in "average" values for data > frame variables. Whatever you do will be statistically dubious (in > particular, using the mode of a factor variable gives me the creeps: Do a > subgroup analysis and your "average person" switches from male to female?), > so I think it is one of those cases where it is best to provide mechanism, > not policy. > That could be satisfied by defining a generic in the core of R without any methods. Then individual packages or analyses could provide those in the way they see fit. As long as the packages or analyses are working with objects of different classes they would not conflict. -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] How to create vignette.pdf for R-2.13.0?
On 26.04.2011 21:58, cstrato wrote: Dear Duncan, dear Uwe, Just now I have re-run everything, and today xps.Rnw can be converted to a vignette w/o any problems using: a, buildVignettes("xps", dir="/Volumes/CoreData/CRAN/xps", quiet=F) b, R CMD Sweave xps.Rnw In both cases the vignette xps.pdf is created (maybe my Mac did not like to work during eastern holidays). However, one issue remains: "R64 CMD check xps_1.13.1.tar.gz" no longer creates any pdf files for the vignettes. Dioes it give an error or warning? It should check the code. R CMD build creates the pdf files. Uwe Ligges Best regards Christian On 4/25/11 9:31 PM, Duncan Murdoch wrote: On 25/04/2011 3:16 PM, cstrato wrote: Thank you. My problem seems to be that at the moment the problem can be seen only on my Mac, since e.g. the Bioconductor servers have no problems creating the vignettes. Then you are definitely the one in the best position to diagnose the problem. Use the usual approach: simplify it by cutting out everything that looks unrelated. Verify that the problem still exists, then cut some more. Eventually you'll have isolated the error to a particular small snippet of code, and then you can add print() statements, or use trace(), or do whatever is necessary to see what's so special about your system. I suspect it will turn out to be an assumption in the code that is not true on your system. If the assumption is being made by code you wrote, then fix it. If it's being made by R, let us know. Duncan Murdoch Best regards Christian On 4/25/11 8:55 PM, Duncan Murdoch wrote: cstrato wrote: Dear Duncan, Thank you for your example, however it is different since it does not use x and y. What about print(x+y)? Try it. Sorry, I do not believe that there is a bug in my code, since: 1, it worked in all versions from R starting with R-2.6.0 till R-2.12.2. 2, the identical code works in the examples 3, this code (or a similar code) is the starting code which all users of xps have to use, and there was never a problem. This might be a problem in R, or might be a problem in your code. As far as I know, it has only shown up in your code, so I'd guess that's where the problem is. In any case, you're the one in the best position to isolate it and debug it. If it turns out to be a problem in R, put together an example illustrating the problem that doesn't involve your code, and I'll take a look. Duncan Murdoch Maybe the reason could be that my code has to import - the CEL-files from the package dir - the file SchemeTest3.root from the package dir ?? Best regards Christian On 4/25/11 8:00 PM, Duncan Murdoch wrote: cstrato wrote: Dear Uwe, Your suggestion to look at the Sweave manual helped me to solve the problem. It seems that in R-2.13.0 every chunk can use the code from the chunk before but not from an earlier chunk. I'm either misreading what you wrote, or it's wrong. If I have this in a Sweave file: <<>>= x<- 1 @ <<>>= y<- 2 @ <<>>= print(x) @ I will see the value of x getting printed, even though it came from two chunks earlier. I think Uwe is right: there is some bug in the code you're running. Sweave isn't the problem. Duncan Murdoch Concretely, the following does not work since chunk 5 needs the code from chunk 3 and 4: ### ### chunk number 3: ### #line 126 "xps.Rnw" celdir<- file.path(.path.package("xps"), "raw") ### ### chunk number 4: ### #line 132 "xps.Rnw" scheme.test3<- root.scheme(file.path(.path.package("xps"), "schemes", "SchemeTest3.root")) ### ### chunk number 5: ### #line 137 "xps.Rnw" celfiles<- c("TestA1.CEL","TestA2.CEL") data.test3<- import.data(scheme.test3, "tmpdt_DataTest3", celdir=celdir, celfiles=celfiles, verbose=FALSE) However, when I add "celdir" to chunk 5 then everything works since now chunk 5 needs only the code from chunk 4 but not from chunk 3: ### ### chunk number 5: ### #line 137 "xps.Rnw" celdir<- file.path(.path.package("xps"), "raw") celfiles<- c("TestA1.CEL","TestA2.CEL") data.test3<- import.data(scheme.test3, "tmpdt_DataTest3", celdir=celdir, celfiles=celfiles, verbose=FALSE) Now buildVignettes() is able to create the vignettes, however R CMD check still does not build the vignettes. Yes, I get a Warning in both cases: * checking package vignettes in 'inst/doc' ... WARNING Package vignettes without corresponding PDF: However, with R-2.12.2 the following lines are added: /Volumes/CoreData/CRAN/xps.Rcheck/00_pkg_src/xps/inst/doc/APTvsXPS.Rnw /Volumes/CoreData/CRAN/xps.Rcheck/00_pkg_src/xps/inst/doc/xps.Rnw /Volumes/CoreData/CRAN/xps.Rcheck/00_pkg_src/xps/inst/
Re: [Rd] Wish R Core had a standard format (or generic function) for "newdata" objects
On Apr 27, 2011, at 02:39 , Duncan Murdoch wrote: > On 26/04/2011 11:13 AM, Paul Johnson wrote: >> Is anybody working on a way to standardize the creation of "newdata" >> objects for predict methods? [snip] >> I think it is time the R Core Team would look at this tell "us" what >> is the right way to do this. I think the interface to setx in Zelig is >> pretty easy to understand, at least for numeric variables. > > If you don't like the way this was done in my three lines above, or by Frank > Harrell, or the Zelig group, or John Fox, why don't you do it yourself, and > get it right this time? It's pretty rude to complain about things that > others have given you for free, and demand they do it better. Er... No, I don't think Paul is being particularly rude here (and he has been doing us some substantial favors in the past, notably his useful Rtips page). I know the kind of functionality he is looking for; e.g., SAS JMP has some rather nice interactive displays of regression effects for which you'll need to fill in "something" for the other variables. However, that being said, I agree with Duncan that we probably do not want to canonicalize any particular method of filling in "average" values for data frame variables. Whatever you do will be statistically dubious (in particular, using the mode of a factor variable gives me the creeps: Do a subgroup analysis and your "average person" switches from male to female?), so I think it is one of those cases where it is best to provide mechanism, not policy. -- Peter Dalgaard Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel