Re: [R] Random Forest: OOB performance = test set performance?
I think the only thing you are doing wrong is not setting the random seed (set.seed()) so your results are not reproducible. Depending on the random sample used to select the training and test sets, you get slightly varying accuracy for both, sometimes one is better and sometimes the other. HTH, Peter On Sat, Apr 10, 2021 at 8:49 PM wrote: > > Hi ML, > > For random forest, I thought that the out-of-bag performance should be > the same (or at least very similar) to the performance calculated on a > separated test set. > > But this does not seem to be the case. > > In the following code, the accuracy computed on out-of-bag sample is > 77.81%, while the one computed on a separated test set is 81%. > > Can you please check what I am doing wrong? > > Thanks in advance and best regards. > > library(randomForest) > library(ISLR) > > Carseats$High <- ifelse(Carseats$Sales<=8,"No","Yes") > Carseats$High <- as.factor(Carseats$High) > > train = sample(1:nrow(Carseats), 200) > > rf = randomForest(High~.-Sales, >data=Carseats, >subset=train, >mtry=6, >importance=T) > > acc <- (rf$confusion[1,1] + rf$confusion[2,2]) / sum(rf$confusion) > print(paste0("Accuracy OOB: ", round(acc*100,2), "%")) > > yhat <- predict(rf, newdata=Carseats[-train,]) > y <- Carseats[-train,]$High > conftest <- table(y, yhat) > acctest <- (conftest[1,1] + conftest[2,2]) / sum(conftest) > print(paste0("Accuracy test set: ", round(acctest*100,2), "%")) > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Random Forest: OOB performance = test set performance?
Hi ML, For random forest, I thought that the out-of-bag performance should be the same (or at least very similar) to the performance calculated on a separated test set. But this does not seem to be the case. In the following code, the accuracy computed on out-of-bag sample is 77.81%, while the one computed on a separated test set is 81%. Can you please check what I am doing wrong? Thanks in advance and best regards. library(randomForest) library(ISLR) Carseats$High <- ifelse(Carseats$Sales<=8,"No","Yes") Carseats$High <- as.factor(Carseats$High) train = sample(1:nrow(Carseats), 200) rf = randomForest(High~.-Sales, data=Carseats, subset=train, mtry=6, importance=T) acc <- (rf$confusion[1,1] + rf$confusion[2,2]) / sum(rf$confusion) print(paste0("Accuracy OOB: ", round(acc*100,2), "%")) yhat <- predict(rf, newdata=Carseats[-train,]) y <- Carseats[-train,]$High conftest <- table(y, yhat) acctest <- (conftest[1,1] + conftest[2,2]) / sum(conftest) print(paste0("Accuracy test set: ", round(acctest*100,2), "%")) __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Stata/Rstudio evil attributes
Hi Roger, You could look at the attributes() function in base-R. See: > ?attributes >From the help-page: > ## strip an object's attributes: > attributes(x) <- NULL HTH, Bill. W. Michels, Ph.D. On Sat, Apr 10, 2021 at 4:20 AM Koenker, Roger W wrote: > > Wolfgang, > > Thanks, this is _extremely_ helpful. > > Roger > > > On Apr 10, 2021, at 11:59 AM, Viechtbauer, Wolfgang (SP) > > wrote: > > > > Dear Roger, > > > > The problem is this. qss() looks like this: > > > > if (is.matrix(x)) { > > [...] > > } > > if (is.vector(x)) { > > [...] > > } > > qss > > > > Now let's check these if() statements: > > is.vector(B$x) # TRUE > > is.vector(D$x) # FALSE > > is.matrix(B$x) # FALSE > > is.matrix(D$x) # FALSE > > > > is.vector(D$x) being FALSE may be surprising, but see ?is.vector: > > "is.vector returns TRUE if x is a vector of the specified mode having no > > attributes other than names. It returns FALSE otherwise." And as D$x shows, > > this vector has additional attributes. > > > > So, with 'D', qss() returns the qss function (c.f., qss(B$x) and qss(D$x)) > > which makes no sense. So, the internal logic in qss() needs to be fixed. > > > >> In accordance with the usual R-help etiquette I first tried to contact the > >> maintainer of the haven package, i.e. RStudio, which elicited the > >> response: "since > >> the error is occurring outside RStudio we’re not responsible, so try Stack > >> Overflow". This is pretty much what I would have expected from the > >> capitalist > >> running dogs they are. Admittedly, the error is probably due to some > >> unforeseen > > > > This kind of bashing is really silly. Can you tell us again how much you > > paid for the use of the haven package? > > > > Best, > > Wolfgang > > > >> -Original Message- > >> From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Koenker, > >> Roger W > >> Sent: Saturday, 10 April, 2021 11:26 > >> To: r-help > >> Subject: [R] Stata/Rstudio evil attributes > >> > >> As shown in the reproducible example below, I used the RStudio function > >> haven() to > >> read a Stata .dta file, and then tried to do some fitting with the > >> resulting > >> data.frame. This produced an error from my fitting function rqss() in the > >> package > >> quantreg. After a bit of frustrated cursing, I converted the data.frame, > >> D, to a > >> matrix A, and thence back to a data.frame B, and tried again, which worked > >> as > >> expected. The conversion removed the attributes of D. My question is: > >> why were > >> the attributes inhibiting the fitting? > >> > >> In accordance with the usual R-help etiquette I first tried to contact the > >> maintainer of the haven package, i.e. RStudio, which elicited the > >> response: "since > >> the error is occurring outside RStudio we’re not responsible, so try Stack > >> Overflow". This is pretty much what I would have expected from the > >> capitalist > >> running dogs they are. Admittedly, the error is probably due to some > >> unforeseen > >> infelicity in my rqss() coding, but it does seem odd that attributes could > >> have > >> such a drastic effect. I would be most grateful for any insight the R > >> commune > >> might offer. > >> > >> #require(haven) # for reading dta file > >> #Ddta <- read_dta(“foo.dta") > >> #D <- with(Ddta, data.frame(y = access_merg, x = meannets_allhh, z = > >> meanhh)) > >> #save(D, file = "D.Rda") > >> con <- url("http://www.econ.uiuc.edu/~roger/research/data/D.Rda;) > >> load(con) > >> > >> # If I purge the Stata attributes in D: > >> A <- as.matrix(D) > >> B <- as.data.frame(A) > >> > >> # This works: > >> with(D,plot(x, y, cex = .5, col = "grey")) > >> taus <- 1:4/5 > >> require(quantreg) > >> for(i in 1:length(taus)){ > >> f <- rqss(y ~ qss(x, constraint = "I", lambda = 1), tau = taus[i], data > >> = B) > >> plot(f, add = TRUE, col = i) > >> } > >> # However, the same code with data = D, does not. Why? > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Comparing dates in two large data frames
Hello, The following solution seems to work and is fast, like findInterval is. It first determines where in df2$start is each value of df1$Time. Then uses that index to see if those Times are not greater than the corresponding df$end. I checked against a small subset of df1 and the results were right. result <- logical(nrow(df1)) inx <- findInterval(df1$Time, df2$start) not_zero <- inx != 0 result[not_zero] <- df1$Time[not_zero] <= df2$end[ inx[not_zero] ] Hope this helps, Rui Barradas Às 12:06 de 10/04/21, Kulupp escreveu: Dear all, I have two data frames (df1 and df2) and for each timepoint in df1 I want to know: is it whithin any of the timespans in df2? The result (e.g. "no" or "yes" or 0 and 1) should be shown in a new column of df1 Here is the code to create the two data frames (the size of the two data frames is approx. the same as in my original data frames): # create data frame df1 ti1 <- seq.POSIXt(from=as.POSIXct("2020/01/01", tz="UTC"), to=as.POSIXct("2020/06/01", tz="UTC"), by="10 min") df1 <- data.frame(Time=ti1) # create data frame df2 with random timespans, i.e. start and end dates start <- sort(sample(seq(as.POSIXct("2020/01/01", tz="UTC"), as.POSIXct("2020/06/01", tz="UTC"), by="1 mins"), 5000)) end <- start + 120 df2 <- data.frame(start=start, end=end) Everything I tried (ifelse combined with sapply or for loops) has been very very very slow. Thus, I am looking for a reasonably fast solution. Thanks a lot for any hint in advance ! Cheers, Thomas __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Stata/Rstudio evil attributes
Wolfgang, Thanks, this is _extremely_ helpful. Roger > On Apr 10, 2021, at 11:59 AM, Viechtbauer, Wolfgang (SP) > wrote: > > Dear Roger, > > The problem is this. qss() looks like this: > > if (is.matrix(x)) { > [...] > } > if (is.vector(x)) { > [...] > } > qss > > Now let's check these if() statements: > is.vector(B$x) # TRUE > is.vector(D$x) # FALSE > is.matrix(B$x) # FALSE > is.matrix(D$x) # FALSE > > is.vector(D$x) being FALSE may be surprising, but see ?is.vector: "is.vector > returns TRUE if x is a vector of the specified mode having no attributes > other than names. It returns FALSE otherwise." And as D$x shows, this vector > has additional attributes. > > So, with 'D', qss() returns the qss function (c.f., qss(B$x) and qss(D$x)) > which makes no sense. So, the internal logic in qss() needs to be fixed. > >> In accordance with the usual R-help etiquette I first tried to contact the >> maintainer of the haven package, i.e. RStudio, which elicited the response: >> "since >> the error is occurring outside RStudio we’re not responsible, so try Stack >> Overflow". This is pretty much what I would have expected from the >> capitalist >> running dogs they are. Admittedly, the error is probably due to some >> unforeseen > > This kind of bashing is really silly. Can you tell us again how much you paid > for the use of the haven package? > > Best, > Wolfgang > >> -Original Message- >> From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Koenker, >> Roger W >> Sent: Saturday, 10 April, 2021 11:26 >> To: r-help >> Subject: [R] Stata/Rstudio evil attributes >> >> As shown in the reproducible example below, I used the RStudio function >> haven() to >> read a Stata .dta file, and then tried to do some fitting with the resulting >> data.frame. This produced an error from my fitting function rqss() in the >> package >> quantreg. After a bit of frustrated cursing, I converted the data.frame, D, >> to a >> matrix A, and thence back to a data.frame B, and tried again, which worked as >> expected. The conversion removed the attributes of D. My question is: why >> were >> the attributes inhibiting the fitting? >> >> In accordance with the usual R-help etiquette I first tried to contact the >> maintainer of the haven package, i.e. RStudio, which elicited the response: >> "since >> the error is occurring outside RStudio we’re not responsible, so try Stack >> Overflow". This is pretty much what I would have expected from the >> capitalist >> running dogs they are. Admittedly, the error is probably due to some >> unforeseen >> infelicity in my rqss() coding, but it does seem odd that attributes could >> have >> such a drastic effect. I would be most grateful for any insight the R >> commune >> might offer. >> >> #require(haven) # for reading dta file >> #Ddta <- read_dta(“foo.dta") >> #D <- with(Ddta, data.frame(y = access_merg, x = meannets_allhh, z = meanhh)) >> #save(D, file = "D.Rda") >> con <- url("http://www.econ.uiuc.edu/~roger/research/data/D.Rda;) >> load(con) >> >> # If I purge the Stata attributes in D: >> A <- as.matrix(D) >> B <- as.data.frame(A) >> >> # This works: >> with(D,plot(x, y, cex = .5, col = "grey")) >> taus <- 1:4/5 >> require(quantreg) >> for(i in 1:length(taus)){ >> f <- rqss(y ~ qss(x, constraint = "I", lambda = 1), tau = taus[i], data = >> B) >> plot(f, add = TRUE, col = i) >> } >> # However, the same code with data = D, does not. Why? __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Comparing dates in two large data frames
Dear all, I have two data frames (df1 and df2) and for each timepoint in df1 I want to know: is it whithin any of the timespans in df2? The result (e.g. "no" or "yes" or 0 and 1) should be shown in a new column of df1 Here is the code to create the two data frames (the size of the two data frames is approx. the same as in my original data frames): # create data frame df1 ti1 <- seq.POSIXt(from=as.POSIXct("2020/01/01", tz="UTC"), to=as.POSIXct("2020/06/01", tz="UTC"), by="10 min") df1 <- data.frame(Time=ti1) # create data frame df2 with random timespans, i.e. start and end dates start <- sort(sample(seq(as.POSIXct("2020/01/01", tz="UTC"), as.POSIXct("2020/06/01", tz="UTC"), by="1 mins"), 5000)) end <- start + 120 df2 <- data.frame(start=start, end=end) Everything I tried (ifelse combined with sapply or for loops) has been very very very slow. Thus, I am looking for a reasonably fast solution. Thanks a lot for any hint in advance ! Cheers, Thomas __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Stata/Rstudio evil attributes
Dear Roger, The problem is this. qss() looks like this: if (is.matrix(x)) { [...] } if (is.vector(x)) { [...] } qss Now let's check these if() statements: is.vector(B$x) # TRUE is.vector(D$x) # FALSE is.matrix(B$x) # FALSE is.matrix(D$x) # FALSE is.vector(D$x) being FALSE may be surprising, but see ?is.vector: "is.vector returns TRUE if x is a vector of the specified mode having no attributes other than names. It returns FALSE otherwise." And as D$x shows, this vector has additional attributes. So, with 'D', qss() returns the qss function (c.f., qss(B$x) and qss(D$x)) which makes no sense. So, the internal logic in qss() needs to be fixed. >In accordance with the usual R-help etiquette I first tried to contact the >maintainer of the haven package, i.e. RStudio, which elicited the response: >"since >the error is occurring outside RStudio we’re not responsible, so try Stack >Overflow". This is pretty much what I would have expected from the capitalist >running dogs they are. Admittedly, the error is probably due to some >unforeseen This kind of bashing is really silly. Can you tell us again how much you paid for the use of the haven package? Best, Wolfgang >-Original Message- >From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Koenker, Roger >W >Sent: Saturday, 10 April, 2021 11:26 >To: r-help >Subject: [R] Stata/Rstudio evil attributes > >As shown in the reproducible example below, I used the RStudio function >haven() to >read a Stata .dta file, and then tried to do some fitting with the resulting >data.frame. This produced an error from my fitting function rqss() in the >package >quantreg. After a bit of frustrated cursing, I converted the data.frame, D, >to a >matrix A, and thence back to a data.frame B, and tried again, which worked as >expected. The conversion removed the attributes of D. My question is: why >were >the attributes inhibiting the fitting? > >In accordance with the usual R-help etiquette I first tried to contact the >maintainer of the haven package, i.e. RStudio, which elicited the response: >"since >the error is occurring outside RStudio we’re not responsible, so try Stack >Overflow". This is pretty much what I would have expected from the capitalist >running dogs they are. Admittedly, the error is probably due to some >unforeseen >infelicity in my rqss() coding, but it does seem odd that attributes could have >such a drastic effect. I would be most grateful for any insight the R commune >might offer. > >#require(haven) # for reading dta file >#Ddta <- read_dta(“foo.dta") >#D <- with(Ddta, data.frame(y = access_merg, x = meannets_allhh, z = meanhh)) >#save(D, file = "D.Rda") >con <- url("http://www.econ.uiuc.edu/~roger/research/data/D.Rda;) >load(con) > ># If I purge the Stata attributes in D: >A <- as.matrix(D) >B <- as.data.frame(A) > ># This works: >with(D,plot(x, y, cex = .5, col = "grey")) >taus <- 1:4/5 >require(quantreg) >for(i in 1:length(taus)){ >f <- rqss(y ~ qss(x, constraint = "I", lambda = 1), tau = taus[i], data = > B) >plot(f, add = TRUE, col = i) >} ># However, the same code with data = D, does not. Why? __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Stata/Rstudio evil attributes
As shown in the reproducible example below, I used the RStudio function haven() to read a Stata .dta file, and then tried to do some fitting with the resulting data.frame. This produced an error from my fitting function rqss() in the package quantreg. After a bit of frustrated cursing, I converted the data.frame, D, to a matrix A, and thence back to a data.frame B, and tried again, which worked as expected. The conversion removed the attributes of D. My question is: why were the attributes inhibiting the fitting? In accordance with the usual R-help etiquette I first tried to contact the maintainer of the haven package, i.e. RStudio, which elicited the response: "since the error is occurring outside RStudio we’re not responsible, so try Stack Overflow". This is pretty much what I would have expected from the capitalist running dogs they are. Admittedly, the error is probably due to some unforeseen infelicity in my rqss() coding, but it does seem odd that attributes could have such a drastic effect. I would be most grateful for any insight the R commune might offer. #require(haven) # for reading dta file #Ddta <- read_dta(“foo.dta") #D <- with(Ddta, data.frame(y = access_merg, x = meannets_allhh, z = meanhh)) #save(D, file = "D.Rda") con <- url("http://www.econ.uiuc.edu/~roger/research/data/D.Rda;) load(con) # If I purge the Stata attributes in D: A <- as.matrix(D) B <- as.data.frame(A) # This works: with(D,plot(x, y, cex = .5, col = "grey")) taus <- 1:4/5 require(quantreg) for(i in 1:length(taus)){ f <- rqss(y ~ qss(x, constraint = "I", lambda = 1), tau = taus[i], data = B) plot(f, add = TRUE, col = i) } # However, the same code with data = D, does not. Why? __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Assigning several lists to variables whose names are contained in other variables
Hello, I believe that the point we are missing is that datatable$column stores the *names* of the graphs, not the graph objects themselves. So in the loop the objects must be retrieved with mget() or get(). First create a reproducible example. library(tidygraph) my_function <- function(g){ stopifnot(inherits(g, "igraph")) g %>% mutate(centrality = centrality_pagerank()) } MYSUBNET1 <- create_notable('bull') iris_clust <- hclust(dist(iris[1:4])) MYSUBNET2 <- as_tbl_graph(iris_clust) MYSUBNET3 <- play_smallworld(1, 100, 3, 0.05) datatable <- data.frame(column = paste0("MYSUBNET", 1:3)) Now apply the function above to each of the "tbl_graph" objects. 1. Get all objects and apply the function in one instruction. Then assign the new names. result <- lapply(mget(datatable$column, envir = .GlobalEnv), my_function) names(result) <- paste("subnet", datatable$column, sep = "_") 2. Loop through the column with lapply, getting one object and applying the function one at a time. Then assign the new names. result2 <- lapply(datatable$column, function(net_name){ NET <- get(net_name, envir = .GlobalEnv) my_function(NET) }) names(result2) <- paste("subnet", datatable$column, sep = "_") 3. Loop through the column with a for loop, getting one object and applying the function one at a time. Then assign the new names. result3 <- vector("list", length = nrow(datatable)) for(i in seq_along(datatable$column)){ net_name <- datatable$column[i] NET <- get(net_name, envir = .GlobalEnv) result3[[i]] <- my_function(NET) } names(result3) <- paste("subnet", datatable$column, sep = "_") 4. Now check that all 3 solutions give the same result. identical(result, result2) #[1] TRUE identical(result, result3) #[1] TRUE Is this it? Rui Barradas Às 17:23 de 09/04/21, Wolfgang Grond escreveu: As I wrote before, I calculate tbl_graph objects, which will be joined afterwards. Not too much, the number of graphs to calculate is in the range between 5 to 20. Further steps are not automated, because they depend on how the single graphs look like, and which of them will be joined. For this reason I thought it would be nice to have the single tbl_ graph objects stored in variables having the name of the graph. For this reason I tried to find a better solution instead of assigning each graph by hand: subnet_MYSUBNET <- my_function(MYSUBNET) To my understanding it is therefore neccessary to assign the result of the function to a variable whose name consists of a fixed string and the content of a further variable. That was the intention for me to ask. Am 9. April 2021 17:22:05 MESZ schrieb David Winsemius : On 4/9/21 5:21 AM, Wolfgang Grond wrote: Greg, here I get the error message: Error my_function(val) : cannot find function my_function. I'm guessing that you are following someone else's blog and have failed one of two things: - understand that what was meant by the author was that you were assumed to have a function in mind to use for a programming strategy being illustrated - or you were copying and pasting only part of a blog and failed to paste in the code from above where there was earlier code defining `my_function` Am 9. April 2021 12:35:40 MESZ schrieb Greg Minshall : Wolfgang, result <- assign(paste("subnet_", val, sep = "") result <- my_function(val) i don't understand why you are twice assigning to =result=. also, the first assignment doesn't seem well formatted (t's missing a value?). did you mean something like : assign(paste("subnet_", val, sep = ""), my_function(val)) (which i would think should work)? cheers, Greg __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. - Numberland - Dr. Wolfgang Grond Diplomphysiker, TQM-Assessor (EFQM) Six Sigma Green Belt Ingenieurbüro / Engineering Consultancy Lohfeld 20, DE-95326 Kulmbach, Germany Phone: +49 9221 6919131 Fax: +49 9221 6919156 Mail: gr...@numberland.com URL: http://www.numberland.com [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org