Dear All, I am using a Bootstrapping approach with the Cubist model. It is possible to get the variable importance (percentages in variable usage in the models) after running the models. For each run of the model based on the random sampling, the variable importance is different. A robust estimate may be determined by taking the average of all the percentages of usage for each specific variable involved in the models. For this purpose, I have (1) saved each model to disk, (2) read back in R each model, (2) got the variable importance for each model, (3) got a final table with all the models and the percentages allocated by each model for the variables. the final table I am expecting should like the table below. It is possible to do it manually by calling in each model but the workload is too high when you have 100 number of bootstraps.
Here below a reproducible example with dataset from the ithir package. I have limited the bootstrap to 3 for quick run. Also below I have shown how I could get the table focusing on one model at a time. Is there anyway to have all the models called in a loop, and extract the variable importance for each model and arrange them finally to get the table instead of doing it for each model separately ? I am actually running the models 100 times. I will appreciate any help. Expected table------------------------------------------------------- Variables VImp cubist.type MRVBF 62.5 Vimp.cubist1 AACN 72.5 Vimp.cubist1 NDVI 41 Vimp.cubist1 Mid_Slope_Positon 24.5 Vimp.cubist1 Landsat_Band1 24 Vimp.cubist1 Terrain_Ruggedness_Index 44.5 Vimp.cubist1 TWI 42 Vimp.cubist1 Hillshading 41.5 Vimp.cubist1 Slope 41.5 Vimp.cubist1 Light_insolation 16 Vimp.cubist1 Elevation 11.5 Vimp.cubist1 MRVBF 81.5 Vimp.cubist2 Elevation 48.5 Vimp.cubist2 NDVI 62.5 Vimp.cubist2 Mid_Slope_Positon 43 Vimp.cubist2 TWI 54.5 Vimp.cubist2 AACN 41.5 Vimp.cubist2 Landsat_Band1 38.5 Vimp.cubist2 Hillshading 37.5 Vimp.cubist2 Terrain_Ruggedness_Index 27 Vimp.cubist2 Slope 21 Vimp.cubist2 Light_insolation 20.5 Vimp.cubist2 MRVBF 78 Vimp.cubist3 Hillshading 62 Vimp.cubist3 TWI 57 Vimp.cubist3 Terrain_Ruggedness_Index 55.5 Vimp.cubist3 AACN 55 Vimp.cubist3 Light_insolation 38.5 Vimp.cubist3 NDVI 50.5 Vimp.cubist3 Slope 49 Vimp.cubist3 Landsat_Band1 47.5 Vimp.cubist3 Mid_Slope_Positon 45.5 Vimp.cubist3 Elevation 31.5 Vimp.cubist3 ## REPRODUCIBLE EXAMPLE rm(list = ls()) library(Cubist) library(ithir) # install.packages("ithir", repos=" http://R-Forge.R-project.org <http://r-forge.r-project.org/>") library(caret) #Set a working directory setwd("") # Point data data(HV_subsoilpH) # subset data for modeling set.seed(123) training <- sample(nrow(HV_subsoilpH), 0.7 * nrow(HV_subsoilpH)) cDat <- HV_subsoilpH[training, ] vDat <- HV_subsoilpH[-training, ] #Create folder to store models dir.create("models", recursive = TRUE) # Number of bootstraps nbag <- 3 # Fit cubist models for each bootstrap for (i in 1:nbag) { trainingREP <- sample.int(nrow(cDat), 1.0 * nrow(cDat),replace = TRUE) fit_cubist <- cubist(x = cDat[trainingREP, c("Terrain_Ruggedness_Index", "AACN", "Landsat_Band1", "Elevation", "Hillshading", "Light_insolation", "Mid_Slope_Positon", "MRVBF", "NDVI", "TWI", "Slope")], y = cDat$pH60_100cm[trainingREP], cubistControl(rules = 5, extrapolation = 5), committees = 3) modelFile <-paste(getwd(), "./models/",sep = "", "bootMod_", i,".rds") saveRDS(object = fit_cubist, file = modelFile) } # List all files in directory c.models <- list.files(path = paste(getwd(),"./models", sep = ""), pattern = "\\.rds$", full.names = TRUE) #Reads the models fit_cubist<- readRDS(c.models[i]) varImp(fit_cubist) # But it only give variable importance for the final model #Alternatively, I can call the model one by one and apply varImp #Call models one by one fit_cubist1<- readRDS("./models/bootMod_1.rds") fit_cubist2<- readRDS("./models/bootMod_2.rds") fit_cubist3<- readRDS("./models/bootMod_3.rds") #Apply varImp on each model VImp1<-varImp(fit_cubist1) VImp2<-varImp(fit_cubist2) VImp3<-varImp(fit_cubist3) #Get a data frame for each variable importance cub.VImp1<-as.data.frame(VImp1[1]) names(cub.VImp1)[1]<-"VImp" cub.VImp2<-as.data.frame(VImp2[1]) names(cub.VImp2)[1]<-"VImp" cub.VImp3<-as.data.frame(VImp3[1]) names(cub.VImp3)[1]<-"VImp" #Create a column for each variable importance related to each model cub.VImp1$cubist.type<-"Vimp.cubist1" cub.VImp2$cubist.type<-"Vimp.cubist2" cub.VImp3$cubist.type<-"Vimp.cubist3" #Join the row names to each dataframe cub.VImp1<-setNames(cbind(rownames(cub.VImp1), cub.VImp1),c("Variables","VImp","cubist.model")) cub.VImp2<-setNames(cbind(rownames(cub.VImp2), cub.VImp2),c("Variables","VImp","cubist.model")) cub.VImp3<-setNames(cbind(rownames(cub.VImp3), cub.VImp3),c("Variables","VImp","cubist.model")) #Bind all the models df together by rows All.VarImp<-bind_rows(cub.VImp1,cub.VImp2,cub.VImp3) #Get the average of variable importance for each variable mean.VarImp <- ddply(All.VarImp, .(Variables), summarise, mean = mean(VImp), sd = sd( VImp)) mean.VarImp Ozias Hounkpatin Post doc Sveriges lantbruksuniversitet Swedish University of Agricultural Sciences Dept. of Soil and Environment PO Box 1234, SE-123 45 Uppsala Visiting address: Lennart Hjems våg 9 Phone: +46 18 67 12 51, Mobile: +46 72 207 85 62 ozias.hounkpa...@slu.se , www.slu.se Hounkpatin Ozias <oziashounkpa...@gmail.com> 17:41 (10 minutes ago) to R-sig-geo ReplyForward <https://drive.google.com/u/0/settings/storage?hl=en-GB> <https://www.google.com/intl/en-GB/policies/terms/> <https://www.google.com/intl/en-GB/policies/privacy/> <https://www.google.com/gmail/about/policy/> [[alternative HTML version deleted]] _______________________________________________ R-sig-Geo mailing list R-sig-Geo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-geo