Hello everybody.

I am  using the GA package[1] in order to optimize the hyperparameter of
SVM like in this example is done:
http://stackoverflow.com/questions/32026436/how-to-optimize-parameters-using-genetic-algorithms

However, when I try to adapt the example for random forest, it takes very
very long to optimize. It might be because the hyperparameter of random
forest are integers (ntree, mtry, nodes) but I don't know if there is a way
to specify it in the algorithm. Any suggestion would be very much
appreciated. Thank you!

The code:

library(GA)
library("randomForest")

data(Ozone, package="mlbench")
Data <- na.omit(Ozone)

# Setup the data for cross-validation
K = 5 # 5-fold cross-validation
fold_inds <- sample(1:K, nrow(Data), replace = TRUE)
lst_CV_data <- lapply(1:K, function(i) list(
  train_data = Data[fold_inds != i, , drop = FALSE],
  test_data = Data[fold_inds == i, , drop = FALSE]))

# Given the values of parameters 'ntree', 'mtry' and 'nodesize', return the
rmse of the model over the test data
evalParamsRF <- function(train_data, test_data, ntree, mtry, nodesize) {
  # Train
  model <- randomForest(V4 ~ ., data = train_data, ntree = ntree, mtry =
mtry, nodesize = nodesize
                        , proximity=T)
  # Test
  rmse <- mean((predict(model, newdata = test_data) - test_data$V4) ^ 2)
  return (rmse)
}

fitnessFuncRF <- function(x, Lst_CV_Data) {
  # Retrieve the RF parameters
  ntree_val <- x[1]
  mtry_val <- x[2]
  nodesize_val <- x[3]
 
  # Use cross-validation to estimate the RMSE for each split of the
dataset
  rmse_vals <- sapply(Lst_CV_Data, function(in_data) with(in_data,
                                                         
evalParamsRF(train_data, test_data, ntree_val
                                                                      
, mtry_val, nodesize_val)))
 
  # As fitness measure, return minus the average rmse (over the
cross-validation folds),
  # so that by maximizing fitness we are minimizing the rmse
  return (-mean(rmse_vals))
}

theta_min <- c(ntree = 100, mtry = 2, nodesize = 3)
theta_max <- c(ntree = 1000, mtry = 7, nodesize = 20)

# Run the genetic algorithm
results <- ga(type = "real-valued", fitness = fitnessFuncRF, lst_CV_data,
              names = names(theta_min),
              min = theta_min, max = theta_max,
              popSize = 50, maxiter = 10)

summary(results)
summary(results)$solution



Links:
------
[1] https://cran.r-project.org/web/packages/GA/index.html


------
Aurora González Vidal
Ph.D. student in Data Analytics for Energy Efficiency

Faculty of Computer Sciences
University of Murcia

@. aurora.gonzal...@um.es
T. 868 88 7866
sae.saiblogs.inf.um.es

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to