Hello Everyone! It's been a while since I last posted a question! Hope everyone has been doing well!
~~~ CONTEXT ~~~ I have recently entered a beginner-level competition on kaggle. The goal of the competition is to build a model that predicts who did/did not survive on the Titanic. I decided to use random forests as I have been wanting to learn the algorithm and the competition was the perfect impetus. Unfortunately, the model I have built is not very accurate. ~~~ QUESTION ~~~ What can I do to make the model more accurate (less error for 1 (survived))? Is there a cost matrix that I can input into the model? Improve the code? Learn more statistics (please provide resources :) )? ~~~ SOME CODE ~~~ # Response variable: Survived # 0 = Did not survive # 1 = Did survive # First few steps # 1. Used regsubsets to identify the 5 best variables # 2. Cleaned the raw data and built a logistic regression to see the significance of the predictors (and their levels if factor) # 3. Develop a new 'train' dataset with a group of variables based on the significances from the logistic regression # PLEASE FEEL FREE TO SHARE FEATURE SELECTION/EXTRACTION METHODS AS I AM CLEARLY LACKING IN THAT AREA AS WELL :( > head(train) survived age sibsp pclass2 pclass3 sexmale 1 0 22 1 0 1 1 2 1 38 1 0 0 0 3 1 26 0 0 1 0 4 1 35 1 0 0 0 5 0 35 0 0 1 1 6 0 27 0 0 1 1 > sapply(train,class) survived age sibsp pclass2 pclass3 sexmale "factor" "numeric" "integer" "factor" "factor" "factor" > sapply(split(train,train$survived),function(x) dim(x)[1]) 0 1 549 342 > rf <- randomForest(train[,-1], train[,1], > ntree=10000,classwt=c(549/891,342/891),importance=TRUE,do.trace=FALSE) OOB estimate of error rate: 17.73% Confusion matrix: 0 1 class.error 0 500 49 0.08925319 1 109 233 0.31871345 [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.