Hello Everyone!

It's been a while since I last posted a question! Hope everyone has been
doing well!

~~~ CONTEXT ~~~
     I have recently entered a beginner-level competition on kaggle. The
goal of the competition is to build a model that predicts who did/did not
survive on the Titanic.

     I decided to use random forests as I have been wanting to learn the
algorithm and the competition was the perfect impetus. Unfortunately, the
model I have built is not very accurate.


~~~ QUESTION ~~~
     What can I do to make the model more accurate (less error for 1
(survived))? Is there a cost matrix that I can input into the model?
Improve the code? Learn more statistics (please provide resources :) )?


~~~ SOME CODE ~~~
# Response variable: Survived
# 0 = Did not survive
# 1 = Did survive

# First few steps
# 1. Used regsubsets to identify the 5 best variables
# 2. Cleaned the raw data and built a logistic regression to see the
significance of the predictors (and their levels if factor)
# 3. Develop a new 'train' dataset with a group of variables based on the
significances from the logistic regression
# PLEASE FEEL FREE TO SHARE FEATURE SELECTION/EXTRACTION METHODS AS I AM
CLEARLY LACKING IN THAT AREA AS WELL :(

> head(train)

  survived age sibsp pclass2 pclass3 sexmale
1        0  22     1       0       1       1
2        1  38     1       0       0       0
3        1  26     0       0       1       0
4        1  35     1       0       0       0
5        0  35     0       0       1       1
6        0  27     0       0       1       1


> sapply(train,class) survived       age     sibsp   pclass2   pclass3   sexmale
 "factor" "numeric" "integer"  "factor"  "factor"  "factor"


> sapply(split(train,train$survived),function(x) dim(x)[1])  0   1
549 342


> rf <- randomForest(train[,-1], train[,1], 
> ntree=10000,classwt=c(549/891,342/891),importance=TRUE,do.trace=FALSE)
        OOB estimate of  error rate: 17.73%
Confusion matrix:
    0   1 class.error
0 500  49  0.08925319
1 109 233  0.31871345

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to