Dear all, Objective: I am trying to learn about neural networks. I want to see if i can train an artificial neural network model to discriminate between spam and nonspam emails.
Problem: I created my own model (example 1 below) and got an error of about 7.7%. I created the same model using the Rattle package (example 2 below, based on rattles log script) and got a much better error of about 0.073%. Question 1: I don't understand why the rattle script gives a better result? I must therefore be doing something wrong in my own script (example 1) and would appreciate some insight :-) Question 2: As rattle gives a much better result, i would be happy to use it's r-code instead of my own. How can I interpret it's predictions as either being either 'spam' or 'nonspam'? I have looked at the type='class' parameter in ?predict.nnet but it doesn't apply to this situation i believe. Below i give commented, minimal, self-contained and reproducible code. (if you ignore the output, it really is very few lines of code and therefore minimal i believe?) ## load library >library(nnet) ## Load in spam dataset from package kernlab >data(list = "spam", package = "kernlab") >set.seed(42) >my.sample <- sample(nrow(spam), 3221) >spam.train <- spam[my.sample, ] >spam.test <- spam[-my.sample, ] ## Example 1 - my own code # train artificial neural network (nn1) >( nn1 <- nnet(type~., data=spam.train, size=3, decay=0.1, maxit=1000) ) # predict spam.test dataset on nn1 > ( nn1.pr.test <- predict(nn1, spam.test, type='class') ) [1] "spam" "spam" "spam" "spam" "nonspam" "spam" "spam" [etc...] # error matrix >(nn1.test.tab<-table(spam.test$type, nn1.pr.test, dnn=c('Actual', >'Predicted'))) Predicted Actual nonspam spam nonspam 778 43 spam 63 496 # Calucate overall error percentage ~ 7.68% >(nn1.test.perf <- 100 * (nn1.test.tab[2] + nn1.test.tab[3]) / >sum(nn1.test.tab)) [1] 7.68116 ## Example 2 - code based on rattles log script # train artifical neural network >nn2<-nnet(as.numeric(type)-1~., data=spam.train, size=3, decay=0.1, maxit=1000) # predict spam.test dataset on nn2. # ?predict.nnet does have the parameter type='class', but i can't use that here as an option >nn2.pr.test <- predict(nn2, spam.test) [,1] 3 0.984972396013 4 0.931149225918 10 0.930001139978 13 0.923271300707 21 0.102282256315 [etc...] # error matrix >( nn2.test.tab <- round(100*table(nn2.pr.test, spam.test$type, dnn=c("Predicted", "Actual"))/length (nn2.pr.test)) ) Actual Predicted nonspam spam -0.741896935969825 0 0 -0.706473834678304 0 0 -0.595327594045746 0 0 [etc...] # calucate overall error percentage. Am not sure how this line works tbh, # and i think it should be multiplied by 100. I got this from rattle's log script. >(function(x){return((x[1,2]+x[2,1])/sum(x))}) (table(nn2.pr.test, spam.test$type, dnn=c("Predicted", "Actual"))) [1] 0.0007246377 # i'm guessing the above should be ~0.072% I know the above probably seems complicated, but any help that can be offered would be much appreicated. Thank you kindly in advance, Tony OS = Windows Vista Ultimate, running R in admin mode > sessionInfo() R version 2.8.1 (2008-12-22) i386-pc-mingw32 locale: LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom. 1252;LC_MONETARY=English_United Kingdom. 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252 attached base packages: [1] grid stats graphics grDevices utils datasets methods base other attached packages: [1] RGtk2_2.12.8 vcd_1.2-2 colorspace_1.0-0 MASS_7.2-45 rattle_2.4.8 nnet_7.2-45 loaded via a namespace (and not attached): [1] tools_2.8.1 ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.