Re: [R] Stepwise SVM Variable selection
On 06/01/11 23:10:59, Noah Silverman wrote: I have a data set with about 30,000 training cases and 103 variable. I've trained an SVM (using the e1071 package) for a binary classifier {0,1}. The accuracy isn't great. I used a grid search over the C and G parameters with an RBF kernel to find the best settings. [...] Can anyone suggest an approach to seek the ideal subset of variables for my SVM classifier? The standard feature selection stuff (backward/forward etc.) is probably ruled out by the time it takes to compute all the sets and subsets. What you could try is the following: First, do a cross-validation setup: split up your data set into a training and testing set (ratio 0.9 / 0.1 or so). Second, train your SVM on the training set (try conservative parameters first). Third, have your trained SVM classify the test set and compute the classification error. Fourth, iterate over all variables and do the following: a) choose one variable and permute its values (only) in the test set b) have your trained SVM (from step 2) classify this test set and measure the classification error c) repeat a) and b) a (high) number of times to be significant d) go to next variable Fifth, you can get an impression of the importance that one variable has by comparing the errors generated on the permuted test set for each variable with the non-permuted test set classification error. If the permutation of one variable drastically increases the classification error, the variable is probably important. Sixth: repeat the cross-validation / random sampling a number of times to be significant. This is more like an ad-hoc approach and there are some pitfalls, but the idea is easily explained and can also be carried over to any other regression model with cross-validation. The computational burden in SVM is assumed to be the training and not the prediction step and you only need a relatively low number of training runs (sixth step) here. Regards, Georg. -- Research Assistant Otto-von-Guericke-Universität Magdeburg resea...@georgruss.de http://research.georgruss.de __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reset R to a vanilla state
On 16/12/10 15:12:47, Holger Hoefling wrote: Specifically I want all objects in the workspace removed rm(list=ls()) should do this trick. and all non-base packages detached and unloaded You may obtain the list of loaded packages via (.packages()) Store this at the beginning of your session, get the diff to the loaded packages at the end of the session and detach(package:packagename) those packages. and preferably a .Rprofile executed as well source(.Rprofile) ? What's the circumstance that requires you to do this? I.e. why don't you just restart R? Regards, Georg. -- Research Assistant Otto-von-Guericke-Universität Magdeburg resea...@georgruss.de http://research.georgruss.de __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Need help on nnet
On 10/12/10 02:56:13, jothy wrote: Am working on neural network. Below is the coding and the output [...] summary (uplift.nn) a 3-3-1 network with 16 weights options were - b-h1 i1-h1 i2-h1 i3-h1 16.646.62 149.932.24 b-h2 i1-h2 i2-h2 i3-h2 -42.79 -17.40 -507.50 -5.14 b-h3 i1-h3 i2-h3 i3-h3 3.451.87 18.890.61 b-o h1-o h2-o h3-o 402.81 41.29 236.766.06 Q1: How to interpret the above output The summary above is the list of internal weights that were learnt during the neural network training in nnet(). From my point of view I wouldn't really try to interpret any meaning into those weights, especially if you have multiple predictor variables. Q2: My objective is to know the contribution of each independent variable. You may try something like variable importance approaches (VI) or feature selection approaches. 1) In VI you have a training and test set as in normal cross-validation. You train your network on the training set. You use the trained network for predicting the test values. The clue in VI then is to pick one variable at a time, permute its values in the test set only (!) and see how much the prediction error deviates from the original prediction error on the unpermuted test set. Repeat this a lot of times to get a meaningful output and also be sure to use a lot of cross-validation permutations. The more the prediction error rises, the more important the respective variable was/is. This approach includes interactions between variables. 2) feature selection is essentially an exhaustive approach which tries every possible subset of your predictors, trains a network and sees what the prediction error is. The subset which is best (lowest error) is then chosen in the end. It normally (as a side-effect) also gives you something like an importance ranking of the variables when using backward or forward feature selection. But be careful of interactions between variables. Q3: Which package of neural network provides the AIC or BIC values You may try training with the multinom() function, as pointed out in msg09297: http://www.mail-archive.com/r-help@stat.math.ethz.ch/msg09297.html I hope I could point out some keywords and places to look at. Regards, Georg. -- Research Assistant Otto-von-Guericke-Universität Magdeburg resea...@georgruss.de http://research.georgruss.de __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Help..Neural Network
On 10/12/10 03:45:46, sadanandan wrote: I am trying to develop a neural network with single target variable and 5 input variables to predict the importance of input variables using R. I used the packages nnet and RSNNS. But unfortunately I could not interpret the out put properly and the documentation of that packages also not giving proper direction. Please help me to find a good package with a proper documentation for neural network. Hi, please see post http://r.789695.n4.nabble.com/Need-help-on-nnet-td3081744.html (title Need help on nnet by jothy) and see if that helps solving your problem. Otherwise you may try to provide some more input about what you're trying to do and ask again. Regards, Georg. -- Research Assistant Otto-von-Guericke-Universität Magdeburg resea...@georgruss.de http://research.georgruss.de __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] spatial clusters
On 10/12/10 23:26:28, dorina.lazar wrote: I am looking for a clustering method usefull to classify the countries in some clusters taking account of: a) the geographical distance (in km) between countries and b) of some macroeconomic indicators (gdp, life expectancy...). Hi Dorina, before choosing R packages useful for this task, the task itself must be clarified. What does the data you're working with look like? I'm asking because it looks as if you're trying to mix spatial (spatial distances) and non-spatial information in a clustering algorithm. I've done a lot of research in this area because I needed something similar (combining spatial and non-spatial information) and the existing approaches weren't really useful in my case because I had equidistant spatial points with equal spatial density (management zone delineation in precision agriculture). There are a few algorithms which may be suitable for your work, maybe check out the references below (you should find those using only the title, otherwise please let me know): MOSAIC: A Proximity Graph Approach for Agglomerative Clustering ICEAGE: Interactive Clustering and Exploration of Large and High-Dimensional Geodata Efficient regionalization techniques for socio-economic geographical units using minimum spanning trees (SKATER) I haven't seen too many R implementations yet, though. You may also try the R-sig-geo mailing list, because your data look geo :-) https://stat.ethz.ch/mailman/listinfo/r-sig-geo Regards, Georg. -- Research Assistant Otto-von-Guericke-Universität Magdeburg resea...@georgruss.de http://research.georgruss.de __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] nnet for regression, mixed factors/numeric in data.frame
Hi there, this is more a comment and a solution rather than a question, but I thought I'd post it since it cost some time to dig down to the issue and maybe someone else could run into this. I'm using the nnet function for a regression task. I'm inputting the following data frame: 'data.frame': 4970 obs. of 11 variables: $ EC25 : num 67.5 67.6 68 69 69.5 ... $ YIELD07 : num 5.43 5.68 5.88 5.81 6.47 5.96 5.71 5.92 5.92 6.47 $ N3 : num 63 63 55 58 59 57 59 55 54 54 ... $ N2 : num 45 44 41 42 44 43 46 47 46 43 ... $ N1 : num 68 68 69 69 69 69 69 69 69 68 ... $ REIP32 : num 725 725 725 725 725 ... $ REIP49 : num 727 728 728 728 727 ... $ ELEVATION: Factor w/ 1127 levels 67.71,67.73,..: 17 19 23 19 19 16 26 18 33 9 ... using the formula interface: formula - YIELD07 ~ N1 + N2 + N3 + EC25 + REIP32 + REIP49 + ELEVATION However, using the above data.frame, R spits out the following message: Error in nnet.default(x, y, w, ...) : too many (56701) weights After changing the ELEVATION variable to a numeric variable via the following line: f611$ELEVATION - as.numeric(levels(f611$ELEVATION)[f611$ELEVATION]) the model runs fine. It's funny though that all the other models I've used for regression worked fine with ELEVATION being a factor variable. And it's not mentioned in ?nnet (there, it only says that if the response variable is a factor it's going to be a classification network). Regards, Georg. -- Research Assistant Otto-von-Guericke-Universität Magdeburg resea...@georgruss.de http://research.georgruss.de __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] kmeans() compared to PROC FASTCLUS
On 02/12/10 17:49:37, Andrew Agrimson wrote: I've been comparing results from kmeans() in R to PROC FASTCLUS in SAS and I'm getting drastically different results with a real life data set. [...] Has anybody looked into the differences in the implementations or have any thoughts on the matter? Hi Andrew, as per the website below, it looks as if PROC FASTCLUS is implementing a certain flavor of k-Means: http://www.technion.ac.il/docs/sas/stat/chap27/sect2.htm As per the manpage ?kmeans, the R implementation of k-Means has the option to set one of the algorithms explicitly: algorithm = c(Hartigan-Wong, Lloyd, Forgy, MacQueen)) I don't know whether you've tried that, but you may start by setting these algorithm variants explicitly and see what the outcome is. Regards, Georg. -- Research Assistant Otto-von-Guericke-Universität Magdeburg resea...@georgruss.de http://research.georgruss.de __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] book about support vector machines
On 03/12/10 16:23:33, manuel.martin wrote: I am currently looking for a book about support vector machines for regression and classification and am a bit lost since they are plenty of books dealing with this subject. I am not totally new to the field and would like to get more information on that subject for later use with the e1071 http://cran.r-project.org/web/packages/e1071/index.html package for instance. Hi Manuel, there's also the references mentioned in ?svm once you've loaded the e1071 library. Nevertheless, that's rather detailed on the implementation side, not on the general picture that I assume you'd like for a book. library(e1071) ?svm There's also the downloadable A guide for beginners: C.-W. Hsu, C.-C. Chang, C.-J. Lin. A practical guide to support vector classification mentioned in the additional information section of http://www.csie.ntu.edu.tw/~cjlin/libsvm/ (which, in turn, is from ?svm) Regards, Georg. -- Research Assistant Otto-von-Guericke-Universität Magdeburg resea...@georgruss.de http://research.georgruss.de __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to know if a file exists on a remote server?
On 30/11/10 10:10:07, Baoqiang Cao wrote: I'd like to download some data files from a remote server, the problem here is that some of the files actually don't exist, which I don't know before try. Just wondering if a function in R could tell me if a file exists on a remote server? Hi Baoqiang, try downloading the file with R's download.file() function. Then you should examine the returned value. Citing a part of ?download.file below: Value: An (invisible) integer code, ‘0’ for success and non-zero for failure. For the ‘wget’ and ‘lynx’ methods this is the status code returned by the external program. The ‘internal’ method can return ‘1’, but will in most cases throw an error. So if you call your download via v - download.file(url, destfile, method=wget) and v is not equal to zero, then the file is likely to be non-existent (at least the download failed). Note: the method internal doesn't really change the value of v, I just tried that. With wget it returns 0 for success and 2048 (or some other value) for non-success. Regards, Georg. -- Research Assistant Otto-von-Guericke-Universität Magdeburg resea...@georgruss.de http://research.georgruss.de __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Issues with nnet.default for regression/classification
On 29/11/10 11:57:31, Jude Ryan wrote: Hi Georg, The documentation (?nnet) says that y should be a matrix or data frame, but in your case it is a vector. This is most likely the problem, if you do not have other data issues going on. Convert y to a matrix (or data frame) using ‘as.matrix’ and see if this solves your problem. Library ‘nnet’ can do both classification and regression. I was able to replicate your problem, using an example from Modern Applied Statistics with S, Venables and Ripley, pages 246 and 247), by turning y into a vector and verifying that all the predicted values are the same when y is a vector. This is not the case when y is part of a data frame. You can see this by running the code below. I tried about 4 neural network packages in the past, including AMORE, but found ‘nnet’ to be the best for my needs. Hi Jude, thanks for the hint. I lately experimented both with the nnet(x,y, ...) and the nnet(formula, dataframe ...) interfaces to nnet and both yielded the same results. So changing the format of y from a vector to a matrix or a data frame didn't change anything at all. However, what _did_ change the outcome is to introduce the decay parameter (which I didn't have at all before). By default it is set to 0 which doesn't seem appropriate in my case. Setting it to decay=1e-3 magically turned my output into an acceptable regression response instead of spitting out fixed values. I really love the predict interface for regression in each of the models I'm using. Clear code :-) So, for the record, the call for nnet for the regression problem is as follows: net.fitted - nnet(formula, data = sp...@data[-testset,], decay=1e-3, size = 20, linout = TRUE) (where sp...@data is the data part of a SpatialPointsDataFrame. And yes, in selecting the [-testset,] data points I'm taking into account the existing spatial autocorrelation.) # Neural Network model in Modern Applied Statistics with S, Venables and Ripley, pages 246 and 247 Thanks for your help and the reference, I'm likely to order the book now :-) Leaving out the decay parameter changes the fitted.values in the rock example you mentioned as well, although not that much. Convergence speed does change as expected, so the parameter is working. I guess my problem is solved now, the rest is due to the specialties with my data sets. Georg. -- Research Assistant Otto-von-Guericke-Universität Magdeburg resea...@georgruss.de http://research.georgruss.de __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Combind two different vector
On 27/11/10 16:04:35, Serdar Akin wrote: I'm trying two combine two vectors that have different lengths. This without recursive the shorter one. E.g., a - seq(1:3) b - seq(1:6) If that means your output should be (1 2 3 1 2 3 4 5 6) then c - c(a,b) should solve this. Looks like _the_ basic vector operation. Georg. -- Research Assistant Otto-von-Guericke-Universität Magdeburg resea...@georgruss.de http://research.georgruss.de __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Combind two different vector
On 27/11/10 19:04:27, Serdar Akin wrote: Hi No its has to be like this: a b 1 1 2 2 3 3 4 5 6 Hmm, empty elements in such an array? Seems not really recommended, if it's possible at all. You may try filling up the shorter vector with NA's or any other values that your application can understand appropriately. Then do rbind or cbind, as necessary. Georg. PS: you may also reply to the r-help list -- Research Assistant Otto-von-Guericke-Universität Magdeburg resea...@georgruss.de http://research.georgruss.de __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Issues with nnet.default for regression/classification
Hi, I'm currently trying desperately to get the nnet function for training a neural network (with one hidden layer) to perform a regression task. So I run it like the following: trainednet - nnet(x=traindata, y=trainresponse, size = 30, linout = TRUE, maxit=1000) (where x is a matrix and y a numerical vector consisting of the target values for one variable) To see whether the network learnt anything at all, I checked the network weights and those have definitely changed. However, when examining the trainednet$fitted.values, those are all the same so it rather looks as if the network is doing a classification. I can even set linout=FALSE and then it outputs 1 (the class?) for each training example. The trainednet$residuals are correct (difference between predicted/fitted example and actual response), but rather useless. The same happens if I run nnet with the formula/data.frame interface, btw. As per the suggestion in the ?nnet page: If the response is not a factor, it is passed on unchanged to 'nnet.default', I assume that the network is doing regression since my trainresponse variable is a numerical vector and _not_ a factor. I'm currently lost and I can't see that the AMORE/neuralnet packages are any better (moreover, they don't implement the formula/dataframe/predict things). I've read the manpages of nnet and predict.nnet a gazillion times, but I can't really find an answer there. I don't want to do classification, but regression. Thanks for any help. Georg. -- Research Assistant Otto-von-Guericke-Universität Magdeburg resea...@georgruss.de http://research.georgruss.de __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.