Re: [R] Stepwise SVM Variable selection

2011-01-07 Thread Georg Ruß
On 06/01/11 23:10:59, Noah Silverman wrote:
 I have a data set with about 30,000 training cases and 103 variable.
 I've trained an SVM (using the e1071 package) for a binary classifier
 {0,1}.  The accuracy isn't great.  I used a grid search over the C and G
 parameters with an RBF kernel to find the best settings. [...]

 Can anyone suggest an approach to seek the ideal subset of variables for
 my SVM classifier?

The standard feature selection stuff (backward/forward etc.) is probably
ruled out by the time it takes to compute all the sets and subsets. What
you could try is the following:

First, do a cross-validation setup: split up your data set into a training
and testing set (ratio 0.9 / 0.1 or so).

Second, train your SVM on the training set (try conservative parameters
first).

Third, have your trained SVM classify the test set and compute the
classification error.

Fourth, iterate over all variables and do the following:
  a) choose one variable and permute its values (only) in the test set
  b) have your trained SVM (from step 2) classify this test set and 
  measure the classification error
  c) repeat a) and b) a (high) number of times to be significant 
  d) go to next variable

Fifth, you can get an impression of the importance that one variable has
by comparing the errors generated on the permuted test set for each
variable with the non-permuted test set classification error. If the
permutation of one variable drastically increases the classification
error, the variable is probably important.

Sixth: repeat the cross-validation / random sampling a number of times to
be significant.

This is more like an ad-hoc approach and there are some pitfalls, but the
idea is easily explained and can also be carried over to any other
regression model with cross-validation. The computational burden in SVM is
assumed to be the training and not the prediction step and you only need a
relatively low number of training runs (sixth step) here.

Regards,
Georg.
-- 
Research Assistant
Otto-von-Guericke-Universität Magdeburg
resea...@georgruss.de
http://research.georgruss.de

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Reset R to a vanilla state

2010-12-16 Thread Georg Ruß
On 16/12/10 15:12:47, Holger Hoefling wrote:
 Specifically I want all objects in the workspace removed

rm(list=ls()) should do this trick.

 and all non-base packages detached and unloaded

You may obtain the list of loaded packages via

(.packages())

Store this at the beginning of your session, get the diff to the loaded
packages at the end of the session and

detach(package:packagename) those packages.

 and preferably a .Rprofile executed as well

source(.Rprofile) ?

What's the circumstance that requires you to do this? I.e. why don't you
just restart R?

Regards,
Georg.
--
Research Assistant
Otto-von-Guericke-Universität Magdeburg
resea...@georgruss.de
http://research.georgruss.de

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Need help on nnet

2010-12-13 Thread Georg Ruß
On 10/12/10 02:56:13, jothy wrote:
 Am working on neural network.
 Below is the coding and the output [...]

  summary (uplift.nn)

 a 3-3-1 network with 16 weights

 options were -

   b-h1  i1-h1  i2-h1  i3-h1
   16.646.62  149.932.24
   b-h2  i1-h2  i2-h2  i3-h2
  -42.79  -17.40 -507.50   -5.14
   b-h3  i1-h3  i2-h3  i3-h3
3.451.87   18.890.61
b-o   h1-o   h2-o   h3-o
  402.81   41.29  236.766.06

 Q1: How to interpret the above output

The summary above is the list of internal weights that were learnt during
the neural network training in nnet(). From my point of view I wouldn't
really try to interpret any meaning into those weights, especially if you
have multiple predictor variables.

 Q2: My objective is to know the contribution of each independent variable.

You may try something like variable importance approaches (VI) or feature
selection approaches. 

1) In VI you have a training and test set as in normal cross-validation.
You train your network on the training set. You use the trained network
for predicting the test values. The clue in VI then is to pick one
variable at a time, permute its values in the test set only (!) and see
how much the prediction error deviates from the original prediction error
on the unpermuted test set.  Repeat this a lot of times to get a
meaningful output and also be sure to use a lot of cross-validation
permutations. The more the prediction error rises, the more important the
respective variable was/is. This approach includes interactions between
variables.

2) feature selection is essentially an exhaustive approach which tries
every possible subset of your predictors, trains a network and sees what
the prediction error is. The subset which is best (lowest error) is then
chosen in the end. It normally (as a side-effect) also gives you something
like an importance ranking of the variables when using backward or forward
feature selection. But be careful of interactions between variables.

 Q3: Which package of neural network provides the AIC or BIC values

You may try training with the multinom() function, as pointed out in
msg09297:
http://www.mail-archive.com/r-help@stat.math.ethz.ch/msg09297.html

I hope I could point out some keywords and places to look at.

Regards,
Georg.
-- 
Research Assistant
Otto-von-Guericke-Universität Magdeburg
resea...@georgruss.de
http://research.georgruss.de

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Help..Neural Network

2010-12-13 Thread Georg Ruß
On 10/12/10 03:45:46, sadanandan wrote:
 I am trying to develop a neural network with single target variable and 5
 input variables to predict the importance of input variables using R. I used
 the packages nnet and RSNNS. But unfortunately I could not interpret the out
 put properly and the documentation of that packages also not giving proper
 direction. Please help me to find a good package with a proper documentation
 for neural network.

Hi,

please see post
http://r.789695.n4.nabble.com/Need-help-on-nnet-td3081744.html (title
Need help on nnet by jothy) and see if that helps solving your
problem. Otherwise you may try to provide some more input about what
you're trying to do and ask again.

Regards,
Georg.
-- 
Research Assistant
Otto-von-Guericke-Universität Magdeburg
resea...@georgruss.de
http://research.georgruss.de

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] spatial clusters

2010-12-13 Thread Georg Ruß
On 10/12/10 23:26:28, dorina.lazar wrote:
 I am looking for a clustering method usefull to classify the countries in
 some clusters taking account of: a) the geographical distance (in km)
 between countries  and b) of some macroeconomic indicators (gdp, life
 expectancy...).

Hi Dorina,

before choosing R packages useful for this task, the task itself must be
clarified. What does the data you're working with look like? I'm asking
because it looks as if you're trying to mix spatial (spatial distances)
and non-spatial information in a clustering algorithm. I've done a lot of
research in this area because I needed something similar (combining
spatial and non-spatial information) and the existing approaches weren't
really useful in my case because I had equidistant spatial points with
equal spatial density (management zone delineation in precision
agriculture).

There are a few algorithms which may be suitable for your work, maybe
check out the references below (you should find those using only the
title, otherwise please let me know):

MOSAIC: A Proximity Graph Approach for Agglomerative Clustering
ICEAGE: Interactive Clustering and Exploration of Large and
High-Dimensional Geodata
Efficient regionalization techniques for socio-economic geographical units
using minimum spanning trees (SKATER)

I haven't seen too many R implementations yet, though.

You may also try the R-sig-geo mailing list, because your data look geo
:-) https://stat.ethz.ch/mailman/listinfo/r-sig-geo

Regards,
Georg.
--
Research Assistant
Otto-von-Guericke-Universität Magdeburg
resea...@georgruss.de
http://research.georgruss.de

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] nnet for regression, mixed factors/numeric in data.frame

2010-12-09 Thread Georg Ruß
Hi there,

this is more a comment and a solution rather than a question, but I
thought I'd post it since it cost some time to dig down to the issue and
maybe someone else could run into this.

I'm using the nnet function for a regression task. I'm inputting the
following data frame:

 'data.frame':   4970 obs. of  11 variables:
$ EC25 : num  67.5 67.6 68 69 69.5 ...
$ YIELD07  : num  5.43 5.68 5.88 5.81 6.47 5.96 5.71 5.92 5.92 6.47
$ N3   : num  63 63 55 58 59 57 59 55 54 54 ...
$ N2   : num  45 44 41 42 44 43 46 47 46 43 ...
$ N1   : num  68 68 69 69 69 69 69 69 69 68 ...
$ REIP32   : num  725 725 725 725 725 ...
$ REIP49   : num  727 728 728 728 727 ...
$ ELEVATION: Factor w/ 1127 levels 67.71,67.73,..: 17 19 23 19 19 16 26 18 
33 9 ...

using the formula interface:
 formula - YIELD07 ~ N1 + N2 + N3 + EC25 + REIP32 + REIP49 + ELEVATION

However, using the above data.frame, R spits out the following message:
 Error in nnet.default(x, y, w, ...) : too many (56701) weights

After changing the ELEVATION variable to a numeric variable via the
following line:
 f611$ELEVATION - as.numeric(levels(f611$ELEVATION)[f611$ELEVATION])

the model runs fine.

It's funny though that all the other models I've used for regression
worked fine with ELEVATION being a factor variable. And it's not mentioned
in ?nnet (there, it only says that if the response variable is a factor
it's going to be a classification network).

Regards,
Georg.
--
Research Assistant
Otto-von-Guericke-Universität Magdeburg
resea...@georgruss.de
http://research.georgruss.de

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] kmeans() compared to PROC FASTCLUS

2010-12-03 Thread Georg Ruß
On 02/12/10 17:49:37, Andrew Agrimson wrote:
 I've been comparing results from kmeans() in R to PROC FASTCLUS in SAS
 and I'm getting drastically different results with a real life data set.
 [...] Has anybody looked into the differences in the implementations or
 have any thoughts on the matter?

Hi Andrew,

as per the website below, it looks as if PROC FASTCLUS is implementing a
certain flavor of k-Means:

http://www.technion.ac.il/docs/sas/stat/chap27/sect2.htm

As per the manpage ?kmeans, the R implementation of k-Means has the option
to set one of the algorithms explicitly:

algorithm = c(Hartigan-Wong, Lloyd, Forgy, MacQueen))

I don't know whether you've tried that, but you may start by setting these
algorithm variants explicitly and see what the outcome is.

Regards,
Georg.
-- 
Research Assistant
Otto-von-Guericke-Universität Magdeburg
resea...@georgruss.de
http://research.georgruss.de

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] book about support vector machines

2010-12-03 Thread Georg Ruß
On 03/12/10 16:23:33, manuel.martin wrote:
 I am currently looking for a book about support vector machines for
 regression and classification and am a bit lost since they are plenty of
 books dealing with this subject. I am not totally new to the field and
 would like to get more information on that subject for later use with
 the e1071 http://cran.r-project.org/web/packages/e1071/index.html
 package for instance.

Hi Manuel,

there's also the references mentioned in ?svm once you've loaded the e1071
library. Nevertheless, that's rather detailed on the implementation side,
not on the general picture that I assume you'd like for a book.

library(e1071)
?svm

There's also the downloadable A guide for beginners: C.-W. Hsu, C.-C.
Chang, C.-J.  Lin. A practical guide to support vector classification
mentioned in the additional information section of
http://www.csie.ntu.edu.tw/~cjlin/libsvm/ (which, in turn, is from ?svm)

Regards,
Georg.
--
Research Assistant
Otto-von-Guericke-Universität Magdeburg
resea...@georgruss.de
http://research.georgruss.de

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to know if a file exists on a remote server?

2010-11-30 Thread Georg Ruß
On 30/11/10 10:10:07, Baoqiang Cao wrote:
 I'd like to download some data files from a remote server, the problem
 here is that some of the files actually don't exist, which I don't
 know before try. Just wondering if a function in R could tell me if a
 file exists on a remote server? 

Hi Baoqiang,

try downloading the file with R's download.file() function. Then you
should examine the returned value.

Citing a part of ?download.file below:

 Value:
 An (invisible) integer code, ‘0’ for success and non-zero for
 failure.  For the ‘wget’ and ‘lynx’ methods this is the status
 code returned by the external program.  The ‘internal’ method can
 return ‘1’, but will in most cases throw an error.

So if you call your download via 

v - download.file(url, destfile, method=wget) 

and v is not equal to zero, then the file is likely to be non-existent (at
least the download failed). Note: the method internal doesn't really
change the value of v, I just tried that. With wget it returns 0 for
success and 2048 (or some other value) for non-success.

Regards,
Georg.
-- 
Research Assistant
Otto-von-Guericke-Universität Magdeburg
resea...@georgruss.de
http://research.georgruss.de

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Issues with nnet.default for regression/classification

2010-11-29 Thread Georg Ruß
On 29/11/10 11:57:31, Jude Ryan wrote:
Hi Georg,


The documentation (?nnet) says that y should be a matrix or data frame,
but in your case it is a vector. This is most likely the problem, if
you do not have other data issues going on. Convert y to a matrix (or
data frame) using ‘as.matrix’ and see if this solves your problem.
Library ‘nnet’ can do both classification and regression. I was able to
replicate your problem, using an example from Modern Applied Statistics
with S, Venables and Ripley, pages 246 and 247), by turning y into a
vector and verifying that all the predicted values are the same when y
is a vector. This is not the case when y is part of a data frame. You
can see this by running the code below. I tried about 4 neural network
packages in the past, including AMORE, but found ‘nnet’ to be the best
for my needs.

Hi Jude,

thanks for the hint. I lately experimented both with the nnet(x,y, ...)
and the nnet(formula, dataframe ...) interfaces to nnet and both yielded
the same results. So changing the format of y from a vector to a matrix or
a data frame didn't change anything at all. However, what _did_ change the
outcome is to introduce the decay parameter (which I didn't have at all
before). By default it is set to 0 which doesn't seem appropriate in my
case. Setting it to decay=1e-3 magically turned my output into an
acceptable regression response instead of spitting out fixed values.

I really love the predict interface for regression in each of the models
I'm using. Clear code :-)

So, for the record, the call for nnet for the regression problem is as
follows:

net.fitted - nnet(formula, data = sp...@data[-testset,], decay=1e-3, size = 
20, linout = TRUE)

(where sp...@data is the data part of a SpatialPointsDataFrame. And yes,
in selecting the [-testset,] data points I'm taking into account the existing
spatial autocorrelation.)

# Neural Network model in Modern Applied Statistics with S, Venables
and Ripley, pages 246 and 247

Thanks for your help and the reference, I'm likely to order the book now
:-) Leaving out the decay parameter changes the fitted.values in the
rock example you mentioned as well, although not that much. Convergence
speed does change as expected, so the parameter is working. I guess my
problem is solved now, the rest is due to the specialties with my data
sets.

Georg.
-- 
Research Assistant
Otto-von-Guericke-Universität Magdeburg
resea...@georgruss.de
http://research.georgruss.de

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Combind two different vector

2010-11-27 Thread Georg Ruß
On 27/11/10 16:04:35, Serdar Akin wrote:
 I'm trying two combine two vectors that have different lengths. This without
 recursive the shorter one. E.g.,

 a - seq(1:3)
 b - seq(1:6)

If that means your output should be (1 2 3 1 2 3 4 5 6) then
c - c(a,b) should solve this. Looks like _the_ basic vector
operation.

Georg.
-- 
Research Assistant
Otto-von-Guericke-Universität Magdeburg
resea...@georgruss.de
http://research.georgruss.de

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Combind two different vector

2010-11-27 Thread Georg Ruß
On 27/11/10 19:04:27, Serdar Akin wrote:
Hi
No its has to be like this:
a b
1 1
2 2
3 3
  4
  5
  6

Hmm, empty elements in such an array? Seems not really recommended, if
it's possible at all. You may try filling up the shorter vector with NA's
or any other values that your application can understand appropriately.
Then do rbind or cbind, as necessary.

Georg.

PS: you may also reply to the r-help list
-- 
Research Assistant
Otto-von-Guericke-Universität Magdeburg
resea...@georgruss.de
http://research.georgruss.de

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Issues with nnet.default for regression/classification

2010-11-26 Thread Georg Ruß
Hi,

I'm currently trying desperately to get the nnet function for training a
neural network (with one hidden layer) to perform a regression task.

So I run it like the following:

trainednet - nnet(x=traindata, y=trainresponse, size = 30, linout = TRUE, 
maxit=1000)
(where x is a matrix and y a numerical vector consisting of the target
values for one variable)

To see whether the network learnt anything at all, I checked the network
weights and those have definitely changed. However, when examining the
trainednet$fitted.values, those are all the same so it rather looks as if
the network is doing a classification. I can even set linout=FALSE and
then it outputs 1 (the class?) for each training example. The
trainednet$residuals are correct (difference between predicted/fitted
example and actual response), but rather useless.

The same happens if I run nnet with the formula/data.frame interface, btw.

As per the suggestion in the ?nnet page: If the response is not a factor,
it is passed on unchanged to 'nnet.default', I assume that the network is
doing regression since my trainresponse variable is a numerical vector and
_not_ a factor.

I'm currently lost and I can't see that the AMORE/neuralnet packages are
any better (moreover, they don't implement the formula/dataframe/predict
things). I've read the manpages of nnet and predict.nnet a gazillion
times, but I can't really find an answer there. I don't want to do
classification, but regression.

Thanks for any help.

Georg.
--
Research Assistant
Otto-von-Guericke-Universität Magdeburg
resea...@georgruss.de
http://research.georgruss.de

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.