Re: [R] probabilities from predict.svm

2010-08-20 Thread Steve Lianoglou
Awesome!

Good news, James. Thanks for letting us know. Glad you were able to
sort this out.

-steve


On Thu, Aug 19, 2010 at 5:00 PM, Watling,James I watli...@ufl.edu wrote:
 Hi Steve--

 I spent some more time tuning the model with alternative gamma and cost 
 values, but still kept coming back to the same issue re: probabilities. I 
 spent some more time playing around with the code, and realized that the 
 error did indeed have to do with the ifelse() function I used to feed the 
 probabilities into the ascii file.  I have rewritten the code with a 
 replace() statement, and the probabilities have 'landed' in the correct place 
 in the ascii file.  The resulting map is exactly what I would expect.

 Thanks for your helpful suggestions that forced me to figure this out!

 Much appreciated

 James


 -Original Message-
 From: Steve Lianoglou [mailto:mailinglist.honey...@gmail.com]
 Sent: Thursday, August 19, 2010 11:39 AM
 To: Watling,James I
 Cc: r-h...@lists.r-project.org
 Subject: Re: [R] probabilities from predict.svm

 On Thu, Aug 19, 2010 at 10:56 AM, Watling,James I watli...@ufl.edu wrote:
 Hi Steve--

 Thanks for your interest in helping me figure this out.  I think the problem 
 has to do with the values of the probabilities returned from the use of the 
 model to predict occurrence in a new dataframe.

 Ok, so if you're sure this is the problem, and not, say, getting the
 correct values for the predictor variables at a given point, then I'd
 be a bit more thorough when building your model.

 Originally you said:

 I have used a training dataset to train the model, and tested it against a 
 validation data set with good results: AUC is high, and the confusion matrix 
 indicates low commission and omission errors.

 Maybe your originally good AUC's was just a function of your train/test 
 split?

 Why not use all of your data and do something like 10 fold cross
 validation to find:

 (1) Your average accuracy over your folds
 (2) The best value for your cost parameter; (how did you pick cost=1)?
 (3) or even the best kernel to use.

 Doing 2 and 3 will likely be time consuming. To help with (2) you
 might try looking at the svmpath package:

 http://cran.r-project.org/web/packages/svmpath/index.html

 It only works on 2-class classification problems, and (I think) using
 a linear kernel (sorry, don't remember off hand, but it's written in
 the package help and linked pubs).

 You don't need to use svmpath, but then you'll need to define a grid
 of C values (or maybe a 2d grid, if your svm + kernel combo has more
 params) and train over these values ... takes lots of cpu time, but
 not too much human time.

 Does that make sense?

 --
 Steve Lianoglou
 Graduate Student: Computational Systems Biology
  | Memorial Sloan-Kettering Cancer Center
  | Weill Medical College of Cornell University
 Contact Info: http://cbio.mskcc.org/~lianos/contact




-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] probabilities from predict.svm

2010-08-19 Thread Steve Lianoglou
Hi James,

I'd like to help you out, but I'm not sure I understand what the problem is.

Does the problem lie with building a predictive SVM, or getting the
right values (class probabilities) to land in the right place on your
map/plot?

-steve

On Wed, Aug 18, 2010 at 3:09 PM, Watling,James I watli...@ufl.edu wrote:
 Dear R Community-

 I am a new user of support vector machines for species distribution modeling 
 and am using package e1071 to run svm() and predict.svm().  Briefly, I want 
 to create an svm model for classification of a factor response (species 
 presence or absence) based on climate predictor variables.  I have used a 
 training dataset to train the model, and tested it against a validation data 
 set with good results: AUC is high, and the confusion matrix indicates low 
 commission and omission errors.  The code for the best-fit model is:

 svm.model 
 -svm(as.factor(acutus)~p_feb+p_jan+p_mar+p_sep+t_feb+t_july+t_june+t_mar,cost=1,
  gamma=1, probability=T)

 Because ultimately I want to create prediction maps of probabilities of 
 species occurrence under future climate change, I want to use the results of 
 the validated model to predict probability of presence using data describing 
 future conditions.  I have created a data frame (predict.data) with new 
 values for the same predictor variables used in the original model; each 
 value corresponds to an observation from a raster grid of the study area.  I 
 enabled the probability option when creating the original model, and acquire 
 the probabilities using the predict function:
 pred.map -predict(svm.model, predict.data, probability=T).  However, when I 
 use probs-attr(pred.map, probabilities) to acquire the probabilities for 
 each grid cell, the spatial signature of the probabilities does make sense.  
 I have extracted the column of probabilities for class = 1 (probability of 
 presence), and the resulting map of the study area is spatially accurate (it 
 has the right shape), but the probability values are incorrect, or at least 
 in the wrong place.  I am attaching a pdf (SVM prediction maps) of the 
 resulting map using probabilities obtained using the code described above 
 (page 1) and a map of what the prediction map should look like given spatial 
 autocorrelation in climate predictors (page 2, map generated using 
 openmodeller).  Note that the openmodeller map was created with the same 
 input data and same svm algorithm (also using code from libsvm) as the model 
 in R, just run using different software.  I don't know why the prediction map 
 of probabilities based on the model is  so different from what I would 
 expect, and would appreciate any thoughts from the group.

 All the best

 James

 ***
 James I Watling, PhD
 Postdoctoral Research Associate
 University of Florida
 Ft. Lauderdale Research  Education Center
 3205 College Avenue
 Ft Lauderdale, FL 33314 USA
 954.577.6316 (phone)
 954.475.4125 (fax)


 ***
 James I Watling, PhD
 Postdoctoral Research Associate
 University of Florida
 Ft. Lauderdale Research  Education Center
 3205 College Avenue
 Ft Lauderdale, FL 33314 USA
 954.577.6316 (phone)
 954.475.4125 (fax)


 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.





-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] probabilities from predict.svm

2010-08-19 Thread Steve Lianoglou
On Thu, Aug 19, 2010 at 10:56 AM, Watling,James I watli...@ufl.edu wrote:
 Hi Steve--

 Thanks for your interest in helping me figure this out.  I think the problem 
 has to do with the values of the probabilities returned from the use of the 
 model to predict occurrence in a new dataframe.

Ok, so if you're sure this is the problem, and not, say, getting the
correct values for the predictor variables at a given point, then I'd
be a bit more thorough when building your model.

Originally you said:

 I have used a training dataset to train the model, and tested it against a 
 validation data set with good results: AUC is high, and the confusion matrix 
 indicates low commission and omission errors.

Maybe your originally good AUC's was just a function of your train/test split?

Why not use all of your data and do something like 10 fold cross
validation to find:

(1) Your average accuracy over your folds
(2) The best value for your cost parameter; (how did you pick cost=1)?
(3) or even the best kernel to use.

Doing 2 and 3 will likely be time consuming. To help with (2) you
might try looking at the svmpath package:

http://cran.r-project.org/web/packages/svmpath/index.html

It only works on 2-class classification problems, and (I think) using
a linear kernel (sorry, don't remember off hand, but it's written in
the package help and linked pubs).

You don't need to use svmpath, but then you'll need to define a grid
of C values (or maybe a 2d grid, if your svm + kernel combo has more
params) and train over these values ... takes lots of cpu time, but
not too much human time.

Does that make sense?

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] probabilities from predict.svm

2010-08-19 Thread Watling,James I
Hi Steve--

Thanks for your interest in helping me figure this out.  I think the problem 
has to do with the values of the probabilities returned from the use of the 
model to predict occurrence in a new dataframe.  The svm model I referenced in 
the original message (svm.model) does a good job classifying species presence 
and absence in the test data set I used.  So I don't think the problem is with 
building the predictive svm per se.  The problem comes when I take that model 
and use it to calculate probabilities based on the climate predictors--the 
resulting probabilities range from 0-1, but the probability of presence 
associated with specific cells just does not make sense.  If you take a look at 
the maps I attached in the original message I think the problem becomes very 
clear; the maps model the probability of occurrence for the American 
Crocodile--a species with an entirely tropical distribution.  The second map 
looks exactly like the prediction map for the species should--the warmer colors 
essentially delineate the geographic range of the species.  The first map, with 
probabilities extracted from the use of svm.model to predict occurrence as a 
function of climate variables in the same area (the predict.data dataframe) 
does not make any sense.  I don't think the problem is with getting the 
probabilities in the right place, because the relative position of predicted 
values and NA's used to define the map make sense--the map looks like a map of 
southern North America and northern South America, just as it should. So the 
probabilities are in the right place on the map.  The problem is that the 
probabilities associated with each individual cell are, in a word, wrong.  The 
original model (svm.model) was parameterized with 10,000 pseudoabsences drawn 
from throughout the entire region, so the range of climate values used to 
create the original model is the same as that reflected in the data I am using 
to build the prediction map.  I can't think of any reason that the 
probabilities returned from pred.map-predict(svm.model, predict.data, 
probability=T)should be so off-base, but it seems like they are. 

Any thoughts?

James

   
-Original Message-
From: Steve Lianoglou [mailto:mailinglist.honey...@gmail.com] 
Sent: Thursday, August 19, 2010 10:24 AM
To: Watling,James I
Cc: r-h...@lists.r-project.org
Subject: Re: [R] probabilities from predict.svm

Hi James,

I'd like to help you out, but I'm not sure I understand what the problem is.

Does the problem lie with building a predictive SVM, or getting the
right values (class probabilities) to land in the right place on your
map/plot?

-steve

On Wed, Aug 18, 2010 at 3:09 PM, Watling,James I watli...@ufl.edu wrote:
 Dear R Community-

 I am a new user of support vector machines for species distribution modeling 
 and am using package e1071 to run svm() and predict.svm().  Briefly, I want 
 to create an svm model for classification of a factor response (species 
 presence or absence) based on climate predictor variables.  I have used a 
 training dataset to train the model, and tested it against a validation data 
 set with good results: AUC is high, and the confusion matrix indicates low 
 commission and omission errors.  The code for the best-fit model is:

 svm.model 
 -svm(as.factor(acutus)~p_feb+p_jan+p_mar+p_sep+t_feb+t_july+t_june+t_mar,cost=1,
  gamma=1, probability=T)

 Because ultimately I want to create prediction maps of probabilities of 
 species occurrence under future climate change, I want to use the results of 
 the validated model to predict probability of presence using data describing 
 future conditions.  I have created a data frame (predict.data) with new 
 values for the same predictor variables used in the original model; each 
 value corresponds to an observation from a raster grid of the study area.  I 
 enabled the probability option when creating the original model, and acquire 
 the probabilities using the predict function:
 pred.map -predict(svm.model, predict.data, probability=T).  However, when I 
 use probs-attr(pred.map, probabilities) to acquire the probabilities for 
 each grid cell, the spatial signature of the probabilities does make sense.  
 I have extracted the column of probabilities for class = 1 (probability of 
 presence), and the resulting map of the study area is spatially accurate (it 
 has the right shape), but the probability values are incorrect, or at least 
 in the wrong place.  I am attaching a pdf (SVM prediction maps) of the 
 resulting map using probabilities obtained using the code described above 
 (page 1) and a map of what the prediction map should look like given spatial 
 autocorrelation in climate predictors (page 2, map generated using 
 openmodeller).  Note that the openmodeller map was created with the same 
 input data and same svm algorithm (also using code from libsvm) as the model 
 in R, just run using different software.  I don't know why the prediction map

Re: [R] probabilities from predict.svm

2010-08-19 Thread Watling,James I
Hi Steve--

I spent some more time tuning the model with alternative gamma and cost values, 
but still kept coming back to the same issue re: probabilities. I spent some 
more time playing around with the code, and realized that the error did indeed 
have to do with the ifelse() function I used to feed the probabilities into the 
ascii file.  I have rewritten the code with a replace() statement, and the 
probabilities have 'landed' in the correct place in the ascii file.  The 
resulting map is exactly what I would expect.

Thanks for your helpful suggestions that forced me to figure this out!

Much appreciated

James


-Original Message-
From: Steve Lianoglou [mailto:mailinglist.honey...@gmail.com] 
Sent: Thursday, August 19, 2010 11:39 AM
To: Watling,James I
Cc: r-h...@lists.r-project.org
Subject: Re: [R] probabilities from predict.svm

On Thu, Aug 19, 2010 at 10:56 AM, Watling,James I watli...@ufl.edu wrote:
 Hi Steve--

 Thanks for your interest in helping me figure this out.  I think the problem 
 has to do with the values of the probabilities returned from the use of the 
 model to predict occurrence in a new dataframe.

Ok, so if you're sure this is the problem, and not, say, getting the
correct values for the predictor variables at a given point, then I'd
be a bit more thorough when building your model.

Originally you said:

 I have used a training dataset to train the model, and tested it against a 
 validation data set with good results: AUC is high, and the confusion matrix 
 indicates low commission and omission errors.

Maybe your originally good AUC's was just a function of your train/test split?

Why not use all of your data and do something like 10 fold cross
validation to find:

(1) Your average accuracy over your folds
(2) The best value for your cost parameter; (how did you pick cost=1)?
(3) or even the best kernel to use.

Doing 2 and 3 will likely be time consuming. To help with (2) you
might try looking at the svmpath package:

http://cran.r-project.org/web/packages/svmpath/index.html

It only works on 2-class classification problems, and (I think) using
a linear kernel (sorry, don't remember off hand, but it's written in
the package help and linked pubs).

You don't need to use svmpath, but then you'll need to define a grid
of C values (or maybe a 2d grid, if your svm + kernel combo has more
params) and train over these values ... takes lots of cpu time, but
not too much human time.

Does that make sense?

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.