[R] splitting dataset based on variable and re-combining

2012-12-10 Thread Brian Feeny

I have a dataset and I wish to use two different models to predict.  Both 
models are SVM.  The reason for two different models is based
on the sex of the observation.  I wish to be able to make predictions and have 
the results be in the same order as my original dataset.  To
illustrate I will use iris:

# Take Iris and create a dataframe of just two Species, setosa and versicolor, 
shuffle them
data(iris)
iris - iris[(iris$Species==setosa | iris$Species==versicolor),]
irisindex - sample(1:nrow(iris), nrow(iris))
iris - iris[irisindex,]

# Make predictions on setosa using the mySetosaModel model, and on versicolor 
using the myVersicolorModel:

predict(mySetosaModel, iris[iris$Species==setosa,])
predict(myVersicolorModel, iris[iris$Species==versicolor,])

The problem is this will give me a vector of just the setosa results, and then 
one of just the versicolor results.

I wish to take the results and have them be in the same order as the original 
dataset.  So if the original dataset had:


Species
setosa
setosa
versicolor
setosa
versicolor
setosa

I wish for my results to have:
prediction for setosa
prediction for setosa
prediction for versicolor
prediction for setosa
prediction for versicolor
prediction for setosa

But instead, what I am ending up with is two result sets, and no way I can 
think of to combine them.  I am sure this comes up alot where you have a factor 
you wish to split your models on, say sex (male vs. female), and you need to 
present the results back so it matches to the order of the orignal dataset.

I have tried to think of ways to use an index, to try to keep things in order, 
but I can't figure it out.

Any help is greatly appreciated.

Brian

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] splitting dataset based on variable and re-combining

2012-12-10 Thread Brian Feeny


I will look into that, thanks.  I am afraid I don't quite understand what is 
going on there with the multiplication, so I will need to read up.  What I 
ended up doing was like so:

For train data, its easy, as I can subset to have the model only work off the 
data I want:

rbfSVM_setosa  - train(Sepal.Length~., data = trainset,  subset = 
trainset$Species==setosa, ...)
rbfSVM_versicolor - train(Sepal.Length~., data = trainset,  subset = 
trainset$Species==versicolor, ...)

For my test data (testset), I ended up doing like so which appears to work:

index_setosa- which(testset$Species == setosa)

svmPred - as.vector(rep(NA,nrow(testset)))
svmPred[index_setosa] - predict(rbfSVM_setosa, testset[testset$Species == 
setosa,])
svmPred[is.na(svmPred)] - predict(rbfSVM_versicolor, testset[testset$Species 
== versicolor,])

The above works when there are just two classes.  I am going to read on some of 
these other ways suggested and give them a try.

Brian




On Dec 10, 2012, at 10:38 PM, Thomas Stewart tgs.public.m...@gmail.com wrote:

 Why not use an indicator variable? 
 
 P1 - ... # prediction from model 1 (Setosa) for entire dataset
 
 P2 - ... # prediction from model 2 for entire dataset
 
 I - Species==setosa #
 
 Predictions - P1 * I + P2 * ( 1 - I )
 
 On Monday, December 10, 2012, Brian Feeny wrote:
 
 I have a dataset and I wish to use two different models to predict.  Both 
 models are SVM.  The reason for two different models is based
 on the sex of the observation.  I wish to be able to make predictions and 
 have the results be in the same order as my original dataset.  To
 illustrate I will use iris:
 
 # Take Iris and create a dataframe of just two Species, setosa and 
 versicolor, shuffle them
 data(iris)
 iris - iris[(iris$Species==setosa | iris$Species==versicolor),]
 irisindex - sample(1:nrow(iris), nrow(iris))
 iris - iris[irisindex,]
 
 # Make predictions on setosa using the mySetosaModel model, and on versicolor 
 using the myVersicolorModel:
 
 predict(mySetosaModel, iris[iris$Species==setosa,])
 predict(myVersicolorModel, iris[iris$Species==versicolor,])
 
 The problem is this will give me a vector of just the setosa results, and 
 then one of just the versicolor results.
 
 I wish to take the results and have them be in the same order as the original 
 dataset.  So if the original dataset had:
 
 
 Species
 setosa
 setosa
 versicolor
 setosa
 versicolor
 setosa
 
 I wish for my results to have:
 prediction for setosa
 prediction for setosa
 prediction for versicolor
 prediction for setosa
 prediction for versicolor
 prediction for setosa
 
 But instead, what I am ending up with is two result sets, and no way I can 
 think of to combine them.  I am sure this comes up alot where you have a 
 factor you wish to split your models on, say sex (male vs. female), and you 
 need to present the results back so it matches to the order of the orignal 
 dataset.
 
 I have tried to think of ways to use an index, to try to keep things in 
 order, but I can't figure it out.
 
 Any help is greatly appreciated.
 
 Brian
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Assignment of values with different indexes

2012-12-05 Thread Brian Feeny

I would like to take the values of observations and map them to a new index.  I 
am not sure how to accomplish this.  The result would look like so:

x[1,2,3,4,5,6,7,8,9,10]
becomes
y[2,4,6,8,10,12,14,16,18,20]

The newindex would not necessarily be this sequence, but a sequence I have 
stored in a vector, so it could be all kinds of values.  here is what happens:

 x - rnorm(10)
 myindex - seq(from = 1,to = 20, by = 2)
 y - numeric()
 y[myindex] - x
 y
 [1] -0.03745988  NA -0.09078822  NA  0.92484413  NA  
0.32057426  NA
 [9]  0.01536279  NA  0.02200198  NA  0.37535438  NA  
1.46606535  NA
[17]  1.44855796  NA -0.05048738

So yes, it maps the values to my new indexes, but I have NA's.  The result I 
want would look like this instead:


 [1] -0.03745988  
 [3] -0.09078822   
 [5] 0.92484413   
 [7] 0.32057426 
 [9]  0.01536279   
 [11] 0.02200198   
 [13] 0.37535438   
 [15] 1.46606535 
 [17]  1.44855796  
 [19] -0.05048738


and remove the NA's.  I tried this with na.omit() on x, but it looks like so:

 x - rnorm(10)
 myindex - seq(from = 1,to = 20, by = 2)
 y - numeric()
 y[myindex] - na.omit(x)
 y
 [1]  0.87399523  NA -0.39908184  NA  0.14583051  NA  
0.01850755  NA
 [9] -0.47413632  NA  0.88410517  NA -1.64939190  NA  
0.57650807  NA
[17]  0.44016971  NA -0.56313802

Brian

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Assignment of values with different indexes

2012-12-05 Thread Brian Feeny

No, because it does not assign the indexes of myindex.

If its not possible, which I am assuming its not, thats OK.  I thought that if 
I had say 10 observations, sequentially ordered (or any order, it doesn't 
matter), and I wanted to assign them specific indexes, and not have NA's, that 
it was possible.
I am OK with knowing that I can assign them the specific indexes, and that 
there will be empty spots, which are marked NA.  Most functions I would need to 
use can handle NA's by telling the function to ignore. I appreciate all the 
help that has been given.

Brian

On Dec 5, 2012, at 11:49 PM, arun smartpink...@yahoo.com wrote:

 
 
 Hi,
 
 Would it be okay to use:
  y-na.omit(y[myindex]-x)
 y
 # [1] -1.36025132 -0.57529211  1.18132359  0.41038489  1.83108252 -0.03563686
  #[7]  1.25267314  1.08311857  1.56973422 -0.30752939
 
 A.K.
 
 
 - Original Message -
 From: Brian Feeny bfe...@mac.com
 To: r-help@r-project.org help r-help@r-project.org
 Cc: 
 Sent: Wednesday, December 5, 2012 9:47 PM
 Subject: [R] Assignment of values with different indexes
 
 
 I would like to take the values of observations and map them to a new index.  
 I am not sure how to accomplish this.  The result would look like so:
 
 x[1,2,3,4,5,6,7,8,9,10]
 becomes
 y[2,4,6,8,10,12,14,16,18,20]
 
 The newindex would not necessarily be this sequence, but a sequence I have 
 stored in a vector, so it could be all kinds of values.  here is what happens:
 
 x - rnorm(10)
 myindex - seq(from = 1,to = 20, by = 2)
 y - numeric()
 y[myindex] - x
 y
 [1] -0.03745988  NA -0.09078822  NA  0.92484413  NA  
 0.32057426  NA
 [9]  0.01536279  NA  0.02200198  NA  0.37535438  NA  
 1.46606535  NA
 [17]  1.44855796  NA -0.05048738
 
 So yes, it maps the values to my new indexes, but I have NA's.  The result I 
 want would look like this instead:
 
 
 [1] -0.03745988  
 [3] -0.09078822  
 [5] 0.92484413  
 [7] 0.32057426
 [9]  0.01536279  
 [11] 0.02200198  
 [13] 0.37535438  
 [15] 1.46606535
 [17]  1.44855796  
 [19] -0.05048738
 
 
 and remove the NA's.  I tried this with na.omit() on x, but it looks like so:
 
 x - rnorm(10)
 myindex - seq(from = 1,to = 20, by = 2)
 y - numeric()
 y[myindex] - na.omit(x)
 y
 [1]  0.87399523  NA -0.39908184  NA  0.14583051  NA  
 0.01850755  NA
 [9] -0.47413632  NA  0.88410517  NA -1.64939190  NA  
 0.57650807  NA
 [17]  0.44016971  NA -0.56313802
 
 Brian
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] How to re-combine values based on an index?

2012-12-01 Thread Brian Feeny
I am able to split my df into two like so:

dataset - trainset
index - 1:nrow(dataset)
testindex - sample(index, trunc(length(index)*30/100))
trainset - dataset[-testindex,]
testset - dataset[testindex,-1]

So I have the index information, how could I re-combine the data using that 
back into a single df?

I tried what I thought might work, but failed with:

newdataset[testindex] = testset[testindex]
  object 'dataset' not found
newdataset[-testindex] = trainset[-testindex]
  object 'dataset' not found

Brian

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to re-combine values based on an index?

2012-12-01 Thread Brian Feeny
Thank you for your response,  here is a better example of what I am trying to 
do:

data(iris)
index_setosa - which(iris$Species == setosa)
iris_setosa - data.frame()
iris_setosa[index_setosa,] -iris[index_setosa,]
iris_others - data.frame()
iris_others[-index_setosa,] - iris[-index_setosa,]

So the idea would be that iris_setosa is a dataframe of size 150, with 50 
observations of setosa,
using their original same indices, and 100 observations of NA.  Likewise 
iris_others would be
100 observations of species besides setosa, using their original indices, and 
there would be 50 NA's.

The above doesn't work.  When I execute it, I am left with iris_setosa having 0 
columns, I wish it to have all 
the original columns of iris.

That said, once I get past the above (being able to split them out and keep 
original indices), I wish to be able to combine
iris_setosa and iris_others so that iris_combined is a data frame with no NA's 
and all the original data.

Does this make sense?  So I am basically taking a dataframe, splitting it based 
on some criteria, and working on the two
split dataframes separately, and then I wish to recombine.

Brian


So at this point, I have iris_setosa a dataframe of size 
On Dec 1, 2012, at 11:34 PM, William Dunlap wrote:

 newdataset[testindex] = testset[testindex]
  object 'dataset' not found
 
 Is that really what R printed?  I get
 newdataset[testindex] = testset[testindex]
  Error in newdataset[testindex] = testset[testindex] : 
object 'newdataset' not found
 but perhaps you have a different problem.  Copy and paste
 (and read) the error message you got.
 
 Bill Dunlap
 Spotfire, TIBCO Software
 wdunlap tibco.com
 
 
 -Original Message-
 From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
 Behalf
 Of Brian Feeny
 Sent: Saturday, December 01, 2012 8:04 PM
 To: r-help@r-project.org
 Subject: [R] How to re-combine values based on an index?
 
 I am able to split my df into two like so:
 
 dataset - trainset
 index - 1:nrow(dataset)
 testindex - sample(index, trunc(length(index)*30/100))
 trainset - dataset[-testindex,]
 testset - dataset[testindex,-1]
 
 So I have the index information, how could I re-combine the data using that 
 back into a
 single df?
 
 I tried what I thought might work, but failed with:
 
 newdataset[testindex] = testset[testindex]
  object 'dataset' not found
 newdataset[-testindex] = trainset[-testindex]
  object 'dataset' not found
 
 Brian
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Help with this error kernlab class probability calculations failed; returning NAs

2012-11-29 Thread Brian Feeny
I have never been able to get class probabilities to work and I am relatively 
new to using these tools, and I am looking for some insight as to what may be 
wrong.

I am using caret with kernlab/ksvm.  I will simplify my problem to a basic data 
set which produces the same problem.  I have read the caret vignettes as well 
as documentation for ?train.  I appreciate any direction you can give.  I 
realize this is a very small dataset, the actual data is much larger, I am just 
using 10 rows as an example:

trainset - data.frame( 
  outcome=factor(c(0,1,0,1,0,1,1,1,1,0)),
  age=c(10, 23, 5, 28, 81, 48, 82, 23, 11, 9),
  amount=c(10.11, 22.23, 494.2, 2.0, 29.2, 39.2, 39.2, 39.0, 11.1, 12.2)
)

 str(trainset)
'data.frame':   7 obs. of  3 variables:
 $ outcome: Factor w/ 2 levels 0,1: 2 1 2 2 2 2 1
 $ age: num  23 5 28 48 82 11 9
 $ amount : num  22.2 494.2 2 39.2 39.2 ...

 colSums(is.na(trainset))
outcome age  amount 
  0   0   0 


## SAMPLING AND FORMULA
dataset - trainset
index - 1:nrow(dataset)
testindex - sample(index, trunc(length(index)*30/100))
trainset - dataset[-testindex,]
testset - dataset[testindex,-1]


## TUNE caret / kernlab
set.seed(1)
MyTrainControl=trainControl(
  method = repeatedcv,
  number=10,
  repeats=5,
  returnResamp = all,
  classProbs = TRUE
)


## MODEL
rbfSVM - train(outcome~., data = trainset, 
   method=svmRadial,
   preProc = c(scale),
   tuneLength = 10,
   trControl=MyTrainControl,
   fit = FALSE
)

There were 50 or more warnings (use warnings() to see the first 50)
 warnings()
Warning messages:
1: In train.default(x, y, weights = w, ...) :
  At least one of the class levels are not valid R variables names; This may 
cause errors if class probabilities are generated because the variables names 
will be converted to: X0, X1
2:  In caret:::predictionFunction(method = method, modelFit = mod$fit,  ... :
  kernlab class prediction calculations failed; returning NAs

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Help with this error kernlab class probability calculations failed; returning NAs

2012-11-29 Thread Brian Feeny


Yes I am still getting this error, here is my sessionInfo:

 sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base 

other attached packages:
[1] e1071_1.6-1 class_7.3-5 kernlab_0.9-14  caret_5.15-045  
foreach_1.4.0   cluster_1.14.3 
[7] reshape_0.8.4   plyr_1.7.1  lattice_0.20-10

loaded via a namespace (and not attached):
[1] codetools_0.2-8 compiler_2.15.2 grid_2.15.2 iterators_1.0.6 
tools_2.15.2   


Is there an example that shows a classProbs example, I could try to run it to 
replicate and see if it works on my system.

Brian

On Nov 29, 2012, at 10:10 PM, Max Kuhn mxk...@gmail.com wrote:

 You didn't provide the results of sessionInfo().
 
 Upgrade to the version just released on cran and see if you still have the 
 issue.
 
 Max
 
 
 On Thu, Nov 29, 2012 at 6:55 PM, Brian Feeny bfe...@mac.com wrote:
 I have never been able to get class probabilities to work and I am relatively 
 new to using these tools, and I am looking for some insight as to what may be 
 wrong.
 
 I am using caret with kernlab/ksvm.  I will simplify my problem to a basic 
 data set which produces the same problem.  I have read the caret vignettes as 
 well as documentation for ?train.  I appreciate any direction you can give.  
 I realize this is a very small dataset, the actual data is much larger, I am 
 just using 10 rows as an example:
 
 trainset - data.frame(
   outcome=factor(c(0,1,0,1,0,1,1,1,1,0)),
   age=c(10, 23, 5, 28, 81, 48, 82, 23, 11, 9),
   amount=c(10.11, 22.23, 494.2, 2.0, 29.2, 39.2, 39.2, 39.0, 11.1, 12.2)
 )
 
  str(trainset)
 'data.frame':   7 obs. of  3 variables:
  $ outcome: Factor w/ 2 levels 0,1: 2 1 2 2 2 2 1
  $ age: num  23 5 28 48 82 11 9
  $ amount : num  22.2 494.2 2 39.2 39.2 ...
 
  colSums(is.na(trainset))
 outcome age  amount
   0   0   0
 
 
 ## SAMPLING AND FORMULA
 dataset - trainset
 index - 1:nrow(dataset)
 testindex - sample(index, trunc(length(index)*30/100))
 trainset - dataset[-testindex,]
 testset - dataset[testindex,-1]
 
 
 ## TUNE caret / kernlab
 set.seed(1)
 MyTrainControl=trainControl(
   method = repeatedcv,
   number=10,
   repeats=5,
   returnResamp = all,
   classProbs = TRUE
 )
 
 
 ## MODEL
 rbfSVM - train(outcome~., data = trainset,
method=svmRadial,
preProc = c(scale),
tuneLength = 10,
trControl=MyTrainControl,
fit = FALSE
 )
 
 There were 50 or more warnings (use warnings() to see the first 50)
  warnings()
 Warning messages:
 1: In train.default(x, y, weights = w, ...) :
   At least one of the class levels are not valid R variables names; This may 
 cause errors if class probabilities are generated because the variables names 
 will be converted to: X0, X1
 2:  In caret:::predictionFunction(method = method, modelFit = mod$fit,  ... :
   kernlab class prediction calculations failed; returning NAs
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 
 
 -- 
 
 Max


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Help with this error kernlab class probability calculations failed; returning NAs

2012-11-29 Thread Brian Feeny
Max,

Thank you for the assistance.  That was it.  My dependent variable was just 
using 1 and 0 as levels, I changed them to yes, no:

levels(trainset$outcome) - list(no=0, yes=1)

and I no longer get the warning.

Brian


On Nov 29, 2012, at 10:29 PM, Max Kuhn mxk...@gmail.com wrote:

 Your output has:
 
 At least one of the class levels are not valid R variables names; This may 
 cause errors if class probabilities are generated because the variables names 
 will be converted to: X0, X1
 
 Try changing the factor levels to avoid leading numbers and try again.
 
 Max
 
 
 
 
 On Thu, Nov 29, 2012 at 10:18 PM, Brian Feeny bfe...@mac.com wrote:
 
 
 Yes I am still getting this error, here is my sessionInfo:
 
  sessionInfo()
 R version 2.15.2 (2012-10-26)
 Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
 
 locale:
 [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
 
 attached base packages:
 [1] stats graphics  grDevices utils datasets  methods   base 
 
 other attached packages:
 [1] e1071_1.6-1 class_7.3-5 kernlab_0.9-14  caret_5.15-045  
 foreach_1.4.0   cluster_1.14.3 
 [7] reshape_0.8.4   plyr_1.7.1  lattice_0.20-10
 
 loaded via a namespace (and not attached):
 [1] codetools_0.2-8 compiler_2.15.2 grid_2.15.2 iterators_1.0.6 
 tools_2.15.2   
 
 
 Is there an example that shows a classProbs example, I could try to run it to 
 replicate and see if it works on my system.
 
 Brian
 
 On Nov 29, 2012, at 10:10 PM, Max Kuhn mxk...@gmail.com wrote:
 
 You didn't provide the results of sessionInfo().
 
 Upgrade to the version just released on cran and see if you still have the 
 issue.
 
 Max
 
 
 On Thu, Nov 29, 2012 at 6:55 PM, Brian Feeny bfe...@mac.com wrote:
 I have never been able to get class probabilities to work and I am 
 relatively new to using these tools, and I am looking for some insight as to 
 what may be wrong.
 
 I am using caret with kernlab/ksvm.  I will simplify my problem to a basic 
 data set which produces the same problem.  I have read the caret vignettes 
 as well as documentation for ?train.  I appreciate any direction you can 
 give.  I realize this is a very small dataset, the actual data is much 
 larger, I am just using 10 rows as an example:
 
 trainset - data.frame(
   outcome=factor(c(0,1,0,1,0,1,1,1,1,0)),
   age=c(10, 23, 5, 28, 81, 48, 82, 23, 11, 9),
   amount=c(10.11, 22.23, 494.2, 2.0, 29.2, 39.2, 39.2, 39.0, 11.1, 12.2)
 )
 
  str(trainset)
 'data.frame':   7 obs. of  3 variables:
  $ outcome: Factor w/ 2 levels 0,1: 2 1 2 2 2 2 1
  $ age: num  23 5 28 48 82 11 9
  $ amount : num  22.2 494.2 2 39.2 39.2 ...
 
  colSums(is.na(trainset))
 outcome age  amount
   0   0   0
 
 
 ## SAMPLING AND FORMULA
 dataset - trainset
 index - 1:nrow(dataset)
 testindex - sample(index, trunc(length(index)*30/100))
 trainset - dataset[-testindex,]
 testset - dataset[testindex,-1]
 
 
 ## TUNE caret / kernlab
 set.seed(1)
 MyTrainControl=trainControl(
   method = repeatedcv,
   number=10,
   repeats=5,
   returnResamp = all,
   classProbs = TRUE
 )
 
 
 ## MODEL
 rbfSVM - train(outcome~., data = trainset,
method=svmRadial,
preProc = c(scale),
tuneLength = 10,
trControl=MyTrainControl,
fit = FALSE
 )
 
 There were 50 or more warnings (use warnings() to see the first 50)
  warnings()
 Warning messages:
 1: In train.default(x, y, weights = w, ...) :
   At least one of the class levels are not valid R variables names; This may 
 cause errors if class probabilities are generated because the variables 
 names will be converted to: X0, X1
 2:  In caret:::predictionFunction(method = method, modelFit = mod$fit,  ... :
   kernlab class prediction calculations failed; returning NAs
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 
 
 -- 
 
 Max
 
 
 
 
 -- 
 
 Max


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] What is the . in formula ~. syntax?

2012-11-23 Thread Brian Feeny

Thank you!  I searched in the manual, but I did not see where this is 
mentioned, I looked under operators 
and in some of the formula documentation.

Brian

On Nov 23, 2012, at 3:15 AM, Michael Weylandt michael.weyla...@gmail.com 
wrote:

 
 
 On Nov 23, 2012, at 4:26 AM, Brian Feeny bfe...@mac.com wrote:
 
 I know if I have a dataframe with columns y, x1, x2 and I wish to have y as 
 my y value and x1 and x2 as x values I can do:
 y ~ x1 + x2
 
 or 
 
 y ~.
 
 but can someone explain what . actually is or what its transposed into?
 
 Everything not already stated. 
 
 rmw
 
 
 I searched for this with no success, reading the formula manual pages.
 
 Brian
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] caret train and trainControl

2012-11-23 Thread Brian Feeny

I am used to packages like e1071 where you have a tune step and then pass your 
tunings to train.

It seems with caret, tuning and training are both handled by train.

I am using train and trainControl to find my hyper parameters like so:

MyTrainControl=trainControl(
  method = cv,
  number=5,
  returnResamp = all,
   classProbs = TRUE
)

rbfSVM - train(label~., data = trainset, 
   method=svmRadial,
   tuneGrid = expand.grid(.sigma=c(0.0118),.C=c(8,16,32,64,128)),  
   trControl=MyTrainControl,
   fit = FALSE
)

Once this returns my ideal parameters, in this case Cost of 64, do I simply 
just re-run the whole process again, passing a grid only containing the 
specific parameters? like so?


rbfSVM - train(label~., data = trainset, 
   method=svmRadial,
   tuneGrid = expand.grid(.sigma=0.0118,.C=64),  
   trControl=MyTrainControl,
   fit = FALSE
)

This is what I have been doing but I am new to caret and want to make sure I am 
doing this correctly.

Brian

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] caret train and trainControl

2012-11-23 Thread Brian Feeny

Max, 

Thanks, I do understand that the final model is fitted.  I think I was not 
clear in my posting.  I am changing datasets between tuning and real training.  
So maybe I tune on trainset but its only 5000 rows, doing my gridsearch and 
all that, and then once I have the hyper parameters, I will use an increased 
trainset size, say 25000.

So between tuning and training, I am modifying the trainset, specifically 
making it bigger.  So I have to re-fit the model and I guess what I am trying 
to make sure of is that really the
only thing I need to carry over so to speak, is the hyper parameters I was 
looking for.  Is this correct?  That is what I am doing, and simply passing a 
grid with my specific 2 hyperparameters, to avoid it doing any type of search.



Brian

On Nov 23, 2012, at 6:06 PM, Max Kuhn wrote:

 Brian,
 
 This is all outlined in the package documentation. The final model is fit 
 automatically. For example, using 'verboseIter' provides details. From ?train
 
  knnFit1 - train(TrainData, TrainClasses,
 
 +  method = knn,
 
 +  preProcess = c(center, scale),
 
 +  tuneLength = 10,
 
 +  trControl = trainControl(method = cv, verboseIter = 
 TRUE))
 
 + Fold01: k= 5 
 
 - Fold01: k= 5 
 
 + Fold01: k= 7 
 
 - Fold01: k= 7 
 
 + Fold01: k= 9 
 
 - Fold01: k= 9 
 
 + Fold01: k=11 
 
 - Fold01: k=11 
 
 snip
 
 + Fold10: k=17 
 
 - Fold10: k=17 
 
 + Fold10: k=19 
 
 - Fold10: k=19 
 
 + Fold10: k=21 
 
 - Fold10: k=21 
 
 + Fold10: k=23 
 
 - Fold10: k=23 
 
 Aggregating results
 
 Selecting tuning parameters
 
 Fitting model on full training set
 
 
 
 Max
 
 
 
 On Fri, Nov 23, 2012 at 5:52 PM, Brian Feeny bfe...@mac.com wrote:
 
 I am used to packages like e1071 where you have a tune step and then pass 
 your tunings to train.
 
 It seems with caret, tuning and training are both handled by train.
 
 I am using train and trainControl to find my hyper parameters like so:
 
 MyTrainControl=trainControl(
   method = cv,
   number=5,
   returnResamp = all,
classProbs = TRUE
 )
 
 rbfSVM - train(label~., data = trainset,
method=svmRadial,
tuneGrid = expand.grid(.sigma=c(0.0118),.C=c(8,16,32,64,128)),
trControl=MyTrainControl,
fit = FALSE
 )
 
 Once this returns my ideal parameters, in this case Cost of 64, do I simply 
 just re-run the whole process again, passing a grid only containing the 
 specific parameters? like so?
 
 
 rbfSVM - train(label~., data = trainset,
method=svmRadial,
tuneGrid = expand.grid(.sigma=0.0118,.C=64),
trControl=MyTrainControl,
fit = FALSE
 )
 
 This is what I have been doing but I am new to caret and want to make sure I 
 am doing this correctly.
 
 Brian
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 
 
 -- 
 
 Max


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Building factors across two columns, is this possible?

2012-11-23 Thread Brian Feeny

I am trying to make it so two columns with similar data use the same internal 
numbers for same factors, here is the example:

 read.csv(test.csv,header =FALSE,sep=,)
 V1V2   V3
1   sun  moonstars
2 stars  moon  sun
3   cat   dog   catdog
4   dog  moon  sun
5  bird plane superman
6  1000   dog 2000
 data - read.csv(test.csv,header =FALSE,sep=,)
 str(data)
'data.frame':   6 obs. of  3 variables:
 $ V1: Factor w/ 6 levels 1000,bird,..: 6 5 3 4 2 1
 $ V2: Factor w/ 3 levels dog,moon,plane: 2 2 1 2 3 1
 $ V3: Factor w/ 5 levels 2000,catdog,..: 3 4 2 4 5 1

 as.numeric(data$V1)
[1] 6 5 3 4 2 1
 as.numeric(data$V2)
[1] 2 2 1 2 3 1
 as.factor(data$V1)
[1] sun   stars cat   dog   bird  1000 
Levels: 1000 bird cat dog stars sun
 as.factor(data$V2)
[1] moon  moon  dog   moon  plane dog  
Levels: dog moon plane


So notice dog is 4 in V1, yet its 1 in V2.  Is there a way, either on import, 
or after, to have factors computed for both columns and assigned
the same internal values?

Brian

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Building factors across two columns, is this possible?

2012-11-23 Thread Brian Feeny

To clarify on my previous post, here is a representation of what I am trying to 
accomplish:

I would like every unique value in either column to be assigned a number so 
like so:

V1V2   V3
1   sun  moonstars
2 stars  moon  sun
3   cat   dog   catdog
4   dog  moon  sun
5  bird plane superman
6  1000   dog 2000

Level   Value
sun -  0
stars   -  1
cat -  2
dog -  3
bird-  4
1000-  5
moon-  6
plane   -  7
catdog  -  8
superman-  9
2000-   10
etc
etc

so internally its represented as:

V1V2   V3
1   0   6   1
2   1   6   0
3   2   3   8
4   3   6   0
5   4   7   9
6   5   3   10

does this make sense?  I am hoping there is a way to accomplish this.

Brian

On Nov 23, 2012, at 11:42 PM, Brian Feeny bfe...@mac.com wrote:

 
 I am trying to make it so two columns with similar data use the same internal 
 numbers for same factors, here is the example:
 
 read.csv(test.csv,header =FALSE,sep=,)
 V1V2   V3
 1   sun  moonstars
 2 stars  moon  sun
 3   cat   dog   catdog
 4   dog  moon  sun
 5  bird plane superman
 6  1000   dog 2000
 data - read.csv(test.csv,header =FALSE,sep=,)
 str(data)
 'data.frame': 6 obs. of  3 variables:
 $ V1: Factor w/ 6 levels 1000,bird,..: 6 5 3 4 2 1
 $ V2: Factor w/ 3 levels dog,moon,plane: 2 2 1 2 3 1
 $ V3: Factor w/ 5 levels 2000,catdog,..: 3 4 2 4 5 1
 
 as.numeric(data$V1)
 [1] 6 5 3 4 2 1
 as.numeric(data$V2)
 [1] 2 2 1 2 3 1
 as.factor(data$V1)
 [1] sun   stars cat   dog   bird  1000 
 Levels: 1000 bird cat dog stars sun
 as.factor(data$V2)
 [1] moon  moon  dog   moon  plane dog  
 Levels: dog moon plane
 
 
 So notice dog is 4 in V1, yet its 1 in V2.  Is there a way, either on 
 import, or after, to have factors computed for both columns and assigned
 the same internal values?
 
 Brian


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] What is the . in formula ~. syntax?

2012-11-22 Thread Brian Feeny
I know if I have a dataframe with columns y, x1, x2 and I wish to have y as my 
y value and x1 and x2 as x values I can do:
y ~ x1 + x2

or 

y ~.

but can someone explain what . actually is or what its transposed into?

I searched for this with no success, reading the formula manual pages.

Brian

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Scaling values 0-255 - -1 , 1 - how can this be done?

2012-11-21 Thread Brian Feeny

I have a dataframe in which I have values 0-255, I wish to transpose them such 
that:

if value   127.5 value = 1
if value  127.5 value = -1

I did something similar using the binarize function of the biclust package, 
this transforms my dataframe to 0 and 1 values, but I wish
to use -1 and 1 and looking for a way in R to do this.

Brian

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cluster analysis in R

2012-11-21 Thread Brian Feeny


http://cran.r-project.org/web/views/Cluster.html

might be a good start

Brian

On Nov 21, 2012, at 1:36 PM, KitKat wrote:

 Thank you for replying! 
 I made a new post asking if there are any websites or files on how to
 download package mclust (or other Bayesian cluster analysis packages) and
 the appropriate R functions? Sorry I don't know how this forum works yet
 
 
 
 --
 View this message in context: 
 http://r.789695.n4.nabble.com/cluster-analysis-in-R-tp4649635p4650341.html
 Sent from the R help mailing list archive at Nabble.com.
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Using doMC to run parallel SVM grid search?

2012-11-21 Thread Brian Feeny
Has anyone used doMC to speed up an SVM grid search?  I am considering doing 
like so:

library(doMC)
registerDoMC()
foreach (i=0:3) %dopar% {
tuned_part1 - tune.svm(label~., data = trainset, gamma = 10^(-10:-6), 
cost = 10^(-1:1))
tuned_part2 - tune.svm(label~., data = trainset, gamma = 10^(-5:0),
 cost = 10^(-1:1))
tuned_part3 - tune.svm(label~., data = trainset, gamma = 10^(1:-5),
 cost = 10^(-1:1))
tuned_part4 - tune.svm(label~., data = trainset, gamma = 10^(5:10),
cost = 10^(-1:1))
}


I have a Quad Core processor, so if I understand correctly the above could 
split that up across the cores.

My goal would be a coarse grid search, not sure if the above parameters are 
good for that, it just seemed like 
some good starting points.

I would just manually look at each of the resulting files, although it would be 
cool if it resulted in an instance variable
being set of the best values. 

Has anyone used doMC for something like this?  Is there a better library to 
potentially use than doMC for doing 
something like splitting up an SVM grid search over multiple cores?

Brian

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Using doMC to run parallel SVM grid search?

2012-11-21 Thread Brian Feeny
Has anyone used doMC to speed up an SVM grid search?  I am considering doing 
like so:

library(doMC)
registerDoMC()
foreach (i=0:3) %dopar% {
tuned_part1 - tune.svm(label~., data = trainset, gamma = 10^(-10:-6), 
cost = 10^(-1:1))
tuned_part2 - tune.svm(label~., data = trainset, gamma = 10^(-5:0),
 cost = 10^(-1:1))
tuned_part3 - tune.svm(label~., data = trainset, gamma = 10^(1:-5),
 cost = 10^(-1:1))
tuned_part4 - tune.svm(label~., data = trainset, gamma = 10^(5:10),
cost = 10^(-1:1))
}


I have a Quad Core processor, so if I understand correctly the above could 
split that up across the cores.

My goal would be a coarse grid search, not sure if the above parameters are 
good for that, it just seemed like 
some good starting points.

I would just manually look at each of the resulting files, although it would be 
cool if it resulted in an instance variable
being set of the best values. 

Has anyone used doMC for something like this?  Is there a better library to 
potentially use than doMC for doing 
something like splitting up an SVM grid search over multiple cores?

Brian

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] data after write() is off by 1 ?

2012-11-20 Thread Brian Feeny
I am new to R, so I am sure I am making a simple mistake.  I am including 
complete information in hopes
someone can help me.

Basically my data in R looks good, I write it to a file, and every value is off 
by 1.

Here is my flow:

 str(prediction)
 Factor w/ 10 levels 0,1,2,3,..: 3 1 10 10 4 8 1 4 1 4 ...
 - attr(*, names)= chr [1:28000] 1 2 3 4 ...
 print(prediction)
1 2 3 4 5 6 7 8 910111213   
 14151617181920212223 
2 0 9 9 3 7 0 3 0 3 5 7 4   
  0 4 3 3 1 9 0 9 1 1 

ok, so it shows my values are 2, 0, 9, 9, 3 etc

# I write my file out
write(prediction, file=prediction.csv)

# look at the first 10 values
$ head -10 prediction.csv 
3 1 10 10 4
8 1 4 1 4
6 8 5 1 5
4 4 2 10 1
10 2 2 6 8
5 3 8 5 8
8 6 5 3 7
3 6 6 2 7
8 8 5 10 9
8 9 3 7 8

The complete work of what I did was as follows:

# First I load in a dataset, label the first column as a factor
 dataset - read.csv('train.csv',head=TRUE)
 dataset$label - as.factor(dataset$label)

# it has 42000 obs. 785 variables
 str(dataset)
'data.frame':   42000 obs. of  785 variables:
 $ label   : Factor w/ 10 levels 0,1,2,3,..: 2 1 2 5 1 1 8 4 6 4 ...
 $ pixel0  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel1  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel2  : int  0 0 0 0 0 0 0 0 0 0 ...
  [list output truncated]

# I make a sampling testset and trainset
 index - 1:nrow(dataset)
 testindex - sample(index, trunc(length(index)*30/100))
 testset - dataset[testindex,]
 trainset - dataset[-testindex,]

# build model, predict, view
 model  - svm(label~., data = trainset, type=C-classification, 
 kernel=radial, gamma=0.001, cost=16)
 prediction - predict(model, testset)
 tab - table(pred = prediction, true = testset[,1])
true
pred0123456789
   0 1210031057258
   10 141520210750
   202 1127   12302720
   3007 12960   1002   156
   41182 12012435   16
   5310   130 11003123
   6303059 1263010
   70296610 12961   13
   8357   111202 11904
   91123   172044 1190


Ok everything looks great up to this point..so I try to apply my model 
to a real testset, which is the same format as my previous
dataset, except it does not have the label/factor column, so its 28000 obs 784 
variables:

 testset - read.csv('test.csv',head=TRUE)
 str(testset)
'data.frame':   28000 obs. of  784 variables:
 $ pixel0  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel1  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel2  : int  0 0 0 0 0 0 0 0 0 0 ...
  [list output truncated]

 prediction - predict(model, testset)
 summary(prediction)
   0123456789 
2780 3204 2824 2767 2771 2516 2744 2898 2736 2760 
 print(prediction)
1 2 3 4 5 6 7 8 910111213   
 14151617181920212223 
2 0 9 9 3 7 0 3 0 3 5 7 4   
  0 4 3 3 1 9 0 9 1 1 
   24252627282930313233343536   
 37383940414243444546 
5 7 4 2 7 4 7 7 5 4 2 6 2   
  5 5 1 6 7 7 4 9 8 7 
  [list output truncated]

 write(prediction, file=prediction.csv)
$ head -10 prediction.csv 
3 1 10 10 4
8 1 4 1 4
6 8 5 1 5
4 4 2 10 1
10 2 2 6 8
5 3 8 5 8
8 6 5 3 7
3 6 6 2 7
8 8 5 10 9
8 9 3 7 8


I am obviously making a mistake.  Everything is off by a value of 1.


Can someone tell me what I am doing wrong?

Brian



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] data after write() is off by 1 ?

2012-11-20 Thread Brian Feeny

A followup to my own post, I believe I figured this out, but if I should be 
doing something different please correct:

 prediction.out - levels(prediction)[prediction]
 write(prediction.out, file=prediction.csv)

This gives me my correctly adjusted values

Brian

On Nov 20, 2012, at 2:30 PM, Brian Feeny wrote:

 I am new to R, so I am sure I am making a simple mistake.  I am including 
 complete information in hopes
 someone can help me.
 
 Basically my data in R looks good, I write it to a file, and every value is 
 off by 1.
 
 Here is my flow:
 
 str(prediction)
 Factor w/ 10 levels 0,1,2,3,..: 3 1 10 10 4 8 1 4 1 4 ...
 - attr(*, names)= chr [1:28000] 1 2 3 4 ...
 print(prediction)
1 2 3 4 5 6 7 8 910111213  
   14151617181920212223 
2 0 9 9 3 7 0 3 0 3 5 7 4  
0 4 3 3 1 9 0 9 1 1 
 
 ok, so it shows my values are 2, 0, 9, 9, 3 etc
 
 # I write my file out
 write(prediction, file=prediction.csv)
 
 # look at the first 10 values
 $ head -10 prediction.csv 
 3 1 10 10 4
 8 1 4 1 4
 6 8 5 1 5
 4 4 2 10 1
 10 2 2 6 8
 5 3 8 5 8
 8 6 5 3 7
 3 6 6 2 7
 8 8 5 10 9
 8 9 3 7 8
 
 The complete work of what I did was as follows:
 
 # First I load in a dataset, label the first column as a factor
 dataset - read.csv('train.csv',head=TRUE)
 dataset$label - as.factor(dataset$label)
 
 # it has 42000 obs. 785 variables
 str(dataset)
 'data.frame': 42000 obs. of  785 variables:
 $ label   : Factor w/ 10 levels 0,1,2,3,..: 2 1 2 5 1 1 8 4 6 4 ...
 $ pixel0  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel1  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel2  : int  0 0 0 0 0 0 0 0 0 0 ...
  [list output truncated]
 
 # I make a sampling testset and trainset
 index - 1:nrow(dataset)
 testindex - sample(index, trunc(length(index)*30/100))
 testset - dataset[testindex,]
 trainset - dataset[-testindex,]
 
 # build model, predict, view
 model  - svm(label~., data = trainset, type=C-classification, 
 kernel=radial, gamma=0.001, cost=16)
 prediction - predict(model, testset)
 tab - table(pred = prediction, true = testset[,1])
true
 pred0123456789
   0 1210031057258
   10 141520210750
   202 1127   12302720
   3007 12960   1002   156
   41182 12012435   16
   5310   130 11003123
   6303059 1263010
   70296610 12961   13
   8357   111202 11904
   91123   172044 1190
 
 
 Ok everything looks great up to this point..so I try to apply my 
 model to a real testset, which is the same format as my previous
 dataset, except it does not have the label/factor column, so its 28000 obs 
 784 variables:
 
 testset - read.csv('test.csv',head=TRUE)
 str(testset)
 'data.frame': 28000 obs. of  784 variables:
 $ pixel0  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel1  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel2  : int  0 0 0 0 0 0 0 0 0 0 ...
  [list output truncated]
 
 prediction - predict(model, testset)
 summary(prediction)
   0123456789 
 2780 3204 2824 2767 2771 2516 2744 2898 2736 2760 
 print(prediction)
1 2 3 4 5 6 7 8 910111213  
   14151617181920212223 
2 0 9 9 3 7 0 3 0 3 5 7 4  
0 4 3 3 1 9 0 9 1 1 
   24252627282930313233343536  
   37383940414243444546 
5 7 4 2 7 4 7 7 5 4 2 6 2  
5 5 1 6 7 7 4 9 8 7 
  [list output truncated]
 
 write(prediction, file=prediction.csv)
 $ head -10 prediction.csv 
 3 1 10 10 4
 8 1 4 1 4
 6 8 5 1 5
 4 4 2 10 1
 10 2 2 6 8
 5 3 8 5 8
 8 6 5 3 7
 3 6 6 2 7
 8 8 5 10 9
 8 9 3 7 8
 
 
 I am obviously making a mistake.  Everything is off by a value of 1.
 
 
 Can someone tell me what I am doing wrong?
 
 Brian
 
 
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible

[R] Removing columns that are na or constant

2012-11-20 Thread Brian Feeny

I have a dataset that has many columns which are NA or constant, and so I 
remove them like so:


same - sapply(dataset, function(.col){ 
  all(is.na(.col))  || all(.col[1L] == .col) 
}) 
dataset - dataset[!same] 

This works GREAT (thanks to the r-users list archive I found this)

however, then when I do my data sampling like so:

testSize - floor(nrow(x) * 10/100)
test - sample(1:nrow(x), testSize)

train_data - x[-test,]
test_data - x[test, -1]
test_class - x[test, 1]

It is now possible that test_data or train_data contain columns that are 
constants, however as one dataset they did not.

So the solution for me is to just re-run lines to remove all constants..not 
a problem, but is this normal?  is this how I should
be handling this in R?  many models I am attempting to use (SVM, lda, etc) 
don't like if a column has all the same value...
so as a beginner, this is how I am handling it in R, but I am looking for 
someone to sanity check what I am doing is sound.

Brian

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] e1071 SVM: Cross-validation error confusion matrix

2012-11-20 Thread Brian Feeny
responding to my own question, I see in ?svm man it states fitted() and 
predict() can do the same thing:

# test with train data
pred - predict(model, x)
# (same as:)
pred - fitted(model)



On Nov 21, 2012, at 1:08 AM, signal bfe...@mac.com wrote:

 Did you ever receive a response to this? I did not see one public.
 
 I would think that if your dataset was of a large enough size, that 10-fold
 validation would show an improvement over N:N.
 
 Also, any ideas if there is any difference really in using fitted() vs.
 predict() in your second step? I am pretty sure they do the same thing.  
 
 Brian
 
 
 
 --
 View this message in context: 
 http://r.789695.n4.nabble.com/e1071-SVM-Cross-validation-error-confusion-matrix-tp4437047p4650252.html
 Sent from the R help mailing list archive at Nabble.com.
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to subset my data and at the same time keep the balance?

2012-11-19 Thread Brian Feeny

Just curious, once you have a model that works well, does it make sense to then 
tune it against 100% of the dataset (with known outcomes)
so you can apply it to data you wish to predict for or is that a bad approach?

I have done like is explained in this thread many times, taken a sample, 
learned against it, and then tested on the remaining.  But this is using data
for which we know the predicted variable and can compare to validate.  So after 
your done, should you re-tune with the entire training set?

As for which method, I am using mostly SVM

Brian

On Nov 19, 2012, at 2:07 PM, Eddie Smith eddie...@gmail.com wrote:

 Thanks a lot! I got some ideas from all the replies and here is the final one.
 
 newdata
 
 select - sample(nrow(newdata), nrow(newdata) * .7)
 data70 - newdata[select,]  # select
 write.csv(data70, data70.csv, row.names=FALSE)
 
 data30 - newdata[-select,]  # testing
 write.csv(data30, data30.csv, row.names=FALSE)
 
 Cheers
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] help interpreting dudi.pco

2012-11-18 Thread Brian Feeny
I am new to R as well, it sounds like you would want to look at clustering, 
perhaps k-means clustering.


Brian

On Nov 18, 2012, at 12:19 AM, avadhoot velankar avadhoot.velan...@gmail.com 
wrote:

 I am working on morphometry of hairs and want to see if selected variables
 are giving significantly distinct groups.
 
 I am new to both R and principal coordinate analysis. Scatter plot is
 showing distinct groups but i dont know how to refine the analysis and
 interprete it.
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Best prediction to use to use for basic problem?

2012-11-18 Thread Brian Feeny
I have a rather basic set of data.  It is simply a variable that can be 0, 1 or 
2 and its value over a series of time t0 - t9 like so:

y:  1   1   2   0   1   2   2   1   2   
1
x:  t0  t1  t2  t3  t4  t5  t6  t7  t8  
t9

I need to predict what the value of y will be at t10 through t13.

As you can see its rather basic.  I am rather new to solving these types of 
problems so I am looking for some
good straight forward things to try.

My research into this (google, wiki's, etc) leads me to believe that perhaps 
logistic regression would be good, since
I am predicting a categorical variable (0, 1, 2).  

I don't have much data for the formula to learn from, as I only have 10 time 
slots and I need to predict the next 4.

Is logistic regression a good candidate or should I be looking at perhaps 
something else?


Brian

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Using cbind to combine data frames and preserve header/names

2012-11-17 Thread Brian Feeny

David and Rainer, 

Thank you both for your responses, you got me on track, I ended up just doing 
like so:

trainset - read.csv('train.csv',head=TRUE)
trainset[,-1] - binarize(trainset[,-1])
trainset$label - as.factor(trainset$label)

I appreciate your help

Brian

On Nov 17, 2012, at 11:25 AM, David Winsemius wrote:

 
 On Nov 16, 2012, at 9:39 PM, Brian Feeny wrote:
 
 I have a dataframe that has a header like so:
 
 classvalue1  value2  value3
 
 class is a factor
 
 the actual values in the columns value1, value2 and value3 are 0-255, I wish 
 to binarize these using biclust.
 I can do this like so:
 
 binarize(dataframe[,-1])
 
 this will return a dataframe, but then I lose my first column class, so I 
 thought I could combine it like so:
 
 dataframe - cbind(dataframe$label, binarize(dataframe[,-1]))
 
 There is no column with the name label. There is also no function named 
 label in base R although I cannot speak about biclust. Even if there were, 
 you cannot apply functions to data.frames with the $ function.
 
 but then I lose my header (names).how can I do the above 
 operation and keep my header in tact?
 
 Basically i just want to binarize everything but the first column (since its 
 a factor column and not numeric).
 
 I have no idea how 'binarize' works but if you wanted to 'defactorize' a 
 factor then you should learn to use 'as.character' to turn factors into 
 character vectors. Perhaps:
 
 dfrm - cbind( as.character(dataframe[1]), binarize(dataframe[,-1]))
 
 You should make sure this is still a dataframe since cbind.default  returns a 
 matrix and this would be a character matrix. I'm taking your word that the 
 second argument is a dataframe, and that would mean the cbind.data.frame 
 method would be dispatched.
 
 It is a rather unfortunate practice to call your dataframes dataframe and 
 also bad to name your columns class since the first is a fundamental term 
 and the second a basic function. If you persist, people will start talking to 
 you about dogs named Dog.
 
 
 
 David Winsemius, MD
 Alameda, CA, USA


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Strange problem with reading a pipe delimited file

2012-11-17 Thread Brian Feeny
I am trying to read in a pipe delimited file that has rows with varying number 
of columns, here is my sample data:

A|B|C|D
A|B|C|D|E|F
A|B|C|D|E
A|B|C|D|E|F|G|H|I
A|B|C|D
A|B|C|D|E|F|G|H|I|J

You can see line 6 has 10 columns.  Yet, I can't explain why R does like so:

 test - read.delim(mypaths4.txt, sep=|, quote=NULL, header=F, 
 colClasses=character)
 test
  V1 V2 V3 V4 V5 V6 V7 V8 V9
1  A  B  C  D   
2  A  B  C  D  E  F 
3  A  B  C  D  E
4  A  B  C  D  E  F  G  H  I
5  A  B  C  D   
6  A  B  C  D  E  F  G  H  I
7  J

You can see it moved J to row 7, I don't understand why it is not left in 
position 6,10.

So, more strange to me, I remove line 1, so my data file contains:

A|B|C|D|E|F
A|B|C|D|E
A|B|C|D|E|F|G|H|I
A|B|C|D
A|B|C|D|E|F|G|H|I|J

and I get a totally different result:

 test - read.delim(mypaths5.txt, sep=|, quote=NULL, header=F, 
 colClasses=character)
 test
  V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1  A  B  C  D  E  F 
2  A  B  C  D  E
3  A  B  C  D  E  F  G  H  I
4  A  B  C  D   
5  A  B  C  D  E  F  G  H  I   J

what it is that I am doing that is changing the fate of that final J?  This 
is just a basic ASCII text file, pipe delimited as shown.

I have been racking my brain on this for a day!

Brian

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Strange problem with reading a pipe delimited file

2012-11-17 Thread Brian Feeny

On Nov 17, 2012, at 4:27 PM, Duncan Murdoch wrote:
 
 
 I would suggest reading the help file: read.delim only looks at the first 5 
 lines to determine the number of columns if you don't specify the colClasses.
 
 Duncan Murdoch
 

Duncan, 

I have tried to pass colClasses but R complains:

test - read.delim(mypaths4.txt, sep=|, quote=NULL, header=F, 
colClasses=c(character,character,character,character,character,character,character,character,character,character),
 fill=TRUE)
Warning message:
In read.table(file = file, header = header, sep = sep, quote = quote,  :
  cols = 9 != length(data) = 10


once again this is with data:

A|B|C|D
A|B|C|D|E|F
A|B|C|D|E
A|B|C|D|E|F|G|H|I
A|B|C|D
A|B|C|D|E|F|G|H|I|J

Any idea?

Brian


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Strange problem with reading a pipe delimited file

2012-11-17 Thread Brian Feeny
Duncan,

I believe I follow you now, I have done like so with expected results:

ncol - max(count.fields(paths.txt, sep = |))
test - read.delim(paths.txt, sep=|, quote=NULL, header=F, 
colClasses=character, fill=TRUE, col.names = paste(V, seq_len(ncol), sep = 
))


Thank you for your help

Brian


On Nov 17, 2012, at 4:34 PM, Brian Feeny wrote:

 
 On Nov 17, 2012, at 4:27 PM, Duncan Murdoch wrote:
 
 
 I would suggest reading the help file: read.delim only looks at the first 5 
 lines to determine the number of columns if you don't specify the colClasses.
 
 Duncan Murdoch
 
 
 Duncan, 
 
 I have tried to pass colClasses but R complains:
 
 test - read.delim(mypaths4.txt, sep=|, quote=NULL, header=F, 
 colClasses=c(character,character,character,character,character,character,character,character,character,character),
  fill=TRUE)
 Warning message:
 In read.table(file = file, header = header, sep = sep, quote = quote,  :
  cols = 9 != length(data) = 10
 
 
 once again this is with data:
 
 A|B|C|D
 A|B|C|D|E|F
 A|B|C|D|E
 A|B|C|D|E|F|G|H|I
 A|B|C|D
 A|B|C|D|E|F|G|H|I|J
 
 Any idea?
 
 Brian
 
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] library/function to compare two phrases?

2012-11-17 Thread Brian Feeny
I am looking for a library/function in R that can compare two phrases and give 
me a score, or somehow classify them as correct as possible.

The phrases are obfuscated/messy.  I am not concerned about which is 
correct (for example spell checking), I am only concerned in grouping them
so that I know they are the closest match.

Example:

I have ROW1 and ROW2 like so:

ROW1ROW2
hamburger helperbigmc heartkcatta
chicken nuggets chicke, nuggets, jss
bigmac heartattack  some sombody somehwere
somebody somehwere  repleh regrubmah

I am looking for something that can tell me that the best match for hamburger 
helper is repleh regrubmah, and the same for each other row.

So my goal is to write a program that foreach phrase in ROW1 runs this function 
against ROW2 and gives me the phrase that scored best.

I have read over much of the NLP packages at 
http://cran.r-project.org/web/views/NaturalLanguageProcessing.html

I thought lsa might be a good fit, but I am not sure.  I have limited time, so 
I am hoping someone can point me in a direction of what I am looking for.

I have been searching for text classifiers, perhaps this problem is referred 
to as something else.

Brian

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] library/function to compare two phrases?

2012-11-17 Thread Brian Feeny

Thank you Michael and David.  I am onto agrep and adist and they look very 
useful for what I am wanting to do.  My initial results are promising!

Brian

On Nov 17, 2012, at 6:20 PM, R. Michael Weylandt wrote:

 On Sat, Nov 17, 2012 at 11:00 PM, Brian Feeny bfe...@mac.com wrote:
 I am looking for a library/function in R that can compare two phrases and 
 give me a score, or somehow classify them as correct as possible.
 
 The phrases are obfuscated/messy.  I am not concerned about which is 
 correct (for example spell checking), I am only concerned in grouping them
 so that I know they are the closest match.
 
 Example:
 
 I have ROW1 and ROW2 like so:
 
 ROW1ROW2
 hamburger helperbigmc heartkcatta
 chicken nuggets chicke, nuggets, jss
 bigmac heartattack  some sombody somehwere
 somebody somehwere  repleh regrubmah
 
 I am looking for something that can tell me that the best match for 
 hamburger helper is repleh regrubmah, and the same for each other row.
 
 So my goal is to write a program that foreach phrase in ROW1 runs this 
 function against ROW2 and gives me the phrase that scored best.
 
 I have read over much of the NLP packages at 
 http://cran.r-project.org/web/views/NaturalLanguageProcessing.html
 
 I thought lsa might be a good fit, but I am not sure.  I have limited time, 
 so I am hoping someone can point me in a direction of what I am looking for.
 
 I have been searching for text classifiers, perhaps this problem is 
 referred to as something else.
 
 
 This is outside my expertise, but if memory serves, you might benefit
 from googling the Levenshtein (spelling?) distance which allows this
 sort of fuzzy matching of strings.
 
 MW

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Using cbind to combine data frames and preserve header/names

2012-11-16 Thread Brian Feeny
I have a dataframe that has a header like so:

class   value1  value2  value3

class is a factor

the actual values in the columns value1, value2 and value3 are 0-255, I wish to 
binarize these using biclust.
I can do this like so:

binarize(dataframe[,-1])

this will return a dataframe, but then I lose my first column class, so I 
thought I could combine it like so:

dataframe - cbind(dataframe$label, binarize(dataframe[,-1]))

but then I lose my header (names).how can I do the above operation 
and keep my header in tact?

Basically i just want to binarize everything but the first column (since its a 
factor column and not numeric).

Thank you for any help you can give me, I am relatively new to R.

Brian

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.