[R] splitting dataset based on variable and re-combining
I have a dataset and I wish to use two different models to predict. Both models are SVM. The reason for two different models is based on the sex of the observation. I wish to be able to make predictions and have the results be in the same order as my original dataset. To illustrate I will use iris: # Take Iris and create a dataframe of just two Species, setosa and versicolor, shuffle them data(iris) iris - iris[(iris$Species==setosa | iris$Species==versicolor),] irisindex - sample(1:nrow(iris), nrow(iris)) iris - iris[irisindex,] # Make predictions on setosa using the mySetosaModel model, and on versicolor using the myVersicolorModel: predict(mySetosaModel, iris[iris$Species==setosa,]) predict(myVersicolorModel, iris[iris$Species==versicolor,]) The problem is this will give me a vector of just the setosa results, and then one of just the versicolor results. I wish to take the results and have them be in the same order as the original dataset. So if the original dataset had: Species setosa setosa versicolor setosa versicolor setosa I wish for my results to have: prediction for setosa prediction for setosa prediction for versicolor prediction for setosa prediction for versicolor prediction for setosa But instead, what I am ending up with is two result sets, and no way I can think of to combine them. I am sure this comes up alot where you have a factor you wish to split your models on, say sex (male vs. female), and you need to present the results back so it matches to the order of the orignal dataset. I have tried to think of ways to use an index, to try to keep things in order, but I can't figure it out. Any help is greatly appreciated. Brian __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] splitting dataset based on variable and re-combining
I will look into that, thanks. I am afraid I don't quite understand what is going on there with the multiplication, so I will need to read up. What I ended up doing was like so: For train data, its easy, as I can subset to have the model only work off the data I want: rbfSVM_setosa - train(Sepal.Length~., data = trainset, subset = trainset$Species==setosa, ...) rbfSVM_versicolor - train(Sepal.Length~., data = trainset, subset = trainset$Species==versicolor, ...) For my test data (testset), I ended up doing like so which appears to work: index_setosa- which(testset$Species == setosa) svmPred - as.vector(rep(NA,nrow(testset))) svmPred[index_setosa] - predict(rbfSVM_setosa, testset[testset$Species == setosa,]) svmPred[is.na(svmPred)] - predict(rbfSVM_versicolor, testset[testset$Species == versicolor,]) The above works when there are just two classes. I am going to read on some of these other ways suggested and give them a try. Brian On Dec 10, 2012, at 10:38 PM, Thomas Stewart tgs.public.m...@gmail.com wrote: Why not use an indicator variable? P1 - ... # prediction from model 1 (Setosa) for entire dataset P2 - ... # prediction from model 2 for entire dataset I - Species==setosa # Predictions - P1 * I + P2 * ( 1 - I ) On Monday, December 10, 2012, Brian Feeny wrote: I have a dataset and I wish to use two different models to predict. Both models are SVM. The reason for two different models is based on the sex of the observation. I wish to be able to make predictions and have the results be in the same order as my original dataset. To illustrate I will use iris: # Take Iris and create a dataframe of just two Species, setosa and versicolor, shuffle them data(iris) iris - iris[(iris$Species==setosa | iris$Species==versicolor),] irisindex - sample(1:nrow(iris), nrow(iris)) iris - iris[irisindex,] # Make predictions on setosa using the mySetosaModel model, and on versicolor using the myVersicolorModel: predict(mySetosaModel, iris[iris$Species==setosa,]) predict(myVersicolorModel, iris[iris$Species==versicolor,]) The problem is this will give me a vector of just the setosa results, and then one of just the versicolor results. I wish to take the results and have them be in the same order as the original dataset. So if the original dataset had: Species setosa setosa versicolor setosa versicolor setosa I wish for my results to have: prediction for setosa prediction for setosa prediction for versicolor prediction for setosa prediction for versicolor prediction for setosa But instead, what I am ending up with is two result sets, and no way I can think of to combine them. I am sure this comes up alot where you have a factor you wish to split your models on, say sex (male vs. female), and you need to present the results back so it matches to the order of the orignal dataset. I have tried to think of ways to use an index, to try to keep things in order, but I can't figure it out. Any help is greatly appreciated. Brian __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Assignment of values with different indexes
I would like to take the values of observations and map them to a new index. I am not sure how to accomplish this. The result would look like so: x[1,2,3,4,5,6,7,8,9,10] becomes y[2,4,6,8,10,12,14,16,18,20] The newindex would not necessarily be this sequence, but a sequence I have stored in a vector, so it could be all kinds of values. here is what happens: x - rnorm(10) myindex - seq(from = 1,to = 20, by = 2) y - numeric() y[myindex] - x y [1] -0.03745988 NA -0.09078822 NA 0.92484413 NA 0.32057426 NA [9] 0.01536279 NA 0.02200198 NA 0.37535438 NA 1.46606535 NA [17] 1.44855796 NA -0.05048738 So yes, it maps the values to my new indexes, but I have NA's. The result I want would look like this instead: [1] -0.03745988 [3] -0.09078822 [5] 0.92484413 [7] 0.32057426 [9] 0.01536279 [11] 0.02200198 [13] 0.37535438 [15] 1.46606535 [17] 1.44855796 [19] -0.05048738 and remove the NA's. I tried this with na.omit() on x, but it looks like so: x - rnorm(10) myindex - seq(from = 1,to = 20, by = 2) y - numeric() y[myindex] - na.omit(x) y [1] 0.87399523 NA -0.39908184 NA 0.14583051 NA 0.01850755 NA [9] -0.47413632 NA 0.88410517 NA -1.64939190 NA 0.57650807 NA [17] 0.44016971 NA -0.56313802 Brian __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Assignment of values with different indexes
No, because it does not assign the indexes of myindex. If its not possible, which I am assuming its not, thats OK. I thought that if I had say 10 observations, sequentially ordered (or any order, it doesn't matter), and I wanted to assign them specific indexes, and not have NA's, that it was possible. I am OK with knowing that I can assign them the specific indexes, and that there will be empty spots, which are marked NA. Most functions I would need to use can handle NA's by telling the function to ignore. I appreciate all the help that has been given. Brian On Dec 5, 2012, at 11:49 PM, arun smartpink...@yahoo.com wrote: Hi, Would it be okay to use: y-na.omit(y[myindex]-x) y # [1] -1.36025132 -0.57529211 1.18132359 0.41038489 1.83108252 -0.03563686 #[7] 1.25267314 1.08311857 1.56973422 -0.30752939 A.K. - Original Message - From: Brian Feeny bfe...@mac.com To: r-help@r-project.org help r-help@r-project.org Cc: Sent: Wednesday, December 5, 2012 9:47 PM Subject: [R] Assignment of values with different indexes I would like to take the values of observations and map them to a new index. I am not sure how to accomplish this. The result would look like so: x[1,2,3,4,5,6,7,8,9,10] becomes y[2,4,6,8,10,12,14,16,18,20] The newindex would not necessarily be this sequence, but a sequence I have stored in a vector, so it could be all kinds of values. here is what happens: x - rnorm(10) myindex - seq(from = 1,to = 20, by = 2) y - numeric() y[myindex] - x y [1] -0.03745988 NA -0.09078822 NA 0.92484413 NA 0.32057426 NA [9] 0.01536279 NA 0.02200198 NA 0.37535438 NA 1.46606535 NA [17] 1.44855796 NA -0.05048738 So yes, it maps the values to my new indexes, but I have NA's. The result I want would look like this instead: [1] -0.03745988 [3] -0.09078822 [5] 0.92484413 [7] 0.32057426 [9] 0.01536279 [11] 0.02200198 [13] 0.37535438 [15] 1.46606535 [17] 1.44855796 [19] -0.05048738 and remove the NA's. I tried this with na.omit() on x, but it looks like so: x - rnorm(10) myindex - seq(from = 1,to = 20, by = 2) y - numeric() y[myindex] - na.omit(x) y [1] 0.87399523 NA -0.39908184 NA 0.14583051 NA 0.01850755 NA [9] -0.47413632 NA 0.88410517 NA -1.64939190 NA 0.57650807 NA [17] 0.44016971 NA -0.56313802 Brian __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] How to re-combine values based on an index?
I am able to split my df into two like so: dataset - trainset index - 1:nrow(dataset) testindex - sample(index, trunc(length(index)*30/100)) trainset - dataset[-testindex,] testset - dataset[testindex,-1] So I have the index information, how could I re-combine the data using that back into a single df? I tried what I thought might work, but failed with: newdataset[testindex] = testset[testindex] object 'dataset' not found newdataset[-testindex] = trainset[-testindex] object 'dataset' not found Brian __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to re-combine values based on an index?
Thank you for your response, here is a better example of what I am trying to do: data(iris) index_setosa - which(iris$Species == setosa) iris_setosa - data.frame() iris_setosa[index_setosa,] -iris[index_setosa,] iris_others - data.frame() iris_others[-index_setosa,] - iris[-index_setosa,] So the idea would be that iris_setosa is a dataframe of size 150, with 50 observations of setosa, using their original same indices, and 100 observations of NA. Likewise iris_others would be 100 observations of species besides setosa, using their original indices, and there would be 50 NA's. The above doesn't work. When I execute it, I am left with iris_setosa having 0 columns, I wish it to have all the original columns of iris. That said, once I get past the above (being able to split them out and keep original indices), I wish to be able to combine iris_setosa and iris_others so that iris_combined is a data frame with no NA's and all the original data. Does this make sense? So I am basically taking a dataframe, splitting it based on some criteria, and working on the two split dataframes separately, and then I wish to recombine. Brian So at this point, I have iris_setosa a dataframe of size On Dec 1, 2012, at 11:34 PM, William Dunlap wrote: newdataset[testindex] = testset[testindex] object 'dataset' not found Is that really what R printed? I get newdataset[testindex] = testset[testindex] Error in newdataset[testindex] = testset[testindex] : object 'newdataset' not found but perhaps you have a different problem. Copy and paste (and read) the error message you got. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Brian Feeny Sent: Saturday, December 01, 2012 8:04 PM To: r-help@r-project.org Subject: [R] How to re-combine values based on an index? I am able to split my df into two like so: dataset - trainset index - 1:nrow(dataset) testindex - sample(index, trunc(length(index)*30/100)) trainset - dataset[-testindex,] testset - dataset[testindex,-1] So I have the index information, how could I re-combine the data using that back into a single df? I tried what I thought might work, but failed with: newdataset[testindex] = testset[testindex] object 'dataset' not found newdataset[-testindex] = trainset[-testindex] object 'dataset' not found Brian __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Help with this error kernlab class probability calculations failed; returning NAs
I have never been able to get class probabilities to work and I am relatively new to using these tools, and I am looking for some insight as to what may be wrong. I am using caret with kernlab/ksvm. I will simplify my problem to a basic data set which produces the same problem. I have read the caret vignettes as well as documentation for ?train. I appreciate any direction you can give. I realize this is a very small dataset, the actual data is much larger, I am just using 10 rows as an example: trainset - data.frame( outcome=factor(c(0,1,0,1,0,1,1,1,1,0)), age=c(10, 23, 5, 28, 81, 48, 82, 23, 11, 9), amount=c(10.11, 22.23, 494.2, 2.0, 29.2, 39.2, 39.2, 39.0, 11.1, 12.2) ) str(trainset) 'data.frame': 7 obs. of 3 variables: $ outcome: Factor w/ 2 levels 0,1: 2 1 2 2 2 2 1 $ age: num 23 5 28 48 82 11 9 $ amount : num 22.2 494.2 2 39.2 39.2 ... colSums(is.na(trainset)) outcome age amount 0 0 0 ## SAMPLING AND FORMULA dataset - trainset index - 1:nrow(dataset) testindex - sample(index, trunc(length(index)*30/100)) trainset - dataset[-testindex,] testset - dataset[testindex,-1] ## TUNE caret / kernlab set.seed(1) MyTrainControl=trainControl( method = repeatedcv, number=10, repeats=5, returnResamp = all, classProbs = TRUE ) ## MODEL rbfSVM - train(outcome~., data = trainset, method=svmRadial, preProc = c(scale), tuneLength = 10, trControl=MyTrainControl, fit = FALSE ) There were 50 or more warnings (use warnings() to see the first 50) warnings() Warning messages: 1: In train.default(x, y, weights = w, ...) : At least one of the class levels are not valid R variables names; This may cause errors if class probabilities are generated because the variables names will be converted to: X0, X1 2: In caret:::predictionFunction(method = method, modelFit = mod$fit, ... : kernlab class prediction calculations failed; returning NAs __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Help with this error kernlab class probability calculations failed; returning NAs
Yes I am still getting this error, here is my sessionInfo: sessionInfo() R version 2.15.2 (2012-10-26) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] e1071_1.6-1 class_7.3-5 kernlab_0.9-14 caret_5.15-045 foreach_1.4.0 cluster_1.14.3 [7] reshape_0.8.4 plyr_1.7.1 lattice_0.20-10 loaded via a namespace (and not attached): [1] codetools_0.2-8 compiler_2.15.2 grid_2.15.2 iterators_1.0.6 tools_2.15.2 Is there an example that shows a classProbs example, I could try to run it to replicate and see if it works on my system. Brian On Nov 29, 2012, at 10:10 PM, Max Kuhn mxk...@gmail.com wrote: You didn't provide the results of sessionInfo(). Upgrade to the version just released on cran and see if you still have the issue. Max On Thu, Nov 29, 2012 at 6:55 PM, Brian Feeny bfe...@mac.com wrote: I have never been able to get class probabilities to work and I am relatively new to using these tools, and I am looking for some insight as to what may be wrong. I am using caret with kernlab/ksvm. I will simplify my problem to a basic data set which produces the same problem. I have read the caret vignettes as well as documentation for ?train. I appreciate any direction you can give. I realize this is a very small dataset, the actual data is much larger, I am just using 10 rows as an example: trainset - data.frame( outcome=factor(c(0,1,0,1,0,1,1,1,1,0)), age=c(10, 23, 5, 28, 81, 48, 82, 23, 11, 9), amount=c(10.11, 22.23, 494.2, 2.0, 29.2, 39.2, 39.2, 39.0, 11.1, 12.2) ) str(trainset) 'data.frame': 7 obs. of 3 variables: $ outcome: Factor w/ 2 levels 0,1: 2 1 2 2 2 2 1 $ age: num 23 5 28 48 82 11 9 $ amount : num 22.2 494.2 2 39.2 39.2 ... colSums(is.na(trainset)) outcome age amount 0 0 0 ## SAMPLING AND FORMULA dataset - trainset index - 1:nrow(dataset) testindex - sample(index, trunc(length(index)*30/100)) trainset - dataset[-testindex,] testset - dataset[testindex,-1] ## TUNE caret / kernlab set.seed(1) MyTrainControl=trainControl( method = repeatedcv, number=10, repeats=5, returnResamp = all, classProbs = TRUE ) ## MODEL rbfSVM - train(outcome~., data = trainset, method=svmRadial, preProc = c(scale), tuneLength = 10, trControl=MyTrainControl, fit = FALSE ) There were 50 or more warnings (use warnings() to see the first 50) warnings() Warning messages: 1: In train.default(x, y, weights = w, ...) : At least one of the class levels are not valid R variables names; This may cause errors if class probabilities are generated because the variables names will be converted to: X0, X1 2: In caret:::predictionFunction(method = method, modelFit = mod$fit, ... : kernlab class prediction calculations failed; returning NAs __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Help with this error kernlab class probability calculations failed; returning NAs
Max, Thank you for the assistance. That was it. My dependent variable was just using 1 and 0 as levels, I changed them to yes, no: levels(trainset$outcome) - list(no=0, yes=1) and I no longer get the warning. Brian On Nov 29, 2012, at 10:29 PM, Max Kuhn mxk...@gmail.com wrote: Your output has: At least one of the class levels are not valid R variables names; This may cause errors if class probabilities are generated because the variables names will be converted to: X0, X1 Try changing the factor levels to avoid leading numbers and try again. Max On Thu, Nov 29, 2012 at 10:18 PM, Brian Feeny bfe...@mac.com wrote: Yes I am still getting this error, here is my sessionInfo: sessionInfo() R version 2.15.2 (2012-10-26) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] e1071_1.6-1 class_7.3-5 kernlab_0.9-14 caret_5.15-045 foreach_1.4.0 cluster_1.14.3 [7] reshape_0.8.4 plyr_1.7.1 lattice_0.20-10 loaded via a namespace (and not attached): [1] codetools_0.2-8 compiler_2.15.2 grid_2.15.2 iterators_1.0.6 tools_2.15.2 Is there an example that shows a classProbs example, I could try to run it to replicate and see if it works on my system. Brian On Nov 29, 2012, at 10:10 PM, Max Kuhn mxk...@gmail.com wrote: You didn't provide the results of sessionInfo(). Upgrade to the version just released on cran and see if you still have the issue. Max On Thu, Nov 29, 2012 at 6:55 PM, Brian Feeny bfe...@mac.com wrote: I have never been able to get class probabilities to work and I am relatively new to using these tools, and I am looking for some insight as to what may be wrong. I am using caret with kernlab/ksvm. I will simplify my problem to a basic data set which produces the same problem. I have read the caret vignettes as well as documentation for ?train. I appreciate any direction you can give. I realize this is a very small dataset, the actual data is much larger, I am just using 10 rows as an example: trainset - data.frame( outcome=factor(c(0,1,0,1,0,1,1,1,1,0)), age=c(10, 23, 5, 28, 81, 48, 82, 23, 11, 9), amount=c(10.11, 22.23, 494.2, 2.0, 29.2, 39.2, 39.2, 39.0, 11.1, 12.2) ) str(trainset) 'data.frame': 7 obs. of 3 variables: $ outcome: Factor w/ 2 levels 0,1: 2 1 2 2 2 2 1 $ age: num 23 5 28 48 82 11 9 $ amount : num 22.2 494.2 2 39.2 39.2 ... colSums(is.na(trainset)) outcome age amount 0 0 0 ## SAMPLING AND FORMULA dataset - trainset index - 1:nrow(dataset) testindex - sample(index, trunc(length(index)*30/100)) trainset - dataset[-testindex,] testset - dataset[testindex,-1] ## TUNE caret / kernlab set.seed(1) MyTrainControl=trainControl( method = repeatedcv, number=10, repeats=5, returnResamp = all, classProbs = TRUE ) ## MODEL rbfSVM - train(outcome~., data = trainset, method=svmRadial, preProc = c(scale), tuneLength = 10, trControl=MyTrainControl, fit = FALSE ) There were 50 or more warnings (use warnings() to see the first 50) warnings() Warning messages: 1: In train.default(x, y, weights = w, ...) : At least one of the class levels are not valid R variables names; This may cause errors if class probabilities are generated because the variables names will be converted to: X0, X1 2: In caret:::predictionFunction(method = method, modelFit = mod$fit, ... : kernlab class prediction calculations failed; returning NAs __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max -- Max [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] What is the . in formula ~. syntax?
Thank you! I searched in the manual, but I did not see where this is mentioned, I looked under operators and in some of the formula documentation. Brian On Nov 23, 2012, at 3:15 AM, Michael Weylandt michael.weyla...@gmail.com wrote: On Nov 23, 2012, at 4:26 AM, Brian Feeny bfe...@mac.com wrote: I know if I have a dataframe with columns y, x1, x2 and I wish to have y as my y value and x1 and x2 as x values I can do: y ~ x1 + x2 or y ~. but can someone explain what . actually is or what its transposed into? Everything not already stated. rmw I searched for this with no success, reading the formula manual pages. Brian __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] caret train and trainControl
I am used to packages like e1071 where you have a tune step and then pass your tunings to train. It seems with caret, tuning and training are both handled by train. I am using train and trainControl to find my hyper parameters like so: MyTrainControl=trainControl( method = cv, number=5, returnResamp = all, classProbs = TRUE ) rbfSVM - train(label~., data = trainset, method=svmRadial, tuneGrid = expand.grid(.sigma=c(0.0118),.C=c(8,16,32,64,128)), trControl=MyTrainControl, fit = FALSE ) Once this returns my ideal parameters, in this case Cost of 64, do I simply just re-run the whole process again, passing a grid only containing the specific parameters? like so? rbfSVM - train(label~., data = trainset, method=svmRadial, tuneGrid = expand.grid(.sigma=0.0118,.C=64), trControl=MyTrainControl, fit = FALSE ) This is what I have been doing but I am new to caret and want to make sure I am doing this correctly. Brian __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] caret train and trainControl
Max, Thanks, I do understand that the final model is fitted. I think I was not clear in my posting. I am changing datasets between tuning and real training. So maybe I tune on trainset but its only 5000 rows, doing my gridsearch and all that, and then once I have the hyper parameters, I will use an increased trainset size, say 25000. So between tuning and training, I am modifying the trainset, specifically making it bigger. So I have to re-fit the model and I guess what I am trying to make sure of is that really the only thing I need to carry over so to speak, is the hyper parameters I was looking for. Is this correct? That is what I am doing, and simply passing a grid with my specific 2 hyperparameters, to avoid it doing any type of search. Brian On Nov 23, 2012, at 6:06 PM, Max Kuhn wrote: Brian, This is all outlined in the package documentation. The final model is fit automatically. For example, using 'verboseIter' provides details. From ?train knnFit1 - train(TrainData, TrainClasses, + method = knn, + preProcess = c(center, scale), + tuneLength = 10, + trControl = trainControl(method = cv, verboseIter = TRUE)) + Fold01: k= 5 - Fold01: k= 5 + Fold01: k= 7 - Fold01: k= 7 + Fold01: k= 9 - Fold01: k= 9 + Fold01: k=11 - Fold01: k=11 snip + Fold10: k=17 - Fold10: k=17 + Fold10: k=19 - Fold10: k=19 + Fold10: k=21 - Fold10: k=21 + Fold10: k=23 - Fold10: k=23 Aggregating results Selecting tuning parameters Fitting model on full training set Max On Fri, Nov 23, 2012 at 5:52 PM, Brian Feeny bfe...@mac.com wrote: I am used to packages like e1071 where you have a tune step and then pass your tunings to train. It seems with caret, tuning and training are both handled by train. I am using train and trainControl to find my hyper parameters like so: MyTrainControl=trainControl( method = cv, number=5, returnResamp = all, classProbs = TRUE ) rbfSVM - train(label~., data = trainset, method=svmRadial, tuneGrid = expand.grid(.sigma=c(0.0118),.C=c(8,16,32,64,128)), trControl=MyTrainControl, fit = FALSE ) Once this returns my ideal parameters, in this case Cost of 64, do I simply just re-run the whole process again, passing a grid only containing the specific parameters? like so? rbfSVM - train(label~., data = trainset, method=svmRadial, tuneGrid = expand.grid(.sigma=0.0118,.C=64), trControl=MyTrainControl, fit = FALSE ) This is what I have been doing but I am new to caret and want to make sure I am doing this correctly. Brian __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Building factors across two columns, is this possible?
I am trying to make it so two columns with similar data use the same internal numbers for same factors, here is the example: read.csv(test.csv,header =FALSE,sep=,) V1V2 V3 1 sun moonstars 2 stars moon sun 3 cat dog catdog 4 dog moon sun 5 bird plane superman 6 1000 dog 2000 data - read.csv(test.csv,header =FALSE,sep=,) str(data) 'data.frame': 6 obs. of 3 variables: $ V1: Factor w/ 6 levels 1000,bird,..: 6 5 3 4 2 1 $ V2: Factor w/ 3 levels dog,moon,plane: 2 2 1 2 3 1 $ V3: Factor w/ 5 levels 2000,catdog,..: 3 4 2 4 5 1 as.numeric(data$V1) [1] 6 5 3 4 2 1 as.numeric(data$V2) [1] 2 2 1 2 3 1 as.factor(data$V1) [1] sun stars cat dog bird 1000 Levels: 1000 bird cat dog stars sun as.factor(data$V2) [1] moon moon dog moon plane dog Levels: dog moon plane So notice dog is 4 in V1, yet its 1 in V2. Is there a way, either on import, or after, to have factors computed for both columns and assigned the same internal values? Brian __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Building factors across two columns, is this possible?
To clarify on my previous post, here is a representation of what I am trying to accomplish: I would like every unique value in either column to be assigned a number so like so: V1V2 V3 1 sun moonstars 2 stars moon sun 3 cat dog catdog 4 dog moon sun 5 bird plane superman 6 1000 dog 2000 Level Value sun - 0 stars - 1 cat - 2 dog - 3 bird- 4 1000- 5 moon- 6 plane - 7 catdog - 8 superman- 9 2000- 10 etc etc so internally its represented as: V1V2 V3 1 0 6 1 2 1 6 0 3 2 3 8 4 3 6 0 5 4 7 9 6 5 3 10 does this make sense? I am hoping there is a way to accomplish this. Brian On Nov 23, 2012, at 11:42 PM, Brian Feeny bfe...@mac.com wrote: I am trying to make it so two columns with similar data use the same internal numbers for same factors, here is the example: read.csv(test.csv,header =FALSE,sep=,) V1V2 V3 1 sun moonstars 2 stars moon sun 3 cat dog catdog 4 dog moon sun 5 bird plane superman 6 1000 dog 2000 data - read.csv(test.csv,header =FALSE,sep=,) str(data) 'data.frame': 6 obs. of 3 variables: $ V1: Factor w/ 6 levels 1000,bird,..: 6 5 3 4 2 1 $ V2: Factor w/ 3 levels dog,moon,plane: 2 2 1 2 3 1 $ V3: Factor w/ 5 levels 2000,catdog,..: 3 4 2 4 5 1 as.numeric(data$V1) [1] 6 5 3 4 2 1 as.numeric(data$V2) [1] 2 2 1 2 3 1 as.factor(data$V1) [1] sun stars cat dog bird 1000 Levels: 1000 bird cat dog stars sun as.factor(data$V2) [1] moon moon dog moon plane dog Levels: dog moon plane So notice dog is 4 in V1, yet its 1 in V2. Is there a way, either on import, or after, to have factors computed for both columns and assigned the same internal values? Brian __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] What is the . in formula ~. syntax?
I know if I have a dataframe with columns y, x1, x2 and I wish to have y as my y value and x1 and x2 as x values I can do: y ~ x1 + x2 or y ~. but can someone explain what . actually is or what its transposed into? I searched for this with no success, reading the formula manual pages. Brian __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Scaling values 0-255 - -1 , 1 - how can this be done?
I have a dataframe in which I have values 0-255, I wish to transpose them such that: if value 127.5 value = 1 if value 127.5 value = -1 I did something similar using the binarize function of the biclust package, this transforms my dataframe to 0 and 1 values, but I wish to use -1 and 1 and looking for a way in R to do this. Brian __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cluster analysis in R
http://cran.r-project.org/web/views/Cluster.html might be a good start Brian On Nov 21, 2012, at 1:36 PM, KitKat wrote: Thank you for replying! I made a new post asking if there are any websites or files on how to download package mclust (or other Bayesian cluster analysis packages) and the appropriate R functions? Sorry I don't know how this forum works yet -- View this message in context: http://r.789695.n4.nabble.com/cluster-analysis-in-R-tp4649635p4650341.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Using doMC to run parallel SVM grid search?
Has anyone used doMC to speed up an SVM grid search? I am considering doing like so: library(doMC) registerDoMC() foreach (i=0:3) %dopar% { tuned_part1 - tune.svm(label~., data = trainset, gamma = 10^(-10:-6), cost = 10^(-1:1)) tuned_part2 - tune.svm(label~., data = trainset, gamma = 10^(-5:0), cost = 10^(-1:1)) tuned_part3 - tune.svm(label~., data = trainset, gamma = 10^(1:-5), cost = 10^(-1:1)) tuned_part4 - tune.svm(label~., data = trainset, gamma = 10^(5:10), cost = 10^(-1:1)) } I have a Quad Core processor, so if I understand correctly the above could split that up across the cores. My goal would be a coarse grid search, not sure if the above parameters are good for that, it just seemed like some good starting points. I would just manually look at each of the resulting files, although it would be cool if it resulted in an instance variable being set of the best values. Has anyone used doMC for something like this? Is there a better library to potentially use than doMC for doing something like splitting up an SVM grid search over multiple cores? Brian __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Using doMC to run parallel SVM grid search?
Has anyone used doMC to speed up an SVM grid search? I am considering doing like so: library(doMC) registerDoMC() foreach (i=0:3) %dopar% { tuned_part1 - tune.svm(label~., data = trainset, gamma = 10^(-10:-6), cost = 10^(-1:1)) tuned_part2 - tune.svm(label~., data = trainset, gamma = 10^(-5:0), cost = 10^(-1:1)) tuned_part3 - tune.svm(label~., data = trainset, gamma = 10^(1:-5), cost = 10^(-1:1)) tuned_part4 - tune.svm(label~., data = trainset, gamma = 10^(5:10), cost = 10^(-1:1)) } I have a Quad Core processor, so if I understand correctly the above could split that up across the cores. My goal would be a coarse grid search, not sure if the above parameters are good for that, it just seemed like some good starting points. I would just manually look at each of the resulting files, although it would be cool if it resulted in an instance variable being set of the best values. Has anyone used doMC for something like this? Is there a better library to potentially use than doMC for doing something like splitting up an SVM grid search over multiple cores? Brian __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] data after write() is off by 1 ?
I am new to R, so I am sure I am making a simple mistake. I am including complete information in hopes someone can help me. Basically my data in R looks good, I write it to a file, and every value is off by 1. Here is my flow: str(prediction) Factor w/ 10 levels 0,1,2,3,..: 3 1 10 10 4 8 1 4 1 4 ... - attr(*, names)= chr [1:28000] 1 2 3 4 ... print(prediction) 1 2 3 4 5 6 7 8 910111213 14151617181920212223 2 0 9 9 3 7 0 3 0 3 5 7 4 0 4 3 3 1 9 0 9 1 1 ok, so it shows my values are 2, 0, 9, 9, 3 etc # I write my file out write(prediction, file=prediction.csv) # look at the first 10 values $ head -10 prediction.csv 3 1 10 10 4 8 1 4 1 4 6 8 5 1 5 4 4 2 10 1 10 2 2 6 8 5 3 8 5 8 8 6 5 3 7 3 6 6 2 7 8 8 5 10 9 8 9 3 7 8 The complete work of what I did was as follows: # First I load in a dataset, label the first column as a factor dataset - read.csv('train.csv',head=TRUE) dataset$label - as.factor(dataset$label) # it has 42000 obs. 785 variables str(dataset) 'data.frame': 42000 obs. of 785 variables: $ label : Factor w/ 10 levels 0,1,2,3,..: 2 1 2 5 1 1 8 4 6 4 ... $ pixel0 : int 0 0 0 0 0 0 0 0 0 0 ... $ pixel1 : int 0 0 0 0 0 0 0 0 0 0 ... $ pixel2 : int 0 0 0 0 0 0 0 0 0 0 ... [list output truncated] # I make a sampling testset and trainset index - 1:nrow(dataset) testindex - sample(index, trunc(length(index)*30/100)) testset - dataset[testindex,] trainset - dataset[-testindex,] # build model, predict, view model - svm(label~., data = trainset, type=C-classification, kernel=radial, gamma=0.001, cost=16) prediction - predict(model, testset) tab - table(pred = prediction, true = testset[,1]) true pred0123456789 0 1210031057258 10 141520210750 202 1127 12302720 3007 12960 1002 156 41182 12012435 16 5310 130 11003123 6303059 1263010 70296610 12961 13 8357 111202 11904 91123 172044 1190 Ok everything looks great up to this point..so I try to apply my model to a real testset, which is the same format as my previous dataset, except it does not have the label/factor column, so its 28000 obs 784 variables: testset - read.csv('test.csv',head=TRUE) str(testset) 'data.frame': 28000 obs. of 784 variables: $ pixel0 : int 0 0 0 0 0 0 0 0 0 0 ... $ pixel1 : int 0 0 0 0 0 0 0 0 0 0 ... $ pixel2 : int 0 0 0 0 0 0 0 0 0 0 ... [list output truncated] prediction - predict(model, testset) summary(prediction) 0123456789 2780 3204 2824 2767 2771 2516 2744 2898 2736 2760 print(prediction) 1 2 3 4 5 6 7 8 910111213 14151617181920212223 2 0 9 9 3 7 0 3 0 3 5 7 4 0 4 3 3 1 9 0 9 1 1 24252627282930313233343536 37383940414243444546 5 7 4 2 7 4 7 7 5 4 2 6 2 5 5 1 6 7 7 4 9 8 7 [list output truncated] write(prediction, file=prediction.csv) $ head -10 prediction.csv 3 1 10 10 4 8 1 4 1 4 6 8 5 1 5 4 4 2 10 1 10 2 2 6 8 5 3 8 5 8 8 6 5 3 7 3 6 6 2 7 8 8 5 10 9 8 9 3 7 8 I am obviously making a mistake. Everything is off by a value of 1. Can someone tell me what I am doing wrong? Brian [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] data after write() is off by 1 ?
A followup to my own post, I believe I figured this out, but if I should be doing something different please correct: prediction.out - levels(prediction)[prediction] write(prediction.out, file=prediction.csv) This gives me my correctly adjusted values Brian On Nov 20, 2012, at 2:30 PM, Brian Feeny wrote: I am new to R, so I am sure I am making a simple mistake. I am including complete information in hopes someone can help me. Basically my data in R looks good, I write it to a file, and every value is off by 1. Here is my flow: str(prediction) Factor w/ 10 levels 0,1,2,3,..: 3 1 10 10 4 8 1 4 1 4 ... - attr(*, names)= chr [1:28000] 1 2 3 4 ... print(prediction) 1 2 3 4 5 6 7 8 910111213 14151617181920212223 2 0 9 9 3 7 0 3 0 3 5 7 4 0 4 3 3 1 9 0 9 1 1 ok, so it shows my values are 2, 0, 9, 9, 3 etc # I write my file out write(prediction, file=prediction.csv) # look at the first 10 values $ head -10 prediction.csv 3 1 10 10 4 8 1 4 1 4 6 8 5 1 5 4 4 2 10 1 10 2 2 6 8 5 3 8 5 8 8 6 5 3 7 3 6 6 2 7 8 8 5 10 9 8 9 3 7 8 The complete work of what I did was as follows: # First I load in a dataset, label the first column as a factor dataset - read.csv('train.csv',head=TRUE) dataset$label - as.factor(dataset$label) # it has 42000 obs. 785 variables str(dataset) 'data.frame': 42000 obs. of 785 variables: $ label : Factor w/ 10 levels 0,1,2,3,..: 2 1 2 5 1 1 8 4 6 4 ... $ pixel0 : int 0 0 0 0 0 0 0 0 0 0 ... $ pixel1 : int 0 0 0 0 0 0 0 0 0 0 ... $ pixel2 : int 0 0 0 0 0 0 0 0 0 0 ... [list output truncated] # I make a sampling testset and trainset index - 1:nrow(dataset) testindex - sample(index, trunc(length(index)*30/100)) testset - dataset[testindex,] trainset - dataset[-testindex,] # build model, predict, view model - svm(label~., data = trainset, type=C-classification, kernel=radial, gamma=0.001, cost=16) prediction - predict(model, testset) tab - table(pred = prediction, true = testset[,1]) true pred0123456789 0 1210031057258 10 141520210750 202 1127 12302720 3007 12960 1002 156 41182 12012435 16 5310 130 11003123 6303059 1263010 70296610 12961 13 8357 111202 11904 91123 172044 1190 Ok everything looks great up to this point..so I try to apply my model to a real testset, which is the same format as my previous dataset, except it does not have the label/factor column, so its 28000 obs 784 variables: testset - read.csv('test.csv',head=TRUE) str(testset) 'data.frame': 28000 obs. of 784 variables: $ pixel0 : int 0 0 0 0 0 0 0 0 0 0 ... $ pixel1 : int 0 0 0 0 0 0 0 0 0 0 ... $ pixel2 : int 0 0 0 0 0 0 0 0 0 0 ... [list output truncated] prediction - predict(model, testset) summary(prediction) 0123456789 2780 3204 2824 2767 2771 2516 2744 2898 2736 2760 print(prediction) 1 2 3 4 5 6 7 8 910111213 14151617181920212223 2 0 9 9 3 7 0 3 0 3 5 7 4 0 4 3 3 1 9 0 9 1 1 24252627282930313233343536 37383940414243444546 5 7 4 2 7 4 7 7 5 4 2 6 2 5 5 1 6 7 7 4 9 8 7 [list output truncated] write(prediction, file=prediction.csv) $ head -10 prediction.csv 3 1 10 10 4 8 1 4 1 4 6 8 5 1 5 4 4 2 10 1 10 2 2 6 8 5 3 8 5 8 8 6 5 3 7 3 6 6 2 7 8 8 5 10 9 8 9 3 7 8 I am obviously making a mistake. Everything is off by a value of 1. Can someone tell me what I am doing wrong? Brian [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible
[R] Removing columns that are na or constant
I have a dataset that has many columns which are NA or constant, and so I remove them like so: same - sapply(dataset, function(.col){ all(is.na(.col)) || all(.col[1L] == .col) }) dataset - dataset[!same] This works GREAT (thanks to the r-users list archive I found this) however, then when I do my data sampling like so: testSize - floor(nrow(x) * 10/100) test - sample(1:nrow(x), testSize) train_data - x[-test,] test_data - x[test, -1] test_class - x[test, 1] It is now possible that test_data or train_data contain columns that are constants, however as one dataset they did not. So the solution for me is to just re-run lines to remove all constants..not a problem, but is this normal? is this how I should be handling this in R? many models I am attempting to use (SVM, lda, etc) don't like if a column has all the same value... so as a beginner, this is how I am handling it in R, but I am looking for someone to sanity check what I am doing is sound. Brian __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] e1071 SVM: Cross-validation error confusion matrix
responding to my own question, I see in ?svm man it states fitted() and predict() can do the same thing: # test with train data pred - predict(model, x) # (same as:) pred - fitted(model) On Nov 21, 2012, at 1:08 AM, signal bfe...@mac.com wrote: Did you ever receive a response to this? I did not see one public. I would think that if your dataset was of a large enough size, that 10-fold validation would show an improvement over N:N. Also, any ideas if there is any difference really in using fitted() vs. predict() in your second step? I am pretty sure they do the same thing. Brian -- View this message in context: http://r.789695.n4.nabble.com/e1071-SVM-Cross-validation-error-confusion-matrix-tp4437047p4650252.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to subset my data and at the same time keep the balance?
Just curious, once you have a model that works well, does it make sense to then tune it against 100% of the dataset (with known outcomes) so you can apply it to data you wish to predict for or is that a bad approach? I have done like is explained in this thread many times, taken a sample, learned against it, and then tested on the remaining. But this is using data for which we know the predicted variable and can compare to validate. So after your done, should you re-tune with the entire training set? As for which method, I am using mostly SVM Brian On Nov 19, 2012, at 2:07 PM, Eddie Smith eddie...@gmail.com wrote: Thanks a lot! I got some ideas from all the replies and here is the final one. newdata select - sample(nrow(newdata), nrow(newdata) * .7) data70 - newdata[select,] # select write.csv(data70, data70.csv, row.names=FALSE) data30 - newdata[-select,] # testing write.csv(data30, data30.csv, row.names=FALSE) Cheers __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] help interpreting dudi.pco
I am new to R as well, it sounds like you would want to look at clustering, perhaps k-means clustering. Brian On Nov 18, 2012, at 12:19 AM, avadhoot velankar avadhoot.velan...@gmail.com wrote: I am working on morphometry of hairs and want to see if selected variables are giving significantly distinct groups. I am new to both R and principal coordinate analysis. Scatter plot is showing distinct groups but i dont know how to refine the analysis and interprete it. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Best prediction to use to use for basic problem?
I have a rather basic set of data. It is simply a variable that can be 0, 1 or 2 and its value over a series of time t0 - t9 like so: y: 1 1 2 0 1 2 2 1 2 1 x: t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 I need to predict what the value of y will be at t10 through t13. As you can see its rather basic. I am rather new to solving these types of problems so I am looking for some good straight forward things to try. My research into this (google, wiki's, etc) leads me to believe that perhaps logistic regression would be good, since I am predicting a categorical variable (0, 1, 2). I don't have much data for the formula to learn from, as I only have 10 time slots and I need to predict the next 4. Is logistic regression a good candidate or should I be looking at perhaps something else? Brian __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Using cbind to combine data frames and preserve header/names
David and Rainer, Thank you both for your responses, you got me on track, I ended up just doing like so: trainset - read.csv('train.csv',head=TRUE) trainset[,-1] - binarize(trainset[,-1]) trainset$label - as.factor(trainset$label) I appreciate your help Brian On Nov 17, 2012, at 11:25 AM, David Winsemius wrote: On Nov 16, 2012, at 9:39 PM, Brian Feeny wrote: I have a dataframe that has a header like so: classvalue1 value2 value3 class is a factor the actual values in the columns value1, value2 and value3 are 0-255, I wish to binarize these using biclust. I can do this like so: binarize(dataframe[,-1]) this will return a dataframe, but then I lose my first column class, so I thought I could combine it like so: dataframe - cbind(dataframe$label, binarize(dataframe[,-1])) There is no column with the name label. There is also no function named label in base R although I cannot speak about biclust. Even if there were, you cannot apply functions to data.frames with the $ function. but then I lose my header (names).how can I do the above operation and keep my header in tact? Basically i just want to binarize everything but the first column (since its a factor column and not numeric). I have no idea how 'binarize' works but if you wanted to 'defactorize' a factor then you should learn to use 'as.character' to turn factors into character vectors. Perhaps: dfrm - cbind( as.character(dataframe[1]), binarize(dataframe[,-1])) You should make sure this is still a dataframe since cbind.default returns a matrix and this would be a character matrix. I'm taking your word that the second argument is a dataframe, and that would mean the cbind.data.frame method would be dispatched. It is a rather unfortunate practice to call your dataframes dataframe and also bad to name your columns class since the first is a fundamental term and the second a basic function. If you persist, people will start talking to you about dogs named Dog. David Winsemius, MD Alameda, CA, USA __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Strange problem with reading a pipe delimited file
I am trying to read in a pipe delimited file that has rows with varying number of columns, here is my sample data: A|B|C|D A|B|C|D|E|F A|B|C|D|E A|B|C|D|E|F|G|H|I A|B|C|D A|B|C|D|E|F|G|H|I|J You can see line 6 has 10 columns. Yet, I can't explain why R does like so: test - read.delim(mypaths4.txt, sep=|, quote=NULL, header=F, colClasses=character) test V1 V2 V3 V4 V5 V6 V7 V8 V9 1 A B C D 2 A B C D E F 3 A B C D E 4 A B C D E F G H I 5 A B C D 6 A B C D E F G H I 7 J You can see it moved J to row 7, I don't understand why it is not left in position 6,10. So, more strange to me, I remove line 1, so my data file contains: A|B|C|D|E|F A|B|C|D|E A|B|C|D|E|F|G|H|I A|B|C|D A|B|C|D|E|F|G|H|I|J and I get a totally different result: test - read.delim(mypaths5.txt, sep=|, quote=NULL, header=F, colClasses=character) test V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 1 A B C D E F 2 A B C D E 3 A B C D E F G H I 4 A B C D 5 A B C D E F G H I J what it is that I am doing that is changing the fate of that final J? This is just a basic ASCII text file, pipe delimited as shown. I have been racking my brain on this for a day! Brian __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Strange problem with reading a pipe delimited file
On Nov 17, 2012, at 4:27 PM, Duncan Murdoch wrote: I would suggest reading the help file: read.delim only looks at the first 5 lines to determine the number of columns if you don't specify the colClasses. Duncan Murdoch Duncan, I have tried to pass colClasses but R complains: test - read.delim(mypaths4.txt, sep=|, quote=NULL, header=F, colClasses=c(character,character,character,character,character,character,character,character,character,character), fill=TRUE) Warning message: In read.table(file = file, header = header, sep = sep, quote = quote, : cols = 9 != length(data) = 10 once again this is with data: A|B|C|D A|B|C|D|E|F A|B|C|D|E A|B|C|D|E|F|G|H|I A|B|C|D A|B|C|D|E|F|G|H|I|J Any idea? Brian [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Strange problem with reading a pipe delimited file
Duncan, I believe I follow you now, I have done like so with expected results: ncol - max(count.fields(paths.txt, sep = |)) test - read.delim(paths.txt, sep=|, quote=NULL, header=F, colClasses=character, fill=TRUE, col.names = paste(V, seq_len(ncol), sep = )) Thank you for your help Brian On Nov 17, 2012, at 4:34 PM, Brian Feeny wrote: On Nov 17, 2012, at 4:27 PM, Duncan Murdoch wrote: I would suggest reading the help file: read.delim only looks at the first 5 lines to determine the number of columns if you don't specify the colClasses. Duncan Murdoch Duncan, I have tried to pass colClasses but R complains: test - read.delim(mypaths4.txt, sep=|, quote=NULL, header=F, colClasses=c(character,character,character,character,character,character,character,character,character,character), fill=TRUE) Warning message: In read.table(file = file, header = header, sep = sep, quote = quote, : cols = 9 != length(data) = 10 once again this is with data: A|B|C|D A|B|C|D|E|F A|B|C|D|E A|B|C|D|E|F|G|H|I A|B|C|D A|B|C|D|E|F|G|H|I|J Any idea? Brian [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] library/function to compare two phrases?
I am looking for a library/function in R that can compare two phrases and give me a score, or somehow classify them as correct as possible. The phrases are obfuscated/messy. I am not concerned about which is correct (for example spell checking), I am only concerned in grouping them so that I know they are the closest match. Example: I have ROW1 and ROW2 like so: ROW1ROW2 hamburger helperbigmc heartkcatta chicken nuggets chicke, nuggets, jss bigmac heartattack some sombody somehwere somebody somehwere repleh regrubmah I am looking for something that can tell me that the best match for hamburger helper is repleh regrubmah, and the same for each other row. So my goal is to write a program that foreach phrase in ROW1 runs this function against ROW2 and gives me the phrase that scored best. I have read over much of the NLP packages at http://cran.r-project.org/web/views/NaturalLanguageProcessing.html I thought lsa might be a good fit, but I am not sure. I have limited time, so I am hoping someone can point me in a direction of what I am looking for. I have been searching for text classifiers, perhaps this problem is referred to as something else. Brian __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] library/function to compare two phrases?
Thank you Michael and David. I am onto agrep and adist and they look very useful for what I am wanting to do. My initial results are promising! Brian On Nov 17, 2012, at 6:20 PM, R. Michael Weylandt wrote: On Sat, Nov 17, 2012 at 11:00 PM, Brian Feeny bfe...@mac.com wrote: I am looking for a library/function in R that can compare two phrases and give me a score, or somehow classify them as correct as possible. The phrases are obfuscated/messy. I am not concerned about which is correct (for example spell checking), I am only concerned in grouping them so that I know they are the closest match. Example: I have ROW1 and ROW2 like so: ROW1ROW2 hamburger helperbigmc heartkcatta chicken nuggets chicke, nuggets, jss bigmac heartattack some sombody somehwere somebody somehwere repleh regrubmah I am looking for something that can tell me that the best match for hamburger helper is repleh regrubmah, and the same for each other row. So my goal is to write a program that foreach phrase in ROW1 runs this function against ROW2 and gives me the phrase that scored best. I have read over much of the NLP packages at http://cran.r-project.org/web/views/NaturalLanguageProcessing.html I thought lsa might be a good fit, but I am not sure. I have limited time, so I am hoping someone can point me in a direction of what I am looking for. I have been searching for text classifiers, perhaps this problem is referred to as something else. This is outside my expertise, but if memory serves, you might benefit from googling the Levenshtein (spelling?) distance which allows this sort of fuzzy matching of strings. MW __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Using cbind to combine data frames and preserve header/names
I have a dataframe that has a header like so: class value1 value2 value3 class is a factor the actual values in the columns value1, value2 and value3 are 0-255, I wish to binarize these using biclust. I can do this like so: binarize(dataframe[,-1]) this will return a dataframe, but then I lose my first column class, so I thought I could combine it like so: dataframe - cbind(dataframe$label, binarize(dataframe[,-1])) but then I lose my header (names).how can I do the above operation and keep my header in tact? Basically i just want to binarize everything but the first column (since its a factor column and not numeric). Thank you for any help you can give me, I am relatively new to R. Brian __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.