from:"Max Kuhn"

[R] failure with merge

2016-07-14 Thread Max Kuhn

I am merging two data frames:

tuneAcc <- structure(list(select = c(FALSE, TRUE), method =
structure(c(1L, 1L), .Label = "GCV.Cp", class = "factor"), RMSE =
c(29.2102056093962, 28.9743318817886), Rsquared =
c(0.0322612161559773, 0.0281713457306074), RMSESD = c(0.981573768028697,
0.791307778398384), RsquaredSD = c(0.0388188469162352,
0.0322578925071113)),
.Names = c("select", "method", "RMSE", "Rsquared", "RMSESD",
"RsquaredSD"),
class = "data.frame", row.names = 1:2)

finalTune <- structure(list(select = TRUE, method = structure(1L,
.Label = "GCV.Cp", class = "factor"), Selected = "*"), .Names =
c("select", "method", "Selected"), row.names = 2L, class = "data.frame")

using

   merge(x = tuneAcc, y = finalTune, all.x = TRUE)

The error is

  "Error in match.arg(method) : 'arg' must be NULL or a character vector"

This is R version 3.3.1 (2016-06-21), Platform: x86_64-apple-darwin13.4.0
(64-bit), Running under: OS X 10.11.5 (El Capitan).



These do not stop execution:

  merge(x = tuneAcc, y = finalTune)
  merge(x = tuneAcc, y = finalTune, all.x = TRUE, sort = FALSE)

The latter produces (what I consider to be) incorrect results.

Walking through the code, the original call with just `all.x = TRUE` fails
when sorting at the line:

  res <- res[if (all.x || all.y)
do.call("order", x[, seq_len(l.b), drop = FALSE]) else
 sort.list(bx[m$xi]), , drop = FALSE]

Specifically, on the `do.call` bit. For these data:

  Browse[3]> x
  select method RMSE Rsquared RMSESD RsquaredSD
  2 TRUE GCV.Cp 28.97433 0.02817135 0.7913078 0.03225789
  1 FALSE GCV.Cp 29.21021 0.03226122 0.9815738 0.03881885


  Browse[3]> x[, seq_len(l.b), drop = FALSE]
  select method
  2 TRUE GCV.Cp
  1 FALSE GCV.Cp

and this line executes:

  Browse[3]> order(x[, seq_len(l.b), drop = FALSE])
  [1] 1 2 3 4

although nrow(x) = 2 so this is an issue.

Calling it this way stops execution:

Browse[3]> do.call("order", x[, seq_len(l.b), drop = FALSE])
Error in match.arg(method) : 'arg' must be NULL or a character vector

Thanks,

Max

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Installing Caret

2016-06-16 Thread Max Kuhn

The problem is not with `caret. Your output says:

 > installation of package ‘minqa’ had non-zero exit status

`caret` has a dependency that has a dependency on `minqa`. The same is true
for `RcppEigen` and the others.

What code did you use to do the install? What OS and version or R etc?


On Thu, Jun 16, 2016 at 4:49 AM, TJUN KIAT TEO  wrote:

> I am trying to install the package but am I keep getting this error
> messages
>
>
>
>   installation of
> package ‘minqa’ had non-zero exit status
>
> 2: In install.packages("caret", repos =
> "http://cran.stat.ucla.edu/";) :
>
>   installation of
> package ‘RcppEigen’ had non-zero exit status
>
> 3: In install.packages("caret", repos = "http://cran.stat.ucla.edu/";)
> :
>
>   installation of
> package ‘SparseM’ had non-zero exit status
>
> 4: In install.packages("caret", repos =
> "http://cran.stat.ucla.edu/";) :
>
>   installation of
> package ‘lme4’ had non-zero exit status
>
> 5: In install.packages("caret", repos =
> "http://cran.stat.ucla.edu/";) :
>
>   installation of
> package ‘quantreg’ had non-zero exit status
>
> 6: In install.packages("caret", repos =
> "http://cran.stat.ucla.edu/";) :
>
>   installation of
> package ‘pbkrtest’ had non-zero exit status
>
> 7: In install.packages("caret", repos =
> "http://cran.stat.ucla.edu/";) :
>
>   installation of
> package ‘car’ had non-zero exit status
>
> 8: In install.packages("caret", repos =
> "http://cran.stat.ucla.edu/";) :
>
>   installation of
> package ‘caret’ had non-zero exit status
>
>
> Anyone has any idea what wrong?
>
> Tjun Kiat
>
>
>
> [[alternative HTML version deleted]]
>
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Problem while predicting in regression trees

2016-05-09 Thread Max Kuhn

I've brought this up numerous times... you shouldn't use `predict.rpart`
(or whatever modeling function) from the `finalModel` object. That object
has no idea what was done to the data prior to its invocation.

The issue here is that `train(formula)` converts the factors to dummy
variables. `rpart` does not require that and the `finalModel` object has no
idea that that happened. Using `predict.train` works just fine so why not
use it?

> table(predict(tr_m, newdata = testPFI))

-2617.42857142857 -1786.76923076923 -1777.583   -1217.3
3 3 6 3
-886.6667  -408.375-375.7 -240.307692307692
5 1 4 5
-201.612903225806 -19.6071428571429  30.80833  43.9
   307266 9
151.5  209.647058823529
628

On Mon, May 9, 2016 at 2:46 PM, Muhammad Bilal <
muhammad2.bi...@live.uwe.ac.uk> wrote:

> Please find the sample dataset attached along with R code pasted below to
> reproduce the issue.
>
>
> #Loading the data frame
>
> pfi <- read.csv("pfi_data.csv")
>
> #Splitting the data into training and test sets
> split <- sample.split(pfi, SplitRatio = 0.7)
> trainPFI <- subset(pfi, split == TRUE)
> testPFI <- subset(pfi, split == FALSE)
>
> #Cross validating the decision trees
> tr.control <- trainControl(method="repeatedcv", number=20)
> cp.grid <- expand.grid(.cp = (0:10)*0.001)
> tr_m <- train(project_delay ~ project_lon + project_lat + project_duration
> + sector + contract_type + capital_value, data = trainPFI, method="rpart",
> trControl=tr.control, tuneGrid = cp.grid)
>
> #Displaying the train results
> tr_m
>
> #Fetching the best tree
> best_tree <- tr_m$finalModel
>
> #Plotting the best tree
> prp(best_tree)
>
> #Using the best tree to make predictions *[This command raises the error]*
> best_tree_pred <- predict(best_tree, newdata = testPFI)
>
> #Calculating the SSE
> best_tree_pred.sse <- sum((best_tree_pred - testPFI$project_delay)^2)
>
> #
> tree_pred.sse
>
> ...
>
> Many Thanks and
>
>
> Kind Regards
>
>
>
> --
> Muhammad Bilal
> Research Fellow and Doctoral Researcher,
> Bristol Enterprise, Research, and Innovation Centre (BERIC),
> University of the West of England (UWE),
> Frenchay Campus,
> Bristol,
> BS16 1QY
>
> *muhammad2.bi...@live.uwe.ac.uk* 
>
>
> --
> *From:* Max Kuhn 
> *Sent:* 09 May 2016 17:22:22
> *To:* Muhammad Bilal
> *Cc:* Bert Gunter; r-help@r-project.org
>
> *Subject:* Re: [R] Problem while predicting in regression trees
>
> It is extremely difficult to tell what the issue might be without a
> reproducible example.
>
> The only thing that I can suggest is to use the non-formula interface to
> `train` so that you can avoid creating dummy variables.
>
> On Mon, May 9, 2016 at 11:23 AM, Muhammad Bilal <
> muhammad2.bi...@live.uwe.ac.uk> wrote:
>
>> Hi Bert,
>>
>> Thanks for the response.
>>
>> I checked the datasets, however, the Hospitals level appears in both of
>> them. See the output below:
>>
>> > sqldf("SELECT sector, count(*) FROM trainPFI GROUP BY sector")
>> sector count(*)
>> 1  Defense9
>> 2Hospitals  101
>> 3  Housing   32
>> 4   Others   99
>> 5 Public Buildings   39
>> 6  Schools  148
>> 7  Social Care   10
>> 8  Transportation   27
>> 9Waste   26
>> > sqldf("SELECT sector, count(*) FROM testPFI GROUP BY sector")
>> sector count(*)
>> 1  Defense5
>> 2Hospitals   47
>> 3  Housing   11
>> 4   Others   44
>> 5 Public Buildings   18
>> 6  Schools   69
>> 7  Social Care9
>> 8   Transportation8
>> 9Waste   12
>>
>> Any thing else to try?
>>
>> --
>> Muhammad Bilal
>> Research Fellow and Doctoral Researcher,
>> Bristol Enterprise, Research, and Innovation Centre (BERIC),
>> University of the West of England (UWE),
>> Frenchay Campus,
>> Bristol,
>> BS16 1QY
>>
>> muhammad2.bi...@live.uwe.ac.uk
>>
>>
>> 
>> From: Bert Gunter 
>> Sent: 09 May 2016 01:42:39
>> To: Muhammad Bilal
>> Cc: r-hel

Re: [R] Problem while predicting in regression trees

2016-05-09 Thread Max Kuhn

It is extremely difficult to tell what the issue might be without a
reproducible example.

The only thing that I can suggest is to use the non-formula interface to
`train` so that you can avoid creating dummy variables.

On Mon, May 9, 2016 at 11:23 AM, Muhammad Bilal <
muhammad2.bi...@live.uwe.ac.uk> wrote:

> Hi Bert,
>
> Thanks for the response.
>
> I checked the datasets, however, the Hospitals level appears in both of
> them. See the output below:
>
> > sqldf("SELECT sector, count(*) FROM trainPFI GROUP BY sector")
> sector count(*)
> 1  Defense9
> 2Hospitals  101
> 3  Housing   32
> 4   Others   99
> 5 Public Buildings   39
> 6  Schools  148
> 7  Social Care   10
> 8  Transportation   27
> 9Waste   26
> > sqldf("SELECT sector, count(*) FROM testPFI GROUP BY sector")
> sector count(*)
> 1  Defense5
> 2Hospitals   47
> 3  Housing   11
> 4   Others   44
> 5 Public Buildings   18
> 6  Schools   69
> 7  Social Care9
> 8   Transportation8
> 9Waste   12
>
> Any thing else to try?
>
> --
> Muhammad Bilal
> Research Fellow and Doctoral Researcher,
> Bristol Enterprise, Research, and Innovation Centre (BERIC),
> University of the West of England (UWE),
> Frenchay Campus,
> Bristol,
> BS16 1QY
>
> muhammad2.bi...@live.uwe.ac.uk
>
>
> 
> From: Bert Gunter 
> Sent: 09 May 2016 01:42:39
> To: Muhammad Bilal
> Cc: r-help@r-project.org
> Subject: Re: [R] Problem while predicting in regression trees
>
> It seems that the data that you used for prediction contained a level
> "Hospitals" for the sector factor that did not appear in the training
> data (or maybe it's the other way round). Check this.
>
> Cheers,
> Bert
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Sun, May 8, 2016 at 4:14 PM, Muhammad Bilal
>  wrote:
> > Hi All,
> >
> > I have the following script, that raises error at the last command. I am
> new to R and require some clarification on what is going wrong.
> >
> > #Creating the training and testing data sets
> > splitFlag <- sample.split(pfi_v3, SplitRatio = 0.7)
> > trainPFI <- subset(pfi_v3, splitFlag==TRUE)
> > testPFI <- subset(pfi_v3, splitFlag==FALSE)
> >
> >
> > #Structure of the trainPFI data frame
> >> str(trainPFI)
> > ***
> > 'data.frame': 491 obs. of  16 variables:
> >  $ project_id : int  1 2 3 6 7 9 10 12 13 14 ...
> >  $ project_lat: num  51.4 51.5 52.2 51.9 52.5 ...
> >  $ project_lon: num  -0.642 -1.85 0.08 -0.401 -1.888 ...
> >  $ sector : Factor w/ 9 levels "Defense","Hospitals",..:
> 4 4 4 6 6 6 6 6 6 6 ...
> >  $ contract_type  : chr  "Turnkey" "Turnkey" "Turnkey" "Turnkey"
> ...
> >  $ project_duration   : int  1826 3652 121 730 730 790 522 819 998
> 372 ...
> >  $ project_delay  : int  -323 0 -60 0 0 0 -91 0 0 7 ...
> >  $ capital_value  : num  6.7 5.8 21.8 24.2 40.7 10.7 70 24.5
> 60.5 78 ...
> >  $ project_delay_pct  : num  -17.7 0 -49.6 0 0 0 -17.4 0 0 1.9 ...
> >  $ delay_type : Ord.factor w/ 9 levels "7 months early &
> beyond"<..: 1 5 3 5 5 5 2 5 5 6 ...
> >
> > library(caret)
> > library(e1071)
> >
> > set.seed(100)
> >
> > tr.control <- trainControl(method="cv", number=10)
> > cp.grid <- expand.grid(.cp = (0:10)*0.001)
> >
> > #Fitting the model using regression tree
> > tr_m <- train(project_delay ~ project_lon + project_lat +
> project_duration + sector + contract_type + capital_value, data = trainPFI,
> method="rpart", trControl=tr.control, tuneGrid = cp.grid)
> >
> > tr_m
> >
> > CART
> > 491 samples
> > 15 predictor
> > No pre-processing
> > Resampling: Cross-Validated (10 fold)
> > Summary of sample sizes: 443, 442, 441, 442, 441, 442, ...
> > Resampling results across tuning parameters:
> >   cp RMSE  Rsquared
> >   0.000  441.1524  0.5417064
> >   0.001  439.6319  0.5451104
> >   0.002  437.4039  0.5487203
> >   0.003  432.3675  0.551
> >   0.004  434.2138  0.5519964
> >   0.005  431.6635  0.551
> >   0.006  436.6163  0.5474135
> >   0.007  440.5473  0.5407240
> >   0.008  441.0876  0.5399614
> >   0.009  441.5715  0.5401718
> >   0.010  441.1401  0.5407121
> > RMSE was used to select the optimal model using  the smallest value.
> > The final value used for the model was cp = 0.005.
> >
> > #Fetching the best tree
> > best_tree <- tr_m$finalModel
> >
> > Alright, all the aforementioned commands worked fine.
> >
> > Except the subsequent command raises error, when the developed model is
> used to make predictions:
> > best_tree_pred <- predict(best_tree, newdata = testPFI)
> > Error in eval(expr, envir, enclos) : object 'sectorHospital

Re: [R] Mixture Discriminant Analysis and Penalized LDA

2016-01-25 Thread Max Kuhn

There is a function called `smda` in the sparseLDA package that implements
the model described in Clemmensen, L., Hastie, T., Witten, D. and Ersbøll,
B. Sparse discriminant analysis, Technometrics, 53(4): 406-413, 2011

Max

On Sun, Jan 24, 2016 at 10:45 PM, TJUN KIAT TEO 
wrote:

> Hi
>
> I noticed we have MDA and Mclust for Mixture Discriminant Analysis and
> Penalized LDA. Do we have a R packages for Penalized MDA?
>
> Tjun Kiat
>
>
>
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Caret - Recursive Feature Elimination Error

2015-12-23 Thread Max Kuhn

Providing a reproducible example and the results of `sessionInfo` will help
get your question answered.

Also, what is the point of using glmnet with RFE? It already does feature
selection.

On Wed, Dec 23, 2015 at 1:48 AM, Manish MAHESHWARI  wrote:

> Hi,
>
> I am trying to use caret, for feature selection on glmnet. I get a strange
> error like below - "arguments imply differing number of rows: 2, 3".
>
>
> x <- data.matrix(train[,features])
>
> y <- train$quoteconversion_flag
>
>
>
> > str(x)
>
>  num [1:260753, 1:297] NA NA NA NA NA NA NA NA NA NA ...
>
>  - attr(*, "dimnames")=List of 2
>
>   ..$ : NULL
>
>   ..$ : chr [1:297] "original_quote_date" "field6" "field7" "field8" ...
>
> > str(y)
>
>  Factor w/ 2 levels "X0","X1": 1 1 1 1 1 1 1 1 1 1 ...
>
> > RFE <- rfe(x,y,sizes = seq(50,300,by=10),
> +metric = "ROC",maximize=TRUE,rfeControl = MyRFEcontrol,
> +method='glmnet',
> +tuneGrid = expand.grid(.alpha=0,.lambda=c(0.01,0.02)),
> +trControl = MyTrainControl)
> +(rfe) fit Resample01 size: 297
> +(rfe) fit Resample02 size: 297
> +(rfe) fit Resample03 size: 297
> +(rfe) fit Resample04 size: 297
> +(rfe) fit Resample05 size: 297
> +(rfe) fit Resample06 size: 297
> +(rfe) fit Resample07 size: 297
> +(rfe) fit Resample08 size: 297
> +(rfe) fit Resample09 size: 297
> +(rfe) fit Resample10 size: 297
> +(rfe) fit Resample11 size: 297
> +(rfe) fit Resample12 size: 297
> +(rfe) fit Resample13 size: 297
> +(rfe) fit Resample14 size: 297
> +(rfe) fit Resample15 size: 297
> +(rfe) fit Resample16 size: 297
> +(rfe) fit Resample17 size: 297
> +(rfe) fit Resample18 size: 297
> +(rfe) fit Resample19 size: 297
> +(rfe) fit Resample20 size: 297
> +(rfe) fit Resample21 size: 297
> +(rfe) fit Resample22 size: 297
> +(rfe) fit Resample23 size: 297
> +(rfe) fit Resample24 size: 297
> +(rfe) fit Resample25 size: 297
> Error in { :
>   task 1 failed - "task 1 failed - "arguments imply differing number of
> rows: 2, 3""
> In addition: There were 50 or more warnings (use warnings() to see the
> first 50)
>
> Any idea what does this mean?
>
> Thanks,
> Manish
>
> CONFIDENTIAL NOTE:
> The information contained in this email is intended on...{{dropped:13}}

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Error in 'Contrasts<-' while using GBM.

2015-11-29 Thread Max Kuhn

Providing a reproducible example and the results of `sessionInfo` will help
get your question answered.

My only guess is that one or more of your predictors are factors and that
the in-sample data (used to build the model during resampling) have
different levels than the holdout samples.

Max

On Sat, Nov 28, 2015 at 10:04 PM, Karteek Pradyumna Bulusu <
kartikpradyumn...@gmail.com> wrote:

> Hey,
>
> I was trying to implement Stochastic Gradient Boosting in R. Following is
> my code in rstudio:
>
>
>
> library(caret);
>
> library(gbm);
>
> library(plyr);
>
> library(survival);
>
> library(splines);
>
> library(mlbench);
>
> set.seed(35);
>
> stack = read.csv("E:/Semester 3/BDA/PROJECT/Sample_SO.csv", head
> =TRUE,sep=",");
>
> dim(stack); #displaying dimensions of the dataset
>
>
>
> #SPLITTING TRAINING AND TESTING SET
>
> totraining <- createDataPartition(stack$ID, p = .6, list = FALSE);
>
> training <- stack[ totraining,]
>
> test <- stack[-totraining,]
>
>
>
> #PARAMETER SETTING
>
> t_control <- trainControl(method = "cv", number = 10);
>
>
>
>
>
> # GLM
>
> start <- proc.time();
>
>
>
> glm = train(ID ~ ., data = training,
>
>  method = "gbm",
>
>  metric = "ROC",
>
>  trControl = t_control,
>
>  verbose = FALSE)
>
>
>
> When I am compiling last line, I am getting following error:
>
>
>
> Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
>
>   contrasts can be applied only to factors with 2 or more levels
>
>
>
>
>
> Can anyone tell me where I am going wrong and How to rectify it. It’ll be
> greatful.
>
>
>
> Thank you. Looking forward to it.
>
>
>
> Regards,
> Karteek Pradyumna Bulusu.
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Ensure distribution of classes is the same as prior distribution in Cross Validation

2015-11-24 Thread Max Kuhn

Right now, using `method = "cv"` or `method = "repeatedcv"` does stratified
sampling. Depending on what you mean by "ensure" and the nature of your
outcome (categorical?), it probably already does.

On Mon, Nov 23, 2015 at 7:04 PM, TJUN KIAT TEO  wrote:

> In the caret train control function, is it possible to ensure Ensure
> distribution of classes is the same as prior distribution in the folds of
> cross
>  validation? I know it can be done using create folds but was wondering if
> it is possible using train control?
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Caret Internal Data Representation

2015-11-06 Thread Max Kuhn

Providing a reproducible example and the results of `sessionInfo` will help
get your question answered.  For example, did you use the formula or
non-formula interface to `train` and so on

On Thu, Nov 5, 2015 at 1:10 PM, Bert Gunter  wrote:

> I am not familiar with caret/Cubist, but assuming they follow the
> usual R procedures that encode categorical factors for conditional
> fitting, you need to do some homework on your own by reading up on the
> use of contrasts in regression.
>
> See ?factor and ?contrasts (and other linked Help as necessary) to see
> what are R's usual procedures, but you will undoubtedly need to
> consult outside statistical references -- the help files will point
> you to some -- to fully understand what's going on. It is not trivial.
>
> Cheers,
> Bert
> Bert Gunter
>
> "Data is not information. Information is not knowledge. And knowledge
> is certainly not wisdom."
>-- Clifford Stoll
>
>
> On Thu, Nov 5, 2015 at 9:38 AM, Lorenzo Isella 
> wrote:
> > Dear All,
> > I have a data set which contains both categorical and numerical
> > variables which I analyze using Cubist+the caret framework.
> > Now, from the generated rules, it is clear that cubist does something
> > to the categorical variables and probably uses some dummy coding for
> > them.
> > However, I cannot right now access the data the way it is transformed
> > by cubist.
> > If caret (or the package) need to do some dummy coding of the factors,
> > how can I access the newly encoded data set?
> > I suppose this applies to plenty of other packages.
> > Any suggestion is welcome.
> > Cheers
> >
> > Lorenzo
> >
> > __
> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Imbalanced random forest

2015-07-29 Thread Max Kuhn

This might help:

http://bit.ly/1MUP0Lj

On Wed, Jul 29, 2015 at 11:00 AM, jpara3 
wrote:

> ¿How can i set up a study with random forest where the response is highly
> imbalanced?
>
>
>
> -
>
> Guided Tours Basque Country
>
> Guided tours in the three capitals of the Basque Country: Bilbao,
> Vitoria-Gasteiz and San Sebastian, as well as in their provinces. Available
> languages.
>
> Travel planners for groups and design of tourist routes across the Basque
> Country.
> --
> View this message in context:
> http://r.789695.n4.nabble.com/Imbalanced-random-forest-tp4710524.html
> Sent from the R help mailing list archive at Nabble.com.
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] what constitutes a 'complete sentence'?

2015-07-07 Thread Max Kuhn

On Tue, Jul 7, 2015 at 8:19 AM, John Fox  wrote:

> Dear Peter,
>
> You're correct that these examples aren't verb phrases (though the second
> one contains a verb phrase). I don't want to make the discussion even more
> pedantic (moving it in this direction was my fault), but "Paragraph" isn't
> quite right, unless explained, because conventionally a paragraph consists
> of sentences.
>
> How about something like this? "One can use several complete sentences or
> punctuated telegraphic phrases, but only one paragraph (that is, block of
> continuous text with no intervening blank lines). The description should
> end with a full stop (period)."
>
>
Before we start crafting better definitions of the rule, it seems important
to understand what issue we are trying to solve. I don't see any place
where this has been communicated. As I said previously, I usually give them
the benefit of the doubt. However, this requirement is poorly implemented
and we need to know more.

For example, does CRAN need to parse the text and the code failed because
there was no period? It seems plausible that someone could have worded that
requirement in the current form, but it is poorly written (which is
unusual).

If the goal is to improve the quality of the description text, then that is
a more difficult issue to define. and good luck coding your way into a
lucid and effective set of rules. It also seems a bit over the top to me
and a poor choice of where everyone should be spending their time.

What are we trying to fix?

It would likely be helpful to add some examples of good and bad
> descriptions, and to explain how the check actually works.
>
> Best,
>  John
>
> On Tue, 7 Jul 2015 12:20:38 +0200
>  peter dalgaard  wrote:
> > ...except that there is not necessarily a verb either. What we're
> looking for is something like "advertisement style" as in
> >
> > UGLY MUGS 7.95.
> >
> > An invaluable addition to your display cabinet. Comes in an assortment
> of warts and wrinkles, crafted by professional artist Foo Yung.
> >
> > However, I'm drawing blanks when searching for an established term for
> it.
> >
> > Could we perhaps sidestep the issue by requesting a "single descriptive
> paragraph, with punctuation" or thereabouts?
> >
> > 
> >
> > I'm still puzzled about what threw Federico's example in the first
> place. The actual code is
> >
> > if(strict && !is.na(val <- db["Description"])
> >&& !grepl("[.!?]['\")]?$", trimws(val)))
> > out$bad_Description <- TRUE
> >
> > and  I can do this
> >
> > > strict <- TRUE
> > > db <- tools:::.read_description("/tmp/dd")
> > >if(strict && !is.na(val <- db["Description"])
> > +&& !grepl("[.!?]['\")]?$", trimws(val)))
> > + out$bad_Description <- TRUE
> > > out
> > Error: object 'out' not found
> >
> > I.e., the complaint should _not_ be triggered. I suppose that something
> like a non-breakable space at the end could confuse trimws(), but beyond
> that I'm out of ideas.
> >
> >
> > On 07 Jul 2015, at 03:28 , John Fox  wrote:
> >
> > > Dear Peter,
> > >
> > > I think that the grammatical term you're looking for is "verb phrase."
> > >
> > > Best,
> > > John
> > >
> > > On Tue, 7 Jul 2015 00:12:25 +0200
> > > peter dalgaard  wrote:
> > >>
> > >>> On 06 Jul 2015, at 23:19 , Duncan Murdoch 
> wrote:
> > >>>
> > >>> On 06/07/2015 5:09 PM, Rolf Turner wrote:
> >  On 07/07/15 07:10, William Dunlap wrote:
> > 
> >  [Rolf Turner wrote.]
> > 
> > >> The CRAN guidelines should be rewritten so that they say what
> they *mean*.
> > >> If a complete sentence is not actually required --- and it seems
> abundantly clear
> > >> that it is not --- then guidelines should not say so.  Rather
> they should say,
> > >> clearly and comprehensibly, what actually *is* required.
> > >
> > > This may be true, but also think of the user when you write the
> description.
> > > If you are scanning a long list of descriptions looking for a
> package to
> > > use,
> > > seeing a description that starts with 'A package for' just slows
> you down.
> > > Seeing a description that includes 'designed to' leaves you
> wondering if the
> > > implementation is woefully incomplete.  You want to go beyond what
> CRAN
> > > can test for.
> > 
> >  All very true and sound and wise, but what has this got to do with
> >  complete sentences?  The package checker issues a message saying
> that it
> >  wants a complete sentence when this has nothing to do with what it
> >  *really* wants.
> > >>>
> > >>> That's false.  If you haven't given a complete sentence, you might
> still
> > >>> pass, but if you have, you will pass.  That's not "nothing to do"
> with
> > >>> what it really wants, it's just an imperfect test that fails to
> detect
> > >>> violations of the guidelines.
> > >>>
> > >>> As we've seen, it sometimes also makes mistakes in the other
> direction.
> > >>> I'd say those are more serious.
> > >>>
>

Re: [R] Caret and custom summary function

2015-05-11 Thread Max Kuhn

The version of caret just put on CRAN has a function called mnLogLoss that
does this.

Max

On Mon, May 11, 2015 at 11:17 AM, Lorenzo Isella 
wrote:

> Dear All,
> I am trying to implement my own metric (a log loss metric) for a
> binary classification problem in Caret.
> I must be making some mistake, because I cannot get anything sensible
> out of it.
> I paste below a numerical example which should run in more or less one
> minute on any laptop.
> When I run it, I finally have an output of the kind
>
>
>
>
> Aggregating results
> Something is wrong; all the LogLoss metric values are missing:
>LogLoss
> Min.   : NA
>  1st Qu.: NA
>   Median : NA
>Mean   :NaN
>  3rd Qu.: NA
>   Max.   : NA
>NA's   :40
>Error in train.default(x, y, weights = w, ...) : Stopping
>In addition: Warning message:
>In nominalTrainWorkflow(x = x, y = y, wts = weights, info =
>trainInfo,  :
>  There were missing values in resampled performance
>  measures.
>
>
>
>
> Any suggestion is appreciated.
> Many thanks
>
> Lorenzo
>
>
>
>
>
> เเ
>
> library(caret)
> library(C50)
>
>
> LogLoss <- function (data, lev = NULL, model = NULL)
> {
>probs <- pmax(pmin(as.numeric(data$T), 1 - 1e-15), 1e-15)
>logPreds <- log(probs)
> log1Preds <- log(1 - probs)
> real <- (as.numeric(data$obs) - 1)
> out <- c(mean(real * logPreds + (1 - real) *
> log1Preds)) * -1
> names(out) <- c("LogLoss")
> out
> }
>
>
>
>
>
>
> train <- matrix(ncol=5,nrow=200,NA)
>
> train <- as.data.frame(train)
> names(train) <- c("donation", "x1","x2","x3","x4")
>
> set.seed(134)
>
> sel <- sample(nrow(train), 0.5*nrow(train))
>
>
> train$donation[sel] <- "yes"
> train$donation[-sel] <- "no"
>
> train$x1 <- seq(nrow(train))
> train$x2 <- rnorm(nrow(train))
> train$x3 <- 1/train$x1
> train$x4 <- sample(nrow(train))
>
> train$donation <- as.factor(train$donation)
>
> c50Grid <- expand.grid(trials = 1:10,
> model = c( "tree" ,"rules"
> ),winnow = c(TRUE,
>  FALSE ))
>
>
>
>
>
> tc <- trainControl(method = "repeatedCV", summaryFunction=LogLoss,
>   number = 10, repeats = 10, verboseIter=TRUE,
>   classProbs=TRUE)
>
>
> model <- train(donation~., data=train, method="C5.0", trControl=tc,
>   metric="LogLoss", maximize=FALSE, tuneGrid=c50Grid)
>
>
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Repeated failures to install "caret" package (of Max Kuhn)

2015-04-04 Thread Max Kuhn

I thought that this might be relevant:

https://stackoverflow.com/questions/28985759/cant-install-the-caret-package-in-r-in-my-linux-machine

but it seems that you installed nloptr.

I would also suggest doing the install in base R and trying a different
mirror. I would avoid installing via RStudio unless you have just started a
new R session.



On Sat, Apr 4, 2015 at 11:11 AM, John Kane  wrote:

> Try installing from somewhere outside of RStudio or reboot and retry in
> RStudio.  I find that if RStudio is open for a long time I occasionally get
> some weird (buggy?) results but I cannot reproduce to send in an bug report.
>
> Load R  and from the command line or Windows RGui try installing.  As a
> test I just installed it successully with the command
> "install.packages("caret")" executed in R (using gedit with its
> R-plug-in) and running Ubuntu 14.04
>
>
> For future reference:
> Reproducibility
> https://github.com/hadley/devtools/wiki/Reproducibility
>
> http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
>
>
>
>
>
> John Kane
> Kingston ON Canada
>
>
> > -Original Message-
> > From: wyl...@ischool.utexas.edu
> > Sent: Fri, 03 Apr 2015 16:07:57 -0500
> > To: r-help@r-project.org
> > Subject: [R] Repeated failures to install "caret" package (of Max Kuhn)
> >
> > For an edx course, MIT's "The Analtics Edge", I need to install the
> > "caret" package that was originated and is maintained by Dr. Max Kuhn of
> > Pfizer. So far, every effort I've made to try to
> > install.packages("caret") has failed.  (I'm using R v. 3.1.3 and RStudio
> > v. 0.98.1103 in LinuxMint 17.1)
> >
> > Here are some of the things I've tried unsuccessfully:
> > install.packages("caret", repos=c("http://rstudio.org/_packages";,
> > "http://cran.rstudio.com";))
> > install.packages("caret", dependencies=TRUE)
> > install.packages("caret", repos=c("http://rstudio.org/_packages";,
> > "http://cran.rstudio.com";), dependencies=TRUE)
> > install.packages("caret", dependencies = c("Depends", "Suggests"))
> > install.packages("caret", repos="http://cran.rstudio.com/";)
> >
> > I've changed my CRAN mirror from UCLA to Revolution Analytics in Dallas,
> > and tried the above installs again, unsuccessfully.
> >
> > I've succeeded in individually installing a number of packages on which
> > "caret" appears to be dependent.  Specifically, I've been able to
> > install  "nloptr", "minqa", "Rcpp", "reshape2", "stringr", and
> > "scales".  But I've had no success with trying to do individual installs
> > of "BradleyTerry2", "car", "lme4", "quantreg", and "RcppEigen".
> >
> > Any suggestions will be very gratefully received (and tried out quickly).
> >
> > Thanks in advance.
> >
> > Ron Wyllys
> >
> > __
> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> 
> FREE 3D MARINE AQUARIUM SCREENSAVER - Watch dolphins, sharks & orcas on
> your desktop!
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] #library("CHAID") - Cross validation for chaid

2015-01-05 Thread Max Kuhn

You can create your own:

   http://topepo.github.io/caret/custom_models.html

I put a prototype together. Source this file:

   https://github.com/topepo/caret/blob/master/models/files/chaid.R

then try this:

library("CHAID")

### fit tree to subsample
set.seed(290875)
USvoteS <- USvote[sample(1:nrow(USvote), 1000),]


## You probably don't want to use `train.formula` as
## it will convert the factors to dummy variables
mod <- train(x = USvoteS[,-1], y = USvoteS$vote3,
 method = modelInfo,
 trControl = trainControl(method = "cv"))

Max

On Mon, Jan 5, 2015 at 7:11 AM, Rodica Coderie via R-help
 wrote:
> Hello,
>
> Is there an option of cross validation for CHAID decision tree? An example of 
> CHAID is below:
> library("CHAID")
> example("chaid", package = "CHAID")
>
> How can I use a 10 fold cross-validation for CHAID?
> I've read that caret package is to cross-validate on many times of models, 
> but model CHAID is not in caret's built-in library.
>
> library(caret)
> model <- train(vote3 ~., data = USvoteS, method='CHAID', 
> tuneLength=10,trControl=trainControl(method='cv', number=10, classProbs=TRUE, 
> summaryFunction=twoClassSummary))
>
> Thanks,
> Rodica
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Help with caret, please

2014-10-11 Thread Max Kuhn

What you are asking is a bad idea on multiple levels. You will grossly
over-estimate the area under the ROC curve. Consider the 1-NN model: you
will have perfect predictions every time.

To do this, you will need to run train again and modify the index and
indexOut objects:

library(caret)

  set.seed(1)
  dat <- twoClassSim(200)

  set.seed(2)
  folds <- createFolds(dat$Class, returnTrain = TRUE)

  Control <- trainControl(method="cv",
  summaryFunction=twoClassSummary,
  classProb=T,
  index = folds,
  indexOut = folds)

  tGrid=data.frame(k=1:100)

  set.seed(3)
  a_bad_idea <- train(Class ~ ., data=dat,
  method = "knn",
  tuneGrid=tGrid,
  trControl=Control, metric =  "ROC")

Max

On Sat, Oct 11, 2014 at 7:58 PM, Iván Vallés Pérez <
ivanvallespe...@gmail.com> wrote:

> Hello,
>
> I am using caret package in order to train a K-Nearest Neigbors algorithm.
> For this, I am running this code:
>
> Control <- trainControl(method="cv", summaryFunction=twoClassSummary,
> classProb=T)
>
> tGrid=data.frame(k=1:100)
>
> trainingInfo <- train(Formula, data=trainData, method =
> "knn",tuneGrid=tGrid,
>   trControl=Control, metric =  "ROC")
> As you can see, I am interested in obtain the AUC parameter of the ROC.
> This code works good but returns the testing error (which the algorithm
> uses for tuning the k parameter of the model) as the mean of the error of
> the CrossValidation folds. I am interested in return, in addition of the
> testing error, the trainingerror (the mean across each fold of the error
> obtained with the training data). ¿How can I do it?
>
> Thank you
> [[alternative HTML version deleted]]
>
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Training a model using glm

2014-09-17 Thread Max Kuhn

You have not shown all of your code and it is difficult to diagnose the
issue.

I assume that you are using the data from:

   library(AppliedPredictiveModeling)
   data(AlzheimerDisease)

If so, there is example code to analyze these data in that package. See
?scriptLocation.

We have no idea how you got to the `training` object (package versions
would be nice too).

I suspect that Dennis is correct. Try using more normal syntax without the
$ indexing in the formula. I wouldn't say it is (absolutely) wrong but it
doesn't look right either.

Max


On Wed, Sep 17, 2014 at 2:04 PM, Mohan Radhakrishnan <
radhakrishnan.mo...@gmail.com> wrote:

> Hi Dennis,
>
>  Why is there that warning ? I think my syntax is
> right. Isn't it not? So the warning can be ignored ?
>
> Thanks,
> Mohan
>
> On Wed, Sep 17, 2014 at 9:48 PM, Dennis Murphy  wrote:
>
> > No reproducible example (i.e., no data) supplied, but the following
> > should work in general, so I'm presuming this maps to the caret
> > package as well. Thoroughly untested.
> >
> > library(caret)# something you failed to mention
> >
> > ...
> > modelFit <- train(diagnosis ~ ., data = training1)# presumably a
> > logistic regression
> > confusionMatrix(test1$diagnosis, predict(modelFit, newdata = test1,
> > type = "response"))
> >
> > For GLMs, there are several types of possible predictions. The default
> > is 'link', which associates with the linear predictor. caret may have
> > a different syntax so you should check its help pages re the supported
> > predict methods.
> >
> > Hint: If a function takes a data = argument, you don't need to specify
> > the variables as components of the data frame - the variable names are
> > sufficient. You should also do some reading to understand why the
> > model formula I used is correct if you're modeling one variable as
> > response and all others in the data frame as covariates.
> >
> > Dennis
> >
> > On Tue, Sep 16, 2014 at 11:15 PM, Mohan Radhakrishnan
> >  wrote:
> > > I answered this question which was part of the online course correctly
> by
> > > executing some commands and guessing.
> > >
> > > But I didn't get the gist of this approach though my R code works.
> > >
> > > I have a training and test dataset.
> > >
> > >> nrow(training)
> > >
> > > [1] 251
> > >
> > >> nrow(testing)
> > >
> > > [1] 82
> > >
> > >> head(training1)
> > >
> > >diagnosisIL_11IL_13IL_16   IL_17E IL_1alpha  IL_3
> > > IL_4
> > >
> > > 6   Impaired 6.103215 1.282549 2.671032 3.637051 -8.180721 -3.863233
> > > 1.208960
> > >
> > > 10  Impaired 4.593226 1.269463 3.476091 3.637051 -7.369791 -4.017384
> > > 1.808289
> > >
> > > 11  Impaired 6.919778 1.274133 2.154845 4.749337 -7.849364 -4.509860
> > > 1.568616
> > >
> > > 12  Impaired 3.218759 1.286356 3.593860 3.867347 -8.047190 -3.575551
> > > 1.916923
> > >
> > > 13  Impaired 4.102821 1.274133 2.876338 5.731246 -7.849364 -4.509860
> > > 1.808289
> > >
> > > 16  Impaired 4.360856 1.278484 2.776394 5.170380 -7.662778 -4.017384
> > > 1.547563
> > >
> > >  IL_5   IL_6 IL_6_Receptor IL_7 IL_8
> > >
> > > 6  -0.4004776  0.1856864   -0.51727788 2.776394 1.708270
> > >
> > > 10  0.1823216 -1.53427580.09668586 2.154845 1.701858
> > >
> > > 11  0.1823216 -1.09654120.35404039 2.924466 1.719944
> > >
> > > 12  0.3364722 -0.39871860.09668586 2.924466 1.675557
> > >
> > > 13  0.000  0.4223589   -0.53219115 1.564217 1.691393
> > >
> > > 16  0.2623643  0.42235890.18739989 1.269636 1.705116
> > >
> > > The testing dataset is similar with 13 columns. Number of rows vary.
> > >
> > >
> > > training1 <- training[,grepl("^IL|^diagnosis",names(training))]
> > >
> > > test1 <- testing[,grepl("^IL|^diagnosis",names(testing))]
> > >
> > > modelFit <- train(training1$diagnosis ~ training1$IL_11 +
> > training1$IL_13 +
> > > training1$IL_16 + training1$IL_17E + training1$IL_1alpha +
> > training1$IL_3 +
> > > training1$IL_4 + training1$IL_5 + training1$IL_6 +
> > training1$IL_6_Receptor
> > > + training1$IL_7 + training1$IL_8,method="glm",data=training1)
> > >
> > > confusionMatrix(test1$diagnosis,predict(modelFit, test1))
> > >
> > > I get this error when I run the above command to get the confusion
> > matrix.
> > >
> > > *'newdata' had 82 rows but variables found have 251 rows '*
> > >
> > > I thought this was simple. I train a model using the training dataset
> and
> > > predict using the test dataset and get the accuracy.
> > >
> > > Am I missing the obvious here ?
> > >
> > > Thanks,
> > >
> > > Mohan
> > >
> > > [[alternative HTML version deleted]]
> > >
> > > __
> > > R-help@r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
> >
>
> [[alternative HTML version deleted]]
>
> _

Re: [R] Use of library(X) in the code of library X.

2014-06-06 Thread Max Kuhn

That is legacy code but there was a good reason back then.

caret is written to use parallel processing via the foreach package.
There were some cases where the worker processes did not load the
required packages (even when I used foreach's ".packages" argument) so
I would do it explicitly. I don't recall which parallel backend had
the issue.

The more important lesson is that if you want to "understand some R
code written by others" you'll learn more bad habits than good ones if
you examine my packages…

Max

On Fri, Jun 6, 2014 at 2:42 PM, Duncan Murdoch  wrote:
> On 06/06/2014 10:26 AM, Bart Kastermans wrote:
>>
>> To improve my R skills I try to understand some R code written by others.
>> Mostly
>> I am looking at the code of packages I use.  Today I looked at the code
>> for the
>> caret package
>>
>> http://cran.r-project.org/src/contrib/caret_6.0-30.tar.gz
>>
>> in particular at the file R/adaptive.R
>>
>> This file starts with:
>>
>> adaptiveWorkflow <- function(x, y, wts, info, method, ppOpts, ctrl, lev,
>>   metric, maximize, testing = FALSE, ...) {
>>library(caret)
>>loadNamespace("caret”)
>>
>>  From ?library and googling I can’t figure out what this code would do.
>>
>> Why would you call library(caret) in the caret package?
>
>
> I don't know that package, and since adaptiveWorkflow is not documented at
> the user level, I can't tell exactly what the author had in mind.  However,
> code like that could be present for debugging purposes (and is
> unintentionally present in the CRAN copy), or could be intentional.  The
> library(caret) call has the effect of ensuring that the package is on the
> search list.  (It might have been loaded invisibly by another package.)
> This is generally considered to be bad form nowadays; packages should
> function properly without being on the search list.
>
> I can't think of a situation where loadNamespace() would do anything --- it
> would have been called by library().
>
> Duncan Murdoch
>
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cforest sampling methods

2014-03-19 Thread Max Kuhn

You might look at the 'bag' function in the caret package. It will not
do the subsampling of variables at each split but you can bag a tree
and down-sample the data at each iteration. The help page has an
examples bagging ctree (although you might want to play with the tree
depth a little).

Max

On Wed, Mar 19, 2014 at 3:32 PM, Maggie Makar  wrote:
> Hi all,
>
> I've been using the randomForest package and I'm trying to make the switch
> over to party. My problem is that I have an extremely unbalanced outcome
> (only 1% of the data has a positive outcome) which makes resampling methods
> necessary.
>
> randomForest has a very useful argument that is sampsize which allows me to
> use a balanced subsample to build each tree in my forest. lets say the
> number of positive cases is 100, my forest would look something like this:
>
> rf<-randomForest(y~. ,data=train, ntree=800,replace=TRUE,sampsize = c(100,
> 100))
>
> so I use 100 cases and 100 controls to build each individual tree. Can I do
> the same for cforests? I know I can always upsample but I'd rather not.
>
> I've tried playing around with the weights argument but I'm either not
> getting it right or it's just the wrong thing to use.
>
> Any advice on how to adapt cforests to datasets with imbalanced outcomes is
> greatly appreciated...
>
>
>
> Thanks!
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] how is the model resample performance calculated by caret?

2014-02-28 Thread Max Kuhn

On Fri, Feb 28, 2014 at 1:13 AM, zhenjiang zech xu
 wrote:
> Dear all,
>
> I did a 5-repeat of 10-fold cross validation using partial least square
> regression model provided by caret package. Can anyone tell me how are the
> values in plsTune$resample calculated? Is that predicted on each hold-out
> set using the model which is trained on the rest data with the optimized
> parameter tuned from previous cross validation?

Yes, those values are the performance estimates across each hold-out
using the final model. There is an option in trainControl() that will
have it return the resamples from all models too.

> So in the following
> example, firstly, 5-repeat of 10-fold cross validation gives 2 for ncomp as
> the best, and then using ncomp of 2 and the training data to build a model
> and then predict the hold-out data with the model to give a RMSE and
> RSQUARE - is what I am thinking true?

It is.

Max

>
>
>> plsTune
> 524 samples
> 615 predictors
>
> Pre-processing: centered, scaled
> Resampling: Cross-Validation (10 fold, repeated 5 times)
>
> Summary of sample sizes: 472, 472, 471, 471, 471, 471, ...
>
> Resampling results across tuning parameters:
>
>   ncomp  RMSE  Rsquared  RMSE SD  Rsquared SD
>   1  16.8  0.434 1.47 0.0616
>   2  14.3  0.612 2.21 0.0768
>   3  13.5  0.704 6.33 0.145
>   4  14.6  0.706 9.29 0.163
>   5  15.2  0.703 10.9 0.172
>   6  16.5  0.69  13.4 0.181
>   7  18.4  0.672 17.8 0.194
>   8  200.651 20.4 0.199
>   9  20.9  0.634 20.9 0.199
>   10 22.1  0.613 22.1 0.197
>   11 23.3  0.599 23.8 0.198
>   12 240.588 24.7 0.198
>   13 24.9  0.572 25.2 0.197
>   14 25.8  0.557 26.2 0.194
>   15 26.2  0.544 25.8 0.191
>   16 26.6  0.532 25.5 0.187
>
> RMSE was used to select the optimal model using  the one SE rule.
> The final value used for the model was ncomp = 2.
>>
>> plsTune$resample
>ncomp RMSE  RsquaredResample
> 1  2 13.61569 0.6349700 Fold06.Rep4
> 2  2 16.02091 0.5808985 Fold05.Rep1
> 3  2 12.59985 0.6008357 Fold03.Rep5
> 4  2 13.20069 0.6296245 Fold02.Rep3
> 5  2 12.43419 0.6560434 Fold04.Rep2
> 6  2 15.36510 0.5954177 Fold04.Rep5
> 7  2 12.70028 0.6894489 Fold03.Rep2
> 8  2 13.34882 0.6468300 Fold09.Rep3
> 9  2 14.80217 0.5575010 Fold08.Rep3
> 10 2 19.03705 0.4907630 Fold05.Rep4
> 11 2 14.26704 0.6579390 Fold10.Rep2
> 12 2 13.79060 0.5806663 Fold05.Rep3
> 13 2 14.83641 0.5918039 Fold05.Rep2
> 14 2 12.48721 0.7011439 Fold01.Rep3
> 15 2 14.98765 0.5866102 Fold07.Rep4
> 16 2 10.88100 0.7597167 Fold06.Rep1
> 17 2 13.60705 0.6321377 Fold08.Rep5
> 18 2 13.42618 0.6136031 Fold08.Rep4
> 19 2 13.26066 0.6784586 Fold07.Rep1
> 20 2 13.20623 0.6812341 Fold03.Rep3
> 21 2 18.54275 0.4404729 Fold08.Rep2
> 22 2 11.80312 0.7177681 Fold05.Rep5
> 23 2 18.56271 0.4661072 Fold03.Rep1
> 24 2 13.54879 0.5850439 Fold10.Rep3
> 25 2 14.10859 0.5994811 Fold06.Rep5
> 26 2 13.68329 0.6701091 Fold01.Rep5
> 27 2 16.12123 0.5401200 Fold10.Rep1
> 28 2 12.92250 0.6917220 Fold06.Rep3
> 29 2 12.94366 0.6400066 Fold06.Rep2
> 30 2 12.39889 0.6790578 Fold01.Rep2
> 31 2 13.48499 0.6759649 Fold01.Rep1
> 32 2 12.52938 0.6728476 Fold03.Rep4
> 33 2 16.43352 0.5795160 Fold09.Rep5
> 34 2 12.53991 0.6550694 Fold09.Rep4
> 35 2 12.78708 0.6304606 Fold08.Rep1
> 36 2 13.97559 0.6655688 Fold04.Rep3
> 37 2 15.31642 0.5124997 Fold09.Rep2
> 38 2 15.24194 0.5324943 Fold09.Rep1
> 39 2 12.90107 0.6318960 Fold04.Rep1
> 40 2 13.59574 0.6277869 Fold01.Rep4
> 41 2 19.73633 0.4154821 Fold07.Rep5
> 42 2 12.03759 0.6537381 Fold02.Rep5
> 43 2 15.47139 0.5597097 Fold02.Rep4
> 44 2 22.55060 0.3816672 Fold07.Rep3
> 45 2 14.57875 0.6269560 Fold07.Rep2
> 46 2 13.02385 0.6395148 Fold02.Rep2
> 47 2 13.81020 0.6116137 Fold02.Rep1
> 48 2 13.46100 0.6200828 Fold04.Rep4
> 49 2 13.95487 0.6709253 Fold10.Rep5
> 50 2 12.65981 0.6606435 Fold10.Rep4
>
> Best,
> Zhenjiang
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] boxcox alternative

2014-02-24 Thread Max Kuhn

Michael,

On Mon, Feb 24, 2014 at 5:51 AM, Michael Haenlein
 wrote:
>
> Dear all,
>
> I am working with a set of variables that are very non-normally
> distributed. To improve the performance of my model, I'm currently applying
> a boxcox transformation to them. While this improves things, the
> performance is still not great.
>

Are these predictors that you are transforming?

> So my question: Are there any alternatives to boxcox in R? I would need a
> model that estimates the "best" transformation automatically without input
> from the user since my approach should be flexible enough to deal with any
> kind of distribution. boxcox allows me to do this by picking the lambda
> that leads to the "best fit" but I wonder whether there are other options
> out there.
>

If they are predictors, caret has a function called 'preProcess' that
might interest you. See:

   http://caret.r-forge.r-project.org/preprocess.html#trans

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Predictor Importance in Random Forests and bootstrap

2014-01-28 Thread Max Kuhn

I think that the fundamental problem is that you are using the default
value of ntree (500). You should always use at least 1500 and more if n or
p are large.

Also, this link will give you more up-to-date information on that package
and feature selection:

http://caret.r-forge.r-project.org/featureSelection.html

Max


On Tue, Jan 28, 2014 at 5:32 PM, Dimitri Liakhovitski <
dimitri.liakhovit...@gmail.com> wrote:

> Here is a great response I got from SO:
>
> There is an important difference between the two importance measures:
> MeanDecreaseAccuracy is calculated using out of bag (OOB) data,
> MeanDecreaseGini is not. For each tree MeanDecreaseAccuracy is calculated
> on observations not used to form that particular tree. In contrast,
> MeanDecreaseGini is a summary of how impure the leaf nodes of a tree are.
> It is calculated using the same data used to fit trees.
>
> When you bootstrap data, you are creating multiple copies of the same
> observations. Therefore the same observation can be split into two copies,
> one to form a tree, and one treated as OOB and used to calculate accuracy
> measures. Therefore, data that randomForest thinks is OOB for
> MeanDecreaseAccuracy is not necessarily truly OOB in your bootstrap sample,
> making the estimate of MeanDecreaseAccuracy overly optimistic in the
> bootstrap iterations. Gini index is immune to this, because it is not
> relying on evaluating importance on observations different from those used
> to fit the data.
>
> I suspect what you are trying to do is use the bootstrap to generate
> inference (p-values/confidence intervals) indicating which variables are
> "important" in the sense that they are actually predictive of your outcome.
> The bootstrap is not appropriate in this context, because Random Forests
> expects that OOB data is truly OOB and this is important for building the
> forest in the first place. In general, bootstrap is not universally
> applicable, and is only useful in cases where it can be shown that the
> parameter you're estimating has nice asymptotic properties and is not
> sensitive to "ties" in the data. A procedure like Random Forest which
> relies on the availability of OOB data is necessarily sensitive to ties.
>
> You may want to look at the caret package in R, which uses random forest
> (or one of a set of many other algorithms) inside a cross-validation loop
> to determine which variables are consistently important. See:
>
>
>
>
> http://cran.open-source-solution.org/web/packages/caret/vignettes/caretSelection.pdf
>
>
> On Tue, Jan 28, 2014 at 8:54 AM, Dimitri Liakhovitski <
> dimitri.liakhovit...@gmail.com> wrote:
>
> > Thank you, Bert. I'll definitely ask there.
> > In the meantime I just wanted to ensure that my R code (my function for
> > bootstrap and the bootstrap run) is correct and my abnormal bootstrap
> > results are not a function of my erroneous code.
> > Thank you!
> >
> >
> >
> > On Mon, Jan 27, 2014 at 7:09 PM, Bert Gunter  >wrote:
> >
> >> I **think** this kind of methodological issue might be better at SO
> >> (stats.stackexchange.com).  It's not really about R programming, which
> >> is the main focus of this list. And yes, I know they do intersect.
> >> Nevertheless...
> >>
> >> Cheers,
> >> Bert
> >>
> >> Bert Gunter
> >> Genentech Nonclinical Biostatistics
> >> (650) 467-7374
> >>
> >> "Data is not information. Information is not knowledge. And knowledge
> >> is certainly not wisdom."
> >> H. Gilbert Welch
> >>
> >>
> >>
> >>
> >> On Mon, Jan 27, 2014 at 3:47 PM, Dimitri Liakhovitski
> >>  wrote:
> >> > Hello!
> >> > Below, I:
> >> > 1. Create a data set with a bunch of factors. All of them are
> predictors
> >> > and 'y' is the dependent variable.
> >> > 2. I run a classification Random Forests run with predictor
> importance.
> >> I
> >> > look at 2 measures of importance - MeanDecreaseAccuracy and
> >> MeanDecreaseGini
> >> > 3. I run 2 boostrap runs for 2 Random Forests measures of importance
> >> > mentioned above.
> >> >
> >> > Question: Could anyone please explain why I am getting such a huge
> >> positive
> >> > bias across the board (for all predictors) for MeanDecreaseAccuracy?
> >> >
> >> > Thanks a lot!
> >> > Dimitri
> >> >
> >> >
> >> > #
> >> > # Creating a a data set:
> >> > #-
> >> >
> >> > N<-1000
> >> > myset1<-c(1,2,3,4,5)
> >> > probs1a<-c(.05,.10,.15,.40,.30)
> >> > probs1b<-c(.05,.15,.10,.30,.40)
> >> > probs1c<-c(.05,.05,.10,.15,.65)
> >> > myset2<-c(1,2,3,4,5,6,7)
> >> > probs2a<-c(.02,.03,.10,.15,.20,.30,.20)
> >> > probs2b<-c(.02,.03,.10,.15,.20,.20,.30)
> >> > probs2c<-c(.02,.03,.10,.10,.10,.25,.40)
> >> > myset.y<-c(1,2)
> >> > probs.y<-c(.65,.30)
> >> >
> >> > set.seed(1)
> >> > y<-as.factor(sample(myset.y,N,replace=TRUE,probs.y))
> >> > set.seed(2)
> >> > a<-as.factor(sample(myset1, N, replace = TRUE,probs1a))
> >> > set.seed(3)
> >> > b<-as.factor(sampl

Re: [R] R crashes with memory errors on a 256GB machine (and system shoes only 60GB usage)

2014-01-02 Thread Max Kuhn

Describing the problem would help a lot more. For example, if you were
using some of the parallel processing options in R, this can make extra
copies of objects and drive memory usage up very quickly.

Max


On Thu, Jan 2, 2014 at 3:35 PM, Ben Bolker  wrote:

> Xebar Saram  gmail.com> writes:
>
> >
> > Hi All,
> >
> > I have a terrible issue i cant seem to debug which is halting my work
> > completely. I have R 3.02 installed on a linux machine (arch
> linux-latest)
> > which I built specifically for running high memory use models. the system
> > is a 16 core, 256 GB RAM machine. it worked well at the start but in the
> > recent days i keep getting errors and crashes regarding memory use, such
> as
> > "cannot create vector size of XXX, not enough memory" etc
> >
> > when looking at top (linux system monitor) i see i barley scrape the 60
> GB
> > of ram (out of 256GB)
> >
> > i really don't know how to debug this and my whole work is halted due to
> > this so any help would be greatly appreciated
>
>   I'm very sympathetic, but it will be almost impossible to debug
> this sort of a problem remotely, without a reproducible example.
> The only guess that I can make, if you *really* are running *exactly*
> the same code as you previously ran successfully, is that you might
> have some very large objects hidden away in a saved workspace in a
> .RData file that's being loaded automatically ...
>
>   I would check whether gc(), memory.profile(), etc. give sensible results
> in a clean R session (R --vanilla).
>
>   Ben Bolker
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 

Max

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Variable importance - ANN

2013-12-04 Thread Max Kuhn

If you are using the nnet package, the caret package has a variable
importance method based on Gevrey, M., Dimopoulos, I., & Lek, S. (2003).
Review and comparison of methods to study the contribution of variables in
artificial neural network models. Ecological Modelling, 160(3), 249-264. It
is based on the estimated weights.

Max

On Wed, Dec 4, 2013 at 6:41 AM, Giulia Di Lauro wrote:

> Hi everybody,
> I created a neural network for a regression analysis with package ANN, but
> now I need to know which is the significance of each predictor variable in
> explaining the dependent variable. I thought to analyze the weight, but I
> don't know how to do it.
>
> Thanks in advance,
> Giulia Di Lauro.
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 

Max

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Inconsistent results between caret+kernlab versions

2013-11-17 Thread Max Kuhn

Andrew,

> What I still don't quite understand is which accuracy values from train() I 
> should trust: those using classProbs=T or classProbs=F?

It depends on whether you need the class probabilities and class
predictions to match (which they would if classProbs = TRUE).

Another option is to use a model where this discrepancy does not exist.

>  train often crashes with 'memory map' errors!)?

I've never seen that. You should describe it more.

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Inconsistent results between caret+kernlab versions

2013-11-15 Thread Max Kuhn

Or not!

The issue with with kernlab.

Background: SVM models do not naturally produce class probabilities. A
secondary model (via Platt) is fit to the raw model output and a
logistic function is used to translate the raw SVM output to
probability-like numbers (i.e. sum to zero, between 0 and 1). In
ksvm(), you need to use the option prob.model = TRUE to get that
second model.

I discovered some time ago that there can be a discrepancy in the
predicted classes that naturally come from the SVM model and those
derived by using the class associated with the largest class
probability. This is most likely do to natural error in the secondary
probability model and should not be unexpected.

That is the case for your data. In you use the same tuning parameters
as those suggested by train() and go straight to ksvm():

> newSVM <- ksvm(x = as.matrix(df[,-1]),
+y = df[,1],
+kernel = rbfdot(sigma = svm.m1$bestTune$.sigma),
+C = svm.m1$bestTune$.C,
+prob.model = TRUE)
>
> predict(newSVM, df[43,-1])
[1] O32078
10 Levels: O27479 O31403 O32057 O32059 O32060 O32078 ... O32676
> predict(newSVM, df[43,-1], type = "probabilities")
 O27479 O31403O32057O32059 O32060O32078
[1,] 0.08791826 0.05911645 0.2424997 0.1036943 0.06968587 0.1648394
 O32089 O32663 O32668 O32676
[1,] 0.04890477 0.05210836 0.09838892 0.07284396

Note that, based on the probability model, the class with the largest
probability is O32057 (p = 0.24) while the basic SVM model predicts
O32078 (p = 0.16).

Somebody (maybe me) saw this discrepancy and that led to me to follow this rule:

if(prob.model = TRUE) use the class with the maximum probability
   else use the class prediction from ksvm().

Therefore:

> predict(svm.m1, df[43,-1])
[1] O32057
10 Levels: O27479 O31403 O32057 O32059 O32060 O32078 ... O32676

That change occurred between the two caret versions that you tested with.

(On a side note, can also occur with ksvm() and rpart() if
cost-sensitive training is used because the class designation takes
into account the costs but the class probability predictions do not. I
alerted both package maintainers to the issue some time ago.)

HTH,

Max

On Fri, Nov 15, 2013 at 1:56 PM, Max Kuhn  wrote:
> I've looked into this a bit and the issue seems to be with caret. I've
> been looking at the svn check-ins and nothing stands out to me as the
> issue so far. The final models that are generated are the same and
> I'll try to figure out the difference.
>
> Two small notes:
>
> 1) you should set the seed to ensure reproducibility.
> 2) you really shouldn't use character stings with all numbers as
> factor levels with caret when you want class probabilities. It should
> give you a warning about this
>
> Max
>
> On Thu, Nov 14, 2013 at 7:31 PM, Andrew Digby  wrote:
>>
>> I'm using caret to assess classifier performance (and it's great!). However, 
>> I've found that my results differ between R2.* and R3.* - reported 
>> accuracies are reduced dramatically. I suspect that a code change to kernlab 
>> ksvm may be responsible (see version 5.16-24 here: 
>> http://cran.r-project.org/web/packages/caret/news.html). I get very 
>> different results between caret_5.15-61 + kernlab_0.9-17 and caret_5.17-7 + 
>> kernlab_0.9-19 (see below).
>>
>> Can anyone please shed any light on this?
>>
>> Thanks very much!
>>
>>
>> ### To replicate:
>>
>> require(repmis)  # For downloading from https
>> df <- source_data('https://dl.dropboxusercontent.com/u/47973221/data.csv', 
>> sep=',')
>> require(caret)
>> svm.m1 <- 
>> train(df[,-1],df[,1],method='svmRadial',metric='Kappa',tunelength=5,trControl=trainControl(method='repeatedcv',
>>  number=10, repeats=10, classProbs=TRUE))
>> svm.m1
>> sessionInfo()
>>
>> ### Results - R2.15.2
>>
>>> svm.m1
>> 1241 samples
>>7 predictors
>>   10 classes: ‘O27479’, ‘O31403’, ‘O32057’, ‘O32059’, ‘O32060’, ‘O32078’, 
>> ‘O32089’, ‘O32663’, ‘O32668’, ‘O32676’
>>
>> No pre-processing
>> Resampling: Cross-Validation (10 fold, repeated 10 times)
>>
>> Summary of sample sizes: 1116, 1116, 1114, 1118, 1118, 1119, ...
>>
>> Resampling results across tuning parameters:
>>
>>   C Accuracy  Kappa  Accuracy SD  Kappa SD
>>   0.25  0.684 0.63   0.0353   0.0416
>>   0.5   0.729 0.685  0.0379   0.0445
>>   1 0.756 0.716  0.0357   0.0418
>>
>> Tuning parameter ‘sigma’ was held constant at a value of 0.247
>> Kappa was used to select the optimal model using  the larg

Re: [R] C50 Node Assignment

2013-11-09 Thread Max Kuhn

There is a sub-object called 'rules' that has the output of C5.0 for this model:

> library(C50)
> mod <- C5.0(Species ~ ., data = iris, rules = TRUE)
> cat(mod$rules)
id="See5/C5.0 2.07 GPL Edition 2013-11-09"
entries="1"
rules="4" default="setosa"
conds="1" cover="50" ok="50" lift="2.94231" class="setosa"
type="2" att="Petal.Length" cut="1.9" result="<"
conds="3" cover="48" ok="47" lift="2.88" class="versicolor"
type="2" att="Petal.Length" cut="1.9" result=">"
type="2" att="Petal.Length" cut="4.901" result="<"
type="2" att="Petal.Width" cut="1.7" result="<"
conds="1" cover="46" ok="45" lift="2.875" class="virginica"
type="2" att="Petal.Width" cut="1.7" result=">"
conds="1" cover="46" ok="44" lift="2.8125" class="virginica"
type="2" att="Petal.Length" cut="4.901" result=">"

You would either have to parse this or parse the summary results:

> summary(mod)

Call:
C5.0.formula(formula = Species ~ ., data = iris, rules = TRUE)


Rules:

Rule 1: (50, lift 2.9)
Petal.Length <= 1.9
->  class setosa  [0.981]

Rule 2: (48/1, lift 2.9)
Petal.Length > 1.9
Petal.Length <= 4.9
Petal.Width <= 1.7
->  class versicolor  [0.960]


Max

On Sat, Nov 9, 2013 at 1:11 PM, Carl Witthoft  wrote:
>
> Just to clarify:  I'm guessing the OP is referring to the CRAN package C50
> here.   A quick skim suggests the rules are a list element of a C5.0-class
> object, so maybe that's where to start?
>
>
> David Winsemius wrote
>> In my role as a moderator I am attempting to bypass the automatic mail
>> filters that are blocking this posting. Please reply to the list and to:
>> =
>> Kevin Shaney <
>
>> kevin.shaney@
>
>> >
>>
>> C50 Node Assignment
>>
>> I am using C50 to classify individuals into 5 groups / categories (factor
>> variable).  The tree / set of rules has 10 rules for classification.  I am
>> trying to extract the RULE for which each individual qualifies (a number
>> between 1 and 10), and cannot figure out how to do so.  I can extract the
>> predicted group and predicted group probability, but not the RULE to which
>> an individual qualifies.  Please let me know if you can help!
>>
>> Kevin
>> =
>>
>>
>> --
>> David Winsemius
>> Alameda, CA, USA
>>
>> __
>
>> R-help@
>
>>  mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
>
>
>
> --
> View this message in context: 
> http://r.789695.n4.nabble.com/C50-Node-Assignment-tp4680071p4680127.html
> Sent from the R help mailing list archive at Nabble.com.
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Cross validation in R

2013-07-02 Thread Max Kuhn

> How do i make a loop so that the process could be repeated several time,
> producing randomly ROC curve and under ROC values?


Using the caret package

http://caret.r-forge.r-project.org/

--

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Error running caret's gbm train function with new version of caret

2013-05-06 Thread Max Kuhn

Katrina,

I made some changes to accomidate gbm's new feature for 3+ categories,
then had to "harmonize" how gbm and caret work together.

I have a new version of caret that is not released yet (maybe within a
month), but you should get it from:

   install.packages("caret", repos="http://R-Forge.R-project.org";)

You may also need to ungrade gbm. That package page is:

   https://code.google.com/p/gradientboostedmodels/downloads/list

Let me know if you have any issues.

Max

On Sat, May 4, 2013 at 5:33 PM, Katrina Bennett  wrote:
> I am running caret for model exploration. I developed my code a number of
> months ago and I've been running it with no issues. Recently, I updated my
> version of caret however, and now I am getting a new error. I'm wondering
> if this is due to the new release.
>
> The error I am getting is when I am running GBM.
>
> print(paste("calculating GBM for", i))
> #gbm runs over and over again
> set.seed(1)
> trainModelGBM <- train(trainClass3, trainAsym, "gbm", metric="RMSE",
> tuneLength = 5, trControl = con)
>
> The error I am getting is at the end of the run once all the iterations
> have been processed:
> Error in { :
>   task 1 failed - "arguments imply differing number of rows: 5, 121"
>
> trainClass3 and trainAsym have 311 values in them. I'm using 5 variables in
> my matrix. I'm not sure where the 117 is coming from.
>
> I found solutions online that suggested that updated the version of glmnet,
> Matrix and doing something with cv.folds would work. None of these
> solutions have worked for me.
>
> Here is my R session info.
>
> R version 2.15.1 (2012-06-22)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> caret version 5.15-61
>
> Thank you,
>
> Katrina
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] C50 package in R

2013-04-26 Thread Max Kuhn

There isn't much out there. Quinlan didn't open source the code until about
a year ago.

I've been through the code line by line and we have a fairly descriptive
summary of the model in our book (that's almost out):

  http://appliedpredictivemodeling.com/

I will say that the pruning is mostly the same as described in Quinlan's
C4.5 book. The big differences in C4.5 and C5.0 are boosting and winnowing.
The former is very different mechanically than gradient boosting machines
and is more similar to the re-weighting approach of the original adaboost
algorithm (but is still pretty different).

I've submitted a talk on C5.0 for this year's UseR! conference. If there is
enough time I will be able to go through some of the technical details.

Two other related notes:

- the J48 implementation in Weka lacks one or two of C4.5's features that
makes the results substantially different than what C4.5 would have
produced  The differences are significant enough that Quinlan asked us to
call the results of that function as "J48" and not "C4.5". Using C5.0 with
a single tree is much similar to C4.5 than J48.

- the differences between model trees and Cubist are also substantial and
largely undocumented.

HTH,

Max

On Thu, Apr 25, 2013 at 9:40 AM, Indrajit Sen Gupta <
indrajit...@rediffmail.com> wrote:

> Hi All,
>
>
>
> I am trying to use the C50 package to build classification trees in R.
> Unfortunately there is not enought documentation around its use. Can anyone
> explain to me - how to prune the decision trees?
>
>
>
> Regards,
>
> Indrajit
>
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 

Max

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] odfWeave: Some questions about potential formatting options

2013-04-17 Thread Max Kuhn

Paul,

#1: I've never tried but you might be able to escape the required tags in
your text (e.g. in html you could write out the  in your text).

#3: Which output? Is this in text?

#2: I may be possible and maybe easy to implement. So if you want to dig
into it, have at it. For me, I'm completely buried in
the foreseeable future and won't be able to pay much attention to it. To be
honest, odfWeave has been fairly neglected by me and lately I've had
thoughts of orphaning the package :-/

Thanks,

Max



On Tue, Apr 16, 2013 at 1:15 PM, Paul Miller  wrote:

> Hi Milan and Max,
>
> Thanks to each of you for your reply to my post. Thus far, I've managed to
> find answers to some of the questions I asked initially.
>
> I am now able to control the justification of the leftmost column in my
> tables, as well as to add borders to the top and bottom. I also downloaded
> Milan's revised version of odfWeave at the link below, and found that it
> does a nice job of controlling column widths.
>
> http://nalimilan.perso.neuf.fr/transfert/odfWeave.tar.gz
>
> There are some other things I'm still struggling with though.
>
> 1. Is it possible to get odfTableCaption and odfFigureCaption to make the
> titles they produce bold? I understand it might be possible to accomplish
> this by changing something in the styles but am not sure what. If someone
> can give me a hint, I can likely do the rest.
>
> 2. Is there any way to get odfFigureCaption to put titles at the top of
> the figure instead of the bottom? I've noticed that odfTableCaption is able
> to do this but apparently not odfFigureCaption.
>
> 3. Is it possible to add special characters to the output? Below is a
> sample Kaplan-Meier analysis. There's a footnote in there that reads "Note:
> X2(1) = xx.xx, p = .". Is there any way to make the X a lowercase Chi
> and to superscript the 2? I did quite a bit of digging on this topic. It
> sounds like it might be difficult, especially if one is using Windows as I
> am.
>
> Thanks,
>
> Paul
>
> ##
>  Get data 
> ##
>
>  Load packages 
>
> require(survival)
> require(MASS)
>
>  Sample analysis 
>
> attach(gehan)
> gehan.surv <- survfit(Surv(time, cens) ~ treat, data= gehan, conf.type =
> "log-log")
> print(gehan.surv)
>
> survTable <- summary(gehan.surv)$table
> survTable <- data.frame(Treatment = rownames(survTable), survTable,
> row.names=NULL)
> survTable <- subset(survTable, select = -c(records, n.max))
>
> ##
>  odfWeave 
> ##
>
>  Load odfWeave 
>
> require(odfWeave)
>
>  Modify StyleDefs 
>
> currentDefs <- getStyleDefs()
>
> currentDefs$firstColumn$type <- "Table Column"
> currentDefs$firstColumn$columnWidth <- "5 cm"
> currentDefs$secondColumn$type <- "Table Column"
> currentDefs$secondColumn$columnWidth <- "3 cm"
>
> currentDefs$ArialCenteredBold$fontSize <- "10pt"
> currentDefs$ArialNormal$fontSize <- "10pt"
> currentDefs$ArialCentered$fontSize <- "10pt"
> currentDefs$ArialHighlight$fontSize <- "10pt"
>
> currentDefs$ArialLeftBold <- currentDefs$ArialCenteredBold
> currentDefs$ArialLeftBold$textAlign <- "left"
>
> currentDefs$cgroupBorder <- currentDefs$lowerBorder
> currentDefs$cgroupBorder$topBorder <- "0.0007in solid #00"
>
> setStyleDefs(currentDefs)
>
>  Modify ImageDefs 
>
> imageDefs <- getImageDefs()
> imageDefs$dispWidth <- 5.5
> imageDefs$dispHeight<- 5.5
> setImageDefs(imageDefs)
>
>  Modify Styles 
>
> currentStyles <- getStyles()
> currentStyles$figureFrame <- "frameWithBorders"
> setStyles(currentStyles)
>
>  Set odt table styles 
>
> tableStyles <- tableStyles(survTable, useRowNames = FALSE, header = "")
> tableStyles$headerCell[1,] <- "cgroupBorder"
> tableStyles$header[,1] <- "ArialLeftBold"
> tableStyles$text[,1] <- "ArialNormal"
> tableStyles$cell[2,] <- "lowerBorder"
>
>  Weave odt source file 
>
> fp <- "N:/Studies/HCRPC1211/Report/odfWeaveTest/"
> inFile <- paste(fp, "testWeaveIn.odt", sep="")
> outFile <- paste(fp, "testWeaveOut.odt", sep="")
> odfWeave(inFile, outFile)
>
> ##
>  Contents of .odt source file 
> ##
>
> Here is a sample Kaplan-Meier table.
>
> <>=
> odfTableCaption(A Sample Kaplan-Meier Analysis Table)
> odfTable(survTable, useRowNames = FALSE, digits = 3,
> colnames = c("Treatment", "Number", "Events", "Median", "95% LCL", "95%
> UCL"),
> colStyles = c("firstColumn", "secondColumn", "secondColumn",
> "secondColumn", "secondColumn", "secondColumn"),
> styles = tableStyles)
> odfCat(Note: X2(1) = xx.xx, p = .)
> @
>
> Here is a sample Kaplan-Meier graph.
>
> <>=
> odfFigureCaption("A Sample Kaplan-Meier Analysis Graph", label = "Figure")
> plot(gehan.surv, xlab = "Time", ylab= "Survivorship")
> @
>
>
>


-- 

Max

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
ht

Re: [R] Parallelizing GBM

2013-03-24 Thread Max Kuhn

See this:

   https://code.google.com/p/gradientboostedmodels/issues/detail?id=3

and this:


https://code.google.com/p/gradientboostedmodels/source/browse/?name=parallel


Max


On Sun, Mar 24, 2013 at 7:31 AM, Lorenzo Isella wrote:

> Dear All,
> I am far from being a guru about parallel programming.
> Most of the time, I rely or randomForest for data mining large datasets.
> I would like to give a try also to the gradient boosted methods in GBM,
> but I have a need for parallelization.
> I normally rely on gbm.fit for speed reasons, and I usually call it this
> way
>
>
>
> gbm_model <- gbm.fit(trainRF,prices_train,
> offset = NULL,
> misc = NULL,
> distribution = "multinomial",
> w = NULL,
> var.monotone = NULL,
> n.trees = 50,
> interaction.depth = 5,
> n.minobsinnode = 10,
> shrinkage = 0.001,
> bag.fraction = 0.5,
> nTrain = (n_train/2),
> keep.data = FALSE,
> verbose = TRUE,
> var.names = NULL,
> response.name = NULL)
>
>
> Does anybody know an easy way to parallelize the model (in this case it
> means simply having 4 cores on the same machine working on the problem)?
> Any suggestion is welcome.
> Cheers
>
> Lorenzo
>
> __**
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/**listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/**
> posting-guide.html 
> and provide commented, minimal, self-contained, reproducible code.
>



-- 

Max

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] CARET and NNET fail to train a model when the input is high dimensional

2013-03-06 Thread Max Kuhn

James,

I did a fresh install from CRAN to get caret_5.15-61 and ran your code with
method.name = "nnet" and grid.len = 3.

I don't get an error, although there were issues:

   In nominalTrainWorkflow(dat = trainData, info = trainInfo,  ... :
 There were missing values in resampled performance measures.

The results had:

Resampling results across tuning parameters:

  size  decay  ROCSens   Spec   ROC SD   Sens SD  Spec SD
  1 0  0.521  0.52   0.521  0.0148   0.0312   0.00901
  1 1e-04  0.513  0.528  0.498  0.00616  0.00386  0.00552
  1 0.10.515  0.522  0.514  0.0169   0.0284   0.0426
  3 0  NaNNaNNaNNA   NA   NA
  3 1e-04  NaNNaNNaNNA   NA   NA
  3 0.1NaNNaNNaNNA   NA   NA
  5 0  NaNNaNNaNNA   NA   NA
  5 1e-04  NaNNaNNaNNA   NA   NA
  5 0.1NaNNaNNaNNA   NA   NA

To test more, I ran:

   > test <- nnet(trX, trY, size = 3, decay = 0)
   Error in nnet.default(trX, trY, size = 3, decay = 0) :
 too many (2107) weights

So, you need to pass in MaxNWts to nnet() with a value that let's you fit
the model. Off the top of my head, you could use something like:

   MaxNWts  = length(levels(trY))*(max(my.grid$.size) * (nCol + 1) +
max(my.grid$.size) + 1)

Also, this one of the methods for getting help (the other is to just email
me). I also try to keep up on stack exchange too.

Max



On Tue, Mar 5, 2013 at 9:47 PM, James Jong  wrote:

> The following code fails to train a nnet model in a random dataset using
> caret:
>
> nR <- 700
> nCol <- 2000
>   myCtrl <- trainControl(method="cv", number=3, preProcOptions=NULL,
> classProbs = TRUE, summaryFunction = twoClassSummary)
>   trX <- data.frame(replicate(nR, rnorm(nCol)))
>   trY <- runif(1)*trX[,1]*trX[,2]^2+runif(1)*trX[,3]/trX[,4]
>   trY <- as.factor(ifelse(sign(trY)>0,'X1','X0'))
>   my.grid <- createGrid(method.name, grid.len, data=trX)
>   my.model <- train(trX,trY,method=method.name
> ,trace=FALSE,trControl=myCtrl,tuneGrid=my.grid,
> metric="ROC")
>   print("Done")
>
> The error I get is:
> task 2 failed - "arguments imply differing number of rows: 1334, 666"
>
> However, everything works if I reduce nR to, say 20.
>
> Any thoughts on what may be causing this? Is there a place where I could
> report this bug other than this mailing list?
>
> Here is my session info:
> > sessionInfo()
> R version 2.15.2 (2012-10-26)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] C
>
> attached base packages:
> [1] stats graphics  grDevices utils datasets  methods   base
>
> other attached packages:
> [1] nnet_7.3-5  pROC_1.5.4  caret_5.15-052  foreach_1.4.0
> [5] cluster_1.14.3  plyr_1.8reshape2_1.2.2  lattice_0.20-13
>
> loaded via a namespace (and not attached):
> [1] codetools_0.2-8 compiler_2.15.2 grid_2.15.2 iterators_1.0.6
> [5] stringr_0.6.2   tools_2.15.2
>
> Thanks,
>
> James
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 

Max

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] caret pls model statistics

2013-03-03 Thread Max Kuhn

That the most common formula, but not the only one. See

  Kvålseth, T. (1985). Cautionary note about $R^2$. *American Statistician*,
*39*(4), 279285.

Traditionally, the symbol 'R' is used for the Pearson correlation
coefficient and one way to calculate R^2 is... R^2.

Max


On Sun, Mar 3, 2013 at 3:16 PM, Charles Determan Jr wrote:

> I was under the impression that in PLS analysis, R2 was calculated by 1-
> (Residual sum of squares) / (Sum of squares).  Is this still what you are
> referring to?  I am aware of the linear R2 which is how well two variables
> are correlated but the prior equation seems different to me.  Could you
> explain if this is the same concept?
>
> Charles
>
>
> On Sun, Mar 3, 2013 at 12:46 PM, Max Kuhn  wrote:
>
>> > Is there some literature that you make that statement?
>>
>> No, but there isn't literature on changing a lightbulb with a duck either.
>>
>> > Are these papers incorrect in using these statistics?
>>
>> Definitely, if they convert 3+ categories to integers (but there are
>> specialized R^2 metrics for binary classification models). Otherwise, they
>> are just using an ill-suited "score".
>>
>>  How would you explain such an R^2 value to someone? R^2 is
>> a function of correlation between the two random variables. For two
>> classes, one of them is binary. What does it mean?
>>
>> Historically, models rooted in computer science (eg neural networks) used
>> RMSE or SSE to fit models with binary outcomes and that *can* work work
>> well.
>>
>> However, I don't think that communicating R^2 is effective. Other metrics
>> (e.g. accuracy, Kappa, area under the ROC curve, etc) are designed to
>> measure the ability of a model to classify and work well. With 3+
>> categories, I tend to use Kappa.
>>
>> Max
>>
>>
>>
>>
>> On Sun, Mar 3, 2013 at 10:53 AM, Charles Determan Jr wrote:
>>
>>> Thank you for your response Max.  Is there some literature that you make
>>> that statement?  I am confused as I have seen many publications that
>>> contain R^2 and Q^2 following PLSDA analysis.  The analysis usually is to
>>> discriminate groups (ie. classification).  Are these papers incorrect in
>>> using these statistics?
>>>
>>> Regards,
>>> Charles
>>>
>>>
>>> On Sat, Mar 2, 2013 at 10:39 PM, Max Kuhn  wrote:
>>>
>>>> Charles,
>>>>
>>>> You should not be treating the classes as numeric (is virginica really
>>>> three times setosa?). Q^2 and/or R^2 are not appropriate for 
>>>> classification.
>>>>
>>>> Max
>>>>
>>>>
>>>> On Sat, Mar 2, 2013 at 5:21 PM, Charles Determan Jr 
>>>> wrote:
>>>>
>>>>> I have discovered on of my errors.  The timematrix was unnecessary and
>>>>> an
>>>>> unfortunate habit I brought from another package.  The following
>>>>> provides
>>>>> the same R2 values as it should, however, I still don't know how to
>>>>> retrieve Q2 values.  Any insight would again be appreciated:
>>>>>
>>>>> library(caret)
>>>>> library(pls)
>>>>>
>>>>> data(iris)
>>>>>
>>>>> #needed to convert to numeric in order to do regression
>>>>> #I don't fully understand this but if I left as a factor I would get an
>>>>> error following the summary function
>>>>> iris$Species=as.numeric(iris$Species)
>>>>> inTrain1=createDataPartition(y=iris$Species,
>>>>> p=.75,
>>>>> list=FALSE)
>>>>>
>>>>> training1=iris[inTrain1,]
>>>>> testing1=iris[-inTrain1,]
>>>>>
>>>>> ctrl1=trainControl(method="cv",
>>>>> number=10)
>>>>>
>>>>> plsFit2=train(Species~.,
>>>>> data=training1,
>>>>> method="pls",
>>>>> trControl=ctrl1,
>>>>> metric="Rsquared",
>>>>> preProc=c("scale"))
>>>>>
>>>>> data(iris)
>>>>> training1=iris[inTrain1,]
>>>>> datvars=training1[,1:4]
>>>>> dat.sc=scale(datvars)
>>>>>
>>>>> pls.dat=plsr(as.numeric(training1$Species)~dat.sc,
>>>>> ncomp=3, method="oscorespls", data=training1)
>>>>>

Re: [R] caret pls model statistics

2013-03-02 Thread Max Kuhn

Charles,

You should not be treating the classes as numeric (is virginica really
three times setosa?). Q^2 and/or R^2 are not appropriate for classification.

Max


On Sat, Mar 2, 2013 at 5:21 PM, Charles Determan Jr wrote:

> I have discovered on of my errors.  The timematrix was unnecessary and an
> unfortunate habit I brought from another package.  The following provides
> the same R2 values as it should, however, I still don't know how to
> retrieve Q2 values.  Any insight would again be appreciated:
>
> library(caret)
> library(pls)
>
> data(iris)
>
> #needed to convert to numeric in order to do regression
> #I don't fully understand this but if I left as a factor I would get an
> error following the summary function
> iris$Species=as.numeric(iris$Species)
> inTrain1=createDataPartition(y=iris$Species,
> p=.75,
> list=FALSE)
>
> training1=iris[inTrain1,]
> testing1=iris[-inTrain1,]
>
> ctrl1=trainControl(method="cv",
> number=10)
>
> plsFit2=train(Species~.,
> data=training1,
> method="pls",
> trControl=ctrl1,
> metric="Rsquared",
> preProc=c("scale"))
>
> data(iris)
> training1=iris[inTrain1,]
> datvars=training1[,1:4]
> dat.sc=scale(datvars)
>
> pls.dat=plsr(as.numeric(training1$Species)~dat.sc,
> ncomp=3, method="oscorespls", data=training1)
>
> x=crossval(pls.dat, segments=10)
>
> summary(x)
> summary(plsFit2)
>
> Regards,
> Charles
>
> On Sat, Mar 2, 2013 at 3:55 PM, Charles Determan Jr  >wrote:
>
> > Greetings,
> >
> > I have been exploring the use of the caret package to conduct some plsda
> > modeling.  Previously, I have come across methods that result in a R2 and
> > Q2 for the model.  Using the 'iris' data set, I wanted to see if I could
> > accomplish this with the caret package.  I use the following code:
> >
> > library(caret)
> > data(iris)
> >
> > #needed to convert to numeric in order to do regression
> > #I don't fully understand this but if I left as a factor I would get an
> > error following the summary function
> > iris$Species=as.numeric(iris$Species)
> > inTrain1=createDataPartition(y=iris$Species,
> > p=.75,
> > list=FALSE)
> >
> > training1=iris[inTrain1,]
> > testing1=iris[-inTrain1,]
> >
> > ctrl1=trainControl(method="cv",
> > number=10)
> >
> > plsFit2=train(Species~.,
> > data=training1,
> > method="pls",
> > trControl=ctrl1,
> > metric="Rsquared",
> > preProc=c("scale"))
> >
> > data(iris)
> > training1=iris[inTrain1,]
> > datvars=training1[,1:4]
> > dat.sc=scale(datvars)
> >
> > n=nrow(dat.sc)
> > dat.indices=seq(1,n)
> >
> > timematrix=with(training1,
> > classvec2classmat(Species[dat.indices]))
> >
> > pls.dat=plsr(timematrix ~ dat.sc,
> > ncomp=3, method="oscorespls", data=training1)
> >
> > x=crossval(pls.dat, segments=10)
> >
> > summary(x)
> > summary(plsFit2)
> >
> > I see two different R2 values and I cannot figure out how to get the Q2
> > value.  Any insight as to what my errors may be would be appreciated.
> >
> > Regards,
> >
> > --
> > Charles
> >
>
>
>
> --
> Charles Determan
> Integrated Biosciences PhD Student
> University of Minnesota
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 

Max

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] odfWeave: Trouble Getting the Package to Work

2013-02-18 Thread Max Kuhn

That's not a reproducible example. There is no sessionInfo() and you
omitted code (where did 'fp' come from?).

It works fine for me (see sessionInfo below) using the code in ?odfWeave.

As for the file paths: you can point to different paths for the files
(although don't change the working directory in the odt file). If you read
the documentation for workDir: "a path to a directory where the source file
will be unpacked and processed. If it does not exist, it will be created.
If it exists, it should be empty, since all its contents will be included
in the generated file". The default value should be sufficient.

Max

> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] grid  stats graphics  grDevices utils datasets  methods
base

other attached packages:
[1] MASS_7.3-22 odfWeave_0.8.2  XML_3.95-0.1lattice_0.20-10

loaded via a namespace (and not attached):
[1] tools_2.15.2




On Mon, Feb 18, 2013 at 8:52 AM, Paul Miller  wrote:

> Hello All,
>
> Have recently started learning Sweave and Knitr. Am now trying to learn
> odfWeave as well. Things went pretty smoothly with Sweave and Knitr but I'm
> having some trouble with odfWeave.
>
> My understanding was that odfWeave should work in pretty much the same way
> as Sweave. With odfWeave, you set up an input .odt file in a folder, run
> that file through the odfWeave function, and then the function produces an
> output .odt file in the same folder.
>
> So I decided to try that using a file called simple.odt that comes with
> the odfWeave package. Unfortunately, things didn't work out quite as I had
> hoped. Below is the result of my attempt to odfWeave that file via Emacs.
>
> For some reason, odfWeave is setting the wd to a location on the C drive
> when my input file is on the N drive. I tried altering this by setting the
> location of workDir to my folder on the N drive. odfWeave through up an
> error saying that this folder already exists. So perhaps the files are
> supposed to be processed in a location other than the one where the input
> file resides.
>
> The other thing is that odfWeave is finding an unexpected "&". There is
> text in the "simple.odt" input file that looks like
> "paste(levels(iris$Species), collapse = " but it has no "&". So presumably
> something is wrong in the xml markup that is being produced.
>
> If anyone can help me understand what is going wrong here, that would be
> greatly appreciated.
>
> Thanks,
>
> Paul
>
> > library(odfWeave)
> Loading required package: lattice
> Loading required package: XML
> > inFile  <- paste(fp, "simple.odt", sep="")
> > outFile <- paste(fp, "output.odt", sep="")
> > odfWeave(inFile, outFile)
>   Copying  N:/Studies/HCRPC1211/Documentation/R Documentation/odfWeave
> Documentation/Examples/Example 1/simple.odt
>   Setting wd to
> C:\Users\pmiller\AppData\Local\Temp\3\RtmpMlDMHV/odfWeave18071055703
>   Unzipping ODF file using unzip -o "simple.odt"
> Archive:  simple.odt
>  extracting: mimetype
>   inflating: meta.xml
>   inflating: settings.xml
>   inflating: content.xml
>  extracting: Thumbnails/thumbnail.png
>   inflating: layout-cache
>   inflating: manifest.rdf
>creating: Configurations2/popupmenu/
>creating: Configurations2/images/Bitmaps/
>creating: Configurations2/toolpanel/
>creating: Configurations2/statusbar/
>creating: Configurations2/toolbar/
>creating: Configurations2/progressbar/
>creating: Configurations2/menubar/
>creating: Configurations2/floater/
>   inflating: Configurations2/accelerator/current.xml
>   inflating: styles.xml
>   inflating: META-INF/manifest.xml
>   Removing  simple.odt
>   Creating a Pictures directory
>   Pre-processing the contents
>   Sweaving  content.Rnw
>   Writing to file content_1.xml
>   Processing code chunks ...
> Error in parse(text = cmd) : :1:40: unexpected '&'
> 1: paste(levels(iris$Species), collapse = &
>   ^
> >
> [[alternative HTML version deleted]]
>
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>


-- 

Max

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] CARET: Any way to access other tuning parameters?

2013-02-13 Thread Max Kuhn

@Max - Thanks a lot for your help. I have already been using that website
> as a reference, and it's incredibly helpful. I have also been experimenting
> with tuneGrid already. My question was specifically if tuneGrid (or caret
> in general) supports passing method parameters to the method functions from
> each package other than those listed in the CARET documentation (e.g. I
> would like to specify sampsize and nodesize for randomForest, and not just
> mtry).
>
>
Yes. A custom method is how you do that.


> Thanks,
>
> James
>
>
>
>
>
>
> On Wed, Feb 13, 2013 at 1:07 PM, Max Kuhn  wrote:
>
>> James,
>>
>> You really need to read the documentation. Almost every question that you
>> have has been addressed in the existing material. For this one, there is a
>> section on custom models here:
>>
>>http://caret.r-forge.r-project.org/training.html
>>
>> Max
>>
>>
>> On Wed, Feb 13, 2013 at 9:58 AM, James Jong wrote:
>>
>>> The documentation for caret::train shows a list of parameters that one
>>> can
>>>  tune for each method classification/regression method. For example, for
>>> the method randomForest one can tune mtry in the call to train. But the
>>>  function call to train random forests in the original package has many
>>> other parameters, e.g. sampsize, maxnodes, etc.
>>>
>>> Is there **any** way to access these parameters using train in caret? (Is
>>> the function caret::createGrid limited to the list of parameters
>>> specified
>>> in the caret documentation, it's not super clear if the list of parameter
>>> is for all the caret APIs).
>>>
>>> Thanks,
>>>
>>> James,
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> __
>>> R-help@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>>
>> --
>>
>> Max
>>
>
>


-- 

Max

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] CARET: Any way to access other tuning parameters?

2013-02-13 Thread Max Kuhn

James,

You really need to read the documentation. Almost every question that you
have has been addressed in the existing material. For this one, there is a
section on custom models here:

   http://caret.r-forge.r-project.org/training.html

Max

On Wed, Feb 13, 2013 at 9:58 AM, James Jong  wrote:

> The documentation for caret::train shows a list of parameters that one can
>  tune for each method classification/regression method. For example, for
> the method randomForest one can tune mtry in the call to train. But the
>  function call to train random forests in the original package has many
> other parameters, e.g. sampsize, maxnodes, etc.
>
> Is there **any** way to access these parameters using train in caret? (Is
> the function caret::createGrid limited to the list of parameters specified
> in the caret documentation, it's not super clear if the list of parameter
> is for all the caret APIs).
>
> Thanks,
>
> James,
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 

Max

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] pROC and ROCR give different values for AUC

2012-12-19 Thread Max Kuhn

A reproducible example sent to the package maintainer(s)
might yield results.

Max


On Wed, Dec 19, 2012 at 7:47 AM, Ivana Cace  wrote:

> Packages pROC and ROCR both calculate/approximate the Area Under (Receiver
> Operator) Curve. However the results are different.
>
> I am computing a new variable as a predictor for a label. The new variable
> is a (non-linear) function of a set of input values, and I'm checking how
> different parameter settings contribute to prediction. All my settings are
> predictive, but some are better.
>
> The AUC i got with pROC was much lower then expected, so i tried ROCR.
> Here are some comparisons:
> AUC from pROC AUC from ROCR
> 0.49465  0.79311
> 0.49465  0.79349
> 0.49701  0.79446
> 0.49701  0.79764
>
> When i draw the ROC (with pROC) i get the curve i expect. But why is the
> AUC according to pROC so different?
>
> Ivana
>
>
>
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 

Max

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Help with this error "kernlab class probability calculations failed; returning NAs"

2012-11-29 Thread Max Kuhn

Your output has:

"At least one of the class levels are not valid R variables names; This may
cause errors if class probabilities are generated because the variables
names will be converted to: X0, X1"

Try changing the factor levels to avoid leading numbers and try again.

Max




On Thu, Nov 29, 2012 at 10:18 PM, Brian Feeny  wrote:

>
>
> Yes I am still getting this error, here is my sessionInfo:
>
> > sessionInfo()
> R version 2.15.2 (2012-10-26)
> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats graphics  grDevices utils datasets  methods   base
>
> other attached packages:
> [1] e1071_1.6-1 class_7.3-5 kernlab_0.9-14  caret_5.15-045
>  foreach_1.4.0   cluster_1.14.3
> [7] reshape_0.8.4   plyr_1.7.1  lattice_0.20-10
>
> loaded via a namespace (and not attached):
> [1] codetools_0.2-8 compiler_2.15.2 grid_2.15.2 iterators_1.0.6
> tools_2.15.2
>
>
> Is there an example that shows a classProbs example, I could try to run it
> to replicate and see if it works on my system.
>
> Brian
>
> On Nov 29, 2012, at 10:10 PM, Max Kuhn  wrote:
>
> You didn't provide the results of sessionInfo().
>
> Upgrade to the version just released on cran and see if you still have the
> issue.
>
> Max
>
>
> On Thu, Nov 29, 2012 at 6:55 PM, Brian Feeny  wrote:
>
>> I have never been able to get class probabilities to work and I am
>> relatively new to using these tools, and I am looking for some insight as
>> to what may be wrong.
>>
>> I am using caret with kernlab/ksvm.  I will simplify my problem to a
>> basic data set which produces the same problem.  I have read the caret
>> vignettes as well as documentation for ?train.  I appreciate any direction
>> you can give.  I realize this is a very small dataset, the actual data is
>> much larger, I am just using 10 rows as an example:
>>
>> trainset <- data.frame(
>>   outcome=factor(c("0","1","0","1","0","1","1","1","1","0")),
>>   age=c(10, 23, 5, 28, 81, 48, 82, 23, 11, 9),
>>   amount=c(10.11, 22.23, 494.2, 2.0, 29.2, 39.2, 39.2, 39.0, 11.1, 12.2)
>> )
>>
>> > str(trainset)
>> 'data.frame':   7 obs. of  3 variables:
>>  $ outcome: Factor w/ 2 levels "0","1": 2 1 2 2 2 2 1
>>  $ age: num  23 5 28 48 82 11 9
>>  $ amount : num  22.2 494.2 2 39.2 39.2 ...
>>
>> > colSums(is.na(trainset))
>> outcome age  amount
>>   0   0   0
>>
>>
>> ## SAMPLING AND FORMULA
>> dataset <- trainset
>> index <- 1:nrow(dataset)
>> testindex <- sample(index, trunc(length(index)*30/100))
>> trainset <- dataset[-testindex,]
>> testset <- dataset[testindex,-1]
>>
>>
>> ## TUNE caret / kernlab
>> set.seed(1)
>> MyTrainControl=trainControl(
>>   method = "repeatedcv",
>>   number=10,
>>   repeats=5,
>>   returnResamp = "all",
>>   classProbs = TRUE
>> )
>>
>>
>> ## MODEL
>> rbfSVM <- train(outcome~., data = trainset,
>>method="svmRadial",
>>preProc = c("scale"),
>>tuneLength = 10,
>>trControl=MyTrainControl,
>>fit = FALSE
>> )
>>
>> There were 50 or more warnings (use warnings() to see the first 50)
>> > warnings()
>> Warning messages:
>> 1: In train.default(x, y, weights = w, ...) :
>>   At least one of the class levels are not valid R variables names; This
>> may cause errors if class probabilities are generated because the variables
>> names will be converted to: X0, X1
>> 2:  In caret:::predictionFunction(method = method, modelFit = mod$fit,
>>  ... :
>>   kernlab class prediction calculations failed; returning NAs
>>
>> __
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
> --
>
> Max
>
>
>


-- 

Max

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Help with this error "kernlab class probability calculations failed; returning NAs"

2012-11-29 Thread Max Kuhn

You didn't provide the results of sessionInfo().

Upgrade to the version just released on cran and see if you still have the
issue.

Max


On Thu, Nov 29, 2012 at 6:55 PM, Brian Feeny  wrote:

> I have never been able to get class probabilities to work and I am
> relatively new to using these tools, and I am looking for some insight as
> to what may be wrong.
>
> I am using caret with kernlab/ksvm.  I will simplify my problem to a basic
> data set which produces the same problem.  I have read the caret vignettes
> as well as documentation for ?train.  I appreciate any direction you can
> give.  I realize this is a very small dataset, the actual data is much
> larger, I am just using 10 rows as an example:
>
> trainset <- data.frame(
>   outcome=factor(c("0","1","0","1","0","1","1","1","1","0")),
>   age=c(10, 23, 5, 28, 81, 48, 82, 23, 11, 9),
>   amount=c(10.11, 22.23, 494.2, 2.0, 29.2, 39.2, 39.2, 39.0, 11.1, 12.2)
> )
>
> > str(trainset)
> 'data.frame':   7 obs. of  3 variables:
>  $ outcome: Factor w/ 2 levels "0","1": 2 1 2 2 2 2 1
>  $ age: num  23 5 28 48 82 11 9
>  $ amount : num  22.2 494.2 2 39.2 39.2 ...
>
> > colSums(is.na(trainset))
> outcome age  amount
>   0   0   0
>
>
> ## SAMPLING AND FORMULA
> dataset <- trainset
> index <- 1:nrow(dataset)
> testindex <- sample(index, trunc(length(index)*30/100))
> trainset <- dataset[-testindex,]
> testset <- dataset[testindex,-1]
>
>
> ## TUNE caret / kernlab
> set.seed(1)
> MyTrainControl=trainControl(
>   method = "repeatedcv",
>   number=10,
>   repeats=5,
>   returnResamp = "all",
>   classProbs = TRUE
> )
>
>
> ## MODEL
> rbfSVM <- train(outcome~., data = trainset,
>method="svmRadial",
>preProc = c("scale"),
>tuneLength = 10,
>trControl=MyTrainControl,
>fit = FALSE
> )
>
> There were 50 or more warnings (use warnings() to see the first 50)
> > warnings()
> Warning messages:
> 1: In train.default(x, y, weights = w, ...) :
>   At least one of the class levels are not valid R variables names; This
> may cause errors if class probabilities are generated because the variables
> names will be converted to: X0, X1
> 2:  In caret:::predictionFunction(method = method, modelFit = mod$fit,
>  ... :
>   kernlab class prediction calculations failed; returning NAs
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 

Max

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] caret train and trainControl

2012-11-23 Thread Max Kuhn

Brian,

This is all outlined in the package documentation. The final model is fit
automatically. For example, using 'verboseIter' provides details. From
?train

> knnFit1 <- train(TrainData, TrainClasses,

+  method = "knn",

+  preProcess = c("center", "scale"),

+  tuneLength = 10,

+  trControl = trainControl(method = "cv", verboseIter =
TRUE))

+ Fold01: k= 5

- Fold01: k= 5

+ Fold01: k= 7

- Fold01: k= 7

+ Fold01: k= 9

- Fold01: k= 9

+ Fold01: k=11

- Fold01: k=11



+ Fold10: k=17

- Fold10: k=17

+ Fold10: k=19

- Fold10: k=19

+ Fold10: k=21

- Fold10: k=21

+ Fold10: k=23

- Fold10: k=23

Aggregating results

Selecting tuning parameters

Fitting model on full training set


Max


On Fri, Nov 23, 2012 at 5:52 PM, Brian Feeny  wrote:

>
> I am used to packages like e1071 where you have a tune step and then pass
> your tunings to train.
>
> It seems with caret, tuning and training are both handled by train.
>
> I am using train and trainControl to find my hyper parameters like so:
>
> MyTrainControl=trainControl(
>   method = "cv",
>   number=5,
>   returnResamp = "all",
>classProbs = TRUE
> )
>
> rbfSVM <- train(label~., data = trainset,
>method="svmRadial",
>tuneGrid =
> expand.grid(.sigma=c(0.0118),.C=c(8,16,32,64,128)),
>trControl=MyTrainControl,
>fit = FALSE
> )
>
> Once this returns my ideal parameters, in this case Cost of 64, do I
> simply just re-run the whole process again, passing a grid only containing
> the specific parameters? like so?
>
>
> rbfSVM <- train(label~., data = trainset,
>method="svmRadial",
>tuneGrid = expand.grid(.sigma=0.0118,.C=64),
>trControl=MyTrainControl,
>fit = FALSE
> )
>
> This is what I have been doing but I am new to caret and want to make sure
> I am doing this correctly.
>
> Brian
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 

Max

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Decision Tree: Am I Missing Anything?

2012-09-22 Thread Max Kuhn

Vik,

On Fri, Sep 21, 2012 at 12:42 PM, Vik Rubenfeld  wrote:
> Max, I installed C50. I have a question about the syntax. Per the C50 manual:
>
> ## Default S3 method:
> C5.0(x, y, trials = 1, rules= FALSE,
> weights = NULL,
> control = C5.0Control(),
> costs = NULL, ...)
>
> ## S3 method for class ’formula’
> C5.0(formula, data, weights, subset,
> na.action = na.pass, ...)
>
> I believe I need the method for class 'formula'. But I don't yet see in the 
> manual how to tell C50 that I want to use that method. If I run:
>
> respLevel = read.csv("Resp Level Data.csv")
> respLevelTree = C5.0(BRAND_NAME ~ PRI + PROM + REVW + MODE + FORM + FAMI + 
> DRRE + FREC + SPED, data = respLevel)
>
> ...I get an error message:
>
> Error in gsub(":", ".", x, fixed = TRUE) :
>   input string 18 is invalid in this locale

You're not doing it wrong.

Can you send me the results of sessionInfo()? I think there are a few
issues with the function on windows, so a reproducible example would
help solve the issue.

-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Caret: Use timingSamps leads to error

2012-07-12 Thread Max Kuhn

I can reproduce the errors. I'll take a look.

Thanks,

Max

On Thu, Jul 12, 2012 at 5:24 AM, Dominik Bruhn  wrote:
> I want to use the caret package and found out about the timingSamps
> obtion to obtain the time which is needed to predict results. But, as
> soon as I set a value for this option, the whole model generation fails.
> Check this example:
>
> -
> library(caret)
>
> tc=trainControl(method='LGOCV', timingSamps=10)
> tcWithout=trainControl(method='LGOCV')
>
> x=train(Volume~Girth+Height, method="lm", data=trees, trControl=tcWithout)
>
> x=train(Volume~Girth+Height, method="lm", data=trees, trControl=tc)
> Error in eval(expr, envir, enclos) : object 'Girth' not found
> Timing stopped at: 0 0 0.003
> 
>
> As you can see, the model generation works without the timingSamps
> option but fails if it is specified.
>
> What am I doing wrong?
>
> My sessioninfo:
> --
> R version 2.15.0 (2012-03-30)
> Platform: x86_64-pc-linux-gnu (64-bit)
>
> locale:
>  [1] LC_CTYPE=en_GB.UTF-8   LC_NUMERIC=C
>  [3] LC_TIME=en_GB.UTF-8LC_COLLATE=en_GB.UTF-8
>  [5] LC_MONETARY=en_GB.UTF-8LC_MESSAGES=en_GB.UTF-8
>  [7] LC_PAPER=C LC_NAME=C
>  [9] LC_ADDRESS=C   LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics  grDevices utils datasets  methods   base
>
> other attached packages:
> [1] MASS_7.3-18caret_5.15-023 foreach_1.4.0  cluster_1.14.2
> reshape_0.8.4
> [6] plyr_1.7.1 lattice_0.20-6
>
> loaded via a namespace (and not attached):
> [1] codetools_0.2-8 compiler_2.15.0 grid_2.15.0 iterators_1.0.6
> [5] tools_2.15.0
> -
>
> Thanks!
> --
> Dominik Bruhn
> mailto: domi...@dbruhn.de
>



-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] caret() train based on cross validation - split dataset to keep sites together?

2012-05-30 Thread Max Kuhn

Tyrell,

If you want to have the folds contain data from only one site at a
time, you can develop a set of row indices and pass these to the index
argument in trainControl. For example

   index = list(site1 = c(1, 6, 8, 12), site2 = c(120, 152, 176, 178),
site3 = c(754, 789, 981))

The first fold would fit a model on those site 1 data in the first
argument and predict everything else, and so on.

I'm not sure if this is what you need, but there you go.

Max

On Wed, May 30, 2012 at 7:55 AM, Tyrell Deweber  wrote:
> Hello all,
>
> I have searched and have not yet identified a solution so now I am sending
> this message. In short, I need to split my data into training, validation,
> and testing subsets that keep all observations from the same sites together
> – preferably as part of a cross validation procedure. Now for the longer
> version. And I must confess that although my R skills are improving, they
> are not so highly developed.
>
> I am using 10 fold cross validation with 3 repeats in the train function of
> the caret() package to identify an optimal nnet (neural network) model to
> predict daily river water temperature at unsampled sites. I am also
> withholding data from 10% of sites to have a better understanding of
> generalization error. However, the focus on predictions at other sites is
> turning out to be not easily facilitated – as far as I can see.  My data
> structure (example at bottom of email) consists of columns identifying the
> site, the date, the water temperature on that day for the site (response
> variable), and many predictors.  There are over 220,000 individual
> observations at ~1,000 sites, and each site has a minimum of 30
> observations.  It is important to keep sites separate because selecting a
> model based on predictions at an already sampled site is likely
> overly-optimistic.
>
> Is there a way to split data for (or preferably during) cross validation
> procedure to:
>
> 1.) Selects a separate validation dataset from 10% of sites
> 2.) Splits remaining training data into cross validation subsets and most
> importantly, keeping all observations from a site together
> 3.) Secondarily, constrain partitions to be similar - ideally based on
> distributions of all variables
>
> It seems that some combination of the sample.split function of the caTools()
> package and the createdataPartition function of caret() might do this, but I
> am at a loss for how to code that.
>
> If this is not possible, I would be content to skip the cross validation
> procedure and create three similar splits of my data that keep all
> observations from a site together – one for training, one for testing, and
> one for validation.  The alternative goal here would be to split the data
> where 80% of sites are training, 10% of sites are for testing (model
> selection), and 10% of sites for validation.
>
> Thank you and please let me know if there are any remaining questions.  This
> is my first post as well, so if I left anything out that would be good to
> know as well.
>
> Tyrell Deweber
>
>
>
> R version 2.13.1 (2011-07-08)
> Copyright (C) 2011 The R Foundation for Statistical Computing
> ISBN 3-900051-07-0
> Platform: x86_64-redhat-linux-gnu (64-bit)
>
> Comid   tempymd    watmntemp   airtemp predictorb    …
> 15433    1980-05-01  11.4  22.1 …
> 15433    1980-05-02  11.6  23.6     …
> 15433    1980-05-03  11.2  28.5
> 15687    1980-06-01  13.5  26.5
> 15687    1980-06-02  14.2  26.9
> 15687    1980-06-03  13.8  28.9
> 18994    1980-04-05  8.4   16.4
> 18994    1980-04-06  8.3   12.6
> 90342    1980-07-13  18.9  22.3
> 90342    1980-07-14  19.3  28.4
>
>
> EXAMPLE SCRIPT FOR MODEL FITTING
>
>
> fitControl <- trainControl(method = "repeatedcv", number=10, repeats=3)
>
> tuning <- read.table("temptunegrid.txt",head=T,sep=",")
> tuning
>
>
> # # Model with 100 iterations
> registerDoMC(4)
> tempmod100its <- train(watmntemp~tempa + tempb + tempc + tempd + tempe +
> netarea + netbuffor + strmslope +
>        netsoilprm + netslope + gwndx + mnaspect + urb + ag + forest +
> buffor + tempa7day + tempb7day +
>        tempc7day + tempd7day + tempe7day +  tempa30day + tempb30day +
> tempc30day + tempd30day +
>        tempe30day, data = temp.train, method = "nnet", linout=T, maxit =
> 100,
>        MaxNWts = 10, metric = "RMSE", trControl = fitControl, tuneGrid
> = tuning, trace = T)
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PL

Re: [R] caret: Error when using rpart and CV != LOOCV

2012-05-17 Thread Max Kuhn

Dominik,

There are a number of formulations of this statistic (see the
Kvålseth[*] reference below).

I tend to think of R^2 as the proportion of variance explained by the
model[**]. With the "traditional" formula, it is possible to get
negative proportions (if there are extreme outliers in the
predictions, the negative proportion can be very large). I used this
formulation because it is always on (0, 1). It is called "R^2" after
all!

Here is an example:

> set.seed(1)
> simObserved <- rnorm(100)
> simPredicted <- simObserved + rnorm(100)*.1
>
> cor(simObserved, simPredicted)^2
[1] 0.9887525
> customSummary(data.frame(obs = simObserved,
+  pred = simPredicted))
  RMSE   Rsquared
0.09538273 0.98860908
>
> simPredicted[1]
[1] -0.6884905
> simPredicted[1] <- 10
>
> cor(simObserved, simPredicted)^2
[1] 0.3669257
> customSummary(data.frame(obs = simObserved,
+  pred = simPredicted))
 RMSE  Rsquared
 1.066900 -0.425169

It is somewhat extreme, but it does happen.

Max


* Kvålseth, T. (1985). Cautionary note about $R^2$. American
statistician, 39(4), 279–285.
* This is a very controversial statement when non-linear models are
used. I'd rather use RMSE, but many scientists I work with still think
in terms of R^2 regardless of the model. The randomForest function
also computes this statistic, but calls it "% Var explained" instead
of explicitly labeling it as "R^2". This statistic has generated
heated debates and I hope that I will not have to wear a scarlet R in
Nashville in a few weeks.


On Thu, May 17, 2012 at 1:35 PM, Dominik Bruhn  wrote:
> Hy Max,
> thanks again for the answer.
>
> I checked the caret implementation and you were right. If the
> predictions for the model constant (or sd(pred)==0) then the
> implementation returns a NA for the rSquare (in postResample). This is
> mainly because the caret implementation uses `cor` (from the
> stats-package) which would throw a error for values with sd(pred)==0.
>
> Do you know why this is implemented in this way? I wrote my own
> summaryFunction which calculates rSquare by hand and it works fine. It
> nevertheless does NOT(!) generate the same values as the original
> implementation. It seems that the calcuation of Rsquare does not seem to
> be consistent. I took mine from Wikipedia [1].
>
> Here is my code:
> ---
> customSummary <- function (data, lev = NULL, model = NULL) {
>         #Calulate rSquare
>         ssTot <- sum((data$obs-mean(data$obs))^2)
>         ssErr <- sum((data$obs-data$pred)^2)
>         rSquare <- 1-(ssErr/ssTot)
>
>         #Calculate MSE
>         mse <- mean((data$pred - data$obs)^2)
>
>         #Aggregate
>         out <- c(sqrt(mse), 1-(ssErr/ssTot))
>         names(out) <- c("RMSE", "Rsquared")
>
>         return(out)
> }
> ---
>
> [1]: http://en.wikipedia.org/wiki/Coefficient_of_determination#Definitions
>
> Thanks!
> Dominik
>
>
>
>
> On 17/05/12 04:10, Max Kuhn wrote:
>> Dominik,
>>
>> See this line:
>>
>>>   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>>>  30.37   30.37   30.37   30.37   30.37   30.37
>>
>> The variance of the predictions is zero. caret uses the formula for
>> R^2 by calculating the correlation between the observed data and the
>> predictions which uses sd(pred) which is zero. I believe that the same
>> would occur with other formulas for R^2.
>>
>> Max
>>
>> On Wed, May 16, 2012 at 11:54 AM, Dominik Bruhn  wrote:
>>> Thanks Max for your answer.
>>>
>>> First, I do not understand your post. Why is it a problem if two of
>>> predictions match? From the formula for calculating R^2 I can see that
>>> there will be a DivByZero iff the total sum of squares is 0. This is
>>> only true if the predictions of all the predicted points from the
>>> test-set are equal to the mean of the test-set. Why should this happen?
>>>
>>> Anyway, I wrote the following code to check what you tried to tell:
>>>
>>> --
>>> library(caret)
>>> data(trees)
>>> formula=Volume~Girth+Height
>>>
>>> customSummary <- function (data, lev = NULL, model = NULL) {
>>>    print(summary(data$pred))
>>>    return(defaultSummary(data, lev, model))
>>> }
>>>
>>> tc=trainControl(method='cv', summaryFunction=customSummary)
>>> train(formula, data=trees,  method='rpart', trControl=tc)
>>> --
>>>
>>> This outputs:
>>> ---
>>>  Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>>&g

Re: [R] caret: Error when using rpart and CV != LOOCV

2012-05-16 Thread Max Kuhn

Dominik,

See this line:

>   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>  30.37   30.37   30.37   30.37   30.37   30.37

The variance of the predictions is zero. caret uses the formula for
R^2 by calculating the correlation between the observed data and the
predictions which uses sd(pred) which is zero. I believe that the same
would occur with other formulas for R^2.

Max

On Wed, May 16, 2012 at 11:54 AM, Dominik Bruhn  wrote:
> Thanks Max for your answer.
>
> First, I do not understand your post. Why is it a problem if two of
> predictions match? From the formula for calculating R^2 I can see that
> there will be a DivByZero iff the total sum of squares is 0. This is
> only true if the predictions of all the predicted points from the
> test-set are equal to the mean of the test-set. Why should this happen?
>
> Anyway, I wrote the following code to check what you tried to tell:
>
> --
> library(caret)
> data(trees)
> formula=Volume~Girth+Height
>
> customSummary <- function (data, lev = NULL, model = NULL) {
>    print(summary(data$pred))
>    return(defaultSummary(data, lev, model))
> }
>
> tc=trainControl(method='cv', summaryFunction=customSummary)
> train(formula, data=trees,  method='rpart', trControl=tc)
> --
>
> This outputs:
> ---
>  Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>  18.45   18.45   18.45   30.12   35.95   53.44
>   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>  22.69   22.69   22.69   32.94   38.06   53.44
>   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>  30.37   30.37   30.37   30.37   30.37   30.37
> [cut many values like this]
> Warning: In nominalTrainWorkflow(dat = trainData, info = trainInfo,
> method = method,  :
>  There were missing values in resampled performance measures.
> -
>
> As I didn't understand your post, I don't know if this confirms your
> assumption.
>
> Thanks anyway,
> Dominik
>
>
> On 16/05/12 17:30, Max Kuhn wrote:
>> More information is needed to be sure, but it is most likely that some
>> of the resampled rpart models produce the same prediction for the
>> hold-out samples (likely the result of no viable split being found).
>>
>> Almost every incarnation of R^2 requires the variance of the
>> prediction. This particular failure mode would result in a divide by
>> zero.
>>
>> Try using you own summary function (see ?trainControl) and put a
>> print(summary(data$pred)) in there to verify my claim.
>>
>> Max
>>
>> On Wed, May 16, 2012 at 11:30 AM, Max Kuhn  wrote:
>>> More information is needed to be sure, but it is most likely that some
>>> of the resampled rpart models produce the same prediction for the
>>> hold-out samples (likely the result of no viable split being found).
>>>
>>> Almost every incarnation of R^2 requires the variance of the
>>> prediction. This particular failure mode would result in a divide by
>>> zero.
>>>
>>> Try using you own summary function (see ?trainControl) and put a
>>> print(summary(data$pred)) in there to verify my claim.
>>>
>>> Max
>>>
>>> On Tue, May 15, 2012 at 5:55 AM, Dominik Bruhn  wrote:
>>>> Hy,
>>>> I got the following problem when trying to build a rpart model and using
>>>> everything but LOOCV. Originally, I wanted to used k-fold partitioning,
>>>> but every partitioning except LOOCV throws the following warning:
>>>>
>>>> 
>>>> Warning message: In nominalTrainWorkflow(dat = trainData, info =
>>>> trainInfo, method = method, : There were missing values in resampled
>>>> performance measures.
>>>> -
>>>>
>>>> Below are some simplified testcases which repoduce the warning on my
>>>> system.
>>>>
>>>> Question: What does this error mean? How can I avoid it?
>>>>
>>>> System-Information:
>>>> -
>>>>> sessionInfo()
>>>> R version 2.15.0 (2012-03-30)
>>>> Platform: x86_64-pc-linux-gnu (64-bit)
>>>>
>>>> locale:
>>>>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
>>>>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
>>>>  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
>>>>  [7] LC_PAPER=C                 LC_NAME=C
>>>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices utils     datasets  metho

Re: [R] caret: Error when using rpart and CV != LOOCV

2012-05-16 Thread Max Kuhn

More information is needed to be sure, but it is most likely that some
of the resampled rpart models produce the same prediction for the
hold-out samples (likely the result of no viable split being found).

Almost every incarnation of R^2 requires the variance of the
prediction. This particular failure mode would result in a divide by
zero.

Try using you own summary function (see ?trainControl) and put a
print(summary(data$pred)) in there to verify my claim.

Max

On Wed, May 16, 2012 at 11:30 AM, Max Kuhn  wrote:
> More information is needed to be sure, but it is most likely that some
> of the resampled rpart models produce the same prediction for the
> hold-out samples (likely the result of no viable split being found).
>
> Almost every incarnation of R^2 requires the variance of the
> prediction. This particular failure mode would result in a divide by
> zero.
>
> Try using you own summary function (see ?trainControl) and put a
> print(summary(data$pred)) in there to verify my claim.
>
> Max
>
> On Tue, May 15, 2012 at 5:55 AM, Dominik Bruhn  wrote:
>> Hy,
>> I got the following problem when trying to build a rpart model and using
>> everything but LOOCV. Originally, I wanted to used k-fold partitioning,
>> but every partitioning except LOOCV throws the following warning:
>>
>> 
>> Warning message: In nominalTrainWorkflow(dat = trainData, info =
>> trainInfo, method = method, : There were missing values in resampled
>> performance measures.
>> -
>>
>> Below are some simplified testcases which repoduce the warning on my
>> system.
>>
>> Question: What does this error mean? How can I avoid it?
>>
>> System-Information:
>> -
>>> sessionInfo()
>> R version 2.15.0 (2012-03-30)
>> Platform: x86_64-pc-linux-gnu (64-bit)
>>
>> locale:
>>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
>>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
>>  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
>>  [7] LC_PAPER=C                 LC_NAME=C
>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] rpart_3.1-52   caret_5.15-023 foreach_1.4.0  cluster_1.14.2
>> reshape_0.8.4
>> [6] plyr_1.7.1     lattice_0.20-6
>>
>> loaded via a namespace (and not attached):
>> [1] codetools_0.2-8 compiler_2.15.0 grid_2.15.0     iterators_1.0.6
>> [5] tools_2.15.0
>> ---
>>
>>
>> Simlified Testcase I: Throws warning
>> ---
>> library(caret)
>> data(trees)
>> formula=Volume~Girth+Height
>> train(formula, data=trees,  method='rpart')
>> ---
>>
>> Simlified Testcase II: Every other CV-method also throws the warning,
>> for example using 'cv':
>> ---
>> library(caret)
>> data(trees)
>> formula=Volume~Girth+Height
>> tc=trainControl(method='cv')
>> train(formula, data=trees,  method='rpart', trControl=tc)
>> ---
>>
>> Simlified Testcase III: The only CV-method which is working is 'LOOCV':
>> ---
>> library(caret)
>> data(trees)
>> formula=Volume~Girth+Height
>> tc=trainControl(method='LOOCV')
>> train(formula, data=trees,  method='rpart', trControl=tc)
>> ---
>>
>>
>> Thanks!
>> --
>> Dominik Bruhn
>> mailto: domi...@dbruhn.de
>>
>>
>>
>>
>> __
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
> --
>
> Max



-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] caret package: custom summary function in trainControl doesn't work with oob?

2012-04-13 Thread Max Kuhn

Matt,

> I've been using a custom summary function to optimise regression model
> methods using the caret package. This has worked smoothly. I've been using
> the default bootstrapping resampling method. For bagging models
> (specifically randomForest in this case) caret can, in theory, uses the
> out-of-bag (oob) error estimate from the model instead of resampling, which
> (in theory) is largely redundant for such models. Since they take a while
> to build in the first place, it really slows things down when estimating
> performance using boostrap.
>
> I can successfully run either using the oob 'resampling method' with the
> default RMSE optimisation, or run using bootstrap and my custom
> summaryFunction as the thing to optimise, but they don't work together. If
> I try and use oob and supply a summaryFunction caret throws an error saying
> it can't find the relevant metric.
>
> Now, if caret is simply polling the randomForest object for the stored oob
> error I can understand this limitation

That is exactly what it does. See caret:::rfStats (not a public function)

train() was written to be fairly general and this level of control
would be very difficult to implement, especially since each model that
does some type of bagging uses different internal structures etc.

> but in the case of randomForest
> (and probably other bagging methods?) the training function can be asked to
> return information about the individual tree predictions and whether data
> points were oob in each case. With this information you can reconstruct an
> oob 'error' using whatever function you choose to target for optimisation.
> As far as I can tell, caret is not doing this and I can't see anywhere that
> it can be coerced to do so.

It will not be able to do this. I'm not sure that you can either.
randomForest() will return the individual forests and
predict.randomForest() can return the per-tree results but I don't
know if it saves the indices that tell you which bootstrap samples
contained which training set points. Perhaps Andy would know.

> Have I missed something? Can anyone suggest how this could be achieved? It
> wouldn't be *that* hard to code up something that essentially operates in
> the same way as caret.train but can handle this feature for bagging models,
> but if it is already there and I've missed something please let me know.

Well, everything is easy for the person not doing it =]

If you save the proximity measures, you might gain the sampling
indices. WIth these, you would use predict.randomForest(...,
predict.all=TRUE) to get the individual predictions.

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] nonparametric densities for bounded distributions

2012-03-09 Thread Max Kuhn

Can anyone recommend a good nonparametric density approach for data bounded
(say between 0 and 1)?

For example, using the basic Gaussian density approach doesn't generate a
very realistic shape (nor should it):

> set.seed(1)
> dat <- rbeta(100, 1, 2)
> plot(density(dat))

(note the area outside of 0/1)

The data I have may be bimodal or have other odd properties (e.g. point
mass at zero). I've tried transforming via the logit, estimating the
density then plotting the curve in the original units, but this seems to do
poorly in the tails (and I have data are absolute zero and one).

Thanks,

Max

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Custom caret metric based on prob-predictions/rankings

2012-02-10 Thread Max Kuhn

I think you need to read the man pages and the four vignettes. A lot
of your questions have answers there.

If you don't specify the resampling indices, they ones generated for
you are saved in the train object:

> data(iris)
> TrainData <- iris[,1:4]
> TrainClasses <- iris[,5]
>
> knnFit1 <- train(TrainData, TrainClasses,
+  method = "knn",
+  preProcess = c("center", "scale"),
+  tuneLength = 10,
+  trControl = trainControl(method = "cv"))
Loading required package: class

Attaching package: ‘class’

The following object(s) are masked from ‘package:reshape’:

condense

Warning message:
executing %dopar% sequentially: no parallel backend registered
> str(knnFit1$control$index)
List of 10
 $ Fold01: int [1:135] 1 2 3 4 5 6 7 9 10 11 ...
 $ Fold02: int [1:135] 1 2 3 4 5 6 8 9 10 12 ...
 $ Fold03: int [1:135] 1 3 4 5 6 7 8 9 10 11 ...
 $ Fold04: int [1:135] 1 2 3 5 6 7 8 9 10 11 ...
 $ Fold05: int [1:135] 1 2 3 4 6 7 8 9 11 12 ...
 $ Fold06: int [1:135] 1 2 3 4 5 6 7 8 9 10 ...
 $ Fold07: int [1:135] 1 2 3 4 5 7 8 9 10 11 ...
 $ Fold08: int [1:135] 2 3 4 5 6 7 8 9 10 11 ...
 $ Fold09: int [1:135] 1 2 3 4 5 6 7 8 9 10 ...
 $ Fold10: int [1:135] 1 2 4 5 6 7 8 10 11 12 ...

There is also a savePredictions argument that gives you the hold-out results.

I'm not sure which weights you are referring to.

On Fri, Feb 10, 2012 at 4:38 AM, Yang Zhang  wrote:
> Actually, is there any way to get at additional information beyond the
> classProbs?  In particular, is there any way to find out the
> associated weights, or otherwise the row indices into the original
> model matrix corresponding to the tested instances?
>
> On Thu, Feb 9, 2012 at 4:37 PM, Yang Zhang  wrote:
>> Oops, found trainControl's classProbs right after I sent!
>>
>> On Thu, Feb 9, 2012 at 4:30 PM, Yang Zhang  wrote:
>>> I'm dealing with classification problems, and I'm trying to specify a
>>> custom scoring metric (recall@p, ROC, etc.) that depends on not just
>>> the class output but the probability estimates, so that caret::train
>>> can choose the optimal tuning parameters based on this metric.
>>>
>>> However, when I supply a trainControl summaryFunction, the data given
>>> to it contains only class predictions, so the only metrics possible
>>> are things like accuracy, kappa, etc.
>>>
>>> Is there any way to do this that I'm looking?  If not, could I put
>>> this in as a feature request?  Thanks!
>>>
>>> --
>>> Yang Zhang
>>> http://yz.mit.edu/
>>
>>
>>
>> --
>> Yang Zhang
>> http://yz.mit.edu/
>
>
>
> --
> Yang Zhang
> http://yz.mit.edu/
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Choosing glmnet lambda values via caret

2012-02-09 Thread Max Kuhn

You can adjust the candidate set of tuning parameters via the tuneGrid
argument in trian() and the process by which the optimal choice is
made (via the 'selectionFunction' argument in trainControl()). Check
out the package vignettes.

The latest version also has an update.train() function that lets the
user manually specify the tuning parameters after the call to train().

On Thu, Feb 9, 2012 at 7:00 PM, Yang Zhang  wrote:
> Usually when using raw glmnet I let the implementation choose the
> lambdas.  However when training via caret::train the lambda values are
> predetermined.  Is there any way to have caret defer the lambda
> choices to caret::train and thus choose the optimal lambda
> dynamically?
>
> --
> Yang Zhang
> http://yz.mit.edu/
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] lattice key in blank panel

2011-12-15 Thread Max Kuhn

Somewhere I've seen an example of an xyplot() where the key was placed
in a location of a missing panel. For example, if there were 3
conditioning levels, the panel grid would look like:

34
12

In this (possibly imaginary) example, there were scatter plots in
locations 1:3 and location 4 had no conditioning bar at the top, only
the key.

I can find examples of putting the legend outside of the panel
locations (e.g to the right of locations 2 and 4 above), but that's
not really what I'd like to do.

Thanks,

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] palettes for the color-blind

2011-11-02 Thread Max Kuhn

Yes, I was aware of the different type and their respective prevalences.

The dichromat package helped me find what I needed.

Thanks,

Max

On Wed, Nov 2, 2011 at 6:38 PM, Thomas Lumley  wrote:
> On Thu, Nov 3, 2011 at 11:04 AM, Carl Witthoft  wrote:
>>
>> Before you pick out a palette:  you are aware that their are several
>> different types of color-blindness, aren't you?
>
> Yes, but to first approximation there are only two, and they have
> broadly similar, though not identical impact on choice of color
> palettes.  The dichromat package knows about them, and so does
> Professor Brewer.
>
> More people will be unable to read your graphs due to some kind of
> gross visual impairment (cataracts, uncorrected focusing problems,
> macular degeneration, etc) than will have tritanopia or monochromacy.
>
>   -thomas
>
> --
> Thomas Lumley
> Professor of Biostatistics
> University of Auckland
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] palettes for the color-blind

2011-11-02 Thread Max Kuhn

Everyone,

I'm working with scatter plots with different colored symbols (via
lattice). I'm currently using these colors for points and lines:

col1 <- c(rgb(1, 0, 0), rgb(0, 0, 1),
 rgb(0, 1, 0),
 rgb(0.55482458, 0.40350876, 0.0416),
 rgb(0, 0, 0))
plot(seq(along = col1), pch = 16, col = col1, cex = 1.5)

I'm also using these with transparency (alpha between .5-.8 depending
on the number of points).

I'd like to make sure that these colors are interpretable by the color
bind. Doing a little looking around, this might be a good palette:

col2 <- c(rgb(0, 0.4470588, 0.6980392),
  rgb(0.8352941, 0.3686275, 0,   ),
  rgb(0.800, 0.4745098, 0.6549020),
  rgb(0.1686275, 0.6235294, 0.4705882),
  rgb(0.9019608, 0.6235294, 0.000))

plot(seq(along = col2), pch = 16, col = col2, cex = 1.5)

but to be honest, I'd like to use something a little more vibrant.

First, can anyone verify that these the colors in col2 are
differentiable to someone who is color blind?

Second, are there any other specific palettes that can be recommended?
How do the RColorBrewer palettes rate in this respect?

Thanks,

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Contrasts with an interaction. How does one specify the dummy variables for the interaction

2011-10-31 Thread Max Kuhn

This is failing because it is a saturated model and the contrast
package tries to do a t-test (instead of a z test). I can add code to
do this, but it will take a few days.

Max

On Fri, Oct 28, 2011 at 2:16 PM, John Sorkin
 wrote:
> Forgive my resending this post. To data I have received only one response 
> (thank you Bert Gunter), and I still do not have an answer to my question.
> Respectfully,
> John
>
>
> Windows XP
> R 2.12.1
> contrast package.
>
>
> I am trying to understand how to create contrasts for a model that contatains 
> an interaction. I can get contrasts to work for a model without interaction, 
> but not after adding the interaction. Please see code below. The last two 
> contrast statements show the problem. I would appreciate someone letting me 
> know what is wrong with the syntax of my contrast statements.
> Thank you,
> John
>
>
> library(contrast)
>
> # Create 2x2 contingency table.
> counts=c(50,50,30,70)
> row <-    gl(2,2,4)
> column <- gl(2,1,4)
> mydata <- data.frame(row,column,counts)
> print(mydata)
>
> # Show levels of 2x2 table
> levels(mydata$row)
> levels(mydata$column)
>
>
> # Models, no interaction, and interaction
> fitglm0 <- glm(counts ~ row + column,              family=poisson(link="log"))
> fitglm  <- glm(counts ~ row + column + row*column, family=poisson(link="log"))
>
> # Contrasts for model without interaction works fine!
> anova(fitglm0)
> summary(fitglm0)
> con0<-contrast(fitglm0,list(row="1",column="1"))
> print(con0,X=TRUE)
>
> # Contrast for model with interaction does not work.
> anova(fitglm)
> summary(fitglm)
> con<-contrast(fitglm,list(row="1",column="1")
> print(con,X=TRUE)
>
> # Nor does this work.
> con<-contrast(fitglm,list(row="1",column="1",row:column=c("0","0")))
> print(con,X=TRUE)
>
>
>
>
> John David Sorkin M.D., Ph.D.
> Chief, Biostatistics and Informatics
> University of Maryland School of Medicine Division of Gerontology
> Baltimore VA Medical Center
> 10 North Greene Street
> GRECC (BT/18/GR)
> Baltimore, MD 21201-1524
> (Phone) 410-605-7119
> (Fax) 410-605-7913 (Please call phone number above prior to faxing)
>
> Confidentiality Statement:
> This email message, including any attachments, is for ...{{dropped:16}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] help with parallel processing code

2011-10-31 Thread Max Kuhn

I'm not sure what you mean by full code or the iteration. This uses
foreach to parallelize the loops over different tuning parameters and
resampled data sets.

The only way I could set to split up the parallelism is if you are
fitting different models to the same data. In that case, you could
launch separate jobs for each model. If the data is large and quickly
read from disk, that might be better than storing it in memory and
sequentially running models in the same script. We have decent sized
machines here, so we launch different jobs per model and then
parallelize each (even if it is using 2-3 cores it helps).

Thanks,

Max

On Fri, Oct 28, 2011 at 10:49 AM, 1Rnwb  wrote:
> the part of the question dawned on me now is, should I try to do the parallel
> processing of the full code or only the iteration part? if it is full code
> then I am at the complete mercy of the R help community or I giveup on this
> and let the computation run the serial way, which is continuing from past
> sat.
> Sharad
>
> --
> View this message in context: 
> http://r.789695.n4.nabble.com/help-with-parallel-processing-code-tp3944303p3948118.html
> Sent from the R help mailing list archive at Nabble.com.
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] help with parallel processing code

2011-10-27 Thread Max Kuhn

I have had issues with some parallel backends not finding functions
within a namespace for packages listed in the ".packages" argument or
explicitly loaded in the body of the foreach loop. This has occurred
with MPI but not with multicore. I can get around this to some extent
by calling the functions using the namespace (eg foo:::bar) but this
is pretty kludgy.

> sessionInfo()
R version 2.13.2 (2011-09-30)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

other attached packages:
[1] doMPI_0.1-5 Rmpi_0.5-9  doMC_1.2.3  multicore_0.1-7
foreach_1.3.2   codetools_0.2-8 iterators_1.0.5

Max

On Thu, Oct 27, 2011 at 4:30 PM, 1Rnwb  wrote:
> If i understand correctly you mean to write the line as below:
>
> foreach(icount(itr),.combine=combine,.options.smp=smpopts,.packages='MASS')%dopar%
>
> --
> View this message in context: 
> http://r.789695.n4.nabble.com/help-with-parallel-processing-code-tp3944303p3945954.html
> Sent from the R help mailing list archive at Nabble.com.
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] difference between createPartition and createfold functions

2011-10-03 Thread Max Kuhn

No, it is an argument to createFolds. Type ?createFolds to see the
appropriate syntax: "returnTrain a logical. When true, the values
returned are the sample positions corresponding to the data used
during training. This argument only works in conjunction with list =
TRUE"

On Mon, Oct 3, 2011 at 11:10 AM,   wrote:
> Hi Max,
>
> Thanks for the note. In your last paragraph, did you mean "in
> createDataPartition"? I'm a little vague about what returnTrain option does.
>
> Bonnie
>
> Quoting Max Kuhn :
>
>> Basically, createDataPartition is used when you need to make one or
>> more simple two-way splits of your data. For example, if you want to
>> make a training and test set and keep your classes balanced, this is
>> what you could use. It can also make multiple splits of this kind (or
>> leave-group-out CV aka Monte Carlos CV aka repeated training test
>> splits).
>>
>> createFolds is exclusively for k-fold CV. Their usage is simular when
>> you use the returnTrain = TRUE option in createFolds.
>>
>> Max
>>
>> On Sun, Oct 2, 2011 at 4:00 PM, Steve Lianoglou
>>  wrote:
>>>
>>> Hi,
>>>
>>> On Sun, Oct 2, 2011 at 3:54 PM,   wrote:
>>>>
>>>> Hi Steve,
>>>>
>>>> Thanks for the note. I did try the example and the result didn't make
>>>> sense
>>>> to me. For splitting a vector, what you describe is a big difference btw
>>>> them. For splitting a dataframe, I now wonder if these 2 functions are
>>>> the
>>>> wrong choices. They seem to split the columns, at least in the few
>>>> things I
>>>> tried.
>>>
>>> Sorry, I'm a bit confused now as to what you are after.
>>>
>>> You don't pass in a data.frame into any of the
>>> createFolds/DataPartition functions from the caret package.
>>>
>>> You pass in a *vector* of labels, and these functions tells you which
>>> indices into the vector to use as examples to hold out (or keep
>>> (depending on the value you pass in for the `returnTrain` argument))
>>> between each fold/partition of your learning scenario (eg. cross
>>> validation with createFolds).
>>>
>>> You would then use these indices to keep (remove) the rows of a
>>> data.frame, if that is how you are storing your examples.
>>>
>>> Does that make sense?
>>>
>>> -steve
>>>
>>> --
>>> Steve Lianoglou
>>> Graduate Student: Computational Systems Biology
>>>  | Memorial Sloan-Kettering Cancer Center
>>>  | Weill Medical College of Cornell University
>>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>>
>>> __
>>> R-help@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>>
>> --
>>
>> Max
>>
>>
>
>
>



-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] difference between createPartition and createfold functions

2011-10-02 Thread Max Kuhn

Basically, createDataPartition is used when you need to make one or
more simple two-way splits of your data. For example, if you want to
make a training and test set and keep your classes balanced, this is
what you could use. It can also make multiple splits of this kind (or
leave-group-out CV aka Monte Carlos CV aka repeated training test
splits).

createFolds is exclusively for k-fold CV. Their usage is simular when
you use the returnTrain = TRUE option in createFolds.

Max

On Sun, Oct 2, 2011 at 4:00 PM, Steve Lianoglou
 wrote:
> Hi,
>
> On Sun, Oct 2, 2011 at 3:54 PM,   wrote:
>> Hi Steve,
>>
>> Thanks for the note. I did try the example and the result didn't make sense
>> to me. For splitting a vector, what you describe is a big difference btw
>> them. For splitting a dataframe, I now wonder if these 2 functions are the
>> wrong choices. They seem to split the columns, at least in the few things I
>> tried.
>
> Sorry, I'm a bit confused now as to what you are after.
>
> You don't pass in a data.frame into any of the
> createFolds/DataPartition functions from the caret package.
>
> You pass in a *vector* of labels, and these functions tells you which
> indices into the vector to use as examples to hold out (or keep
> (depending on the value you pass in for the `returnTrain` argument))
> between each fold/partition of your learning scenario (eg. cross
> validation with createFolds).
>
> You would then use these indices to keep (remove) the rows of a
> data.frame, if that is how you are storing your examples.
>
> Does that make sense?
>
> -steve
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  | Memorial Sloan-Kettering Cancer Center
>  | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] odfWeave: Combining multiple output statements in a function

2011-09-16 Thread Max Kuhn

formatting.odf, page 7. The results are in formattingOut.odt

On Thu, Sep 15, 2011 at 2:44 PM, Jan van der Laan  wrote:
> Max,
>
> Thank you for your answer. I have had another look at the examples (I
> already had before mailing the list), but could find the example you
> mention. Could you perhaps tell me which example I should have a look at?
>
> Regards,
> Jan
>
>
>
> On 09/15/2011 04:47 PM, Max Kuhn wrote:
>>
>> There are examples in the package directory that explain this.
>>
>> On Thu, Sep 15, 2011 at 8:16 AM, Jan van der Laan
>>  wrote:
>>>
>>> What is the correct way to combine multiple calls to odfCat, odfItemize,
>>> odfTable etc. inside a function?
>>>
>>> As an example lets say I have a function that needs to write two
>>> paragraphs
>>> of text and a list to the resulting odf-document (the real function has
>>> much
>>> more complex logic, but I don't think thats relevant). My first guess
>>> would
>>> be:
>>>
>>> exampleOutput<- function() {
>>>   odfCat("This is the first paragraph")
>>>   odfCat("This is the second paragraph")
>>>   odfItemize(letters[1:5])
>>> }
>>>
>>> However, calling this function in my odf-document only generates the last
>>> list as only the output of the odfItemize function is returned by
>>> exampleOutput. How do I combine the three results into one to be returned
>>> by
>>> exampleOutput?
>>>
>>> I tried to wrap the calls to the odf* functions into a print statement:
>>>
>>> exampleOutput2<- function() {
>>>   print(odfCat("This is the first paragraph"))
>>>   print(odfCat("This is the second paragraph"))
>>>   print(odfItemize(letters[1:5]))
>>> }
>>>
>>> In another document this seemed to work, but in my current document
>>> strange
>>> odf-output is generated.
>>>
>>> Regards,
>>>
>>> Jan
>>>
>>> __
>>> R-help@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>
>



-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] odfWeave: Combining multiple output statements in a function

2011-09-15 Thread Max Kuhn

There are examples in the package directory that explain this.

On Thu, Sep 15, 2011 at 8:16 AM, Jan van der Laan  wrote:
>
> What is the correct way to combine multiple calls to odfCat, odfItemize,
> odfTable etc. inside a function?
>
> As an example lets say I have a function that needs to write two paragraphs
> of text and a list to the resulting odf-document (the real function has much
> more complex logic, but I don't think thats relevant). My first guess would
> be:
>
> exampleOutput <- function() {
>   odfCat("This is the first paragraph")
>   odfCat("This is the second paragraph")
>   odfItemize(letters[1:5])
> }
>
> However, calling this function in my odf-document only generates the last
> list as only the output of the odfItemize function is returned by
> exampleOutput. How do I combine the three results into one to be returned by
> exampleOutput?
>
> I tried to wrap the calls to the odf* functions into a print statement:
>
> exampleOutput2 <- function() {
>   print(odfCat("This is the first paragraph"))
>   print(odfCat("This is the second paragraph"))
>   print(odfItemize(letters[1:5]))
> }
>
> In another document this seemed to work, but in my current document strange
> odf-output is generated.
>
> Regards,
>
> Jan
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Trying to extract probabilities in CARET (caret) package with a glmStepAIC model

2011-08-28 Thread Max Kuhn

Can you provide a reproducible example and the results of
sessionInfo()? What are the levels of your classes?

On Sat, Aug 27, 2011 at 10:43 PM, Jon Toledo  wrote:
>
> Dear developers,
> I have jutst started working with caret and all the nice features it offers. 
> But I just encountered a problem:
> I am working with a dataset that include 4 predictor variables in Descr and a 
> two-category outcome in Categ (codified as a factor).
> Everything was working fine I got the results, confussion matrix etc.
> BUT for obtaining the AUC and predicted probabilities I had to add " 
> classProbs = TRUE," in the trainControl. Thereafter everytime I run train I 
> get this message:
> "undefined columns selected"
>
> I copy the syntax:
> fitControl <- trainControl(method = "cv", number = 10, classProbs = 
> TRUE,returnResamp = "all", verboseIter = FALSE)
> glmFit <- train(Descr, Categ, method = "glmStepAIC",tuneLength = 4,trControl 
> = fitControl)
> Thank you.
> Best regards,
>
> Jon Toledo, MD
>
> Postdoctoral fellow
> University of Pennsylvania School of Medicine
> Center for Neurodegenerative Disease Research
> 3600 Spruce Street
> 3rd Floor Maloney Building
> Philadelphia, Pa 19104
>
>        [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] aucRoc in caret package [SEC=UNCLASSIFIED]

2011-06-01 Thread Max Kuhn

David,

The ROC curve should really be computed with some sort of numeric data
(as opposed to classes). It varies the cutoff to get a continuum of
sensitivity and specificity values.  Using the classes as 1's and 2's
implies that the second class is twice the value of the first, which
doesn't really make sense.

Try getting the class probabilities for predicted1 and predicted2 and
use those instead.

Thanks,

Max


On Wed, Jun 1, 2011 at 9:24 PM,  wrote:
>
> Please note that predicted1 and predicted2 are two sets of predictions 
> instead of predictors. As you can see the predictions with only two levels, 1 
> is for hard and 2 for soft. I need to assess which one is more accurate. Hope 
> this is clear now. Thanks.
> Jin
>
> -Original Message-
> From: David Winsemius [mailto:dwinsem...@comcast.net]
> Sent: Thursday, 2 June 2011 10:55 AM
> To: Li Jin
> Cc: R-help@r-project.org
> Subject: Re: [R] aucRoc in caret package [SEC=UNCLASSIFIED]
>
> Using AUC for discrete predictor variables with inly two levels
> doesn't seem very sensible. What are you planning to to with this
> measure?
>
> --
> David.
>
> On Jun 1, 2011, at 8:47 PM,   wrote:
>
> > Hi all,
> > I used the following code and data to get auc values for two sets of
> > predictions:
> >            library(caret)
> >> table(predicted1, trainy)
> >   trainy
> >    hard soft
> >  1   27    0
> >  2   11   99
> >> aucRoc(roc(predicted1, trainy))
> > [1] 0.5
> >
> >
> >> table(predicted2, trainy)
> >   trainy
> >    hard soft
> >  1   27    2
> >  2   11   97
> >> aucRoc(roc(predicted2, trainy))
> > [1] 0.8451621
> >
> > predicted1:
> > 1 1 2 2 2 1 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 2
> > 2 2 2 2 1 2 2 2 2 1 1 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2
> > 2 2 2 1 2 2 2 2 2 2 2 1 2 2 2 2 2 1 1 1 2 2 1 1 1 2 2 2 2 2 1 1 2 2
> > 2 2 2 2 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
> >
> > predicted2:
> > 1 1 2 1 2 1 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 1 1 2
> > 2 2 2 2 1 2 2 2 2 1 1 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2
> > 2 2 2 1 2 2 2 2 2 2 2 1 2 2 2 2 2 1 1 1 2 2 1 1 1 2 2 2 2 2 1 1 2 2
> > 2 2 2 2 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
> >
> > trainy:
> > hard hard hard soft soft hard hard hard hard soft soft soft soft
> > soft soft hard soft soft soft soft soft soft hard soft soft soft
> > soft soft soft soft soft soft hard soft soft soft soft soft hard
> > soft soft soft soft hard hard soft soft soft hard soft hard soft
> > soft soft soft soft hard soft soft soft soft soft soft soft soft
> > hard soft soft soft soft soft hard soft soft soft soft soft soft
> > soft hard soft soft soft hard hard hard hard hard soft soft hard
> > hard hard soft hard soft soft soft hard hard soft soft soft soft
> > soft hard hard hard hard hard hard hard soft soft soft soft soft
> > soft soft soft soft soft soft soft soft soft soft soft hard soft
> > soft soft soft soft soft soft soft
> > Levels: hard soft
> >
> >> Sys.info()
> >                     sysname
> > release                      version                     nodename
> >                   "Windows"                      "XP"        "build
> > 2600, Service Pack 3"        "PC-60772"
> >                     machine
> >                       "x86"
> >
> > I would expect predicted1 is more accurate that the predicted2. But
> > the auc values show an opposite. I was wondering whether this is a
> > bug or I have done something wrong.  Thanks for your help in advance!
> >
> > Cheers,
> >
> > Jin
> > 
> > Jin Li, PhD
> > Spatial Modeller/Computational Statistician
> > Marine & Coastal Environment
> > Geoscience Australia
> > GPO Box 378, Canberra, ACT 2601, Australia
> >
> > Ph: 61 (02) 6249 9899; email:
> > jin...@ga.gov.au
> > ___
> >
> >
> >
> >       [[alternative HTML version deleted]]
> >
> > __
> > R-help@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius, MD
> West Hartford, CT
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



--

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] issue with odfWeave running on Windows XP; question about installing packages under Linux

2011-05-18 Thread Max Kuhn

Sorry for the delayed response.

An upgrade of the XML package has broken odfWeave; see this thread:

   https://stat.ethz.ch/pipermail/r-help/2011-May/278063.html

That may be your issue. We're working on the problem now. I'll post to
R-Packages when we have a working update. If you like, I can send you
the eventual fixes if you would like to test them.

Thanks,

Max


On Tue, May 17, 2011 at 3:35 PM,   wrote:
> I also have a problem using odfWeave on Windows XP with R > R2.11.1. odfWeave 
> fails, giving mysterious error messages. (Not quite the same as yours, but 
> similar. I sent the info to Max Kuhn privately, but did not get a response 
> after two tries.) My odfWeave reporting system worked fine prior to R2.12 and 
> then the same code that ran fine under R2.11.1 stopped working. Using the 
> very same machine and running the very same code under R2.11.1 it still runs 
> fine today. So, something is not quite right with odfWeave on Windows XP for 
> R > R2.11.1, and I don't know what it is. My "solution" is to keep R2.11.1 
> around until it can be resolved.
>
> Eric
>
>
>
> - Original message -
> From: "Cormac Long" 
> To: r-help@r-project.org
> Date: Fri, 13 May 2011 10:45:06 +0100
> Subject: [R] issue with odfWeave running on Windows XP; question about 
> installing packages under Linux
>
> Good morning R community,
>
> I have two questions (and a comment):
> 1)
> A problem with odfWeave. I have an odf document
> with a table that spans multiple pages. Each cell in the table is
> populated using \sexpr{}. This worked fine on my
> own machine (windows 7 box using any R2.x.y, for x>=11) and
> on a colleagues machine (Windows XP box running R2.11.1).
> However, on a third machine (Windows XP box running R2.12.0
> or R2.13.0), odfWeave fails with the following error:
>    Error in parse(text = cmd) : :1:36: unexpected '>'
>    1: GLOBAL_CONTAI
> A poke around in the unzipped odt file reveals the culprit:
>    \Sexpr{GLOBAL_CONTAINER$repDat$Dec[i]}
> which should read
>    \Sexpr{GLOBAL_CONTAINER$repDat$Dec[i]}
>
> The page break coincides with where the table overruns from
> one page to the next.
>
> Now, if this was a constant error across all machines, that
> would be annoying, but ok. My questions are:
>   a) Can anyone think of a sensible suggestion why has this
>       happened only on one machine, and not on other machines?
>   b) Is there any way of handling such silent xml modifications
>      (apart from odfTable, which I have only just bumped into, or
>      extremely judicious choice of table construction, which is
>      tedious and unreliable)?
>
> 2)
> When installing some packages on linux (notably RODBC and XML),
> you need to ensure that you linux distro has extra header files installed.
> This is a particular issue in Ubuntu. The question is: is there any way
> that a package can check for necessary external header files and issue
> suitable warnings? For example, if you try to install RODBC on Ubuntu
> without first installing unixodbc-dev, the installation will fail with the
> error:
>    configure: error: "ODBC headers sql.h and sqlext.h not found
> which is useful, but not particularly suggestive of requiring unixodbc-dev
>
>
> A further comment on odfWeave: odfWeave uses system calls to
> zip and unzip when processing the odt documents. Would it not
> be a good idea for the odfWeave package to check for the presence
> of zip and unzip utilities and report accordingly when trying to install?
> By default, Windows XP boxes do not have these utilities installed
> (installing Rtools does away with this problem).
>
>
> Many thanks in advance,
> Dr. Cormac Long.
>
>        [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Can ROC be used as a metric for optimal model selection for randomForest?

2011-05-13 Thread Max Kuhn

Frank,

It depends on how you define "optimal". While I'm not a big fan of
using the area under the ROC to characterize performance, there are a
lot of times when likelihood measures are clearly sub-optimal in
performance. Using resampled accuracy (or Kappa) instead of deviance
(out-of-bag or not) is likely to produce more inaccurate models (not
shocking, right?).

The best example is determining the number of boosting iterations.
>From Friedman (2001): ``[...] degrading the likelihood by overfitting
actually improves misclassification error rates. Although perhaps
counterintuitive, this is not a contradiction; likelihood and error
rate measure different aspects of fit quality.''

My argument here assumes that you are fitting a model for the purposes
of prediction rather than interpretation. This particular case
involves random forests, so I'm hoping that statistical inference is
not the goal.

Ref: Friedman. Greedy function approximation: a gradient boosting
machine. Annals of Statistics (2001) pp. 1189-1232

Thanks,

Max

On Fri, May 13, 2011 at 8:11 AM, Frank Harrell  wrote:
> Using anything other than deviance (or likelihood) as the objective function
> will result in a suboptimal model.
> Frank
>
> -
> Frank Harrell
> Department of Biostatistics, Vanderbilt University
> --
> View this message in context: 
> http://r.789695.n4.nabble.com/Can-ROC-be-used-as-a-metric-for-optimal-model-selection-for-randomForest-tp3519003p3520043.html
> Sent from the R help mailing list archive at Nabble.com.
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Can ROC be used as a metric for optimal model selection for randomForest?

2011-05-13 Thread Max Kuhn

XiaoLiu,

I can't see the options in bootControl you used here. Your error is
consistent with leaving classProbs and summaryFunction unspecified.
Please double check that you set them with classProbs = TRUE and
summaryFunction = twoClassSummary before you ran.

Max

On Thu, May 12, 2011 at 7:04 PM, Jing Liu  wrote:
>
> Dear all,
>
> I am using the "caret" Package for predictors selection with a randomForest 
> model. The following is the train function:
>
> rfFit<- train(x=trainRatios, y=trainClass, method="rf", importance = TRUE, 
> do.trace = 100, keep.inbag = TRUE,
>    tuneGrid = grid, trControl=bootControl, scale = TRUE, metric = "ROC")
>
> I wanted to use ROC as the metric for variable selection. I know that this 
> works with the logit model by making sure that classProbs = TRUE and 
> summaryFunction = twoClassSummary in the trainControl function. However if I 
> do the same with randomForest, I get a warning saying that
>
> "In train.default(x = trainPred, y = trainDep, method = "rf",  :
>  The metric "ROC" was not in the result set. Accuracy will be used instead."
>
> I wonder if ROC metric can be used for randomForest? Have I missed something? 
> Very very grateful if anyone can help!
>
> Best regards,
> XiaoLiu
>
>
>
>        [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Bigining with a Program of SVR

2011-05-07 Thread Max Kuhn

As far as caret goes, you should read

   http://cran.r-project.org/web/packages/caret/vignettes/caretVarImp.pdf

and look at rfe() and sbf().


On Fri, May 6, 2011 at 2:53 PM, ypriverol  wrote:
> Thanks Max. I'm using now the library caret with my data. But the models
> showed a correlation under 0.7. Maybe the problem is with the variables that
> I'm using to generate the model. For that reason I'm asking for some
> packages that allow me to reduce the number of feature and to remove the
> worst features. I read recently an article taht combine Genetic algorithm
> with support vector regression to do that.
>
> Best Regards
> Yasset
>
> --
> View this message in context: 
> http://r.789695.n4.nabble.com/Bigining-with-a-Program-of-SVR-tp3484476p3503918.html
> Sent from the R help mailing list archive at Nabble.com.
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Bigining with a Program of SVR

2011-05-04 Thread Max Kuhn

train() uses vectors, matrices and data frames as input. I really
think you need to read materials on basic R before proceeding. Go to
the R web page. There are introductory materials there.

On Tue, May 3, 2011 at 11:19 AM, ypriverol  wrote:
> I saw the format of the caret data some days ago. It is possible to convert
> my csv data with the same data a format as the caret dataset. My idea is to
> use firstly the same scripts as caret tutorial, then i want to remove
> problems related with data formats and incompatibilities.
>
> Thanks for your time
>
> --
> View this message in context: 
> http://r.789695.n4.nabble.com/Bigining-with-a-Program-of-SVR-tp3484476p3492746.html
> Sent from the R help mailing list archive at Nabble.com.
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Bigining with a Program of SVR

2011-05-03 Thread Max Kuhn

See the examples at the end of:

   http://cran.r-project.org/web/packages/caret/vignettes/caretTrain.pdf

for a QSAR data set for modeling the log blood-brain barrier
concentration. SVMs are not used there but, if you use train(), the
syntax is very similar.

On Tue, May 3, 2011 at 9:38 AM, ypriverol  wrote:
> well, first of all thank for your answer. I need some example that works with
> Support Vector Regression. This is the format of my data:
>  VDP   V1        V2  
>  9.15  1234.5   10
>  9.15 2345.6 15
>  6.7    789.0     12
>  6.7    234.6     11
>  3.2   123.6      5
>  3.2   235.7      8
>
> VDP is the experimental value of the property that i want to predict with
> the model and more accurate. The other variables V1, V2 ... are the
> properties to generate the model. I need some examples that introduce me in
> this field. I read some examples from e1071 but all of them are for
> classification problems.
>
> thanks for your help in advance
>
> --
> View this message in context: 
> http://r.789695.n4.nabble.com/Bigining-with-a-Program-of-SVR-tp3484476p3492487.html
> Sent from the R help mailing list archive at Nabble.com.
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] caret - prevent resampling when no parameters to find

2011-05-02 Thread Max Kuhn

Yeah, that didn't work. Use

   fitControl<-trainControl(index = list(seq(along = mdrrClass)))

See ?trainControl to understand what this does in detail.

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] caret - prevent resampling when no parameters to find

2011-05-01 Thread Max Kuhn

Not all modeling functions have both the formula and "matrix"
interface. For example, glm() and rpart() only have formula method,
enet() has only the matrix interface and ksvm() and others have both.
This was one reason I created the package (so we don't have to
remember all this).

train() lets you specify the model either way. When the actual model
is fit, it favors the matrix interface whenever possible (since it is
more efficient) and works out the details behind the scenes.

For your example, you can fit the model you want using train():

   train(mdrrDescr,mdrrClass,method='glm')

If y is a factor, it automatically adds the 'family = binomial' option
when the model is fit (so you don't have to).

Max

On Sun, May 1, 2011 at 7:18 PM, pdb  wrote:
> glm.fit - answered my own question by reading the manual!--
> View this message in context: 
> http://r.789695.n4.nabble.com/caret-prevent-resampling-when-no-parameters-to-find-tp3488761p3488923.html
> Sent from the R help mailing list archive at Nabble.com.
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] caret - prevent resampling when no parameters to find

2011-05-01 Thread Max Kuhn

No, the sampling is done on rows. The definition of a bootstrap
(re)sample is one which is the same size as the original data but
taken with replacement. The "Accuracy SD" and "Kappa SD" columns give
you a sense of how the model performance varied across these bootstrap
data sets (i.e. they are not the same data set).

In the end, the original training set is used to fit the final model
that is used for prediction.

Max

On Sun, May 1, 2011 at 6:41 PM, pdb  wrote:
> Hi Max,
>
> But in this example, it says the sample size is the same as the total number
> of samples, so unless the sampling is done by columns, wouldn't you get
> exactly the same model each time for logistic regression?
>
> ps - great package btw. I'm just beginning to explore its potential now.--
> View this message in context: 
> http://r.789695.n4.nabble.com/caret-prevent-resampling-when-no-parameters-to-find-tp3488761p341.html
> Sent from the R help mailing list archive at Nabble.com.
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Bigining with a Program of SVR

2011-05-01 Thread Max Kuhn

When you say "variable" do you mean predictors or responses?

In either case, they do. You can generally tell by reading the help
files and looking at the examples.

Max

On Fri, Apr 29, 2011 at 3:47 PM, ypriverol  wrote:
> Hi:
>  I'm starting a research of Support Vector Regression. I want to obtain a
> model to predict a property A with
>  a set of property B, C, D, ...  This problem is very common for example in
> QSAR models. I want to know
>  some examples and package that could help me in this way. I know about
> caret and e1071. But I' don't
>  know if this package can work with continues variables.?
>
> Thanks in advance
>
> --
> View this message in context: 
> http://r.789695.n4.nabble.com/Bigining-with-a-Program-of-SVR-tp3484476p3484476.html
> Sent from the R help mailing list archive at Nabble.com.
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] caret - prevent resampling when no parameters to find

2011-05-01 Thread Max Kuhn

It isn't building the same model since each fit is created from
different data sets.

The resampling is sort of the point of the function, but if you really
want to avoid it, supply your own index in trainControl that has every
index (eg, index = seq(along = mdrrClass)). In this case, the
performance it gives is the apparent error rate.

Max

On Sun, May 1, 2011 at 5:57 PM, pdb  wrote:
> I want to use caret to build a model with an algorithm that actually has no
> parameters to find.
>
> How do I stop it from repeatedly building the same model 25 times?
>
>
> library(caret)
> data(mdrr)
> LOGISTIC_model <- train(mdrrDescr,mdrrClass
>                        ,method='glm'
>                        ,family=binomial(link="logit")
>                        )
> LOGISTIC_model
>
> 528 samples
> 342 predictors
>  2 classes: 'Active', 'Inactive'
>
> Pre-processing: None
> Resampling: Bootstrap (25 reps)
>
> Summary of sample sizes: 528, 528, 528, 528, 528, 528, ...
>
> Resampling results
>
>  Accuracy  Kappa   Accuracy SD  Kappa SD
>  0.552     0.0999  0.0388       0.0776  --
> View this message in context: 
> http://r.789695.n4.nabble.com/caret-prevent-resampling-when-no-parameters-to-find-tp3488761p3488761.html
> Sent from the R help mailing list archive at Nabble.com.
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] odfWeave Error unzipping file in Win 7

2011-03-21 Thread Max Kuhn

I don't think that this is the issue, but test it on a file without spaces.

On Mon, Mar 21, 2011 at 2:25 PM,   wrote:
>
> I have a very similar error that cropped up when I upgraded to R 2.12 and 
> persists at R 2.12.1. I am running R on Windows XP and OO is at version 3.2. 
> I did not make any changes to my R code or ODF code or configuration to 
> produce this error. Only upgraded R.
>
> Many Thanks,
>
> Eric
>
> R session:
>
>
>> odfWeave ( 'Report input template.odt' , 'August 2011.odt')
>  Copying  Report input template.odt
>  Setting wd to  
> C:\DOCUME~1\Koster01\LOCALS~1\Temp\Rtmp4uCcY2/odfWeave2153483
>  Unzipping ODF file using unzip -o "Report input template.odt"
> Error in odfWeave("Report input template.odt", "August 2011.odt") :
>  Error unzipping file
>
> 
>
>
> When I start a shell and go to the temp directory in question and copy the 
> exact command that the error message says produced an error the command runs 
> fine. Here is that session:
>
> Microsoft Windows XP [Version 5.1.2600]
> (C) Copyright 1985-2001 Microsoft Corp.
>
> H:\>c:
>
> C:\>cd C:\DOCUME~1\Koster01\LOCALS~1\Temp\Rtmp4uCcY2/odfWeave2153483
>
> C:\DOCUME~1\Koster01\LOCALS~1\Temp\Rtmp4uCcY2\odfWeave2153483>dir
>  Volume in drive C has no label.
>  Volume Serial Number is 7464-62CA
>
>  Directory of 
> C:\DOCUME~1\Koster01\LOCALS~1\Temp\Rtmp4uCcY2\odfWeave2153483
>
> 03/21/2011  11:11 AM              .
> 03/21/2011  11:11 AM              ..
> 03/21/2011  11:11 AM            13,780 Report input template.odt
>               1 File(s)         13,780 bytes
>               2 Dir(s)   7,987,343,360 bytes free
>
> C:\DOCUME~1\Koster01\LOCALS~1\Temp\Rtmp4uCcY2\odfWeave2153483>unzip -o 
> "Report input template.odt"
> Archive:  Report input template.odt
>  extracting: mimetype
>   creating: Configurations2/statusbar/
>  inflating: Configurations2/accelerator/current.xml
>   creating: Configurations2/floater/
>   creating: Configurations2/popupmenu/
>   creating: Configurations2/progressbar/
>   creating: Configurations2/menubar/
>   creating: Configurations2/toolbar/
>   creating: Configurations2/images/Bitmaps/
>  inflating: content.xml
>  inflating: manifest.rdf
>  inflating: styles.xml
>  extracting: meta.xml
>  inflating: Thumbnails/thumbnail.png
>  inflating: settings.xml
>  inflating: META-INF/manifest.xml
>
> C:\DOCUME~1\Koster01\LOCALS~1\Temp\Rtmp4uCcY2\odfWeave2153483>
>
>
>
>
>
>
>
> - Original message -
> From: "psycho-ld" 
> To: r-help@r-project.org
> Date: Sun, 23 Jan 2011 01:47:44 -0800 (PST)
> Subject: [R] odfWeave Error unzipping file in Win 7
>
>
> Hey guys,
>
> I´m just getting started with R (version 2.12.0) and odfWeave and kinda
> stumble from one problem to the next, the current one is the following:
>
> trying to use odfWeave:
>
>> odfctrl <- odfWeaveControl(
> +             zipCmd = c("C:/Program Files/unz552dN/VBunzip.exe $$file$$ .",
> +              "C:/Program Files/unz552dN/VBunzip.exe $$file$$"))
>>
>> odfWeave("C:/testat.odt", "C:/iris.odt", control = odfctrl)
>  Copying  C:/testat.odt
>  Setting wd to
> D:\Users\egf\AppData\Local\Temp\Rtmpmp4E1J/odfWeave23103351832
>  Unzipping ODF file using C:/Program Files/unz552dN/VBunzip.exe
> "testat.odt"
> Fehler in odfWeave("C:/testat.odt", "C:/iris.odt", control = odfctrl) :
>  Error unzipping file
>
> so I tried a few other unzipping programs like jar and 7-zip, but still the
> same problem occurs, I also tried to install zip and unzip, but then I get
> some error message that registration failed (Error 1904 )
>
> so if there are anymore questions, just ask, would be great if someone could
> help me though
>
> cheers
> psycho-ld
>
> --
> View this message in context: 
> http://r.789695.n4.nabble.com/odfWeave-Error-unzipping-file-in-Win-7-tp3232359p3232359.html
> Sent from the R help mailing list archive at Nabble.com.
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Specify feature weights in model prediction (CARET)

2011-03-16 Thread Max Kuhn

> Using the 'CARET' package, is it possible to specify weights for features
> used in model prediction?

For what model?

> And for the 'knn' implementation, is there a way
> to choose a distance metric (i.e. Mahalanobis distance)?
>

No, sorry.

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] use "caret" to rank predictors by random forest model

2011-03-07 Thread Max Kuhn

It would help if you provided the code that you used for the caret functions.

The most likely issues is not using importance = TRUE in the call to train()

I believe that I've only implemented code for plotting the varImp
objects resulting from train() (eg. there is plot.varImp.train but not
plot.varImp).

Max

On Mon, Mar 7, 2011 at 3:27 PM, Xiaoqi Cui  wrote:
> Hi,
>
> I'm using package "caret" to rank predictors using random forest model and 
> draw predictors importance plot. I used below commands:
>
> rf.fit<-randomForest(x,y,ntree=500,importance=TRUE)
> ## "x" is matrix whose columns are predictors, "y" is a binary resonse vector
> ## Then I got the ranked predictors by ranking 
> "rf1$importance[,"MeanDecreaseAccuracy"]"
> ## Then draw the importance plot
> varImpPlot(rf.fit)
>
> As you can see, all the functions I used are directly from the package 
> "randomForest", instead of from "caret". so I'm wondering if the package 
> "caret" has some functions who can do the above ranking and ploting.
>
> In fact, I tried functions "train", "varImp" and "plot" from package "caret", 
> the random forest model that built by "train" can not be input correctly to 
> "varImp", which gave error message like "subscripts out of bounds". Also 
> function "plot" doesn't work neither.
>
> So I'm wondering if anybody has encountered the same problem before, and 
> could shed some light on this. I would really appreciate your help.
>
> Thanks,
> Xiaoqi
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Course: R for Predictive Modeling: A Hands-On Introduction

2011-03-04 Thread Max Kuhn

R for Predictive Modeling: A Hands-On Introduction

Predictive Analytics World in San Francisco
Sunday March 13, 9am to 4:30pm

This one-day session provides a hands-on introduction to R, the
well-known open-source platform for data analysis. Real examples are
employed in order to methodically expose attendees to best practices
driving R and its rich set of predictive modeling packages, providing
hands-on experience and know-how. R is compared to other data analysis
platforms, and common pitfalls in using R are addressed.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] ROC from R-SVM?

2011-02-22 Thread Max Kuhn

The objects functions for kernel methods are unrelated to the area
under the ROC curve. However, you can try to choose the cost and
kernel parameters to maximize the ROC AUC.

See the caret package, specifically the train function.

Max

On Mon, Feb 21, 2011 at 5:34 PM, Angel Russo  wrote:
> *Hi,
>
> *Does anyone know how can I show an *ROC curve for R-SVM*? I understand in
> R-SVM we are not optimizing over SVM cost parameter. Any example ROC for
> R-SVM code or guidance can be really useful.
>
> Thanks, Angel.
>
>        [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Random Forest & Cross Validation

2011-02-20 Thread Max Kuhn

> I am using randomForest package to do some prediction job on GWAS data. I
> firstly split the data into training and testing set (70% vs 30%), then
> using training set to grow the trees (ntree=10). It looks that the OOB
> error in training set is good (<10%). However, it is not very good for the
> test set with a AUC only about 50%.

Did you do any feature selection in the training set? If so, you also
need to include that step in the cross-validation to get realistic
performance estimates (see Ambroise and McLachlan. Selection bias in
gene extraction on the basis of microarray gene-expression data.
Proceedings of the National Academy of Sciences (2002) vol. 99 (10)
pp. 6562-6566).

In the caret package, train() can be used to get cross-validation
estimates for RF and the sbf() function (for selection by filter) can
be used to include simple univariate filters in the CV procedure.

> Although some people said no cross-validation was necessary for RF, I still
> felt unsafe and thought a testing set is important. I felt really frustrated
> with the results.

CV is needed when you want an assessment of performance on a test set.
In this sense, RF is like any other method.

-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] caret::train() and ctree()

2011-02-16 Thread Max Kuhn

Andrew,

ctree only tunes over mincriterion and ctree2 tunes over maxdepth
(while fixing mincriterion = 0).

Seeing both listed as the function is being executed is a bug. I'll
setup checks to make sure that the columns specified in tuneGrid are
actually the tuning parameters that are used.

Max

On Wed, Feb 16, 2011 at 12:01 PM, Andrew Ziem  wrote:
> Like earth can be trained simultaneously for degree and nprune, is there a 
> way to train ctree simultaneously for mincriterion and maxdepth?
>
> Also, I notice there are separate methods ctree and ctree2, and if both 
> options are attempted to tune with one method, the summary averages the 
> option it doesn't support.  The full log is attached, and notice these lines 
> below for method="ctree" where maxdepth=c(2,4) are averaged to maxdepth=3.
>
> Fitting: maxdepth=2, mincriterion=0.95
> Fitting: maxdepth=4, mincriterion=0.95
> Fitting: maxdepth=2, mincriterion=0.99
> Fitting: maxdepth=4, mincriterion=0.99
>
>  mincriterion  Accuracy  Kappa  maxdepth  Accuracy SD  Kappa SD  maxdepth SD
>  0.95          0.939     0.867  3         0.0156       0.0337    1.01
>  0.99          0.94      0.868  3         0.0157       0.0337    1.01
>
> I use R 2.12.1 and caret 4.78.
>
> Andrew
>
>
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>



-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Train error:: subscript out of bonds

2011-01-26 Thread Max Kuhn

No. Any valid seed should work. In this case, train() should on;y be
using it to determine which training set samples are in the CV or
bootstrap data sets.

Max

On Wed, Jan 26, 2011 at 9:56 AM, Neeti  wrote:
>
> Thank you so much for your reply. In my case it is giving error in some seed
> value for example if I set seed value to 357 this gives an error. Does train
> have some specific seed range?
> --
> View this message in context: 
> http://r.789695.n4.nabble.com/Train-error-subscript-out-of-bonds-tp3234510p3238197.html
> Sent from the R help mailing list archive at Nabble.com.
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Train error:: subscript out of bonds

2011-01-26 Thread Max Kuhn

Sort of. It lets you define a grid of candidate values to test and to
define the rule to choose the best. For some models, it is each to
come up with default values that work well (e.g. RBF SVM's, PLS, KNN)
while others are more data dependent. In the latter case, the defaults
may not work well.

MAx

On Wed, Jan 26, 2011 at 5:45 AM, Neeti  wrote:
>
> What I have understood in CARET train() method is that train() itself does
> the model selection and tune the parameter. (please correct me if I am
> wrong). That was my first motivation to select this package and method for
> fitting the model. And use the parameter to e1071 svm() method and compare
> the result.
>
> fit1<-train(train1,as.factor(trainset[,ncol(trainset)]),"svmpoly",trControl
> = trainControl((method = "cv"),10,verboseIter = F),tuneLength=3)
>
> --
> View this message in context: 
> http://r.789695.n4.nabble.com/Train-error-subscript-out-of-bonds-tp3234510p3237800.html
> Sent from the R help mailing list archive at Nabble.com.
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Train error:: subscript out of bonds

2011-01-25 Thread Max Kuhn

What version of caret and R? We'll also need a reproducible example.


On Mon, Jan 24, 2011 at 12:44 PM, Neeti  wrote:
>
> Hi,
> I am trying to construct a svmpoly model using the "caret" package (please
> see code below). Using the same data, without changing any setting, I am
> just changing the seed value. Sometimes it constructs the model
> successfully, and sometimes I get an “Error in indexes[[j]] : subscript out
> of bounds”.
> For example when I set seed to 357 following code produced result only for 8
> iterations and for 9th iteration it reaches to an error that “subscript out
> of bonds” error. I don’t understand why
>
> Any help would be great
> thanks
> ###
> for (i in 1:10)
>  {
> fit1<-NULL;
> x<-NULL;
>  x<-which(number==i)
>        trainset<-d[-x,]
>        testset<-d[x,]
> train1<-trainset[,-ncol(trainset)]
>        train1<-train1[,-(1)]
>        test_t<-testset[,-ncol(testset)]
>        species_test<-as.factor(testset[,ncol(testset)])
>        test_t<-test_t[,-(1)]
>        
>        #CARET::TRAIN
>        
>
>        
> fit1<-train(train1,as.factor(trainset[,ncol(trainset)]),"svmpoly",trControl
> = trainControl((method = "cv"),10,verboseIter = F),tuneLength=3)
>        pred<-predict(fit1,test_t)
>        t_train[[i]]<-table(predicted=pred,observed=testset[,ncol(testset)])
> tune_result[[i]]<-fit1$results;
>        tune_best<-fit1$bestTune;
>        scale1[i]<-tune_best[[3]]
>        degree[i]<-tune_best[[2]]
>        c1[i]<-tune_best[[1]]
>
>        }
>
>
> --
> View this message in context: 
> http://r.789695.n4.nabble.com/Train-error-subscript-out-of-bonds-tp3234510p3234510.html
> Sent from the R help mailing list archive at Nabble.com.
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] circular reference lines in splom

2011-01-20 Thread Max Kuhn

This did the trick:

panel.circ3 <- function(...)
  {
args <- list(...)
circ1 <- ellipse(diag(rep(1, 2)), t = 1)
panel.xyplot(circ1[,1], circ1[,2],
 type = "l",
 lty = trellis.par.get("reference.line")$lty,
 col = trellis.par.get("reference.line")$col,
 lwd = trellis.par.get("reference.line")$lwd)
circ2 <- ellipse(diag(rep(1, 2)), t = 2)
panel.xyplot(circ2[,1], circ2[,2],
 type = "l",
 lty = trellis.par.get("reference.line")$lty,
 col = trellis.par.get("reference.line")$col,
 lwd = trellis.par.get("reference.line")$lwd)
panel.xyplot(args$x, args$y,
 groups = args$groups,
 subscripts = args$subscripts)
  }


splom(~dat, groups = grps,
  lower.panel = panel.circ3,
  upper.panel = panel.circ3)


Thanks,

Max

On Thu, Jan 20, 2011 at 11:13 AM, Peter Ehlers  wrote:
> On 2011-01-19 20:15, Max Kuhn wrote:
>>
>> Hello everyone,
>>
>> I'm stumped. I'd like to create a scatterplot matrix with circular
>> reference lines. Here is an example in 2d:
>>
>> library(ellipse)
>>
>> set.seed(1)
>> dat<- matrix(rnorm(300), ncol = 3)
>> colnames(dat)<- c("X1", "X2", "X3")
>> dat<- as.data.frame(dat)
>> grps<- factor(rep(letters[1:4], 25))
>>
>> panel.circ<- function(x, y, ...)
>>   {
>>     circ1<- ellipse(diag(rep(1, 2)), t = 1)
>>     panel.xyplot(circ1[,1], circ1[,2],
>>                  type = "l",
>>                  lty = 2)
>>     circ2<- ellipse(diag(rep(1, 2)), t = 2)
>>     panel.xyplot(circ2[,1], circ2[,2],
>>                  type = "l",
>>                  lty = 2)
>>     panel.xyplot(x, y)
>>   }
>>
>> xyplot(X2 ~ X1, data = dat,
>>        panel = panel.circ,
>>        aspect = 1)
>>
>> I'd like to to the sample with splom, but with groups.
>>
>> My latest attempt:
>>
>> panel.circ2<- function(x, y, groups, ...)
>>   {
>>     circ1<- ellipse(diag(rep(1, 2)), t = 1)
>>     panel.xyplot(circ1[,1], circ1[,2],
>>                  type = "l",
>>                  lty = 2)
>>     circ2<- ellipse(diag(rep(1, 2)), t = 2)
>>     panel.xyplot(circ2[,1], circ2[,2],
>>                  type = "l",
>>                  lty = 2)
>>     panel.xyplot(x, y, type = "p", groups)
>>   }
>>
>>
>>
>> splom(~dat,
>>       panel = panel.superpose,
>>       panel.groups = panel.circ2)
>>
>> produces nothing but warnings:
>>
>>> warnings()
>>
>> Warning messages:
>> 1: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'
>>
>> It does not appear to me that panel.circ2 is even being called.
>>
>> Thanks,
>>
>> Max
>
> I don't see a function panel.groups() in lattice.
> Does this do what you want or am I missing the point:
>
>  splom(~dat|grps, panel = panel.circ2)
>
> Peter Ehlers
>



-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] circular reference lines in splom

2011-01-19 Thread Max Kuhn

Hello everyone,

I'm stumped. I'd like to create a scatterplot matrix with circular
reference lines. Here is an example in 2d:

library(ellipse)

set.seed(1)
dat <- matrix(rnorm(300), ncol = 3)
colnames(dat) <- c("X1", "X2", "X3")
dat <- as.data.frame(dat)
grps <- factor(rep(letters[1:4], 25))

panel.circ <- function(x, y, ...)
  {
circ1 <- ellipse(diag(rep(1, 2)), t = 1)
panel.xyplot(circ1[,1], circ1[,2],
 type = "l",
 lty = 2)
circ2 <- ellipse(diag(rep(1, 2)), t = 2)
panel.xyplot(circ2[,1], circ2[,2],
 type = "l",
 lty = 2)
panel.xyplot(x, y)
  }

xyplot(X2 ~ X1, data = dat,
   panel = panel.circ,
   aspect = 1)

I'd like to to the sample with splom, but with groups.

My latest attempt:

panel.circ2 <- function(x, y, groups, ...)
  {
circ1 <- ellipse(diag(rep(1, 2)), t = 1)
panel.xyplot(circ1[,1], circ1[,2],
 type = "l",
 lty = 2)
circ2 <- ellipse(diag(rep(1, 2)), t = 2)
panel.xyplot(circ2[,1], circ2[,2],
 type = "l",
 lty = 2)
panel.xyplot(x, y, type = "p", groups)
  }



splom(~dat,
  panel = panel.superpose,
  panel.groups = panel.circ2)

produces nothing but warnings:

> warnings()
Warning messages:
1: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'

It does not appear to me that panel.circ2 is even being called.

Thanks,

Max

> sessionInfo()
R version 2.11.1 Patched (2010-09-30 r53356)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

other attached packages:
[1] lattice_0.19-11 ellipse_0.3-5

loaded via a namespace (and not attached):
[1] grid_2.11.1  tools_2.11.1



-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] less than full rank contrast methods

2010-12-06 Thread Max Kuhn

I'd like to make a less than full rank design using dummy variables
for factors. Here is some example data:

when <- data.frame(time = c("afternoon", "night", "afternoon",
"morning", "morning", "morning",
"morning", "afternoon", "afternoon"),
   day = c("Monday", "Monday", "Monday",
   "Wednesday", "Wednesday", "Friday",
   "Saturday", "Saturday", "Friday"))

For a single factor, I can do this this using

> head(model.matrix(~time -1, data = when))
  timeafternoon timemorning timenight
1 1   0 0
2 0   0 1
3 1   0 0
4 0   1 0
5 0   1 0
6 0   1 0

but this breakdown muti-variable formulas such as "time + day" or
"time + dat + time:day".

I've looked for alternate contrast functions to do this and I haven't
figured out a way to coerce existing functions to get the desired
output. Hopefully I haven't missed anything obvious.

Thanks,

Max

> sessionInfo()
R version 2.11.1 Patched (2010-09-11 r52910)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cross validation using e1071:SVM

2010-11-23 Thread Max Kuhn

Neeti,

I'm pretty sure that the error is related to the confusionMAtrix call,
which is in the caret package, not e1071.

The error message is pretty clear: you need to pas in two factor
objects that have the same levels. You can check by running the
commands:

   str(pred_true1)
   str(species_test)

Also, caret can do the resampling for you instead of you writing the
loop yourself.

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Sporadic errors when training models using CARET

2010-11-23 Thread Max Kuhn

Kendric,

I've seen these too and traceback() usually goes back to ksvm(). This
doesn't mean that the error is there, but the results fo traceback()
from you would be helpful.

thanks,

Max

On Mon, Nov 22, 2010 at 6:18 PM, Kendric Wang
 wrote:
> Hi. I am trying to construct a svmLinear model using the "caret" package
> (see code below). Using the same data, without changing any setting,
> sometimes it constructs the model successfully, and sometimes I get an index
> out of bounds error. Is this unexpected behaviour? I would appreciate any
> insights this issue.
>
>
> Thanks.
> ~Kendric
>
>
>> train.y
>  [1] S S S S R R R R R R R R R R R R R R R R R R R R
> Levels: R S
>
>> train.x
>        m1      m2
> 1   0.1756  0.6502
> 2   0.1110 -0.2217
> 3   0.0837 -0.1809
> 4  -0.3703 -0.2476
> 5   8.3825  2.8814
> 6   5.6400 12.9922
> 7   7.5537  7.4809
> 8   3.5005  5.7844
> 9  16.8541 16.6326
> 10  9.1851  8.7814
> 11  1.4405 11.0132
> 12  9.8795  2.6182
> 13  8.7151  4.5476
> 14 -0.2092 -0.7601
> 15  3.6876  2.5772
> 16  8.3776  5.0882
> 17  8.6567  7.2640
> 18 20.9386 20.1107
> 19 12.2903  4.7864
> 20 10.5920  7.5204
> 21 10.2679  9.5493
> 22  6.2023 11.2333
> 23 -5.0720 -4.8701
> 24  6.6417 11.5139
>
>> svmLinearGrid <- expand.grid(.C=0.1)
>> svmLinearFit <- train(train.x, train.y, method="svmLinear",
> tuneGrid=svmLinearGrid)
> Fitting: C=0.1
> Error in indexes[[j]] : subscript out of bounds
>
>> svmLinearFit <- train(train.x, train.y, method="svmLinear",
> tuneGrid=svmLinearGrid)
> Fitting: C=0.1
> maximum number of iterations reached 0.0005031579 0.0005026807maximum number
> of iterations reached 0.0002505857 0.0002506714Error in indexes[[j]] :
> subscript out of bounds
>
>> svmLinearFit <- train(train.x, train.y, method="svmLinear",
> tuneGrid=svmLinearGrid)
> Fitting: C=0.1
> maximum number of iterations reached 0.0003270061 0.0003269764maximum number
> of iterations reached 7.887867e-05 7.866367e-05maximum number of iterations
> reached 0.0004087571 0.0004087466Aggregating results
> Selecting tuning parameters
> Fitting model on full training set
>
>
> R version 2.11.1 (2010-05-31)
> x86_64-redhat-linux-gnu
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] splines   stats     graphics  grDevices utils     datasets  methods
> [8] base
>
> other attached packages:
>  [1] kernlab_0.9-12  pamr_1.47       survival_2.35-8 cluster_1.12.3
>  [5] e1071_1.5-24    class_7.3-2     caret_4.70      reshape_0.8.3
>  [9] plyr_1.2.1      lattice_0.18-8
>
> loaded via a namespace (and not attached):
> [1] grid_2.11.1
>
>
> --
> MSc. Candidate
> CIHR/MSFHR Training Program in Bioinformatics
> University of British Columbia
>
>        [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] odfWeave - "Format error discovered in the file in sub-document content.xml at 2, 4047 (row, col)"

2010-11-16 Thread Max Kuhn

Can you try it with version 7.16 on R-Forge? Use

 install.packages("odfWeave", repos="http://R-Forge.R-project.org";)

to get it.

Thanks,

Max

On Tue, Nov 16, 2010 at 8:26 AM, Søren Højsgaard
 wrote:
> Dear Mike,
>
> Good point - thanks. The lines that caused the error mentioned above are 
> simply:
>
> <<>>=
> x <- 1:10
> x
> @
>
> I could add that the document 'simple.odt' (which comes with odfWeave) causes 
> the same error - but at row=109, col=1577
>
>> sessionInfo()
> R version 2.12.0 (2010-10-15)
> Platform: x86_64-pc-mingw32/x64 (64-bit)
>
> locale:
> [1] LC_COLLATE=Danish_Denmark.1252  LC_CTYPE=Danish_Denmark.1252    
> LC_MONETARY=Danish_Denmark.1252 LC_NUMERIC=C                    
> LC_TIME=Danish_Denmark.1252
>
> attached base packages:
> [1] grid      stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] MASS_7.3-8      odfWeave_0.7.14 XML_3.2-0.1     lattice_0.19-13
>
> loaded via a namespace (and not attached):
> [1] tools_2.12.0
>
>
> Regards
> Søren
>
> -Oprindelig meddelelse-
> Fra: Mike Marchywka [mailto:marchy...@hotmail.com]
> Sendt: 16. november 2010 12:56
> Til: Søren Højsgaard; r-h...@stat.math.ethz.ch
> Emne: RE: [R] odfWeave - "Format error discovered in the file in sub-document 
> content.xml at 2, 4047 (row, col)"
>
>
>
>
>
>
>
>
> 
> From: soren.hojsga...@agrsci.dk
> To: r-h...@stat.math.ethz.ch
> Date: Tue, 16 Nov 2010 11:32:06 +0100
> Subject: [R] odfWeave - "Format error discovered in the file in sub-document 
> content.xml at 2, 4047 (row, col)"
>
>
> When using odfWeave on an OpenOffice input document, I can not open the 
> output document. I get the message
>
> "Format error discovered in the file in sub-document content.xml at 2,4047 
> (row,col)"
>
> Can anyone help me on this? (Apologies if this has been discussed before; I 
> have not been able to find any info...)
>
> well, if it really means line 2 you could post the first few lines. Did you 
> expect a line
> with 4047 columns?
>
>
>
>
> Info:
> I am using R.2.12.0 on Windows 7 (64 bit). I have downloaded the XML package 
> from http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/2.12/ and I have 
> compiled odfWeave myself
>
> Best regards
> Søren
>
>        [[alternative HTML version deleted]]
>
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] to determine the variable importance in svm

2010-10-26 Thread Max Kuhn

> The caret package has answers to all your questions.

>> 1) How to obtain a variable (attribute) importance using
>> e1071:SVM (or other
>> svm methods)?

I haven't implemented a model-specific method for variables importance
for SVM models. I know of one package (svmpath) that will return the
regression coefficients (e.g. the \beta values of x'\beta) for two
class models. There are probably other methods for non-linear kernels,
but I haven't coded anything (any volunteers?).

When there is no variable importance method implemented for
classification models, caret calculates an ROC curve for each
predictor and returns the AUC. For 3+ classes, it returns the maximum
AUC on the one-vs-all ROC curves.

Note also that caret uses ksvm in kernlab for no other reason that it
has a bunch of available kernels and similar methods (rvm, etc)

>> 2) how to validate the results of svm?

If you use caret, you can look at:

  http://user2010.org/slides/Kuhn.pdf
  http://www.jstatsoft.org/v28/i05

and the four package vignettes.

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Random Forest AUC

2010-10-22 Thread Max Kuhn

Ravishankar,

> I used Random Forest with a couple of data sets I had to predict for binary
> response. In all the cases, the AUC of the training set is coming to be 1.
> Is this always the case with random forests? Can someone please clarify
> this?

This is pretty typical for this model.

> I have given a simple example, first using logistic regression and then
> using random forests to explain the problem. AUC of the random forest is
> coming out to be 1.

Logistic regression isn't as flexible as RF and some other methods, so
the ROC curve is likely to be less than one, but much higher than it
really is (since you are re-predicting the same data)

For you example:

> performance(prediction(train.predict,iris$Species),"auc")@y.values[[1]]
[1] 0.9972

but using simple 10-fold CV:

> library(caret)
> ctrl <- trainControl(method = "cv",
+  classProbs = TRUE,
+  summaryFunction = twoClassSummary)
>
> set.seed(1)
> cvEstimate <- train(Species ~ ., data = iris,
+ method = "glm",
+ metric = "ROC",
+ trControl = ctrl)
Fitting: parameter=none
Aggregating results
Fitting model on full training set
Warning messages:
1: glm.fit: fitted probabilities numerically 0 or 1 occurred
2: glm.fit: algorithm did not converge
3: glm.fit: fitted probabilities numerically 0 or 1 occurred
4: glm.fit: algorithm did not converge
5: glm.fit: fitted probabilities numerically 0 or 1 occurred
> cvEstimate

Call:
train.formula(form = Species ~ ., data = iris, method = "glm",
metric = "ROC", trControl = ctrl)

100 samples
  4 predictors

Pre-processing:
Resampling: Cross-Validation (10 fold)

Summary of sample sizes: 90, 90, 90, 90, 90, 90, ...

Resampling results

  Sens  Spec  ROC   Sens SD  Spec SD  ROC SD
  0.96  0.98  0.86  0.0843   0.0632   0.126

and for random forest:

> set.seed(1)
> rfEstimate <- train(Species ~ .,
+ data = iris,
+ method = "rf",
+ metric = "ROC",
+ tuneGrid = data.frame(.mtry = 2),
+ trControl = ctrl)
Fitting: mtry=2
Aggregating results
Selecting tuning parameters
Fitting model on full training set
> rfEstimate

Call:
train.formula(form = Species ~ ., data = iris, method = "rf",
metric = "ROC", tuneGrid = data.frame(.mtry = 2), trControl = ctrl)

100 samples
  4 predictors

Pre-processing:
Resampling: Cross-Validation (10 fold)

Summary of sample sizes: 90, 90, 90, 90, 90, 90, ...

Resampling results

  Sens  Spec  ROCSens SD  Spec SD  ROC SD
  0.94  0.92  0.898  0.0966   0.14 0.00632

Tuning parameter 'mtry' was held constant at a value of 2

-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Understanding linear contrasts in Anova using R

2010-09-30 Thread Max Kuhn

These two resources might also help:

   http://cran.r-project.org/doc/contrib/Faraway-PRA.pdf
   http://cran.r-project.org/web/packages/contrast/vignettes/contrast.pdf

Max


On Thu, Sep 30, 2010 at 1:33 PM, Ista Zahn  wrote:
> Hi Professor Howell,
> I think the issue here is simply in the assumption that the regression
> coefficients will always be equal to the product of the means and the
> contrast codes. I tend to think of regression coefficients as the
> quotient of the covariance of x and y divided by the variance of x,
> and this definition agrees with the coefficients calculated by lm().
> See below for a long-winded example.
>
> On Wed, Sep 29, 2010 at 3:42 PM, David Howell  wrote:
>>  #I am trying to understand how R fits models for contrasts in a
>> #simple one-way anova. This is an example, I am not stupid enough to want
>> #to simultaneously apply all of these contrasts to real data. With a few
>> #exceptions, the tests that I would compute by hand (or by other software)
>> #will give the same t or F statistics. It is the contrast estimates that R
>> produces
>> #that I can't seem to understand.
>> #
>> # In searching for answers to this problem, I found a great PowerPoint slide
>> (I think by John Fox).
>> # The slide pointed to the coefficients, said something like "these are
>> coeff. that no one could love," and
>> #then suggested looking at the means to understand where they came from. I
>> have stared
>> # and stared at his means and then my means, but can't find a relationship.
>>
>> # The following code and output illustrates the problem.
>>
>> # Various examples of Anova using R
>>
>> dv <- c(1.28,  1.35,  3.31,  3.06,  2.59,  3.25,  2.98,  1.53, -2.68,  2.64,
>>  1.26,  1.06,
>>       -1.18,  0.15,  1.36,  2.61,  0.66,  1.32,  0.73, -1.06,  0.24,  0.27,
>>  0.72,  2.28,
>>       -0.41, -1.25, -1.33, -0.47, -0.60, -1.72, -1.74, -0.77, -0.41, -1.20,
>> -0.31, -0.74,
>>       -0.45,  0.54, -0.98,  1.68,  2.25, -0.19, -0.90,  0.78,  0.05,  2.69,
>>  0.15,  0.91,
>>        2.01,  0.40,  2.34, -1.80,  5.00,  2.27,  6.47,  2.94,  0.47,  3.22,
>>  0.01, -0.66)
>>
>> group <- factor(rep(1:5, each = 12))
>>
>>
>> # Use treatment contrasts to compare each group to the first group.
>> options(contrasts = c("contr.treatment","contr.poly"))  # The default
>> model2 <- lm(dv ~ group)
>> summary(model2)
>>  # Summary table is the same--as it should be
>>  # Intercept is Group 1 mean and other coeff. are deviations from that.
>>  # This is what I would expect.
>>  #summary(model1)
>>  #              Df Sum Sq Mean Sq F value    Pr(>F)
>>  #  group        4  62.46 15.6151  6.9005 0.0001415 ***
>>  #  Residuals   55 124.46  2.2629
>>  #Coefficients:
>>  #            Estimate Std. Error t value Pr(>|t|)
>>  #(Intercept)  1.80250    0.43425   4.151 0.000116 ***
>>  #group2      -1.12750    0.61412  -1.836 0.071772 .
>>  #group3      -2.71500    0.61412  -4.421 4.67e-05 ***
>>  #group4      -1.25833    0.61412  -2.049 0.045245 *
>>  #group5       0.08667    0.61412   0.141 0.888288
>>
>>
>> # Use sum contrasts to compare each group against grand mean.
>> options(contrasts = c("contr.sum","contr.poly"))
>> model3 <- lm(dv ~ group)
>> summary(model3)
>>
>>  # Again, this is as expected. Intercept is grand mean and others are
>> deviatoions from that.
>>  #Coefficients:
>>  #              Estimate Std. Error t value Pr(>|t|)
>>  #  (Intercept)   0.7997     0.1942   4.118 0.000130 ***
>>  #  group1        1.0028     0.3884   2.582 0.012519 *
>>  #  group2       -0.1247     0.3884  -0.321 0.749449
>>  #  group3       -1.7122     0.3884  -4.408 4.88e-05 ***
>>  #  group4       -0.2555     0.3884  -0.658 0.513399
>>
>> #SO FAR, SO GOOD
>>
>> # IF I wanted polynomial contrasts BY HAND I would use
>> #    a(i) =  -2   -1   0   1   2   for linear contrast        (or some
>> linear function of this )
>> #    Effect = Sum(a(j)M(i))    # where M = mean
>> #    Effect(linear) = -2(1.805) -1(0.675) +0(-.912) +1(.544) +2(1.889) =
>> 0.043
>> #    SS(linear) = n*(Effect(linear)^2)/Sum((a(j)^2))  = 12(.043)/10 = .002
>> #    F(linear) = SS(linear)/MS(error) = .002/2.263 = .001
>> #    t(linear) = sqrt(.001) = .031
>>
>> # To do this in R I would use
>> order.group <- ordered(group)
>> model4 <- lm(dv~order.group)
>> summary(model4)
>> #  This gives:
>>    #Coefficients:
>> #                  Estimate Std. Error t value Pr(>|t|)
>> #    (Intercept)    0.79967    0.19420   4.118 0.000130 ***
>> #    order.group.L  0.01344    0.43425   0.031 0.975422
>> #    order.group.Q  2.13519    0.43425   4.917 8.32e-06 ***
>> #    order.group.C  0.11015    0.43425   0.254 0.800703
>> #    order.group^4 -0.79602    0.43425  -1.833 0.072202 .
>>
>> # The t value for linear is same as I got (as are others) but I don't
>> understand
>> # the estimates. The intercept is the grand mean, but I don't see the
>> relationship
>> # of other estimates to that or to the ones I get by hand.
>> # My estimates are the sum of (coeff times means) i.e.

Re: [R] Creating publication-quality plots for use in Microsoft Word

2010-09-15 Thread Max Kuhn

You might want to check out the Reproducible Research task view:

   http://cran.r-project.org/web/views/ReproducibleResearch.html

There is a section on Microsoft formats, as well as other formats that
can be converted.

Max



On Wed, Sep 15, 2010 at 11:49 AM, Thomas Lumley
 wrote:
> On Wed, 15 Sep 2010, dadrivr wrote:
>
>>
>> Thanks for your help, guys.  I'm looking to produce a high-quality plot
>> (no
>> jagged lines or other distortions) with a filetype that is accepted by
>> Microsoft Word on a PC and that most journals will accept.  That's why I'd
>> prefer to stick with JPEG, TIFF, PNG, or the like.  I'm not sure EPS would
>> fly.
>
> One simple approach, which I use when I have to create graphics for MS
> Office while on a non-Windows platform is to use PNG and set the resolution
> and file size large enough.  At 300dpi or so the physics of ink on paper
> does all the antialiasing you need.
>
> Work out how big you want the graph to be, and use PNG with enough pixels to
> get at least 300dpi at that final size. You'll need to set the pointsize
> argument and it will help to set the resolution argument.
>
>     -thomas
>
> Thomas Lumley
> Professor of Biostatistics
> University of Washington, Seattle
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] createDataPartition

2010-09-09 Thread Max Kuhn

Trafim,

You'll get more answers if you adhere to the posting guide and tell us
you version information and other necessary details. For example, this
function is in the caret package (but nobody but me probably knows
that =]).

The first argument should be a vector of outcome values (not the
possible classes).

For the iris data, this means something like:

   createDataPartition(iris$Species)

if you were trying to predict the species. The function does
stratified splitting; the data are split into training and test sets
within each class, then the results are aggregated to get the entire
training set indicators. Setting a proportion per class won't do
anything.

Look at the man page or the (4) package vignettes for examples.

Max

On Thu, Sep 9, 2010 at 7:52 AM, Trafim Vanishek  wrote:
> Dear all,
>
> does anyone know how to define the structure of the required samples using
> function createDataPartition, meaning proportions of different types of
> variable in the partition?
> Smth like this for iris data:
>
> createDataPartition(y = c(setosa = .5, virginica = .3, versicolor = .2),
> times = 10, p = .7, list = FALSE)
>
> Thanks a lot for your help.
>
> Regards,
> Trafim
>
>        [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Reproducible research

2010-09-09 Thread Max Kuhn

A Reproducible Research CRAN task view was recently created:

   http://cran.r-project.org/web/views/ReproducibleResearch.html

I will be updating it with some of the information in this thread.

thanks,

Max



On Thu, Sep 9, 2010 at 11:41 AM, Matt Shotwell  wrote:
> Well, the attachment was a dud. Try this:
>
> http://biostatmatt.com/R/markup_0.0.tar.gz
>
> -Matt
>
> On Thu, 2010-09-09 at 10:54 -0400, Matt Shotwell wrote:
>> I have a little package I've been using to write template blog posts (in
>> HTML) with embedded R code. It's quite small but very flexible and
>> extensible, and aims to do something similar to Sweave and brew. In
>> fact, the package is heavily influenced by the brew package, though
>> implemented quite differently. It depends on the evaluate package,
>> available in the CRAN. The tentatively titled 'markup' package is
>> attached. After it's installed, see ?markup and the few examples in the
>> inst/ directory, or just example(markup).
>>
>> -Matt
>>
>> On Thu, 2010-09-09 at 01:47 -0400, David Scott wrote:
>> > I am investigating some approaches to reproducible research. I need in
>> > the end to produce .html or .doc or .docx. I have used hwriter in the
>> > past but have had some problems with verbatim output from  R. Tables are
>> > also not particularly convenient.
>> >
>> > I am interested in R2HTML and R2wd in particular, and possibly odfWeave.
>> >
>> > Does anyone have sample documents using any of these approaches which
>> > they could let me have?
>> >
>> > David Scott
>> >
>> > _
>> >
>> > David Scott Department of Statistics
>> >             The University of Auckland, PB 92019
>> >             Auckland 1142,    NEW ZEALAND
>> > Phone: +64 9 923 5055, or +64 9 373 7599 ext 85055
>> > Email:      d.sc...@auckland.ac.nz,  Fax: +64 9 373 7018
>> >
>> > Director of Consulting, Department of Statistics
>> >
>> > __
>> > R-help@r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide 
>> > http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>>
>
> --
> Matthew S. Shotwell
> Graduate Student
> Division of Biostatistics and Epidemiology
> Medical University of South Carolina
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] several odfWeave questions

2010-08-25 Thread Max Kuhn

Ben,

>  1a. am I right in believing that odfWeave does not respect the
> 'keep.source' option?  Am I missing something obvious?

I believe it does, since this gets passed directly to Sweave.

>  1b. is there a way to set global options analogous to \SweaveOpts{}
> directives in Sweave? (I looked at odfWeaveControl, it doesn't seem to
> do it.)

Yes. There are examples of this in the 'examples' package directory.

>  2. I tried to write a Makefile directive to process files from the
> command line:
>
> %.odt: %_in.odt
>        $(RSCRIPT) -e "library(odfWeave);
> odfWeave(\"$*_in.odt\",\"$*.odt\");"
>
>  This works, *but* the resulting output file gives a warning ("The file
> 'odftest2.odt' is corrupt and therefore cannot be opened.
> OpenOffice.org can try to repair the file ...").  Based on looking at
> the contents, it seems that a spurious/unnecessary 'Rplots.pdf' file is 
> getting
> created and zipped in with the rest of the archive; when I unzip, delete
> the Rplots.pdf file and re-zip, the ODT file opens without a warning.
> Obviously I could post-process but it would be nicer to find a
> workaround within R ...

Get the latest version form R-Forge. I haven't gotten this fix onto
CRAN yet (I've been on a caret streak lately).

>  3. I find the requirement that all file paths be specified as absolute
> rather than relative paths somewhat annoying -- I understand the reason,
> but it goes against one practice that I try to encourage for
> reproducibility, which is *not* to use absolute file paths -- when
> moving a same set of data and analysis files across computers, it's hard
> to enforce them all ending up in the same absolute location, which then
> means that the recipient has to edit the ODT file.  It would be nice if
> there were hooks for read.table() and load() as there are for plotting
> and package/namespace loading -- then one could just copy them into the
> working directory on the fly.
>   has anyone experienced this/thought of any workarounds?
>  (I guess one solution is to zip any necessary source files into the archive 
> beforehand,
> as illustrated in the vignette.)

You can set the working directory with the (wait for it...) 'workDir'
argument. Using 'workDir = getwd()' will pack and unpack the files in
the current location and you wouldn't need to worry about setting the
path. I use the temp directory because I started over-wrting files.

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] odfWeave Issue.

2010-08-11 Thread Max Kuhn

> What does this mean?

It's impossible to tell. Read the posting guide and figure out all the
details that you left out. If we don't have more information, you
should have low expectations about the quality of any replies to might
get.

-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Random Forest - Strata

2010-07-27 Thread Max Kuhn

The index indicates which samples should go into the training set.
However, you are using out of bag sampling, so it would use the whole
training set and return the OOB error (instead of the error estimates
that would be produced by resampling via the index).

Which do you want? OOB estimates or other estimates? Based on your
previous email, I figured you would have an index list with three sets
of sample indicies for sites A+B, sites A+C and sites B+C. In this way
you would do three resamples: the first fits using data from sites A
&B, then predicts on C (and so on). In this way, the resampled error
estimates would be based on the average of the three hold-out sets
(actually hold-out sites). OOB error doesn't sound like what you want.

MAx

On Tue, Jul 27, 2010 at 2:46 PM, Coll  wrote:
>
> Thanks for all the help.
>
> I had tried using the "index" in caret to try to dictate which rows of the
> sample would be used in each of the tree building in RF. (e.g. use all data
> from A B site for training, hold out all data from C site for testing etc)
>
> However after running, when I cross-checked the "index" that goes to train
> function and the "inbag" in the resulting randomForest object, I found the
> two didn't match.
>
> Shown as below:
>
>> data(iris)
>> tmpIrisIndex <- createDataPartition(iris$Species, p=0.632, times = 10)
>> head(tmpIrisIndex,3)
> [[1]]
>  [1]   1   2   3   7  10  11  12  13  16  18  20  22  24  25  26  27  28  29
> 31
> [20]  34  35  36  37  38  39  40  41  43  46  47  48  50  52  53  55  56  57
> 58
> [39]  61  64  65  66  67  68  69  71  74  75  76  77  79  82  83  84  85  86
> 88
> [58]  90  91  92  94  96  98  99 102 103 104 106 108 109 111 112 113 114 115
> 116
> [77] 117 119 120 121 123 126 128 129 130 131 132 134 136 139 140 141 143 146
> 147
> [96] 150
>
> [[2]]
>  [1]   1   3   6   7   8  10  12  13  14  16  18  20  21  22  23  24  26  27
> 28
> [20]  29  30  32  34  35  36  38  42  44  46  47  48  50  51  53  54  55  58
> 60
> [39]  61  62  67  68  69  70  72  73  74  76  77  79  81  82  83  85  86  88
> 89
> [58]  90  92  93  95  97  99 100 103 104 105 107 108 109 111 112 113 114 117
> 119
> [77] 120 121 122 123 124 125 127 130 132 133 134 135 137 139 140 141 142 145
> 147
> [96] 149
>
> [[3]]
>  [1]   1   5   7   9  10  11  12  14  18  20  21  22  23  24  26  29  30  31
> 33
> [20]  34  35  36  37  38  39  40  44  45  46  47  48  49  51  52  53  54  56
> 58
> [39]  61  63  65  66  69  70  72  74  75  76  77  78  79  80  82  83  85  86
> 87
> [58]  90  91  92  93  94  98 100 102 103 105 106 107 109 110 113 114 115 116
> 117
> [77] 121 122 123 124 125 128 129 130 131 132 133 134 135 138 139 140 141 142
> 146
> [96] 150
>
>> irisTrControl <- trainControl(method = "oob", index = tmpIrisIndex)
>> rf.iris.obj <-train(Species~., data= iris, method = "rf", ntree = 10,
>> keep.inbag = TRUE, trControl = irisTrControl)
> Fitting: mtry=2
> Fitting: mtry=3
> Fitting: mtry=4
>> head(rf.iris.obj$finalModel$inbag,20)
>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
>  [1,]    1    0    1    0    0    0    1    0    1     1
>  [2,]    1    1    1    1    1    0    1    0    1     0
>  [3,]    1    1    1    0    0    1    1    0    0     0
>  [4,]    1    0    1    0    1    1    0    1    0     1
>  [5,]    0    1    1    1    1    1    0    1    0     1
>  [6,]    1    1    0    1    0    0    1    1    1     0
>  [7,]    1    1    0    0    1    1    0    0    0     0
>  [8,]    1    1    1    1    1    0    1    1    1     1
>  [9,]    1    1    0    1    0    1    0    1    1     0
> [10,]    1    1    1    0    1    1    0    0    0     1
> [11,]    1    1    1    1    1    1    1    0    1     0
> [12,]    1    1    1    1    1    0    1    0    1     1
> [13,]    1    0    1    1    1    1    1    1    0     1
> [14,]    0    1    1    1    0    1    0    0    0     0
> [15,]    1    1    1    1    1    1    1    1    1     0
> [16,]    1    1    0    0    0    0    1    0    1     1
> [17,]    1    0    1    0    0    0    1    1    0     1
> [18,]    1    0    1    1    1    1    1    1    1     1
> [19,]    1    0    1    0    1    1    1    0    1     1
> [20,]    1    0    1    0    1    1    1    0    1     0
>
> My understanding is the 1st tree in the RF should be built with
> tmpIrisIndex[1] i.e. "1   2   3   7  10  11  12  13  ..." ?
> But the Inbag in the resulting forest is showing it is using "1 2 3 4 6 7 8
> 9..." for inbag in 1st tree?
>
> Why the index passed to train does not match what got from inbag in the rf
> object? Or I had looked to the wrong place to check this?
>
> Any help / comments would be appreciated. Thanks a lot.
>
> Regards,
> Coll
>
>
>
> --
> View this message in context: 
> http://r.789695.n4.nabble.com/Random-Forest-Strata-tp2295731p2303958.html
> Sent from the R help mailing list archive at Nabble.com.
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
>

Re: [R] UseR! 2010 - my impressions

2010-07-27 Thread Max Kuhn

Not to beat a dead horse...

I've found that I like the useR conferences more than most statistics
conferences. This isn't due to the difference in content, but the
difference in the audience and the environment.

For example, everyone is at useR because of their appreciation of R.
At most other conferences, there is a much wider focus of topics and
less "group cohesion". Given this, I think that the environment is
more congenial. I've had many discussions with people that are in
completely different fields than myself (e.g. imaging, forestry,
physics, etc) that would be less likely to occur at other scientific
meetings.

Another difference between useR and the average (statistics)
conference is the network effect is stronger. I believe that there is
a much higher likelihood that a random person is acquainted with a
different random attendee. This could be because of we've used their
package, they run a local RUG or they are one of the principal people
who drive R (Uwe, Kurt, etc).

Anyway, well done.

Max

On Mon, Jul 26, 2010 at 11:49 AM, Tal Galili  wrote:
> Dear Ravi - I echo everything you wrote, useR2010 was an amazing experience
> (for me, and for many others with whom I have spoken about it).
> Many thanks should go to the wonderful people who put their efforts into
> making this conference a reality (and Kate is certainly one of them).
> Thank you for expressing feelings I had using your own words.
>
> Best,
> Tal
>
>
> Contact
> Details:---
> Contact me: tal.gal...@gmail.com |  972-52-7275845
> Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
> www.r-statistics.com (English)
> --
>
>
>
>
> On Sat, Jul 24, 2010 at 2:50 AM, Ravi Varadhan  wrote:
>
>> Dear UseRs!,
>>
>> Everything about UseR! 2010 was terrific!  I really mean "everything" - the
>> tutorials, invited talks, kaleidoscope sessions, focus sessions, breakfast,
>> snacks, lunch, conference dinner, shuttle services, and the participants.
>> The organization was fabulous.  NIST were gracious hosts, and provided top
>> notch facilities.  The rousing speech by Antonio Possolo, who is the chief
>> of Statistical Engineering Division at NIST, set the tempo for the entire
>> conference.  Excellent invited lectures by Luke Tierney, Frank Harrell, Mark
>> Handcock, Diethelm Wurtz, Uwe Ligges, and Fritz Leisch.  All the sessions
>> that I attended had many interesting ideas and useful contributions.  During
>> the whole time that I was there, I could not help but get the feeling that I
>> am a part of something great.
>>
>> Before I end, let me add a few words about a special person.  This
>> conference would not have been as great as it was without the tireless
>> efforts of Kate Mullen.  The great thing about Kate is that she did so much
>> without ever hogging the limelight.  Thank you, Kate and thank you NIST!
>>
>> I cannot wait for UseR!2011!
>>
>> Best,
>> Ravi.
>>
>> 
>>
>> Ravi Varadhan, Ph.D.
>> Assistant Professor,
>> Division of Geriatric Medicine and Gerontology
>> School of Medicine
>> Johns Hopkins University
>>
>> Ph. (410) 502-2619
>> email: rvarad...@jhmi.edu
>>
>> __
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>        [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

1 2 3 >

1 - 100 of 245 matches

Mail list logo