Dear All, I was writing a small wrapper to bootstrap a classification algorithm, but if we generate the indices in the "usual way" as:
bootindex <- sample(index, N, replace = TRUE) there is a non-zero probability that all the samples belong to only one class, thus leading to problems in the fitting (or that some classes will end up with only one sample, which will be a problem for quadratic discriminant analysis). It thought this situation should be frequent enough to be mentioned in the literature, but I have found almost no mention in the references I have available, except for Hirst (see below). If I've reread correctly, this issue is not mentioned in Efron & Tibshirani (1997; the .632+ paper), or in Efron and Gong (the TAS "leisure look" paper), or the Efron & Tibshirani 1993 bootstrap book, or Chernick's "Bootstrap methods" book. I've only seen some side mentions in Ripley's Pattern recognition (when talking about stratified cross-validation), and Davison & Hinkley's bootstrap book when, on p. 304, they refer to some subsets having singular design matrices, and thus requiring stratification on covars. McLachlan (in his discriminant analysis book), on p. 347, differentiates between mixture sampling and separate sampling, but I can find a mention of what do when, under mixture sampling, we end up with all samples in only one group. Only Hirst (1996, Technometrics, 38 (4): 389--399) says that each bootstrap sample should include at least one observation for each group, and at least enough different observations from each group to allow estimation of the covariance matrix (he is referring to discriminant analysis), and thus he uses essentially stratified bootstrap samples. Interestingly, the "boot" function (boot library) says "For nonparametric multi-sample problems stratified resampling is used.". As well, the predab.resample (Design library) says "group: a grouping variable used to stratify the sample upon bootstrapping. This allows one to handle k-sample problems, (...)". That the authors of boot and Design are using stratified resampling indicates to me that this might be the obvious, unproblematic way to go, but I understood that stratified resampling was OK only when that was sampling scheme that generated the data. What am I missing? Thanks, R. -- Ramón Díaz-Uriarte Bioinformatics Unit Centro Nacional de Investigaciones Oncológicas (CNIO) (Spanish National Cancer Center) Melchor Fernández Almagro, 3 28029 Madrid (Spain) Fax: +-34-91-224-6972 Phone: +-34-91-224-6900 http://bioinfo.cnio.es/~rdiaz PGP KeyID: 0xE89B3462 (http://bioinfo.cnio.es/~rdiaz/0xE89B3462.asc) ______________________________________________ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html