Dear useRs,
What is an efficient way to randomly sample from clustered data such that I get equal representation from each cluster? For example, let's say I want to randomly sample two cases from each cluster created by the "id" variable in the following data frame: > id<-c(rep("100", 4),rep("101", 3), rep("102", 6), rep("103", 7)) > sex<-sample(c("m","f"), 20, replace=TRUE) > weight<-rnorm(n=20, mean=150, sd=3) > attitude<-sample(1:7, 20, replace=TRUE) > Dataf<-data.frame(id,sex,weight,attitude) > Dataf id sex weight attitude 1 100 m 146.5064 6 2 100 f 150.2317 4 3 100 f 149.3686 5 4 100 m 144.7218 7 5 101 m 147.9071 4 6 101 m 148.3802 6 7 101 m 154.4634 1 8 102 m 153.2719 5 9 102 m 148.9821 5 10 102 f 148.0656 1 11 102 f 148.8949 6 12 102 m 146.9963 4 13 102 m 153.0542 4 14 103 m 148.1558 1 15 103 f 148.0482 4 16 103 m 151.8044 2 17 103 f 155.4976 4 18 103 m 150.0423 1 19 103 f 146.0487 5 20 103 m 154.6651 7 > Here's the R code I wrote that obviously does not work: sapply(split(Dataf, Dataf$id), sample, size=2) I would prefer a data frame (i.e., Dataf2) as the final output and it should look something like this: > Dataf2 id sex weight attitude 1 100 m 146.5064 6 2 100 m 144.7218 7 3 101 m 147.9071 4 4 101 m 154.4634 1 5 102 m 153.2719 5 6 102 m 148.9821 5 7 103 f 155.4976 4 8 103 f 146.0487 5 > Thanks in advance in your assistance. Tony ------------------------------------------------------------------ Tony N. Brown, Ph.D. Associate Professor of Sociology Faculty Head of Hank Ingram House, The Commons Research Fellow, Vanderbilt Center for Nashville Studies Vanderbilt University (615) 322-7518 (615) 322-7505 fax [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]> [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.