Hello R Experts, I kindly request your assistance on figuring out how to get a stratified random sampling proportional to 100.
Below is my r code showing what I did and the error I'm getting with sampling::strata # FIRST I summarized count of records by the two variables I want to use as strata Library(RODBC) library(sqldf) library(sampling) #After establishing connection I query the data and sort it by strata APPT_TYP_CD_LL and EMPL_TYPE and store it in a dataframe CURRPOP<-sqlQuery(ch,"SELECT APPT_TYP_CD_LL, EMPL_TYPE,ASOFDATE,EMPLID,NAME,DEPTID,JOBCODE,JOBTITLE,SAL_ADMIN_PLAN,RET_TYP_CD_LL FROM PS_EMPLOYEES_LL WHERE EMPL_STATUS NOT IN('R','T') ORDER BY APPT_TYP_CD_LL, EMPL_TYPE") #ROWID is a dummy ID I added and repositioned after the strat columns for later use CURRPOP$ROWID<-seq(nrow(CURRPOP)) CURRPOP<-CURRPOP[,c(1:2,11,3:10)] # My strata. Stratp is how many I want to sampled from each strata. NOTE THERE ARE SOME 0's which just means I won't sample from that group. stratum_cp<-sqldf("SELECT APPT_TYP_CD_LL,EMPL_TYPE, count(*) HC FROM CURRPOP GROUP BY APPT_TYP_CD_LL,EMPL_TYPE") stratum_cp$stratp<-round(stratum_cp$HC/nrow(CURRPOP)*100) > stratum_cp APPT_TYP_CD_LL EMPL_TYPE HC stratp 1 FA S 1 0 2 FC S 5 0 3 FP S 173 3 4 FR H 170 3 5 FX H 49 1 6 FX S 57 1 7 IN H 1589 25 8 IN S 3987 63 9 IP H 7 0 10 IP S 53 1 11 SA H 8 0 12 SE S 43 1 13 SF H 14 0 14 SF S 1 0 15 SG S 10 0 16 ST H 107 2 17 ST S 6 0 #THEN I attempted to use sampling::strata using the instructions in that package and got an error #I use stratum_cp$stratp for my sizes. > s<-strata(CURRPOP,c("APPT_TYP_CD_LL","EMPL_TYPE"),size=stratum_cp$stratp,method="srswor") Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 0, 1 > traceback() 5: stop("arguments imply differing number of rows: ", paste(unique(nrows), collapse = ", ")) 4: data.frame(..., check.names = FALSE) 3: cbind(deparse.level, ...) 2: cbind(r, i) 1: strata(CURRPOP, c("APPT_TYP_CD_LL", "EMPL_TYPE"), size = stratum_cp$stratp, method = "srswor") #In lieu of a reproducible sample here is some info regarding most of my data dim(CURRPOP) [1] 6280 11 #Cols w/ personal info have been removed in this output > str(CURRPOP[,c(1:3,7:11)]) 'data.frame': 6280 obs. of 8 variables: $ APPT_TYP_CD_LL: Factor w/ 12 levels "FA","FC","FP",..: 1 2 2 2 2 2 3 3 3 3 ... $ EMPL_TYPE : Factor w/ 2 levels "H","S": 2 2 2 2 2 2 2 2 2 2 ... $ ROWID : int 1 2 3 4 5 6 7 8 9 10 ... $ DEPTID : int 9825 9613 9613 9852 9772 9852 9853 9853 9853 9854 ... $ JOBCODE : Factor w/ 325 levels "055.2","055.3",..: 311 112 112 112 112 112 298 299 299 300 ... $ JOBTITLE : Factor w/ 325 levels "Accounting Assistant",..: 227 192 192 192 192 192 190 191 191 153 ... $ SAL_ADMIN_PLAN: Factor w/ 40 levels "ADE","AME","ASE",..: 36 38 38 38 38 38 31 31 31 31 ... $ RET_TYP_CD_LL : Factor w/ 2 levels "TCP1","TCP2": 2 2 2 2 2 2 2 2 2 2 ... Daniel Lopez Workforce Analyst HRIM - Workforce Analytics & Metrics Strategic Human Resources Management wf-analytics-metr...@lists.llnl.gov<mailto:wf-analytics-metr...@lists.llnl.gov> (925) 422-0814 [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.