Here's one way, worked out in lots of steps so you can see how each works: > mydata <- data.frame(MyFactor = factor(rep(LETTERS[1:4], times=c(1000, 2000, > 30, 4))), something = runif(3034)) > str(mydata) 'data.frame': 3034 obs. of 2 variables: $ MyFactor : Factor w/ 4 levels "A","B","C","D": 1 1 1 1 1 1 1 1 1 1 ... $ something: num 0.725 0.222 0.347 0.614 0.968 ... > > table(mydata$MyFactor)
A B C D 1000 2000 30 4 > > > important.levels <- table(mydata$MyFactor) / nrow(mydata) > important.levels <- names(important.levels)[important.levels > .01] > important.levels [1] "A" "B" > > newdata <- mydata[mydata$MyFactor %in% important.levels, ] > table(newdata$MyFactor) A B C D 1000 2000 0 0 > > > newdata$MyFactor <- factor(newdata$MyFactor, levels=important.levels) > table(newdata$MyFactor) A B 1000 2000 > On Wed, Jan 18, 2012 at 5:25 PM, Sam Steingold <s...@gnu.org> wrote: > I have a data frame with some factor columns. > I want to drop the rows with rare factor values > (and remove the factor values from the factors). > E.g., frame$MyFactor takes values > A 1,000 times, > B 2,000 times, > C 30 times and > D 4 times. > I want to remove all rows which assume rare values (<1%), i.e., C and D. > i.e., > frame <- frame[[! (frame$MyFactor %in% c("A","B"))]] > except that I probably got the syntax wrong > and I want c("A","B") to be generated automatically from frame$MyFactor > and the number 0.01 (1%). > > Thanks! -- Sarah Goslee http://www.functionaldiversity.org ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.