Here's the code and results. The corpus is the text version of a single book. (r vs. 3.2) > docs <- tm_map(docs, stemDocument) > dtm <- DocumentTermMatrix(docs) > freq <- colSums(as.matrix(dtm)) > ord <- order(freq) > freq[tail(ord)] one experi will can lucid dream 287 312 363 452 1018 2413 > freq[head(ord)] abbey abdomin abdu abraham absent abus 1 1 1 1 1 1 > dim(dtm) [1] 1 5265 > dtms <- removeSparseTerms(dtm, 0.1) > dim(dtms) [1] 1 5265 > dtms <- removeSparseTerms(dtm, 0.001) > dim(dtms) [1] 1 5265 > dtms <- removeSparseTerms(dtm, 0.9) > dim(dtms) [1] 1 5265 >
[[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.