On 09/12/2011 04:28 PM, vioravis wrote:
I am using 'tm' package for text mining and facing an issue with finding the
frequently occuring terms. From the definition it appears that findFreqTerms
and minDocFreq are equivalent commands and both tries to identify the
documents with terms appearing more than a specified threshold. However, I
am getting drastically different results with both. I have given the results
from both the commands below:

findFreqTerms identifies 3140 words that appear more than 5 times but
minDocFreq identifies only 659 terms. Can someone please explain the reason
for the different or whether I have misunderstood their definitions??

From the help page of termFreq:

‘minDocFreq’ An integer value. Words that appear less often
              in ‘doc’ than this number are discarded. Defaults to ‘1’
              (i.e., every token will be used).

The description for findFreqTerms states:

Find frequent terms in a term-document matrix.

So minDocFreq assesses how often a word appears in a document in order to 
decide if it should be included in the frequency vector of words for this 
document.

By contrast findFreqTerms focuses on the document-term matrix and determines 
how often the word occurs in the matrix. So in fact the whole corpus is used to 
decide on the frequency and if the word should be included or not.

Because one function uses frequency of words in a document, while the other 
uses frequency of words in the document-term matrix, they are obviously not 
equivalent commands. Your results indicate that 3140 words occur at least 5 
times in the whole corpus, i.e., when summing over all documents. By contrasts 
659 words occur at least 5 times in one single document.

HTH,
Bettina


--
-------------------------------------------------------------------
Bettina Grün
Institut für Angewandte Statistik / IFAS
Johannes Kepler Universität Linz
Altenbergerstraße 69
4040 Linz, Austria

Tel: +43 732 2468-6829
Fax: +43 732 2468-6800
E-Mail: bettina.gr...@jku.at
www.ifas.jku.at

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to