Re: Obtaining IDF values for the terms in a document set

Simon Willnauer Thu, 15 Dec 2011 11:44:17 -0800

On Thu, Dec 15, 2011 at 6:33 PM, Mike O'Leary <[email protected]> wrote:
> We have a large set of documents that we would like to index with a 
> customized stopword list. We have run tests by indexing a random set of about 
> 10% of the documents, and we'd like to generate a list of the terms in that 
> smaller set and their IDF values as a way to create a starter set of 
> stopwords for the larger document set by selecting the terms that have the 
> lowest IDF values. First of all, is this the best way to create a stopword 
> list? Second, is there a straightforward way to generate a list of terms and 
> their IDF values from a Lucene index?
> Thanks,
> Mike


hey mike,

I can certainly help you with generating the list of your top N terms,
if that is the best or right way to generate the stopwords list I am
not sure but maybe somebody else will step up.

to get the top N terms out of your index you can simply iterate the
terms in a field and put the top N terms based on the docFreq() on a
heap. something like this:

     static class TermAndDF {
       String term;
       int df;
     }
     int queueSize = N;
     PriorityQueue<TermAndDF> queue = ...

     final TermEnum termEnum = reader.terms(new Term(field));
      try {
        do {
          final Term term = termEnum.term();
          if (term == null || term.field() != field) break;
          int docFreq = termEnum.docFreq();
          if (queue.size() < queueSize) {
             queue.add(new TermAndDF(term.text(), docFreq);
          } else if (queue.top().df < docFreq) {
             TermAndFreq tnFrq = queue.top();
             tnFrq.term = term.text();
             tnFrq.df = docFreq;
          }
        } while (termEnum.next());
      } finally {
        termEnum.close();
      }

another way of doing it is to use index pruning and drop terms with
docFreq above a threshold after you have indexed your doc set.

simon

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Obtaining IDF values for the terms in a document set

Reply via email to