On Thu, Dec 15, 2011 at 6:33 PM, Mike O'Leary <[email protected]> wrote:
> We have a large set of documents that we would like to index with a
> customized stopword list. We have run tests by indexing a random set of about
> 10% of the documents, and we'd like to generate a list of the terms in that
> smaller set and their IDF values as a way to create a starter set of
> stopwords for the larger document set by selecting the terms that have the
> lowest IDF values. First of all, is this the best way to create a stopword
> list? Second, is there a straightforward way to generate a list of terms and
> their IDF values from a Lucene index?
> Thanks,
> Mike
hey mike,
I can certainly help you with generating the list of your top N terms,
if that is the best or right way to generate the stopwords list I am
not sure but maybe somebody else will step up.
to get the top N terms out of your index you can simply iterate the
terms in a field and put the top N terms based on the docFreq() on a
heap. something like this:
static class TermAndDF {
String term;
int df;
}
int queueSize = N;
PriorityQueue<TermAndDF> queue = ...
final TermEnum termEnum = reader.terms(new Term(field));
try {
do {
final Term term = termEnum.term();
if (term == null || term.field() != field) break;
int docFreq = termEnum.docFreq();
if (queue.size() < queueSize) {
queue.add(new TermAndDF(term.text(), docFreq);
} else if (queue.top().df < docFreq) {
TermAndFreq tnFrq = queue.top();
tnFrq.term = term.text();
tnFrq.df = docFreq;
}
} while (termEnum.next());
} finally {
termEnum.close();
}
another way of doing it is to use index pruning and drop terms with
docFreq above a threshold after you have indexed your doc set.
simon
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]