Proposal: Statistical Stopword elimination

Karsten Konrad Mon, 31 Mar 2003 08:28:41 -0800

Hi,

I am experimenting with long queries (parts of documents as search query),
and I would like to filter all terms with high document frequencies when
searching. I.e., a kind of statistical, language independent stop word 
elimination while searching.


For this, I have introduced a frequency limit factor into
Similarity and test for excessively high document frequencies
in the TermQuery. The code looks somewhat like this:

  
>>
    public Scorer scorer(IndexReader reader) throws IOException {
      TermDocs termDocs = reader.termDocs(term);
      
      if (termDocs == null)
        return null;
      
      float limit = searcher.getSimilarity().getLimitFactor();
      int docFreq = searcher.docFreq(term);
      int max = searcher.maxDoc();
      if (docFreq >= (max+1)*limit)
            return null;

        return new TermScorer(this, termDocs, searcher.getSimilarity(),
      reader.norms(term.field()));
    }
>>

A limit factor of 0.2 will then remove all terms from the search that appear
in more than (approximately) 20% of the documents. For long queries, the search
time is reduced - about factor 2 even on shorter text queries. A factor of 1.0 or 
higher  will give  you identical results to the original version. Also, 
highlighting often looks better as only less frequent terms are highlighted.

While the terms removed stay in the index and therefore can still be searched,
we can speed up more complex searches by this method.

My questions:

(1) Is there some more elegant way of doing this? E.g., access to the docFreq is
done again in the TermScorer and I would like to remove this redundancy.

(2) Is this a worthwhile contribution to Lucene's features in your opinion?

Comments appreciated,

--

Dr.-Ing. Karsten Konrad
Head of Information Agent Engineering

XtraMind Technologies GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbr�cken
Phone: +49 (681) 3025113
Fax: +49 (681) 3025109
[EMAIL PROTECTED]
www.xtramind.com






---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Proposal: Statistical Stopword elimination

Reply via email to