I was also wondering about this... don't know well the internals of Lucene though to give you any smart implementation feedback. In my opinion, it would be a very useful addition. I would just add that, if the frequent term is the only term in the query, it should not be eliminated. I just tried Google and it behaves the same way. Very frequent terms ARE indexed. They get removed only when they are part of a query with more than one term.
-- Alex Murzaku ___________________________________________ alex(at)lissus.com http://www.lissus.com -----Original Message----- From: Karsten Konrad [mailto:[EMAIL PROTECTED] Sent: Monday, March 31, 2003 11:30 AM To: Lucene Developers List Subject: Proposal: Statistical Stopword elimination Hi, I am experimenting with long queries (parts of documents as search query), and I would like to filter all terms with high document frequencies when searching. I.e., a kind of statistical, language independent stop word elimination while searching. For this, I have introduced a frequency limit factor into Similarity and test for excessively high document frequencies in the TermQuery. The code looks somewhat like this: >> public Scorer scorer(IndexReader reader) throws IOException { TermDocs termDocs = reader.termDocs(term); if (termDocs == null) return null; float limit = searcher.getSimilarity().getLimitFactor(); int docFreq = searcher.docFreq(term); int max = searcher.maxDoc(); if (docFreq >= (max+1)*limit) return null; return new TermScorer(this, termDocs, searcher.getSimilarity(), reader.norms(term.field())); } >> A limit factor of 0.2 will then remove all terms from the search that appear in more than (approximately) 20% of the documents. For long queries, the search time is reduced - about factor 2 even on shorter text queries. A factor of 1.0 or higher will give you identical results to the original version. Also, highlighting often looks better as only less frequent terms are highlighted. While the terms removed stay in the index and therefore can still be searched, we can speed up more complex searches by this method. My questions: (1) Is there some more elegant way of doing this? E.g., access to the docFreq is done again in the TermScorer and I would like to remove this redundancy. (2) Is this a worthwhile contribution to Lucene's features in your opinion? Comments appreciated, -- Dr.-Ing. Karsten Konrad Head of Information Agent Engineering XtraMind Technologies GmbH Stuhlsatzenhausweg 3 D-66123 Saarbr�cken Phone: +49 (681) 3025113 Fax: +49 (681) 3025109 [EMAIL PROTECTED] www.xtramind.com --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
