Lucene won't be aware that you've got duplicate documents, but scoring does take account of the number of documents in which search terms appear. See http://lucene.apache.org/java/3_5_0/scoring.html and the javadocs for oal.search.Similarity.
Only you can say whether or not you need to worry about it, If you do, you could provide your own implementation of Similarity. Or change your indexing process to skip updates where only the timestamp changes. -- Ian. On Sun, Nov 27, 2011 at 10:42 PM, Stephen Thomas <stephen.warner.tho...@gmail.com> wrote: > List, > > I am indexing a subset of Wikipedia. I have 4 years worth of data, and > have taken snapshots of each document at each month in the 4 year > span. Thus, I have 4*12=36 versions of each document. (I keep track of > the timestamp in a field.) I have noticed that in many cases, a > Wikipedia document does not change very much between each version, > sometimes not at all. I end up with duplicate documents, the only > different is the timestamp. Does this impact the term weighting used > by Lucene? > > My intuition is that if a term only occurs in one document, but that > document occurs 36 times, then the frequency of the term is > "artificially" increased. Is this true? And if so, is this something I > need to worry about? > > Thanks, > Steve > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org