List, I am indexing a subset of Wikipedia. I have 4 years worth of data, and have taken snapshots of each document at each month in the 4 year span. Thus, I have 4*12=36 versions of each document. (I keep track of the timestamp in a field.) I have noticed that in many cases, a Wikipedia document does not change very much between each version, sometimes not at all. I end up with duplicate documents, the only different is the timestamp. Does this impact the term weighting used by Lucene?
My intuition is that if a term only occurs in one document, but that document occurs 36 times, then the frequency of the term is "artificially" increased. Is this true? And if so, is this something I need to worry about? Thanks, Steve --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org