
I am indexing a subset of Wikipedia. I have 4 years worth of data, and
have taken snapshots of each document at each month in the 4 year
span. Thus, I have 4*12=36 versions of each document. (I keep track of
the timestamp in a field.) I have noticed that in many cases, a
Wikipedia document does not change very much between each version,
sometimes not at all. I end up with duplicate documents, the only
different is the timestamp. Does this impact the term weighting used
by Lucene?

My intuition is that if a term only occurs in one document, but that
document occurs 36 times, then the frequency of the term is
"artificially" increased. Is this true? And if so, is this something I
need to worry about?


To unsubscribe, e-mail:
For additional commands, e-mail:

Reply via email to