Hey Solr people:
Suppose that we did not want to break up our document set into separate
indexes, but had certain cases where many versions of a document were not
relevant for certain searches.
I guess this could be thought of as a "authorization" class of problem,
however it is not that for us. We have a few other fields that determine
relevancy to the current query, based on what page the query is coming
from. It's kind of like authorization, but not really.
Anyway, I think the answer for how you would do it for authorization would
solve it for our case too.
So I guess suppose you had 99 users and 100 documents and Document 1
everybody could see it the same, but for the 99 documents, there was a
slightly different document, and it was unique for each of 99 users, but
not "very" unique. Suppose for instance that the only thing different in
the text of the 99 different documents was that it was watermarked with the
users name. Aren't you spamming your tf/idf at that point? Is there a way
around this? Is there a way to say, hey, group these 99 documents together
and only count 1 of them for tf/idf purposes?
When doing queries, each user would only ever see 2 documents, Document 1
, plus whichever other document they specifically owned.
If there are web pages or book chapters I can read or re-read that address
this class of problem, those references would be great.
-Chris.