Vic Bancroft wrote on 10/17/2006 02:44 AM: > In some of my group's usage of lucene over large document collections, > we have split the documents across several machines. This has lead to > a concern of whether the inverse document frequency was appropriate, > since the score seems to be dependant on the partioning of documents > over indexing hosts. We have not formulated an experiment to > determine if it seriously effects our results, though it has been > discussed.
What version of Lucene are you using? Are you using ParallelMultiSearcher to manage the distributed indexes or have you implemented your own mechanism? There was a bug a couple years ago, in the 1.4.3 version as I recall, where ParallelMultiSearcher was not computing df's appropriately, but that has been fixed for a long time now. The df's are the sum of the df's from each distributed index and thus are independent of the partitioning. Chuck --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]