> From: Joshua O'Madadhain [mailto:[EMAIL PROTECTED]] > > You cannot, in general, structure a Lucene query such that it > will yield > the same document rankings that Google would for that (query, document > set). The reason for this is that Google employs a scoring > algorithm that > includes information about the topology of the pages (i.e., how the > pages are linked together). (An overview of what Google does in this > regard may be found at http://www.google.com/technology/index.html .) > Thus, in order to get Lucene to do "what Google does", you'd have to > rewrite large chunks of it.
I don't agree with your conclusion: you would not have to re-write much of Lucene to incorporate this sort of information. To my understanding, Google uses linking information as a factor in scoring. Thus every document in the index has a factor computed from its links that is multiplied into its score. Lucene already keeps a factor per document that is multiplied into its score, but one that is computed from the document's length, not its links. Thus, once one has computed link scores, to add them to Lucene we just need to permit applications to affect this factor, with something like a Document.setBoost(float) method. The representation of the per-document factor would also need to change a little internally. It is currently stored as a single byte, and multiplying in an arbitrary factor would cause overflow. But enlarging it to 16 bits would be a small change. So adding such a capability would require re-writing only a very small chunk of Lucene. Computing a link-based factor would also take some code, but that's writing, not re-writing. Doug -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>