> From: Joshua O'Madadhain [mailto:[EMAIL PROTECTED]]
> 
> You cannot, in general, structure a Lucene query such that it 
> will yield
> the same document rankings that Google would for that (query, document
> set).  The reason for this is that Google employs a scoring 
> algorithm that
> includes information about the topology of the pages (i.e., how the
> pages are linked together).  (An overview of what Google does in this
> regard may be found at http://www.google.com/technology/index.html .)
> Thus, in order to get Lucene to do "what Google does", you'd have to
> rewrite large chunks of it.

I don't agree with your conclusion: you would not have to re-write much of
Lucene to incorporate this sort of information.  To my understanding, Google
uses linking information as a factor in scoring.  Thus every document in the
index has a factor computed from its links that is multiplied into its
score.

Lucene already keeps a factor per document that is multiplied into its
score, but one that is computed from the document's length, not its links.
Thus, once one has computed link scores, to add them to Lucene we just need
to permit applications to affect this factor, with something like a
Document.setBoost(float) method.  The representation of the per-document
factor would also need to change a little internally.  It is currently
stored as a single byte, and multiplying in an arbitrary factor would cause
overflow.  But enlarging it to 16 bits would be a small change.

So adding such a capability would require re-writing only a very small chunk
of Lucene.  Computing a link-based factor would also take some code, but
that's writing, not re-writing.

Doug

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to