On Mon, 25 Feb 2002, Doug Cutting wrote:

> > From: Joshua O'Madadhain [mailto:[EMAIL PROTECTED]]
> > 
> > You cannot, in general, structure a Lucene query such that it 
> > will yield
> > the same document rankings that Google would for that (query, document
> > set).  The reason for this is that Google employs a scoring 
> > algorithm that
> > includes information about the topology of the pages (i.e., how the
> > pages are linked together).  (An overview of what Google does in this
> > regard may be found at http://www.google.com/technology/index.html .)
> > Thus, in order to get Lucene to do "what Google does", you'd have to
> > rewrite large chunks of it.
> 
> I don't agree with your conclusion: you would not have to re-write
> much of Lucene to incorporate this sort of information.  To my
> understanding, Google uses linking information as a factor in scoring.  
> Thus every document in the index has a factor computed from its links
> that is multiplied into its score.

It's not quite as simple as you're making it sound.  Google's PageRank
algorithm is recursive: the "authority" of a page (a factor in determining
its rank) is determined in part by the authority of the pages that link to
it, and by the authority of the pages that link to each of the pages that
link to it, and so on.

> Lucene already keeps a factor per document that is multiplied into its
> score, but one that is computed from the document's length, not its
> links. Thus, once one has computed link scores, to add them to Lucene
> we just need to permit applications to affect this factor, with
> something like a Document.setBoost(float) method.  The representation
> of the per-document factor would also need to change a little
> internally.  It is currently stored as a single byte, and multiplying
> in an arbitrary factor would cause overflow.  But enlarging it to 16
> bits would be a small change.
> 
> So adding such a capability would require re-writing only a very small
> chunk of Lucene.  Computing a link-based factor would also take some
> code, but that's writing, not re-writing.

On reflection, I think you're right (although I am not certain that some
other aspect of the algorithm might not require more extensive re-writing;
I'd have to check).  On that basis, I retract my prefix.  :)

If the original poster were interested in exactly replicating Google's
results, though, he's probably out of luck, simply because while PageRank
is a published algorithm, there have been revisions to the Google
architecture that I am reasonably sure have not been published.

 [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden
    Joshua Madden: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.

 


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to