Carl Cerecke wrote:
> Hi,
> 
> The docs for the OPICScoringFilter mention that the plugin implements a 
> variant of OPIC from Artiboul et al's paper. What exactly is different? 
> How does the difference affect the scores?

As it is now, the implementation doesn't preserve the total "cash value" 
in the system, and also there is almost no smoothing between the 
iterations (Abiteboul's "history").

As a consequence, scores may (and do) vary dramatically between 
iterations, and they don't converge to stable values, i.e. they always 
increase. For pages that get a lot of score contributions from other 
pages this leads to an explosive increase into the range of thousands or 
eventually millions. This means that the scores produced by the OPIC 
plugin exaggerate score differences between pages more and more, even if 
the web graph that you crawl is stable.

In a sense, to follow the "cash" analogy, our implementation of OPIC 
illustrates a runaway economy - galloping inflation, rich get richer and 
poor get poorer ;)


> Also, there's a comment in the code:
> 
> // XXX (ab) no adjustment? I think this is contrary to the algorithm descr.
> // XXX in the paper, where page "loses" its score if it's distributed to
> // XXX linked pages...
> 
> Is this something that will be looked at eventually or is the scoring 
> "good enough" at the moment without some "adjustment".

Yes, I'll start working on it when I get back from vacations. I did some 
simulations that show how to fix it (see 
http://wiki.apache.org/nutch/FixingOpicScoring bottom of the page).

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to