Dear Doug,
Any news about integration of OPIC in mapred? I have time to develop
OPIC on Nutch Mapred. Can you help me to start?
By the email from Carlos Alberto-Alejandro CASTILLO-Ocaranza, seams that
the best way to integrate OPIC in on old webdb, is this way valid also
CrawlDb in Mapred?
Thanks,
Massimo
Doug Cutting ha scritto:
Here's some interesting stuff about OPIC, an easy-to-calculate
link-based measure of page quality. I'm going to read the papers, and
if it is a good as it sounds, perhaps implement this in the mapred
branch. Does anyone have experience with OPIC?
Original Message
Subject: Fetch list priority
Date: Thu, 29 Sep 2005 10:57:31 +0200
From: Carlos Alberto-Alejandro CASTILLO-Ocaranza
Organization: Universitat Pompeu Fabra
Hi Doug, I'm ChaTo, developer of the WIRE crawler; we met in Compiegne
during the OSWIR workshop.
I told you I would contact you about the priorities of the crawler; and
that there were best strategies than using log(indegree). I suggested to
use OPIC (online page importance computation).
OPIC is described here by Abiteboul et al.:
http://www.citeulike.org/user/ChaTo/article/240858
We did experiments with OPIC in two collections of 2-million pages each,
and we tested that these collections have the same power-law exponents
that the full web [I'm attaching a graph of Pagerank vs page
downloaded]. Ordering pages by indegree is as bad as random:
http://www.citeulike.org/user/ChaTo/article/240824
http://www.citeulike.org/user/ChaTo/article/240898
Why? Because the crawler tends to focus in a few Web sites. See for
instance Boldi et al. Do your worst to make the best:
http://www.citeulike.org/user/ChaTo/article/240866
===
Here is the general idea of OPIC: at the beginning, each page has the
same score. Let's call it 'opic':
for all initial pages i:
opic[i] = 1;
Whenever you find a link:
opic[destination] += opic[source] / outdegree[source];
This is it. Abiteboul's paper proves that this converges even in a
changing graph, and that it is a good estimator of quality. He also
suggests using the history of a page to keep it's opic across crawls,
but even without the history we have seen that it works quite well.
In your case, what you do in org.apache.nutch.tools.FetchListTool is:
...
String[] anchors = dbAnchors.getAnchors(page.getURL());
curScore.set(scoreByLinkCount ?
(float)Math.log(anchors.length+1) : page.getScore());
...
You need something different, because you will have to read the scores
of the pages that are pointing to your page. You can do it by (a)
keeping or reading the scores of the inlinks to each page or (b) do this
cycle for the source pages in the other order:
for each page P in the webdb:
for each outlinks in page P
opic[destination] += opic[P] / outdegree[P];
Note that to make this more effective you must also update the 'opic' of
the pages you already crawled, and that I think you should avoid
self-links.
The 'opic' scores will also be statistically distributed according to a
power-law so it's sensible to use log(opic) when combining this with
other scores with a different distribution, such as text similarity.
I hope this is useful for you.
All the best,