Re: [Nutch-dev] [Fwd: Fetch list priority]

2005-10-19 Thread Massimo Miccoli

Dear Doug,

Any news about integration of OPIC in  mapred? I have time to develop 
OPIC on Nutch Mapred. Can you help me to start?
By the email from Carlos Alberto-Alejandro CASTILLO-Ocaranza, seams that 
the best way to integrate OPIC in on old webdb, is this way valid also

CrawlDb in Mapred?

Thanks,

Massimo

Doug Cutting ha scritto:

Here's some interesting stuff about OPIC, an easy-to-calculate 
link-based measure of page quality.  I'm going to read the papers, and 
if it is a good as it sounds, perhaps implement this in the mapred 
branch.  Does anyone have experience with OPIC?


 Original Message 
Subject: Fetch list priority
Date: Thu, 29 Sep 2005 10:57:31 +0200
From: Carlos Alberto-Alejandro CASTILLO-Ocaranza
Organization: Universitat Pompeu Fabra

Hi Doug, I'm ChaTo, developer of the WIRE crawler; we met in Compiegne
during the OSWIR workshop.

I told you I would contact you about the priorities of the crawler; and
that there were best strategies than using log(indegree). I suggested to
use OPIC (online page importance computation).

OPIC is described here by Abiteboul et al.:

http://www.citeulike.org/user/ChaTo/article/240858

We did experiments with OPIC in two collections of 2-million pages each,
and we tested that these collections have the same power-law exponents
that the full web [I'm attaching a graph of Pagerank vs page
downloaded]. Ordering pages by indegree is as bad as random:

http://www.citeulike.org/user/ChaTo/article/240824

http://www.citeulike.org/user/ChaTo/article/240898

Why? Because the crawler tends to focus in a few Web sites. See for
instance Boldi et al.  Do your worst to make the best:

http://www.citeulike.org/user/ChaTo/article/240866

===

Here is the general idea of OPIC: at the beginning, each page has the
same score. Let's call it 'opic':

  for all initial pages i:
 opic[i] = 1;

Whenever you find a link:

  opic[destination] += opic[source] / outdegree[source];

This is it. Abiteboul's paper proves that this converges even in a
changing graph, and that it is a good estimator of quality. He also
suggests using the history of a page to keep it's opic across crawls,
but even without the history we have seen that it works quite well.

In your case, what you do in org.apache.nutch.tools.FetchListTool is:
...
String[] anchors = dbAnchors.getAnchors(page.getURL());
curScore.set(scoreByLinkCount ?
  (float)Math.log(anchors.length+1) : page.getScore());
...

You need something different, because you will have to read the scores
of the pages that are pointing to your page. You can do it by (a)
keeping or reading the scores of the inlinks to each page or (b) do this
cycle for the source pages in the other order:

   for each page P in the webdb:
 for each outlinks in page P
   opic[destination] += opic[P] / outdegree[P];

Note that to make this more effective you must also update the 'opic' of
the pages you already crawled, and that I think you should avoid 
self-links.


The 'opic' scores will also be statistically distributed according to a
power-law so it's sensible to use log(opic) when combining this with
other scores with a different distribution, such as text similarity.



I hope this is useful for you.

All the best,



Re: [Nutch-dev] [Fwd: Fetch list priority]

2005-10-19 Thread Doug Cutting

Massimo Miccoli wrote:
Any news about integration of OPIC in  mapred? I have time to develop 
OPIC on Nutch Mapred. Can you help me to start?
By the email from Carlos Alberto-Alejandro CASTILLO-Ocaranza, seams that 
the best way to integrate OPIC in on old webdb, is this way valid also

CrawlDb in Mapred?


Yes.  I think the way to implement this in the mapred branch is:

1. In CrawlDatum.java, replace 'int linkCount' with 'float score'.  The 
default value of this should be 1.0f.  This will require changes to 
accessors, write, readFields, compareTo etc.  A constructor which 
specifies the score should be added.  The comparator should sort by 
decreasing score.


2. In crawl/Fetcher.java, add the score to the Content's metadata:

  public static String SCORE_KEY = org.apache.nutch.crawl.score;
  ...
  private void output(...) {
...
content.getMetadata().setProperty(SCORE_KEY, datum.getScore());
...
  }


3. In ParseOutputFormat.java, when writing the CrawlDatum for each 
outlink (line 77), set the score of the link CrawlDatum to be the score 
of the page:


   float score =
 Float.valueOf(parse.getData().get(Fetcher.SCORE_KEY));
   score /= links.length;
   for (int i = 0; i  links.length, ...) {
 ...
   new CrawlDatum(CrawlDatum.STATUS_LINKED,
  interval, score);
 ...
   }

4. In CrawlDbReducer.java, remove linkCount calculations.  Replace these 
with something like:


  float scoreIncrement = 0.0f;
  while (values.next()) {
...
switch (datum.getStatus()) {
...
CrawlDatum.STATUS_LINKED:
  scoreIncrement += datum.getScore();
  break;
...
  }
  ...
  result.setScore(result.getScore() + scoreIncrement);

I think that should do it, no?

Doug


Re: [Nutch-dev] [Fwd: Fetch list priority]

2005-10-19 Thread Ken Krugler

Massimo Miccoli wrote:
Any news about integration of OPIC in  mapred? I have time to 
develop OPIC on Nutch Mapred. Can you help me to start?
By the email from Carlos Alberto-Alejandro CASTILLO-Ocaranza, seams 
that the best way to integrate OPIC in on old webdb, is this way 
valid also

CrawlDb in Mapred?


Yes.  I think the way to implement this in the mapred branch is:


[snip]

Just for grins, I modified Nutch 0.7 to use OPIC. It was a quick 
hack, where I stuffed the OPIC score in a page's nextScore field, 
added to this value when processing a page's outlinks, and then used 
it when ranking links in the FetchListTool.


Seems to be working well, though without a well-constrained crawl 
environment it's hard to come up with quantitative results. At least 
we no longer spend a disproportionate amount of our crawl time on 
some sites (like about.com) that wind up with lots of in-bound links.


Note that our usage is also a bit non-standard in that we're doing a 
vertical crawl, and have a way of scoring page contents at crawl 
time. So we use this in combination with the OPIC score as the page 
score that we divide up among the outbound links.


-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200