Hi,

I am not sure, if I understand enough about that. So the following comment might be rubbish.

I assume that the PageDB is a part of the old WebDB?

2. PageDB

  The PageDB is used to crawl.  Initially it is empty.  Each round of
  fetching updates the list of known urls and their status.  The
  database is a directory of flat files.

pages: <url, <status, contentHash, lastFetchDate, numFailures> >

Is this list of storable fields extendable by plugins?

If not, I believe we should think about that.

For example it might be intersting to monitor changes on websites and prefer more up to date pages in ranking.

In this case for example I would add fields about the content to compute changes when fetching the page again. For the calculated result I also would store a value about the amount of changes per time.

I think storing information like this in the PageDB would be the right place. So the list should be extendable by plugins.

I hope, I did not talk to much rubbish.

Matthias


------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to