Hello Everyone,

I currently have nutch set up doing a "whole-web" style crawl.  When I
need to index a new page or reindex an existing page immediately, I
start a process that waits until the webdb is not being used by the
normal crawl process, locks the webdb using the existence of a file as
a mutex, and performs a modified inject [which uses WebDBWriter's
addPageWithScore() instead of addPageIfNotPresent()] followed by a
generate, fetch, updatedb, analyze, index, and deletion of duplicates.
 The inject score is set to a very high value and I then specify that
value as the cutoff for the generate operation. This way I can very
quickly manually add new/refreshed pages to my searchable content.

What I would like to do next, is be able to do the same with an entire
domain, given only the root web page (being able to reindex everything
that matches a given regular expression would be even better). So
basically, I would like to take inject a file with root pages, and
then crawl from each of these root pages until I have all of the
domain's content refreshed.  This would of course only be used for
small to mid-sized domains under 100 pages or so. The closest I have
been able to get to this goal so far is:
  Inject a url with a high score as above
  Generate a fetchlist with that high score as the cutoff
  Update webdb
  (skip the analyze step - thus outlinks gathered from entirely new
content also have the high score assigned to injected urls)
  Generate with a cutoff 
  ... 
  and do this to a prespecified depth.

There are two downsides to this: it uses a prespecified depth, and if
the page injected already exists (meaning I would like the domain
refreshed instead) the high score and next fetch date are not
propagated to outlinks of the injected url (which of course is the
desired behavior during a normal crawl). The best thing I can think of
now is to write my own external UpdateDatabaseTool that propagates the
score and next fetch date to outlinks unconditionally. I would use
this tool only for such priority instances.  Does anyone know of a
better way to approach the implementation of such functionality?

Also, I think that in my scenario using something like google's
sitemaps (https://www.google.com/webmasters/sitemaps/docs/en/protocol.html)
to help direct the crawl would be helpful. Are there any plans to
incorporate something of the sort into nutch sometime in the near
future?

Thanks in advance,
Kamil


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to