Re: [Nutch-general] Indexing an entire domain

Doug Cutting Fri, 13 Feb 2004 09:12:56 -0800

Thomas Johnson wrote:

Maybe this is a FAQ or in the docs somewhere; if so I apologize in
advance. I have two questions about Nutch:
1. Is there an easy way to index every reachable page in a domain,
without running the generate, fetch, updatedb, analyze cycle a hundred
times?

Each cycle traverses one level of links. So if you have pages whose shortest path is ten links from your starting URLs then you will need ten cycles. Yes, this is inconvenient for intranet applications. It would be good to add a high-level interface which performs such cycling automatically.

One caution, however: many sites have an infinite number of pages. Consider a calendar. Following the "next month" link will forever generate new pages. So, in general, you're probably better off running cycles until the number of new pages found drops substantially.

2. Is there a way to refetch, reupdate, and reanalyze all the pages that
are in the index?

You can use the -adddays option to generate, specifying something more than the db.default.fetch.interval (30, by default). The latter determines how often, in days, each URL is generated for re-fetch. An intranet site might reasonably reduce this to 7 or even 1. If set to 1, then a generate run daily should refetch everything. The -addays option makes generate act as though it is run that many days in the future. So that if your fetch interval is seven days, and you specify '-adddays 8', then it is as though you're running the generate eight days hence, and all pages will be due for re-fetching.

Doug


-------------------------------------------------------
SF.Net is sponsored by: Speed Start Your Linux Apps Now.
Build and deploy apps & Web services for Linux with
a free DVD software kit from IBM. Click Now!
http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click
_______________________________________________
Nutch-general mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Indexing an entire domain

Reply via email to