Maybe this is a FAQ or in the docs somewhere; if so I apologize in advance. I have two questions about Nutch: 1. Is there an easy way to index every reachable page in a domain, without running the generate, fetch, updatedb, analyze cycle a hundred times?
Each cycle traverses one level of links. So if you have pages whose shortest path is ten links from your starting URLs then you will need ten cycles. Yes, this is inconvenient for intranet applications. It would be good to add a high-level interface which performs such cycling automatically.
One caution, however: many sites have an infinite number of pages. Consider a calendar. Following the "next month" link will forever generate new pages. So, in general, you're probably better off running cycles until the number of new pages found drops substantially.
2. Is there a way to refetch, reupdate, and reanalyze all the pages that are in the index?
You can use the -adddays option to generate, specifying something more than the db.default.fetch.interval (30, by default). The latter determines how often, in days, each URL is generated for re-fetch. An intranet site might reasonably reduce this to 7 or even 1. If set to 1, then a generate run daily should refetch everything. The -addays option makes generate act as though it is run that many days in the future. So that if your fetch interval is seven days, and you specify '-adddays 8', then it is as though you're running the generate eight days hence, and all pages will be due for re-fetching.
Doug
------------------------------------------------------- SF.Net is sponsored by: Speed Start Your Linux Apps Now. Build and deploy apps & Web services for Linux with a free DVD software kit from IBM. Click Now! http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click _______________________________________________ Nutch-general mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-general
