Re: [Nutch-dev] Bug in RequestScheduler

Doug Cutting Sun, 01 Feb 2004 15:38:35 -0800

Andrzej Bialecki wrote:

Exactly my case - I'm using a limited bw. connection, so it pains me to discard these hundreds of MBs I fetched... Since this error condition doesn't end up in corrupted files, perhaps the fetcher should just create fetcher.done file, and maybe another file indicating that it is only a partial fetch, with error message. Oh, btw: the RequestScheduler.run() method body should be surrounded by try/catch to perform some sensible action in case of such exceptions...

The problem is that the fetcher frequently crashes in such a way (process killed, runs out of memory, etc.) that java code cannot do anything to catch the problem. Hopefully the fetcher will soon get more reliable and this will be less of an issue.

Browsing the archives at SF is painful. Is there another archive in mbox or other downloadable format?

An mbox archive is at http://www.nutch.org/nutch-mboxes.tar.gz. As I mentioned earlier, the list is now also archived at http://www.mail-archive.com/.

The idea (undocumented) is that you discard segments older than 30 days (or whatever db.default.fetch.interval is set to). Keeping them a bit longer even would be safest, to make sure that all the pages have been refetched into a newer segment. Duplicate detection ensures that, if a page is in multiple segments, only the most recently fetched version is searched.

Hmm... Maybe I don't understand what the fetch.interval parameter is for. Does it affect the content of fetchlists produced with "generate" command?

Yes. It determines how frequently a page will be added to a generate list.

Related question: is there any way to force an update, even if the pages are marked as already fetched and not older than e.g. 30 days? In other words, I'd like to force re-fetching some pages (by URL pattern or by "[not] older than" date).

The -addDays parameter to the generate command makes it act like you're running it that many days in the future. Each URL has it's own refresh interval, but there's no command yet which updates it for a particular URL. So, for now, the best way to do this is to set the default update interval down while you first inject the URLs you want updated more frequently, or when you update them, but that's harder, as there will probably be a bunch of other URLs there when you update...

And yet another question :-) - how do I remove URLs from webdb?

There's no command yet to do that. If you put a negated regular expression in your urlfilter.regex.file then they'll never enter the db in the first place. Probably we should add a command which re-filters the urls in the db, removing any which no longer are permitted. That way, to remove URLs, you'd just add some regexps and run that command. Can you log a bug requesting something like this? Thanks.

Doug


-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Bug in RequestScheduler

Reply via email to