Hmmm... Perhaps writing the segment data should be performed atomically, so that it's kept consistent at all times; and some checkpointing data could be written from time to time, if it cannot already be reconstructed from the leftover data on the next run. Then the fetcher could be restarted on an unfinished data.
I don't think this is a priority. We can already use the output of crashed fetcher runs, and, in general, the fetcher should not crash frequently. If it does, we should fix that.
Looking at the logic in FetchListTool I don't understand where the new "last modified" date is written back to the webdb. I must be blind or what, the whole thing wouldn't work otherwise... As soon as I understand this part, modifying the FetchListTool to do what I need would be straightforward, I think...
As I described, the fetch interval is not currently explictly set by the code: it is always the default value, assigned by the Page constructor. To change it, one would call Page.setFetchInterval(int days). One could do this in UpdateDatabaseTool.pageContentsChanged(), for the page which was fetched and/or for its outgoing URLs that are added to the database. Ideally this would be done in an extensible way, so that different applications could specify different policys for update intervals. We'd welcome a contribution in this area.
Oops. It seems to be there now - there are deletePage(url) / deleteLink(url) calls in WebDBWriter, and the corresponding argument parsing in main(); so it looks like this is only a deficiency in the docs :-).
What docs are you referring to?
There is no re-filter command, though...
Right. I suggested that you might log a bug report requesting this, or, if you're so inclined, write it yourself and contribute it.
Also, it looks like the method deleteLink() is missing from the IWebDBWriter.
Links are not individually deleted. Rather they're removed when their source page's content changes. In Nutch, links are from content (md5 of html) to URL, not URL-to-URL. If multiple pages share the same content, they also share a link in the db. So links are not removed until the last URL with that content is removed. These deletions are performed automatically by the db code.
Doug
------------------------------------------------------- The SF.Net email is sponsored by EclipseCon 2004 Premiere Conference on Open Tools Development and Integration See the breadth of Eclipse activity. February 3-5 in Anaheim, CA. http://www.eclipsecon.org/osdn _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
