Hi Andrzej, I am sorry for the late reply. I haven't had a chance to prepare these dump files for you yet but I made one interesting observation which could shed some light on the problem. I turned on http and fetcher verbose logging and it seems that all these three urls redirects fetcher to the same page.
I have a lot of un_fetched url links in database but lot of them does not point to any real document (as the original document is gone) and the server redirects to the default page (home page or "can't find this page" ... etc). Do you think this information could help us now? Anyway, I'll try to prepare those dump files for you (I don't have much experience with segread command so far). However, I tied the newest SVN nutch-.08 today with the same result. My current settings for redirects: <name>http.redirect.max</name> <value>3</value> Regards, Lukas On 5/17/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
Lukas Vlcek wrote: > Hi Andrzej, > > nutch-site.xml says: > <name>db.default.fetch.interval</name> > <value>15</value> > > I tried readdb -dump. > I am not an expert in dump output but to me it seems that db is not > updated. > I have two dump output (pre and post) and diffing then I found the > following differencies: > 1) Some score values were changed. > 2) Only one fetch time for one document has been changed but that is > not any of that three fatched pages... > > I also checked these three pages and they are still unfetched. > > Wow that seems very strange... > Any idea? Ok, this could indicate some bugs in either Generate or CrawlDbReducer (both of which has been recently changes in a couple places). Could you please do the following: * prepare a fragment of the crawldb dump with the data about these three pages. * generate, so that you get these three pages in the fetchlist (easy to check with segread). * fetch * prepare a fragment of the segment dump (segread -dump) with the data about these pages * run updatedb * prepare a fragment of the crawldb dump after updating And then package this data nicely and send them to me. Thanks! -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com