According to Conrad Schilbe: > On Thursday, January 23, 2003, at 04:13 PM, Gilles Detillieux wrote: > > According to Conrad Schilbe: > >> At some point I stopped receiving the "Errors to take note of" > >> information in my dig reports. I have done tests with the exact same > >> configuration digging different sites and the information was there. > >> The dig I am doing is 3000 + documents, could this be a problem or > >> possibly something to do with the configuration of the apache server? > >> > >> I have tried many changes to the configuration for htdig with no luck > >> and again the configuration I have works for a different and smaller > >> site. Same installation, just digging a different site. > >> > >> I turned to the mailing list expecting similar problems reported, with > >> no luck. > >> > >> Any suggestions would be greatly appreciated. > > > > Are you sure you're still running htdig with the -s option? > > Yes. That was my first thought too. > > I have since done some testing digging the same server but a different > virtual host and have found that it works there. So to me, it is either > a volume issue or an apache configuration issue. I imagine that htdig > is in use for other sites with 3000+ active documents, can anyone > confirm this? > > So that leaves only an apache configuration issue... but I did note > that even on sites that display no errors, the text `Errors to note:' > is still in the report... this is not the case for this one particular > site. > > Is it possible that somewhere along the way the htdig report is > becoming too large and that information is dropped? I notice that htdig > does not print anything while digging unless of course verbose is on. > It then stores the data in memory and dumps when done. This could be a > volume issue then.
Seems pretty unlikely to me. htdig maintains the "notFound" list as a String class object, which has no inherent limit on size. When it needs a bigger buffer, it will reallocate it with a "new" call, which should only fail when you run out of virtual memory. If that happens, you should know about it, as a failure in "new" should trigger a core dump. The String class would fail if a string exceeds 2 GB in length, but that's not going to happen even if 3000+ documents were all "not found". More likely, what's happening is for whatever reason htdig is no longer getting any 404 errors on that site. That would happen if you're doing update digs, as the documents that aren't found will be removed from the database, so until one of the documents that's in the database or in the start_url is updated, htdig won't even look at its links to get the URLs again for the missing documents. Either update these documents that link to the formerly missing URLs, or reindex the site from scratch. If that doesn't help, have a look at how 404 errors are dealt with on that site. It may be that htdig is never seeing that status code there, but is instead getting some other document (e.g. an error page), with a normal status code, for any unresolvable URL on that site. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) ------------------------------------------------------- This SF.NET email is sponsored by: SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See! http://www.vasoftware.com _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

