According to Conrad Schilbe:
> On Thursday, January 23, 2003, at 04:13  PM, Gilles Detillieux wrote:
> > According to Conrad Schilbe:
> >> At some point I stopped receiving the "Errors to take note of"
> >> information in my dig reports. I have done tests with the exact same
> >> configuration digging different sites and the information was there.
> >> The dig I am doing is 3000 + documents, could this be a problem or
> >> possibly something to do with the configuration of the apache server?
> >>
> >> I have tried many changes to the configuration for htdig with no luck
> >> and again the configuration I have works for a different and smaller
> >> site. Same installation, just digging a different site.
> >>
> >> I turned to the mailing list expecting similar problems reported, with
> >> no luck.
> >>
> >> Any suggestions would be greatly appreciated.
> >
> > Are you sure you're still running htdig with the -s option?
> 
> Yes. That was my first thought too.
> 
> I have since done some testing digging the same server but a different 
> virtual host and have found that it works there. So to me, it is either 
> a volume issue or an apache configuration issue. I imagine that htdig 
> is in use for other sites with 3000+ active documents, can anyone 
> confirm this?
> 
> So that leaves only an apache configuration issue... but I did note 
> that even on sites that display no errors, the text `Errors to note:' 
> is still in the report... this is not the case for this one particular 
> site.
> 
> Is it possible that somewhere along the way the htdig report is 
> becoming too large and that information is dropped? I notice that htdig 
> does not print anything while digging unless of course verbose is on. 
> It then stores the data in memory and dumps when done. This could be a 
> volume issue then.

Seems pretty unlikely to me.  htdig maintains the "notFound" list as a
String class object, which has no inherent limit on size.  When it needs
a bigger buffer, it will reallocate it with a "new" call, which should
only fail when you run out of virtual memory.  If that happens, you
should know about it, as a failure in "new" should trigger a core dump.
The String class would fail if a string exceeds 2 GB in length, but
that's not going to happen even if 3000+ documents were all "not found".

More likely, what's happening is for whatever reason htdig is no longer
getting any 404 errors on that site.  That would happen if you're doing
update digs, as the documents that aren't found will be removed from the
database, so until one of the documents that's in the database or in the
start_url is updated, htdig won't even look at its links to get the URLs
again for the missing documents.  Either update these documents that
link to the formerly missing URLs, or reindex the site from scratch.

If that doesn't help, have a look at how 404 errors are dealt with on
that site.  It may be that htdig is never seeing that status code there,
but is instead getting some other document (e.g. an error page), with
a normal status code, for any unresolvable URL on that site.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
This SF.NET email is sponsored by:
SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See!
http://www.vasoftware.com
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to