I'm just exploring ht://Dig as a replacement for my own robot (VWbot) - a Perl nightmare that drives me crazy - and trying to figure out how to do some of the things I was doing in that. htdig has much better indexing, a better database, and better support than mine.
Gzip handling: The TODO html with htdig-3.2.0b3 suggests that gzip and compress decoders are now implemented. Perhaps I misunderstood or am missing some config items. E.g. my test file http://andrew.triumf.ca/test/latex.ps.gz returns Content-type: application/postscript Content-encoding: x-gzip htdig seems to ignore the content-encoding and tries to parse the compressed postscript directly. I'd expect it to call gunzip then pass the result to the appropriate MIME handler. Apache, Netscape, and I think IE now all support this transparently, so I have some sites where I serve gzipped HTML (though often with content-negotiation), and at TRIUMF we have gzipped PostScript, text etc. that can be viewed directly in a browser but is currently missed by htdig. Scoreboarding dead servers At some point I decided to ignore dead servers for the rest of the run. I think htdig does this the first time it tries a server, but if the server becomes unreachable during a run it seems to wait a timeout period for each URL. As for example where someone powered down a server while I was indexing it. Grace period for dead links/servers I gave, I think, URLs 3 or 5 chances to get repaired before they get pruned on the offchance that the site is undergoing maintenance or is temporarily unavailable when spidered. Htdig I think either prunes all dead links, or none, depending on the config file. Probably not worth messing with in the grand scheme of things. Revisit interval I set a revisit interval per URL, originally based on a submission entry and meta "revisit-after" hint, and then later depending on how often a page changed in the past. Htdig I think revisits every page in its database every run, which seems a bit unnecessary. I thought for a moment that it wasn't even checking modification times but found it was sending If-Modified-Else in localtime which Apache doesn't understand. Round-Robin to servers I was visiting different servers in turn, or rather, only revisiting servers where at least a server_time had elapsed since they were last visited, so that I wasn't waiting for some server when there were others I could visit in the meantime. I think that in fact htdig is doing this, except that it does max_connection_requests on an HTTP/1.1 server. Great. Multithreaded requests I never managed this, but other tools e.g. "nmap" seem to do it. I don't think htdig does it yet, either. I was looking at the idea of forking subprocesses to index a number of URLs on different servers at the same time, then gathering the results back together to build the list of new URLs to be visited, but I gave up. I think it ought to be possible but it was making my head spin. Harvesting Arbitrary Metadata I was interested in harvesting metadata in documents, e.g. Author or Subject from HTML META or PDF pdfmark blocks, or Dublin Core Creator, Date etc. I had hard-coded entries in my database, etc. for some of these to present in results. I figured how to do this in htdig in much the same way. I'm not sure whether users really want this, seeing as it seems impossible to get authors to add metadata to documents, even their own name. However, some authoring systems and PostScript generators seem to include the username automatically on Unix, and on other systems too assuming the tool has been configured at installation (the so-called "hidden data" in Word documents), so maybe it's not entirely pointless. It (author search) seems a common kind of search in e.g. libraries. Notifying Owners of broken links I did this on one site (TRIUMF). Well, I didn't actually notify them by email but built a flat file that could be searched by name or turned into a web page. Basically every time there's a 404 or 60? (server name not resolved) I make an entry under the referring page, then list them at the end of the run tagged with a author name from meta Author, or the old rev=made mailto, or the last mailto found on the page (for in-house use, email harvvesting is acceptable and it's often the page author or maintainer) When it comes down to it, most of the time I'm too lazy to fix broken links even on my own pages, but some orgs may try to enforce a "no broken links" policy and if the org is diverse with many authors in different divisions then this feature may be useful, as in Fielding's original MOMspider, perhaps moreso than the htdig-notify "alarm clock". Indexing non-parsable objects (images & multimedia, etc.) I never did this, but it seems like a neat idea. AltaVista does it. At TRIUMF we have many photo archives that it would be useful to index. I see that htdig harvests descriptions for pages from the hrefs in referring pages, so some of the framework exists. Basically I think one could do a HEAD operation on a link, then if it came back as status 200 but non-parsable, you'd tag it with the MIME type and index it under the description words from the hrefs. That would require a switch on the search (as proposed for search-words-in-title or search-by-author) so that one could say something like "saskquatch AND mimetype=image/*" to find alleged pictures of the creature rather than thousands of pages with stuff like "my cousin took a picture of Sasquatch but his film didn't come out". In-house, this might be useful to find source code for e.g. Word documents rather than printable derivatives. -- Andrew Daviel Are you always losing things? - http://huzizit.com _______________________________________________ htdig-dev mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/htdig-dev
