I'm just exploring ht://Dig as a replacement for my own robot (VWbot) - a
Perl nightmare that drives me crazy - and trying to figure out how to
do some of the things I was doing in that. htdig has much better
indexing, a better database, and better support than mine.


Gzip handling:

The TODO html with htdig-3.2.0b3 suggests that gzip
and compress decoders are now implemented. Perhaps I misunderstood or
am missing some config items. E.g. my test file
http://andrew.triumf.ca/test/latex.ps.gz returns
  Content-type: application/postscript
  Content-encoding: x-gzip

htdig seems to ignore the content-encoding and tries to parse the
compressed postscript directly. I'd expect it to call gunzip then pass the
result to the appropriate MIME handler. Apache, Netscape, and I think IE
now all support this transparently, so I have some sites where I serve
gzipped HTML (though often with content-negotiation), and at TRIUMF we
have gzipped PostScript, text etc. that can be viewed directly in a
browser but is currently missed by htdig.


Scoreboarding dead servers

At some point I decided to ignore dead servers for the rest of the run. I
think htdig does this the first time it tries a server, but if the server
becomes unreachable during a run it seems to wait a timeout period for
each URL. As for example where someone powered down a server while I was
indexing it.

Grace period for dead links/servers

I gave, I think, URLs 3 or 5 chances to get repaired before they get
pruned on the offchance that the site is undergoing maintenance or is
temporarily unavailable when spidered. Htdig I think either prunes all
dead links, or none, depending on the config file. Probably not worth
messing with in the grand scheme of things.

Revisit interval

I set a revisit interval per URL, originally based on a submission entry
and meta "revisit-after" hint, and then later depending on how often a
page changed in the past. Htdig I think revisits every page in its
database every run, which seems a bit unnecessary. I thought for a moment
that it wasn't even checking modification times but found it was sending
If-Modified-Else in localtime which Apache doesn't understand.

Round-Robin to servers

I was visiting different servers in turn, or rather, only revisiting
servers where at least a server_time had elapsed since they were
last visited, so that I wasn't waiting for some server when there were
others I could visit in the meantime. I think that in fact htdig is doing
this, except that it does max_connection_requests on an HTTP/1.1 server.
Great.


Multithreaded requests

I never managed this, but other tools e.g. "nmap" seem to do it.
I don't think htdig does it yet, either. I was looking at the idea
of forking subprocesses to index a number of URLs on different servers
at the same time, then gathering the results back together to build
the list of new URLs to be visited, but I gave up. I think it ought
to be possible but it was making my head spin.


Harvesting Arbitrary Metadata

I was interested in harvesting metadata in documents, e.g. Author or
Subject from HTML META or PDF pdfmark blocks, or Dublin Core Creator, Date
etc. I had hard-coded entries in my database, etc. for some of these to
present in results. I figured how to do this in htdig in much the same
way. I'm not sure whether users really want this, seeing as it seems
impossible to get authors to add metadata to documents, even their own
name. However, some authoring systems and PostScript generators seem to
include the username automatically on Unix, and on other systems too
assuming the tool has been configured at installation (the so-called
"hidden data" in Word documents), so maybe it's not entirely pointless. It
(author search) seems a common kind of search in e.g. libraries.


Notifying Owners of broken links

I did this on one site (TRIUMF). Well, I didn't actually notify them by
email but built a flat file that could be searched by name or turned into
a web page. Basically every time there's a 404 or 60? (server name not
resolved) I make an entry under the referring page, then list them at the
end of the run tagged with a author name from meta Author, or the old
rev=made mailto, or the last mailto found on the page (for in-house use,
email harvvesting is acceptable and it's often the page author or
maintainer)
When it comes down to it, most of the time I'm too lazy to fix broken
links even on my own pages, but some orgs may try to enforce a "no broken
links" policy and if the org is diverse with many authors in different
divisions then this feature may be useful, as in Fielding's original
MOMspider, perhaps moreso than the htdig-notify "alarm clock".


Indexing non-parsable objects (images & multimedia, etc.)

I never did this, but it seems like a neat idea. AltaVista does it.
At TRIUMF we have many photo archives that it would be useful to index.
I see that htdig harvests descriptions for pages from the hrefs in
referring pages, so some of the framework exists. Basically I think
one could do a HEAD operation on a link, then if it came back
as status 200 but non-parsable, you'd tag it with the MIME type
and index it under the description words from the hrefs. That would
require a switch on the search (as proposed for search-words-in-title or
search-by-author) so that one could say something like
"saskquatch AND mimetype=image/*" to find alleged pictures of the
creature rather than thousands of pages with stuff like "my cousin took a
picture of Sasquatch but his film didn't come out". In-house, this might
be useful to find source code for e.g. Word documents rather than
printable derivatives.


-- 
Andrew Daviel
Are you always losing things?  - http://huzizit.com







_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to