Hi Richard,
I'm not a Droids expert, but I can answer some of the more general
questions below.
On Jan 12, 2010, at 2:44pm, Richard Frovarp wrote:
2) How best to do an incremental crawl? I'm going to want to do if-
last-modified checks as I crawl.
Most crawlers, including Nutch, don't rely on any last modified
data returned by servers. Why? Well, it's wrong much of the time.
So you wind up having to fetch the content and (ideally) generate/
compare a "signature hash" that tries to ignore meaningless
differences such as a "number of visitors" counter.
Good point. We for the most part have fairly decent systems for
which to check the modified time against. My biggest concern is
having to reprocess all of the PDF's and docs. Those for the most
part are being pulled by HTTPD from the filesystem and should have a
valid last modified. Furthermore, they do have a stable signature,
so your suggestion would work as well. Is the ETag reliable when
presented?
From what I've seen, yes - in that it's less likely to be bogus if
present, but it's still possible for the back-end system to generate a
false positive identical etag.
You could use etags plus content length from the response header to
further improve the reliability of the "no need to download" result.
If the main concern is processing time (versus download time) then I'd
just generate a hash from the content bytes, and use that. You
occasionally will reprocess a file that is in fact logically
unchanged, where there's an etag or last-modified value in the
response header that would have allowed you to skip this step, but
that shouldn't be a common case.
-- Ken
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g