Re: Couple of Droids questions

Ken Krugler Wed, 13 Jan 2010 13:52:24 -0800

Hi Richard,

I'm not a Droids expert, but I can answer some of the more generalquestions below.
On Jan 12, 2010, at 2:44pm, Richard Frovarp wrote:
2) How best to do an incremental crawl? I'm going to want to do if-last-modified checks as I crawl.
Most crawlers, including Nutch, don't rely on any last modifieddata returned by servers. Why? Well, it's wrong much of the time.So you wind up having to fetch the content and (ideally) generate/compare a "signature hash" that tries to ignore meaninglessdifferences such as a "number of visitors" counter.
Good point. We for the most part have fairly decent systems forwhich to check the modified time against. My biggest concern ishaving to reprocess all of the PDF's and docs. Those for the mostpart are being pulled by HTTPD from the filesystem and should have avalid last modified. Furthermore, they do have a stable signature,so your suggestion would work as well. Is the ETag reliable whenpresented?

From what I've seen, yes - in that it's less likely to be bogus ifpresent, but it's still possible for the back-end system to generate afalse positive identical etag.

You could use etags plus content length from the response header tofurther improve the reliability of the "no need to download" result.

If the main concern is processing time (versus download time) then I'djust generate a hash from the content bytes, and use that. Youoccasionally will reprocess a file that is in fact logicallyunchanged, where there's an etag or last-modified value in theresponse header that would have allowed you to skip this step, butthat shouldn't be a common case.


-- Ken



--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Couple of Droids questions

Reply via email to