Thanks Ken.

On 1/12/2010 5:17 PM, Ken Krugler wrote:
Hi Richard,

I'm not a Droids expert, but I can answer some of the more general questions below.

On Jan 12, 2010, at 2:44pm, Richard Frovarp wrote:


2) How best to do an incremental crawl? I'm going to want to do if-last-modified checks as I crawl.

Most crawlers, including Nutch, don't rely on any last modified data returned by servers. Why? Well, it's wrong much of the time. So you wind up having to fetch the content and (ideally) generate/compare a "signature hash" that tries to ignore meaningless differences such as a "number of visitors" counter.

Good point. We for the most part have fairly decent systems for which to check the modified time against. My biggest concern is having to reprocess all of the PDF's and docs. Those for the most part are being pulled by HTTPD from the filesystem and should have a valid last modified. Furthermore, they do have a stable signature, so your suggestion would work as well. Is the ETag reliable when presented?

Reply via email to