chad savage wrote:
> Hello All,
>
> With ftp and file crawls you can check the date of the file and match 
> the date against yours. Http does not have that luxury.
> If this is on an internal site of yours being generated by a cms or 
> even by hand, I'm sure you can create a list of pages that have been 
> updated since last crawl.
> As for generic web page in the wild,  No software (that I am aware of) 
> can determine if a page has been updated without actually downloading 
> it and matching it against its history.

That's not quite the case - please see the HTTP spec. for 
"Last-Modified" header. However, it's true that for dynamic pages 
clients often don't get this information, and then indeed we have to 
download the page and compare its signature to the previous signature.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to