chad savage wrote: > Hello All, > > With ftp and file crawls you can check the date of the file and match > the date against yours. Http does not have that luxury. > If this is on an internal site of yours being generated by a cms or > even by hand, I'm sure you can create a list of pages that have been > updated since last crawl. > As for generic web page in the wild, No software (that I am aware of) > can determine if a page has been updated without actually downloading > it and matching it against its history.
That's not quite the case - please see the HTTP spec. for "Last-Modified" header. However, it's true that for dynamic pages clients often don't get this information, and then indeed we have to download the page and compare its signature to the previous signature. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
