Thanks for the reply. That makes things a lot clearer : ) Karen
----- Original Message ----- From: "Andrzej Bialecki" <[EMAIL PROTECTED]> To: <[email protected]> Sent: Thursday, June 16, 2005 7:01 PM Subject: Re: Track changes to pages between crawls? > Karen Church wrote: > > Hi all, > > > > I'm interested in tracking changes to pages between crawls. I want > > to be able to log new pages added since the last crawl, updates to > > existing pages as well as any pages that have been removed. I think > > I can determine if a page has been updated by comparing the MD5 hash > > of the two pages. > > > > In looking at the code, it appears that it's the 'Content' of the > > Page that is hashed - so if I want to compare two pages using this > > technique, I'm essentially comparing the 'Content' of the two pages. > > My question is - does this mean that additional changes to a page > > cannot be tracked. For example - changes to tags, meta-data, etc? > > They are all tracked, because for every change in the content the md5 is > changed. But they are not tracked separately. > > > What exactly constitutes 'Content' within Nutch? I understand that > > I assume you ask about the semantics of byte[] Content.getContent(), > right? This content is the protocol payload. In case of HTTP, this is > the response body as stream. In case of FTP, this is the file content; > etc... > > > I think I could compare other page attributes like the title or > > meta-data of the page using the ParseData class but I'm a little > > apprehensive that I'll still be missing out on other changes to the > > page. > > Hence the md5 checksum, which is calculated from the whole byte[] content. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > ------------------------------------------------------- SF.Net email is sponsored by: Discover Easy Linux Migration Strategies from IBM. Find simple to follow Roadmaps, straightforward articles, informative Webcasts and more! Get everything you need to get up to speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
