Thanks for the reply.  That makes things a lot clearer : )

Karen

----- Original Message ----- 
From: "Andrzej Bialecki" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Thursday, June 16, 2005 7:01 PM
Subject: Re: Track changes to pages between crawls?


> Karen Church wrote:
> > Hi all,
> >
> > I'm interested in tracking changes to pages between crawls.  I want
> > to be able to log new pages added since the last crawl, updates to
> > existing pages as well as any pages that have been removed.  I think
> > I can determine if a page has been updated by comparing the MD5 hash
> > of the two pages.
> >
> > In looking at the code, it appears that it's the 'Content' of the
> > Page that is hashed - so if I want to compare two pages using this
> > technique, I'm essentially comparing the 'Content' of the two pages.
> > My question is - does this mean that additional changes to a page
> > cannot be tracked.  For example - changes to tags, meta-data, etc?
>
> They are all tracked, because for every change in the content the md5 is
> changed. But they are not tracked separately.
>
> > What exactly constitutes 'Content' within Nutch?  I understand that
>
> I assume you ask about the semantics of byte[] Content.getContent(),
> right? This content is the protocol payload. In case of HTTP, this is
> the response body as stream. In case of FTP, this is the file content;
> etc...
>
> > I think I could compare other page attributes like the title or
> > meta-data of the page using the ParseData class but I'm a little
> > apprehensive that I'll still be missing out on other changes to the
> > page.
>
> Hence the md5 checksum, which is calculated from the whole byte[] content.
>
> -- 
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>



-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to