One way to do that is to keep track (in a disk file, for example) of
already seen urls & content (along with their hashes) and check every
scraped item against those in an item pipeline [1], dropping [2] the ones
that were already seen before.

[1] http://doc.scrapy.org/en/latest/topics/item-pipeline.html
[2]
http://doc.scrapy.org/en/latest/topics/exceptions.html#scrapy.exceptions.DropItem


On Tue, Mar 11, 2014 at 6:53 AM, Sayth Renshaw <[email protected]>wrote:

> Hi
>
> Having completed and toyed with the tutorial I have something I don't
> understand. What happens when my base url features links and content that
> change daily?
> I don't want all the data only specific documents when they update to the
> page.
>
> From the base url to get the link across to the page I want to scrape is
>  body/div/div/div/div/table/tbody/tr/td/p/a.
> So i want to navigate down that path if State and location details when
> they update. So will Scrapy allow me to do that or do I need to employ
> something like Mechanize https://pypi.python.org/pypi/mechanize/?
>
> Sayth
>
> --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to