One way to do that is to keep track (in a disk file, for example) of already seen urls & content (along with their hashes) and check every scraped item against those in an item pipeline [1], dropping [2] the ones that were already seen before.
[1] http://doc.scrapy.org/en/latest/topics/item-pipeline.html [2] http://doc.scrapy.org/en/latest/topics/exceptions.html#scrapy.exceptions.DropItem On Tue, Mar 11, 2014 at 6:53 AM, Sayth Renshaw <[email protected]>wrote: > Hi > > Having completed and toyed with the tutorial I have something I don't > understand. What happens when my base url features links and content that > change daily? > I don't want all the data only specific documents when they update to the > page. > > From the base url to get the link across to the page I want to scrape is > body/div/div/div/div/table/tbody/tr/td/p/a. > So i want to navigate down that path if State and location details when > they update. So will Scrapy allow me to do that or do I need to employ > something like Mechanize https://pypi.python.org/pypi/mechanize/? > > Sayth > > -- > You received this message because you are subscribed to the Google Groups > "scrapy-users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/scrapy-users. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
