RE: Drop scraped data if it matches all previous data

Neverlast N Tue, 14 Jun 2016 05:20:02 -0700

Hello,

This is a quite common requirement. I think one good idea is to use a 
lightweight semi-persistent storage solution, typically redis. Calculate a hash 
of your item if it's large and "SET ID, HASH". Use Twisted asynchronous Redis 
clients and you can get just slightly increased latency but no noticeable 
throughput decrease. You can use this code as a starting point: 
https://github.com/scalingexcellence/scrapybook/blob/master/ch09/properties/properties/pipelines/redis.py
 
 
Cheers,
Dimitris

Date: Tue, 14 Jun 2016 01:41:59 -0700
From: [email protected]
To: [email protected]
Subject: Drop scraped data if it matches all previous data

Hello, 
I scrape sites and now I want to drop the scraped data if there's 'no update'. 
Fortunately I have a unique ID per scraped data record so I could use this ID 
field to compare if the data has changed or not. 
I run the scrapy with scrapy crawl in crontabs so every time I scrape I startup 
a new instance, meaning if I would hold the scraped data in memory using python 
code that wouldn't work. 
I don't think this is possible with item pipelines? A solution is just that I 
post everything in a database and then use the item pipelines to check the 
database using the unique ID and compare the data if it's new or not, and drop 
the scraped data if it is the same. 
Thanks for the help, 
Cheers

-- 

You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].

To post to this group, send email to [email protected].

Visit this group at https://groups.google.com/group/scrapy-users.

For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

RE: Drop scraped data if it matches all previous data

Reply via email to