Removing duplicate items from CSV

Jaspreet Singh Sun, 04 May 2014 07:48:26 -0700

I am using scrapy to crawl through news articles in the form of blogs on a 
fixed url. Each news article (blog) is an item on the page (fixed url) with 
fields like date, heading, description etc.


I am using the following command to store the items into the CVS file:

scrapy crawl NewsBlog -o Items_File.csv -t csv

This web page gets refreshed every few hours with New Articles (Blogs). If 
I run the above command I get duplicate items in the csv as each time the 
entire set of items are appended to the csv file. I don't want the 
duplicates in the csv. I am not using pipelines to filter duplicates as the 
duplicates are not there in the blog. I am not deleting the csv each time 
as old articles disappear from the blog url page. Its a fair to assume that 
the combination of date and heading is unique in the item fields.

Please suggest a viable solution for storing only the unique items in the 
csv.

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Removing duplicate items from CSV

Reply via email to