I am using scrapy to crawl through news articles in the form of blogs on a fixed url. Each news article (blog) is an item on the page (fixed url) with fields like date, heading, description etc.
I am using the following command to store the items into the CVS file: scrapy crawl NewsBlog -o Items_File.csv -t csv This web page gets refreshed every few hours with New Articles (Blogs). If I run the above command I get duplicate items in the csv as each time the entire set of items are appended to the csv file. I don't want the duplicates in the csv. I am not using pipelines to filter duplicates as the duplicates are not there in the blog. I am not deleting the csv each time as old articles disappear from the blog url page. Its a fair to assume that the combination of date and heading is unique in the item fields. Please suggest a viable solution for storing only the unique items in the csv. -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
