Re: Removing duplicate items from CSV

Bill Ebeling Mon, 05 May 2014 10:21:07 -0700

I have a separate file for the DB in the same folder as settings.py   The
file handles the DB overhead and contains loose methods like 'saveItem' or
'itemExists' and whatnot.  I call it from the pipeline, and if not
'itemExists', 'saveItem'


This way the spider just happily crawls along and the pipeline deals with
whatever items it's producing.


On Mon, May 5, 2014 at 1:12 PM, Jaspreet Singh <[email protected]> wrote:

> Thanks for your reply Bill.
>
> I will go with the good option. Where should I place the SQL insert? I am
> thinking of placing it inside the parse function of the spider.
>
>
>
> On Monday, May 5, 2014 10:01:59 PM UTC+5:30, Bill Ebeling wrote:
>>
>> Good option: Sounds like a case for a database.
>>
>> Very Bad Option: The only other option I can think of is storing a hash
>> of the url's in a flat file, and reading in that file and checking to see
>> if a hash of the current url is in that list, if not, save it and add the
>> url to that list..  this leads to many other problems.
>>
>>  --
> You received this message because you are subscribed to a topic in the
> Google Groups "scrapy-users" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/scrapy-users/x_R1HAqoySU/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Removing duplicate items from CSV

Reply via email to