Just define your custom start_requests. As you describe the issue you already have knowledge about the urls when the spider starts, maybe a custom Dupefilter is what you need.
El miércoles, 20 de agosto de 2014 07:50:50 UTC-3, tim feirg escribió: > > I'm crawling through some 20 webpages to get my database updated, I want > my spider to ignore a url completely once it found that the item it just > returned already exists in database, so that it doesn't follow any other > links from this url and just move on to those which still contain new items. > > it seems fairly easy but I haven't find any smart ways to do it, can > anybody help? thanks:) > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
