Simple crawler modification - hit each URL twice

abraden Mon, 14 Sep 2015 09:53:21 -0700

I have a simple crawler that makes a list of every link on my site. I have 
been feeding this list into a perl script that purges the cache for each 
link, but I would like to tell Scrapy to do it. How can I modify this 
crawler so that it will hit each link a second time?


from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field

class MyItem(Item):
 url= Field()

class MySpider(CrawlSpider):
 name = 'test'
 allowed_domains = ['nydvreports1.example.com']
 start_urls = ['http://nydvreports1.example.com/perl/globe_3xx.pl']
 rules = (Rule(LinkExtractor(), callback='parse_url', follow=True), )
 def parse_url(self, response):
  print "Visiting %s" % response.url
  item = MyItem()
  item['url'] = response.url
  return item

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Simple crawler modification - hit each URL twice

Reply via email to