I have the a scrapy code that keeps on visiting and scraping a single link. It never stops, just keeps scraping a single link which shouldn't happen as the documentation says it has a dedup url filter already built in .
When I checked under scrapy/spider.py I could see that dont_filter was set to True. So I changed it to False, but it didn't help. def make_requests_from_url(self, url): return Request(url, dont_filter=False) My code is as follows. Where could I be going wrong ? The start_url only has 1 link to a page a.html. And it keeps scraping a.html recursively. ================================ from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import Selector from kt.items import DmozItem Class DmozSpider(CrawlSpider): name = "dmoz" allowed_domains = ["datacaredubai.com"] start_urls = ["http://www.datacaredubai.com/aj/link.html"] rules = ( Rule(SgmlLinkExtractor(allow=('/aj'),unique=('Yes')), callback='parse_item'), ) def parse_item(self, response): sel = Selector(response) sites = sel.xpath('//*') items = [] for site in sites: item = DmozItem() item['overview'] = site.xpath('//*[@id="overview"]/div/div[1]/div/div/div/dl[1]/dd').extract() item['specs'] = site.xpath('//*[@id="specs"]/div/div[1]/div/div/dl/dd[1]').extract() item['title']= site.xpath('/html/head/meta[3]').extract() item['full']= site.xpath('//*[@id="overview"]//dd').extract() item['req_url']= response.url items.append(item) return items -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.