Scrapy Crawls and Scrape Duplicate URLS, despite unique and dont_filter settings

Akash jain Wed, 31 Dec 2014 02:42:08 -0800

I have the a scrapy code that keeps on visiting and scraping a single link. 
It never stops, just keeps scraping a single link which shouldn't happen as 
the documentation says it has a dedup url filter already built in .


When I checked under scrapy/spider.py I could see that dont_filter was set 
to True. So I changed it to False, but it didn't help.

    def make_requests_from_url(self, url):
        return Request(url, dont_filter=False)

My code is as follows. Where could I be going wrong ? The start_url only 
has 1 link to a page a.html. And it keeps scraping a.html recursively. 
================================

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from kt.items import DmozItem


Class DmozSpider(CrawlSpider):

    name = "dmoz"
    allowed_domains = ["datacaredubai.com"]
    start_urls = ["http://www.datacaredubai.com/aj/link.html";]

    rules = (
    Rule(SgmlLinkExtractor(allow=('/aj'),unique=('Yes')), 
callback='parse_item'), 
    )

    def parse_item(self, response):
        sel = Selector(response)
        sites = sel.xpath('//*') 
        items = []
        for site in sites:
            item = DmozItem()
            item['overview'] = 
site.xpath('//*[@id="overview"]/div/div[1]/div/div/div/dl[1]/dd').extract()
            item['specs'] = 
site.xpath('//*[@id="specs"]/div/div[1]/div/div/dl/dd[1]').extract()
            item['title']= site.xpath('/html/head/meta[3]').extract()
            item['full']= site.xpath('//*[@id="overview"]//dd').extract()
            item['req_url']= response.url

            items.append(item)
        return items

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Scrapy Crawls and Scrape Duplicate URLS, despite unique and dont_filter settings

Reply via email to