Hey, I am working on parsing this <http://www.fabfoto.co.in/>.
I am using this <http://pastebin.com/HiEVps8q> rule. The rule is working fine and Scrapy is crawling all links. Sample <http://pastebin.com/px3JHCFr> . But in this website, a single web page is having multiple items, where id of each is different and hence I am not able to extract the image source, price and it's name. If for one product the id is : ctl00_ContentPlaceHolder1_ASPxDataView1_IT*6*_Label1, then for other product on the same page the id is ctl00_ContentPlaceHolder1_ASPxDataView1_IT*7*_Label1. I have attached my spider in this post. Please don't take the xpaths seriously. How can I extract desired data from such websites.? Thanks, -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
from scrapy.selector import Selector from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from fabfoto.items import FabfotoItem import re class FFSpider (CrawlSpider) : name = "ff" allowed_domains = ["fabfoto.co.in"] start_urls = ["http://www.fabfoto.co.in/"] rules = ( Rule(SgmlLinkExtractor(allow=(".*NewItems.*\.aspx", ), unique=False), callback='parse_item', follow= True), # Rule(SgmlLinkExtractor(allow=("scl-.*\.htm", )), callback='parse_item', follow= True), ) def parse_item(self, response) : sel = Selector (response) print response.url items = [] sites = sel.xpath ('//ul/li') for site in sites : item = FabfotoItem () #item['category'] = site.xpath ('//div[@class="breadcrumb"]/a/text()').extract () item['source_website'] = "fabfoto.co.in" title = sel.xpath ('//td/span/text()').extract() img_url = sel.xpath ('//td/span/img/@src').extract() price = str(sel.xpath ('//td/span/text()').extract()) #price = re.findall(r'\d+', price) print title print img_url print price item['price'] = price if item['title'] : items.append (item) return items
