Hey,

I am working on parsing this <http://www.fabfoto.co.in/>.

I am using this <http://pastebin.com/HiEVps8q> rule. The rule is working 
fine and Scrapy is crawling all links. Sample <http://pastebin.com/px3JHCFr>
.

But in this website, a single web page is having multiple items, where id 
of each is different and hence I am not able to extract the image source, 
price and it's name.
If for one product the id is : 
ctl00_ContentPlaceHolder1_ASPxDataView1_IT*6*_Label1, 
then for other product on the same page the id is 
ctl00_ContentPlaceHolder1_ASPxDataView1_IT*7*_Label1. I have attached my 
spider in this post. Please don't take the xpaths seriously.

How can I extract desired data from such websites.?

Thanks,

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.
from scrapy.selector import Selector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from fabfoto.items import FabfotoItem
import re

class FFSpider (CrawlSpider) :
  name = "ff"
  allowed_domains = ["fabfoto.co.in"]
  start_urls = ["http://www.fabfoto.co.in/";]

  rules = (
       Rule(SgmlLinkExtractor(allow=(".*NewItems.*\.aspx", ), unique=False), callback='parse_item', follow= True),
      #       Rule(SgmlLinkExtractor(allow=("scl-.*\.htm", )), callback='parse_item',     follow= True),
  )

  def parse_item(self, response) :
    sel = Selector (response)
    print response.url
    items = []
    sites = sel.xpath ('//ul/li')
    for site in sites :
      item = FabfotoItem ()
      #item['category'] = site.xpath ('//div[@class="breadcrumb"]/a/text()').extract ()
      item['source_website'] = "fabfoto.co.in"
      title = sel.xpath ('//td/span/text()').extract()
      img_url = sel.xpath ('//td/span/img/@src').extract()
      price = str(sel.xpath ('//td/span/text()').extract())
      #price = re.findall(r'\d+', price)
      print title
      print img_url
      print price
      item['price'] = price

      if item['title'] :
        items.append (item)

    return items


Reply via email to