Thank you Suyatoslav. May i know how to print these sitename and links in parse_item funtion?
On Fri, Apr 18, 2014 at 4:57 AM, Svyatoslav Sydorenko < [email protected]> wrote: > Following code will do the job. Hope it helps. > > > import re > import sys > import unicodedata > from string import join > from scrapy.contrib.spiders import CrawlSpider, Rule > from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor > from scrapy.selector import HtmlXPathSelector > from scrapy.http import Request > from pagitest.items import PagitestItem > from urlparse import urlparse > from urlparse import urljoin > class InfojobsSpider(CrawlSpider): > USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; rv:29.0) Gecko/20100101 > Firefox/29.0" > name = "info" > allowed_domains = ["infosec.co.uk"] > start_urls = [ > "http://www.infosec.co.uk/exhibitor-directory/" > ] > rules = ( > Rule(SgmlLinkExtractor(allow=(r'/en/exhibitor-directory/?startRecord=\d+&rpp=\d+')), > callback='parse_item', follow=True), > ) > def parse_item(self, response): > items=[] > hxs = HtmlXPathSelector(response) > links = hxs.select('//div[@class="listItemDetail > exhibitorDetail"]/h3[@class="name"]/a') > for lnk in links: > link = lnk.select('@href').extract() > if not re.match('^http', link): > link="http://www.infosec.co.uk"+link > yield Request(url=link, meta={"sitename": lnk.select('text()'). > extract()}, callback=self.getwebsitename) > > > def getwebsitename(self, response): > websitename = response.meta['sitename'] > hxs= HtmlXPathSelector(response) > websites= hxs.select('//li[@class="web"]/a/@href').extract() > ## it's better to return PagitestItem instances instead of dict: > return {"sitename": websitename, "links": websites} > > Четвер, 17 квітня 2014 р. 07:50:58 UTC+3 користувач masroor javed написав: >> >> yes i know but these links will be extract with simple xpath expression. >> i just want to hit all these links and get the website name and again >> come back to first page to get the link name and stand name. >> Means: >> first page have 12 links so i have to extract each link name and stand >> name then i have to hit one by one link and get website name. >> titlename, standname and websitename. >> I have attached image in which i described the titlename,standname. >> >> >> >> On Thu, Apr 17, 2014 at 3:04 AM, Svyatoslav Sydorenko < >> [email protected]> wrote: >> >>> Then just yield a new Request instead of returning url. >>> >>> BWT, You also should avoid double loop. It's possible to extract all >>> links with single XPath expression >>> //div[@class="listItemDetail exhibitorDetail"]/h3[@class="name"]/a/@href >>> >>> P.S. If I understand you right you may also let scrapy crawl all links >>> itself and not implement >>> >>> Середа, 16 квітня 2014 р. 12:34:19 UTC+3 користувач masroor javed >>> написав: >>>> >>>> Hi Svyatoslav i just want to return all the website name from >>>> getwebsitename function to yield Request(url=titleurls,callback >>>> =self.getwebsitename) >>>> >>>> >>>> On Wed, Apr 16, 2014 at 2:22 PM, Svyatoslav Sydorenko < >>>> [email protected]> wrote: >>>> >>>>> >>>>> - yield Request(url=titleurls,callback=self.getwebsitename) >>>>> + yield Request(url=titleurls, meta={"titlename": some_titlename, >>>>> "standnumber": some_standnumber}, callback=self.getwebsitename) >>>>> >>>>> and in getwebsitename you may just access response.meta dict. >>>>> http://doc.scrapy.org/en/latest/topics/request-response. >>>>> html?highlight=meta#scrapy.http.Response.meta >>>>> >>>>> Вівторок, 15 квітня 2014 р. 14:14:32 UTC+3 користувач masroor javed >>>>> написав: >>>>> >>>>>> Hi, >>>>>> >>>>>> I am new here in scrapy. >>>>>> I just want to know how to call a function and pass the two or three >>>>>> value in return. >>>>>> I have a spider code please let me know how to solve it. >>>>>> >>>>>> Step: >>>>>> 1. i want to scrap all page links with pagination and and stand >>>>>> number. >>>>>> 2. hit all the links and want to extract website url >>>>>> 3. Total value should b 3 means titlename, standnumber and website >>>>>> url. >>>>>> >>>>>> my spider code is >>>>>> >>>>>> import re >>>>>> import sys >>>>>> import unicodedata >>>>>> from string import join >>>>>> from scrapy.contrib.spiders import CrawlSpider, Rule >>>>>> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor >>>>>> from scrapy.selector import HtmlXPathSelector >>>>>> from scrapy.http import Request >>>>>> from pagitest.items import PagitestItem >>>>>> from urlparse import urlparse >>>>>> from urlparse import urljoin >>>>>> class InfojobsSpider(CrawlSpider): >>>>>> USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; rv:29.0) Gecko/20100101 >>>>>> Firefox/29.0" >>>>>> name = "info" >>>>>> allowed_domains = ["infosec.co.uk"] >>>>>> start_urls = [ >>>>>> "http://www.infosec.co.uk/exhibitor-directory/" >>>>>> ] >>>>>> rules = ( >>>>>> Rule(SgmlLinkExtractor(allow=(r'exhibitor\W+directory'),rest >>>>>> rict_xpaths=('//li[@class="gButton"]/a')), callback='parse_item', >>>>>> follow=True), >>>>>> ) >>>>>> def parse_item(self, response): >>>>>> items=[] >>>>>> hxs = HtmlXPathSelector(response) >>>>>> data = hxs.select('//div[@class="listItemDetail exhibitorDetail"]') >>>>>> for titlename in data: >>>>>> titleurl=titlename.select('h3[@class="name"]/a/@href').extract() >>>>>> for titleurls in titleurl: >>>>>> preg=re.match('^http',titleurls) >>>>>> if preg: >>>>>> titleurls=titleurls >>>>>> else: >>>>>> titleurls="http://www.infosec.co.uk"+titleurls >>>>>> yield Request(url=titleurls,callback=self.getwebsitename) >>>>>> >>>>>> def getwebsitename(self,response): >>>>>> hxs= HtmlXPathSelector(response) >>>>>> websites= hxs.select('//li[@class="web"]/a/@href').extract() >>>>>> for websitename in websites: >>>>>> return websites >>>>>> >>>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "scrapy-users" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To post to this group, send email to [email protected]. >>>>> >>>>> Visit this group at http://groups.google.com/group/scrapy-users. >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "scrapy-users" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at http://groups.google.com/group/scrapy-users. >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- > You received this message because you are subscribed to the Google Groups > "scrapy-users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/scrapy-users. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
