Re: Pleae find the attached docuemt, I am troubling to crawl nextpage link?

james josh Tue, 02 Sep 2014 02:17:32 -0700

I am troubling to crawl multiple page, which is currently running 1000 jobs 
after that its stopped.


Please let me know how to solve this

On Tuesday, September 2, 2014 2:37:56 PM UTC+5:30, james josh wrote:
>
> Yes, I got it, now it has working as expected,
>
> Thanks 
> james
>
>
> On Tue, Sep 2, 2014 at 2:18 PM, Lhassan Baazzi <[email protected]> 
> wrote:
>
>> Hi,
>>
>> For this notice:
>>
>> scrapy_demo\spiders\test.py:43: ScrapyDeprecationWarning: Call to deprecated 
>> function select. Use .xpath() instead.
>>
>>
>> Just replace select by xpath.
>>
>>
>>
>> Regards.
>> ---------
>> Lhassan Baazzi | Web Developer PHP / Python - Symfony - JS - Scrapy
>> Email/Gtalk: [email protected] - Skype: baazzilhassan - Twitter: 
>> @baazzilhassan <http://twitter.com/baazzilhassan>
>> Blog: http://blog.jbinfo.io/
>>
>>
>>
>> 2014-09-02 9:45 GMT+01:00 james josh <[email protected]>:
>>
>>>  I am debugging this code and get different job counts; but its not 
>>> giving me all the jobs as they are spread across multiple pages.
>>>
>>> I also get the following error, but I'm not sure what to do about it: 
>>>
>>>
>>> scrapy_demo\spiders\test.py:43: ScrapyDeprecationWarning: Call to 
>>> deprecated function select. Use .xpath() instead.
>>>
>>>
>>>
>>> next_page = Noneif 
>>> hxs.select('//div[@class="paggingNext"]/a[@class="blue"]/@href').extract():
>>> next_page = 
>>> hxs.select('//div[@class="paggingNext"]/a[@class="blue"]/@href').extract()[0]if
>>>  next_page:yield Request(urlparse.urljoin(response.url, next_page), 
>>> self.parse)
>>>
>>>
>>>
>>>
>>> from scrapy.spider import BaseSpider from scrapy.selector import 
>>> HtmlXPathSelector import urlparse from scrapy.http.request import 
>>> Requestfrom scrapy.contrib.spiders import CrawlSpider,Rulefrom 
>>> scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
>>> from scrapy.item import Item, Fieldclass ScrapyDemoSpiderItem(Item):
>>>
>>>     link = Field()
>>>     title = Field()
>>>     city = Field()
>>>     salary = Field()
>>>     content = Field()
>>> class ScrapyDemoSpider(BaseSpider):
>>>         name = 'eujobs77'
>>>         allowed_domains = ['eujobs77.com']
>>>         start_urls = ['http://www.eujobs77.com/jobs']
>>>
>>>         def parse(self,response):
>>>             hxs = HtmlXPathSelector(response) 
>>>             listings = hxs.select('//div[@class="jobSearchBrowse 
>>> jobSearchBrowsev1"]') 
>>>             links = []
>>>
>>>     #scrap listings page to get listing links
>>>             for listing in listings: 
>>>                 
>>> link=listing.select('//h2[@class="jobtitle"]/a[@class="blue"]/@href').extract()
>>>
>>>                 links.extend(link)
>>>
>>>     #parse listing url to get content of the listing page
>>>
>>>             for link in links: 
>>>                 item=ScrapyDemoSpiderItem() 
>>>                 item['link']=link
>>>
>>>                 yield Request(urlparse.urljoin(response.url, link), 
>>> meta={'item':item},callback=self.parse_listing_page)
>>>
>>>     #get next button link 
>>>             next_page = None
>>>             if hxs.select('//div[@class="paggingNext"]/@href').extract():
>>>                   next_page = 
>>> hxs.select('//div[@class="paggingNext"]/@href').extract()
>>>                   if next_page:
>>>                     yield Request(urlparse.urljoin(response.url, 
>>> next_page), self.parse)
>>>
>>>          #scrap listing page to get content
>>>         def parse_listing_page(self,response):
>>>                 hxs = HtmlXPathSelector(response) 
>>>                 item = response.request.meta['item']
>>>                 item ['link'] = response.url
>>>                 item['title'] = 
>>> hxs.select("//h1[@id='share_jobtitle']/text()").extract()
>>>                 item['city'] = 
>>> hxs.select("//html/body/div[3]/div[3]/div[2]/div[1]/div[3]/ul/li[1]/div[2]/text()").extract()
>>>                 item['salary'] = 
>>> hxs.select("//html/body/div[3]/div[3]/div[2]/div[1]/div[3]/ul/li[3]/div[2]/text()").extract()
>>>                 item['content'] = hxs.select("//div[@class='detailTxt 
>>> deneL']/text()").extract()
>>>
>>>                 yield item
>>>
>>>  -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "scrapy-users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>>
>>> To post to this group, send email to [email protected].
>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  -- 
>> You received this message because you are subscribed to a topic in the 
>> Google Groups "scrapy-users" group.
>> To unsubscribe from this topic, visit 
>> https://groups.google.com/d/topic/scrapy-users/cz69e-iquDA/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to 
>> [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at http://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Pleae find the attached docuemt, I am troubling to crawl nextpage link?

Reply via email to