I practice Scrapy and want to ask a question:
*https://eapplicant.northshore.org/psc/psapp/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL <https://eapplicant.northshore.org/psc/psapp/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL>* *Please let me know how to select the css selector path for title, description and next page link*. *Language : Python + scrapy + scrapinghub/splash* the website I want to scrap has a structure like this: * <div id="divgbHRS_CE_JO_EXT_I$0" style="width:613px;height:206px; "> ## This is the full path of table.*** <table id="gbHRS_CE_JO_EXT_I$0" width="613" cellspacing="0" cellpadding="0" border="0" dir="ltr" style="overflow:hidden;"> <tbody> <tr> <td valign="top" style="width:613px;"> <div id="divgbrHRS_CE_JO_EXT_I$0" onscroll="ptGridObj_win0.doOnScroll('HRS_CE_JO_EXT_I$0',1);" style="width:613px;height:206px; overflow-x:hidden;overflow-y:hidden;"> <table id="tdgbrHRS_CE_JO_EXT_I$0" cellspacing="0" cellpadding="2" border="0" cols="5" dir="ltr" style="width:613px;"> <tbody> <tr id="trHRS_CE_JO_EXT_I$0_row1" onmouseout="hoverLightTR('rgb(249,254,203)','',1,'trHRS_CE_JO_EXT_I$0_row1');" onmouseover="hoverLightTR('rgb(249,254,203)','',0,'trHRS_CE_JO_EXT_I$0_row1');" onclick="HighLightTR('rgb(212,219,217)','','trHRS_CE_JO_EXT_I$0_row1');"> <td id="tdHRS_CE_JO_EXT_I$0#0" class="PSLEVEL1GRIDODDROW" width="20" nowrap="nowrap" height="54" align="center" style=""> <div id="win0divSELECT$0"> <input id="SELECT$chk$0" type="hidden" value="N" name="SELECT$chk$0"> <input id="SELECT$0" class="PSCHECKBOX" type="checkbox" onclick="setupTimeout2(); this.form.SELECT$chk$0.value=(this.checked?'Y':'N');doFocus_win0(this,false,true);" value="Y" tabindex="99" name="SELECT$0"> </div> </td> <td id="tdHRS_CE_JO_EXT_I$0#1" class="PSLEVEL1GRIDODDROW" width="83" align="left" style=""> <td id="tdHRS_CE_JO_EXT_I$0#2" class="PSLEVEL1GRIDODDROW" width="182" align="left" style=""> ***<div id="win0divPOSTINGTITLE$0" style="width:182px;"> ## Here I need to add css selectors for this title.*** </td> <td id="tdHRS_CE_JO_EXT_I$0#3" class="PSLEVEL1GRIDODDROW" width="83" align="left" style=""> <td id="tdHRS_CE_JO_EXT_I$0#4" class="PSLEVEL1GRIDODDROW" align="left" style=""> </tr> `enter code here` This is the next page html : <div id="win0divHRS_APPL_WRK_HRS_LST_NEXT"> <span class="PSHYPERLINK" title="Next In List"> *<a id="HRS_APPL_WRK_HRS_LST_NEXT" class="PSHYPERLINK" href="javascript:submitAction_win0(document.win0,'HRS_APPL_WRK_HRS_LST_NEXT');" tabindex="74" ptlinktgt="pt_replace" name="HRS_APPL_WRK_HRS_LST_NEXT">Next</a> ## Here i have to extract next page jobs using splash.* </span> </div> </td> Spider code: ============ def parse(self,response): selector = Selector(response) links = [] *for link in selector.css('div.win0divHRS_CE_JO_EXT_I$0 div.trHRS_CE_JO_EXT_I$0_row1 > a.title.heading.trHRS_CE_JO_EXT_I$0_row1-title::attr(href)').extract(): ## Here is my code and don't work* yield Request(urlparse.urljoin(response.url, link), callback=self.parse_listing_page, #meta={"use_splash": False} ) * next_page_link **= selector.css('div.pages > a:last-child:not(.disabled)') ## Here is my code and don't work* if next_page_link: def increment10(matchobj): *return "st="+str(int(matchobj.group("pagenum"))+10 ## Here is my code and don't work* next_page_url = re.sub('', increment10, response.url) print "next page:", next_page_url yield Request(next_page_url, self.parse, #meta={"use_splash": True}, dont_filter=True) -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
