I've noticed two things: 1. The existance of the 999 in the response output: <*999* https://linkedin.com/job/jobs-in-san-fransisco-ca/page_num=1> 2. In the available scrapy obects, there is no xpath in the* sel* object which is defined as following: <Selector* xpath=None* data=u'<html><head>\n<script type="text/javascri'>
According to the previous information, how can I address this problem? Regards, K. 2015-03-17 14:15 GMT+01:00 Morad Edwar <[email protected]>: > it's the same problem try response.url and you will see that it's another > link because of the special chars. > > On Tuesday, March 17, 2015 at 2:54:21 PM UTC+2, DataScience wrote: >> >> Yes, I saw the difference. In this sense, I've changed the URL with the >> one you suggested then with another one (https://www.linkedin.com/job/ >> all-jobs/?sort=date)=> I obtained the same output when I run *print >> response URL*, but still having an empty list as a result of the* >> sel.xpath*. >> Please, find a screenshot expalining thre procedure I followed right >> here: >> <http://fr.tinypic.com/view.php?pic=yi9dv&s=8> >> http://fr.tinypic.com/view.php?pic=yi9dv&s=8 >> >> Regards, >> K. >> >> 2015-03-17 12:59 GMT+01:00 Morad Edwar <[email protected]>: >> >>> do you see the difference?? >>> scrapy shell didn't parse the full url because the special chars in the >>> url, try the following : >>> >>> scrapy shell https://www.linkedin.com/job/jobs-in-san-francisco-ca/\? >>> page_num\=1\&trk\=jserp_pagination_1 >>> <http://www.linkedin.com/job/jobs-in-san-francisco-ca/%5C?page_num%5C=1%5C&trk%5C=jserp_pagination_1> >>> >>> >>> On Tuesday, March 17, 2015 at 1:52:20 PM UTC+2, DataScience wrote: >>>> >>>> I obtain the following output: >>>> *https://www.linkedin.com/job/jobs-in-san-fransisco-ca/?page_num=1 >>>> <https://www.linkedin.com/job/jobs-in-san-fransisco-ca/?page_num=1> * >>>> >>>> Regards, >>>> K. >>>> 2015-03-17 12:42 GMT+01:00 Morad Edwar <[email protected]>: >>>> >>>>> Please do it again but after step one run the following code : >>>>> print response.url >>>>> And make give us the output. >>>>> >>>>> Morad Edwar, >>>>> Software Developer | Bkam.com >>>>> On Mar 17, 2015 1:13 PM, "Kais DAI" <[email protected]> wrote: >>>>> >>>>>> This is what I did: >>>>>> >>>>>> 1. I opened the command line in windows and run the follwing >>>>>> command: *scrapy >>>>>> shell >>>>>> https://www.linkedin.com/job/jobs-in-san-francisco-ca/?page_num=1&trk=jserp_pagination_1 >>>>>> >>>>>> <https://www.linkedin.com/job/jobs-in-san-francisco-ca/?page_num=1&trk=jserp_pagination_1>* >>>>>> 2. Then, I run this command: >>>>>> >>>>>> *sel.xpath(‘//div[@id="results-rail"]/ul[@class="jobs"]/li[1]/div[@class="content"]/span/a[@class="title"]/text()’).extract() >>>>>> * In >>>>>> this case, an empty list is returned *[] *Also, the same thing >>>>>> with this xpath selection: >>>>>> >>>>>> *sel.xpath('html/body/div[3]/div/div[2]/div[2]/div[1]/ul/li[1]/div/span/a').extract()* >>>>>> >>>>>> Did you obtained a result by following the same steps? >>>>>> Thank you for your help. >>>>>> >>>>>> Regards, >>>>>> K. >>>>>> >>>>>> 2015-03-17 11:34 GMT+01:00 Morad Edwar <[email protected]>: >>>>>> >>>>>>> I used 'scrapy shell' and your xpath worked fine!! >>>>>>> and when i changed 'li[1]' to 'li' it scrapped all the jobs titles. >>>>>>> >>>>>>> >>>>>>> On Monday, March 16, 2015 at 6:19:01 PM UTC+2, DataScience wrote: >>>>>>>> >>>>>>>> Actually, I've checked the "response.body" and it doesn't matches >>>>>>>> the content that I have in the webpage. >>>>>>>> I am really confused, what can I do in this case? >>>>>>>> >>>>>>>> Le lundi 16 mars 2015 17:15:14 UTC+1, Travis Leleu a écrit : >>>>>>>>> >>>>>>>>> It doesn't look to me like it's writing the HTML to the DOM with >>>>>>>>> j.s., as you noted. >>>>>>>>> >>>>>>>>> The big concern I have is that you are assuming the HTML content >>>>>>>>> in your browser is the same as in your code. How have you asserted >>>>>>>>> this? >>>>>>>>> >>>>>>>>> On Mon, Mar 16, 2015 at 9:02 AM, DataScience <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Thank you Travis for you quick feedback. >>>>>>>>>> >>>>>>>>>> I am testing scrapy on this specefic webpage and try to get the >>>>>>>>>> job offers (and not profiles). >>>>>>>>>> I read in some forums that it may be due to the website which is >>>>>>>>>> using Javascript to build most of the page, so the elements I >>>>>>>>>> want do not appear in the HTML source of the page. I've checked by >>>>>>>>>> disabling >>>>>>>>>> Javascript and reloading the page, but the result has been displayed >>>>>>>>>> on the >>>>>>>>>> page (I've also checked the network in firbug by filtering XHR and >>>>>>>>>> looked >>>>>>>>>> into the POST...and nothing). >>>>>>>>>> >>>>>>>>>> Any help would be more than welcome. >>>>>>>>>> Thank you. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Le lundi 16 mars 2015 16:26:41 UTC+1, Travis Leleu a écrit : >>>>>>>>>>> >>>>>>>>>>> Linkedin can be a tough site to scrape, as they generally don't >>>>>>>>>>> want their data in other people's hands. You will need to use a >>>>>>>>>>> user-agent >>>>>>>>>>> switcher (you don't mention what UA you are sending), and most >>>>>>>>>>> likely >>>>>>>>>>> require a proxy in addition. >>>>>>>>>>> >>>>>>>>>>> If you are looking to scrape the entirety of linkedin, it's > 30 >>>>>>>>>>> million profiles. I've found it more economical to purchase a >>>>>>>>>>> linkedin >>>>>>>>>>> data dump from scrapinghub.com than to scrape it myself. >>>>>>>>>>> >>>>>>>>>>> On Mon, Mar 16, 2015 at 8:05 AM, DataScience <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Scrapy Guys, >>>>>>>>>>>> >>>>>>>>>>>> Scrapy returns me an empty list while using shell to pick a >>>>>>>>>>>> simple "title" field from this web page: http://goo.gl/dBR8P4 >>>>>>>>>>>> I've used: >>>>>>>>>>>> >>>>>>>>>>>> - sel.xpath(‘//div[@id="results- >>>>>>>>>>>> rail"]/ul[@class="jobs"]/li[1]/div[@class="content"]/span/a[ >>>>>>>>>>>> @class="title"]/text()’).extract() >>>>>>>>>>>> - sel.xpath('html/body/div[3]/div/div[2]/div[2]/div[1]/ul/ >>>>>>>>>>>> li[1]/div/span/a').extract() >>>>>>>>>>>> - ... >>>>>>>>>>>> >>>>>>>>>>>> I verified the issue of the POST with XHR using firebug, and I >>>>>>>>>>>> think there are no relationships with information generated using >>>>>>>>>>>> js code >>>>>>>>>>>> (what do you think?). >>>>>>>>>>>> >>>>>>>>>>>> Can you please help me to figure out with this problem? >>>>>>>>>>>> Thank you in Advance. >>>>>>>>>>>> >>>>>>>>>>>> Best Regards, >>>>>>>>>>>> K. >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>>> Google Groups "scrapy-users" group. >>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from >>>>>>>>>>>> it, send an email to [email protected]. >>>>>>>>>>>> To post to this group, send email to [email protected] >>>>>>>>>>>> . >>>>>>>>>>>> Visit this group at http://groups.google.com/group/scrapy-users >>>>>>>>>>>> . >>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>> Google Groups "scrapy-users" group. >>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>>> send an email to [email protected]. >>>>>>>>>> To post to this group, send email to [email protected]. >>>>>>>>>> Visit this group at http://groups.google.com/group/scrapy-users. >>>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>> You received this message because you are subscribed to a topic in >>>>>>> the Google Groups "scrapy-users" group. >>>>>>> To unsubscribe from this topic, visit https://groups.google.com/d/to >>>>>>> pic/scrapy-users/BSmdIyfxiC4/unsubscribe. >>>>>>> To unsubscribe from this group and all its topics, send an email to >>>>>>> [email protected]. >>>>>>> To post to this group, send email to [email protected]. >>>>>>> Visit this group at http://groups.google.com/group/scrapy-users. >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> >>>>>> >>>> -- >>> You received this message because you are subscribed to a topic in the >>> Google Groups "scrapy-users" group. >>> To unsubscribe from this topic, visit https://groups.google.com/d/ >>> topic/scrapy-users/BSmdIyfxiC4/unsubscribe. >>> To unsubscribe from this group and all its topics, send an email to >>> [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at http://groups.google.com/group/scrapy-users. >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- > You received this message because you are subscribed to a topic in the > Google Groups "scrapy-users" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/scrapy-users/BSmdIyfxiC4/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/scrapy-users. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
