Re: Scrapy shell returns empty list!?

Kais DAI Tue, 17 Mar 2015 06:51:34 -0700

I've noticed two things:

   1. The existance of the 999 in the response output: <*999*
   https://linkedin.com/job/jobs-in-san-fransisco-ca/page_num=1>
   2. In the available scrapy obects, there is no xpath in the* sel* object
   which is defined as following: <Selector* xpath=None*
   data=u'<html><head>\n<script type="text/javascri'>


According to the previous information, how can I address this problem?

Regards,
K.

2015-03-17 14:15 GMT+01:00 Morad Edwar <[email protected]>:

> it's the same problem try response.url and you will see that it's another
> link because of the special chars.
>
> On Tuesday, March 17, 2015 at 2:54:21 PM UTC+2, DataScience wrote:
>>
>> Yes, I saw the difference. In this sense, I've changed the URL with the
>> one you suggested then with another one (https://www.linkedin.com/job/
>> all-jobs/?sort=date)=> I obtained the same output when I run *print
>> response URL*, but still having an empty list as a result of the*
>> sel.xpath*.
>> Please, find a screenshot expalining thre procedure I followed right
>> here:
>> <http://fr.tinypic.com/view.php?pic=yi9dv&s=8>
>> http://fr.tinypic.com/view.php?pic=yi9dv&s=8
>>
>> Regards,
>> K.
>>
>> 2015-03-17 12:59 GMT+01:00 Morad Edwar <[email protected]>:
>>
>>> do you see the difference??
>>> scrapy shell didn't parse the full url because the special chars in the
>>> url, try the following :
>>>
>>> scrapy shell https://www.linkedin.com/job/jobs-in-san-francisco-ca/\?
>>> page_num\=1\&trk\=jserp_pagination_1
>>> <http://www.linkedin.com/job/jobs-in-san-francisco-ca/%5C?page_num%5C=1%5C&trk%5C=jserp_pagination_1>
>>>
>>>
>>> On Tuesday, March 17, 2015 at 1:52:20 PM UTC+2, DataScience wrote:
>>>>
>>>> I obtain the following output:
>>>> *https://www.linkedin.com/job/jobs-in-san-fransisco-ca/?page_num=1
>>>> <https://www.linkedin.com/job/jobs-in-san-fransisco-ca/?page_num=1> *
>>>>
>>>> Regards,
>>>> K.
>>>> 2015-03-17 12:42 GMT+01:00 Morad Edwar <[email protected]>:
>>>>
>>>>> Please do it again but after step one run the following code :
>>>>>     print response.url
>>>>> And make give us the output.
>>>>>
>>>>> Morad Edwar,
>>>>> Software Developer | Bkam.com
>>>>> On Mar 17, 2015 1:13 PM, "Kais DAI" <[email protected]> wrote:
>>>>>
>>>>>> This is what I did:
>>>>>>
>>>>>>    1. I opened the command line in windows and run the follwing
>>>>>>    command: *scrapy
>>>>>>    shell 
>>>>>> https://www.linkedin.com/job/jobs-in-san-francisco-ca/?page_num=1&trk=jserp_pagination_1
>>>>>>    
>>>>>> <https://www.linkedin.com/job/jobs-in-san-francisco-ca/?page_num=1&trk=jserp_pagination_1>*
>>>>>>    2. Then, I run this command:
>>>>>>    
>>>>>> *sel.xpath(‘//div[@id="results-rail"]/ul[@class="jobs"]/li[1]/div[@class="content"]/span/a[@class="title"]/text()’).extract()
>>>>>>  * In
>>>>>>    this case, an empty list is returned *[] *Also, the same thing
>>>>>>    with this xpath selection:
>>>>>>    
>>>>>> *sel.xpath('html/body/div[3]/div/div[2]/div[2]/div[1]/ul/li[1]/div/span/a').extract()*
>>>>>>
>>>>>> Did you obtained a result by following the same steps?
>>>>>> Thank you for your help.
>>>>>>
>>>>>> Regards,
>>>>>> K.
>>>>>>
>>>>>> 2015-03-17 11:34 GMT+01:00 Morad Edwar <[email protected]>:
>>>>>>
>>>>>>> I used 'scrapy shell' and your xpath worked fine!!
>>>>>>> and when i changed 'li[1]' to 'li' it scrapped all the jobs titles.
>>>>>>>
>>>>>>>
>>>>>>> On Monday, March 16, 2015 at 6:19:01 PM UTC+2, DataScience wrote:
>>>>>>>>
>>>>>>>> Actually, I've checked the "response.body" and it doesn't matches
>>>>>>>> the content that I have in the webpage.
>>>>>>>> I am really confused, what can I do in this case?
>>>>>>>>
>>>>>>>> Le lundi 16 mars 2015 17:15:14 UTC+1, Travis Leleu a écrit :
>>>>>>>>>
>>>>>>>>> It doesn't look to me like it's writing the HTML to the DOM with
>>>>>>>>> j.s., as you noted.
>>>>>>>>>
>>>>>>>>> The big concern I have is that you are assuming the HTML content
>>>>>>>>> in your browser is the same as in your code.  How have you asserted 
>>>>>>>>> this?
>>>>>>>>>
>>>>>>>>> On Mon, Mar 16, 2015 at 9:02 AM, DataScience <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thank you Travis for you quick feedback.
>>>>>>>>>>
>>>>>>>>>> I am testing scrapy on this specefic webpage and try to get the
>>>>>>>>>> job offers (and not profiles).
>>>>>>>>>> I read in some forums that it may be due to the website which is
>>>>>>>>>> using Javascript to build most of the page, so the elements I
>>>>>>>>>> want do not appear in the HTML source of the page. I've checked by 
>>>>>>>>>> disabling
>>>>>>>>>> Javascript and reloading the page, but the result has been displayed 
>>>>>>>>>> on the
>>>>>>>>>> page (I've also checked the network in firbug by filtering XHR and 
>>>>>>>>>> looked
>>>>>>>>>> into the POST...and nothing).
>>>>>>>>>>
>>>>>>>>>> Any help would be more than welcome.
>>>>>>>>>> Thank you.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Le lundi 16 mars 2015 16:26:41 UTC+1, Travis Leleu a écrit :
>>>>>>>>>>>
>>>>>>>>>>> Linkedin can be a tough site to scrape, as they generally don't
>>>>>>>>>>> want their data in other people's hands.  You will need to use a 
>>>>>>>>>>> user-agent
>>>>>>>>>>> switcher (you don't mention what UA you are sending), and most 
>>>>>>>>>>> likely
>>>>>>>>>>> require a proxy in addition.
>>>>>>>>>>>
>>>>>>>>>>> If you are looking to scrape the entirety of linkedin, it's > 30
>>>>>>>>>>> million profiles.  I've found it more economical to purchase a 
>>>>>>>>>>> linkedin
>>>>>>>>>>> data dump from scrapinghub.com than to scrape it myself.
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Mar 16, 2015 at 8:05 AM, DataScience <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Scrapy Guys,
>>>>>>>>>>>>
>>>>>>>>>>>> Scrapy returns me an empty list while using shell to pick a
>>>>>>>>>>>> simple "title" field from this web page: http://goo.gl/dBR8P4
>>>>>>>>>>>> I've used:
>>>>>>>>>>>>
>>>>>>>>>>>>    - sel.xpath(‘//div[@id="results-
>>>>>>>>>>>>    rail"]/ul[@class="jobs"]/li[1]/div[@class="content"]/span/a[
>>>>>>>>>>>>    @class="title"]/text()’).extract()
>>>>>>>>>>>>    - sel.xpath('html/body/div[3]/div/div[2]/div[2]/div[1]/ul/
>>>>>>>>>>>>    li[1]/div/span/a').extract()
>>>>>>>>>>>>    - ...
>>>>>>>>>>>>
>>>>>>>>>>>> I verified the issue of the POST with XHR using firebug, and I
>>>>>>>>>>>> think there are no relationships with information generated using 
>>>>>>>>>>>> js code
>>>>>>>>>>>> (what do you think?).
>>>>>>>>>>>>
>>>>>>>>>>>> Can you please help me to figure out with this problem?
>>>>>>>>>>>> Thank you in Advance.
>>>>>>>>>>>>
>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>> K.
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>>>> Google Groups "scrapy-users" group.
>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from
>>>>>>>>>>>> it, send an email to [email protected].
>>>>>>>>>>>> To post to this group, send email to [email protected]
>>>>>>>>>>>> .
>>>>>>>>>>>> Visit this group at http://groups.google.com/group/scrapy-users
>>>>>>>>>>>> .
>>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  --
>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>> Google Groups "scrapy-users" group.
>>>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>>>> send an email to [email protected].
>>>>>>>>>> To post to this group, send email to [email protected].
>>>>>>>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  --
>>>>>>> You received this message because you are subscribed to a topic in
>>>>>>> the Google Groups "scrapy-users" group.
>>>>>>> To unsubscribe from this topic, visit https://groups.google.com/d/to
>>>>>>> pic/scrapy-users/BSmdIyfxiC4/unsubscribe.
>>>>>>> To unsubscribe from this group and all its topics, send an email to
>>>>>>> [email protected].
>>>>>>> To post to this group, send email to [email protected].
>>>>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>>>
>>>>  --
>>> You received this message because you are subscribed to a topic in the
>>> Google Groups "scrapy-users" group.
>>> To unsubscribe from this topic, visit https://groups.google.com/d/
>>> topic/scrapy-users/BSmdIyfxiC4/unsubscribe.
>>> To unsubscribe from this group and all its topics, send an email to
>>> [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to a topic in the
> Google Groups "scrapy-users" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/scrapy-users/BSmdIyfxiC4/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Scrapy shell returns empty list!?

Reply via email to