Re: Incorrect xpath values when spider crawls website

netcrime Sun, 30 Aug 2015 05:17:17 -0700

Hi,

In my settings.py file I was using:


USER_AGENT = 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 
(KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36'

It apears that Scrapy just didn't saw <li> element class atributes..

My workaround was:

//ol[@class="breadcrumb container"]/li[position() > 1 and position() < last(



On Saturday, August 29, 2015 at 10:23:42 PM UTC+3, Ashish Meena wrote:
>
> Hi,
>
> It is possible different page is fetched for different browsers. What 
> value are you using for property USER_AGENT for scrapy? Could you try to 
> put same value for this property as used by web browser?
>
> Regards,
> Ashish
>
> On Sat, Aug 29, 2015 at 4:49 PM, netcrime <[email protected] <javascript:>
> > wrote:
>
>> Hello,
>>
>> Background: I need to get product category based on Breadcrumbs. Example 
>> breadcrumb Home *>* Books *>* Bookname I need to get only Books.
>>
>> HTML code:
>> <ol class="breadcrumb container">
>>         <li class="first"><a href="
>> http://xxxx.com/index.php?route=common/home";><span>Home</span></a></li>
>>         <li><a href="http://xxxx.com/books";><span>Books</span></a></li>
>>         <li class="last"><a href="http://xxxxx.com/books?product_id=193"; 
>> class="last"><span>My Vision : Challenges in the Race for Excellence - 
>> Mohammed Bin Rashid Al Maktoum</span></a></li>
>>     </ol>
>>
>> xpath I use on browser console which returns me correct value "Books":
>>
>> //ol[@class="breadcrumb container"]/li[not(contains(@class,"first")) and 
>> not(contains(@class,"last"))]/a/span/text()
>>
>> My Python code:
>>
>> for cat in sel.xpath('//ol[@class="breadcrumb 
>> container"]/li[not(contains(@class,"first")) and 
>> not(contains(@class,"last"))]/a/span/text()').extract():
>>                 categories[catIndex] = cat
>>                 catIndex += 1
>>
>> When I run my Scrapy spider it returns me whole 3 Li elements including 
>> Home (with class first) and book name (with class last)
>>
>> I tryed to run Scrapy View http://xxx.com to see page how spider sees it 
>> and xpath works correct there.
>>
>> http://prntscr.com/8a7a4u
>>
>> But when I run Scrapy Shell and try the xpath code there it returns me 
>> whole 3 Li elements 
>>
>> http://prntscr.com/8a77xe
>>
>>
>> So anyone has an idea what might be the problem ?
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "scrapy-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at http://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Incorrect xpath values when spider crawls website

Reply via email to