Re: scrapy: Remove some elements from an xpath selector

Nikolaos-Digenis Karagiannis Sun, 11 May 2014 02:27:26 -0700

You can select elements with xpath but not apply any tree transformations 
on them.
Take a look at XSLT or Xquery.
http://lxml.de/xpathxslt.html#xslt
http://www.dpawson.co.uk/xsl/sect2/identity.html#d5916e103



On Saturday, 10 May 2014 11:07:02 UTC+3, VR Tech wrote:
>
> I'm using Scrapy to crawl a site with some odd formatting conventions. The 
> basic idea is that I want all the text and sub-elements of a certain div, 
> EXCEPT a few div in the middle. Here is the piece of code below :-
>
> <div align="center" class="article"><!--wanted-->
>     <img src="http://i.imgur.com/12345.jpg"; width="500" alt="abcde" 
> title="abcde"><br><br>     
>     <div style="text-align:justify"><!--wanted-->
>         Sample Text<br><br>Demo: <a 
> href="http://www.example.com/?http://example.com/item/asash/asdas-asfasf-afaf.html";
>  target="_blank">http://example.com/dfa/asfa/aasfa</a><br><br>
>         <div class="quote"><!--wanted-->
>             
> http://www.coolfiles.ro/download/kleo13.rar/1098750<br>http://www.ainecreator.com/files/0MKOGM6D/kleo13.rar_links<br>
>         </div>
>         <br>
>         <div align="left"><!--not wanted-->
>             <div id="ratig-layer-2249"><!--not wanted-->
>                 <div class="rating"><!--not wanted-->
>                     <ul class="unit-rating">
>                         <li class="current-rating" style="width:80%;">80</li>
>                         <li><a href="#" title="Bad" class="r1-unit" 
> onclick="doRate('1', '2249'); return false;">1</a></li>
>                         <li><a href="#" title="Poor" class="r2-unit" 
> onclick="doRate('2', '2249'); return false;">2</a></li>
>                         <li><a href="#" title="Fair" class="r3-unit" 
> onclick="doRate('3', '2249'); return false;">3</a></li>
>                         <li><a href="#" title="Good" class="r4-unit" 
> onclick="doRate('4', '2249'); return false;">4</a></li>
>                         <li><a href="#" title="Excellent" class="r5-unit" 
> onclick="doRate('5', '2249'); return false;">5</a></li>
>                     </ul>
>                 </div>
>                 (votes: <span id="vote-num-id-2249">3</span>)
>             </div>
>         </div>
>         <div class="reln"><!--not wanted-->
>             <strong>
>                 <h4>Related News:</h4>
>             </strong>
>             <li><a 
> href="http://www.example.com/themes/tf/a-b-c-d.html";>1</a></li>
>             <li><a 
> href="http://www.example.com/plugins/codecanyon/a-b-c-d";>2</a></li>
>             <li><a 
> href="http://www.example.com/themes/tf/a-b-c-d.html";>3</a></li>
>             <li><a 
> href="http://www.example.com/plugins/codecanyon/a-b-c-d.html";>4</a></li>
>             <li><a 
> href="http://www.example.com/plugins/codecanyon/a-b-c-d.html";>5</a></li>
>         </div>
>     </div>
> </div>
>
>
> The final output should look like :-
>
> <div align="center" class="article"><!--wanted-->
>     <img src="http://i.imgur.com/12345.jpg"; width="500" alt="abcde" 
> title="abcde"><br><br>     
>     <div style="text-align:justify"><!--wanted-->
>         Sample Text<br><br>Demo: <a 
> href="http://www.example.com/?http://example.com/item/asash/asdas-asfasf-afaf.html";
>  target="_blank">http://example.com/dfa/asfa/aasfa</a><br><br>
>         <div class="quote"><!--wanted-->
>             
> http://www.coolfiles.ro/download/kleo13.rar/1098750<br>http://www.ainecreator.com/files/0MKOGM6D/kleo13.rar_links<br>
>         </div>
>         <br>
>     </div>
> </div>
>
>
> Here is the piece of my Scrapy code. Please suggest the addition to this 
> script :-
>
> from scrapy.contrib.spiders import CrawlSpider, Rule
> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
> from scrapy.selector import HtmlXPathSelector
> from isbullshit.items import IsBullshitItem
>
>
> class IsBullshitSpider(CrawlSpider):
>     """ General configuration of the Crawl Spider """
>     name = 'isbullshitwp'
>     start_urls = ['http://example.com/themes'] # urls from which the spider 
> will start crawling
>     rules = [Rule(SgmlLinkExtractor(allow=[r'page/\d+']), follow=True), 
>         # r'page/\d+' : regular expression for http://example.com/page/X URLs
>         Rule(SgmlLinkExtractor(allow=[r'\w+']), callback='parse_blogpost')]
>         # r'\d{4}/\d{2}/\w+' : regular expression for 
> http://example.com/YYYY/MM/title URLs
>
>     def parse_blogpost(self, response):
>         hxs = HtmlXPathSelector(response)
>         item = IsBullshitItem()
>         item['title'] = 
> hxs.select('//span[@class="storytitle"]/text()').extract()[0]
>         item['article_html'] = 
> hxs.select("//div[@class='article']").extract()[0]
>
>         return item
>
>
> Thanks in advance...
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: scrapy: Remove some elements from an xpath selector

Reply via email to