You can select elements with xpath but not apply any tree transformations on them. Take a look at XSLT or Xquery. http://lxml.de/xpathxslt.html#xslt http://www.dpawson.co.uk/xsl/sect2/identity.html#d5916e103
On Saturday, 10 May 2014 11:07:02 UTC+3, VR Tech wrote: > > I'm using Scrapy to crawl a site with some odd formatting conventions. The > basic idea is that I want all the text and sub-elements of a certain div, > EXCEPT a few div in the middle. Here is the piece of code below :- > > <div align="center" class="article"><!--wanted--> > <img src="http://i.imgur.com/12345.jpg" width="500" alt="abcde" > title="abcde"><br><br> > <div style="text-align:justify"><!--wanted--> > Sample Text<br><br>Demo: <a > href="http://www.example.com/?http://example.com/item/asash/asdas-asfasf-afaf.html" > target="_blank">http://example.com/dfa/asfa/aasfa</a><br><br> > <div class="quote"><!--wanted--> > > http://www.coolfiles.ro/download/kleo13.rar/1098750<br>http://www.ainecreator.com/files/0MKOGM6D/kleo13.rar_links<br> > </div> > <br> > <div align="left"><!--not wanted--> > <div id="ratig-layer-2249"><!--not wanted--> > <div class="rating"><!--not wanted--> > <ul class="unit-rating"> > <li class="current-rating" style="width:80%;">80</li> > <li><a href="#" title="Bad" class="r1-unit" > onclick="doRate('1', '2249'); return false;">1</a></li> > <li><a href="#" title="Poor" class="r2-unit" > onclick="doRate('2', '2249'); return false;">2</a></li> > <li><a href="#" title="Fair" class="r3-unit" > onclick="doRate('3', '2249'); return false;">3</a></li> > <li><a href="#" title="Good" class="r4-unit" > onclick="doRate('4', '2249'); return false;">4</a></li> > <li><a href="#" title="Excellent" class="r5-unit" > onclick="doRate('5', '2249'); return false;">5</a></li> > </ul> > </div> > (votes: <span id="vote-num-id-2249">3</span>) > </div> > </div> > <div class="reln"><!--not wanted--> > <strong> > <h4>Related News:</h4> > </strong> > <li><a > href="http://www.example.com/themes/tf/a-b-c-d.html">1</a></li> > <li><a > href="http://www.example.com/plugins/codecanyon/a-b-c-d">2</a></li> > <li><a > href="http://www.example.com/themes/tf/a-b-c-d.html">3</a></li> > <li><a > href="http://www.example.com/plugins/codecanyon/a-b-c-d.html">4</a></li> > <li><a > href="http://www.example.com/plugins/codecanyon/a-b-c-d.html">5</a></li> > </div> > </div> > </div> > > > The final output should look like :- > > <div align="center" class="article"><!--wanted--> > <img src="http://i.imgur.com/12345.jpg" width="500" alt="abcde" > title="abcde"><br><br> > <div style="text-align:justify"><!--wanted--> > Sample Text<br><br>Demo: <a > href="http://www.example.com/?http://example.com/item/asash/asdas-asfasf-afaf.html" > target="_blank">http://example.com/dfa/asfa/aasfa</a><br><br> > <div class="quote"><!--wanted--> > > http://www.coolfiles.ro/download/kleo13.rar/1098750<br>http://www.ainecreator.com/files/0MKOGM6D/kleo13.rar_links<br> > </div> > <br> > </div> > </div> > > > Here is the piece of my Scrapy code. Please suggest the addition to this > script :- > > from scrapy.contrib.spiders import CrawlSpider, Rule > from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor > from scrapy.selector import HtmlXPathSelector > from isbullshit.items import IsBullshitItem > > > class IsBullshitSpider(CrawlSpider): > """ General configuration of the Crawl Spider """ > name = 'isbullshitwp' > start_urls = ['http://example.com/themes'] # urls from which the spider > will start crawling > rules = [Rule(SgmlLinkExtractor(allow=[r'page/\d+']), follow=True), > # r'page/\d+' : regular expression for http://example.com/page/X URLs > Rule(SgmlLinkExtractor(allow=[r'\w+']), callback='parse_blogpost')] > # r'\d{4}/\d{2}/\w+' : regular expression for > http://example.com/YYYY/MM/title URLs > > def parse_blogpost(self, response): > hxs = HtmlXPathSelector(response) > item = IsBullshitItem() > item['title'] = > hxs.select('//span[@class="storytitle"]/text()').extract()[0] > item['article_html'] = > hxs.select("//div[@class='article']").extract()[0] > > return item > > > Thanks in advance... > > > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
