Hi,
you have a couple options here (at least):
- select descendant text nodes of the body element, and joining this list
of strings with u"" (or a newline character u"\n")
play['body'] = u''.join(sel.xpath('//body//text()').extract()).strip()
If you want to remove text nodes in <script> elements (Javascript
instructions that you probably don't want), you can use:
play['body'] = u''.join(sel.xpath(
'//body/descendant-or-self::*[not(self::script)]/text()').extract()).strip()
- alternatively, if you don't want to deal with XPath expressions, using
w3lib
(http://w3lib.readthedocs.org/en/latest/w3lib.html#w3lib.html.remove_tags)
import w3lib.html
...
play['body'] = w3lib.html.remove_tags(sel.xpath('//body').extract()[0])
and to remove text from <script> before stripping tags, you can remove
<script> tags alltogether, and then only remove tags, keeping text content:
play['body'] = w3lib.html.remove_tags(
w3lib.html.remove_tags_with_content(
sel.xpath('//body').extract()[0],
which_ones=('script',)
)
)
Hope this helps
/Paul.
On Monday, March 3, 2014 4:47:31 PM UTC+1, [email protected] wrote:
>
> This is my configuration scrapy.
>
>
> from scrapy.contrib.spiders import CrawlSpider, Rule
> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
> from scrapy.selector import Selector
>
> from play.items import PlayItem
>
> class PlaySpider(CrawlSpider):
> name = 'play'
> allowed_domains = ['lo.lesko.pl']
> start_urls = ['http://www.lo.lesko.pl/']
> rules = [Rule(SgmlLinkExtractor(allow=[]), follow=True,
> callback='parse_play')]
>
> def parse_play(self, response):
> sel = Selector(response)
> play = PlayItem()
> play['url'] = response.url[0].strip()
> # play['title'] = sel.xpath("//title/text()").extract()
> play['body'] = sel.select("//body").extract()[0].strip()
> return play
>
>
> I use the strip function because I would like to have a text without tags
> html
> but am I doing something wrong there are html tags in my xml file
>
>
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.