If you want to get all categories in "tag" you can remove the "take-first"
predicate [1]. If you want to ignore all markup between two (comment)
tags, then you might want to do that with Python, not Xpath. Also
CrawSpider was removed from "contrib" in Scrapy; same with extractors.
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from isbullshit.items import IsBullshitItem
class IsBullshitSpider(CrawlSpider):
name = 'isbullshit'
start_urls = ['http://sample.com']
rules = (
Rule(LinkExtractor(allow=r'page/\d+')),
Rule(LinkExtractor(allow=r'\w+'), callback='parse_blogpost'),
)
def parse_blogpost(self, response):
item = IsBullshitItem()
item['title'] = response.select('//h2[@class="post-title
entry-title"]/text()').extract_first()
item['tag'] =
response.select('//ul[@class="post-categories"]/li/a/text()').extract_first()
item['article_html'] = response.select("//div[@class='entry
clearfix']").extract_first()
return item
On Wednesday, December 9, 2015 at 3:24:18 AM UTC-7, VR Tech wrote:
>
> Below is a sample piece of HTML code that I want to scrape with scrapy.
>
>
> <body><h2 class="post-title entry-title">Sample Header</h2>
> <div class="entry clearfix">
> <div class="sample1">
> <p>Hello</p>
> </div>
> <!--start comment-->
> <div class="sample2">
> <p>World</p>
> </div>
> <!--end comment-->
> </div><ul class="post-categories"><li><a
> href="123.html">Category1</a></li><li><a
> href="456.html">Category2</a></li><li><a
> href="789.html">Category3</a></li></ul></body>
>
>
>
> Right now I am using the below working scrapy code:
>
>
> from scrapy.contrib.spiders import CrawlSpider, Rulefrom
> scrapy.contrib.linkextractors.sgml import SgmlLinkExtractorfrom
> scrapy.selector import HtmlXPathSelectorfrom isbullshit.items import
> IsBullshitItem
> class IsBullshitSpider(CrawlSpider):
> name = 'isbullshit'
> start_urls = ['http://sample.com']
> rules = [Rule(SgmlLinkExtractor(allow=[r'page/\d+']), follow=True),
> Rule(SgmlLinkExtractor(allow=[r'\w+']), callback='parse_blogpost')]
>
> def parse_blogpost(self, response):
> hxs = HtmlXPathSelector(response)
> item = IsBullshitItem()
> item['title'] = hxs.select('//h2[@class="post-title
> entry-title"]/text()').extract()[0]
> item['tag'] =
> hxs.select('//ul[@class="post-categories"]/li[1]/a/text()').extract()[0]
> item['article_html'] = hxs.select("//div[@class='entry
> clearfix']").extract()[0]
> return item
>
>
>
> It gives me the following xml output:
>
>
> <?xml version="1.0" encoding="utf-8"?><items>
> <item>
>
> <article_html>
> <div class="entry clearfix">
> <div class="sample1">
> <p>Hello</p>
> </div>
> <!--start comment-->
> <div class="sample2">
> <p>World</p>
> </div>
> <!--end comment-->
> </div>
> </article_html>
>
> <tag>
> Category1
> </tag>
>
> <title>
> Sample Header
> </title>
>
> </item></items>
>
>
>
> I want to know how to achieve the following output:
>
>
> <?xml version="1.0" encoding="utf-8"?><items>
> <item>
>
> <article_html>
> <div class="entry clearfix">
> <div class="sample1">
> <p>Hello</p>
> </div>
> <!--start comment-->
> <!--end comment-->
> </div>
> </article_html>
>
> <tag>
> Category1,Category2,Category3
> </tag>
>
> <title>
> Sample Header
> </title>
>
> </item></items>
>
>
> Note: The number of categories depends on the post. In the above example,
> there are 3 categories. There could be more or less.
>
> Help would be much appreciated. Cheers.
>
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.