Relly Thanks.
But I think I have another problem.
My start_urls also need to be rendered . In what function I should do that?
start_request? like this?
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url,, meta={
'splash': {
'endpoint': 'render.html',
'args': {'wait': 0.5}
}
})
what callback method should I put in the request?
On Monday, November 2, 2015 at 8:34:13 PM UTC+8, Paul Tremberth wrote:
>
> Hello,
>
> You probably want to use Splash for Requests that CrawlSpider generates
> from the rules.
> See `process_request` argument when defining CrawlSpider Rules
> http://doc.scrapy.org/en/latest/topics/spiders.html#crawling-rules
>
> Something like this:
>
> rules = [
> Rule(SgmlLinkExtractor(allow =
> (r'https://detail.ju.taobao.com/.*')),
> follow = False,
> process_request = "use_splash"
> ),
>
> Rule(SgmlLinkExtractor(allow =
> (r'https://detail.tmall.com/item.htm.*')),
> callback = "parse_link",
> process_request = "use_splash"),
> ]
>
> def use_splash(self, request):
> request.meta['splash'] = {
>
> 'endpoint':'render.html',
> 'args':{
> 'wait':0.5,
> }
> }
>
> return request
> ...
>
>
> See
> https://github.com/scrapy/scrapy/blob/master/scrapy/spiders/crawl.py#L64
> for the implementation details
>
>
> Also note that SgmlLinkExtractor is not the current recommended link
> extractor:
>
> http://doc.scrapy.org/en/latest/topics/link-extractors.html#module-scrapy.linkextractors
>
> Hope this helps.
>
> Paul.
>
> On Monday, November 2, 2015 at 12:00:13 PM UTC+1, Raymond Guo wrote:
>>
>> Hi:
>> sorry that I'm not really familiar about scrapy. but I had to use
>> scrapyJs to get rendered contents.
>> I noticed that you have scrapySpider example but I want to use
>> crawlSpider. So I wrote this:
>>
>>
>> class JhsSpider(CrawlSpider):
>> name = "jhsspy"
>> allowd_domains=["taobao.com"]
>> start_urls = ["https://ju.taobao.com/"]
>> rules = [
>> Rule(SgmlLinkExtractor(allow =
>> (r'https://detail.ju.taobao.com/.*')), follow = False),
>>
>> Rule(SgmlLinkExtractor(allow =
>> (r'https://detail.tmall.com/item.htm.*')), callback = "parse_link"),
>> ]
>> def parse_link(self, response):
>> le = SgmlLinkExtractor()
>> for link in le.extract_links(response):
>> yield scrapy.Request(link.url, self.parse_item, meta={
>> 'splash':{
>> 'endpoint':'render.html',
>> 'args':{
>> 'wait':0.5,
>> }
>> }
>> })
>>
>> def parse_item(self, response):
>> ...get items with reponse...
>>
>>
>>
>> but I had some problem that I'm not sure what caused them. So, want to
>> know is it the right way to yield request like what I did above.
>>
>
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.