Re: Is this the right way to use scrapyJs with CrawlSpider?

Raymond Guo Thu, 05 Nov 2015 18:53:01 -0800

Relly Thanks.

But I think I have another problem.
My start_urls also need to be rendered . In what function I should do that?
start_request? like this?


 def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url,, meta={
                'splash': {
                    'endpoint': 'render.html',
                    'args': {'wait': 0.5}
                }
            })


what callback method should I put in  the request?




On Monday, November 2, 2015 at 8:34:13 PM UTC+8, Paul Tremberth wrote:
>
> Hello,
>
> You probably want to use Splash for Requests that CrawlSpider generates 
> from the rules.
> See `process_request` argument when defining CrawlSpider Rules
> http://doc.scrapy.org/en/latest/topics/spiders.html#crawling-rules
>
> Something like this:
>
>     rules = [
>             Rule(SgmlLinkExtractor(allow = 
> (r'https://detail.ju.taobao.com/.*')),
>                  follow = False,
>                  process_request = "use_splash"
>             ),
>
>             Rule(SgmlLinkExtractor(allow = 
> (r'https://detail.tmall.com/item.htm.*')),
>                  callback = "parse_link",
>                  process_request = "use_splash"),
>         ]
>
>     def use_splash(self, request):
>         request.meta['splash'] = {
>
>                 'endpoint':'render.html',
>                 'args':{
>                     'wait':0.5,
>                     }
>                 }
>
>        return request
>     ...
>
>
> See 
> https://github.com/scrapy/scrapy/blob/master/scrapy/spiders/crawl.py#L64 
> for the implementation details
>
>
> Also note that SgmlLinkExtractor is not the current recommended link 
> extractor:
>
> http://doc.scrapy.org/en/latest/topics/link-extractors.html#module-scrapy.linkextractors
>
> Hope this helps.
>
> Paul.
>
> On Monday, November 2, 2015 at 12:00:13 PM UTC+1, Raymond Guo wrote:
>>
>> Hi:
>> sorry that I'm not really familiar about scrapy. but I had to use 
>> scrapyJs to get rendered contents.
>> I noticed that you have scrapySpider example but I want to use 
>> crawlSpider. So I wrote this:
>>
>>
>> class JhsSpider(CrawlSpider):
>>     name = "jhsspy"
>>     allowd_domains=["taobao.com"]
>>     start_urls = ["https://ju.taobao.com/";]
>>     rules = [
>>             Rule(SgmlLinkExtractor(allow = 
>> (r'https://detail.ju.taobao.com/.*')), follow = False),
>>
>>             Rule(SgmlLinkExtractor(allow = 
>> (r'https://detail.tmall.com/item.htm.*')), callback = "parse_link"),
>>         ]
>> def parse_link(self, response):
>>     le = SgmlLinkExtractor()
>>     for link in le.extract_links(response):
>>         yield scrapy.Request(link.url, self.parse_item, meta={
>>             'splash':{
>>                 'endpoint':'render.html',
>>                 'args':{
>>                     'wait':0.5,
>>                     }
>>                 }
>>             })
>>
>> def parse_item(self, response):
>>     ...get items with reponse...
>>
>>  
>>
>> but I had some problem that I'm not sure what caused them. So, want to 
>> know is it the right way to yield request like what I did above.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Is this the right way to use scrapyJs with CrawlSpider?

Reply via email to