Hi :
I 'm new to scrapy.
These days, I tried to use scrapyJs to get items rendered by
Javascript.
I'm not really familiar about scrapy or scrapyJs.
I found my spider crawled really slow and have duplicate result.
this is how I write my spider
# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from jhsspy.items import JhsspyItem
class JhsSpider(CrawlSpider):
name = "jhsspy"
allowd_domains=["taobao.com"]
# start_urls = ["http://sale.yohobuy.com/?specialsale_id=7&gender=1,3"]
start_urls = ["https://ju.taobao.com/"]
rules = [
Rule(SgmlLinkExtractor(allow =
(r'https://detail.ju.taobao.com/.*')), follow = False),
Rule(SgmlLinkExtractor(allow =
(r'https://detail.tmall.com/item.htm.*')), callback = "parse_link"),
]
def parse_link(self, response):
le = SgmlLinkExtractor()
for link in le.extract_links(response):
yield scrapy.Request(link.url, self.parse_item, meta={
'splash':{
'endpoint':'render.html',
'args':{
'wait':0.5,
}
}
})
def parse_item(self, response):
sel = Selector(response)
item = JhsspyItem()
...get items ......
I want to know is it the right way to yield request in the rule's callback
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.