Hey guys, i have a small problem when trying to crawl 10+ pages. Heres the
code:
class ItemspiderSpider(CrawlSpider):
name = "itemspider"
allowed_domains = ["openstacksummitnovember2014paris.sched.org"]
start_urls =
['http://openstacksummitnovember2014paris.sched.org/directory/attendees/']
rules = (
Rule(SgmlLinkExtractor(allow=r'/directory/attendees/\d+'),
callback='parse', follow=True),
)
The problem is that when i run this code i get only results of first page,
not the others. I tried to modify start_urls to something like this and it
worked fine
start_urls = [
'http://openstacksummitnovember2014paris.sched.org/directory/attendees/1'
'http://openstacksummitnovember2014paris.sched.org/directory/attendees/2'
'http://openstacksummitnovember2014paris.sched.org/directory/attendees/3'
'http://openstacksummitnovember2014paris.sched.org/directory/attendees/4'
etc..
]
I'm guessing i messed up at allow part, probably my regex its not proper.
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.