Re: Problem with crawling multiple pages

JEBI93 Tue, 03 Mar 2015 09:20:38 -0800

Thanks Paul for your help, i got it to work but using this:     rules = (
        Rule(SgmlLinkExtractor(allow=('/directory/attendees/\d+')), 
callback='parse_page', follow=True),
    )


понедељак, 02. март 2015. 23.07.30 UTC+1, Paul Tremberth је написао/ла:
>
> That's weird,
> I get nearly 2000 items running your spider: (with a custom ItemloadItem)
>
> https://gist.github.com/redapple/aa274c729ee912de46ce
>
>
> On Saturday, February 28, 2015 at 8:10:36 PM UTC+1, JEBI93 wrote:
>>
>> Here's full script: http://pastebin.com/13eNky9W, after i change from 
>> parse to parse_page i dont get anything scraped.
>>
>> субота, 28. фебруар 2015. 16.51.10 UTC+1, Paul Tremberth је написао/ла:
>>>
>>> Hi,
>>>
>>> CrawlSpider and a custom parse() method do not play well together. See 
>>> the warning a bit below 
>>> http://doc.scrapy.org/en/latest/topics/spiders.html#crawling-rules
>>> It's easy to miss.
>>>
>>> Try renaming your parse() method to something like parse_page(), and 
>>> reference this new callback name in your rule.
>>>  Le 28 févr. 2015 16:17, "JEBI93" <[email protected]> a écrit :
>>>
>>>> Hey guys, i have a small problem when trying to crawl 10+ pages. Heres 
>>>> the code:
>>>>
>>>> class ItemspiderSpider(CrawlSpider):
>>>>     name = "itemspider"
>>>>     allowed_domains = ["openstacksummitnovember2014paris.sched.org"]
>>>>     start_urls = ['
>>>> http://openstacksummitnovember2014paris.sched.org/directory/attendees/
>>>> ']
>>>>     
>>>>     rules = (
>>>>         Rule(SgmlLinkExtractor(allow=r'/directory/attendees/\d+'), 
>>>> callback='parse', follow=True),
>>>>     )    
>>>>
>>>> The problem is that when i run this code i get only results of first 
>>>> page, not the others. I tried to modify start_urls to something like this 
>>>> and it worked fine
>>>>
>>>> start_urls = [
>>>> '
>>>> http://openstacksummitnovember2014paris.sched.org/directory/attendees/1
>>>> '
>>>> '
>>>> http://openstacksummitnovember2014paris.sched.org/directory/attendees/2
>>>> '
>>>> '
>>>> http://openstacksummitnovember2014paris.sched.org/directory/attendees/3
>>>> '
>>>> '
>>>> http://openstacksummitnovember2014paris.sched.org/directory/attendees/4
>>>> '
>>>> etc..
>>>> ]
>>>>
>>>> I'm guessing i messed up at allow part, probably my regex its not 
>>>> proper.
>>>>
>>>>  -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "scrapy-users" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Problem with crawling multiple pages

Reply via email to