Dear Dimitris, dear all, I am working with Massimo (the OP) on this scraping project. I would like to elaborate a little bit on the problem description. What we would like to do is to write a spider that mirrors an entire phpBB board, and collects info from some of its pages. To do so, we are using a set of rules, including rules used to identify those pages from which we want to extract information, and a rule that matches all the pages (so that they can be collected and stored locally). To do so, we use the following set of rules:
Rule(LinkExtractor(restrict_xpaths='//*[contains(@class,"forumtitle")]'),callback = 'parse_forum',follow=True), Rule(LinkExtractor(restrict_xpaths='//*[contains(@class,"topictitle")]'),callback='parse_topic',follow=True), Rule(LinkExtractor(restrict_xpaths='//*[contains(@href,"memberlist")]'),callback='parse_standard',follow=True), Rule(LinkExtractor(restrict_xpaths='//*[contains(@href,"mode=viewprofile")]'),callback='parse_members',follow=True), Rule(LinkExtractor(),callback = 'parse_standard',follow=True) The first four rules specify those links pointing to pages containing information we want to scrape, while the last one matches all the links that are not matched by the first four ones. Our understanding (from Scrapy documentation) is that if a link matches more rules, the first matching rule will be used, so those pages containing information will be processed according to the first four rules, while other pages (not matching the restrict_xpaths clauses) will be processed according to the last one. The pages corresponding to the first four rules require authentication, so we use the technique reported by Massimo to authenticate, expecting that Scrapy will follow the rules after the authentication step has been carried out. What we are instead observing is that this is not the case, as we are getting - from the phpBB board - responses telling that authorization is required before getting those pages. Strangely, if we remove the last rule, i.e. we use only the first four ones, namely: Rule(LinkExtractor(restrict_xpaths='//*[contains(@class,"forumtitle")]'),callback = 'parse_forum',follow=True), Rule(LinkExtractor(restrict_xpaths='//*[contains(@class,"topictitle")]'),callback='parse_topic',follow=True), Rule(LinkExtractor(restrict_xpaths='//*[contains(@href,"memberlist")]'),callback='parse_standard',follow=True), Rule(LinkExtractor(restrict_xpaths='//*[contains(@href,"mode=viewprofile")]'),callback='parse_members',follow=True) everything works as expected, that is we get the pages that are restricted to authenticated users only. We suspect that in the first case (i.e., when using all the FIVE rules) Scrapy starts following links AFTER the FormRequest.from_response has been sent to the server, but BEFORE the corresponding reply (carrying the session cookie, or any other authentication info) has been receveid and/or processed by Scrapy. Could this be the case? And, if so, how can we make rule matching start after the authentication response has been gotten and processed? Conversely, what could be going wrong? Thank you very much in advance for any help you can provide. Cosimo -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
