1. Hack the priority of the requests as you yield them.
http://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request
I already developed a spider that yields a page if it’s not
successfull(200).
class CrawlSpider(CrawlSpider):
name = 'domain'
file = open('list_of_sites.txt', 'r') for url in file:
start_urls.append(url)
rules = (Rule(LinkExtractor(), follow=True, callback='parse_page'),)
def parse_page(self, response):
if response.status != 200:
yield {
'Url': response.url,
'Status Code': response.status}
Should I rewrite it altogether or is there a way to use a priority key
somewhere in rules variable?
On Wednesday, March 9, 2016 at 11:17:55 PM UTC+1, Dimitris Kouzis - Loukas
wrote:
Many solutions here. Here are just two of many possible approaches:
>
> 1. Two phases: a) crawl the site limiting URLs only to that site. After
> you're done, aggregate all the external links and do another crawling job
> 2. Hack the priority of the requests as you yield them.
> http://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request
>
> On Monday, March 7, 2016 at 10:37:55 AM UTC, Mario wrote:
>>
>> Hi guys, I wrote simple domain web scraper. Is it possible to first crawl
>> initial site entirely and then move on to the next domain/domains?
>>
>
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.