Re: Crawling initial site question

Lazar Telebak Sat, 12 Mar 2016 11:38:24 -0800

   1. Hack the priority of the requests as you yield them. 
   
http://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request
 

I already developed a spider that yields a page if it’s not 
successfull(200).

class CrawlSpider(CrawlSpider):
    name = 'domain'
    file = open('list_of_sites.txt', 'r')    for url in file:
        start_urls.append(url)

    rules = (Rule(LinkExtractor(), follow=True, callback='parse_page'),)

    def parse_page(self, response):
        if response.status != 200:
            yield {
                'Url': response.url,
                'Status Code': response.status}

Should I rewrite it altogether or is there a way to use a priority key 
somewhere in rules variable?
On Wednesday, March 9, 2016 at 11:17:55 PM UTC+1, Dimitris Kouzis - Loukas 
wrote:

Many solutions here. Here are just two of many possible approaches:
>
> 1. Two phases: a) crawl the site limiting URLs only to that site. After 
> you're done, aggregate all the external links and do another crawling job
> 2. Hack the priority of the requests as you yield them. 
> http://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request
>
> On Monday, March 7, 2016 at 10:37:55 AM UTC, Mario wrote:
>>
>> Hi guys, I wrote simple domain web scraper. Is it possible to first crawl 
>> initial site entirely and then move on to the next domain/domains?
>>
> 

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.
Re: Crawling initial site question

Reply via email to