Re: Why engine fetch requests from scheduler first other than the start_urls generated ones?

Jianhao Chen Mon, 04 Apr 2016 22:38:01 -0700

Yes. While scrapy engine get the requests from scheduler first, not from 
start_urls.


On Saturday, April 2, 2016 at 7:59:53 PM UTC+8, Dimitris Kouzis - Loukas 
wrote:
>
> Are you asking for 
> http://doc.scrapy.org/en/latest/topics/broad-crawls.html ? Finishing all 
> the start_urls before going wide?
>
> On Wednesday, March 30, 2016 at 10:52:13 AM UTC+1, Jianhao Chen wrote:
>>
>> From HERE 
>> <https://github.com/scrapy/scrapy/blob/master/scrapy/core/engine.py#L121> I 
>> found that Scrapy engine fetch requests from scheduler before the 
>> start_urls generated ones.
>>
>>
>> In my usage, I enqueued thousands of start urls (which from various 
>> domains) to the queue and the crawling goes not so fast (maybe networking 
>> issues). The problems comes up with me was that the spider itself extracts 
>> links and follows them, then Scrapy will fetch the requests from scheduler. 
>> It makes the concurrency lower.
>>
>>
>> I would like to learn about the design purpose of this mechanism.
>> BRs.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Why engine fetch requests from scheduler first other than the start_urls generated ones?

Reply via email to