Thanks a lot @Tsouras :) indeed it might be useful here!

Here are some quick tips... 178 requests/second isn't bad at all. Yes, it 
might well be CPU. Indeed GIL is not a problem because it's single threaded 
BUT the fact that it's single-threaded might be the problem :) Your machine 
likely has another e.g. 3 cores you're not using. Try running in parallel 4 
Scrapy's with 1/4th of the problem. Then you will use more cores. Does it 
finish faster?

Regarding AWS and bandwidth, it doesn't say much. The latency the remotes 
give you says many more.

Measure the average latency of your target pages (e.g. by using `time curl` 
with a few URLs) . If it's e.g. 0.2s multiply it with 1600. That means that 
your job is "worth" 320 seconds. 

If you set :

CONCURRENT_REQUESTS = CONCURRENT_REQUESTS_PER_IP = 10

you would the job done in 320/10 = 32 sec. Obviously you've already hacked 
those values since you're getting something much higher (or your avg. 
latency is smaller than 0.2s)

If you set:

CONCURRENT_REQUESTS = CONCURRENT_REQUESTS_PER_IP = 100

you should get your answer in more or less 3.2 seconds + some startup time.

Try intermediate values... where does reality start to diverge with the 
ideal?



On Saturday, March 5, 2016 at 9:06:53 AM UTC, Tsouras wrote:
>
> Maybe the book of Dimitrios Kouzis-Loukas 
> https://www.packtpub.com/big-data-and-business-intelligence/learning-scrapy 
> will help you. It has a chapter about performance.
>
>
> On Thursday, March 3, 2016 at 10:13:01 AM UTC+2, Berkant AYDIN wrote:
>>
>> Hi everyone,
>>
>> I have to do realtime scraping. I try optimization options on 
>> documentation but still slowly. 1600 page crawling only 9 seconds. Yea its 
>> very speedy but still not enough. 860 mb/s AWS machine. How can increase 
>> performance ? I have to use distributed options ? If yes, which one ? It's 
>> a GIL problem ? I have to continue with PyPy ?
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to