Thanks a lot @Tsouras :) indeed it might be useful here! Here are some quick tips... 178 requests/second isn't bad at all. Yes, it might well be CPU. Indeed GIL is not a problem because it's single threaded BUT the fact that it's single-threaded might be the problem :) Your machine likely has another e.g. 3 cores you're not using. Try running in parallel 4 Scrapy's with 1/4th of the problem. Then you will use more cores. Does it finish faster?
Regarding AWS and bandwidth, it doesn't say much. The latency the remotes give you says many more. Measure the average latency of your target pages (e.g. by using `time curl` with a few URLs) . If it's e.g. 0.2s multiply it with 1600. That means that your job is "worth" 320 seconds. If you set : CONCURRENT_REQUESTS = CONCURRENT_REQUESTS_PER_IP = 10 you would the job done in 320/10 = 32 sec. Obviously you've already hacked those values since you're getting something much higher (or your avg. latency is smaller than 0.2s) If you set: CONCURRENT_REQUESTS = CONCURRENT_REQUESTS_PER_IP = 100 you should get your answer in more or less 3.2 seconds + some startup time. Try intermediate values... where does reality start to diverge with the ideal? On Saturday, March 5, 2016 at 9:06:53 AM UTC, Tsouras wrote: > > Maybe the book of Dimitrios Kouzis-Loukas > https://www.packtpub.com/big-data-and-business-intelligence/learning-scrapy > will help you. It has a chapter about performance. > > > On Thursday, March 3, 2016 at 10:13:01 AM UTC+2, Berkant AYDIN wrote: >> >> Hi everyone, >> >> I have to do realtime scraping. I try optimization options on >> documentation but still slowly. 1600 page crawling only 9 seconds. Yea its >> very speedy but still not enough. 860 mb/s AWS machine. How can increase >> performance ? I have to use distributed options ? If yes, which one ? It's >> a GIL problem ? I have to continue with PyPy ? >> >> -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
