On Jan 5, 2013, at 2:11pm, sebb wrote:

> On 5 January 2013 21:33, vigna <[email protected]> wrote:
>>> But why would you want a web crawler to have 10-20K simultaneously
>>> opened connections in the first place?
>> 
>> (I thought I answered this, but it's not on the archive. Boh.)
>> 
>> Having a few thousands connection open is the only way to retrieve data
>> respecting politeness (e.g., not banging the same site too often).
> 
> Huh?
> There are surely other ways to achieve that goal.

For a beefy server, having a few thousand open connections (one per domain or 
IP address) is a standard solution for web crawling.

There may well be better solutions, but from personal experience even on a 
"small" server (e.g. Amazon m1.large, so roughly 4 wimpy cores) you can 
effectively use 500+ threads.

So on a large box (e.g. 24 more powerful cores) I could see using upward of 10K 
threads being the optimal number.

This assumes you've got a big pipe, and a pretty good DNS system/cache.

This also assumes that you're not trying to parse the downloaded files at the 
same time, as otherwise available CPU will be the limiting factor.

Just FYI about two years ago we were using big servers with lots of threads 
during a large-scale web crawl, and we did run into interesting bottlenecks in 
HttpClient 4.0.1 (?) with lots of simultaneous threads. I haven't had to 
revisit those issues with a recent release, so maybe those have been resolved.

-- Ken

--------------------------------------------
http://about.me/kkrugler
+1 530-210-6378





---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to