On Fri, Oct 02, 2009 at 11:19:14AM -0700, Nick Gerner wrote: > I'm curious if anyone has any tips about performance of libcurl at scale. I > have some pretty good crawling code that I'm always trying to tune. I'm > running curl_multi with poll and between 500 and 1000 curl handles.
Just curious, but what lead you to decide to use that many easy handles in your multi-handle? I recently wrote an application that has to download lots of different files. Instead of instituting a 1:1 mapping between URL and easy-handle, I built a queue for the requests and configured the system to use a fixed number of easy-handles in the multi-handle (about 20, I think). Once a transaction is finished, the easy-handle gets reconfigured to service the new request. I've had good performance with such a design. > And more interestingly: > > 977123 38.6471 url.c:0 ConnectionExists > 477781 18.8972 (no location information) Curl_raw_equal > 344057 13.6081 hostip.c:0 hostcache_timestamp_remove > 230962 9.1350 rawstr.c:0 my_toupper > 184067 7.2802 (no location information) Curl_hash_clean_with_criterium > 67392 2.6655 (no location information) curl_multi_remove_handle > 65826 2.6035 (no location information) Curl_hash_pick > 35846 1.4178 (no location information) Curl_hash_add > > That ConnectionExists call seems to take a lot of time! Looking at the > code, it looks like ConnectionExists should not get called if I set > curl_easy_setopt(curl[i]->curl, CURLOPT_FRESH_CONNECT, (long)1); > > So I did that and got much better performance. But I still see basically > the same oprofile report (basically 40% of my CPU time is in libcurl and 40% > of libcurl's time is spent in ConnectionExists). So... any thoughts on: > > 1) why ConnectionExists takes so long? (I'm guessing it does an expensive > traversal of a really big list of maybe 4k cached connections) It looks like the code in ConnectionExists walks the entries in the connection cache when it looks for a match. If it can't find a matching connection, it looks like you'll make a linear scan of the entire table. The connection cache is kept in the multi-handle when the multi interface is used. > 2) why I'm still getting all this time spent in ConnectionExists Not sure. Do you have a test program that demonstrates this behavior? > 3) any other general perf tips (e.g. other curl_easy_setopt or > curl_multi_setopt settings, or maybe compile time options) You might try setting a different value for CURLMOPT_MAXCONNECTS, but this would limit your ability to re-use cached connections. The default behavior is to cache 10 connections, but increase the size of the cache by (n * 4), where n is the number of easy handles in the multi-handle. (http://curl.haxx.se/libcurl/c/curl_multi_setopt.html) That said, connection caching provides a substantial performance benefit, if you expect your transactions to connect to the same host multiple times. I don't know what problem you're trying to solve. This means my advice is more generic, and less useful, than it would be with more detail; however, if you're interested in scaling your application up to multiple cpus/threads, you might want to consider the following different approaches. 1. Multiple threads, each with an easy-handle, where the work is pulled from a queue. 2. A queue that with a multi-handle, where work is processed by a fixed number of easy-handles. 3. One or more queues, multiple threads, each thread with a multi-handle and a fixed number of easy-handles, where each thread schedules work from the queue and runs its multi-handle. I'm not sure if any of these are ideal for your project, but it might be a worthwhile starting point. -j ------------------------------------------------------------------- List admin: http://cool.haxx.se/list/listinfo/curl-library Etiquette: http://curl.haxx.se/mail/etiquette.html
