On Fri, Oct 02, 2009 at 11:19:14AM -0700, Nick Gerner wrote:
> I'm curious if anyone has any tips about performance of libcurl at scale.  I
> have some pretty good crawling code that I'm always trying to tune.  I'm
> running curl_multi with poll and between 500 and 1000 curl handles.

Just curious, but what lead you to decide to use that many easy handles
in your multi-handle?  I recently wrote an application that has to
download lots of different files.  Instead of instituting a 1:1 mapping
between URL and easy-handle, I built a queue for the requests and
configured the system to use a fixed number of easy-handles in the
multi-handle (about 20, I think).  Once a transaction is finished, the
easy-handle gets reconfigured to service the new request.  I've had good
performance with such a design.

> And more interestingly:
> 
> 977123   38.6471  url.c:0                     ConnectionExists
> 477781   18.8972  (no location information)   Curl_raw_equal
> 344057   13.6081  hostip.c:0                  hostcache_timestamp_remove
> 230962    9.1350  rawstr.c:0                  my_toupper
> 184067    7.2802  (no location information)   Curl_hash_clean_with_criterium
> 67392     2.6655  (no location information)   curl_multi_remove_handle
> 65826     2.6035  (no location information)   Curl_hash_pick
> 35846     1.4178  (no location information)   Curl_hash_add
> 
> That ConnectionExists call seems to take a lot of time!  Looking at the
> code, it looks like ConnectionExists should not get called if I set
> curl_easy_setopt(curl[i]->curl, CURLOPT_FRESH_CONNECT, (long)1);
> 
> So I did that and got much better performance.  But I still see basically
> the same oprofile report (basically 40% of my CPU time is in libcurl and 40%
> of libcurl's time is spent in ConnectionExists).  So... any thoughts on:
> 
> 1) why ConnectionExists takes so long? (I'm guessing it does an expensive
> traversal of a really big list of maybe 4k cached connections)

It looks like the code in ConnectionExists walks the entries in the
connection cache when it looks for a match.  If it can't find a matching
connection, it looks like you'll make a linear scan of the entire table.
The connection cache is kept in the multi-handle when the multi
interface is used.

> 2) why I'm still getting all this time spent in ConnectionExists

Not sure.  Do you have a test program that demonstrates this behavior?

> 3) any other general perf tips (e.g. other curl_easy_setopt or
> curl_multi_setopt settings, or maybe compile time options)

You might try setting a different value for CURLMOPT_MAXCONNECTS, but
this would limit your ability to re-use cached connections.  The default
behavior is to cache 10 connections, but increase the size of the cache
by (n * 4), where n is the number of easy handles in the multi-handle.
(http://curl.haxx.se/libcurl/c/curl_multi_setopt.html)

That said, connection caching provides a substantial performance
benefit, if you expect your transactions to connect to the same host
multiple times.

I don't know what problem you're trying to solve.  This means my advice
is more generic, and less useful, than it would be with more detail;
however, if you're interested in scaling your application up to multiple
cpus/threads, you might want to consider the following different
approaches.

1. Multiple threads, each with an easy-handle, where the work is pulled
from a queue.

2. A queue that with a multi-handle, where work is processed by a fixed
number of easy-handles.

3. One or more queues, multiple threads, each thread with a multi-handle
and a fixed number of easy-handles, where each thread schedules work
from the queue and runs its multi-handle.

I'm not sure if any of these are ideal for your project, but it might be
a worthwhile starting point.

-j
-------------------------------------------------------------------
List admin: http://cool.haxx.se/list/listinfo/curl-library
Etiquette:  http://curl.haxx.se/mail/etiquette.html

Reply via email to