I'm curious if anyone has any tips about performance of libcurl at scale.  I
have some pretty good crawling code that I'm always trying to tune.  I'm
running curl_multi with poll and between 500 and 1000 curl handles.
Most recently I grabbed an oprofile snapshot when one core is pegged (the
acutual crawling code runs in a single thread) and found something
interesting:
  2528322 45.0216 libcurl.so.4.1.1 (is actually being called from two apps,
but mostly it's the crawling app below)
  1128251 20.0907 libc-2.7.so
   760063 13.5344 no-vmlinux
   635400 11.3145 MY_PARSER_APP_HERE (runs in a separate process)
   161909  2.8831 MY_CRAWLING_DRIVING_APP_HERE (runs in the same
process/thread as libcurl listed above)
   106982  1.9050 libz.so.1.2.3.3
    98056  1.7461 liblzo2.so.2.0.0 #the input for my crawl is lzo compressed
    51165  0.9111 libcares.so.2.0.0
    48928  0.8713 pdns_recursor #I'm using pdns recursor locally to do dns

And more interestingly:

977123   38.6471  url.c:0                     ConnectionExists
477781   18.8972  (no location information)   Curl_raw_equal
344057   13.6081  hostip.c:0                  hostcache_timestamp_remove
230962    9.1350  rawstr.c:0                  my_toupper
184067    7.2802  (no location information)   Curl_hash_clean_with_criterium
67392     2.6655  (no location information)   curl_multi_remove_handle
65826     2.6035  (no location information)   Curl_hash_pick
35846     1.4178  (no location information)   Curl_hash_add

That ConnectionExists call seems to take a lot of time!  Looking at the
code, it looks like ConnectionExists should not get called if I set
curl_easy_setopt(curl[i]->curl, CURLOPT_FRESH_CONNECT, (long)1);

So I did that and got much better performance.  But I still see basically
the same oprofile report (basically 40% of my CPU time is in libcurl and 40%
of libcurl's time is spent in ConnectionExists).  So... any thoughts on:

1) why ConnectionExists takes so long? (I'm guessing it does an expensive
traversal of a really big list of maybe 4k cached connections)
2) why I'm still getting all this time spent in ConnectionExists
3) any other general perf tips (e.g. other curl_easy_setopt or
curl_multi_setopt settings, or maybe compile time options)

Some useful info:
$ curl-config --version
libcurl 7.19.3

I know, I should upgrade, but we had some stability issues with a slightly
newer version than this and rolling back fixed it.

$ curl-config --features
SSL
IPv6
libz
AsynchDNS
NTLM

Thanks a million!

--Nick
-------------------------------------------------------------------
List admin: http://cool.haxx.se/list/listinfo/curl-library
Etiquette:  http://curl.haxx.se/mail/etiquette.html

Reply via email to