I've been fiddling with oprofile some more, and here's some interesting stuff. Comments/questions/etc are welcome.
Here's a series of traces from my PIII-633 Celeron test box. Faster boxes have different looking traces but the main culprits are the same. CPU: PIII, speed 634.478 MHz (estimated) Counted IFU_MEM_STALL events (cycles instruction fetch pipe is stalled) with a unit mask of 0x00 (No unit mask) count 7500 samples % image name symbol name 270985 3.6905 libc-2.3.6.so vfprintf 175671 2.3924 libc-2.3.6.so re_search_internal 161464 2.1990 libc-2.3.6.so _int_malloc 111337 1.5163 squid comm_select This is "where the instruction pipeline seems to stall the most", and I believe that translates to "because its waiting for data to come along to satisfy what the instructions need." Compare that to: CPU: PIII, speed 634.478 MHz (estimated) Counted L2_LD events (number of L2 data loads) with a unit mask of 0x0f (All cache states) count 10000 samples % image name symbol name 19592 25.1315 libc-2.3.6.so _int_malloc 10603 13.6009 libc-2.3.6.so memcpy 3051 3.9136 libc-2.3.6.so malloc_consolidate 2201 2.8233 squid memPoolAlloc 1856 2.3808 squid httpHeaderIdByName Now, a large part of that _int_malloc is because I'm malloc'ing twice for every new string, rather than once - but I'm also not malloc'ing at all for strings that are cloned (from say, http headers) and quite a bit of that is going on now in my non-disk tests: (from mgr:strings) : Create: 50227639 Dup: 33111752 Copy-On-Modify: 0 Next: what about L2 data stores: CPU: PIII, speed 634.478 MHz (estimated) Counted L2_ST events (number of L2 data stores) with a unit mask of 0x0f (All cache states) count 10000 samples % image name symbol name 8890 29.3206 libc-2.3.6.so memcpy 2342 7.7243 libc-2.3.6.so _int_malloc 1183 3.9017 libc-2.3.6.so vfprintf Well, we still see the same culprits chewing CPU: * vfprintf * memcpy * malloc Now, memcpy comes up in two main places: 908 6.6099 squid memBufGrow 1032 7.5126 squid httpHeaderEntryPackInto 4737 34.4835 squid httpProcessReplyHeader 6165 44.8788 squid storeAppend 13737 13.6146 libc-2.3.6.so memcpy The big two users are in http.c - the reply header copy'ing and the path into the storage manager. Next: malloc; which goes via xmalloc: 634 3.2385 squid storeSwapTLVAdd 1252 6.3953 squid storeSwapMetaPack 2748 14.0369 squid memReallocBuf 14214 72.6056 squid stmemAppend 35 0.0347 squid xmalloc Since there's plenty of stmem churn - especially when squid is proxy-only as in my workload - I'm guessing this is the MEM_MEM_NODE allocation. vfprintf and friends are a bit convoluted to find, but basically: * lots of library functions end up there - like inet_ntoa() * its in the critical path - httpHeaderPutStrf() and urlCanonical() for two good examples So, summarising: * vfprintf, malloc and memcpy are three ideal candidates to try and eliminate where possible; * but in a lot of cases (now!) there's no single glaringly inefficient routine - its systemic throughout Squid.. * of course, eliminating things which show up with high percentages won't have -that- much of an impact (right now) eg - some of the L2 loads via memcpy for example are priming the cache with data thats about to be used elsewhere "real soon now" and thus eliminating the memcpy will shift the stalling elsewhere; it will reduce the amount of memory bandwidth being used though! Adrian -- - Xenion - http://www.xenion.com.au/ - VPS Hosting - Commercial Squid Support -
