I've been fiddling with oprofile some more, and here's some interesting stuff.
Comments/questions/etc are welcome.

Here's a series of traces from my PIII-633 Celeron test box. Faster boxes have
different looking traces but the main culprits are the same.

CPU: PIII, speed 634.478 MHz (estimated)
Counted IFU_MEM_STALL events (cycles instruction fetch pipe is stalled) with a 
unit mask of 0x00 (No unit mask) count 7500
samples  %        image name               symbol name
270985    3.6905  libc-2.3.6.so            vfprintf
175671    2.3924  libc-2.3.6.so            re_search_internal
161464    2.1990  libc-2.3.6.so            _int_malloc
111337    1.5163  squid                    comm_select

This is "where the instruction pipeline seems to stall the most", and I believe
that translates to "because its waiting for data to come along to satisfy
what the instructions need."

Compare that to:

CPU: PIII, speed 634.478 MHz (estimated)
Counted L2_LD events (number of L2 data loads) with a unit mask of 0x0f (All 
cache states) count 10000
samples  %        image name               symbol name
19592    25.1315  libc-2.3.6.so            _int_malloc
10603    13.6009  libc-2.3.6.so            memcpy
3051      3.9136  libc-2.3.6.so            malloc_consolidate
2201      2.8233  squid                    memPoolAlloc
1856      2.3808  squid                    httpHeaderIdByName

Now, a large part of that _int_malloc is because I'm malloc'ing twice for every
new string, rather than once - but I'm also not malloc'ing at all for strings
that are cloned (from say, http headers) and quite a bit of that is going on now
in my non-disk tests:

(from mgr:strings) :

Create: 50227639
Dup: 33111752
Copy-On-Modify: 0

Next: what about L2 data stores:

CPU: PIII, speed 634.478 MHz (estimated)
Counted L2_ST events (number of L2 data stores) with a unit mask of 0x0f (All 
cache states) count 10000
samples  %        image name               symbol name
8890     29.3206  libc-2.3.6.so            memcpy
2342      7.7243  libc-2.3.6.so            _int_malloc
1183      3.9017  libc-2.3.6.so            vfprintf

Well, we still see the same culprits chewing CPU:

* vfprintf
* memcpy
* malloc

Now, memcpy comes up in two main places:
  908       6.6099  squid                    memBufGrow
  1032      7.5126  squid                    httpHeaderEntryPackInto
  4737     34.4835  squid                    httpProcessReplyHeader
  6165     44.8788  squid                    storeAppend
13737    13.6146  libc-2.3.6.so            memcpy

The big two users are in http.c - the reply header copy'ing and the path into 
the
storage manager.

Next: malloc; which goes via xmalloc:

  634       3.2385  squid                    storeSwapTLVAdd
  1252      6.3953  squid                    storeSwapMetaPack
  2748     14.0369  squid                    memReallocBuf
  14214    72.6056  squid                    stmemAppend
35        0.0347  squid                    xmalloc

Since there's plenty of stmem churn - especially when squid is proxy-only as in 
my workload -
I'm guessing this is the MEM_MEM_NODE allocation.

vfprintf and friends are a bit convoluted to find, but basically:

* lots of library functions end up there - like inet_ntoa()
* its in the critical path - httpHeaderPutStrf() and urlCanonical() for two 
good examples

So, summarising:

* vfprintf, malloc and memcpy are three ideal candidates to try and eliminate 
where possible;
* but in a lot of cases (now!) there's no single glaringly inefficient routine 
- its
  systemic throughout Squid..
* of course, eliminating things which show up with high percentages won't have 
-that-
  much of an impact (right now) eg - some of the L2 loads via memcpy for 
example are priming
  the cache with data thats about to be used elsewhere "real soon now" and thus 
eliminating
  the memcpy will shift the stalling elsewhere; it will reduce the amount of 
memory bandwidth
  being used though!


Adrian

-- 
- Xenion - http://www.xenion.com.au/ - VPS Hosting - Commercial Squid Support -

Reply via email to