Hi,
I am seeing a strange performance behavior with pthread mutex
operations on my Solaris (4-core) system running S10U5.
The issue is that if the program is run with truss enabled, I am
getting about 3 times more the performance of the run without truss
enabled. Strange...right?
The C program does the following:
- accept 2 parameters [number of concurrent threads(N), and run-time]
- create and run N threads simultaneously for <run-time> seconds
- each thread does the following:
- acquires the pthread mutex
- increases the value of it's _own_ long variable by 1
- releases the pthread mutex
- after run-time expires, each thread reports the value of it's long variable
Note that I did not use any special flags to compile it. Reply back to
this email if you like to take a look at the program source code.
Let's look at the program without truss enabled.
$ mutex_ops -w4 -r10
number of threads:4, run-time (in seconds)=10
Thread 1: total ops: 4580775, ops/second: 458077
Thread 3: total ops: 3798001, ops/second: 379800
Thread 0: total ops: 4867850, ops/second: 486785
Thread 2: total ops: 4925299, ops/second: 492529
$
<<vmstat 1>>
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr s0 s1 -- -- in sy cs us sy id
0 0 0 17121968 14961748 22 205 0 0 0 0 0 0 0 0 0 7960 16218 16155 28 2 71
0 0 0 17121968 14961748 0 0 0 0 0 0 0 0 0 0 0 19914 41572 41568 81 3 16
1 0 0 17121968 14961748 0 0 0 0 0 0 0 0 0 0 0 18711 39086 38937 80 3 17
0 0 0 17121968 14961748 0 0 0 0 0 0 0 0 0 0 0 19006 39373 39382 80 3 17
0 0 0 17121968 14961748 0 0 0 0 0 0 0 0 0 0 0 18719 39151 39007 80 3 16
0 0 0 17121968 14961748 0 0 0 0 0 0 0 0 0 0 0 19560 41243 40963 81 4 16
0 0 0 17121968 14961748 0 0 0 0 0 0 0 0 0 0 0 20498 43671 43203 81 4 16
0 0 0 17121968 14961748 0 0 0 0 0 0 0 0 0 0 0 20857 44636 44014 80 4 15
0 0 0 17121968 14961748 0 0 0 0 0 0 0 0 0 0 0 19149 41239 40583 80 4 16
0 0 0 17121968 14961748 0 0 0 0 0 0 0 0 0 0 0 21019 44189 43995 81 4 15
0 0 0 17122372 14962176 7 29 0 0 0 0 0 0 0 0 0 13062 27003 26973 46 2 52
0 0 0 17122372 14962176 0 0 0 0 0 0 0 0 0 0 0 461 160 274 0 0 100
As you can see above that I am executing the program with 4 concurrent
threads for 10 seconds. The output shows that each thread has acquired
the mutex lock about 400,000 times a second. The vmstat shows that the
program used about 80% of usertime, 4%system and ~15% idle. So far so
good.
Now, let's look into program with truss case.
$ truss -fc mutex_ops -w4 -r10
number of threads:4, run-time (in seconds)=10
Thread 0: total ops: 13419838, ops/second: 1341983
Thread 2: total ops: 13959115, ops/second: 1395911
Thread 1: total ops: 13414291, ops/second: 1341429
Thread 3: total ops: 13535670, ops/second: 1353567
syscall seconds calls errors
_exit .000 1
write .000 5
open .000 3 1
close .000 2
stat .000 10 7
getpid .000 1
fstat .000 1
ioctl .000 1
execve .000 1
lwp_park .905 106423
lwp_unpark .757 106423
sigaction .000 1
getcontext .000 1
setustack .000 5
mmap .000 16
munmap .000 2
getrlimit .000 1
memcntl .000 1
sysconfig .000 2
yield .182 26004
lwp_create .000 4
lwp_exit .000 4
lwp_continue .000 4
lwp_sigmask .000 4
lwp_private .000 1
lwp_wait .000 4
lwp_mutex_wakeup .000 9
nanosleep .000 1
schedctl .000 5
resolvepath .000 4
lwp_mutex_timedlock .000 32
getcwd .000 1
-------- ------ ----
sys totals: 1.846 238977 8
usr time: 21.363
elapsed: 10.060
$
<<vmstat 1>>
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr s0 s1 -- -- in sy cs us sy id
0 0 0 17122372 14962176 0 0 0 0 0 0 0 0 0 0 0 529 262 335 0 0 100
0 0 0 17121240 14960660 153 632 0 0 0 0 0 0 0 0 0 25259 91754
87992 37 15 48
0 0 0 17121104 14960524 0 0 0 0 0 0 0 0 0 0 0 62751 178228
179102 57 33 10
0 0 0 17121104 14960524 0 0 0 0 0 0 0 0 0 0 0 56922 169565
168507 59 30 11
0 0 0 17121104 14960524 0 0 0 0 0 0 0 0 0 0 0 60935 178967
178535 58 32 10
0 0 0 17121104 14960524 0 0 0 0 0 0 0 0 0 0 0 61581 180239
179751 58 32 10
0 0 0 17121104 14960524 0 0 0 0 0 0 0 0 0 0 0 61486 181664
180513 58 33 10
1 0 0 17121104 14960524 0 0 0 0 0 0 0 0 0 0 0 67941 186205
190703 55 35 10
0 0 0 17121104 14960524 0 0 0 0 0 0 0 0 0 0 0 59981 175708
175758 58 32 10
1 0 0 17121104 14960524 0 0 0 0 0 0 0 0 0 0 0 60485 174557
175091 58 32 11
1 0 0 17121104 14960524 0 0 0 0 0 0 0 0 0 0 0 62799 178328
179880 57 32 10
0 0 0 17122236 14962040 8 17 0 0 0 0 0 0 0 0 0 19483 58122
58317 19 10 71
0 0 0 17122236 14962040 0 0 0 0 0 0 0 0 0 0 0 449 160 279 0 0 100
Can you spot the issue? I used the same parameters(4 concurrent
threads, 10 seconds run-time), but this time *with truss enabled*. The
output shows that each thread has acquired the mutex lock about
1,300,000 times a second. vmstat shows that the program is using a lot
more system time than before. However, I don't know how truss can
cause the system time to go up, which ultimately increases the
performance. Am I missing something?
BTW, I confirmed the mutex locks generated by the program using
dtrace. The values are indeed correct for both the cases.
Any inputs/comments are appreciated.
thanks.
senthil
_______________________________________________
perf-discuss mailing list
[email protected]