[osol-discuss] Measuring average system load with millisecond resolution
Hi, Using either uptime(1) command or getloadavg(3) system call, we can measure average system load with minutes resolution. However, I would like to measure average system load (average number of active threads) with milliseconds resolution. Could you suggest me a way, please? Thank you -- This message posted from opensolaris.org ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
[osol-discuss] Dramatic Performance Degradation with Binding
Hi, I am running a multithreaded application with 20 threads on my 24-core AMD Opteron (ccNUMA) machine running Solaris 10. When I run the application with threads binding to cores using pbind (one-thread to one-core), its performance is dramatically degrading. It is around 80% performance loss with binding. To understand this, I used prstat -m, and found that without binding (the default case), the % lock-contention (LCK field) is around 13%, but with binding it is around 30%. Moreover, the % latency (LAT field) is almost zero but with binding it is around 37. Please find LCK and LAT fields of prstat output below. Configuration USRLCK LAT No-Binding 86 13 0.1 Binding 32 30 37 Therefore, the application with binding spends most of the time in contention or in the ready-queue. BTW, there is no significant difference in cache miss-ratio measured with cpustat(1). Is it because of the following reasons? If not, please let me know how to find the reasons behind the above behavior. Since the application has serious inter-thread communication, some threads need to wait for locks, therefore the binding configuration increases memory traffic among the chips. Moreover, because of the memory latency, the delay loop time (the delay loop before retrying a lock) will be incremented exponentially and therefore threads spend most of the time waiting for locks. However, in the default configuration (no-binding), the load is balance well by migrating threads among the cores, and therefore threads get a chance to share the lock data structures and thus improves performance compared with binding configuration. Please find the prstat -Lm output per thread in both the configurations below: No-Binding (Default) Configuration == PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 15637 user 93 0.2 0.0 0.0 0.0 6.6 0.0 0.1 186 34 437 0 myprogram/13 15637 user 92 0.2 0.0 0.0 0.0 8.0 0.0 0.1 176 36 399 0 myprogram/11 15637 user 91 0.2 0.0 0.0 0.0 8.8 0.0 0.1 201 34 398 0 myprogram/10 15637 user 89 0.2 0.0 0.0 0.0 11 0.0 0.2 253 34 450 0 myprogram/12 15637 user 87 0.2 0.0 0.0 0.0 13 0.0 0.1 194 34 414 0 myprogram/17 15637 user 87 0.2 0.0 0.0 0.0 13 0.0 0.1 187 34 416 0 myprogram/9 15637 user 86 0.2 0.0 0.0 0.0 13 0.0 0.1 188 34 420 0 myprogram/21 15637 user 86 0.1 0.0 0.0 0.0 14 0.0 0.1 227 45 454 0 myprogram/3 15637 user 86 0.2 0.0 0.0 0.0 14 0.0 0.1 215 37 443 0 myprogram/15 15637 user 86 0.2 0.0 0.0 0.0 14 0.0 0.1 212 35 435 0 myprogram/7 15637 user 85 0.2 0.0 0.0 0.0 14 0.0 0.3 258 43 520 0 myprogram/2 15637 user 85 0.2 0.0 0.0 0.0 15 0.0 0.1 213 34 454 0 myprogram/5 15637 user 85 0.2 0.0 0.0 0.0 15 0.0 0.1 216 80 438 0 myprogram/19 15637 user 85 0.2 0.0 0.0 0.0 15 0.0 0.1 248 36 464 0 myprogram/6 15637 user 84 0.2 0.0 0.0 0.0 15 0.0 0.1 257 35 474 0 myprogram/14 15637 user 84 0.2 0.0 0.0 0.0 16 0.0 0.1 241 31 445 0 myprogram/18 15637 user 83 0.2 0.0 0.0 0.0 17 0.0 0.2 256 30 467 0 myprogram/16 15637 user 83 0.2 0.0 0.0 0.0 17 0.0 0.2 265 30 476 0 myprogram/8 15637 user 83 0.2 0.0 0.0 0.0 17 0.0 0.2 257 31 467 0 myprogram/20 15637 user 81 0.2 0.0 0.0 0.0 18 0.0 0.2 259 30 488 0 myprogram/4 15637 user 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 myprogram/1 Binding (thread-to-core) Configuration === PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 15687 user 6.1 0.0 0.0 0.0 0.0 41 0.0 53 33 8 54 0 myprogram/13 15687 user 5.7 0.0 0.0 0.0 0.0 32 0.0 62 31 10 38 0 myprogram/11 15687 user 5.5 0.0 0.0 0.0 0.0 37 0.0 57 26 15 35 0 myprogram/10 15687 user 5.4 0.0 0.0 0.0 0.0 47 0.0 47 34 6 78 0 myprogram/21 15687 user 5.4 0.0 0.0 0.0 0.0 35 0.0 60 28 16 43 0 myprogram/17 15687 user 5.2 0.0 0.0 0.0 0.0 42 0.0 53 33 6 59 0 myprogram/6 15687 user 5.2 0.0 0.0 0.0 0.0 36 0.0 59 31 8 36 0 myprogram/15 15687 user 5.2 0.0 0.0 0.0 0.0 56 0.0 39 36 7 72 0 myprogram/2 15687 user 5.1 0.0 0.0 0.0 0.0 51 0.0 44 34 6 62 0 myprogram/5 15687 user 5.0 0.0 0.0 0.0 0.0 50 0.0 45 33 6 54 0 myprogram/16 15687 user 5.0 0.0 0.0 0.0 0.0 39 0.0 56 31 8 43 0 myprogram/7 15687 user 4.9 0.0 0.0 0.0 0.0 38 0.0 57 33 7 41 0 myprogram/19 15687 user 4.8 0.0 0.0 0.0 0.0 32 0.0 63 29 11 47 0 myprogram/12 15687 user 4.7 0.0 0.0 0.0 0.0 43 0.0 53 31 8 36 0 myprogram/14 15687 user 4.6 0.0 0.0 0.0 0.0 36 0.0 59 32 8 46 0 myprogram/8 15687 user 4.5 0.0 0.0 0.0 0.0 51 0.0 45 33 5 63 0 myprogram/20 15687 user 4.5 0.0 0.0 0.0 0.0 57 0.0 38 32 6 60 0 myprogram/18 15687 user 4.4 0.0 0.0 0.0 0.0 59 0.0 37 31 7 66 0 myprogram/9 15687 user 4.3 0.0
[osol-discuss] Dramatic Performance Degradation with Binding
Hi, I am running a multithreaded application with 20 threads on my 24-core AMD Opteron (ccNUMA) machine running Solaris 10. When I run the application with threads binding to cores using pbind (one-thread to one-core), its performance is dramatically degrading. It is around 80% performance loss with binding. To understand this, I used prstat -m, and found that without binding (the default case), the % lock-contention (LCK field) is around 13%, but with binding it is around 30%. Moreover, the % latency (LAT field) is almost zero but with binding it is around 37. Please find LCK and LAT fields of prstat output below. Configuration USRLCK LAT No-Binding 86 13 0.1 Binding 32 30 37 Therefore, the application with binding spends most of the time in contention or in the ready-queue. BTW, there is no significant difference in cache miss-ratio measured with cpustat(1). Is it because of the following reasons? If not, please let me know how to find the reasons behind the above behavior. Since the application has serious inter-thread communication, some threads need to wait for locks, therefore the binding configuration increases memory traffic among the chips. Moreover, because of the memory latency, the delay loop time (the delay loop before retrying a lock) will be incremented exponentially and therefore threads spend most of the time waiting for locks. However, in the default configuration (no-binding), the load is balance well by migrating threads among the cores, and therefore threads get a chance to share the lock data structures and thus improves performance compared with binding configuration. Please find the prstat -Lm output per thread in both the configurations below: No-Binding (Default) Configuration == PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 15637 user 93 0.2 0.0 0.0 0.0 6.6 0.0 0.1 186 34 437 0 myprogram/13 15637 user 92 0.2 0.0 0.0 0.0 8.0 0.0 0.1 176 36 399 0 myprogram/11 15637 user 91 0.2 0.0 0.0 0.0 8.8 0.0 0.1 201 34 398 0 myprogram/10 15637 user 89 0.2 0.0 0.0 0.0 11 0.0 0.2 253 34 450 0 myprogram/12 15637 user 87 0.2 0.0 0.0 0.0 13 0.0 0.1 194 34 414 0 myprogram/17 15637 user 87 0.2 0.0 0.0 0.0 13 0.0 0.1 187 34 416 0 myprogram/9 15637 user 86 0.2 0.0 0.0 0.0 13 0.0 0.1 188 34 420 0 myprogram/21 15637 user 86 0.1 0.0 0.0 0.0 14 0.0 0.1 227 45 454 0 myprogram/3 15637 user 86 0.2 0.0 0.0 0.0 14 0.0 0.1 215 37 443 0 myprogram/15 15637 user 86 0.2 0.0 0.0 0.0 14 0.0 0.1 212 35 435 0 myprogram/7 15637 user 85 0.2 0.0 0.0 0.0 14 0.0 0.3 258 43 520 0 myprogram/2 15637 user 85 0.2 0.0 0.0 0.0 15 0.0 0.1 213 34 454 0 myprogram/5 15637 user 85 0.2 0.0 0.0 0.0 15 0.0 0.1 216 80 438 0 myprogram/19 15637 user 85 0.2 0.0 0.0 0.0 15 0.0 0.1 248 36 464 0 myprogram/6 15637 user 84 0.2 0.0 0.0 0.0 15 0.0 0.1 257 35 474 0 myprogram/14 15637 user 84 0.2 0.0 0.0 0.0 16 0.0 0.1 241 31 445 0 myprogram/18 15637 user 83 0.2 0.0 0.0 0.0 17 0.0 0.2 256 30 467 0 myprogram/16 15637 user 83 0.2 0.0 0.0 0.0 17 0.0 0.2 265 30 476 0 myprogram/8 15637 user 83 0.2 0.0 0.0 0.0 17 0.0 0.2 257 31 467 0 myprogram/20 15637 user 81 0.2 0.0 0.0 0.0 18 0.0 0.2 259 30 488 0 myprogram/4 15637 user 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 myprogram/1 Binding (thread-to-core) Configuration === PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 15687 user 6.1 0.0 0.0 0.0 0.0 41 0.0 53 33 8 54 0 myprogram/13 15687 user 5.7 0.0 0.0 0.0 0.0 32 0.0 62 31 10 38 0 myprogram/11 15687 user 5.5 0.0 0.0 0.0 0.0 37 0.0 57 26 15 35 0 myprogram/10 15687 user 5.4 0.0 0.0 0.0 0.0 47 0.0 47 34 6 78 0 myprogram/21 15687 user 5.4 0.0 0.0 0.0 0.0 35 0.0 60 28 16 43 0 myprogram/17 15687 user 5.2 0.0 0.0 0.0 0.0 42 0.0 53 33 6 59 0 myprogram/6 15687 user 5.2 0.0 0.0 0.0 0.0 36 0.0 59 31 8 36 0 myprogram/15 15687 user 5.2 0.0 0.0 0.0 0.0 56 0.0 39 36 7 72 0 myprogram/2 15687 user 5.1 0.0 0.0 0.0 0.0 51 0.0 44 34 6 62 0 myprogram/5 15687 user 5.0 0.0 0.0 0.0 0.0 50 0.0 45 33 6 54 0 myprogram/16 15687 user 5.0 0.0 0.0 0.0 0.0 39 0.0 56 31 8 43 0 myprogram/7 15687 user 4.9 0.0 0.0 0.0 0.0 38 0.0 57 33 7 41 0 myprogram/19 15687 user 4.8 0.0 0.0 0.0 0.0 32 0.0 63 29 11 47 0 myprogram/12 15687 user 4.7 0.0 0.0 0.0 0.0 43 0.0 53 31 8 36 0 myprogram/14 15687 user 4.6 0.0 0.0 0.0 0.0 36 0.0 59 32 8 46 0 myprogram/8 15687 user 4.5 0.0 0.0 0.0 0.0 51 0.0 45 33 5 63 0 myprogram/20 15687 user 4.5 0.0 0.0 0.0 0.0 57 0.0 38 32 6 60 0 myprogram/18 15687 user 4.4 0.0 0.0 0.0 0.0 59 0.0 37 31 7 66 0 myprogram/9 15687 user 4.3 0.0
[osol-discuss] Running OpenMP program
Hi, I have been compiling and running OpenMP parallel programs successfully so far on my OpenSolaris.2009.10 machine. However I am unable to run one program with more than one thread(lwp). Please see details of the program below. Please let me know if there is anything wrong with the compilation or running this particular program. I am using the following command line arguments to run the program and I am using g++-4.2.3 to compile the program with -fopenmp option and also linked libmtsk.so.1. There are no compilation warnings/errors. $env OMP_NUM_THREADS=8 myprogram $ ldd myprogram libmtsk.so.1 = /lib/libmtsk.so.1 libstdc++.so.6 =/usr/lib/libstdc++.so.6 libm.so.2 = /lib/libm.so.2 libgomp.so.1 = /usr/lib/libgomp.so.1 libgcc_s.so.1 = /usr/lib/libgcc_s.so.1 libpthread.so.1 = /lib/libpthread.so.1 libc.so.1 = /lib/libc.so.1 libthread.so.1 =/lib/libthread.so.1 libdl.so.1 =/lib/libdl.so.1 -- This message posted from opensolaris.org ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
[osol-discuss] system traps with FX scheduling policy
Hi, I am playing with FX scheduling policy with different time-quanta on SPECOMP multithreaded programs. I am using prstat -Lm to analyze the effect of different time-quanta on the performance of the programs. Most of the programs experience system traps (TRP) with FX 10ms time-quantum. However, there are no traps with FX 100ms, 200ms, and higher time-quantum values. I understand that based on the time-quantum value, there will be change in other prstat fields such as context-switches, lock contention etc., but I don't understand why I am getting traps only when I used FX 10ms time-quantum. My machine is a multi-core AMD Opteron running Solaris 10. Please see the output of prstat below (for FX 200ms, 100ms, and 10ms). I am also providing stack traces of FX with 10ms run. Please clarify my confusion. Many many thanks. FX with 200ms - PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 23062 user 77 0.1 0.0 0.0 0.0 3.1 0.0 20 119 88 1K 0 myprogram/1 23062 user 73 0.0 0.0 0.0 0.0 9.6 0.0 18 185 351 206 0 myprogram/11 23062 user 72 0.0 0.0 0.0 0.0 9.3 0.0 19 185 72 204 0 myprogram/8 23062 user 71 0.0 0.0 0.0 0.0 10 0.0 19 178 83 194 0 myprogram/17 . . 23062 user 69 0.0 0.0 0.0 0.0 9.5 0.0 21 180 85 196 0 myprogram/18 ... ... 23062 user 69 0.0 0.0 0.0 0.0 9.8 0.0 21 193 84 206 0 myprogram/31 FX with 100ms - PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 23089 user 76 0.1 0.0 0.0 0.0 2.0 0.0 22 138 164 2K 0 myprogram/1 23089 user 70 0.0 0.0 0.0 0.0 10 0.0 20 211 157 227 0 myprogram/10 23089 user 70 0.0 0.0 0.0 0.0 10 0.0 20 220 435 238 0 myprogram/7 23089 user 69 0.0 0.0 0.0 0.0 10 0.0 20 214 153 228 0 myprogram/21 23089 user 69 0.0 0.0 0.0 0.0 10 0.0 21 221 138 241 0 myprogram/4 ... 23089 user 68 0.0 0.0 0.0 0.0 10 0.0 22 206 155 223 0 myprogram/9 23089 user 68 0.0 0.0 0.0 0.0 10 0.0 22 215 136 232 0 myprogram/24 FX with 10ms PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 23105 user 78 0.1 0.1 0.0 0.0 1.3 0.0 21 168 1K 2K 0 myprogram/1 23105 user 68 0.0 0.1 0.0 0.0 10 0.0 21 241 1K 267 0 myprogram/2 23105 user 68 0.0 0.2 0.0 0.0 10 0.0 22 245 1K 270 0 myprogram/16 23105 user 68 0.0 0.2 0.0 0.0 10 0.0 22 246 1K 271 0 myprogram/25 ... 23105 user 68 0.0 0.1 0.0 0.0 10 0.0 22 255 1K 281 0 myprogram/8 23105 user 68 0.0 0.1 0.0 0.0 10 0.0 22 235 1K 260 0 myprogram/21 ... ... 23105 user 67 0.0 0.1 0.0 0.0 10 0.0 22 250 1K 273 0 myprogram/20 Stack traces of the program with FX 10ms $ pstack 23137/25 23137: ./myprogram - lwp# 25 / thread# 25 fd7ffa86abcb omp_set_lock () + 8b 0041e813 _$d1A593.mm_fv_update_nonbon () + 10c3 01bd0001 () 4054eef5073a9994 () $ pstack 23137/16 23137: ./myprogram - lwp# 16 / thread# 16 0041eb6d _$d1A593.mm_fv_update_nonbon () + 141d 01bd0001 () 4054ec19011d549c () $ pstack 23137/21 23137: ./myprogram - lwp# 21 / thread# 21 0041ebfa _$d1A593.mm_fv_update_nonbon () + 14aa 01bd0001 () 4054ea1dc63c899c () $ pstack 23137/2 23137: ./myprogram - lwp# 2 / thread# 2 fd7ffa896f12 atomic_store () + 2 0041ec68 _$d1A593.mm_fv_update_nonbon () + 1518 01bd0001 () 4054e9adebe3c3da () -- This message posted from opensolaris.org ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
[osol-discuss] Microbenchmarks for synchronization mechanisms
Hi, Could you help me to find/develop microbenchmarks for stressing synchronization mechanisms of OpenSolaris? Thank you. -- This message posted from opensolaris.org ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
[osol-discuss] plockstat(1) along with ppgsz(1)
Hi, I have been using plockstat to analyze user-space lock-contention for multithreaded programs. I am also optimizing the performance of those programs using larger pages (using ppgsz(1)). I am unable to get anydata (output) from plockstat(1) when I run the program along with ppgsz. However, getting output without ppgsz. Moreover, there are no error/warning messages in the earlier case. Could you tell me what is happening here, please? The following are the command line argumenets: $ pfexec plockstat -C ppgsz -o heap=2M my-program -- This message posted from opensolaris.org ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
[osol-discuss] Fixed-Priority Scheduling Policy vs Round-Robin Scheduling
Does Fixed-Priority Scheduling Policy function similar to Round-Robin scheduling with fixed time-quantum (somewhat like SCHED_RR in Linux) ? Please let me know. Thank you. -- This message posted from opensolaris.org ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
[osol-discuss] Change scheduling class of a thread (lwp) of a process.
Hi, We can change the scheduling class of a process (either single threaded or multithreaded) using priocntl(1) utility. However, I would like to use different scheduling classes for different threads (lwps) of a process on OpenSolaris.2009.06. But there is no such an option with priocntl(1) (as far as I have seen). It seems that the utility priocntl(1) changes the scheduling class for the whole process (same class for all the threads of that process). Could anyone provide me if there is a way to do this, please? Thank you. -- This message posted from opensolaris.org ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
[osol-discuss] Calculating the size of shared memory of a multi-threaded application
Since heap and stack represent private memory, can I say that the shared memory (among the 16 threads) in this process is: 307008 - (size of heap + size of stacks) = 307008 - [306528 + (16 * 4)]? 25908: ./process -timing -threads 16 -lastframe 100 Address Kbytes RSSAnon Locked Mode Mapped File 08045000 12 12 12 - rwx--[ stack ] 080522881768 - - r-x-- process 0829B000 260 260 12 - rwx-- process 082DC000 336972 306528 306528 - rwx--[ heap ] FDD8C000 4 4 4 - rwx-R[ stack tid=16 ] FDE8B000 4 4 4 - rwx-R[ stack tid=15 ] FDF8A000 4 4 4 - rwx-R[ stack tid=14 ] FE089000 4 4 4 - rwx-R[ stack tid=13 ] FE188000 4 4 4 - rwx-R[ stack tid=12 ] FE287000 4 4 4 - rwx-R[ stack tid=11 ] FE386000 4 4 4 - rwx-R[ stack tid=10 ] FE485000 4 4 4 - rwx-R[ stack tid=9 ] FE584000 4 4 4 - rwx-R[ stack tid=8 ] FE683000 4 4 4 - rwx-R[ stack tid=7 ] FE782000 4 4 4 - rwx-R[ stack tid=6 ] FE881000 4 4 4 - rwx-R[ stack tid=5 ] FE98 4 4 4 - rwx-R[ stack tid=4 ] FEA7F000 4 4 4 - rwx-R[ stack tid=3 ] FEB7E000 4 4 4 - rwx-R[ stack tid=2 ] FEB8 320 252 - - r-x-- libm.so.2 FEBDF000 8 8 8 - rwx-- libm.so.2 FEC3 64 64 64 - rwx--[ anon ] FEC5 64 64 64 - rw---[ anon ] FEC6C000 128 128 128 - rw---[ anon ] FEC8D000 4 4 - - rwxs-[ anon ] FEC9 24 12 12 - rwx--[ anon ] FECA 4 4 4 - rwx--[ anon ] FECB1276 976 - - r-x-- libc_hwcap2.so.1 FEDFF000 28 28 28 - rwx-- libc_hwcap2.so.1 FEE06000 8 8 8 - rwx-- libc_hwcap2.so.1 FEE1 12 12 - - r-x-- libpthread.so.1 FEE2 52 48 - - r-x-- libgcc_s.so.1 FEE3C000 4 4 4 - rwx-- libgcc_s.so.1 FEE4 4 4 4 - rwx--[ anon ] FEE5 856 732 - - r-x-- libstdc++.so.6.0.10 FEF35000 160 108 32 - rwx-- libstdc++.so.6.0.10 FEF5D000 24 12 12 - rwx-- libstdc++.so.6.0.10 FEF7 12 12 - - r-x-- libmtmalloc.so.1 FEF83000 4 4 4 - rw--- libmtmalloc.so.1 FEF9 4 4 4 - rwx--[ anon ] FEFA 4 4 4 - rw---[ anon ] FEFB 4 4 4 - rw---[ anon ] FEFBE000 180 180 - - r-x-- ld.so.1 FEFFB000 8 8 8 - rwx-- ld.so.1 FEFFD000 4 4 4 - rwx-- ld.so.1 --- --- --- --- total Kb 342852 311316 307008 - -- This message posted from opensolaris.org ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
[osol-discuss] Shared-memory Size of an application
Hi, I used pmap -x to find the shared memory size of a multithreaded application. The output of this command is: . Address Kbytes RSSAnon Locked Mode Mapped File 08045000 12 12 12 - rwx--[ stack ] . . . FEFFB000 8 8 8 - rwx-- ld.so.1 FEFFD000 4 4 4 - rwx-- ld.so.1 --- --- --- --- total Kb 342852 311296 306988 - Kbytes: 342852 Res: 311296 Anon: 306988 Can I say that the shared memory size is (Res - Anon), i,e. around 4MB? And, the private memory size is the Anon size, right? Please let me know. -- This message posted from opensolaris.org ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
[osol-discuss] Changing Memory Placement Policies through a C program
Hi, I know how to change memory placement policies using mdb. For example, the following is the usage to apply Round Robin policy instead of the default First-Touch policy: $ pfexec mdb -kw ... lgrp_mem_default_policy/W 5 . However, could you tell me how to do the same through a C program, please? Thank you -- This message posted from opensolaris.org ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
[osol-discuss] Using madvise(3C)
Hi, I am trying to play with madvise on my AMD machine running OpenSolaris.2009.06. However, getting the following error when I used to compile the below program with /usr/sfw/bin/g++. Please help me to resolve this. 457: error: `madvise' undeclared (first use this function) 457: error: (Each undeclared identifier is reported only once for each function it appears in.) Program = #include sys/types.h #include sys/mman.h ... ... int main (void) { int size = numOptions*sizeof(OptionData); data = (OptionData*)malloc(size); if (data == NULL) { perror(Fatal Error: malloc failed); exit(-1); } int ret = madvise(data, size, MADV_ACCESS_MANY); if (ret == -1) { perror(Fatal Error: madvise failed); exit(-2); } ... ... return 0; } -- This message posted from opensolaris.org ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
[osol-discuss] Using madvise(3C)
Hi, I am trying to play with madvise on my AMD machine running OpenSolaris.2009.06. However, getting the following error when I used to compile the below program with /usr/sfw/bin/g++. Please help me to resolve this. I am not sure whether the usage of madvise is correct or not? Please let me know. error: `madvise' undeclared (first use this function) error: (Each undeclared identifier is reported only once for each function it appears in.) Program = #include sys/types.h #include sys/mman.h ... ... int main (void) { int size = numOptions*sizeof(OptionData); data = (OptionData*)malloc(size); if (data == NULL) { perror(Fatal Error: malloc failed); exit(-1); } int ret = madvise(data, size, MADV_ACCESS_MANY); if (ret == -1) { perror(Fatal Error: madvise failed); exit(-2); } ... ... return 0; } -- This message posted from opensolaris.org ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
Re: [osol-discuss] Measuring cost of TLB misses
Thank you, Marty. -- This message posted from opensolaris.org ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
[osol-discuss] Using collect utility
Hi, I am trying to play with the utility collect for studing TLB misses of a multi-threaded program running on my AMD multi-core machine equipped with OpenSolaris.2009.06. However, the program is hanging (with and also without umask) on when I used collect utility. Please find the prstat -m output of the program with and without collect utlity. Is this problem with the collect utility or the way I used it? Please let me know. $ collect -h DC_dtlb_L1_miss_L2_miss~umask=0x01 ./program $ uname -a SunOS opensolaris 5.11 snv_111b i86pc i386 i86pc Solaris prstat output with collect == PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP 2201 pusukuri 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 program/16 prstat output without collect = PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP 2311 pusukuri 51 0.5 0.0 0.0 0.0 48 0.0 0.2 1K 23 2K 0 program/16 -- This message posted from opensolaris.org ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
Re: [osol-discuss] Measuring cost of TLB misses
Thank you, Bart. -- This message posted from opensolaris.org ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
[osol-discuss] How to change Memory Placement Optimization tunable parameters using mdb
Hi, Please let me know how to change MPO tunable parameters using mdb on my AMD Opteron machine running OpenSolaris 2009.06. For example, how to change lgrp_mem_default_policy to LGRP_MEM_POICY_RANDOM. -- This message posted from opensolaris.org ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
[osol-discuss] Measuring cost of TLB misses
Hi, I am able to measure TLB miss-rate of a multi-threaded application running on my multi-core AMD Opteron machine by reading performance monitoring event counters using cpustat utility. However, I would like to measure the amount of time spent on TLB misses? Specifically, I am looking a way like the utility trapstat functions. Please share your ideas. Thank you. -- This message posted from opensolaris.org ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
[osol-discuss] Performance Degradation with 1GB Pages for Heap
Hi, My AMD Opteron supports 4KB, 2MB and 1GB page sizes. I observed that there is performance improvement (reduced elapsed time) for some multi-threaded applications when I used 2MB page-size for heap. These applications need around 650MB heap (it reads a huge file of around 650MB size). However, when I used 1GB pages for heap, there is performance degradation for these programs. If we use 1 GB, then one page is enough to satisfy the heap space, right? Then, why we see performance degradation in this case. Please let me know. $ pagesize -a 4096 2097152 1073741824 -- This message posted from opensolaris.org ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
[osol-discuss] libmtmalloc vs libumem
Hi, I am able to understand how libmtmalloc works from the documentation of libmtmalloc.c source file. However, I am unable to find proper documentation for libumem. Could someone provide the key differences between libmtmalloc and liumem, please? Please also provide me links to the documentation/paper based on the design of libumem. Thank you. -- This message posted from opensolaris.org ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
[osol-discuss] Change segvn cache size
Hi, I observed that one multi-threaded application is generating so many cross-calls (xcalls) on my AMD multi-core machine. A snapshot of stack trace is shown below. I think that this is because of segvn activity, i.e. unmapping the page and generating cross-call activity to maintain MMU level coherence across the processors (from the Solaris Internals book). I read that by increasing segmap cache size, we can improve the performance of some multi-threaded applications (which produce serious File IO). Using adb, I am able to change the size of segmap cache. However, we will get this benefit only on the File systems other than ZFS. I have ZFS and also I am unable to change the size of segvn. I don't know whether ZFS uses segvn cache or not. However, I tried to change the size of segvn cache like segmap using adb, but failed. It is giving the message as shown below. $ pfexec adb -kw /dev/ksyms /dev/mem physmem 7ff23f segmapsize/D segmapsize: 67108864 segvnsize/D adb: failed to dereference symbol: unknown symbol name Could anyone tell me how can I increase the size of segvn cache on my machine. $ pfexec dtrace -n 'xcalls /execname==my_multithreaded/ {...@[stack()] = count()}' dtrace: description 'xcalls ' matched 2 probes unix`xc_do_call+0x135 unix`xc_call+0x4b unix`hat_tlb_inval+0x2af unix`unlink_ptp+0x92 unix`htable_release+0xfa unix`hat_unload_callback+0x1d8 genunix`segvn_unmap+0x255 genunix`as_unmap+0xf2 genunix`munmap+0x80 unix`sys_syscall32+0x101 377 unix`xc_do_call+0x135 unix`xc_call+0x4b unix`hat_tlb_inval+0x2af unix`x86pte_update+0x69 unix`hati_update_pte+0x10c unix`hat_pagesync+0x169 genunix`pvn_getdirty+0x5d zfs`zfs_putpage+0x1c7 genunix`fop_putpage+0x74 genunix`segvn_sync+0x137 genunix`as_ctl+0x200 genunix`memcntl+0x764 unix`sys_syscall32+0x101 946 unix`xc_do_call+0x135 unix`xc_call+0x4b unix`hat_tlb_inval+0x2af unix`unlink_ptp+0x92 unix`htable_release+0xfa unix`hat_unload_callback+0x24a genunix`segvn_unmap+0x255 genunix`as_unmap+0xf2 genunix`munmap+0x80 unix`sys_syscall32+0x101 2494 . -- This message posted from opensolaris.org ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
[osol-discuss] Mapping the kernel heap with large pages
Hi, One of my applications is spending around 90% of total execution time reading a huge file using read system call. I though that I could improve the performance of the application by increasing the page size for kernel heap. I know that I can increase page size of application heap using ppgsz command on the fly. But I don't know how to change the page size for kernel heap. Could anyone tell me if there is any command for this or is it possible through kdb/adb, please? My machine is a multi-core AMD Opteron running OpenSolaris.2009.06. File system is ZFS. And also please let me know if there are any ideas (tunable parameters) to improve the File IO on ZFS. Thank you. Best, Kishore -- This message posted from opensolaris.org ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
[osol-discuss] Segmentation fault when using libhoard memory allocat and getting core dump
I am trying to see the impact of different memory allocators on multi-threaded workloads on my AMD machine running OpenSolaris 2009.06. I successfully used libmtmalloc and libumem, however, it is giving core dump (through SEG fault) when I used libhoard_32.so. However, I didn't get any errors when I compiled using libhoard_32.so. Moreover, it is fine when I used number of threads from 1 to 6. It is giving core dump with 7 threads and above, I compiled a multi-threaded program called swaptions. I also provided the stack trace of the core dump. Could you please tell me the problem? Any directions to find the actual problem please. $ file /usr/lib/libhoard_32.so /usr/lib/libhoard_32.so:ELF 32-bit LSB dynamic lib 80386 Version 1 [SSE2 SSE AMD_3DNow CMOV], dynamically linked, not stripped $ ldd swaptions libpthread.so.1 = /lib/libpthread.so.1 /usr/lib/libhoard_32.so libCrun.so.1 = /usr/lib/libCrun.so.1 libstdc++.so.6 =/usr/lib/libstdc++.so.6 libm.so.2 = /lib/libm.so.2 libgcc_s.so.1 = /usr/lib/libgcc_s.so.1 libc.so.1 = /lib/libc.so.1 libthread.so.1 =/usr/lib/lwp/libthread.so.1 libdl.so.1 =/lib/libdl.so.1 $ mdb core Loading modules: [ ld.so.1 ] ::stack libhoard_32.so`__1cFHoardMHoardManager4n0AVAlignedSuperblockHeap4nCHLMSpinLock Type_udsyQ__n0AKGlobalHeap4udsyQiIn0C___n0APHoardSuperblock4n0C_idsyQn0AJSmall Heap___iIn0C_n0AbBhoardThresholdFunctionClass_n0F__Gmalloc6MI_pv_+0x12d( fef155b4, 1088, febdfee4, feefad51) libhoard_32.so`__1cCHLKHybridHeap4ibwjWnFHoardOThreadPoolHeap4ibnKieYn0BSPerTh readHoardHeap___n0BHBigHeap__Gmalloc6MI_pv_+0x82(fef13850, 1088, fd9c2d48, feef8364) libhoard_32.so`malloc+0xe3(1088, fd9c2d78, feefe59d, fef155b4) _Z7dmatrix+0x51(0, 2, 0, af, fef11d50, fd9c2dac) _Z28HJM_SimPath_Forward_BlockingPPdiidS_S_S0_Pli+0x51(fe4d0224, b, 3, 0, 4016, fea83ac8) _Z21HJM_Swaption_BlockingPddiidS_PS_llii+0x438(fd9c2fa4, 999a, 3fb9, 0, 0, 0) _Z6workerPv+0xae(8047c48, fed4f000, fd9c2fec, fecbcd1e) libc_hwcap2.so.1`_thrp_setup+0x7e(fe937400) libc_hwcap2.so.1`_lwp_start(fe937400, 0, 0, fecbcd1e, 0, 0) ::status debugging core file of swaptions (32-bit) from opensolaris ... threading model: native threads status: process terminated by SIGSEGV (Segmentation Fault), addr=C -- This message posted from opensolaris.org ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
[osol-discuss] Using multiple page sizes
I would like to see the impact of different page sizes on the performance of multi-threaded applications. However, the pagesize -a command is producing only 3 possible page sizes including the default 4Kb on my AMD machine (shown below). Are these only page sizes I can use? (or) Is there anyway I can use more than these? kish...@opensolaris:~$ pagesize -a 4096 2097152 1073741824 -- This message posted from opensolaris.org ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
[osol-discuss] Changing page coloring algorithm using mdb/adb
Hi, I would like to see the impact of different page coloring algorithms on my AMD machine running OpenSolaris 2009.06. According to the book Solaris Internals - Solaris 10 and OpenSolaris (page no: 515), I have tried to change the page coloring algorithm on the fly using mdb -kw, but not successful. However, I am able to change segmapsize using adb. Could someone help me to play with page coloring algorithms using either mdb or adb? $ pfexec mdb -kw Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc pcplusmp scsi_vhci zfs mpt sd sockfs ip hook neti sctp arp usba fctl md lofs fcip fcp cpc random crypto logindmux ptm ufs nsmb sppp ipc nfs ] consistent_coloring/D mdb: failed to dereference symbol: unknown symbol name -- This message posted from opensolaris.org ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org