[osol-discuss] Measuring average system load with millisecond resolution

2011-02-27 Thread Kishore Kumar Pusukuri
Hi,
Using either uptime(1) command or getloadavg(3) system call, we can measure 
average system load with minutes resolution. However, I would like to measure 
average system load (average number of active threads) with milliseconds 
resolution. 

Could you suggest me a way, please? 

Thank you
-- 
This message posted from opensolaris.org
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


[osol-discuss] Dramatic Performance Degradation with Binding

2011-02-11 Thread Kishore Kumar Pusukuri
Hi,

I am running a multithreaded application with 20 threads on my 24-core 
AMD Opteron (ccNUMA) machine running Solaris 10. When I run the application 
with 
threads binding to cores using pbind (one-thread to one-core), its performance 
is 
dramatically degrading. It is around 80% performance loss with binding. To 
understand 
this, I used prstat -m, and found that without binding (the default case), 
the % lock-contention (LCK field) is around 13%, but with binding it is around 
30%. 
Moreover, the % latency (LAT field) is almost zero but with binding it is 
around 37. 
Please find LCK and LAT fields of prstat output below.

Configuration   USRLCK  LAT

No-Binding  86 13   0.1
Binding 32 30   37 


Therefore, the application with binding spends most of the time in contention 
or in the ready-queue.  BTW, there is no significant difference in cache 
miss-ratio 
measured with cpustat(1). 

Is it because of the following reasons? If not, please let me know how to find 
the 
reasons behind the above behavior.

Since the application has serious inter-thread communication, some threads 
need to wait for locks, therefore the binding configuration increases memory 
traffic among the chips. Moreover, because of the memory latency, the delay 
loop 
time (the delay loop before retrying a lock) will be incremented exponentially 
and therefore threads spend most of the time waiting for locks. 

However, in the default configuration (no-binding), the load is balance well 
by migrating threads among the cores, and therefore threads get a chance to 
share the lock data structures and thus improves performance compared with 
binding configuration. 

Please find the prstat -Lm output per thread in both the configurations below:

No-Binding (Default) Configuration
==
 PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 
 15637 user   93 0.2 0.0 0.0 0.0 6.6 0.0 0.1 186  34 437   0 myprogram/13
 15637 user   92 0.2 0.0 0.0 0.0 8.0 0.0 0.1 176  36 399   0 myprogram/11
 15637 user   91 0.2 0.0 0.0 0.0 8.8 0.0 0.1 201  34 398   0 myprogram/10
 15637 user   89 0.2 0.0 0.0 0.0  11 0.0 0.2 253  34 450   0 myprogram/12
 15637 user   87 0.2 0.0 0.0 0.0  13 0.0 0.1 194  34 414   0 myprogram/17
 15637 user   87 0.2 0.0 0.0 0.0  13 0.0 0.1 187  34 416   0 myprogram/9
 15637 user   86 0.2 0.0 0.0 0.0  13 0.0 0.1 188  34 420   0 myprogram/21
 15637 user   86 0.1 0.0 0.0 0.0  14 0.0 0.1 227  45 454   0 myprogram/3
 15637 user   86 0.2 0.0 0.0 0.0  14 0.0 0.1 215  37 443   0 myprogram/15
 15637 user   86 0.2 0.0 0.0 0.0  14 0.0 0.1 212  35 435   0 myprogram/7
 15637 user   85 0.2 0.0 0.0 0.0  14 0.0 0.3 258  43 520   0 myprogram/2
 15637 user   85 0.2 0.0 0.0 0.0  15 0.0 0.1 213  34 454   0 myprogram/5
 15637 user   85 0.2 0.0 0.0 0.0  15 0.0 0.1 216  80 438   0 myprogram/19
 15637 user   85 0.2 0.0 0.0 0.0  15 0.0 0.1 248  36 464   0 myprogram/6
 15637 user   84 0.2 0.0 0.0 0.0  15 0.0 0.1 257  35 474   0 myprogram/14
 15637 user   84 0.2 0.0 0.0 0.0  16 0.0 0.1 241  31 445   0 myprogram/18
 15637 user   83 0.2 0.0 0.0 0.0  17 0.0 0.2 256  30 467   0 myprogram/16
 15637 user   83 0.2 0.0 0.0 0.0  17 0.0 0.2 265  30 476   0 myprogram/8
 15637 user   83 0.2 0.0 0.0 0.0  17 0.0 0.2 257  31 467   0 myprogram/20
 15637 user   81 0.2 0.0 0.0 0.0  18 0.0 0.2 259  30 488   0 myprogram/4
 15637 user  0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0 myprogram/1


Binding (thread-to-core) Configuration
===
  PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 
 15687 user  6.1 0.0 0.0 0.0 0.0  41 0.0  53  33   8  54   0 myprogram/13
 15687 user  5.7 0.0 0.0 0.0 0.0  32 0.0  62  31  10  38   0 myprogram/11
 15687 user  5.5 0.0 0.0 0.0 0.0  37 0.0  57  26  15  35   0 myprogram/10
 15687 user  5.4 0.0 0.0 0.0 0.0  47 0.0  47  34   6  78   0 myprogram/21
 15687 user  5.4 0.0 0.0 0.0 0.0  35 0.0  60  28  16  43   0 myprogram/17
 15687 user  5.2 0.0 0.0 0.0 0.0  42 0.0  53  33   6  59   0 myprogram/6
 15687 user  5.2 0.0 0.0 0.0 0.0  36 0.0  59  31   8  36   0 myprogram/15
 15687 user  5.2 0.0 0.0 0.0 0.0  56 0.0  39  36   7  72   0 myprogram/2
 15687 user  5.1 0.0 0.0 0.0 0.0  51 0.0  44  34   6  62   0 myprogram/5
 15687 user  5.0 0.0 0.0 0.0 0.0  50 0.0  45  33   6  54   0 myprogram/16
 15687 user  5.0 0.0 0.0 0.0 0.0  39 0.0  56  31   8  43   0 myprogram/7
 15687 user  4.9 0.0 0.0 0.0 0.0  38 0.0  57  33   7  41   0 myprogram/19
 15687 user  4.8 0.0 0.0 0.0 0.0  32 0.0  63  29  11  47   0 myprogram/12
 15687 user  4.7 0.0 0.0 0.0 0.0  43 0.0  53  31   8  36   0 myprogram/14
 15687 user  4.6 0.0 0.0 0.0 0.0  36 0.0  59  32   8  46   0 myprogram/8
 15687 user  4.5 0.0 0.0 0.0 0.0  51 0.0  45  33   5  63   0 myprogram/20
 15687 user  4.5 0.0 0.0 0.0 0.0  57 0.0  38  32   6  60   0 myprogram/18
 15687 user  4.4 0.0 0.0 0.0 0.0  59 0.0  37  31   7  66   0 myprogram/9
 15687 user  4.3 0.0 

[osol-discuss] Dramatic Performance Degradation with Binding

2011-02-11 Thread Kishore Kumar Pusukuri
Hi,

I am running a multithreaded application with 20 threads on my 24-core 
AMD Opteron (ccNUMA) machine running Solaris 10. When I run the application 
with 
threads binding to cores using pbind (one-thread to one-core), its performance 
is 
dramatically degrading. It is around 80% performance loss with binding. To 
understand 
this, I used prstat -m, and found that without binding (the default case), 
the % lock-contention (LCK field) is around 13%, but with binding it is around 
30%. 
Moreover, the % latency (LAT field) is almost zero but with binding it is 
around 37. 
Please find LCK and LAT fields of prstat output below.

Configuration   USRLCK  LAT

No-Binding  86 13   0.1
Binding 32 30   37 


Therefore, the application with binding spends most of the time in contention 
or in the ready-queue.  BTW, there is no significant difference in cache 
miss-ratio 
measured with cpustat(1). 

Is it because of the following reasons? If not, please let me know how to find 
the 
reasons behind the above behavior.

Since the application has serious inter-thread communication, some threads 
need to wait for locks, therefore the binding configuration increases memory 
traffic among the chips. Moreover, because of the memory latency, the delay 
loop 
time (the delay loop before retrying a lock) will be incremented exponentially 
and therefore threads spend most of the time waiting for locks. 

However, in the default configuration (no-binding), the load is balance well 
by migrating threads among the cores, and therefore threads get a chance to 
share the lock data structures and thus improves performance compared with 
binding configuration. 

Please find the prstat -Lm output per thread in both the configurations below:

No-Binding (Default) Configuration
==
 PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 
 15637 user   93 0.2 0.0 0.0 0.0 6.6 0.0 0.1 186  34 437   0 myprogram/13
 15637 user   92 0.2 0.0 0.0 0.0 8.0 0.0 0.1 176  36 399   0 myprogram/11
 15637 user   91 0.2 0.0 0.0 0.0 8.8 0.0 0.1 201  34 398   0 myprogram/10
 15637 user   89 0.2 0.0 0.0 0.0  11 0.0 0.2 253  34 450   0 myprogram/12
 15637 user   87 0.2 0.0 0.0 0.0  13 0.0 0.1 194  34 414   0 myprogram/17
 15637 user   87 0.2 0.0 0.0 0.0  13 0.0 0.1 187  34 416   0 myprogram/9
 15637 user   86 0.2 0.0 0.0 0.0  13 0.0 0.1 188  34 420   0 myprogram/21
 15637 user   86 0.1 0.0 0.0 0.0  14 0.0 0.1 227  45 454   0 myprogram/3
 15637 user   86 0.2 0.0 0.0 0.0  14 0.0 0.1 215  37 443   0 myprogram/15
 15637 user   86 0.2 0.0 0.0 0.0  14 0.0 0.1 212  35 435   0 myprogram/7
 15637 user   85 0.2 0.0 0.0 0.0  14 0.0 0.3 258  43 520   0 myprogram/2
 15637 user   85 0.2 0.0 0.0 0.0  15 0.0 0.1 213  34 454   0 myprogram/5
 15637 user   85 0.2 0.0 0.0 0.0  15 0.0 0.1 216  80 438   0 myprogram/19
 15637 user   85 0.2 0.0 0.0 0.0  15 0.0 0.1 248  36 464   0 myprogram/6
 15637 user   84 0.2 0.0 0.0 0.0  15 0.0 0.1 257  35 474   0 myprogram/14
 15637 user   84 0.2 0.0 0.0 0.0  16 0.0 0.1 241  31 445   0 myprogram/18
 15637 user   83 0.2 0.0 0.0 0.0  17 0.0 0.2 256  30 467   0 myprogram/16
 15637 user   83 0.2 0.0 0.0 0.0  17 0.0 0.2 265  30 476   0 myprogram/8
 15637 user   83 0.2 0.0 0.0 0.0  17 0.0 0.2 257  31 467   0 myprogram/20
 15637 user   81 0.2 0.0 0.0 0.0  18 0.0 0.2 259  30 488   0 myprogram/4
 15637 user  0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0 myprogram/1


Binding (thread-to-core) Configuration
===
  PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 
 15687 user  6.1 0.0 0.0 0.0 0.0  41 0.0  53  33   8  54   0 myprogram/13
 15687 user  5.7 0.0 0.0 0.0 0.0  32 0.0  62  31  10  38   0 myprogram/11
 15687 user  5.5 0.0 0.0 0.0 0.0  37 0.0  57  26  15  35   0 myprogram/10
 15687 user  5.4 0.0 0.0 0.0 0.0  47 0.0  47  34   6  78   0 myprogram/21
 15687 user  5.4 0.0 0.0 0.0 0.0  35 0.0  60  28  16  43   0 myprogram/17
 15687 user  5.2 0.0 0.0 0.0 0.0  42 0.0  53  33   6  59   0 myprogram/6
 15687 user  5.2 0.0 0.0 0.0 0.0  36 0.0  59  31   8  36   0 myprogram/15
 15687 user  5.2 0.0 0.0 0.0 0.0  56 0.0  39  36   7  72   0 myprogram/2
 15687 user  5.1 0.0 0.0 0.0 0.0  51 0.0  44  34   6  62   0 myprogram/5
 15687 user  5.0 0.0 0.0 0.0 0.0  50 0.0  45  33   6  54   0 myprogram/16
 15687 user  5.0 0.0 0.0 0.0 0.0  39 0.0  56  31   8  43   0 myprogram/7
 15687 user  4.9 0.0 0.0 0.0 0.0  38 0.0  57  33   7  41   0 myprogram/19
 15687 user  4.8 0.0 0.0 0.0 0.0  32 0.0  63  29  11  47   0 myprogram/12
 15687 user  4.7 0.0 0.0 0.0 0.0  43 0.0  53  31   8  36   0 myprogram/14
 15687 user  4.6 0.0 0.0 0.0 0.0  36 0.0  59  32   8  46   0 myprogram/8
 15687 user  4.5 0.0 0.0 0.0 0.0  51 0.0  45  33   5  63   0 myprogram/20
 15687 user  4.5 0.0 0.0 0.0 0.0  57 0.0  38  32   6  60   0 myprogram/18
 15687 user  4.4 0.0 0.0 0.0 0.0  59 0.0  37  31   7  66   0 myprogram/9
 15687 user  4.3 0.0 

[osol-discuss] Running OpenMP program

2011-02-02 Thread Kishore Kumar Pusukuri
Hi,
I have been compiling and running OpenMP parallel programs successfully so far 
on my OpenSolaris.2009.10 machine. However I am unable to run one program with 
more than one thread(lwp). Please see details of the program below. Please let 
me know if there is anything wrong with the compilation or running this 
particular program. I am using the following command line arguments to run the 
program and I am using g++-4.2.3 to compile the program with -fopenmp option 
and also linked libmtsk.so.1. There are no compilation warnings/errors.

$env OMP_NUM_THREADS=8 myprogram

$ ldd myprogram
 libmtsk.so.1 =  /lib/libmtsk.so.1
 libstdc++.so.6 =/usr/lib/libstdc++.so.6
 libm.so.2 = /lib/libm.so.2
 libgomp.so.1 =  /usr/lib/libgomp.so.1
 libgcc_s.so.1 = /usr/lib/libgcc_s.so.1
 libpthread.so.1 =   /lib/libpthread.so.1
 libc.so.1 = /lib/libc.so.1
 libthread.so.1 =/lib/libthread.so.1
 libdl.so.1 =/lib/libdl.so.1
-- 
This message posted from opensolaris.org
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


[osol-discuss] system traps with FX scheduling policy

2011-01-28 Thread Kishore Kumar Pusukuri
Hi,

I am playing with FX scheduling policy with different time-quanta on SPECOMP 
multithreaded programs.  I am using prstat -Lm to analyze the effect of 
different time-quanta on the performance of the programs.

Most of the programs experience system traps (TRP) with FX 10ms time-quantum. 
However, there are no traps with FX 100ms, 200ms, and higher time-quantum 
values. I understand that based on the time-quantum value, there will be change 
in other prstat fields such as context-switches, lock contention etc., but I 
don't understand why I am getting traps only when I used FX 10ms 
time-quantum. My machine is a multi-core AMD Opteron running Solaris 10. 

Please see the output of prstat below (for FX 200ms, 100ms, and 10ms).  I am 
also providing stack traces of FX with 10ms run. 

Please clarify my confusion. Many many thanks.


FX with 200ms
-
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 
23062 user   77 0.1 0.0 0.0 0.0 3.1 0.0  20 119  88  1K   0 myprogram/1
23062 user   73 0.0 0.0 0.0 0.0 9.6 0.0  18 185 351 206   0 myprogram/11
23062 user   72 0.0 0.0 0.0 0.0 9.3 0.0  19 185  72 204   0 myprogram/8
23062 user   71 0.0 0.0 0.0 0.0  10 0.0  19 178  83 194   0 myprogram/17
.
.
23062 user   69 0.0 0.0 0.0 0.0 9.5 0.0  21 180  85 196   0 myprogram/18
...
...
23062 user   69 0.0 0.0 0.0 0.0 9.8 0.0  21 193  84 206   0 myprogram/31



FX with 100ms
-
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 
23089 user   76 0.1 0.0 0.0 0.0 2.0 0.0  22 138 164  2K   0 myprogram/1
23089 user   70 0.0 0.0 0.0 0.0  10 0.0  20 211 157 227   0 myprogram/10
23089 user   70 0.0 0.0 0.0 0.0  10 0.0  20 220 435 238   0 myprogram/7


23089 user   69 0.0 0.0 0.0 0.0  10 0.0  20 214 153 228   0 myprogram/21
23089 user   69 0.0 0.0 0.0 0.0  10 0.0  21 221 138 241   0 myprogram/4

...
23089 user   68 0.0 0.0 0.0 0.0  10 0.0  22 206 155 223   0 myprogram/9
23089 user   68 0.0 0.0 0.0 0.0  10 0.0  22 215 136 232   0 myprogram/24



FX with 10ms

PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 
23105 user   78 0.1 0.1 0.0 0.0 1.3 0.0  21 168  1K  2K   0 myprogram/1
23105 user   68 0.0 0.1 0.0 0.0  10 0.0  21 241  1K 267   0 myprogram/2
23105 user   68 0.0 0.2 0.0 0.0  10 0.0  22 245  1K 270   0 myprogram/16
23105 user   68 0.0 0.2 0.0 0.0  10 0.0  22 246  1K 271   0 myprogram/25

...
23105 user   68 0.0 0.1 0.0 0.0  10 0.0  22 255  1K 281   0 myprogram/8
23105 user   68 0.0 0.1 0.0 0.0  10 0.0  22 235  1K 260   0 myprogram/21
...
... 
23105 user   67 0.0 0.1 0.0 0.0  10 0.0  22 250  1K 273   0 myprogram/20



Stack traces of the program with FX 10ms

$ pstack 23137/25
23137:  ./myprogram
-  lwp# 25 / thread# 25  
 fd7ffa86abcb omp_set_lock () + 8b
 0041e813 _$d1A593.mm_fv_update_nonbon () + 10c3
 01bd0001  ()
 4054eef5073a9994  ()

$ pstack 23137/16
23137:  ./myprogram
-  lwp# 16 / thread# 16  
 0041eb6d _$d1A593.mm_fv_update_nonbon () + 141d
 01bd0001  ()
 4054ec19011d549c  ()

$ pstack 23137/21
23137:  ./myprogram
-  lwp# 21 / thread# 21  
 0041ebfa _$d1A593.mm_fv_update_nonbon () + 14aa
 01bd0001  ()
 4054ea1dc63c899c  ()

$ pstack 23137/2
23137:  ./myprogram
-  lwp# 2 / thread# 2  
 fd7ffa896f12 atomic_store () + 2
 0041ec68 _$d1A593.mm_fv_update_nonbon () + 1518
 01bd0001  ()
 4054e9adebe3c3da  ()
-- 
This message posted from opensolaris.org
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


[osol-discuss] Microbenchmarks for synchronization mechanisms

2011-01-26 Thread Kishore Kumar Pusukuri
Hi,

Could you help me to find/develop microbenchmarks for stressing 
synchronization  mechanisms of OpenSolaris? 

Thank you.
-- 
This message posted from opensolaris.org
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


[osol-discuss] plockstat(1) along with ppgsz(1)

2011-01-26 Thread Kishore Kumar Pusukuri
Hi,

I have been using plockstat to analyze user-space lock-contention for 
multithreaded programs.
I am also optimizing the performance of those programs using larger pages 
(using ppgsz(1)). 
I am unable to get anydata (output) from plockstat(1) when I run the program 
along 
with ppgsz. However, getting output without ppgsz. Moreover, there are no 
error/warning messages in 
the earlier case.  

Could you tell me what is happening here, please?

The following are the command line argumenets:

$ pfexec plockstat -C ppgsz -o heap=2M my-program
-- 
This message posted from opensolaris.org
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


[osol-discuss] Fixed-Priority Scheduling Policy vs Round-Robin Scheduling

2010-11-16 Thread Kishore Kumar Pusukuri
Does Fixed-Priority Scheduling Policy function similar to Round-Robin 
scheduling with fixed time-quantum (somewhat like SCHED_RR in Linux) ? 

Please let me know.

Thank you.
-- 
This message posted from opensolaris.org
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


[osol-discuss] Change scheduling class of a thread (lwp) of a process.

2010-11-12 Thread Kishore Kumar Pusukuri
Hi,
We can change the scheduling class of a process (either single threaded or 
multithreaded) using priocntl(1) utility. However, I would like to use 
different scheduling classes for different threads (lwps) of a process on 
OpenSolaris.2009.06. But there is no such an option with priocntl(1) (as far as 
I have seen). It seems that the utility priocntl(1) changes the scheduling 
class for the whole process (same class for all the threads of that process).

Could anyone provide me if there is a way to do this, please?

Thank you.
-- 
This message posted from opensolaris.org
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


[osol-discuss] Calculating the size of shared memory of a multi-threaded application

2010-10-15 Thread Kishore Kumar Pusukuri
Since heap and stack represent private memory, can I say that the shared memory 
(among the 16 threads) in this process is: 307008 - (size of heap + size of 
stacks) = 307008 - [306528 + (16 * 4)]?

25908:  ./process -timing -threads 16 -lastframe 100
 Address  Kbytes RSSAnon  Locked Mode   Mapped File
08045000  12  12  12   - rwx--[ stack ]
080522881768   -   - r-x--  process
0829B000 260 260  12   - rwx--  process
082DC000  336972  306528  306528   - rwx--[ heap ]
FDD8C000   4   4   4   - rwx-R[ stack tid=16 ]
FDE8B000   4   4   4   - rwx-R[ stack tid=15 ]
FDF8A000   4   4   4   - rwx-R[ stack tid=14 ]
FE089000   4   4   4   - rwx-R[ stack tid=13 ]
FE188000   4   4   4   - rwx-R[ stack tid=12 ]
FE287000   4   4   4   - rwx-R[ stack tid=11 ]
FE386000   4   4   4   - rwx-R[ stack tid=10 ]
FE485000   4   4   4   - rwx-R[ stack tid=9 ]
FE584000   4   4   4   - rwx-R[ stack tid=8 ]
FE683000   4   4   4   - rwx-R[ stack tid=7 ]
FE782000   4   4   4   - rwx-R[ stack tid=6 ]
FE881000   4   4   4   - rwx-R[ stack tid=5 ]
FE98   4   4   4   - rwx-R[ stack tid=4 ]
FEA7F000   4   4   4   - rwx-R[ stack tid=3 ]
FEB7E000   4   4   4   - rwx-R[ stack tid=2 ]
FEB8 320 252   -   - r-x--  libm.so.2
FEBDF000   8   8   8   - rwx--  libm.so.2
FEC3  64  64  64   - rwx--[ anon ]
FEC5  64  64  64   - rw---[ anon ]
FEC6C000 128 128 128   - rw---[ anon ]
FEC8D000   4   4   -   - rwxs-[ anon ]
FEC9  24  12  12   - rwx--[ anon ]
FECA   4   4   4   - rwx--[ anon ]
FECB1276 976   -   - r-x--  libc_hwcap2.so.1
FEDFF000  28  28  28   - rwx--  libc_hwcap2.so.1
FEE06000   8   8   8   - rwx--  libc_hwcap2.so.1
FEE1  12  12   -   - r-x--  libpthread.so.1
FEE2  52  48   -   - r-x--  libgcc_s.so.1
FEE3C000   4   4   4   - rwx--  libgcc_s.so.1
FEE4   4   4   4   - rwx--[ anon ]
FEE5 856 732   -   - r-x--  libstdc++.so.6.0.10
FEF35000 160 108  32   - rwx--  libstdc++.so.6.0.10
FEF5D000  24  12  12   - rwx--  libstdc++.so.6.0.10
FEF7  12  12   -   - r-x--  libmtmalloc.so.1
FEF83000   4   4   4   - rw---  libmtmalloc.so.1
FEF9   4   4   4   - rwx--[ anon ]
FEFA   4   4   4   - rw---[ anon ]
FEFB   4   4   4   - rw---[ anon ]
FEFBE000 180 180   -   - r-x--  ld.so.1
FEFFB000   8   8   8   - rwx--  ld.so.1
FEFFD000   4   4   4   - rwx--  ld.so.1
 --- --- --- ---
total Kb  342852  311316  307008   -
-- 
This message posted from opensolaris.org
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


[osol-discuss] Shared-memory Size of an application

2010-10-14 Thread Kishore Kumar Pusukuri
Hi,
I used pmap -x to find the shared memory size of a multithreaded application. 
The output of this command is:
.
Address  Kbytes RSSAnon  Locked Mode   Mapped File
08045000  12  12  12   - rwx--[ stack ]
. 
.
.
FEFFB000   8   8   8   - rwx--  ld.so.1
FEFFD000   4   4   4   - rwx--  ld.so.1
 --- --- --- ---
total Kb  342852  311296  306988   -

Kbytes: 342852
Res:   311296
Anon: 306988

Can I say that the shared memory size is (Res - Anon), i,e. around 4MB?
And, the private memory size is the Anon size, right?  

Please let me know.
-- 
This message posted from opensolaris.org
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


[osol-discuss] Changing Memory Placement Policies through a C program

2010-09-21 Thread Kishore Kumar Pusukuri
Hi,
I know how to change memory placement policies using mdb.
For example, the following is the usage to apply Round Robin policy instead of 
the default First-Touch policy:
$ pfexec mdb -kw
...
lgrp_mem_default_policy/W 5
.

However, could you tell me how to do the same through a C program, please? 

Thank you
-- 
This message posted from opensolaris.org
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


[osol-discuss] Using madvise(3C)

2010-08-27 Thread Kishore Kumar Pusukuri
Hi,
I am trying to play with madvise on my AMD machine running OpenSolaris.2009.06. 
However, getting the following error when I used to compile the below program 
with /usr/sfw/bin/g++. Please help me to resolve this. 

 457: error: `madvise' undeclared (first use this function)
 457:  error: (Each undeclared identifier is reported only once for each 
function it appears in.)

Program
=

#include sys/types.h
#include sys/mman.h
...
...

int
main (void)
{

  
  int size = numOptions*sizeof(OptionData);
  data = (OptionData*)malloc(size);
  
  if (data == NULL) {
  perror(Fatal Error: malloc failed);
  exit(-1); 
  }
   
  int ret = madvise(data, size, MADV_ACCESS_MANY);
  if (ret == -1) {
  perror(Fatal Error: madvise failed);
 exit(-2);
  } 
  ...
  ...

return 0;
}
-- 
This message posted from opensolaris.org
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


[osol-discuss] Using madvise(3C)

2010-08-27 Thread Kishore Kumar Pusukuri
Hi,
I am trying to play with madvise on my AMD machine running OpenSolaris.2009.06. 
However, getting the following error when I used to compile the below program 
with /usr/sfw/bin/g++. Please help me to resolve this. I am not sure whether 
the usage of madvise is correct or not? Please let me know.

error: `madvise' undeclared (first use this function)
error: (Each undeclared identifier is reported only once for each function it 
appears in.)

Program
=

#include sys/types.h
#include sys/mman.h
...
...

int
main (void)
{

  
  int size = numOptions*sizeof(OptionData);
  data = (OptionData*)malloc(size);
  
  if (data == NULL) {
  perror(Fatal Error: malloc failed);
  exit(-1); 
  }
   
  int ret = madvise(data, size, MADV_ACCESS_MANY);
  if (ret == -1) {
  perror(Fatal Error: madvise failed);
 exit(-2);
  } 
  ...
  ...

return 0;
}
-- 
This message posted from opensolaris.org
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


Re: [osol-discuss] Measuring cost of TLB misses

2010-08-27 Thread Kishore Kumar Pusukuri
Thank you, Marty.
-- 
This message posted from opensolaris.org
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


[osol-discuss] Using collect utility

2010-08-27 Thread Kishore Kumar Pusukuri
Hi,
I am trying to play with the utility collect for studing TLB misses of a 
multi-threaded program running on my AMD multi-core machine equipped with 
OpenSolaris.2009.06. However, the program is hanging (with and also without 
umask) on when I used collect utility. Please find the prstat -m output of 
the program with and without collect utlity. Is this problem with the collect 
utility or the way I used it? Please let me know. 


$ collect -h DC_dtlb_L1_miss_L2_miss~umask=0x01 ./program

$ uname -a
SunOS opensolaris 5.11 snv_111b i86pc i386 i86pc Solaris

prstat output with collect
==
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP  
2201 pusukuri 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 program/16


prstat output without collect
=
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP  
2311 pusukuri 51 0.5 0.0 0.0 0.0  48 0.0 0.2  1K  23  2K   0 program/16
-- 
This message posted from opensolaris.org
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


Re: [osol-discuss] Measuring cost of TLB misses

2010-08-24 Thread Kishore Kumar Pusukuri
Thank you, Bart.
-- 
This message posted from opensolaris.org
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


[osol-discuss] How to change Memory Placement Optimization tunable parameters using mdb

2010-08-23 Thread Kishore Kumar Pusukuri
Hi,
Please let me know how to change MPO tunable parameters using mdb on my AMD 
Opteron machine running OpenSolaris 2009.06. 

For example, how to change lgrp_mem_default_policy to LGRP_MEM_POICY_RANDOM.
-- 
This message posted from opensolaris.org
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


[osol-discuss] Measuring cost of TLB misses

2010-08-20 Thread Kishore Kumar Pusukuri
Hi,
I am able to measure TLB miss-rate of a multi-threaded application running on 
my multi-core AMD Opteron machine by reading performance monitoring event 
counters using cpustat utility. However, I would like to measure the amount of 
time spent on TLB misses? Specifically, I am looking a way like the utility 
trapstat functions. Please share your ideas.

Thank you.
-- 
This message posted from opensolaris.org
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


[osol-discuss] Performance Degradation with 1GB Pages for Heap

2010-08-20 Thread Kishore Kumar Pusukuri
Hi,
My AMD Opteron supports 4KB, 2MB and 1GB page sizes. I observed that there is 
performance improvement (reduced elapsed time) for some multi-threaded 
applications when I used 2MB page-size for heap. These applications need around 
650MB heap (it reads a huge file of around 650MB size). However, when I used 
1GB pages for heap, there is performance degradation for these programs. If we 
use 1 GB, then one page is enough to satisfy the heap space, right? Then, why 
we see performance degradation in this case. Please let me know.

$ pagesize -a
4096
2097152
1073741824
-- 
This message posted from opensolaris.org
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


[osol-discuss] libmtmalloc vs libumem

2010-08-13 Thread Kishore Kumar Pusukuri
Hi,
I am able to understand how libmtmalloc works from the documentation of 
libmtmalloc.c source file. However, I am unable to find proper documentation 
for libumem. Could someone provide the key differences between libmtmalloc and 
liumem, please? Please also provide me links to the documentation/paper based 
on the design of libumem. 

Thank you.
-- 
This message posted from opensolaris.org
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


[osol-discuss] Change segvn cache size

2010-08-13 Thread Kishore Kumar Pusukuri
Hi,
I observed that one multi-threaded application is generating so many 
cross-calls (xcalls) on my AMD multi-core machine. A snapshot of stack trace is 
shown below. I think that this is because of segvn  activity, i.e. unmapping 
the page and generating cross-call activity to maintain MMU level coherence 
across the processors (from the Solaris Internals book). I read that by 
increasing segmap cache size, we can improve the performance of some 
multi-threaded applications (which produce serious File IO). Using adb, I am 
able to change the size of segmap cache. However, we will get this benefit 
only on the File systems other than ZFS. I have ZFS and also I am unable to 
change the size of segvn. I don't know whether ZFS uses segvn cache or not. 
However, I tried to change  the size of segvn cache like segmap using adb, but 
failed. It is giving the message as shown below.

$ pfexec adb -kw /dev/ksyms /dev/mem
physmem 7ff23f
segmapsize/D
segmapsize: 67108864

segvnsize/D
adb: failed to dereference symbol: unknown symbol name


Could anyone tell me how can I increase the size of segvn cache on my machine.

$ pfexec dtrace -n 'xcalls /execname==my_multithreaded/ {...@[stack()] = 
count()}'
dtrace: description 'xcalls ' matched 2 probes

  unix`xc_do_call+0x135
  unix`xc_call+0x4b
  unix`hat_tlb_inval+0x2af
  unix`unlink_ptp+0x92
  unix`htable_release+0xfa
  unix`hat_unload_callback+0x1d8
  genunix`segvn_unmap+0x255
  genunix`as_unmap+0xf2
  genunix`munmap+0x80
  unix`sys_syscall32+0x101
  377

  unix`xc_do_call+0x135
  unix`xc_call+0x4b
  unix`hat_tlb_inval+0x2af
  unix`x86pte_update+0x69
  unix`hati_update_pte+0x10c
  unix`hat_pagesync+0x169
  genunix`pvn_getdirty+0x5d
  zfs`zfs_putpage+0x1c7
  genunix`fop_putpage+0x74
  genunix`segvn_sync+0x137
  genunix`as_ctl+0x200
  genunix`memcntl+0x764
  unix`sys_syscall32+0x101
  946

  unix`xc_do_call+0x135
  unix`xc_call+0x4b
  unix`hat_tlb_inval+0x2af
  unix`unlink_ptp+0x92
  unix`htable_release+0xfa
  unix`hat_unload_callback+0x24a
  genunix`segvn_unmap+0x255
  genunix`as_unmap+0xf2
  genunix`munmap+0x80
  unix`sys_syscall32+0x101
  2494
. 

-- 
This message posted from opensolaris.org
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


[osol-discuss] Mapping the kernel heap with large pages

2010-08-13 Thread Kishore Kumar Pusukuri
Hi,

One of my applications is spending around 90% of total execution time reading a 
huge file using read system call. I though that I could improve the performance 
of the application by increasing the page size for kernel heap. I know that I 
can increase page size of application heap using ppgsz command on the fly. But 
I don't know how to change the page size for kernel heap. Could anyone tell me 
if there is any command for this or is it possible through kdb/adb, please?

My machine is a multi-core AMD Opteron running OpenSolaris.2009.06. File system 
is ZFS.

And also please let me know if there are any ideas (tunable parameters) to 
improve the File IO on ZFS.

Thank you.

Best,
Kishore
-- 
This message posted from opensolaris.org
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


[osol-discuss] Segmentation fault when using libhoard memory allocat and getting core dump

2010-08-10 Thread Kishore Kumar Pusukuri
I am trying to see the impact of different memory allocators on multi-threaded 
workloads on my AMD machine running OpenSolaris 2009.06.  I successfully used 
libmtmalloc and libumem, however, it is giving core dump (through SEG fault) 
when I used libhoard_32.so. However, I didn't get any errors when I compiled 
using libhoard_32.so. Moreover, it is fine when I used number of threads from 1 
to 6.  It is giving core dump with 7 threads and above,

I compiled a multi-threaded program called swaptions. I also provided the 
stack trace of the core dump. Could you please tell me the problem? Any 
directions to find the actual problem please. 

$ file /usr/lib/libhoard_32.so 
/usr/lib/libhoard_32.so:ELF 32-bit LSB dynamic lib 80386 Version 1 
[SSE2 SSE AMD_3DNow CMOV], dynamically linked, not stripped

$ ldd swaptions 
libpthread.so.1 =   /lib/libpthread.so.1
/usr/lib/libhoard_32.so
libCrun.so.1 =  /usr/lib/libCrun.so.1
libstdc++.so.6 =/usr/lib/libstdc++.so.6
libm.so.2 = /lib/libm.so.2
libgcc_s.so.1 = /usr/lib/libgcc_s.so.1
libc.so.1 = /lib/libc.so.1
libthread.so.1 =/usr/lib/lwp/libthread.so.1
libdl.so.1 =/lib/libdl.so.1
 
$ mdb core
Loading modules: [ ld.so.1 ]
 ::stack
libhoard_32.so`__1cFHoardMHoardManager4n0AVAlignedSuperblockHeap4nCHLMSpinLock
Type_udsyQ__n0AKGlobalHeap4udsyQiIn0C___n0APHoardSuperblock4n0C_idsyQn0AJSmall
Heap___iIn0C_n0AbBhoardThresholdFunctionClass_n0F__Gmalloc6MI_pv_+0x12d(
fef155b4, 1088, febdfee4, feefad51)
libhoard_32.so`__1cCHLKHybridHeap4ibwjWnFHoardOThreadPoolHeap4ibnKieYn0BSPerTh
readHoardHeap___n0BHBigHeap__Gmalloc6MI_pv_+0x82(fef13850, 1088, fd9c2d48, 
feef8364)
libhoard_32.so`malloc+0xe3(1088, fd9c2d78, feefe59d, fef155b4)
_Z7dmatrix+0x51(0, 2, 0, af, fef11d50, fd9c2dac)
_Z28HJM_SimPath_Forward_BlockingPPdiidS_S_S0_Pli+0x51(fe4d0224, b, 3, 0, 
4016, fea83ac8)
_Z21HJM_Swaption_BlockingPddiidS_PS_llii+0x438(fd9c2fa4, 999a, 
3fb9, 0, 0, 0)
_Z6workerPv+0xae(8047c48, fed4f000, fd9c2fec, fecbcd1e)
libc_hwcap2.so.1`_thrp_setup+0x7e(fe937400)
libc_hwcap2.so.1`_lwp_start(fe937400, 0, 0, fecbcd1e, 0, 0)

 ::status
debugging core file of swaptions (32-bit) from opensolaris

...
threading model: native threads
status: process terminated by SIGSEGV (Segmentation Fault), addr=C
-- 
This message posted from opensolaris.org
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


[osol-discuss] Using multiple page sizes

2010-08-09 Thread Kishore Kumar Pusukuri
I would like to see the impact of different page sizes on the performance of 
multi-threaded applications. However, the pagesize -a command is producing only 
3 possible page sizes including the default 4Kb on my AMD machine (shown 
below). Are these only page sizes I can use? (or) Is there anyway I can use 
more than these? 

kish...@opensolaris:~$ pagesize -a
4096
2097152
1073741824
-- 
This message posted from opensolaris.org
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


[osol-discuss] Changing page coloring algorithm using mdb/adb

2010-08-06 Thread Kishore Kumar Pusukuri
Hi,
I would like to see the impact of different page coloring algorithms on my AMD 
machine running OpenSolaris 2009.06. According to the book Solaris Internals - 
Solaris 10 and OpenSolaris (page no: 515), I have tried to change the page 
coloring algorithm on the fly using mdb -kw, but not successful. However, I am 
able to change segmapsize using adb. Could someone help me to play with page 
coloring algorithms using either mdb or adb?

$ pfexec mdb -kw
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc pcplusmp 
scsi_vhci zfs mpt sd sockfs ip hook neti sctp arp usba fctl md lofs fcip fcp 
cpc random crypto logindmux ptm ufs nsmb sppp ipc nfs ]
 consistent_coloring/D
mdb: failed to dereference symbol: unknown symbol name

-- 
This message posted from opensolaris.org
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org