Rolf, > Is it possible that everything is working just as it should?
That's what I'm afraid of :-). But I did not expect to see such communication overhead due to blocking from mpiBLAST, which is very course-grained. I then tried HPL, which is computation-heavy, and found the same thing. Also, the system time seemed to correspond to the MPI processes cycling between run and sleep (as seen via top), and I thought that setting the mpi_yield_when_idle parameter to 0 would keep the processes from entering sleep state when blocking. But it doesn't. Todd On 3/23/07 2:06 PM, "Rolf Vandevaart" <rolf.vandeva...@sun.com> wrote: > > Todd: > > I assume the system time is being consumed by > the calls to send and receive data over the TCP sockets. > As the number of processes in the job increases, then more > time is spent waiting for data from one of the other processes. > > I did a little experiment on a single node to see the difference > in system time consumed when running over TCP vs when > running over shared memory. When running on a single > node and using the sm btl, I see almost 100% user time. > I assume this is because the sm btl handles sending and > receiving its data within a shared memory segment. > However, when I switch over to TCP, I see my system time > go up. Note that this is on Solaris. > > RUNNING OVER SELF,SM >> mpirun -np 8 -mca btl self,sm hpcc.amd64 > > PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP > 3505 rolfv 100 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 75 182 0 hpcc.amd64/1 > 3503 rolfv 100 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0 69 116 0 hpcc.amd64/1 > 3499 rolfv 99 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0 106 236 0 hpcc.amd64/1 > 3497 rolfv 99 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0 169 200 0 hpcc.amd64/1 > 3501 rolfv 98 0.0 0.0 0.0 0.0 0.0 0.0 1.9 0 127 158 0 hpcc.amd64/1 > 3507 rolfv 98 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0 244 200 0 hpcc.amd64/1 > 3509 rolfv 98 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0 282 212 0 hpcc.amd64/1 > 3495 rolfv 97 0.0 0.0 0.0 0.0 0.0 0.0 3.2 0 237 98 0 hpcc.amd64/1 > > RUNNING OVER SELF,TCP >> mpirun -np 8 -mca btl self,tcp hpcc.amd64 > > PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP > 4316 rolfv 93 6.9 0.0 0.0 0.0 0.0 0.0 0.2 5 346 .6M 0 hpcc.amd64/1 > 4328 rolfv 91 8.4 0.0 0.0 0.0 0.0 0.0 0.4 3 59 .15 0 hpcc.amd64/1 > 4324 rolfv 98 1.1 0.0 0.0 0.0 0.0 0.0 0.7 2 270 .1M 0 hpcc.amd64/1 > 4320 rolfv 88 12 0.0 0.0 0.0 0.0 0.0 0.8 4 244 .15 0 hpcc.amd64/1 > 4322 rolfv 94 5.1 0.0 0.0 0.0 0.0 0.0 1.3 2 150 .2M 0 hpcc.amd64/1 > 4318 rolfv 92 6.7 0.0 0.0 0.0 0.0 0.0 1.4 5 236 .9M 0 hpcc.amd64/1 > 4326 rolfv 93 5.3 0.0 0.0 0.0 0.0 0.0 1.7 7 117 .2M 0 hpcc.amd64/1 > 4314 rolfv 91 6.6 0.0 0.0 0.0 0.0 1.3 0.9 19 150 .10 0 hpcc.amd64/1 > > I also ran HPL over a larger cluster of 6 nodes, and noticed even higher > system times. > > And lastly, I ran a simple MPI test over a cluster of 64 nodes, 2 procs > per node > using Sun HPC ClusterTools 6, and saw about a 50/50 split between user > and system time. > > PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP > 11525 rolfv 55 44 0.1 0.0 0.0 0.0 0.1 0.4 76 960 .3M 0 > maxtrunc_ct6/1 > 11526 rolfv 54 45 0.0 0.0 0.0 0.0 0.0 1.0 0 362 .4M 0 > maxtrunc_ct6/1 > > Is it possible that everything is working just as it should? > > Rolf > > Heywood, Todd wrote On 03/22/07 13:30,: > >> Ralph, >> >> Well, according to the FAQ, aggressive mode can be "forced" so I did try >> setting OMPI_MCA_mpi_yield_when_idle=0 before running. I also tried turning >> processor/memory affinity on. Efffects were minor. The MPI tasks still cycle >> bewteen run and sleep states, driving up system time well over user time. >> >> Mpstat shows SGE is indeed giving 4 or 2 slots per node as approporiate >> (depending on memory) and the MPI tasks are using 4 or 2 cores, but to be >> sure, I also tried running directly with a hostfile with slots=4 or slots=2. >> The same behavior occurs. >> >> This behavior is a function of the size of the job. I.e. As I scale from 200 >> to 800 tasks the run/sleep cycling increases, so that system time grows from >> maybe half the user time to maybe 5 times user time. >> >> This is for TCP/gigE. >> >> Todd >> >> >> On 3/22/07 12:19 PM, "Ralph Castain" <r...@lanl.gov> wrote: >> >> >> >>> Just for clarification: ompi_info only shows the *default* value of the MCA >>> parameter. In this case, mpi_yield_when_idle defaults to aggressive, but >>> that value is reset internally if the system sees an "oversubscribed" >>> condition. >>> >>> The issue here isn't how many cores are on the node, but rather how many >>> were specifically allocated to this job. If the allocation wasn't at least 2 >>> (in your example), then we would automatically reset mpi_yield_when_idle to >>> be non-aggressive, regardless of how many cores are actually on the node. >>> >>> Ralph >>> >>> >>> On 3/22/07 7:14 AM, "Heywood, Todd" <heyw...@cshl.edu> wrote: >>> >>> >>> >>>> Yes, I'm using SGE. I also just noticed that when 2 tasks/slots run on a >>>> 4-core node, the 2 tasks are still cycling between run and sleep, with >>>> higher system time than user time. >>>> >>>> Ompi_info shows the MCA parameter mpi_yield_when_idle to be 0 (aggressive), >>>> so that suggests the tasks aren't swapping out on bloccking calls. >>>> >>>> Still puzzled. >>>> >>>> Thanks, >>>> Todd >>>> >>>> >>>> On 3/22/07 7:36 AM, "Jeff Squyres" <jsquy...@cisco.com> wrote: >>>> >>>> >>>> >>>>> Are you using a scheduler on your system? >>>>> >>>>> More specifically, does Open MPI know that you have for process slots >>>>> on each node? If you are using a hostfile and didn't specify >>>>> "slots=4" for each host, Open MPI will think that it's >>>>> oversubscribing and will therefore call sched_yield() in the depths >>>>> of its progress engine. >>>>> >>>>> >>>>> On Mar 21, 2007, at 5:08 PM, Heywood, Todd wrote: >>>>> >>>>> >>>>> >>>>>> P.s. I should have said this this is a pretty course-grained >>>>>> application, >>>>>> and netstat doesn't show much communication going on (except in >>>>>> stages). >>>>>> >>>>>> >>>>>> On 3/21/07 4:21 PM, "Heywood, Todd" <heyw...@cshl.edu> wrote: >>>>>> >>>>>> >>>>>> >>>>>>> I noticed that my OpenMPI processes are using larger amounts of >>>>>>> system time >>>>>>> than user time (via vmstat, top). I'm running on dual-core, dual-CPU >>>>>>> Opterons, with 4 slots per node, where the program has the nodes to >>>>>>> themselves. A closer look showed that they are constantly >>>>>>> switching between >>>>>>> run and sleep states with 4-8 page faults per second. >>>>>>> >>>>>>> Why would this be? It doesn't happen with 4 sequential jobs >>>>>>> running on a >>>>>>> node, where I get 99% user time, maybe 1% system time. >>>>>>> >>>>>>> The processes have plenty of memory. This behavior occurs whether >>>>>>> I use >>>>>>> processor/memory affinity or not (there is no oversubscription). >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Todd >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >>