Rolf,

> Is it possible that everything is working just as it should?

That's what I'm afraid of :-). But I did not expect to see such
communication overhead due to blocking from mpiBLAST, which is very
course-grained. I then tried HPL, which is computation-heavy, and found the
same thing. Also, the system time seemed to correspond to the MPI processes
cycling between run and sleep (as seen via top), and I thought that setting
the mpi_yield_when_idle parameter to 0 would keep the processes from
entering sleep state when blocking. But it doesn't.

Todd



On 3/23/07 2:06 PM, "Rolf Vandevaart" <rolf.vandeva...@sun.com> wrote:

> 
> Todd:
> 
> I assume the system time is being consumed by
> the calls to send and receive data over the TCP sockets.
> As the number of processes in the job increases, then more
> time is spent waiting for data from one of the other processes.
> 
> I did a little experiment on a single node to see the difference
> in system time consumed when running over TCP vs when
> running over shared memory.   When running on a single
> node and using the sm btl, I see almost 100% user time.
> I assume this is because the sm btl handles sending and
> receiving its data within a shared memory segment.
> However, when I switch over to TCP, I see my system time
> go up.  Note that this is on Solaris.
> 
> RUNNING OVER SELF,SM
>> mpirun -np 8 -mca btl self,sm hpcc.amd64
> 
>    PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP
>   3505 rolfv    100 0.0 0.0 0.0 0.0 0.0 0.0 0.0   0  75 182   0 hpcc.amd64/1
>   3503 rolfv    100 0.0 0.0 0.0 0.0 0.0 0.0 0.2   0  69 116   0 hpcc.amd64/1
>   3499 rolfv     99 0.0 0.0 0.0 0.0 0.0 0.0 0.5   0 106 236   0 hpcc.amd64/1
>   3497 rolfv     99 0.0 0.0 0.0 0.0 0.0 0.0 1.0   0 169 200   0 hpcc.amd64/1
>   3501 rolfv     98 0.0 0.0 0.0 0.0 0.0 0.0 1.9   0 127 158   0 hpcc.amd64/1
>   3507 rolfv     98 0.0 0.0 0.0 0.0 0.0 0.0 2.0   0 244 200   0 hpcc.amd64/1
>   3509 rolfv     98 0.0 0.0 0.0 0.0 0.0 0.0 2.0   0 282 212   0 hpcc.amd64/1
>   3495 rolfv     97 0.0 0.0 0.0 0.0 0.0 0.0 3.2   0 237  98   0 hpcc.amd64/1
> 
> RUNNING OVER SELF,TCP
>> mpirun -np 8 -mca btl self,tcp hpcc.amd64
> 
>    PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP
>   4316 rolfv     93 6.9 0.0 0.0 0.0 0.0 0.0 0.2   5 346 .6M   0 hpcc.amd64/1
>   4328 rolfv     91 8.4 0.0 0.0 0.0 0.0 0.0 0.4   3  59 .15   0 hpcc.amd64/1
>   4324 rolfv     98 1.1 0.0 0.0 0.0 0.0 0.0 0.7   2 270 .1M   0 hpcc.amd64/1
>   4320 rolfv     88  12 0.0 0.0 0.0 0.0 0.0 0.8   4 244 .15   0 hpcc.amd64/1
>   4322 rolfv     94 5.1 0.0 0.0 0.0 0.0 0.0 1.3   2 150 .2M   0 hpcc.amd64/1
>   4318 rolfv     92 6.7 0.0 0.0 0.0 0.0 0.0 1.4   5 236 .9M   0 hpcc.amd64/1
>   4326 rolfv     93 5.3 0.0 0.0 0.0 0.0 0.0 1.7   7 117 .2M   0 hpcc.amd64/1
>   4314 rolfv     91 6.6 0.0 0.0 0.0 0.0 1.3 0.9  19 150 .10   0 hpcc.amd64/1
> 
> I also ran HPL over a larger cluster of 6 nodes, and noticed even higher
> system times. 
> 
> And lastly, I ran a simple MPI test over a cluster of 64 nodes, 2 procs
> per node
> using Sun HPC ClusterTools 6, and saw about a 50/50 split between user
> and system time.
> 
>   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP
>  11525 rolfv     55  44 0.1 0.0 0.0 0.0 0.1 0.4  76 960 .3M   0
> maxtrunc_ct6/1
>  11526 rolfv     54  45 0.0 0.0 0.0 0.0 0.0 1.0   0 362 .4M   0
> maxtrunc_ct6/1
> 
> Is it possible that everything is working just as it should?
> 
> Rolf
> 
> Heywood, Todd wrote On 03/22/07 13:30,:
> 
>> Ralph,
>> 
>> Well, according to the FAQ, aggressive mode can be "forced" so I did try
>> setting OMPI_MCA_mpi_yield_when_idle=0 before running. I also tried turning
>> processor/memory affinity on. Efffects were minor. The MPI tasks still cycle
>> bewteen run and sleep states, driving up system time well over user time.
>> 
>> Mpstat shows SGE is indeed giving 4 or 2 slots per node as approporiate
>> (depending on memory) and the MPI tasks are using 4 or 2 cores, but to be
>> sure, I also tried running directly with a hostfile with slots=4 or slots=2.
>> The same behavior occurs.
>> 
>> This behavior is a function of the size of the job. I.e. As I scale from 200
>> to 800 tasks the run/sleep cycling increases, so that system time grows from
>> maybe half the user time to maybe 5 times user time.
>> 
>> This is for TCP/gigE.
>> 
>> Todd
>> 
>> 
>> On 3/22/07 12:19 PM, "Ralph Castain" <r...@lanl.gov> wrote:
>> 
>>  
>> 
>>> Just for clarification: ompi_info only shows the *default* value of the MCA
>>> parameter. In this case, mpi_yield_when_idle defaults to aggressive, but
>>> that value is reset internally if the system sees an "oversubscribed"
>>> condition.
>>> 
>>> The issue here isn't how many cores are on the node, but rather how many
>>> were specifically allocated to this job. If the allocation wasn't at least 2
>>> (in your example), then we would automatically reset mpi_yield_when_idle to
>>> be non-aggressive, regardless of how many cores are actually on the node.
>>> 
>>> Ralph
>>> 
>>> 
>>> On 3/22/07 7:14 AM, "Heywood, Todd" <heyw...@cshl.edu> wrote:
>>> 
>>>    
>>> 
>>>> Yes, I'm using SGE. I also just noticed that when 2 tasks/slots run on a
>>>> 4-core node, the 2 tasks are still cycling between run and sleep, with
>>>> higher system time than user time.
>>>> 
>>>> Ompi_info shows the MCA parameter mpi_yield_when_idle to be 0 (aggressive),
>>>> so that suggests the tasks aren't swapping out on bloccking calls.
>>>> 
>>>> Still puzzled.
>>>> 
>>>> Thanks,
>>>> Todd
>>>> 
>>>> 
>>>> On 3/22/07 7:36 AM, "Jeff Squyres" <jsquy...@cisco.com> wrote:
>>>> 
>>>>      
>>>> 
>>>>> Are you using a scheduler on your system?
>>>>> 
>>>>> More specifically, does Open MPI know that you have for process slots
>>>>> on each node?  If you are using a hostfile and didn't specify
>>>>> "slots=4" for each host, Open MPI will think that it's
>>>>> oversubscribing and will therefore call sched_yield() in the depths
>>>>> of its progress engine.
>>>>> 
>>>>> 
>>>>> On Mar 21, 2007, at 5:08 PM, Heywood, Todd wrote:
>>>>> 
>>>>>        
>>>>> 
>>>>>> P.s. I should have said this this is a pretty course-grained
>>>>>> application,
>>>>>> and netstat doesn't show much communication going on (except in
>>>>>> stages).
>>>>>> 
>>>>>> 
>>>>>> On 3/21/07 4:21 PM, "Heywood, Todd" <heyw...@cshl.edu> wrote:
>>>>>> 
>>>>>>          
>>>>>> 
>>>>>>> I noticed that my OpenMPI processes are using larger amounts of
>>>>>>> system time
>>>>>>> than user time (via vmstat, top). I'm running on dual-core, dual-CPU
>>>>>>> Opterons, with 4 slots per node, where the program has the nodes to
>>>>>>> themselves. A closer look showed that they are constantly
>>>>>>> switching between
>>>>>>> run and sleep states with 4-8 page faults per second.
>>>>>>> 
>>>>>>> Why would this be? It doesn't happen with 4 sequential jobs
>>>>>>> running on a
>>>>>>> node, where I get 99% user time, maybe 1% system time.
>>>>>>> 
>>>>>>> The processes have plenty of memory. This behavior occurs whether
>>>>>>> I use
>>>>>>> processor/memory affinity or not (there is no oversubscription).
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Todd
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>            
>>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>          
>>>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>      
>>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>    
>>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>  
>> 

Reply via email to