Re: [OMPI users] curious behavior during wait for broadcast: 100% cpu

2010-12-14 Thread Eugene Loh

David Mathog wrote:

Is there a tool in openmpi that will reveal how much "spin time" the 
processes are using?


I don't know what sort of answer is helpful for you, but I'll describe 
one option.


With Oracle Message Passing Toolkit (formerly Sun ClusterTools, anyhow, 
an OMPI distribution from Oracle/Sun) and Oracle Solaris Studio 
Performance Analyzer (formerly Sun Studio Performance Analyzer) you can 
see how much time is spent in MPI work, MPI wait, and so on.  
Specifically, by process, you could see (I'm making an example up) that 
process 2 spent:

* 35% of its time in application-level computation
* 5% of its time in MPI moving data
* 60% of its time in MPI waiting
but process 7 spent:
* 90% of its time in application-level computation
* 5% of its time in MPI moving data
* only 5% of its time in MPI waiting
That is, beyond the usual profiling support you might find in other 
tools, with Performance Analyzer you can distinguish time spent in MPI 
moving data from time spent in MPI waiting.


On the other hand, you perhaps don't need that much detail.  For your 
purposes, it may suffice just to know how much time each process is 
spending in MPI.  There are various profiling tools that will give you 
that.  See http://www.open-mpi.org/faq/?category=perftools  Load 
balancing is a common problem people investigate with such tools.


Finally, if you want to stick to tools like top, maybe another 
alternative is to get your application to go into sleep waits.  I can't 
say this is the best choice, but it could be fun/interesting.  Let's say 
your application only calls a handful of different MPI functions.  Write 
PMPI wrappers for them that convert blocking functions 
(MPI_Send/MPI_Recv) to non-blocking ones mixed with short sleep calls.  
Not pretty, but might just be doable for your case.  I don't know.  
Anyhow, that might make MPI wait time detectable with tools like top.


Re: [OMPI users] curious behavior during wait for broadcast: 100% cpu

2010-12-13 Thread David Mathog
Ralph Castain wrote:
> Bottom line for users: the results remain the same. If no other
process wants time, you'll continue to see near 100% utilization even if
we yield because we will always poll for some time before deciding to yield.

Not surprisingly, I am seeing this with recv/send too, at least when
nothing else is running.  This is true even though all workers are on
different nodes (so no need for shared memory connection between them).

Is there a tool in openmpi that will reveal how much "spin time" the
processes are using?  The previous version of the program I'm currently
working on used PVM, and for that implementation gstat, top, etc.
provided a good idea of the percent activity on the compute nodes.  Not
so here.  At the moment our cluster is heterogeneous with 3 nodes about
3X faster than the other 20.  Because of a lack of load balancing
(that's what I am trying to address now) the fast nodes must be idle
around 60% of the time, since they will finish their task long before
the other nodes, but I can't see it, can you?  Here are the relevant
columns from one gstat reading, the idle values jump around between
machines with no apparent pattern.  The 3 faster ones are 02, 05, and
15, but no way to tell that from this data:

[  User,  Nice, System, Idle, Wio]
01 [  49.7,   0.0,  50.3,   0.0,   0.0]
02 [  41.4,   0.0,  58.6,   0.0,   0.0]
03 [  43.2,   0.0,  49.7,   7.0,   0.0]
04 [  38.8,   0.0,  46.0,  15.2,   0.0]
05 [  38.6,   0.0,  46.4,  15.0,   0.0]
06 [  48.3,   0.0,  51.7,   0.0,   0.0]
07 [  38.5,   0.0,  46.6,  14.9,   0.0]
08 [  43.8,   0.0,  51.3,   4.8,   0.0]
09 [  44.9,   0.0,  48.8,   6.3,   0.0]
10 [  48.9,   0.0,  49.1,   2.0,   0.0]
11 [  50.7,   0.0,  49.3,   0.0,   0.0]
12 [  46.8,   0.0,  53.2,   0.0,   0.0]
13 [  48.4,   0.0,  51.6,   0.0,   0.0]
14 [  44.2,   0.0,  48.2,   7.6,   0.0]
15 [  43.3,   0.0,  56.7,   0.0,   0.0]
16 [  44.7,   0.0,  50.3,   5.0,   0.0]
17 [  42.8,   0.0,  57.2,   0.0,   0.0]
18 [  50.7,   0.0,  49.3,   0.0,   0.0]
19 [  46.9,   0.0,  45.2,   7.9,   0.0]
20 [  46.0,   0.0,  48.9,   5.1,   0.0]

Top is even less help, it just shows the worker process on each node at
>98% CPU.

Thanks,

David Mathog
mat...@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


Re: [OMPI users] curious behavior during wait for broadcast: 100% cpu

2010-12-13 Thread Jeff Squyres
On Dec 13, 2010, at 11:00 AM, Hicham Mouline wrote:

> In various interfaces, like network sockets, or threads waiting for data from 
> somewhere, there are various solutions based on _not_ checking the state of 
> the socket or some sort of  queue continuously, but sort of getting 
> _interrupted_ when there is data around, or like condition variables for 
> threads.

OMPI currently busy polls for all progress in the MPI layer, even for TCP 
sockets.  Our progression engine is (currently) based on the premise of not 
blocking, so we have to poll everything.

This design decision was based on several things, including assumptions that 
others have already mentioned in this thread (e.g., MPI jobs typically have 
complete "ownership" of the resources that they're running on, shared memory 
and other transports require polling to check for progress, ...etc.).

We developers have previously discussed how to make the MPI layer block 
(instead of busy poll), but there has never been a compelling need to do so.  
Specifically, conversion to allow a blocking model is a fair amount of complex 
work that no one has allocated any time/resources to do.  :-\

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] curious behavior during wait for broadcast: 100% cpu

2010-12-13 Thread Jeff Squyres
;>>> processes are running (besides my MPI ones, of course), then I'll 
>>>>>> typically see cpu usage drop a few percentage points - down to like 95% 
>>>>>> - because most system tools are very courteous and call yield is they 
>>>>>> don't need to do something. If there is something out there that wants 
>>>>>> time, or is less courteous, then my cpu usage can change a great deal.
>>>>>> 
>>>>>> Note, though, that top and ps are -very- coarse measuring tools. You'll 
>>>>>> probably see them reading more like 100% simply because, averaged out 
>>>>>> over their sampling periods, nobody else is using enough to measure the 
>>>>>> difference.
>>>>>> 
>>>>>> 
>>>>>> On Dec 9, 2010, at 1:37 PM, Hicham Mouline wrote:
>>>>>> 
>>>>>>>> -Original Message-
>>>>>>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
>>>>>>>> Behalf Of Eugene Loh
>>>>>>>> Sent: 08 December 2010 16:19
>>>>>>>> To: Open MPI Users
>>>>>>>> Subject: Re: [OMPI users] curious behavior during wait for broadcast:
>>>>>>>> 100% cpu
>>>>>>>> 
>>>>>>>> I wouldn't mind some clarification here.  Would CPU usage really
>>>>>>>> decrease, or would other processes simply have an easier time getting
>>>>>>>> cycles?  My impression of yield was that if there were no one to yield
>>>>>>>> to, the "yielding" process would still go hard.  Conversely, turning on
>>>>>>>> "yield" would still show 100% cpu, but it would be easier for other
>>>>>>>> processes to get time.
>>>>>>>> 
>>>>>>> Any clarifications?
>>>>>>> 
>>>>>>> ___
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> 
>>>>>> 
>>>>>> ___
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Jeff Squyres
>>>>> jsquy...@cisco.com
>>>>> For corporate legal information go to:
>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>> 
>>>>> 
>>>>> ___
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> 
>>>> ___
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> -- 
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to:
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] curious behavior during wait for broadcast: 100% cpu

2010-12-13 Thread Ralph Castain
OMPI does use those methods, but they can't be used for something like shared 
memory. So if you want the performance benefit of shared memory, then we have 
to poll.


On Dec 13, 2010, at 9:00 AM, Hicham Mouline wrote:

> I don't understand 1 thing though and would appreciate your comments.
>  
> In various interfaces, like network sockets, or threads waiting for data from 
> somewhere, there are various solutions based on _not_ checking the state of 
> the socket or some sort of  queue continuously, but sort of getting 
> _interrupted_ when there is data around, or like condition variables for 
> threads.
> I am not very clear on these points, but it seems that in these contexts, 
> continuous polling is avoided and so actual CPU usage is usually not close to 
> 100%.
>  
> Why can't something similar be implemented with broadcast for e.g.?
>  
> -Original Message-
> From: "Jeff Squyres" [jsquy...@cisco.com]
> Date: 13/12/2010 03:55 PM
> To: "Open MPI Users" 
> Subject: Re: [OMPI users] curious behavior during wait for broadcast: 100% cpu
> 
> I think there *was* a decision and it effectively changed how sched_yield() 
> effectively operates, and that it may not do what we expect any more.
> 
> See this thread (the discussion of Linux/sched_yield() comes in the later 
> messages):
> 
> http://www.open-mpi.org/community/lists/users/2010/07/13729.php
> 
> I believe there's similar threads in the MPICH mailing list archives; that's 
> why Dave posted on the OMPI list about it.
> 
> We briefly discussed replacing OMPI's sched_yield() with a usleep(1), but it 
> was shot down.
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] curious behavior during wait for broadcast: 100% cpu

2010-12-13 Thread Hicham Mouline
I don't understand 1 thing though and would appreciate your comments.
 
In various interfaces, like network sockets, or threads waiting for data from 
somewhere, there are various solutions based on _not_ checking the state of the 
socket or some sort of  queue continuously, but sort of getting _interrupted_ 
when there is data around, or like condition variables for threads.
I am not very clear on these points, but it seems that in these contexts, 
continuous polling is avoided and so actual CPU usage is usually not close to 
100%.
 
Why can't something similar be implemented with broadcast for e.g.?
 
-Original Message-
From: "Jeff Squyres" [jsquy...@cisco.com]
List-Post: users@lists.open-mpi.org
Date: 13/12/2010 03:55 PM
To: "Open MPI Users" 
Subject: Re: [OMPI users] curious behavior during wait for broadcast: 100% cpu

I think there *was* a decision and it effectively changed how sched_yield() 
effectively operates, and that it may not do what we expect any more.

See this thread (the discussion of Linux/sched_yield() comes in the later 
messages):

http://www.open-mpi.org/community/lists/users/2010/07/13729.php

I believe there's similar threads in the MPICH mailing list archives; that's 
why Dave posted on the OMPI list about it.

We briefly discussed replacing OMPI's sched_yield() with a usleep(1), but it 
was shot down.





Re: [OMPI users] curious behavior during wait for broadcast: 100% cpu

2010-12-13 Thread Hicham Mouline
very clear, thanks very much.
 
 
-Original Message-
From: "Ralph Castain" [r...@open-mpi.org]
List-Post: users@lists.open-mpi.org
Date: 13/12/2010 03:49 PM
To: "Open MPI Users" 
Subject: Re: [OMPI users] curious behavior during wait for broadcast: 100% cpu

Thanks for the link!

Just to clarify for the list, my original statement is essentially correct. 
When calling sched_yield, we give up the remaining portion of our time slice.

The issue in the kernel world centers around where to put you in the scheduling 
cycle once you have called sched_yield. Do you go to the end of the schedule 
for your priority? Do you go to the end of the schedule for all priorities? 
Or...where?

Looks like they decided to not decide, and left several options available. Not 
entirely clear of the default, and they recommend we not use sched_yield and 
release the time some other method. We'll take this up on the developer list to 
see what (if anything) we want to do about it.

Bottom line for users: the results remain the same. If no other process wants 
time, you'll continue to see near 100% utilization even if we yield because we 
will always poll for some time before deciding to yield.






Re: [OMPI users] curious behavior during wait for broadcast: 100% cpu

2010-12-13 Thread Jeff Squyres
I think there *was* a decision and it effectively changed how sched_yield() 
effectively operates, and that it may not do what we expect any more.  

See this thread (the discussion of Linux/sched_yield() comes in the later 
messages):

http://www.open-mpi.org/community/lists/users/2010/07/13729.php

I believe there's similar threads in the MPICH mailing list archives; that's 
why Dave posted on the OMPI list about it.

We briefly discussed replacing OMPI's sched_yield() with a usleep(1), but it 
was shot down.


On Dec 13, 2010, at 10:47 AM, Ralph Castain wrote:

> Thanks for the link!
> 
> Just to clarify for the list, my original statement is essentially correct. 
> When calling sched_yield, we give up the remaining portion of our time slice.
> 
> The issue in the kernel world centers around where to put you in the 
> scheduling cycle once you have called sched_yield. Do you go to the end of 
> the schedule for your priority? Do you go to the end of the schedule for all 
> priorities? Or...where?
> 
> Looks like they decided to not decide, and left several options available. 
> Not entirely clear of the default, and they recommend we not use sched_yield 
> and release the time some other method. We'll take this up on the developer 
> list to see what (if anything) we want to do about it.
> 
> Bottom line for users: the results remain the same. If no other process wants 
> time, you'll continue to see near 100% utilization even if we yield because 
> we will always poll for some time before deciding to yield.
> 
> 
> On Dec 13, 2010, at 7:52 AM, Jeff Squyres wrote:
> 
>> See the discussion on kerneltrap:
>> 
>>   http://kerneltrap.org/Linux/CFS_and_sched_yield
>> 
>> Looks like the change came in somewhere around 2.6.23 or so...?
>> 
>> 
>> 
>> On Dec 13, 2010, at 9:38 AM, Ralph Castain wrote:
>> 
>>> Could you at least provide a one-line explanation of that statement?
>>> 
>>> 
>>> On Dec 13, 2010, at 7:31 AM, Jeff Squyres wrote:
>>> 
>>>> Also note that recent versions of the Linux kernel have changed what 
>>>> sched_yield() does -- it no longer does essentially what Ralph describes 
>>>> below.  Google around to find those discussions.
>>>> 
>>>> 
>>>> On Dec 9, 2010, at 4:07 PM, Ralph Castain wrote:
>>>> 
>>>>> Sorry for delay - am occupied with my day job.
>>>>> 
>>>>> Yes, that is correct to an extent. When you yield the processor, all that 
>>>>> happens is that you surrender the rest of your scheduled time slice back 
>>>>> to the OS. The OS then cycles thru its scheduler and sequentially assigns 
>>>>> the processor to the line of waiting processes. Eventually, the OS will 
>>>>> cycle back to your process, and you'll begin cranking again.
>>>>> 
>>>>> So if no other process wants or needs attention, then yes - it will cycle 
>>>>> back around to you pretty quickly. In cases where only system processes 
>>>>> are running (besides my MPI ones, of course), then I'll typically see cpu 
>>>>> usage drop a few percentage points - down to like 95% - because most 
>>>>> system tools are very courteous and call yield is they don't need to do 
>>>>> something. If there is something out there that wants time, or is less 
>>>>> courteous, then my cpu usage can change a great deal.
>>>>> 
>>>>> Note, though, that top and ps are -very- coarse measuring tools. You'll 
>>>>> probably see them reading more like 100% simply because, averaged out 
>>>>> over their sampling periods, nobody else is using enough to measure the 
>>>>> difference.
>>>>> 
>>>>> 
>>>>> On Dec 9, 2010, at 1:37 PM, Hicham Mouline wrote:
>>>>> 
>>>>>>> -Original Message-
>>>>>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
>>>>>>> Behalf Of Eugene Loh
>>>>>>> Sent: 08 December 2010 16:19
>>>>>>> To: Open MPI Users
>>>>>>> Subject: Re: [OMPI users] curious behavior during wait for broadcast:
>>>>>>> 100% cpu
>>>>>>> 
>>>>>>> I wouldn't mind some clarification here.  Would CPU usage really
>>>>>>> decrease, or would other processes simply have an easier time getting
>>>>>>> cycles?  My impression of yield was that if there were no one to yield
>>>>>>>

Re: [OMPI users] curious behavior during wait for broadcast: 100% cpu

2010-12-13 Thread Ralph Castain
Thanks for the link!

Just to clarify for the list, my original statement is essentially correct. 
When calling sched_yield, we give up the remaining portion of our time slice.

The issue in the kernel world centers around where to put you in the scheduling 
cycle once you have called sched_yield. Do you go to the end of the schedule 
for your priority? Do you go to the end of the schedule for all priorities? 
Or...where?

Looks like they decided to not decide, and left several options available. Not 
entirely clear of the default, and they recommend we not use sched_yield and 
release the time some other method. We'll take this up on the developer list to 
see what (if anything) we want to do about it.

Bottom line for users: the results remain the same. If no other process wants 
time, you'll continue to see near 100% utilization even if we yield because we 
will always poll for some time before deciding to yield.


On Dec 13, 2010, at 7:52 AM, Jeff Squyres wrote:

> See the discussion on kerneltrap:
> 
>http://kerneltrap.org/Linux/CFS_and_sched_yield
> 
> Looks like the change came in somewhere around 2.6.23 or so...?
> 
> 
> 
> On Dec 13, 2010, at 9:38 AM, Ralph Castain wrote:
> 
>> Could you at least provide a one-line explanation of that statement?
>> 
>> 
>> On Dec 13, 2010, at 7:31 AM, Jeff Squyres wrote:
>> 
>>> Also note that recent versions of the Linux kernel have changed what 
>>> sched_yield() does -- it no longer does essentially what Ralph describes 
>>> below.  Google around to find those discussions.
>>> 
>>> 
>>> On Dec 9, 2010, at 4:07 PM, Ralph Castain wrote:
>>> 
>>>> Sorry for delay - am occupied with my day job.
>>>> 
>>>> Yes, that is correct to an extent. When you yield the processor, all that 
>>>> happens is that you surrender the rest of your scheduled time slice back 
>>>> to the OS. The OS then cycles thru its scheduler and sequentially assigns 
>>>> the processor to the line of waiting processes. Eventually, the OS will 
>>>> cycle back to your process, and you'll begin cranking again.
>>>> 
>>>> So if no other process wants or needs attention, then yes - it will cycle 
>>>> back around to you pretty quickly. In cases where only system processes 
>>>> are running (besides my MPI ones, of course), then I'll typically see cpu 
>>>> usage drop a few percentage points - down to like 95% - because most 
>>>> system tools are very courteous and call yield is they don't need to do 
>>>> something. If there is something out there that wants time, or is less 
>>>> courteous, then my cpu usage can change a great deal.
>>>> 
>>>> Note, though, that top and ps are -very- coarse measuring tools. You'll 
>>>> probably see them reading more like 100% simply because, averaged out over 
>>>> their sampling periods, nobody else is using enough to measure the 
>>>> difference.
>>>> 
>>>> 
>>>> On Dec 9, 2010, at 1:37 PM, Hicham Mouline wrote:
>>>> 
>>>>>> -Original Message-
>>>>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
>>>>>> Behalf Of Eugene Loh
>>>>>> Sent: 08 December 2010 16:19
>>>>>> To: Open MPI Users
>>>>>> Subject: Re: [OMPI users] curious behavior during wait for broadcast:
>>>>>> 100% cpu
>>>>>> 
>>>>>> I wouldn't mind some clarification here.  Would CPU usage really
>>>>>> decrease, or would other processes simply have an easier time getting
>>>>>> cycles?  My impression of yield was that if there were no one to yield
>>>>>> to, the "yielding" process would still go hard.  Conversely, turning on
>>>>>> "yield" would still show 100% cpu, but it would be easier for other
>>>>>> processes to get time.
>>>>>> 
>>>>> Any clarifications?
>>>>> 
>>>>> ___
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> 
>>>> ___
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> -- 
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to:
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] curious behavior during wait for broadcast: 100% cpu

2010-12-13 Thread Jeff Squyres
See the discussion on kerneltrap:

http://kerneltrap.org/Linux/CFS_and_sched_yield

Looks like the change came in somewhere around 2.6.23 or so...?



On Dec 13, 2010, at 9:38 AM, Ralph Castain wrote:

> Could you at least provide a one-line explanation of that statement?
> 
> 
> On Dec 13, 2010, at 7:31 AM, Jeff Squyres wrote:
> 
>> Also note that recent versions of the Linux kernel have changed what 
>> sched_yield() does -- it no longer does essentially what Ralph describes 
>> below.  Google around to find those discussions.
>> 
>> 
>> On Dec 9, 2010, at 4:07 PM, Ralph Castain wrote:
>> 
>>> Sorry for delay - am occupied with my day job.
>>> 
>>> Yes, that is correct to an extent. When you yield the processor, all that 
>>> happens is that you surrender the rest of your scheduled time slice back to 
>>> the OS. The OS then cycles thru its scheduler and sequentially assigns the 
>>> processor to the line of waiting processes. Eventually, the OS will cycle 
>>> back to your process, and you'll begin cranking again.
>>> 
>>> So if no other process wants or needs attention, then yes - it will cycle 
>>> back around to you pretty quickly. In cases where only system processes are 
>>> running (besides my MPI ones, of course), then I'll typically see cpu usage 
>>> drop a few percentage points - down to like 95% - because most system tools 
>>> are very courteous and call yield is they don't need to do something. If 
>>> there is something out there that wants time, or is less courteous, then my 
>>> cpu usage can change a great deal.
>>> 
>>> Note, though, that top and ps are -very- coarse measuring tools. You'll 
>>> probably see them reading more like 100% simply because, averaged out over 
>>> their sampling periods, nobody else is using enough to measure the 
>>> difference.
>>> 
>>> 
>>> On Dec 9, 2010, at 1:37 PM, Hicham Mouline wrote:
>>> 
>>>>> -Original Message-
>>>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
>>>>> Behalf Of Eugene Loh
>>>>> Sent: 08 December 2010 16:19
>>>>> To: Open MPI Users
>>>>> Subject: Re: [OMPI users] curious behavior during wait for broadcast:
>>>>> 100% cpu
>>>>> 
>>>>> I wouldn't mind some clarification here.  Would CPU usage really
>>>>> decrease, or would other processes simply have an easier time getting
>>>>> cycles?  My impression of yield was that if there were no one to yield
>>>>> to, the "yielding" process would still go hard.  Conversely, turning on
>>>>> "yield" would still show 100% cpu, but it would be easier for other
>>>>> processes to get time.
>>>>> 
>>>> Any clarifications?
>>>> 
>>>> ___
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] curious behavior during wait for broadcast: 100% cpu

2010-12-13 Thread Ralph Castain
Could you at least provide a one-line explanation of that statement?


On Dec 13, 2010, at 7:31 AM, Jeff Squyres wrote:

> Also note that recent versions of the Linux kernel have changed what 
> sched_yield() does -- it no longer does essentially what Ralph describes 
> below.  Google around to find those discussions.
> 
> 
> On Dec 9, 2010, at 4:07 PM, Ralph Castain wrote:
> 
>> Sorry for delay - am occupied with my day job.
>> 
>> Yes, that is correct to an extent. When you yield the processor, all that 
>> happens is that you surrender the rest of your scheduled time slice back to 
>> the OS. The OS then cycles thru its scheduler and sequentially assigns the 
>> processor to the line of waiting processes. Eventually, the OS will cycle 
>> back to your process, and you'll begin cranking again.
>> 
>> So if no other process wants or needs attention, then yes - it will cycle 
>> back around to you pretty quickly. In cases where only system processes are 
>> running (besides my MPI ones, of course), then I'll typically see cpu usage 
>> drop a few percentage points - down to like 95% - because most system tools 
>> are very courteous and call yield is they don't need to do something. If 
>> there is something out there that wants time, or is less courteous, then my 
>> cpu usage can change a great deal.
>> 
>> Note, though, that top and ps are -very- coarse measuring tools. You'll 
>> probably see them reading more like 100% simply because, averaged out over 
>> their sampling periods, nobody else is using enough to measure the 
>> difference.
>> 
>> 
>> On Dec 9, 2010, at 1:37 PM, Hicham Mouline wrote:
>> 
>>>> -Original Message-
>>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
>>>> Behalf Of Eugene Loh
>>>> Sent: 08 December 2010 16:19
>>>> To: Open MPI Users
>>>> Subject: Re: [OMPI users] curious behavior during wait for broadcast:
>>>> 100% cpu
>>>> 
>>>> I wouldn't mind some clarification here.  Would CPU usage really
>>>> decrease, or would other processes simply have an easier time getting
>>>> cycles?  My impression of yield was that if there were no one to yield
>>>> to, the "yielding" process would still go hard.  Conversely, turning on
>>>> "yield" would still show 100% cpu, but it would be easier for other
>>>> processes to get time.
>>>> 
>>> Any clarifications?
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] curious behavior during wait for broadcast: 100% cpu

2010-12-13 Thread Jeff Squyres
Also note that recent versions of the Linux kernel have changed what 
sched_yield() does -- it no longer does essentially what Ralph describes below. 
 Google around to find those discussions.


On Dec 9, 2010, at 4:07 PM, Ralph Castain wrote:

> Sorry for delay - am occupied with my day job.
> 
> Yes, that is correct to an extent. When you yield the processor, all that 
> happens is that you surrender the rest of your scheduled time slice back to 
> the OS. The OS then cycles thru its scheduler and sequentially assigns the 
> processor to the line of waiting processes. Eventually, the OS will cycle 
> back to your process, and you'll begin cranking again.
> 
> So if no other process wants or needs attention, then yes - it will cycle 
> back around to you pretty quickly. In cases where only system processes are 
> running (besides my MPI ones, of course), then I'll typically see cpu usage 
> drop a few percentage points - down to like 95% - because most system tools 
> are very courteous and call yield is they don't need to do something. If 
> there is something out there that wants time, or is less courteous, then my 
> cpu usage can change a great deal.
> 
> Note, though, that top and ps are -very- coarse measuring tools. You'll 
> probably see them reading more like 100% simply because, averaged out over 
> their sampling periods, nobody else is using enough to measure the difference.
> 
> 
> On Dec 9, 2010, at 1:37 PM, Hicham Mouline wrote:
> 
>>> -Original Message-
>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
>>> Behalf Of Eugene Loh
>>> Sent: 08 December 2010 16:19
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] curious behavior during wait for broadcast:
>>> 100% cpu
>>> 
>>> I wouldn't mind some clarification here.  Would CPU usage really
>>> decrease, or would other processes simply have an easier time getting
>>> cycles?  My impression of yield was that if there were no one to yield
>>> to, the "yielding" process would still go hard.  Conversely, turning on
>>> "yield" would still show 100% cpu, but it would be easier for other
>>> processes to get time.
>>> 
>> Any clarifications?
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] curious behavior during wait for broadcast: 100% cpu

2010-12-09 Thread Ralph Castain
Sorry for delay - am occupied with my day job.

Yes, that is correct to an extent. When you yield the processor, all that 
happens is that you surrender the rest of your scheduled time slice back to the 
OS. The OS then cycles thru its scheduler and sequentially assigns the 
processor to the line of waiting processes. Eventually, the OS will cycle back 
to your process, and you'll begin cranking again.

So if no other process wants or needs attention, then yes - it will cycle back 
around to you pretty quickly. In cases where only system processes are running 
(besides my MPI ones, of course), then I'll typically see cpu usage drop a few 
percentage points - down to like 95% - because most system tools are very 
courteous and call yield is they don't need to do something. If there is 
something out there that wants time, or is less courteous, then my cpu usage 
can change a great deal.

Note, though, that top and ps are -very- coarse measuring tools. You'll 
probably see them reading more like 100% simply because, averaged out over 
their sampling periods, nobody else is using enough to measure the difference.


On Dec 9, 2010, at 1:37 PM, Hicham Mouline wrote:

>> -Original Message-
>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
>> Behalf Of Eugene Loh
>> Sent: 08 December 2010 16:19
>> To: Open MPI Users
>> Subject: Re: [OMPI users] curious behavior during wait for broadcast:
>> 100% cpu
>> 
>> I wouldn't mind some clarification here.  Would CPU usage really
>> decrease, or would other processes simply have an easier time getting
>> cycles?  My impression of yield was that if there were no one to yield
>> to, the "yielding" process would still go hard.  Conversely, turning on
>> "yield" would still show 100% cpu, but it would be easier for other
>> processes to get time.
>> 
> Any clarifications?
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] curious behavior during wait for broadcast: 100% cpu

2010-12-09 Thread Hicham Mouline
> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
> Behalf Of Eugene Loh
> Sent: 08 December 2010 16:19
> To: Open MPI Users
> Subject: Re: [OMPI users] curious behavior during wait for broadcast:
> 100% cpu
> 
> I wouldn't mind some clarification here.  Would CPU usage really
> decrease, or would other processes simply have an easier time getting
> cycles?  My impression of yield was that if there were no one to yield
> to, the "yielding" process would still go hard.  Conversely, turning on
> "yield" would still show 100% cpu, but it would be easier for other
> processes to get time.
> 
Any clarifications?



Re: [OMPI users] curious behavior during wait for broadcast: 100% cpu

2010-12-08 Thread Eugene Loh

Ralph Castain wrote:


I know we have said this many times - OMPI made a design decision to poll hard 
while waiting for messages to arrive to minimize latency.

If you want to decrease cpu usage, you can use the yield_when_idle option (it 
will cost you some latency, though) - see ompi_info --param ompi all
 

I wouldn't mind some clarification here.  Would CPU usage really 
decrease, or would other processes simply have an easier time getting 
cycles?  My impression of yield was that if there were no one to yield 
to, the "yielding" process would still go hard.  Conversely, turning on 
"yield" would still show 100% cpu, but it would be easier for other 
processes to get time.



Or don't set affinity and we won't be as aggressive - but you'll lose some 
performance

Choice is yours! :-)
 



Re: [OMPI users] curious behavior during wait for broadcast: 100% cpu

2010-12-08 Thread Richard Treumann
Also -

HPC clusters are commonly dedicated to running parallel jobs with exactly 
one process per CPU.  HPC is about getting computation done and letting a 
CPU time slice among competing processes always has overhead (CPU time not 
spent on the computation).

Unless you are trying to run extra processes that take turns for the 
available processors, there is no gain from freeing up a CPU during a 
blocking call.  It is the difference between spinning in the poll and 
spinning in the OS "idle process".

Dick Treumann  -  MPI Team 
IBM Systems & Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363




From:
Ralph Castain <r...@open-mpi.org>
To:
Open MPI Users <us...@open-mpi.org>
List-Post: users@lists.open-mpi.org
Date:
12/08/2010 10:36 AM
Subject:
Re: [OMPI users] curious behavior during wait for broadcast: 100% cpu
Sent by:
users-boun...@open-mpi.org



I know we have said this many times - OMPI made a design decision to poll 
hard while waiting for messages to arrive to minimize latency.

If you want to decrease cpu usage, you can use the yield_when_idle option 
(it will cost you some latency, though) - see ompi_info --param ompi all

Or don't set affinity and we won't be as aggressive - but you'll lose some 
performance

Choice is yours! :-)

On Dec 8, 2010, at 8:08 AM, Hicham Mouline wrote:

> Hello,
> 
> on win32 openmpi 1.4.3, I have a slave process that reaches this 
pseudo-code and then blocks and the CPU usage for that process stays at 
25% all the time (I have a quadcore processor). When I set the affinity to 
1 of the cores, that core is 100% busy because of my slave process.
> 
> main()
> {
> 
> .
> MPI_ISEND
> 
> std::cout<< "about to get broadcast"<<std::endl;
> MPI_Bcast of an integer
> std::cout<< " broadcast received"<<std::endl;
> ...
> }
> 
> The first printout is seen but not the next which makes me thing the 
process is inside the MPI_Bcast call. Should the CPU be 100% busy while 
this call is waiting for the broadcast message to arrive?
> 
> Any ideas? below the output of ompi-info:
> 
--
> 
> Package: Open MPI 
>  Distribution
>Open MPI: 1.4.3
>   Open MPI SVN revision: r23834
>   Open MPI release date: Oct 05, 2010
>Open RTE: 1.4.3
>   Open RTE SVN revision: r23834
>   Open RTE release date: Oct 05, 2010
>OPAL: 1.4.3
>   OPAL SVN revision: r23834
>   OPAL release date: Oct 05, 2010
>Ident string: 1.4.3
>  Prefix: C:/Program Files/openmpi
> Configured architecture: x86 Windows-5.1
>  Configure host: LC12-003-D-055A
>   Configured by: hicham.mouline
>   Configured on: 18:07 19/11/2010
>  Configure host: 
>Built by: hicham.mouline
>Built on: 18:07 19/11/2010
>  Built host: 
>  C bindings: yes
>C++ bindings: yes
>  Fortran77 bindings: no
>  Fortran90 bindings: no
> Fortran90 bindings size: na
>  C compiler: C:/Program Files/Microsoft Visual Studio
>  9.0/VC/bin/cl.exe
> C compiler absolute: C:/Program Files/Microsoft Visual Studio
>  9.0/VC/bin/cl.exe
>C++ compiler: C:/Program Files/Microsoft Visual Studio
>  9.0/VC/bin/cl.exe
>   C++ compiler absolute: C:/Program Files/Microsoft Visual Studio
>  9.0/VC/bin/cl.exe
>  Fortran77 compiler: CMAKE_Fortran_COMPILER-NOTFOUND
>  Fortran77 compiler abs: none
>  Fortran90 compiler:
>  Fortran90 compiler abs: none
> C profiling: yes
>   C++ profiling: yes
> Fortran77 profiling: no
> Fortran90 profiling: no
>  C++ exceptions: no
>  Thread support: no
>   Sparse Groups: no
>  Internal debug support: no
> MPI parameter check: runtime
> Memory profiling support: no
> Memory debugging support: no
> libltdl support: no
>   Heterogeneous support: no
> mpirun default --prefix: yes
> MPI I/O support: yes
>   MPI_WTIME support: gettimeofday
> Symbol visibility support: yes
>   FT Checkpoint support: yes  (checkpoint thread: no)
>   MCA backtrace: none (MCA v2.0, API v2.0, Component v1.4.3)
>   MCA paffinity: windows (MCA v2.0, API v2.0, Component v1.4.3)
>   MCA carto: auto_detect (MCA v2.0, API v2.0, Component 
v1.4.3)
>   MCA maffinity

Re: [OMPI users] curious behavior during wait for broadcast: 100% cpu

2010-12-08 Thread Ralph Castain
I know we have said this many times - OMPI made a design decision to poll hard 
while waiting for messages to arrive to minimize latency.

If you want to decrease cpu usage, you can use the yield_when_idle option (it 
will cost you some latency, though) - see ompi_info --param ompi all

Or don't set affinity and we won't be as aggressive - but you'll lose some 
performance

Choice is yours! :-)

On Dec 8, 2010, at 8:08 AM, Hicham Mouline wrote:

> Hello,
> 
> on win32 openmpi 1.4.3, I have a slave process that reaches this pseudo-code 
> and then blocks and the CPU usage for that process stays at 25% all the time 
> (I have a quadcore processor). When I set the affinity to 1 of the cores, 
> that core is 100% busy because of my slave process.
> 
> main()
> {
> 
> .
> MPI_ISEND
> 
> std::cout<< "about to get broadcast"< MPI_Bcast of an integer
> std::cout<< " broadcast received"< ...
> }
> 
> The first printout is seen but not the next which makes me thing the process 
> is inside the MPI_Bcast call. Should the CPU be 100% busy while this call is 
> waiting for the broadcast message to arrive?
> 
> Any ideas? below the output of ompi-info:
> --
> 
> Package: Open MPI 
>  Distribution
>Open MPI: 1.4.3
>   Open MPI SVN revision: r23834
>   Open MPI release date: Oct 05, 2010
>Open RTE: 1.4.3
>   Open RTE SVN revision: r23834
>   Open RTE release date: Oct 05, 2010
>OPAL: 1.4.3
>   OPAL SVN revision: r23834
>   OPAL release date: Oct 05, 2010
>Ident string: 1.4.3
>  Prefix: C:/Program Files/openmpi
> Configured architecture: x86 Windows-5.1
>  Configure host: LC12-003-D-055A
>   Configured by: hicham.mouline
>   Configured on: 18:07 19/11/2010
>  Configure host: 
>Built by: hicham.mouline
>Built on: 18:07 19/11/2010
>  Built host: 
>  C bindings: yes
>C++ bindings: yes
>  Fortran77 bindings: no
>  Fortran90 bindings: no
> Fortran90 bindings size: na
>  C compiler: C:/Program Files/Microsoft Visual Studio
>  9.0/VC/bin/cl.exe
> C compiler absolute: C:/Program Files/Microsoft Visual Studio
>  9.0/VC/bin/cl.exe
>C++ compiler: C:/Program Files/Microsoft Visual Studio
>  9.0/VC/bin/cl.exe
>   C++ compiler absolute: C:/Program Files/Microsoft Visual Studio
>  9.0/VC/bin/cl.exe
>  Fortran77 compiler: CMAKE_Fortran_COMPILER-NOTFOUND
>  Fortran77 compiler abs: none
>  Fortran90 compiler:
>  Fortran90 compiler abs: none
> C profiling: yes
>   C++ profiling: yes
> Fortran77 profiling: no
> Fortran90 profiling: no
>  C++ exceptions: no
>  Thread support: no
>   Sparse Groups: no
>  Internal debug support: no
> MPI parameter check: runtime
> Memory profiling support: no
> Memory debugging support: no
> libltdl support: no
>   Heterogeneous support: no
> mpirun default --prefix: yes
> MPI I/O support: yes
>   MPI_WTIME support: gettimeofday
> Symbol visibility support: yes
>   FT Checkpoint support: yes  (checkpoint thread: no)
>   MCA backtrace: none (MCA v2.0, API v2.0, Component v1.4.3)
>   MCA paffinity: windows (MCA v2.0, API v2.0, Component v1.4.3)
>   MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.4.3)
>   MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.4.3)
>   MCA timer: windows (MCA v2.0, API v2.0, Component v1.4.3)
> MCA installdirs: windows (MCA v2.0, API v2.0, Component v1.4.3)
> MCA installdirs: env (MCA v2.0, API v2.0, Component v1.4.3)
> MCA installdirs: config (MCA v2.0, API v2.0, Component v1.4.3)
> MCA crs: none (MCA v2.0, API v2.0, Component v1.4.3)
> MCA dpm: orte (MCA v2.0, API v2.0, Component v1.4.3)
>  MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.4.3)
>   MCA allocator: basic (MCA v2.0, API v2.0, Component v1.4.3)
>   MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.4.3)
>MCA coll: basic (MCA v2.0, API v2.0, Component v1.4.3)
>MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.4.3)
>MCA coll: self (MCA v2.0, API v2.0, Component v1.4.3)
>MCA coll: sm (MCA v2.0, API v2.0, Component v1.4.3)
>MCA coll: sync (MCA v2.0, API v2.0, Component v1.4.3)
>   MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.4.3)
>   MCA mpool: sm (MCA v2.0, API v2.0, Component v1.4.3)
> MCA pml: ob1 (MCA v2.0, API v2.0,