Re: [OMPI users] curious behavior during wait for broadcast: 100% cpu

Jeff Squyres Mon, 13 Dec 2010 11:28:28 -0500

Ralph and I chatted on the phone about this.  Let's clarify a few things here 
for the user list:


1. It looks like we don't have this issue explicitly discussed on the FAQ.  We 
obliquely discuss it in:

http://www.open-mpi.org/faq/?category=all#oversubscribing
and
http://www.open-mpi.org/faq/?category=all#force-aggressive-degraded

I'll try to fix that this week.

2. Ralph's initial description is still correct.  OMPI calls sched_yield() in 
the middle of its progress loop when you enable the yield_when_idle behavior.  
This will *not* cause a (significant) reduction of CPU utilization because OMPI 
is still busy-polling.  But it will yield periodically so that other processes 
*can* run if the OS allows them to.  Due to OS bookkeeping, this yielding 
behavior may result in a slight reduction of top/ps-reported CPU utilization.  
But it's likely not significant.

3. I was trying to point out that the exact behavior of sched_yield() (which 
OMPI uses to yield in Linux) has changed in the Linux kernel over time.  
There's an interesting discussion on the MPICH mailing list archives back in 
2007 about what exactly this means to MPI process performance -- read this 
thread all the way through (the Linux sched_yield() discussion is near the end):

    
https://lists.mcs.anl.gov/mailman/htdig/mpich-discuss/2007-September/002711.html


On Dec 13, 2010, at 10:52 AM, Jeff Squyres wrote:

> I think there *was* a decision and it effectively changed how sched_yield() 
> effectively operates, and that it may not do what we expect any more.  
> 
> See this thread (the discussion of Linux/sched_yield() comes in the later 
> messages):
> 
>    http://www.open-mpi.org/community/lists/users/2010/07/13729.php
> 
> I believe there's similar threads in the MPICH mailing list archives; that's 
> why Dave posted on the OMPI list about it.
> 
> We briefly discussed replacing OMPI's sched_yield() with a usleep(1), but it 
> was shot down.
> 
> 
> On Dec 13, 2010, at 10:47 AM, Ralph Castain wrote:
> 
>> Thanks for the link!
>> 
>> Just to clarify for the list, my original statement is essentially correct. 
>> When calling sched_yield, we give up the remaining portion of our time slice.
>> 
>> The issue in the kernel world centers around where to put you in the 
>> scheduling cycle once you have called sched_yield. Do you go to the end of 
>> the schedule for your priority? Do you go to the end of the schedule for all 
>> priorities? Or...where?
>> 
>> Looks like they decided to not decide, and left several options available. 
>> Not entirely clear of the default, and they recommend we not use sched_yield 
>> and release the time some other method. We'll take this up on the developer 
>> list to see what (if anything) we want to do about it.
>> 
>> Bottom line for users: the results remain the same. If no other process 
>> wants time, you'll continue to see near 100% utilization even if we yield 
>> because we will always poll for some time before deciding to yield.
>> 
>> 
>> On Dec 13, 2010, at 7:52 AM, Jeff Squyres wrote:
>> 
>>> See the discussion on kerneltrap:
>>> 
>>>  http://kerneltrap.org/Linux/CFS_and_sched_yield
>>> 
>>> Looks like the change came in somewhere around 2.6.23 or so...?
>>> 
>>> 
>>> 
>>> On Dec 13, 2010, at 9:38 AM, Ralph Castain wrote:
>>> 
>>>> Could you at least provide a one-line explanation of that statement?
>>>> 
>>>> 
>>>> On Dec 13, 2010, at 7:31 AM, Jeff Squyres wrote:
>>>> 
>>>>> Also note that recent versions of the Linux kernel have changed what 
>>>>> sched_yield() does -- it no longer does essentially what Ralph describes 
>>>>> below.  Google around to find those discussions.
>>>>> 
>>>>> 
>>>>> On Dec 9, 2010, at 4:07 PM, Ralph Castain wrote:
>>>>> 
>>>>>> Sorry for delay - am occupied with my day job.
>>>>>> 
>>>>>> Yes, that is correct to an extent. When you yield the processor, all 
>>>>>> that happens is that you surrender the rest of your scheduled time slice 
>>>>>> back to the OS. The OS then cycles thru its scheduler and sequentially 
>>>>>> assigns the processor to the line of waiting processes. Eventually, the 
>>>>>> OS will cycle back to your process, and you'll begin cranking again.
>>>>>> 
>>>>>> So if no other process wants or needs attention, then yes - it will 
>>>>>> cycle back around to you pretty quickly. In cases where only system 
>>>>>> processes are running (besides my MPI ones, of course), then I'll 
>>>>>> typically see cpu usage drop a few percentage points - down to like 95% 
>>>>>> - because most system tools are very courteous and call yield is they 
>>>>>> don't need to do something. If there is something out there that wants 
>>>>>> time, or is less courteous, then my cpu usage can change a great deal.
>>>>>> 
>>>>>> Note, though, that top and ps are -very- coarse measuring tools. You'll 
>>>>>> probably see them reading more like 100% simply because, averaged out 
>>>>>> over their sampling periods, nobody else is using enough to measure the 
>>>>>> difference.
>>>>>> 
>>>>>> 
>>>>>> On Dec 9, 2010, at 1:37 PM, Hicham Mouline wrote:
>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
>>>>>>>> Behalf Of Eugene Loh
>>>>>>>> Sent: 08 December 2010 16:19
>>>>>>>> To: Open MPI Users
>>>>>>>> Subject: Re: [OMPI users] curious behavior during wait for broadcast:
>>>>>>>> 100% cpu
>>>>>>>> 
>>>>>>>> I wouldn't mind some clarification here.  Would CPU usage really
>>>>>>>> decrease, or would other processes simply have an easier time getting
>>>>>>>> cycles?  My impression of yield was that if there were no one to yield
>>>>>>>> to, the "yielding" process would still go hard.  Conversely, turning on
>>>>>>>> "yield" would still show 100% cpu, but it would be easier for other
>>>>>>>> processes to get time.
>>>>>>>> 
>>>>>>> Any clarifications?
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Jeff Squyres
>>>>> jsquy...@cisco.com
>>>>> For corporate legal information go to:
>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> -- 
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to:
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] curious behavior during wait for broadcast: 100% cpu

Reply via email to