Re: [OMPI users] CPU burning in Wait state

2008-09-04 Thread Jeff Squyres

On Sep 4, 2008, at 4:35 PM, Eugene Loh wrote:

There are many alternatives to polling hard.  One is to yield the  
CPU if someone else is asking for it.  Again, Open MPI has some  
support for this today with the "mpi_yield_when_idle" variable.   
Right?  Might not be all of what someone wants, but the above  
discussion just seems not to account for this.  In any case, I  
verified that it does do something useful in at least one case.


Yes, we do do that, as you verified.  The main intent of that features  
was to support oversubscribing (e.g., developing a parallel app on a  
laptop or limited-processor desktop).


--
Jeff Squyres
Cisco Systems



Re: [OMPI users] CPU burning in Wait state

2008-09-04 Thread Eugene Loh

Jeff Squyres wrote:

OMPI currently polls for message  passing progress.  While you're in 
MPI_BCAST, it's quite possible/ likely that OMPI will poll hard until 
the BCAST is done.  It is  possible that a future version of OMPI will 
use a hybrid polling+non- polling approach for progress, such that if 
you call MPI_BCAST, we'll  poll for a while.  And if nothing 
"interesting" happens after a while  (i.e., the BCAST hasn't finished 
and nothing else seems to be  happening), we'll allow OMPI's internal 
progression engine to block/go  to sleep until something interesting 
happens.


There are many alternatives to polling hard.  One is to yield the CPU if 
someone else is asking for it.  Again, Open MPI has some support for 
this today with the "mpi_yield_when_idle" variable.  Right?  Might not 
be all of what someone wants, but the above discussion just seems not to 
account for this.  In any case, I verified that it does do something 
useful in at least one case.


Re: [OMPI users] CPU burning in Wait state

2008-09-03 Thread Eugene Loh

I hope the following helps, but maybe I'm just repeating myself and Dick.

Let's say you're stuck in an MPI_Recv, MPI_Bcast, or MPI_Barrier call 
waiting on someone else.  You want to free up the CPU for more 
productive purposes.  There are basically two cases:


1)  If you want to free the CPU up for the calling thread, the main 
trick is returning program control to the caller.  This requires a 
non-blocking MPI call.  There is such a thing for MPI_Recv (it's 
MPI_Irecv, you know how to use it), but no such thing for MPI_Bcast or 
MPI_Barrier.  Anyhow, given a non-blocking call, you can return control 
to the caller, who can do productive work while occasionally testing for 
completion of the original operation.


2)  If you want to free the CPU up for anyone else, what you want is 
that the MPI implementation should not poll hard while it's waiting.  
You can do that in Open MPI with the "mpi_yield_when_idle=1" variable.  
E.g.,


   % setenv OMPI_MCA_mpi_yield_when_idle 1
   % mpirun a.out

or

   % mpirun --mca mpi_yield_when_idle 1 a.out

I'm not sure about all systems, but I think yield might sometimes be 
observable only if there is someone to yield to.  It's like driving into 
a traffic circle.  You're supposed to yield to cars already in the 
circle.  This makes a difference only if there is someone in the 
circle!  Similarly, if you look at whether Open MPI is polling hard, you 
might see that it is, indeed, polling hard even if you turn yield on.  
The real test is to have another process compete for the same CPU.  You 
should see the MPI process and the competing process share the CPU in 
the default case, but the competing process winning the CPU when yield 
is turned on.  I tried such a test on my system and confirmed that Open 
MPI yield does "work".


I hope that helps.


Re: [OMPI users] CPU burning in Wait state

2008-09-03 Thread Richard Treumann

Vincent

1) Assume you are running an MPI program which has 16 tasks in
MPI_COMM_WORLD, you have 16 dedicated CPUs and each task is single
threaded. (a task is a distinct process, a process can contain one or more
threads) The is the most common traditional model.  In this model, when a
task makes a blocking call, the CPU is used to poll the communication
layer.  With only one thread per task, there is no way the CPU can be given
other useful work because the only thread is in the MPI_Bast and not
available to compute.  With nothing else for the CPU to do anyway, it may
as well poll because that is likely to complete the blocking operation in
shortest time. Polling is the right choice. You should not worry that the
CPU is being "burned".  It will not wear out.

2) Now assume you have the same number of tasks and CPUs but you have
provided a compute thread and a communication thread in each task.  At the
moment you make an MPI_Bcast call on each task's communication thread you
have unfinished computation that the CPUs could process on the compute
threads.  In this case you want the CPU to be released by the blocked
MPI_Bcast so it can be used by the compute thread.  The MPI_Bcast may take
longer to complete because it is not burning the CPU but if useful
computation is going forward you come out ahead. A non-polling mode for the
blocking MPI_Bcast is the better option.

3) Take a third case - the CPUs are not dedicated to your MPI job.  You
have only one thread per task but when that thread is blocked in an
MPI_Bcast you want other processes to be able to run.  This is not a common
situation in production environments but may be common in learning or
development situations. Perhaps your MPI homework problem is running at the
same time someone else is trying to compile theirs on the same nodes.  In
this case you really do not need the MPI_Bcast to finish in the shortest
possible time and you do want the people who share the node with you to
quit complaining.  Again. a non-polling mode than gives up the CPU and lets
your neighbors compilation run is best.

Which of these is closest to your situation?  If it is situation 1, why
would you care that CPU is burning?  If it is situation 2 or 3 then you do
have reason to care.

   Dick

Dick Treumann  -  MPI Team
IBM Systems & Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363


users-boun...@open-mpi.org wrote on 09/03/2008 01:11:00 PM:

> [image removed]
>
> Re: [OMPI users] CPU burning in Wait state
>
> Vincent Rotival
>
> to:
>
> Open MPI Users
>
> 09/03/2008 01:15 PM
>
> Sent by:
>
> users-boun...@open-mpi.org
>
> Please respond to Open MPI Users
>
> Eugene,
>
> No what I'd like is that when doing something like
>
> call mpi_bcast(data, 1, MPI_INTEGER, 0, .)
>
> the program continues AFTER the Bcast is completed (so no control
> returned to user), but while threads with rank > 0 are waiting in Bcast
> they are not taking CPU resources
>
> I hope it is more clear, I apologize for not being clear in the first
place
>
> Vincent
>
>
>
> Eugene Loh wrote:
> >
> > Vincent Rotival wrote:
> >
> >> The solution I retained was for the main thread to isend data
> >> separately to each other threads that are using Irecv + loop on
> >> mpi_test to test the  finish of the Irecv. It mught be dirty but
> >> works much better than using Bcast
> >
> > Thanks for the clarification.
> >
> > But this strikes me more as a question about the MPI standard than
> > about the Open MPI implementation.  That is, what you really want is
> > for the MPI API to support a non-blocking form of collectives.  You
> > want control to return to the user program before the
> > barrier/bcast/etc. operation has completed.  That's an API change.
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] CPU burning in Wait state

2008-09-03 Thread George Bosilca
This program is 100% correct from MPI perspective. However, in Open  
MPI (and I think most of the others MPI), a collective communication  
is something that will drain most of the resources, similar to all  
blocking functions.


Now I will answer to your original post. Using non blocking  
communications in this particular case, will give you a benefit as the  
data involved in the communications is small enough to achieve a  
perfect overlap. In the case you're trying to do exactly the same with  
larger data, using non blocking communications will negatively impact  
the performances, as MPI is not supposed to communicate when the user  
application is not in an MPI call.


  george.

On Sep 3, 2008, at 6:32 PM, Vincent Rotival wrote:

Ok let's take the simple example here, I might have use wrong terms  
and I apologize for it


While the rank 0 process is sleeping the other ones are in bcast  
waiting for data




program test
use mpi
implicit none

integer :: mpi_wsize, mpi_rank, mpi_err
integer :: data

call mpi_init(mpi_err)
call mpi_comm_size(MPI_COMM_WORLD, mpi_wsize, mpi_err)
call mpi_comm_rank(MPI_COMM_WORLD, mpi_rank, mpi_err)
if(mpi_rank.eq.0) then
   call sleep(100)
   data = 10
end if

call mpi_bcast(data, 1, MPI_INTEGER, 0, MPI_COMM_WORLD, mpi_err)

print *, "Done in #", mpi_rank, " => data=", data

end program test


George Bosilca wrote:


On Sep 3, 2008, at 6:11 PM, Vincent Rotival wrote:


Eugene,

No what I'd like is that when doing something like

call mpi_bcast(data, 1, MPI_INTEGER, 0, .)

the program continues AFTER the Bcast is completed (so no control  
returned to user), but while threads with rank > 0 are waiting in  
Bcast they are not taking CPU resources


Threads with rank > 0 ? Now, this scares me !!! If all your threads  
are going in the bcast, then I guess the application is not correct  
from the MPI standard perspective (i.e. on each communicator there  
is only one collective at every moment). In MPI, each process (and  
not each thread) has a rank, and each process exists in each  
communicator only once. In other words, as each collective is  
bounded to a specific communicator, on each of your processes, only  
one thread should go in the MPI_Bcast, if you want only ONE  
collective.


 george.




I hope it is more clear, I apologize for not being clear in the  
first place


Vincent



Eugene Loh wrote:


Vincent Rotival wrote:

The solution I retained was for the main thread to isend data  
separately to each other threads that are using Irecv + loop on  
mpi_test to test the  finish of the Irecv. It mught be dirty but  
works much better than using Bcast


Thanks for the clarification.

But this strikes me more as a question about the MPI standard  
than about the Open MPI implementation.  That is, what you really  
want is for the MPI API to support a non-blocking form of  
collectives.  You want control to return to the user program  
before the barrier/bcast/etc. operation has completed.  That's an  
API change.

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




smime.p7s
Description: S/MIME cryptographic signature


Re: [OMPI users] CPU burning in Wait state

2008-09-03 Thread Vincent Rotival
Ok let's take the simple example here, I might have use wrong terms and 
I apologize for it


While the rank 0 process is sleeping the other ones are in bcast waiting 
for data




program test
 use mpi
 implicit none

 integer :: mpi_wsize, mpi_rank, mpi_err
 integer :: data

 call mpi_init(mpi_err)
 call mpi_comm_size(MPI_COMM_WORLD, mpi_wsize, mpi_err)
 call mpi_comm_rank(MPI_COMM_WORLD, mpi_rank, mpi_err)

 if(mpi_rank.eq.0) then
call sleep(100)
data = 10
 end if

 call mpi_bcast(data, 1, MPI_INTEGER, 0, MPI_COMM_WORLD, mpi_err)

 print *, "Done in #", mpi_rank, " => data=", data

end program test


George Bosilca wrote:


On Sep 3, 2008, at 6:11 PM, Vincent Rotival wrote:


Eugene,

No what I'd like is that when doing something like

call mpi_bcast(data, 1, MPI_INTEGER, 0, .)

the program continues AFTER the Bcast is completed (so no control 
returned to user), but while threads with rank > 0 are waiting in 
Bcast they are not taking CPU resources


Threads with rank > 0 ? Now, this scares me !!! If all your threads 
are going in the bcast, then I guess the application is not correct 
from the MPI standard perspective (i.e. on each communicator there is 
only one collective at every moment). In MPI, each process (and not 
each thread) has a rank, and each process exists in each communicator 
only once. In other words, as each collective is bounded to a specific 
communicator, on each of your processes, only one thread should go in 
the MPI_Bcast, if you want only ONE collective.


  george.




I hope it is more clear, I apologize for not being clear in the first 
place


Vincent



Eugene Loh wrote:


Vincent Rotival wrote:

The solution I retained was for the main thread to isend data 
separately to each other threads that are using Irecv + loop on 
mpi_test to test the  finish of the Irecv. It mught be dirty but 
works much better than using Bcast


Thanks for the clarification.

But this strikes me more as a question about the MPI standard than 
about the Open MPI implementation.  That is, what you really want is 
for the MPI API to support a non-blocking form of collectives.  You 
want control to return to the user program before the 
barrier/bcast/etc. operation has completed.  That's an API change.

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] CPU burning in Wait state

2008-09-03 Thread George Bosilca


On Sep 3, 2008, at 6:11 PM, Vincent Rotival wrote:


Eugene,

No what I'd like is that when doing something like

call mpi_bcast(data, 1, MPI_INTEGER, 0, .)

the program continues AFTER the Bcast is completed (so no control  
returned to user), but while threads with rank > 0 are waiting in  
Bcast they are not taking CPU resources


Threads with rank > 0 ? Now, this scares me !!! If all your threads  
are going in the bcast, then I guess the application is not correct  
from the MPI standard perspective (i.e. on each communicator there is  
only one collective at every moment). In MPI, each process (and not  
each thread) has a rank, and each process exists in each communicator  
only once. In other words, as each collective is bounded to a specific  
communicator, on each of your processes, only one thread should go in  
the MPI_Bcast, if you want only ONE collective.


  george.




I hope it is more clear, I apologize for not being clear in the  
first place


Vincent



Eugene Loh wrote:


Vincent Rotival wrote:

The solution I retained was for the main thread to isend data  
separately to each other threads that are using Irecv + loop on  
mpi_test to test the  finish of the Irecv. It mught be dirty but  
works much better than using Bcast


Thanks for the clarification.

But this strikes me more as a question about the MPI standard than  
about the Open MPI implementation.  That is, what you really want  
is for the MPI API to support a non-blocking form of collectives.   
You want control to return to the user program before the barrier/ 
bcast/etc. operation has completed.  That's an API change.

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




smime.p7s
Description: S/MIME cryptographic signature


Re: [OMPI users] CPU burning in Wait state

2008-09-03 Thread Vincent Rotival

Eugene,

No what I'd like is that when doing something like

call mpi_bcast(data, 1, MPI_INTEGER, 0, .)

the program continues AFTER the Bcast is completed (so no control 
returned to user), but while threads with rank > 0 are waiting in Bcast 
they are not taking CPU resources


I hope it is more clear, I apologize for not being clear in the first place

Vincent



Eugene Loh wrote:


Vincent Rotival wrote:

The solution I retained was for the main thread to isend data 
separately to each other threads that are using Irecv + loop on 
mpi_test to test the  finish of the Irecv. It mught be dirty but 
works much better than using Bcast 


Thanks for the clarification.

But this strikes me more as a question about the MPI standard than 
about the Open MPI implementation.  That is, what you really want is 
for the MPI API to support a non-blocking form of collectives.  You 
want control to return to the user program before the 
barrier/bcast/etc. operation has completed.  That's an API change.

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users






Re: [OMPI users] CPU burning in Wait state

2008-09-03 Thread Vincent Rotival


Eugene Loh wrote:


Jeff Squyres wrote:


On Sep 2, 2008, at 7:25 PM, Vincent Rotival wrote:

I think I already read some comments on this issue, but I'd like to  
know of latest versions of OpenMPI have managed to solve it. I am  
now running 1.2.5


If I run a MPI program with synchronization routines (e.g.  
MPI_barrier, MPI_bcast...), all threads waiting for data are still  
burning CPU. On the other hand when using non-blocking receives all  
threads waiting for data are not consuming any CPU.


Would there be a possibility to use MPI_Bcast without  burning CPU  
power ?


I'm afraid not at this time.  We've talked about adding a blocking  
mode for progress, but it hasn't happened yet (and is very unlikely 
to  happen for the v1.3 series).


I'd like to understand this issue better.

What about the variable mpi_yield_when_idle ?  Is the point that this 
variable will cause a polling process to yield, but if there is no one 
to yield to then the process resumes burning CPU?  If so, I can 
imagine this solution being sufficient in some cases but not in others.


Also, Vincent, what do you mean by waiting threads not consuming any 
CPU for non-blocking receives?  In what state are these threads?  Are 
they in an MPI call (like MPI_Wait)?  Or, have they returned from an 
MPI call (like MPI_Irecv) and the user application can then park these 
threads to the side?


Dear Eugene

The solution I retained was for the main thread to isend data separately 
to each other threads that are using Irecv + loop on mpi_test to test 
the  finish of the Irecv. It mught be dirty but works much better than 
using Bcast


Cheers

Vincent



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users






Re: [OMPI users] CPU burning in Wait state

2008-09-03 Thread Eugene Loh

Jeff Squyres wrote:


On Sep 2, 2008, at 7:25 PM, Vincent Rotival wrote:

I think I already read some comments on this issue, but I'd like to  
know of latest versions of OpenMPI have managed to solve it. I am  
now running 1.2.5


If I run a MPI program with synchronization routines (e.g.  
MPI_barrier, MPI_bcast...), all threads waiting for data are still  
burning CPU. On the other hand when using non-blocking receives all  
threads waiting for data are not consuming any CPU.


Would there be a possibility to use MPI_Bcast without  burning CPU  
power ?


I'm afraid not at this time.  We've talked about adding a blocking  
mode for progress, but it hasn't happened yet (and is very unlikely 
to  happen for the v1.3 series).


I'd like to understand this issue better.

What about the variable mpi_yield_when_idle ?  Is the point that this 
variable will cause a polling process to yield, but if there is no one 
to yield to then the process resumes burning CPU?  If so, I can imagine 
this solution being sufficient in some cases but not in others.


Also, Vincent, what do you mean by waiting threads not consuming any CPU 
for non-blocking receives?  In what state are these threads?  Are they 
in an MPI call (like MPI_Wait)?  Or, have they returned from an MPI call 
(like MPI_Irecv) and the user application can then park these threads to 
the side?


Re: [OMPI users] CPU burning in Wait state

2008-09-03 Thread Jeff Squyres

On Sep 2, 2008, at 7:25 PM, Vincent Rotival wrote:

I think I already read some comments on this issue, but I'd like to  
know of latest versions of OpenMPI have managed to solve it. I am  
now running 1.2.5


If I run a MPI program with synchronization routines (e.g.  
MPI_barrier, MPI_bcast...), all threads waiting for data are still  
burning CPU. On the other hand when using non-blocking receives all  
threads waiting for data are not consuming any CPU.


Would there be a possibility to use MPI_Bcast without  burning CPU  
power ?



I'm afraid not at this time.  We've talked about adding a blocking  
mode for progress, but it hasn't happened yet (and is very unlikely to  
happen for the v1.3 series).


--
Jeff Squyres
Cisco Systems