Re: [OMPI users] Some Questions on Building OMPI on Linux Em64t

2010-05-26 Thread Josh Hursey

(Sorry for the delay, I missed the C/R question in the mail)

On May 25, 2010, at 9:35 AM, Jeff Squyres wrote:


On May 24, 2010, at 2:02 PM, Michael E. Thomadakis wrote:

| > 2) I have installed blcr V0.8.2 but when I try to built OMPI  
and I point to the
| > full installation it complains it cannot find it. Note that I  
build BLCR with

| > GCC but I am building OMPI with Intel compilers (V11.1)
|
| Can you be more specific here?

I pointed to the insatllation path for BLCR but config complained  
that it
couldn't find it. If BLCR is only needed for checkpoint / restart  
then we can

leave without it. Is BLCR needed for suspend/resume of mpi jobs ?


You mean suspend with ctrl-Z?  If so, correct -- BLCR is *only* used  
for checkpoint/restart.  Ctrl-Z just uses the SIGSTP functionality.


So BLCR is used for the checkpoint/restart functionality in Open MPI.  
We have a webpage with some more details and examples at the link below:

  http://osl.iu.edu/research/ft/ompi-cr/

You should be able to suspend/resume an Open MPI job using SIGSTOP/ 
SIGCONT without the C/R functionality. We have FAQ item that talks  
about how to enable this functionality:

  http://www.open-mpi.org/faq/?category=running#suspend-resume

You can combine the C/R and the SIGSTOP/SIGCONT functionality so that  
when you 'suspend' a job a checkpoint is taken and the process is  
stopped. You can continue the job by sending SIGCONT as normal.  
Additionally, this way if the job needs to be terminated for some  
reason (e.g., memory footprint, maintenance), it can be safely  
terminated and restarted from the checkpoint. I have a example of how  
this works at the link below:

  http://osl.iu.edu/research/ft/ompi-cr/examples.php#uc-ckpt-stop

As far as C/R integration with schedulers/resource managers, I know  
that the BLCR folks have been working with Torque to better integrate  
Open MPI+BLCR+Torque. If this is of interest, you might want to check  
with them on the progress of that project.


-- Josh



Re: [OMPI users] Building from the SRPM version creates an rpm with striped libraries

2010-05-26 Thread Peter Thompson
Thanks Ashley,  that did work, though I must say that %define __strip  
/bin/true is NOT very intuitive!


I did get my symbols in the needed libraries, but unfortunately, at 
least for the compiler I used to build, I still have a typedef 
undefined, and that also prevents that method of launching TV.  But we 
have our own workarounds for that. 


Cheers,
PeterT

Ashley Pittman wrote:

This is a standard rpm feature although like most things it can be disabled.

According to this mail and it's replies the two %defines below will prevent 
striping and the building of debuginfo rpms.

http://lists.rpm.org/pipermail/rpm-list/2009-January/000122.html

%define debug_package %{nil}
%define __strip /bin/true

Ashley.

On 25 May 2010, at 00:25, Peter Thompson wrote:

  

I have a user who prefers building rpm's from the srpm.  That's okay, but for 
debugging via TotalView it creates a version with the openmpi .so files 
stripped and we can't gain control of the processes when launched via mpirun 
-tv.  I've verified this with my own build of a 1.4.1 rpm which I then 
installed and noticed the same behavior that the user reports.  I was hoping to 
give them some advice as to how to avoid the stripping, as it appears that the 
actual build of those libraries is done with -g and everything looks fine.  But 
I can't figure out in the build (from the log file I created) just where that 
stripping takes place, or how to get around it if need be.  The best guess I 
have is that it may be happening at the very end when an rpm-tmp file is 
executed, but that file has disappeared so I don't really know what it does.  I 
thought it might be apparent in the spec file, but it's certainly not apparent 
to me!  Any help or advice would be appreciated.



  





Re: [OMPI users] Some Questions on Building OMPI on Linux Em64t

2010-05-26 Thread Michael E. Thomadakis

Hi Josh

thanks for the reply. pls see below ...


On 05/26/10 09:24, Josh Hursey wrote:

(Sorry for the delay, I missed the C/R question in the mail)

On May 25, 2010, at 9:35 AM, Jeff Squyres wrote:


On May 24, 2010, at 2:02 PM, Michael E. Thomadakis wrote:

| > 2) I have installed blcr V0.8.2 but when I try to built OMPI and 
I point to the
| > full installation it complains it cannot find it. Note that I 
build BLCR with

| > GCC but I am building OMPI with Intel compilers (V11.1)
|
| Can you be more specific here?

I pointed to the insatllation path for BLCR but config complained 
that it
couldn't find it. If BLCR is only needed for checkpoint / restart 
then we can

leave without it. Is BLCR needed for suspend/resume of mpi jobs ?


You mean suspend with ctrl-Z?  If so, correct -- BLCR is *only* used 
for checkpoint/restart.  Ctrl-Z just uses the SIGSTP functionality.


So BLCR is used for the checkpoint/restart functionality in Open MPI. 
We have a webpage with some more details and examples at the link below:

  http://osl.iu.edu/research/ft/ompi-cr/

You should be able to suspend/resume an Open MPI job using 
SIGSTOP/SIGCONT without the C/R functionality. We have FAQ item that 
talks about how to enable this functionality:

  http://www.open-mpi.org/faq/?category=running#suspend-resume

You can combine the C/R and the SIGSTOP/SIGCONT functionality so that 
when you 'suspend' a job a checkpoint is taken and the process is 
stopped. You can continue the job by sending SIGCONT as normal. 
Additionally, this way if the job needs to be terminated for some 
reason (e.g., memory footprint, maintenance), it can be safely 
terminated and restarted from the checkpoint. I have a example of how 
this works at the link below:

  http://osl.iu.edu/research/ft/ompi-cr/examples.php#uc-ckpt-stop

As far as C/R integration with schedulers/resource managers, I know 
that the BLCR folks have been working with Torque to better integrate 
Open MPI+BLCR+Torque. If this is of interest, you might want to check 
with them on the progress of that project.


So suspend/resume of OpenMPI jobs does not require BLCR. OK so I will 
proceed w/o it.


best regards,

Michael



-- Josh

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
%  \
% Michael E. Thomadakis, Ph.D.  Senior Lead Supercomputer Engineer/Res \
% E-mail: miket AT tamu DOT edu   Texas A&M University \
% web:http://alphamike.tamu.edu  Supercomputing Center \
% Voice:  979-862-3931Teague Research Center, 104B \
% FAX:979-847-8643  College Station, TX 77843, USA \
%  \



Re: [OMPI users] Some Questions on Building OMPI on Linux Em64t

2010-05-26 Thread Michael E. Thomadakis

Hi jeff,

thanks for the reply. Pls see below .

And a new question:

How do you handle thread/task and memory affinity? Do you pass the 
requested affinity desires to the batch scheduler and them let it issue 
the specific placements for threads to the nodes ?


This is something we are concerned as we are running multiple jobs on 
same node and we don't want to oversubscribe cores by binding there 
threads inadvertandly.


Looking at ompi_info
 $ ompi_info | grep -i aff
   MCA paffinity: linux (MCA v2.0, API v2.0, Component v1.4.2)
   MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.4.2)

does this mean we have the full affinity support included or do I need 
to involve HWLOC in any way ?




On 05/25/10 08:35, Jeff Squyres wrote:

On May 24, 2010, at 2:02 PM, Michael E. Thomadakis wrote:

   

|>  1) high-resolution timers: how do I specify the HRT linux timers in the
|>  --with-timer=TYPE
|>   line of ./configure ?
|
| You shouldn't need to do anything; the "linux" timer component of Open MPI
| should get automatically selected.  You should be able to see this in the
| stdout of Open MPI's "configure", and/or if you run ompi_info | grep timer
| -- there should only be one entry: linux.

If nothing is menioned, will it by default select 'linux' timers?
 

Yes.

   

Or I have to specify in th configure

 --with-timer=linux ?
 

Nope.  The philosophy of Open MPI is that whenever possible, we try to choose a 
sensible default.  It never hurts to double check, but we try to do the Right 
Thing whenever it's possible to automatically choose it (within reason, of 
course).

You can also check the output of ompi_info -- ompi_info tells you lots of 
things about your Open MPI installation.

   

I actually spent some time looking around in the source trying to see which
actual timer is the base. Is this a high-resolution timer such as a POSIX
timers (timer_gettime or clock_nanosleep, etc.) or Intel processor's TSC ?

I am just trying to stay away from gettimeofday()
 

Understood.

Ugh; I just poked into the code -- it's complicated how we resolve the timer 
functions.  It looks like we put in the infrastructure into getting high 
resolution timers, but at least for Linux, we don't use it (the code falls back 
to gettimeofday).  It looks like we're only using the high-resolution timers on 
AIX (!) and Solaris.

Patches would be greatly appreciated; I'd be happy to walk someone through what 
to do.

   


Which HRtimer is recommended for a Linux environment ? timer_gettime 
usually gives decent resolution and it is portable. I don't want to 
promise anything as I am already bogged down with several ongoing 
projects. You can give me *brief*  instructions to see if this can be 
squeezed in.

...


Justr as a feedback from one of the many HPC centers, for us it is most
important to have

a) a light-weight efficient MPI stack which makes the underlying IB h/w
capabilities available and

b) it can smoothly cooperate withe a batch scheduler / resource manager so
that a mixture of jobs get a decent allocation of the cluster resources.
 

Cools; good to know.  We try to make these things very workable in Open MPI -- 
it's been a goal from day 1 to integrate with job schedulers, etc.  And without 
high performance, we wouldn't have much to talk about.

Please be sure to let us know of questions / problems / etc.  I admit that 
we're sometimes a little slow to answer on the users list, but we do the best 
we can.  So don't hesitate to bump us if we don't reply.

Thanks!

   


Thanks again...
michael


--
%  \
% Michael E. Thomadakis, Ph.D.  Senior Lead Supercomputer Engineer/Res \
% E-mail: miket AT tamu DOT edu   Texas A&M University \
% web:http://alphamike.tamu.edu  Supercomputing Center \
% Voice:  979-862-3931Teague Research Center, 104B \
% FAX:979-847-8643  College Station, TX 77843, USA \
%  \



Re: [OMPI users] Deadlock question

2010-05-26 Thread Gijsbert Wiesenekker

On May 24, 2010, at 19:42 , Gijsbert Wiesenekker wrote:

> My MPI program consists of a number of processes that send 0 or more messages 
> (using MPI_Isend) to 0 or more other processes. The processes check 
> periodically if messages are available to be processed. It was running fine 
> until I increased the message size, and I got deadlock problems. Googling 
> learned I was running into a classic deadlock problem if (see for example 
> http://www.cs.ucsb.edu/~hnielsen/cs140/mpi-deadlocks.html). The workarounds 
> suggested like changing the order of MPI_Send and MPI_Recv do not work in my 
> case, as it could be that one processor does not send any message at all to 
> the other processes, so MPI_Recv would wait indefinitely.
> Any suggestions on how to avoid deadlock in this case?
> 
> Thanks,
> Gijsbert
> 

An approach that seems to work in my case is the following:
I was using separate message-tags for 'update_message' and 'no_more_messages'. 
All these were sent asynchronously. The receive code in pseudo-code looked like:
--
if (probe_for_update_message() == FALSE)
{
if (probe_for_no_more_messages() == TRUE)
{
//we are done
}
else
{
//do some work
}
}
else
{
//process update message
}
--
The problem with this receive code was that in between the 
probe_for_update_message() and the probe_for_no_more_messages() a processor 
could send several update messages, followed by 'no_more_messages', so I still 
needed to check for any pending update messages after a 
probe_for_no_more_messages(), which complicated handling deadlock. So I first 
created a special update message that signals 'no_more_messages', which 
simplified the receive code to:
--
//probe_for_update_message() returns INVALID if no more messages, TRUE if 
message, FALSE if not
if ((result = probe_for_update_message()) == INVALID)
{
//we are done
}
else if (result == TRUE)
{
//process update message
}
else //result == FALSE
{
//do some work
}
--
Now to deal with the deadlock I first created a function recv_update_message() 
that probes for update messages and pushes them onto a FIFO queue (for several 
reasons I cannot process the update message right away). In pseudo-code:
--
int recv_update_message()
{
int result;

if ((result = probe_for_update_message()) == TRUE)
queue(update_message);
return(result);
}
--
The asynchronous send code in pseudo-code looks like:
--
MPI_Isend(update_message, &request);
while(TRUE)
{
//deal with deadlock
//I assume my deadlocks are caused by running out of system buffer space
//hopefully polling pending update messages frees up buffer space

recv_update_message();

MPI_Test(&request, &flag);
if (flag) break;
}
--
The asynchronous receive code in pseudo-code looks like:
--
//first check the FIFO queue
if (dequeue(update_message))
return(TRUE);
else
{
int result;
if ((result = recv_update_message()) == INVALID)
return(INVALID);
if (result == TRUE)
dequeue(update_message);
return(result);
}
--
As a further refinement I use a queue per processor, and recv_update_message 
first tries to receive messages for the least used queues, but if deadlock is 
detected it tries to receive messages for all queues:
--
MPI_Isend(update_message, &request);
while(TRUE)
nwaitx = 16; //threshold for deadlock
nwait = 0;
{
if (nwait > 2 * nwaitx) 
{
printf("possible deadlock detected\n");
nwaitx = nwait;
recv_update_message(all_queues);
}
else
{
recv_update_message(least_used_queues_only);
}

MPI_Test(&request, &flag);
if (flag) break;
nwait++;
}
--

Gijsbert