Re: [OMPI users] MPI-2.2: do you care?

2010-10-26 Thread Douglas Guptill
Hello Jeff:

On Tue, Oct 26, 2010 at 07:27:42PM -0400, Jeff Squyres wrote:
> Open MPI users --
> 
> I took a little heat at the last the MPI Forum for not having Open MPI be 
> fully complaint with MPI-2.2 yet (OMPI is compliant with MPI-2.1).  
> Specifically, there's still 4 open issues in Open MPI that are necessary for 
> full MPI-2.2 compliance:
> 
> https://svn.open-mpi.org/trac/ompi/query?status=accepted=assigned=new=reopened=~MPI-2.2=id=summary=status=type=priority=milestone=version=priority
> 
> We haven't made these items a priority because -- to be blunt -- no one 
> really has been asking for them.  No one has come forward and said "I *must* 
> have these features!" (to be fair, they're somewhat obscure features).  
> 
> Other than not having the obvious "OMPI is MPI-2.2 compliant" checkmark for 
> marketing reasons, is there anyone who *needs* the functionality represented 
> by those still-open tickets?
> 
> Thanks for your input.

>From my point of view, I would be glad to see more emphasis on
stability in OpenMPI (where stability means absence of bugs) than on
new features.  I am still using OpenMPI-1.2.8

Just my $0.02,
Douglas.
-- 
  Douglas Guptill   voice: 902-461-9749
  Research Assistant, LSC 4640  email: douglas.gupt...@dal.ca
  Oceanography Department   fax:   902-494-3877
  Dalhousie University
  Halifax, NS, B3H 4J1, Canada



Re: [OMPI users] spin-wait backoff

2010-09-02 Thread Douglas Guptill
Hi David:

On Fri, Sep 03, 2010 at 10:50:02AM +1000, David Singleton wrote:
>
> I'm sure this has been discussed before but having watched hundreds of
> thousands of cpuhrs being wasted by difficult-to-detect hung jobs, I'd
> be keen to know why there isn't some sort of "spin-wait backoff" option.
> For example, a way to specify spin-wait for x seconds/cycles/iterations
> then backoff to lighter and lighter cpu usage.  At least that way, hung
> jobs would become self-evident.
>
> Maybe there is already some way of doing this?

For my solution to this, see

  http://www.open-mpi.org/community/lists/users/2010/07/13731.php

HTH,
Douglas.
-- 
  Douglas Guptill   voice: 902-461-9749
  Research Assistant, LSC 4640  email: douglas.gupt...@dal.ca
  Oceanography Department   fax:   902-494-3877
  Dalhousie University
  Halifax, NS, B3H 4J1, Canada



Re: [OMPI users] Do MPI calls ever sleep?

2010-07-21 Thread Douglas Guptill
Hi David:

On Wed, Jul 21, 2010 at 02:10:53PM -0400, David Ronis wrote:
> I've got a mpi program on an 8-core box that runs in a master-slave
> mode.   The slaves calculate something, pass data to the master, and
> then call MPI_Bcast waiting for the master to update and return some
> data via a MPI_Bcast originating on the master.  
> 
> One of the things the master does while the slaves are waiting is to
> make heavy use of fftw3 FFT routines which can support multi-threading.
> However, for threading to make sense, the slaves on same physical
> machine have to give up their CPU usage, and this doesn't seem to be the
> case (top shows them running at close to 100%).  Is there another MPI
> routine that polls for data and then gives up its time-slice? 
> 
> Any other suggestions?

I ran into a similar problem some time ago.  My situation seems similar
to yours:
  1. the data in the MPI application has a to-and-fro nature.
  2. I cannot afford an MPI process that consumes 100% cpu 
 while doing nothing.

My solution was to link two extra routines with my (FORTRAN)
application.  These routines intercept mpi_recv and mpi_send, test the
status of the request, and sleep if it is not ready.  The sleep time
has an exponential curve; it has a start value, factor, and maximum
value.

I made no source code changes to my application.  When I include these
two routines at link time, the load from the application changes from
2.0 to 1.0 

I use these with OpenMPI-1.2.8.

I have not tried -mca yield_when_idle 1; which may not be in 1.2.8.
Not sure.

Hope that helps
Douglas.
-- 
  Douglas Guptill   voice: 902-461-9749
  Research Assistant, LSC 4640  email: douglas.gupt...@dal.ca
  Oceanography Department   fax:   902-494-3877
  Dalhousie University
  Halifax, NS, B3H 4J1, Canada

/*
 * Intercept MPI_Recv, and
 * call PMPI_Irecv, loop over PMPI_Request_get_status and sleep, until done
 *
 * Revision History:
 *  2008-12-17: copied from MPI_Send.c
 *  2008-12-18: tweaking.
 *
 * See MPI_Send.c for additional comments, 
 *  especially w.r.t. PMPI_Request_get_status.
 **/

#include "mpi.h"
#define _POSIX_C_SOURCE 199309 
#include 

int MPI_Recv(void *buff, int count, MPI_Datatype datatype, 
	  int from, int tag, MPI_Comm comm, MPI_Status *status) {

  int flag, nsec_start=1000, nsec_max=10;
  struct timespec ts;
  MPI_Request req;

  ts.tv_sec = 0;
  ts.tv_nsec = nsec_start;

  PMPI_Irecv(buff, count, datatype, from, tag, comm, );
  do {
nanosleep(, NULL);
ts.tv_nsec *= 2;
ts.tv_nsec = (ts.tv_nsec > nsec_max) ? nsec_max : ts.tv_nsec;
PMPI_Request_get_status(req, , status);
  } while (!flag);

  return (*status).MPI_ERROR;
}
/*
 * Intercept MPI_Send, and
 * call PMPI_Isend, loop over PMPI_Request_get_status and sleep, until done
 *
 * Revision History:
 *  2008-12-12: skeleton by Jeff Squyres <jsquy...@cisco.com>
 *  2008-12-16->18: adding parameters, variable wait, 
 *     change MPI_Test to MPI_Request_get_status
 *  Douglas Guptill <douglas.gupt...@dal.ca>
 **/

/* When we use this:
 *   PMPI_Test(, , ); 
 * we get:
 * dguptill@DOME:$ mpirun -np 2 mpi_send_recv_test_mine
 * This is process0  of2 .
 * This is process1  of2 .
 * error: proc0 ,mpi_send returned -1208109376
 * error: proc1 ,mpi_send returned -1208310080
 * 1 changed to3
 *
 * Using MPI_request_get_status cures the problem.
 *
 * A read of mpi21-report.pdf confirms that MPI_Request_get_status
 * is the appropriate choice, since there seems to be something
 * between the call to MPI_SEND (MPI_RECV) in my FORTRAN program
 * and MPI_Send.c (MPI_Recv.c)
 **/


#include "mpi.h"
#define _POSIX_C_SOURCE 199309 
#include 

int MPI_Send(void *buff, int count, MPI_Datatype datatype, 
	  int dest, int tag, MPI_Comm comm) {

  int flag, nsec_start=1000, nsec_max=10;
  struct timespec ts;
  MPI_Request req;
  MPI_Status status;

  ts.tv_sec = 0;
  ts.tv_nsec = nsec_start;

  PMPI_Isend(buff, count, datatype, dest, tag, comm, );
  do {
nanosleep(, NULL);
ts.tv_nsec *= 2;
ts.tv_nsec = (ts.tv_nsec > nsec_max) ? nsec_max : ts.tv_nsec;
PMPI_Request_get_status(req, , );
  } while (!flag);

  return status.MPI_ERROR;
}


Re: [OMPI users] first cluster

2010-07-15 Thread Douglas Guptill
On Wed, Jul 14, 2010 at 04:27:11PM -0400, Jeff Squyres wrote:
> On Jul 9, 2010, at 12:43 PM, Douglas Guptill wrote:
> 
> > After some lurking and reading, I plan this:
> >   Debian (lenny)
> >   + fai   - for compute-node operating system install
> >   + Torque- job scheduler/manager
> >   + MPI (Intel MPI)   - for the application
> >   + MPI (OpenMP)  - alternative MPI
> > 
> > Does anyone see holes in this plan?
> 
> HPC is very much a "what is best for *your* requirements" kind of
> environment.  There are many different recipes out there for
> different kinds of HPC environments.

Very wise words.

We will be running only one application, and have one, maybe two, users.

> What you listed above is a reasonable list of meta software
> packages.

Thanks,
Douglas.
-- 
  Douglas Guptill   voice: 902-461-9749
  Research Assistant, LSC 4640  email: douglas.gupt...@dal.ca
  Oceanography Department   fax:   902-494-3877
  Dalhousie University
  Halifax, NS, B3H 4J1, Canada



[OMPI users] first cluster [was trouble using openmpi under slurm]

2010-07-09 Thread Douglas Guptill
On Thu, Jul 08, 2010 at 09:43:48AM -0400, Gus Correa wrote:
> Douglas Guptill wrote:
>> On Wed, Jul 07, 2010 at 12:37:54PM -0600, Ralph Castain wrote:
>>
>>> Noafraid not. Things work pretty well, but there are places
>>> where things just don't mesh. Sub-node allocation in particular is
>>> an issue as it implies binding, and slurm and ompi have conflicting
>>> methods.
>>>
>>> It all can get worked out, but we have limited time and nobody cares
>>> enough to put in the effort. Slurm just isn't used enough to make it
>>> worthwhile (too small an audience).
>>
>> I am about to get my first HPC cluster (128 nodes), and was
>> considering slurm.  We do use MPI.
>>
>> Should I be looking at Torque instead for a queue manager?
>>
> Hi Douglas
>
> Yes, works like a charm along with OpenMPI.
> I also have MVAPICH2 and MPICH2, no integration w/ Torque,
> but no conflicts either.

Thanks, Gus.

After some lurking and reading, I plan this:
  Debian (lenny)
  + fai   - for compute-node operating system install
  + Torque- job scheduler/manager
  + MPI (Intel MPI)   - for the application
  + MPI (OpenMP)  - alternative MPI

Does anyone see holes in this plan?

Thanks,
Douglas
-- 
  Douglas Guptill   voice: 902-461-9749
  Research Assistant, LSC 4640  email: douglas.gupt...@dal.ca
  Oceanography Department   fax:   902-494-3877
  Dalhousie University
  Halifax, NS, B3H 4J1, Canada



Re: [OMPI users] trouble using openmpi under slurm

2010-07-07 Thread Douglas Guptill
On Wed, Jul 07, 2010 at 12:37:54PM -0600, Ralph Castain wrote:

> Noafraid not. Things work pretty well, but there are places
> where things just don't mesh. Sub-node allocation in particular is
> an issue as it implies binding, and slurm and ompi have conflicting
> methods.
>
> It all can get worked out, but we have limited time and nobody cares
> enough to put in the effort. Slurm just isn't used enough to make it
> worthwhile (too small an audience).

I am about to get my first HPC cluster (128 nodes), and was
considering slurm.  We do use MPI.

Should I be looking at Torque instead for a queue manager?

Suggestions appreciated,
Douglas.
-- 
  Douglas Guptill   voice: 902-461-9749
  Research Assistant, LSC 4640  email: douglas.gupt...@dal.ca
  Oceanography Department   fax:   902-494-3877
  Dalhousie University
  Halifax, NS, B3H 4J1, Canada



Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?

2010-05-06 Thread Douglas Guptill
Hello Gus:

On Thu, May 06, 2010 at 11:26:57AM -0400, Gus Correa wrote:

> Douglas:

> Would you know which gcc you used to build your Open MPI?
> Or did you use Intel icc instead?

Intel ifort and icc.  I build OpenMPI with the same compiler, and same
options, that I build my application with.

I have been tempted to try and duplicate your problem.  Would that be a
helpful experiment?  gcc, OpenMPI 1.4.1, IIRC ?

Regards,
Douglas.
-- 
  Douglas Guptill   voice: 902-461-9749
  Research Assistant, LSC 4640  email: douglas.gupt...@dal.ca
  Oceanography Department   fax:   902-494-3877
  Dalhousie University
  Halifax, NS, B3H 4J1, Canada



Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?

2010-05-05 Thread Douglas Guptill
On Wed, May 05, 2010 at 06:08:57PM -0400, Gus Correa wrote:

> If anybody else has Open MPI working with hyperthreading and "sm"
> on a Nehalem box, I would appreciate any information about the
> Linux distro and kernel version being used.

Debian 5 (lenny), Core i7 920, Asus P6T MoBo, 12GB RAM, OpenMPI 1.2.8
(with a custom-built MPI_recv.c and MPI_Send.c, which cut down on the
cpu load caused by the busy wait polling).  We have six (6) of these
machines.  All configured the same.

uname -a yields:
Linux screm 2.6.26-2-amd64 #1 SMP Thu Feb 11 00:59:32 UTC 2010 x86_64 GNU/Linux

HyperThreading is on.

Applications are -np 2 only:
  mpirun --host localhost,localhost --byslot --mca btl sm,self -np 2 ${BIN}

We normally run (up to) 4 of these jobs on each machine.

Using Intel 11.0.074 and 11.1.0** compilers; have trouble with the
11.1.0** and "-mcmodel=large -shared-intel" builds.  Trouble meaning
the numerical results vary strangely.  Still working on that problem.

Hope that helps,
Douglas.

P.S.  Yes, I know OpenMPI 1.2.8 is old.  We have been using it for 2
years with no apparent problems.  When I saw comments like "machine
hung" for 1.4.1, and "data loss" for 1.3.x, I put aside thoughts of
upgrading.

-- 
  Douglas Guptill   voice: 902-461-9749
  Research Assistant, LSC 4640  email: douglas.gupt...@dal.ca
  Oceanography Department   fax:   902-494-3877
  Dalhousie University
  Halifax, NS, B3H 4J1, Canada



Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?

2010-05-04 Thread Douglas Guptill
On Tue, May 04, 2010 at 05:34:40PM -0600, Ralph Castain wrote:
> 
> On May 4, 2010, at 4:51 PM, Gus Correa wrote:
> 
> > Hi Ralph
> > 
> > Ralph Castain wrote:
> >> One possibility is that the sm btl might not like that you have 
> >> hyperthreading enabled.
> > 
> > I remember that hyperthreading was discussed months ago,
> > in the previous incarnation of this problem/thread/discussion on "Nehalem 
> > vs. Open MPI".
> > (It sounds like one of those supreme court cases ... )
> > 
> > I don't really administer that machine,
> > or any machine with hyperthreading,
> > so I am not much familiar to the HT nitty-gritty.
> > How do I turn off hyperthreading?
> > Is it a BIOS or a Linux thing?
> > I may try that.
> 
> I believe it can be turned off via an admin-level cmd, but I'm not certain 
> about it

The challenge was too great to resist, so I yielded, and rebooted my
Nehalem (Core i7 920 @ 2.67 GHz) to confirm my thoughts on the issue.

Entering the BIOS setup by pressing "DEL", and "right-arrowing" over
to "Advanced", then "down arrow" to "CPU configuration", I found a
setting called "Intel (R) HT Technology".  The help dialogue says
"When Disabled only one thread per core is enabled".

Mine is "Enabled", and I see 8 cpus.  The Core i7, to my
understanding, is a 4 core chip.

Hope that helps,
Douglas.
-- 
  Douglas Guptill   voice: 902-461-9749
  Research Assistant, LSC 4640  email: douglas.gupt...@dal.ca
  Oceanography Department   fax:   902-494-3877
  Dalhousie University
  Halifax, NS, B3H 4J1, Canada



Re: [OMPI users] open-mpi behaviour on Fedora, Ubuntu, Debian and CentOS

2010-04-28 Thread Douglas Guptill
Hello Gus:

Thannk you for your excellent and well-considered thoughts on the
subject.  You educate us all.

Douglas.

On Wed, Apr 28, 2010 at 02:39:20PM -0400, Gus Correa wrote:
> Hi Asad
>
> I think the speed vs. accuracy tradeoff will always be there.
> Getting both at the same time is kind of a holy grail,
> everybody wants it!
> Whoever asked you to get both gotta be kidding.

..



Re: [OMPI users] MPI_Comm_accept() busy waiting?

2010-03-09 Thread Douglas Guptill
On Tue, Mar 09, 2010 at 05:43:02PM +0100, Ramon wrote:
> Am I the only one experiencing such problem?  Is there any solution?

No, you are not the only one.  Several others have mentioned the "busy
wait" problem.

The response on the OpenMPI developers, as I understand it, is that
the MPI job should be the only one running, so a 100% busy wait is not
a problem.  I hope the OpenMPI developers will correct me if I have
mis-stated their position.

I posted my cure for the problem some time ago.  I have attached it
again to this message.

Hope that helps,
Douglas.


> Ramon wrote:
>> Hi,
>>
>> I've recently been trying to develop a client-server distributed file  
>> system (for my thesis) using the MPI.  The communication between the  
>> machines is working great, however when ever the MPI_Comm_accept()  
>> function is called, the server starts like consuming 100% of the CPU.
>>
>> One interesting thing is that I tried to compile the same code using  
>> the LAM/MPI library and the mentioned behaviour could not be observed.
>>
>> Is this a bug?
>>
>> On a side note, I'm using Ubuntu 9.10's default OpenMPI deb package.   
>> Its version is 1.3.2.
>>
>> Regards
>>
>> Ramon.
> ___________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
  Douglas Guptill   voice: 902-461-9749
  Research Assistant, LSC 4640  email: douglas.gupt...@dal.ca
  Oceanography Department   fax:   902-494-3877
  Dalhousie University
  Halifax, NS, B3H 4J1, Canada

/*
 * Intercept MPI_Recv, and
 * call PMPI_Irecv, loop over PMPI_Request_get_status and sleep, until done
 *
 * Revision History:
 *  2008-12-17: copied from MPI_Send.c
 *  2008-12-18: tweaking.
 *
 * See MPI_Send.c for additional comments, 
 *  especially w.r.t. PMPI_Request_get_status.
 **/

#include "mpi.h"
#define _POSIX_C_SOURCE 199309 
#include 

int MPI_Recv(void *buff, int count, MPI_Datatype datatype, 
	  int from, int tag, MPI_Comm comm, MPI_Status *status) {

  int flag, nsec_start=1000, nsec_max=10;
  struct timespec ts;
  MPI_Request req;

  ts.tv_sec = 0;
  ts.tv_nsec = nsec_start;

  PMPI_Irecv(buff, count, datatype, from, tag, comm, );
  do {
nanosleep(, NULL);
ts.tv_nsec *= 2;
ts.tv_nsec = (ts.tv_nsec > nsec_max) ? nsec_max : ts.tv_nsec;
PMPI_Request_get_status(req, , status);
  } while (!flag);

  return (*status).MPI_ERROR;
}
/*
 * Intercept MPI_Send, and
 * call PMPI_Isend, loop over PMPI_Request_get_status and sleep, until done
 *
 * Revision History:
 *  2008-12-12: skeleton by Jeff Squyres <jsquy...@cisco.com>
 *  2008-12-16->18: adding parameters, variable wait, 
 * change MPI_Test to MPI_Request_get_status
 *  Douglas Guptill <douglas.gupt...@dal.ca>
 **/

/* When we use this:
 *   PMPI_Test(, , ); 
 * we get:
 * dguptill@DOME:$ mpirun -np 2 mpi_send_recv_test_mine
 * This is process0  of2 .
 * This is process1  of2 .
 * error: proc0 ,mpi_send returned -1208109376
 * error: proc1 ,mpi_send returned -1208310080
 * 1 changed to3
 *
 * Using MPI_request_get_status cures the problem.
 *
 * A read of mpi21-report.pdf confirms that MPI_Request_get_status
 * is the appropriate choice, since there seems to be something
 * between the call to MPI_SEND (MPI_RECV) in my FORTRAN program
 * and MPI_Send.c (MPI_Recv.c)
 **/


#include "mpi.h"
#define _POSIX_C_SOURCE 199309 
#include 

int MPI_Send(void *buff, int count, MPI_Datatype datatype, 
	  int dest, int tag, MPI_Comm comm) {

  int flag, nsec_start=1000, nsec_max=10;
  struct timespec ts;
  MPI_Request req;
  MPI_Status status;

  ts.tv_sec = 0;
  ts.tv_nsec = nsec_start;

  PMPI_Isend(buff, count, datatype, dest, tag, comm, );
  do {
nanosleep(, NULL);
ts.tv_nsec *= 2;
ts.tv_nsec = (ts.tv_nsec > nsec_max) ? nsec_max : ts.tv_nsec;
PMPI_Request_get_status(req, , );
  } while (!flag);

  return status.MPI_ERROR;
}


Re: [OMPI users] openmpi fails to terminate for errors/signals on some but not all processes

2010-02-10 Thread Douglas Guptill
Hello Lawrence:

If I correctly remember your code which created this problem, perhaps
you could solve it by using the iostatus parameter:

   read(unit,*,iostatus=ierror) some_variable
   if (ierror.ne.0) then
c  handle error
   endif

Hope that helps,
Douglas.

On Mon, Feb 08, 2010 at 01:29:38PM -0600, Laurence Marks wrote:
> This was "Re: [OMPI users] Trapping fortran I/O errors leaving zombie
> mpi processes", but it is more severe than this.
> 
> Sorry, but it appears that at least with ifort most run-time errors
> and signals will leave zombie processes behind with openmpi if they
> only occur on some of the processors and not all. You can test this
> with the attached using (for instance)
> 
> mpicc -c doraise.c
> mpif90 -o crash_test crash_test.F doraise.o -FR -xHost -O3
> 
> Then, as appropriate mpirun -np 8 crash_test
> 
> The output is self explanatory, and has an option to both try and
> simulate common fortran problems as well as to send fortran or C
> signals to the process. Please note that the results can be dependent
> upon the level of optimization, and with other compilers there could
> be problems where the compiler complains about SIGSEV or other errors
> since the code deliberately tries to create these.
> 
> -- 
> Laurence Marks
> Department of Materials Science and Engineering
> MSE Rm 2036 Cook Hall
> 2220 N Campus Drive
> Northwestern University
> Evanston, IL 60208, USA
> Tel: (847) 491-3996 Fax: (847) 491-7820
> email: L-marks at northwestern dot edu
> Web: www.numis.northwestern.edu
> Chair, Commission on Electron Crystallography of IUCR
> www.numis.northwestern.edu/
> Electron crystallography is the branch of science that uses electron
> scattering and imaging to study the structure of matter.

> #include 
> #include 
> 
> void doraise(isig)
> long isig[1] ;
> {
> int i, j ;
>i = isig[0];
>raise( i );   /* signal i is raised */
> }
> 
> void doraise_(isig)
> long isig[1] ;
> {
>  doraise(isig) ;
> }
> 
> void whatsig(isig)
> long isig[1] ;
> {
> int i ;
> i = isig[0];
> psignal( i , "Testing Signal");
> }
> 
> void whatsig_(isig)
> long isig[1] ;
> {
>  whatsig(isig) ;
> }
> 
> void showallsignals()
> {
> int i ;
> char buf[15];
> for ( i = 1; i < 32; i++ ) {
>sprintf(buf, "Signal code %d ", i);
>psignal( i , buf );
> }
> }
> 
> void showallsignals_()
> {
>  showallsignals() ;
> }
> 


> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] How to start MPI_Spawn child processes early?

2010-01-27 Thread Douglas Guptill
It sounds to me a bit like asking to be born before your mother.

Unless I misunderstand the question...
Douglas.

On Thu, Jan 28, 2010 at 09:24:29AM +1100, Jaison Paul wrote:
> Hi, I am just reposting my early query once again. If anyone one can  
> give some hint, that would be great.
>
> Thanks, Jaison
> ANU
>
> Jaison Paul wrote:
>> Hi All,
>>
>> I am trying to use MPI for scientific High Performance (hpc)  
>> applications. I use MPI_Spawn to create child processes. Is there a  
>> way to start child processes early than the parent process, using  
>> MPI_Spawn?
>>
>> I want this because, my experiments showed that the time to spawn the  
>> children by parent is too long for HPC apps which slows down the whole  
>> process. If the children are ready when parent application process  
>> seeks for them, that initial delay can be avoided. Is there a way to  
>> do that?
>>
>> Thanks in advance,
>>
>> Jaison
>> Australian National University


Re: [OMPI users] Mimicking timeout for MPI_Wait

2009-12-07 Thread Douglas Guptill
On Mon, Dec 07, 2009 at 08:21:46AM -0500, Richard Treumann wrote:

> The need for a "better" timeout depends on what else there is for the CPU
> to do.
> 
> If you get creative and shift from {99% MPI_WAIT , 1% OS_idle_process} to
> {1% MPI_Wait, 99% OS_idle_process} at a cost of only a few extra
> microseconds added lag on MPI_Wait, you may be pleased by the CPU load
> statistic but still have have only hurt yourself. Perhaps you have not hurt
> yourself much but for what? The CPU does not get tired of spinning in
> MPI_Wait rather than in the OS_idle_process
> 
> Most MPI applications run with an essentially dedicated CPU per process.

Not true in our case.  The computer in question (Intel Core i7, one
cpu, four cores) has several other uses.

It is a general purpose desktop/server for myself, and potential other
users.  I edit and compile the MPI application on it.  I read and
write email from it.  My subversion repositories and server will soon
be on it.  My Trac server (and Apache2) will soon be on it.

Now that MPI does not do busy waits, it can do all that, and run 4
copies of our MPI application.

> In most MPI applications if even one task is sharing its CPU with
> other processes, like users doing compiles, the whole job slows down
> too much.

I have not found that to be the case.

Regards,
Douglas.
-- 
  Douglas Guptill   voice: 902-461-9749
  Research Assistant, LSC 4640  email: douglas.gupt...@dal.ca
  Oceanography Department   fax:   902-494-3877
  Dalhousie University
  Halifax, NS, B3H 4J1, Canada



Re: [OMPI users] Release date for 1.3.4?

2009-11-12 Thread Douglas Guptill
Hello Eugene:

On Thu, Nov 12, 2009 at 07:20:08AM -0800, Eugene Loh wrote:
> Jeff Squyres wrote:
>
>> I think Eugene will have to answer this one -- Eugeue?
>>
>> On Nov 12, 2009, at 6:35 AM, John R. Cary wrote:
>>
>>> From http://svn.open-mpi.org/svn/ompi/branches/v1.3/NEWS I see:
>>>
>>> - Many updates and fixes to the (non-default) "sm" collective
>>>   component (i.e., native shared memory MPI collective operations).
>>>
>>> Will this fix the problem noted at
>>>
>>> https://svn.open-mpi.org/trac/ompi/ticket/2043
>>
> I've been devoting a lot of time to this one.  There seems (to me) to be  
> something very goofy going on here, but I've whittled it down a lot.  I  
> hope soon to have a simple demonstration of the problem so that it'll be  
> possible to decide if there is a problem with OMPI, GCC 4.4.0, both, or  
> something else.  So, I've made a lot of progress, but you're asking how  
> much longer until it's solved.  Different question.  I don't know.  This  
> is not a straightforward problem.

I love that answer.  Sincerely.  It should be taught in schools.  It
should be part of every programmer's toolkit.

Douglas.


Re: [OMPI users] mpirun example program fail on multiple nodes - unable to launch specified application on client node

2009-11-05 Thread Douglas Guptill
On Thu, Nov 05, 2009 at 03:15:33PM -0600, Qing Pang wrote:

> Thank you Jeff! That solves the problem. :-) You are the lifesaver!
> So does that means I always need to copy my application to all the  
> nodes? Or should I give the pathname of the my executable in a different  
> way to avoid this? Do I need a network file system for that?

I am currently using sshfs to mount both OpenMPI and my application on
the "other" computers/nodes.  The advantage to this is that I have
only one copy of OpenMPI and my application.  There may be a
performance penalty, but I haven't seen it yet.

Douglas.


[OMPI users] 100% CPU doing nothing!?

2009-04-22 Thread Douglas Guptill
Hi Ross:

On Tue, Apr 21, 2009 at 07:19:53PM -0700, Ross Boylan wrote:
> I'm using Rmpi (a pretty thin wrapper around MPI for R) on Debian Lenny
> (amd64).  My set up has a central calculator and a bunch of slaves to
> wich work is distributed.
> 
> The slaves wait like this:
> mpi.send(as.double(0), doubleType, root, requestCode, comm=comm)
> request <- request+1
> cases <- mpi.recv(cases, integerType, root, mpi.any.tag(),
> comm=comm)
> 
> I.e., they do a simple send and then a receive.
> 
> It's possible there's no one to talk to, so it could be stuck at
> mpi.send or mpi.recv.
> 
> Are either of those operations that should chew up CPU?  At this point,
> I'm just trying to figure out where to look for the source of the
> problem.

Search the list archives
  http://www.open-mpi.org/community/lists/users/
for "100% CPU" and you will get lots to look at.

I had a similar problem with a FORTRAN program.  With the help of Jeff
Squyres and Eugene Loh I wrote a solution: user-written MPI_Recv.c and
MPI_Send.c which I load them with my application, and the MPI problem
"100% CPU usage while doing nothing" is cured.

The code for MPI_Recv.c and MPI_Send.c is here:
  http://www.open-mpi.org/community/lists/users/2008/12/7563.php

Cheers,
Douglas.



Re: [OMPI users] Intel compiler libraries (was: libnuma issue)

2009-04-16 Thread Douglas Guptill
On Thu, Apr 16, 2009 at 05:29:14PM +0200, Francesco Pietra wrote:
> On Thu, Apr 16, 2009 at 3:04 PM, Jeff Squyres  wrote:
...
> Given my inexperience as system analyzer, I assume that I am messing
> something. Unfortunately, i was unable to discover where I am messing.
> An editor is waiting completion of calculations requested by a
> referee, and I am unable to answer.
> 
> thanks a lot for all you have tried to put me on the right road

I wonder if the confusion stems from the requirement to "source" the
intel compiler setup files in (at least) two situations:
  1. when compiling the (MPI) application
  2. when running the (MPI) application

My solution to the second has been to create - as part of the build
process for my application - a "run" script for it.  That script
sources the intel setup files, then runs the application.

Here is part of the script that runs my application:

==
# If it is defined, source the intel setup script.
#
if test "x/opt/intel/Compiler/11.0/074/bin/ifortvars.sh intel64" != x ; then
echo "setup the intel compiler with 
"
. /opt/intel/Compiler/11.0/074/bin/ifortvars.sh intel64
if test -z $(echo ${LD_LIBRARY_PATH} | grep intel) ; then
echo "Don't see intel in LD_LIBRARY_PATH=<${LD_LIBRARY_PATH}>"
echo "you may have trouble"
fi
fi
...
# run my program
==

I am running only on the 4 cores of one machine, so this solution may
not work for MPI applications that run on multiple machines.

Hope that helps,
Douglas.


Re: [OMPI users] Open MPI 2009 released

2009-04-02 Thread Douglas Guptill
On Wed, Apr 01, 2009 at 06:04:15PM -0400, George Bosilca wrote:

> The Open MPI Team, representing a consortium of bailed-out banks, car
> manufacturers, and insurance companies, is pleased to announce the
> release of the "unbreakable" / bug-free version Open MPI 2009,
> (expected to be available by mid-2011).  This release is essentially a
..

Marvellous.  A keeper.

Douglas.


Re: [OMPI users] threading bug?

2009-03-06 Thread Douglas Guptill
I once had a crash in libpthread something like the one below.  The
very un-obvious cause was a stack overflow on subroutine entry - large
automatic array.

HTH,
Douglas.

On Wed, Mar 04, 2009 at 03:04:20PM -0500, Jeff Squyres wrote:
> On Feb 27, 2009, at 1:56 PM, Mahmoud Payami wrote:
> 
> >I am using intel lc_prof-11 (and its own mkl) and have built  
> >openmpi-1.3.1 with connfigure options: "FC=ifort F77=ifort CC=icc  
> >CXX=icpc". Then I have built my application.
> >The linux box is 2Xamd64 quad. In the middle of running of my  
> >application (after some 15 iterations), I receive the message and  
> >stops.
> >I tried to configure openmpi using "--disable-mpi-threads" but it  
> >automatically assumes "posix".
> 
> This doesn't sound like a threading problem, thankfully.  Open MPI has  
> two levels of threading issues:
> 
> - whether MPI_THREAD_MULTIPLE is supported or not (which is what -- 
> enable|disable-mpi-threads does)
> - whether thread support is present at all on the system (e.g.,  
> solaris or posix threads)
> 
> You see "posix" in the configure output mainly because OMPI still  
> detects that posix threads are available on the system.  It doesn't  
> necessarily mean that threads will be used in your application's run.
> 
> >This problem does not happen in openmpi-1.2.9.
> >Any comment is highly appreciated.
> >Best regards,
> >mahmoud payami
> >
> >
> >[hpc1:25353] *** Process received signal ***
> >[hpc1:25353] Signal: Segmentation fault (11)
> >[hpc1:25353] Signal code: Address not mapped (1)
> >[hpc1:25353] Failing at address: 0x51
> >[hpc1:25353] [ 0] /lib64/libpthread.so.0 [0x303be0dd40]
> >[hpc1:25353] [ 1] /opt/openmpi131_cc/lib/
> >openmpi/mca_pml_ob1.so [0x2e350d96]
> >[hpc1:25353] [ 2] /opt/openmpi131_cc/lib/
> >openmpi/mca_pml_ob1.so [0x2e3514a8]
> >[hpc1:25353] [ 3] /opt/openmpi131_cc/lib/openmpi/mca_btl_sm.so  
> >[0x2eb7c72a]
> >[hpc1:25353] [ 4] /opt/openmpi131_cc/lib/libopen-pal.so. 
> >0(opal_progress+0x89) [0x2b42b7d9]
> >[hpc1:25353] [ 5] /opt/openmpi131_cc/lib/openmpi/mca_pml_ob1.so  
> >[0x2e34d27c]
> >[hpc1:25353] [ 6] /opt/openmpi131_cc/lib/libmpi.so.0(PMPI_Recv 
> >+0x210) [0x2af46010]
> >[hpc1:25353] [ 7] /opt/openmpi131_cc/lib/libmpi_f77.so.0(mpi_recv 
> >+0xa4) [0x2acd6af4]
> >[hpc1:25353] [ 8] /opt/QE131_cc/bin/pw.x(parallel_toolkit_mp_zsqmred_ 
> >+0x13da) [0x513d8a]
> >[hpc1:25353] [ 9] /opt/QE131_cc/bin/pw.x(pcegterg_+0x6c3f) [0x6667ff]
> >[hpc1:25353] [10] /opt/QE131_cc/bin/pw.x(diag_bands_+0xb9e) [0x65654e]
> >[hpc1:25353] [11] /opt/QE131_cc/bin/pw.x(c_bands_+0x277) [0x6575a7]
> >[hpc1:25353] [12] /opt/QE131_cc/bin/pw.x(electrons_+0x53f) [0x58a54f]
> >[hpc1:25353] [13] /opt/QE131_cc/bin/pw.x(MAIN__+0x1fb) [0x458acb]
> >[hpc1:25353] [14] /opt/QE131_cc/bin/pw.x(main+0x3c) [0x4588bc]
> >[hpc1:25353] [15] /lib64/libc.so.6(__libc_start_main+0xf4)  
> >[0x303b21d8a4]
> >[hpc1:25353] [16] /opt/QE131_cc/bin/pw.x(realloc+0x1b9) [0x4587e9]
> >[hpc1:25353] *** End of error message ***
> >--
> >mpirun noticed that process rank 6 with PID 25353 on node hpc1  
> >exited on signal 11 (Segmentation fault).
> >--
> 
> What this stack trace tells us is that Open MPI crashed somewhere  
> while trying to use shared memory for message passing, but it doesn't  
> really tell us much else.  It's not clear, either, whether this is  
> OMPI's fault or your app's fault (or something else).
> 
> Can you run your application through a memory-checking debugger to see  
> if anything obvious pops out?
> 
> -- 
> Jeff Squyres
> Cisco Systems
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] valgrind problems

2009-02-27 Thread Douglas Guptill
On Thu, Feb 26, 2009 at 08:27:15PM -0700, Justin wrote:
> Also the stable version of openmpi on Debian is 1.2.7rc2.  Are there any 
> known issues with this version and valgrid?

For a now-forgotten reason, I ditched the openmpi that comes on Debian
etch, and installed 1.2.8 in /usr/local.

HTH,
Douglas.


Re: [OMPI users] Supporting OpenMPI compiled for multiple compilers

2009-02-10 Thread Douglas Guptill
Hello Prentice:

On Tue, Feb 10, 2009 at 12:04:47PM -0500, Prentice Bisbal wrote:
> I need to support multiple compilers: Portland, Intel and GCC, so I've
> been compiling OpenMPI with each compiler, to avoid the Fortran symbol
> naming problems. When compiling, I'd use the --prefix and -exec-prefix
> switches like so:
> 
> GCC:
> ../configure CC=gcc CXX=g++ F77=gfortran FC=gfortran
> --prefix=/usr/local/openmpi-1.2.8
> --exec-prefix=/usr/local/openmpi-1.2.8/gcc-4.1.2/x86

..

> Does each compiler need a completely separate tree under --prefix?

That is how I do it.  I haven't needed "--exec-prefix".  Here is part
of my openmpi-confiure:


OPENMPI_BASE=/usr/local/openmpi-1.2.8

# Create a name for the build directory
#
BUILD_DIR=${OPENMPI_BASE}/build_${COMPILER}-${OPTIONS}
echo "building in ${BUILD_DIR}"

# create the build directory
#
[[ -d ${BUILD_DIR} ]] || mkdir ${BUILD_DIR}
[[ -d ${BUILD_DIR} ]] || { 
echo "Failed to create ${BUILD_DIR}.  exit."
exit 0
}

# make the configure command
#
COM="../configure --prefix=${BUILD_DIR} "

Hope that helps,
Douglas.


Re: [OMPI users] Handling output of processes

2009-01-26 Thread Douglas Guptill
Hello Ralph:

Please forgive if this has already been covered...

Have you considered prefixing each line of output from each process
with something like "process_number" and a colon?

That is what IBM's poe does.  Separating the output is then easy: 
  cat file | grep 0: > file.0
  cat file | grep 1: > file.1
etc.

In the meantime, I got around the problem by having each process open
its own file (instead of using stdout), and embedding the process
number in the file name.

Hope that helps,
Douglas.


On Sun, Jan 25, 2009 at 05:20:47AM -0700, Ralph Castain wrote:
> For those of you following this thread:
> 
> I have been impressed by the various methods used to grab the output  
> from processes. Since this is clearly something of interest to a broad  
> audience, I would like to try and make this easier to do by adding  
> some options to mpirun. Coming in 1.3.1 will be --tag-output, which  
> will automatically tag each line of output with the rank of the  
> process - this was already in the works, but obviously doesn't meet  
> the needs expressed here.
> 
> I have done some prelim work on a couple of options based on this  
> thread:
> 
> 1. spawn a screen and redirect process output to it, with the ability  
> to request separate screens for each specified rank. Obviously,  
> specifying all ranks would be the equivalent of replacing "my_app" on  
> the mpirun cmd line with "xterm my_app". However, there are cases  
> where you only need to see the output from a subset of the ranks, and  
> that is the intent of this option.
> 
> 2. redirect output of specified processes to files using the provided  
> filename appended with ".rank". You can do this for all ranks, or a  
> specified subset of them.
> 
> 3. timestamp output
> 
> Is there anything else people would like to see?
> 
> It is also possible to write a dedicated app such as Jody described,  
> but that is outside my purview for now due to priorities. However, I  
> can provide technical advice to such an effort, so feel free to ask.
> 
> Ralph
> 
> 
> On Jan 23, 2009, at 12:19 PM, Gijsbert Wiesenekker wrote:
> 
> >jody wrote:
> >>Hi
> >>I have a small cluster consisting of 9 computers (8x2 CPUs, 1x4  
> >>CPUs).
> >>I would like to be able to observe the output of the processes
> >>separately during an mpirun.
> >>
> >>What i currently do is to apply the mpirun to a shell script which
> >>opens a xterm for each process,
> >>which then starts the actual application.
> >>
> >>This works, but is a bit complicated, e.g. finding the window you're
> >>interested in among 19 others.
> >>
> >>So i was wondering is there a possibility to capture the processes'
> >>outputs separately, so
> >>i can make an application in which i can switch between the different
> >>processor outputs?
> >>I could imagine that could be done by wrapper applications which
> >>redirect the output over a TCP
> >>socket to a server application.
> >>
> >>But perhaps there is an easier way, or something like this alread  
> >>does exist?
> >>
> >>Thank You
> >> Jody
> >>___
> >>users mailing list
> >>us...@open-mpi.org
> >>http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >>
> >For C I use a printf wrapper function that writes the output to a  
> >logfile. I derive the name of the logfile from the mpi_id. It  
> >prefixes the lines with a time-stamp, so you also get some basic  
> >profile information. I can send you the source code if you like.
> >
> >Regards,
> >Gijsbert
> >
> >
> >___
> >users mailing list
> >us...@open-mpi.org
> >http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

 FORTRAN?  
 The syntactically incorrect statement "DO 10 I = 1.10" will parse and
 generate code creating a variable, DO10I, as follows: "DO10I = 1.10"  If that
 doesn't terrify you, it should.



Re: [OMPI users] Problem compiling open mpi 1.3 with sunstudio12 express

2009-01-19 Thread Douglas Guptill
When I use the Intel compilers, I have to add to my PATH and
LD_LIBRARY_PATH before using "mpif90".  I wonder if this needs to be
done in your case?

Douglas.

On Mon, Jan 19, 2009 at 05:49:53PM +0100, Olivier Marsden wrote:
> Hello,
> 
> I'm trying to compile ompi 1.3rc7 with the sun studio express comilers.
> 
> I'm using the following configure command:
> 
> CC=/opt/sun/express/sunstudioceres/bin/cc 
> CXX=/opt/sun/express/sunstudioceres/bin/CC   
> F77=/opt/sun/express/sunstudioceres/bin/f77 
> FC=/opt/sun/express/sunstudioceres/bin/f90  ./configure 
> --prefix=/opt/mpi_sun --enable-heterogeneous  --enable-shared 
> --enable-mpi-f90 --with-mpi-f90-size=small --disable-mpi-threads 
> --disable-progress-threads --disable-debug  --without-udapl 
> --disable-io-romio
> 
> The build and install execute correctly. However, I get the following 
> when trying to use mpif90:
> >> /opt/mpi_sun/bin/mpif90
> gfortran: no input files
> 
> My /opt/mpi_sun/share/openmpi/mpif90-wrapper-data.txt file  appears to 
> my layman eye to be correct, but just
> in case, its contents is the following:
> 
> project=Open MPI
> project_short=OMPI
> version=1.3rc7
> language=Fortran 90
> compiler_env=FC
> compiler_flags_env=FCFLAGS
> compiler=/opt/sun/express/sunstudioceres/bin/f90
> module_option=-M
> extra_includes=
> preprocessor_flags=
> compiler_flags=
> linker_flags=
> libs=-lmpi_f90 -lmpi_f77 -lmpi -lopen-rte -lopen-pal   -ldl   
> -Wl,--export-dynamic -lnsl -lutil -lm -ldl
> required_file=
> includedir=${includedir}
> libdir=${libdir}
> 
> 
> Can anyone see why gfortran is being used? (the config.log says that sun 
> f90 is used )
> 
> Thanks,
> 
> Olivier



Re: [OMPI users] trouble using --mca mpi_yield_when_idle 1

2008-12-18 Thread douglas . guptill
Hello Jeff, Eugene:

On Fri, Dec 12, 2008 at 04:47:11PM -0500, Jeff Squyres wrote:

..

> The "P" is MPI's profiling interface.  See chapter 14 in the MPI-2.1  
> doc.

Ah...Thank you, both Jeff and Eugene, for pointing that out.

I think there is a typo in chapter 14 - the first sentence isn't a
sentence - but that's another story.


> >Based on my re-read of the MPI standard, it appears that I may have
> >slightly mis-stated my issue.  The spin is probably taking place in
> >"mpi_send".  "mpi_send", according to my understanding of the MPI
> >standard, may not exit until a matching "mpi_recv" has been initiated,
> >or completed.  At least that is the conclusion I came to.
> 
> Perhaps something like this:
> 
> int MPI_Send(...) {
>MPI_Request req;
>int flag;
>PMPI_Isend(..., );
>do {
>   nanosleep(short);
>   PMPI_Test(, , MPI_STATUS_IGNORE);
>} while (!flag);
> }
> 
> That is, *you* provide MPI_Send and intercept all your apps calls to  
> MPI_Send.  But you implement it by doing a non-blocking send and  
> sleeping and polling MPI to know when it's done.  Of course, you don't  
> have to implement this as MPI_Send -- you could always have  
> your_func_prefix_send(...) instead of explicitly using the MPI  
> profiling interface.  But using the profiling interface allows you to  
> swap in/out different implementations of MPI_Send (etc.) at link time,  
> if that's desirable to you.
> 
> Looping over sleep/test is not the most efficient way of doing it, but  
> it may be suitable for your purposes.

Indeed, it is very suitable.  Thank you, both Jeff and Eugene, for
pointing the way.  That solution changes the load for my job from 2.0
to 1.0, as indicated by "xload" over a 40-minute run.

That means I can *double* the throughput of my machine.

Some gory details:

I ignored the suggestion to use MPI_STATUS_IGNORE, and that got me
some trouble, as you may not be surprised to hear.  The solution was
to use MPI_Request_get_status instead of MPI_Test.

As some of my waits (both in MPI_SEND and MPI_RECV) will be very
short, and some will be up to 4 minutes, I implemented a graduated
sleep time; it starts at 1 millisecond, and doubles after each sleep
up to a maximum of 100 milliseconds.  Interestingly, when I left the
sleep time at a constant 1 millisecond, the run load went up
significantly; it varied over the range 1.3 -> 1.7 .

I have attached my MPI_Send.c and MPI_Recv.c .  Comments welcome and
appreciated.

Regards,
Douglas.
-- 
  Douglas Guptill   
  Research Assistant, LSC 4640  email: douglas.gupt...@dal.ca
  Oceanography Department   fax:   902-494-3877
  Dalhousie University
  Halifax, NS, B3H 4J1, Canada
/*
 * Intercept MPI_Recv, and
 * call PMPI_Irecv, loop over PMPI_Request_get_status and sleep, until done
 *
 * Revision History:
 *  2008-12-17: copied from MPI_Send.c
 *  2008-12-18: tweaking.
 *
 * See MPI_Send.c for additional comments, 
 *  especially w.r.t. PMPI_Request_get_status.
 **/

#include "mpi.h"
#define _POSIX_C_SOURCE 199309 
#include 

int MPI_Recv(void *buff, int count, MPI_Datatype datatype, 
	  int from, int tag, MPI_Comm comm, MPI_Status *status) {

  int flag, nsec_start=1000, nsec_max=10;
  struct timespec ts;
  MPI_Request req;

  ts.tv_sec = 0;
  ts.tv_nsec = nsec_start;

  PMPI_Irecv(buff, count, datatype, from, tag, comm, );
  do {
nanosleep(, NULL);
ts.tv_nsec *= 2;
ts.tv_nsec = (ts.tv_nsec > nsec_max) ? nsec_max : ts.tv_nsec;
PMPI_Request_get_status(req, , status);
  } while (!flag);

  return (*status).MPI_ERROR;
}
/*
 * Intercept MPI_Send, and
 * call PMPI_Isend, loop over PMPI_Request_get_status and sleep, until done
 *
 * Revision History:
 *  2008-12-12: skeleton by Jeff Squyres <jsquy...@cisco.com>
 *  2008-12-16->18: adding parameters, variable wait, 
 * change MPI_Test to MPI_Request_get_status
 *  Douglas Guptill <douglas.gupt...@dal.ca>
 **/

/* When we use this:
 *   PMPI_Test(, , ); 
 * we get:
 * dguptill@DOME:$ mpirun -np 2 mpi_send_recv_test_mine
 * This is process0  of2 .
 * This is process1  of2 .
 * error: proc0 ,mpi_send returned -1208109376
 * error: proc1 ,mpi_send returned -1208310080
 * 1 changed to3
 *
 * Using MPI_request_get_status cures the problem.
 *
 * A read of mpi21-report.pdf confirms that MPI_Request_get_status
 * is the appropriate choice, since there seems to be something
 * between the call to MPI_SEND (MPI_RECV) in my FORTRAN program
 * and MPI_Send.c (MPI_Recv.c)
 **/


#include "mpi.h"
#define _POSIX_C_SOURCE 199309 
#include 

int MPI_Send(void *buff, int count, MPI_Datatype datatype, 
	  int dest, int tag, MPI_Comm comm) {

  int flag, nse

Re: [OMPI users] trouble using --mca mpi_yield_when_idle 1

2008-12-12 Thread douglas . guptill
Hello Jeff:

On Fri, Dec 12, 2008 at 08:37:14AM -0500, Jeff Squyres wrote:

> FWIW, Open MPI does have on its long-term roadmap to have "blocking"  
> progress -- meaning that it'll (probably) spin aggressively for a  
> while and if nothing "interesting" is happening, it'll go into a  
> blocking mode and let the process block in some kind of OS call.
> 
> Although we have some interesting ideas on how to do this, it's not  
> entirely clear when we'll get this done.  There's been a few requests  
> for this kind of feature before, but not a huge demand.

Please count me as wanting that feature.  And it would be nice - for
our application - to block immediately.

> This is probably because most users running MPI jobs tend to devote
> the entire core/CPU/server to the MPI job and don't try to run other
> jobs concurrently on the same resources.

Our situation is different.  While our number-cruncher application is
running, we would like to be able to do some editing, compiling,
post-processing.

I once ran three jobs, hence 6 processes, on our 4-cpu system, and was
unable to ssh into the machine.  Or maybe I did not wait long
enough...

The number-cruncher has two processes, and each needs intermediate
results from the other, inside a
  do i=1,3
  enddo 

As I mentioned earlier, most of the time, only one process is
executing, and the other is waiting for results.  My guess is that,
with the blocking feature you describe, I could double the number of
number-cruncher jobs running at one time, thus doubling throughput.

Regards,
Douglas.
-- 
  Douglas Guptill   
  Research Assistant, LSC 4640  email: douglas.gupt...@dal.ca
  Oceanography Department   fax:   902-494-3877
  Dalhousie University
  Halifax, NS, B3H 4J1, Canada


Re: [OMPI users] trouble using --mca mpi_yield_when_idle 1

2008-12-12 Thread douglas . guptill
Hello Eugene:

On Mon, Dec 08, 2008 at 11:14:10AM -0800, Eugene Loh wrote:

..

>> Proceeding from that, it seems that "mpi_recv" is implemented as 
>>   "poll forever until the message comes"
>> and NOT as 
>>"sleep until the message comes" 
>> 
>> I had assumed, until now, that mpi_recv would be implemented as the
>> second.

> It isn't a binary situation. The first extreme you mention is "consume
> a lot of resources and spring into action as soon as there is work to
> do." The second extreme you mention is "don't use any extra resources,
> but take a lng time to wake up once there is something to do".
> There are a bunch of alternatives in-between. E.g., "don't use as much
> resource and wake up kind of fast when there is something to do."
> 
> OMPI's yield behavior is such an in-between alternative.

Ah...I didn't realize there were more than two alternatives.

> I could imagine another alternative. Construct a wrapper function that
> intercepts MPI_Recv and turn it into something like
> 
> PMPI_Irecv
> while ( ! done ) {
>  nanosleep(short);
>  PMPI_Test(done);
> }
> 
> I don't know how useful this would be for your particular case.
> 

Thank you for the suggestion.  I didn't know about "PMPI_Irecv" (my
question was what/where did the "P" prefix to MPI come from?) so I
went back to the MPI standard, and re-read the description of
"mpi_send" and "mpi_recv".

Based on my re-read of the MPI standard, it appears that I may have
slightly mis-stated my issue.  The spin is probably taking place in
"mpi_send".  "mpi_send", according to my understanding of the MPI
standard, may not exit until a matching "mpi_recv" has been initiated,
or completed.  At least that is the conclusion I came to.

However my complaint - sorry, I wish I could think of a better word -
remains.  It appears that openmpi spin-waits, as opposed to, say,
going to sleep and waiting for a wake-up call.  Like a semaphore - if
those things still exist.

Regards,
Douglas.
-- 
  Douglas Guptill   
  Research Assistant, LSC 4640  email: douglas.gupt...@dal.ca
  Oceanography Department   fax:   902-494-3877
  Dalhousie University
  Halifax, NS, B3H 4J1, Canada


Re: [OMPI users] trouble using --mca mpi_yield_when_idle 1

2008-12-08 Thread Douglas Guptill
On Mon, Dec 08, 2008 at 08:56:59PM +1100, Terry Frankcombe wrote:

> As Eugene said:  Why are you desperate for an idle CPU?

So I can run another job.  :-)

Douglas.
-- 
  Douglas Guptill   
  Research Assistant, LSC 4640  email: douglas.gupt...@dal.ca
  Oceanography Department   fax:   902-494-3877
  Dalhousie University
  Halifax, NS, B3H 4J1, Canada


Re: [OMPI users] trouble using --mca mpi_yield_when_idle 1

2008-12-08 Thread douglas . guptill
Hello Eugene:

On Sun, Dec 07, 2008 at 11:15:21PM -0800, Eugene Loh wrote:
> Douglas Guptill wrote:
> 
> >Hi:
> >
> >I am using openmpi-1.2.8 to run a 2 processor job on an Intel
> >Quad-core cpu.  Opsys is Debian etch.  I am reaonably sure that, most
> >of the time, one process is waiting for results from the other.  The
> >code is fortran 90, and uses mpi_send and mpi_recv.  Yet
> >"gnome-system-monitor" shows 2 cpus at 100%.
> >
> >So I read, and re-read, the FAQs, and found the mpi_yield_when_idle
> >flag, and tried it:
> >
> >mpirun --host localhost,localhost,localhost,localhost --mca btl sm,self 
> >--mca mpi_yield_when_idle 1 --byslot -np 2 
> >/home/dguptill/software/sopale_nested_2008-10-24/bin/sopale_nested_openmpi-intel-noopt
> >
> >And still get, for each run, two cpus are at 100%.
> >
> >My goal is to get the system to a minimum usage state, where only one
> >cpu is being used, if one process is waiting for results from the
> >other.
> >
> >Can anyone suggest if this is possible, and if so, how?
> > 
> >
> I'm no expert on this, but I've played with the same problem.  I think I 
> did this on Solaris, but perhaps the behavior is the same on other OSes.
> 
> One issue is that "yield" might mean "yield if there is someone else 
> ready to run".  Like a traffic sign:  if someone else is there, you 
> yield.  If no one else is there, there's no way to tell that someone is 
> yielding.
> 
> Next, even if someone else is trying to run, "yield" doesn't give give 
> up the CPU 100%.  It's still rather pesky.
> 
> So, one question is whether you really want to have an idle CPU.  Do 
> you, or do you simply want another process, if there is one, to be able 
> to run?
> 
> Not a real answer to your question, but hopefully this helps.

It does.  I think you have raised an excellent question: yield to who?

I can think of 3 classes of process:
  1. other processes created by the same "mpirun"
  2. other processes created by a different "mpirun"
  3. other processes.

Classes 2 and 3 seem to be out of the range of possibility under the
circumstances, so we are left with class 1.  In my case, there are
only two processes:
  one is computing, 
  the other is in "mpi_recv".

Proceeding from that, it seems that "mpi_recv" is implemented as 
  "poll forever until the message comes"
and NOT as 
   "sleep until the message comes" 

I had assumed, until now, that mpi_recv would be implemented as the
second.

If "mpi_recv" is implemented as "poll forever", openmpi (or any MPI
with the same implementation) would seem to be unsuitable for my
application, since the application is using two cpus, but only getting
real work out of one.

Regards,
Douglas.
-- 
  Douglas Guptill   
  Research Assistant, LSC 4640  email: douglas.gupt...@dal.ca
  Oceanography Department   fax:   902-494-3877
  Dalhousie University
  Halifax, NS, B3H 4J1, Canada


[OMPI users] trouble using --mca mpi_yield_when_idle 1

2008-12-06 Thread Douglas Guptill
Hi:

I am using openmpi-1.2.8 to run a 2 processor job on an Intel
Quad-core cpu.  Opsys is Debian etch.  I am reaonably sure that, most
of the time, one process is waiting for results from the other.  The
code is fortran 90, and uses mpi_send and mpi_recv.  Yet
"gnome-system-monitor" shows 2 cpus at 100%.

So I read, and re-read, the FAQs, and found the mpi_yield_when_idle
flag, and tried it:

 mpirun --host localhost,localhost,localhost,localhost --mca btl sm,self --mca 
mpi_yield_when_idle 1 --byslot -np 2 
/home/dguptill/software/sopale_nested_2008-10-24/bin/sopale_nested_openmpi-intel-noopt

And still get, for each run, two cpus are at 100%.

My goal is to get the system to a minimum usage state, where only one
cpu is being used, if one process is waiting for results from the
other.

Can anyone suggest if this is possible, and if so, how?

Thanks,
Douglas.
-- 
  Douglas Guptill   
  Research Assistant, LSC 4640  email: douglas.gupt...@dal.ca
  Oceanography Department   fax:   902-494-3877
  Dalhousie University
  Halifax, NS, B3H 4J1, Canada