Re: [OMPI users] [mpich-discuss] problem with MPI_Get_count() for very long (but legal length) messages.

2010-02-08 Thread Jeff Squyres
FWIW, I filed https://svn.open-mpi.org/trac/ompi/ticket/2241 about this.

Thanks Jed!


On Feb 6, 2010, at 10:56 AM, Jed Brown wrote:

> On Fri, 5 Feb 2010 14:28:40 -0600, Barry Smith  wrote:
> > To cheer you up, when I run with openMPI it runs forever sucking down 
> > 100% CPU trying to send the messages :-)
> 
> On my test box (x86 with 8GB memory), Open MPI (1.4.1) does complete
> after several seconds, but still prints the wrong count.
> 
> MPICH2 does not actually send the message, as you can see by running the
> attached code.
> 
>   # Open MPI 1.4.1, correct cols[0]
>   [0] sending...
>   [1] receiving...
>   count -103432106, cols[0] 0
> 
>   # MPICH2 1.2.1, incorrect cols[1]
>   [1] receiving...
>   [0] sending...
>   [1] count -103432106, cols[0] 1
> 
> 
> How much memory does crush have (you need about 7GB to do this without
> swapping)?  In particular, most of the time it took Open MPI to send the
> message (with your source) was actually just spent faulting the
> send/recv buffers.  The attached faults the buffers first, and the
> subsequent send/recv takes less than 2 seconds.
> 
> Actually, it's clear that MPICH2 never touches either buffer because it
> returns immediately regardless of whether they have been faulted first.
> 
> Jed
> 
> 
> #include 
> #include 
> #include 
> 
> int main(int argc,char **argv)
> {
>  intierr,i,size,rank;
>  intcnt = 433438806;
>  MPI_Status status;
>  long long  *cols;
> 
>  MPI_Init(,);
>  ierr = MPI_Comm_size(MPI_COMM_WORLD,);
>  ierr = MPI_Comm_rank(MPI_COMM_WORLD,);
>  if (size != 2) {
>fprintf(stderr,"[%d] usage: mpiexec -n 2 %s\n",rank,argv[0]);
>MPI_Abort(MPI_COMM_WORLD,1);
>  }
> 
>  cols = malloc(cnt*sizeof(long long));
>  for (i=0; i  if (rank == 0) {
>printf("[%d] sending...\n",rank);
>ierr = MPI_Send(cols,cnt,MPI_LONG_LONG_INT,1,0,MPI_COMM_WORLD);
>  } else {
>printf("[%d] receiving...\n",rank);
>ierr = MPI_Recv(cols,cnt,MPI_LONG_LONG_INT,0,0,MPI_COMM_WORLD,);
>ierr = MPI_Get_count(,MPI_LONG_LONG_INT,);
>printf("[%d] count %d, cols[0] %lld\n",rank,cnt,cols[0]);
>  }
>  ierr = MPI_Finalize();
>  return 0;
> }
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com

For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] libtool compile error

2010-02-08 Thread Jeff Squyres
You shouldn't need to do this in a tarball build.

Did you run autogen manually, or did you just untar the OMPI tarball and just 
configure / make?


On Feb 6, 2010, at 10:49 AM, Caciano Machado wrote:

> Hi,
> 
> You can solve this installing libtool 2.2.6b and running autogen.sh.
> 
> Regards,
> Caciano Machado
> 
> On Thu, Feb 4, 2010 at 8:25 PM, Peter C. Lichtner  wrote:
> > I'm trying to compile openmpi-1.4.1 on MacOSX 10.5.8 using Absoft Fortran
> > 90 11.0 and gcc --version i686-apple-darwin9-gcc-4.0.1 (GCC) 4.0.1 (Apple
> > Inc. build 5493). I get the following error:
> >
> > make
> > ...
> >
> > Making all in mca/io/romio
> > Making all in romio
> > Making all in include
> > make[4]: Nothing to be done for `all'.
> > Making all in adio
> > Making all in common
> > /bin/sh ../../libtool --tag=CC   --mode=compile gcc -DHAVE_CONFIG_H -I.
> > -I../../adio/include  -DOMPI_BUILDING=1
> > -I/Users/lichtner/petsc/openmpi-1.4.1/ompi/mca/io/romio/romio/../../../../..
> > -I/Users/lichtner/petsc/openmpi-1.4.1/ompi/mca/io/romio/romio/../../../../../opal/include
> > -I../../../../../../../opal/include -I../../../../../../../ompi/include
> > -I/Users/lichtner/petsc/openmpi-1.4.1/ompi/mca/io/romio/romio/include
> > -I/Users/lichtner/petsc/openmpi-1.4.1/ompi/mca/io/romio/romio/adio/include
> > -D_REENTRANT  -O3 -DNDEBUG -finline-functions -fno-strict-aliasing
> > -DHAVE_ROMIOCONF_H -DHAVE_ROMIOCONF_H  -I../../include -MT ad_aggregate.lo
> > -MD -MP -MF .deps/ad_aggregate.Tpo -c -o ad_aggregate.lo ad_aggregate.c
> > ../../libtool: line 460: CDPATH: command not found
> > /Users/lichtner/petsc/openmpi-1.4.1/ompi/mca/io/romio/romio/libtool: line
> > 460: CDPATH: command not found
> > /Users/lichtner/petsc/openmpi-1.4.1/ompi/mca/io/romio/romio/libtool: line
> > 1138: func_opt_split: command not found
> > libtool: Version mismatch error.  This is libtool 2.2.6b, but the
> > libtool: definition of this LT_INIT comes from an older release.
> > libtool: You should recreate aclocal.m4 with macros from libtool 2.2.6b
> > libtool: and run autoconf again.
> > make[5]: *** [ad_aggregate.lo] Error 63
> > make[4]: *** [all-recursive] Error 1
> > make[3]: *** [all-recursive] Error 1
> > make[2]: *** [all-recursive] Error 1
> > make[1]: *** [all-recursive] Error 1
> > make: *** [all-recursive] Error 1
> >
> > Any help appreciated.
> > ...Peter
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 


-- 
Jeff Squyres
jsquy...@cisco.com

For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] openmpi fails to terminate for errors/signals on some but not all processes

2010-02-08 Thread Laurence Marks
Correction on a correction; I did not goof, however zombie's remaining
is not a reproducible problem but can occur.

On Mon, Feb 8, 2010 at 2:34 PM, Laurence Marks  wrote:
> I goofed, openmpi does trap these errors but the system I tested them
> on had a very sluggish response. However, and end-of-file is NOT
> trapped.
>
> On Mon, Feb 8, 2010 at 1:29 PM, Laurence Marks  
> wrote:
>> This was "Re: [OMPI users] Trapping fortran I/O errors leaving zombie
>> mpi processes", but it is more severe than this.
>>
>> Sorry, but it appears that at least with ifort most run-time errors
>> and signals will leave zombie processes behind with openmpi if they
>> only occur on some of the processors and not all. You can test this
>> with the attached using (for instance)
>>
>> mpicc -c doraise.c
>> mpif90 -o crash_test crash_test.F doraise.o -FR -xHost -O3
>>
>> Then, as appropriate mpirun -np 8 crash_test
>>
>> The output is self explanatory, and has an option to both try and
>> simulate common fortran problems as well as to send fortran or C
>> signals to the process. Please note that the results can be dependent
>> upon the level of optimization, and with other compilers there could
>> be problems where the compiler complains about SIGSEV or other errors
>> since the code deliberately tries to create these.
>>
>> --
>> Laurence Marks
>> Department of Materials Science and Engineering
>> MSE Rm 2036 Cook Hall
>> 2220 N Campus Drive
>> Northwestern University
>> Evanston, IL 60208, USA
>> Tel: (847) 491-3996 Fax: (847) 491-7820
>> email: L-marks at northwestern dot edu
>> Web: www.numis.northwestern.edu
>> Chair, Commission on Electron Crystallography of IUCR
>> www.numis.northwestern.edu/
>> Electron crystallography is the branch of science that uses electron
>> scattering and imaging to study the structure of matter.
>>
>
>
>
> --
> Laurence Marks
> Department of Materials Science and Engineering
> MSE Rm 2036 Cook Hall
> 2220 N Campus Drive
> Northwestern University
> Evanston, IL 60208, USA
> Tel: (847) 491-3996 Fax: (847) 491-7820
> email: L-marks at northwestern dot edu
> Web: www.numis.northwestern.edu
> Chair, Commission on Electron Crystallography of IUCR
> www.numis.northwestern.edu/
> Electron crystallography is the branch of science that uses electron
> scattering and imaging to study the structure of matter.
>



-- 
Laurence Marks
Department of Materials Science and Engineering
MSE Rm 2036 Cook Hall
2220 N Campus Drive
Northwestern University
Evanston, IL 60208, USA
Tel: (847) 491-3996 Fax: (847) 491-7820
email: L-marks at northwestern dot edu
Web: www.numis.northwestern.edu
Chair, Commission on Electron Crystallography of IUCR
www.numis.northwestern.edu/
Electron crystallography is the branch of science that uses electron
scattering and imaging to study the structure of matter.


Re: [OMPI users] openmpi fails to terminate for errors/signals on some but not all processes

2010-02-08 Thread Laurence Marks
I goofed, openmpi does trap these errors but the system I tested them
on had a very sluggish response. However, and end-of-file is NOT
trapped.

On Mon, Feb 8, 2010 at 1:29 PM, Laurence Marks  wrote:
> This was "Re: [OMPI users] Trapping fortran I/O errors leaving zombie
> mpi processes", but it is more severe than this.
>
> Sorry, but it appears that at least with ifort most run-time errors
> and signals will leave zombie processes behind with openmpi if they
> only occur on some of the processors and not all. You can test this
> with the attached using (for instance)
>
> mpicc -c doraise.c
> mpif90 -o crash_test crash_test.F doraise.o -FR -xHost -O3
>
> Then, as appropriate mpirun -np 8 crash_test
>
> The output is self explanatory, and has an option to both try and
> simulate common fortran problems as well as to send fortran or C
> signals to the process. Please note that the results can be dependent
> upon the level of optimization, and with other compilers there could
> be problems where the compiler complains about SIGSEV or other errors
> since the code deliberately tries to create these.
>
> --
> Laurence Marks
> Department of Materials Science and Engineering
> MSE Rm 2036 Cook Hall
> 2220 N Campus Drive
> Northwestern University
> Evanston, IL 60208, USA
> Tel: (847) 491-3996 Fax: (847) 491-7820
> email: L-marks at northwestern dot edu
> Web: www.numis.northwestern.edu
> Chair, Commission on Electron Crystallography of IUCR
> www.numis.northwestern.edu/
> Electron crystallography is the branch of science that uses electron
> scattering and imaging to study the structure of matter.
>



-- 
Laurence Marks
Department of Materials Science and Engineering
MSE Rm 2036 Cook Hall
2220 N Campus Drive
Northwestern University
Evanston, IL 60208, USA
Tel: (847) 491-3996 Fax: (847) 491-7820
email: L-marks at northwestern dot edu
Web: www.numis.northwestern.edu
Chair, Commission on Electron Crystallography of IUCR
www.numis.northwestern.edu/
Electron crystallography is the branch of science that uses electron
scattering and imaging to study the structure of matter.


Re: [OMPI users] Executing of external programs

2010-02-08 Thread Jeff Squyres
On Feb 8, 2010, at 2:34 PM, Lubomir Klimes wrote:

> I am new in the world of MPI and I would like to ask you for the help. In my 
> diploma thesis I need to write a program in C++ using MPI that will execute 
> another external program - an optimization software GAMS. My question is 
> wheter is sufficient to use simply the command system(); for executing GAMS. 
> In other words, will the external program "work" in parallel?

It depends on what you mean, and what your system setup is.

Calling system() may (will) cause problems if you're using a Myrinet or 
OpenFabrics-based network in MPI (for deep, dark, voodoo reasons -- we can 
explain if you care).  If you're using TCP, you should likely be fine -- but be 
aware that your resulting program may not be portable.

Calling system() in your MPI application will effectively fork/exec the 
specified command.  Hence, if you "mpirun -np 8 a.out", and a.out calls 
system("foo"), you'll get 8 copies of foo running independently of each other.  
If your project is supposed to parallelize foo, then it depends on the input / 
computation / output of foo as to whether this is a good approach.

That being said, if you're just using MPI effectively as a launcher to launch N 
copies of foo, note that you can use Open MPI's "mpirun" to launch non-MPI 
applications (e.g., "mpirun -np 4 hostname").

> If the question is 'Yes', does someone know whether it will work also with 
> LAM/MPI instead of OpenMPI?

As a former developer of LAM/MPI, I can pretty confidently say that, just like 
Mac replied to your initial question on the LAM/MPI list: LAM/MPI is pretty 
much dead.  If you're just starting with MPI, you're much better to start with 
Open MPI than LAM/MPI.  :-)

-- 
Jeff Squyres
jsquy...@cisco.com

For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Similar question about MPI_Create_type

2010-02-08 Thread Jed Brown
On Mon, 08 Feb 2010 14:42:15 -0500, Prentice Bisbal  wrote:
> I'll give that a try, too. IMHO, MPI_Pack/Unpack looks easier and less
> error prone, but Pacheco advocates using derived types over
> MPI_Pack/Unpack.

I would recommend using derived types for big structures, or perhaps for
long-lived medium-sized structures.  If your structure is static
(i.e. doesn't contain pointers), then derived types definitely make
sense and allow you to use that type in collectives.

> In my situation, rank 0 is reading in a file containing all the coords.
> So even if other ranks don't have the data, I still need to create the
> structure on all the nodes, even if I don't populate it with data?

You're populating it by receiving data.  MPI can't allocate the space
for you, so you have to it up.

> To clarify: I thought adding a similar structure, b_point in rank 1
> would be adequate to receive the data from rank 0

You have allocated memory by the time you call MPI_Recv, but you were
passing an undefined value to MPI_Address, and you certainly can't base
derived_type an a_point and use it to receive into b_point.

It would be fine to receive into a_point on rank 1, but whatever you do,
derived_type has to be created correctly on each process.

Jed


Re: [OMPI users] Similar question about MPI_Create_type

2010-02-08 Thread Prentice Bisbal


Prentice Bisbal wrote:
> I hit send to early on my last reply, please forgive me...
> 
> Jed Brown wrote:
>> On Mon, 08 Feb 2010 13:54:10 -0500, Prentice Bisbal  wrote:
>>> but I don't have that book handy
>> The standard has lots of examples.
>>
>>   http://www.mpi-forum.org/docs/docs.html
> 
> Thanks, I'll check out those examples.
>> You can do this, but for small structures, you're better off just
>> packing buffers.  For large structures containing variable-size fields,
>> I think it is clearer to use MPI_BOTTOM instead of offsets from an
>> arbitrary (instance-dependent) address.
> 
> I'll give that a try, too. IMHO, MPI_Pack/Unpack looks easier and less
> error prone, but Pacheco advocates using derived types over
> MPI_Pack/Unpack.
> 
>> [...]
>>
>>>   if (rank == 0) {
>>> a_point.index = 1;
>>> a_point.coords = malloc(3 * sizeof(int));
>>> a_point.coords[0] = 3;
>>> a_point.coords[1] = 6;
>>> a_point.coords[2] = 9;
>>>   }
>>>
>>>   block_lengths[0] = 1;
>>>   block_lengths[1] = 3;
>>>
>>>   type_list[0] = MPI_INT;
>>>   type_list[1] = MPI_INT;
>>>
>>>   displacements[0] = 0;
>>>   MPI_Address(_point.index, _address);
>>>   MPI_Address(a_point.coords, );
>> ^^
>>
>> Rank 1 has not allocated this yet.
> 
> I'm glad you brought that up. I wanted to ask about that:
> 
> In my situation, rank 0 is reading in a file containing all the coords.
> So even if other ranks don't have the data, I still need to create the
> structure on all the nodes, even if I don't populate it with data?

To clarify: I thought adding a similar structure, b_point in rank 1
would be adequate to receive the data from rank 0.

-- 
Prentice


[OMPI users] Executing of external programs

2010-02-08 Thread Lubomir Klimes
Hi,

I am new in the world of MPI and I would like to ask you for the help. In my
diploma thesis I need to write a program in C++ using MPI that will execute
another external program - an optimization software GAMS. My question is
wheter is sufficient to use simply the command system(); for executing GAMS.
In other words, will the external program "work" in parallel? If the
question is 'Yes', does someone know whether it will work also with LAM/MPI
instead of OpenMPI?

Thank you for the answer.

Best regards,
Lubajz


[OMPI users] openmpi fails to terminate for errors/signals on some but not all processes

2010-02-08 Thread Laurence Marks
This was "Re: [OMPI users] Trapping fortran I/O errors leaving zombie
mpi processes", but it is more severe than this.

Sorry, but it appears that at least with ifort most run-time errors
and signals will leave zombie processes behind with openmpi if they
only occur on some of the processors and not all. You can test this
with the attached using (for instance)

mpicc -c doraise.c
mpif90 -o crash_test crash_test.F doraise.o -FR -xHost -O3

Then, as appropriate mpirun -np 8 crash_test

The output is self explanatory, and has an option to both try and
simulate common fortran problems as well as to send fortran or C
signals to the process. Please note that the results can be dependent
upon the level of optimization, and with other compilers there could
be problems where the compiler complains about SIGSEV or other errors
since the code deliberately tries to create these.

-- 
Laurence Marks
Department of Materials Science and Engineering
MSE Rm 2036 Cook Hall
2220 N Campus Drive
Northwestern University
Evanston, IL 60208, USA
Tel: (847) 491-3996 Fax: (847) 491-7820
email: L-marks at northwestern dot edu
Web: www.numis.northwestern.edu
Chair, Commission on Electron Crystallography of IUCR
www.numis.northwestern.edu/
Electron crystallography is the branch of science that uses electron
scattering and imaging to study the structure of matter.
#include 
#include 

void doraise(isig)
long isig[1] ;
{
int i, j ;
   i = isig[0];
   raise( i );   /* signal i is raised */
}

void doraise_(isig)
long isig[1] ;
{
 doraise(isig) ;
}

void whatsig(isig)
long isig[1] ;
{
int i ;
i = isig[0];
psignal( i , "Testing Signal");
}

void whatsig_(isig)
long isig[1] ;
{
 whatsig(isig) ;
}

void showallsignals()
{
int i ;
char buf[15];
for ( i = 1; i < 32; i++ ) {
   sprintf(buf, "Signal code %d ", i);
   psignal( i , buf );
}
}

void showallsignals_()
{
 showallsignals() ;
}



crash_test.F
Description: Binary data


Re: [OMPI users] Similar question about MPI_Create_type

2010-02-08 Thread Jed Brown
On Mon, 08 Feb 2010 13:54:10 -0500, Prentice Bisbal  wrote:
> but I don't have that book handy

The standard has lots of examples.

  http://www.mpi-forum.org/docs/docs.html

You can do this, but for small structures, you're better off just
packing buffers.  For large structures containing variable-size fields,
I think it is clearer to use MPI_BOTTOM instead of offsets from an
arbitrary (instance-dependent) address.

[...]

>   if (rank == 0) {
> a_point.index = 1;
> a_point.coords = malloc(3 * sizeof(int));
> a_point.coords[0] = 3;
> a_point.coords[1] = 6;
> a_point.coords[2] = 9;
>   }
> 
>   block_lengths[0] = 1;
>   block_lengths[1] = 3;
> 
>   type_list[0] = MPI_INT;
>   type_list[1] = MPI_INT;
> 
>   displacements[0] = 0;
>   MPI_Address(_point.index, _address);
>   MPI_Address(a_point.coords, );
^^

Rank 1 has not allocated this yet.

Jed


[OMPI users] Similar question about MPI_Create_type

2010-02-08 Thread Prentice Bisbal
Hello, again MPU Users:

This question is similar to my earlier one about MPI_Pack/Unpack,

I'm trying to send the following structure, which has a dynamically
allocated array in it, as a MPI derived type using MPI_Create_type_struct():

typedef struct{
   int index;
   int* coords;
}point;

I would think that this can't be done since the coords array will not be
  contiguous in memory with the rest of the structure, so calculating
the displacements between  point.index and point.coords will be
meaningless. However, I'm pretty sure that Pacheco's book implies that
this can be done (I'd list the exact page(s), but I don't have that book
handy).

Am I wrong or right?

Below my signature is a the code I'm using to test this, which fails as
I'd expect. Is my thinking right, or is my program wrong? When I run the
program I get this error:

 *** An error occurred in MPI_Address
 *** on communicator MPI_COMM_WORLD
 *** MPI_ERR_ARG: invalid argument of some other kind
 *** MPI_ERRORS_ARE_FATAL (goodbye)
mpirun noticed that job rank 0 with PID 28286 on node juno.sns.ias.edu
exited on signal 15 (Terminated).

-- 
Prentice

#include 
#include 
#include 

int rank;
MPI_Status status;
int size;
int tag;

typedef struct{
  int index;
  int* coords;
}point;

int block_lengths[2];
MPI_Datatype type_list[2];
MPI_Aint displacements[2];
MPI_Aint start_address;
MPI_Aint address;
MPI_Datatype derived_point;
point a_point, b_point;

int main(int argc, char* argv[])
{
  MPI_Init(, );
  MPI_Comm_size(MPI_COMM_WORLD, );
  MPI_Comm_rank(MPI_COMM_WORLD, );

  if (rank == 0) {
a_point.index = 1;
a_point.coords = malloc(3 * sizeof(int));
a_point.coords[0] = 3;
a_point.coords[1] = 6;
a_point.coords[2] = 9;
  }

  block_lengths[0] = 1;
  block_lengths[1] = 3;

  type_list[0] = MPI_INT;
  type_list[1] = MPI_INT;

  displacements[0] = 0;
  MPI_Address(_point.index, _address);
  MPI_Address(a_point.coords, );
  displacements[1] = address - start_address;

  MPI_Type_create_struct(2, block_lengths, displacements, type_list,
_point);
  MPI_Type_commit(_point);

  if (rank == 0) {
MPI_Send(_point, 1, derived_point, 1, 0, MPI_COMM_WORLD);
  }
  if (rank == 1) {
b_point.coords = malloc(3 *sizeof(int));
MPI_Recv(_point, 1, derived_point, 0, 0, MPI_COMM_WORLD, );
printf("b_point.index = %i\n", b_point.index);
printf("b_point.coords:(%i, %i, %i)\n", b_point.coords[0],
b_point.coords[1], b_point.coords[2]);

  }
  MPI_Finalize();
  exit(0);
}




[OMPI users] openmpi-default-hostfile

2010-02-08 Thread Benjamin Gaudio
I'm using ClusterTools 8.2.1 on Solaris 10 and according to the HPC
docs,

"Open MPI includes a commented default hostfile at
/opt/SUNWhpc/HPC8.2/etc/openmpi-default-hostfile. Unless you
specify
a different hostfile at a different location, this is the hostfile
that OpenMPI uses."

I have added my list of hosts to that file. If I don't specify a
hostfile in the mpirun command, it doesn't use any of the hosts in
the file, it just runs everything on the node that I run the command
on. However, if I implicitly call the hostfile in the mpirun command
with -hostfile /opt/SUNWhpc/HPC8.2.1/etc/openmpi-default-hostfile,
then it works as it should. So, I have come to the conclusion that
mpirun is not reading my default file for some reason. Is there a
way to figure out why?

Benj


Re: [OMPI users] Difficulty with MPI_Unpack

2010-02-08 Thread Prentice Bisbal
Jed Brown wrote:
> On Sun, 07 Feb 2010 22:40:55 -0500, Prentice Bisbal  wrote:
>> Hello, everyone. I'm having trouble packing/unpacking this structure:
>>
>> typedef struct{
>>   int index;
>>   int* coords;
>> }point;
>>
>> The size of the coords array is not known a priori, so it needs to be a
>> dynamic array. I'm trying to send it from one node to another using
>> MPI_Pack/MPI_Unpack as shown below. When I unpack it, I get this error
>> when unpacking the coords array:
>>
>> [fatboy:07360] *** Process received signal ***
>> [fatboy:07360] Signal: Segmentation fault (11)
>> [fatboy:07360] Signal code: Address not mapped (1)
>> [fatboy:07360] Failing at address: (nil)
> 
> Looks like b_point.coords = NULL.  Has this been allocated on rank=1?

Yep, that was the problem. I left that out. I can't believe I overlooked
something so obvious. Thanks for the code review. Thanks to Brian
Austin, too,  who also found that mistake.

> 
> You might need to use MPI_Get_count to decide how much to allocate.
> Also, if you don't have a convenient upper bound on the size of the
> receive buffer, you can use MPI_Probe followed by MPI_Get_count to
> determine this before calling MPI_Recv.

Thanks for the tip. I'll take a look at those functions.

-- 
Prentice


Re: [OMPI users] OpenMPI checkpoint/restart on multiple nodes

2010-02-08 Thread Joshua Hursey
You can use the 'checkpoint to local disk' example to checkpoint and restart 
without access to a globally shared storage devices. There is an example on the 
website that does not use a globally mounted file system:
  http://www.osl.iu.edu/research/ft/ompi-cr/examples.php#uc-ckpt-local

What version of Open MPI are you using? This functionality is known to be 
broken on the v1.3/1.4 branches, per the ticket below:
  https://svn.open-mpi.org/trac/ompi/ticket/2139

Try the nightly snapshot of the 1.5 branch or the development trunk, and see if 
this issues still occurs.

-- Josh

On Feb 8, 2010, at 8:35 AM, Andreea Costea wrote:

> I asked this question because checkpointing with to NFS is successful, but 
> checkpointing without a mount filesystem or a shared storage throws this 
> warning:
> 
> WARNING: Could not preload specified file: File already exists. 
> Fileset: /home/andreea/checkpoints/global/ompi_global_snapshot_7426.ckpt/0 
> Host: X 
> 
> Will continue attempting to launch the process. 
> 
> 
> filem:rsh: wait_all(): Wait failed (-1) 
> [[62871,0],0] ORTE_ERROR_LOG: Error in file snapc_full_global.c at line 1054 
> 
> even if I set the mca-parameters like this:
> snapc_base_store_in_place=0
> 
> crs_base_snapshot_dir
> =/home/andreea/checkpoints/local
> 
> snapc_base_global_snapshot_dir
> =/home/andreea/checkpoints/global
> and the nodes can connect through ssh without a password. 
> 
> Thanks,
> Andreea
> 
> On Mon, Feb 8, 2010 at 12:59 PM, Andreea Costea  
> wrote:
> Hi,
> 
> Let's say I have an MPI application running on several hosts. Is there any 
> way to checkpoint this application without having a shared storage between 
> the nodes?
> I already took a look at the examples here 
> http://www.osl.iu.edu/research/ft/ompi-cr/examples.php, but it seems that in 
> both cases there is a globally mounted file system. 
> 
> Thanks,
> Andreea
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] OpenMPI checkpoint/restart on multiple nodes

2010-02-08 Thread Andreea Costea
I asked this question because checkpointing with to NFS is successful, but
checkpointing without a mount filesystem or a shared storage throws this
warning:

WARNING: Could not preload specified file: File already exists.
Fileset: /home/andreea/checkpoints/global/ompi_global_snapshot_7426.ckpt/0
Host: X

Will continue attempting to launch the process.


filem:rsh: wait_all(): Wait failed (-1)
[[62871,0],0] ORTE_ERROR_LOG: Error in file snapc_full_global.c at line 1054


even if I set the mca-parameters like this:

snapc_base_store_in_place=0
crs_base_snapshot_dir=/home/andreea/checkpoints/local
snapc_base_global_snapshot_dir=/home/andreea/checkpoints/global

and the nodes can connect through ssh without a password.

Thanks,
Andreea

On Mon, Feb 8, 2010 at 12:59 PM, Andreea Costea wrote:

> Hi,
>
> Let's say I have an MPI application running on several hosts. Is there any
> way to checkpoint this application without having a shared storage between
> the nodes?
> I already took a look at the examples here
> http://www.osl.iu.edu/research/ft/ompi-cr/examples.php, but it seems that
> in both cases there is a globally mounted file system.
>
> Thanks,
> Andreea
>
>


Re: [OMPI users] Difficulty with MPI_Unpack

2010-02-08 Thread Jed Brown
On Sun, 07 Feb 2010 22:40:55 -0500, Prentice Bisbal  wrote:
> Hello, everyone. I'm having trouble packing/unpacking this structure:
> 
> typedef struct{
>   int index;
>   int* coords;
> }point;
> 
> The size of the coords array is not known a priori, so it needs to be a
> dynamic array. I'm trying to send it from one node to another using
> MPI_Pack/MPI_Unpack as shown below. When I unpack it, I get this error
> when unpacking the coords array:
> 
> [fatboy:07360] *** Process received signal ***
> [fatboy:07360] Signal: Segmentation fault (11)
> [fatboy:07360] Signal code: Address not mapped (1)
> [fatboy:07360] Failing at address: (nil)

Looks like b_point.coords = NULL.  Has this been allocated on rank=1?

You might need to use MPI_Get_count to decide how much to allocate.
Also, if you don't have a convenient upper bound on the size of the
receive buffer, you can use MPI_Probe followed by MPI_Get_count to
determine this before calling MPI_Recv.

Jed


Re: [OMPI users] Problems building Open MPI 1.4.1 with Pathscale

2010-02-08 Thread Rafael Arco Arredondo
Hello,

It does work with version 1.4. This is the hello world that hangs with
1.4.1:

#include 
#include 

int main(int argc, char **argv)
{
  int node, size;

  MPI_Init(,);
  MPI_Comm_rank(MPI_COMM_WORLD, );
  MPI_Comm_size(MPI_COMM_WORLD, );

  printf("Hello World from Node %d of %d.\n", node, size);

  MPI_Finalize();
  return 0;
}

El mar, 26-01-2010 a las 03:57 -0500, Åke Sandgren escribió:
> 1 - Do you have problems with openmpi 1.4 too? (I don't, haven't built
> 1.4.1 yet)
> 2 - There is a bug in the pathscale compiler with -fPIC and -g that
> generates incorrect dwarf2 data so debuggers get really confused and
> will have BIG problems debugging the code. I'm chasing them to get a
> fix...
> 3 - Do you have an example code that have problems? 

-- 
Rafael Arco Arredondo
Centro de Servicios de Informática y Redes de Comunicaciones
Universidad de Granada