Re: [OMPI users] [mpich-discuss] problem with MPI_Get_count() for very long (but legal length) messages.
FWIW, I filed https://svn.open-mpi.org/trac/ompi/ticket/2241 about this. Thanks Jed! On Feb 6, 2010, at 10:56 AM, Jed Brown wrote: > On Fri, 5 Feb 2010 14:28:40 -0600, Barry Smithwrote: > > To cheer you up, when I run with openMPI it runs forever sucking down > > 100% CPU trying to send the messages :-) > > On my test box (x86 with 8GB memory), Open MPI (1.4.1) does complete > after several seconds, but still prints the wrong count. > > MPICH2 does not actually send the message, as you can see by running the > attached code. > > # Open MPI 1.4.1, correct cols[0] > [0] sending... > [1] receiving... > count -103432106, cols[0] 0 > > # MPICH2 1.2.1, incorrect cols[1] > [1] receiving... > [0] sending... > [1] count -103432106, cols[0] 1 > > > How much memory does crush have (you need about 7GB to do this without > swapping)? In particular, most of the time it took Open MPI to send the > message (with your source) was actually just spent faulting the > send/recv buffers. The attached faults the buffers first, and the > subsequent send/recv takes less than 2 seconds. > > Actually, it's clear that MPICH2 never touches either buffer because it > returns immediately regardless of whether they have been faulted first. > > Jed > > > #include > #include > #include > > int main(int argc,char **argv) > { > intierr,i,size,rank; > intcnt = 433438806; > MPI_Status status; > long long *cols; > > MPI_Init(,); > ierr = MPI_Comm_size(MPI_COMM_WORLD,); > ierr = MPI_Comm_rank(MPI_COMM_WORLD,); > if (size != 2) { >fprintf(stderr,"[%d] usage: mpiexec -n 2 %s\n",rank,argv[0]); >MPI_Abort(MPI_COMM_WORLD,1); > } > > cols = malloc(cnt*sizeof(long long)); > for (i=0; i if (rank == 0) { >printf("[%d] sending...\n",rank); >ierr = MPI_Send(cols,cnt,MPI_LONG_LONG_INT,1,0,MPI_COMM_WORLD); > } else { >printf("[%d] receiving...\n",rank); >ierr = MPI_Recv(cols,cnt,MPI_LONG_LONG_INT,0,0,MPI_COMM_WORLD,); >ierr = MPI_Get_count(,MPI_LONG_LONG_INT,); >printf("[%d] count %d, cols[0] %lld\n",rank,cnt,cols[0]); > } > ierr = MPI_Finalize(); > return 0; > } > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] libtool compile error
You shouldn't need to do this in a tarball build. Did you run autogen manually, or did you just untar the OMPI tarball and just configure / make? On Feb 6, 2010, at 10:49 AM, Caciano Machado wrote: > Hi, > > You can solve this installing libtool 2.2.6b and running autogen.sh. > > Regards, > Caciano Machado > > On Thu, Feb 4, 2010 at 8:25 PM, Peter C. Lichtnerwrote: > > I'm trying to compile openmpi-1.4.1 on MacOSX 10.5.8 using Absoft Fortran > > 90 11.0 and gcc --version i686-apple-darwin9-gcc-4.0.1 (GCC) 4.0.1 (Apple > > Inc. build 5493). I get the following error: > > > > make > > ... > > > > Making all in mca/io/romio > > Making all in romio > > Making all in include > > make[4]: Nothing to be done for `all'. > > Making all in adio > > Making all in common > > /bin/sh ../../libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. > > -I../../adio/include -DOMPI_BUILDING=1 > > -I/Users/lichtner/petsc/openmpi-1.4.1/ompi/mca/io/romio/romio/../../../../.. > > -I/Users/lichtner/petsc/openmpi-1.4.1/ompi/mca/io/romio/romio/../../../../../opal/include > > -I../../../../../../../opal/include -I../../../../../../../ompi/include > > -I/Users/lichtner/petsc/openmpi-1.4.1/ompi/mca/io/romio/romio/include > > -I/Users/lichtner/petsc/openmpi-1.4.1/ompi/mca/io/romio/romio/adio/include > > -D_REENTRANT -O3 -DNDEBUG -finline-functions -fno-strict-aliasing > > -DHAVE_ROMIOCONF_H -DHAVE_ROMIOCONF_H -I../../include -MT ad_aggregate.lo > > -MD -MP -MF .deps/ad_aggregate.Tpo -c -o ad_aggregate.lo ad_aggregate.c > > ../../libtool: line 460: CDPATH: command not found > > /Users/lichtner/petsc/openmpi-1.4.1/ompi/mca/io/romio/romio/libtool: line > > 460: CDPATH: command not found > > /Users/lichtner/petsc/openmpi-1.4.1/ompi/mca/io/romio/romio/libtool: line > > 1138: func_opt_split: command not found > > libtool: Version mismatch error. This is libtool 2.2.6b, but the > > libtool: definition of this LT_INIT comes from an older release. > > libtool: You should recreate aclocal.m4 with macros from libtool 2.2.6b > > libtool: and run autoconf again. > > make[5]: *** [ad_aggregate.lo] Error 63 > > make[4]: *** [all-recursive] Error 1 > > make[3]: *** [all-recursive] Error 1 > > make[2]: *** [all-recursive] Error 1 > > make[1]: *** [all-recursive] Error 1 > > make: *** [all-recursive] Error 1 > > > > Any help appreciated. > > ...Peter > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] openmpi fails to terminate for errors/signals on some but not all processes
Correction on a correction; I did not goof, however zombie's remaining is not a reproducible problem but can occur. On Mon, Feb 8, 2010 at 2:34 PM, Laurence Markswrote: > I goofed, openmpi does trap these errors but the system I tested them > on had a very sluggish response. However, and end-of-file is NOT > trapped. > > On Mon, Feb 8, 2010 at 1:29 PM, Laurence Marks > wrote: >> This was "Re: [OMPI users] Trapping fortran I/O errors leaving zombie >> mpi processes", but it is more severe than this. >> >> Sorry, but it appears that at least with ifort most run-time errors >> and signals will leave zombie processes behind with openmpi if they >> only occur on some of the processors and not all. You can test this >> with the attached using (for instance) >> >> mpicc -c doraise.c >> mpif90 -o crash_test crash_test.F doraise.o -FR -xHost -O3 >> >> Then, as appropriate mpirun -np 8 crash_test >> >> The output is self explanatory, and has an option to both try and >> simulate common fortran problems as well as to send fortran or C >> signals to the process. Please note that the results can be dependent >> upon the level of optimization, and with other compilers there could >> be problems where the compiler complains about SIGSEV or other errors >> since the code deliberately tries to create these. >> >> -- >> Laurence Marks >> Department of Materials Science and Engineering >> MSE Rm 2036 Cook Hall >> 2220 N Campus Drive >> Northwestern University >> Evanston, IL 60208, USA >> Tel: (847) 491-3996 Fax: (847) 491-7820 >> email: L-marks at northwestern dot edu >> Web: www.numis.northwestern.edu >> Chair, Commission on Electron Crystallography of IUCR >> www.numis.northwestern.edu/ >> Electron crystallography is the branch of science that uses electron >> scattering and imaging to study the structure of matter. >> > > > > -- > Laurence Marks > Department of Materials Science and Engineering > MSE Rm 2036 Cook Hall > 2220 N Campus Drive > Northwestern University > Evanston, IL 60208, USA > Tel: (847) 491-3996 Fax: (847) 491-7820 > email: L-marks at northwestern dot edu > Web: www.numis.northwestern.edu > Chair, Commission on Electron Crystallography of IUCR > www.numis.northwestern.edu/ > Electron crystallography is the branch of science that uses electron > scattering and imaging to study the structure of matter. > -- Laurence Marks Department of Materials Science and Engineering MSE Rm 2036 Cook Hall 2220 N Campus Drive Northwestern University Evanston, IL 60208, USA Tel: (847) 491-3996 Fax: (847) 491-7820 email: L-marks at northwestern dot edu Web: www.numis.northwestern.edu Chair, Commission on Electron Crystallography of IUCR www.numis.northwestern.edu/ Electron crystallography is the branch of science that uses electron scattering and imaging to study the structure of matter.
Re: [OMPI users] openmpi fails to terminate for errors/signals on some but not all processes
I goofed, openmpi does trap these errors but the system I tested them on had a very sluggish response. However, and end-of-file is NOT trapped. On Mon, Feb 8, 2010 at 1:29 PM, Laurence Markswrote: > This was "Re: [OMPI users] Trapping fortran I/O errors leaving zombie > mpi processes", but it is more severe than this. > > Sorry, but it appears that at least with ifort most run-time errors > and signals will leave zombie processes behind with openmpi if they > only occur on some of the processors and not all. You can test this > with the attached using (for instance) > > mpicc -c doraise.c > mpif90 -o crash_test crash_test.F doraise.o -FR -xHost -O3 > > Then, as appropriate mpirun -np 8 crash_test > > The output is self explanatory, and has an option to both try and > simulate common fortran problems as well as to send fortran or C > signals to the process. Please note that the results can be dependent > upon the level of optimization, and with other compilers there could > be problems where the compiler complains about SIGSEV or other errors > since the code deliberately tries to create these. > > -- > Laurence Marks > Department of Materials Science and Engineering > MSE Rm 2036 Cook Hall > 2220 N Campus Drive > Northwestern University > Evanston, IL 60208, USA > Tel: (847) 491-3996 Fax: (847) 491-7820 > email: L-marks at northwestern dot edu > Web: www.numis.northwestern.edu > Chair, Commission on Electron Crystallography of IUCR > www.numis.northwestern.edu/ > Electron crystallography is the branch of science that uses electron > scattering and imaging to study the structure of matter. > -- Laurence Marks Department of Materials Science and Engineering MSE Rm 2036 Cook Hall 2220 N Campus Drive Northwestern University Evanston, IL 60208, USA Tel: (847) 491-3996 Fax: (847) 491-7820 email: L-marks at northwestern dot edu Web: www.numis.northwestern.edu Chair, Commission on Electron Crystallography of IUCR www.numis.northwestern.edu/ Electron crystallography is the branch of science that uses electron scattering and imaging to study the structure of matter.
Re: [OMPI users] Executing of external programs
On Feb 8, 2010, at 2:34 PM, Lubomir Klimes wrote: > I am new in the world of MPI and I would like to ask you for the help. In my > diploma thesis I need to write a program in C++ using MPI that will execute > another external program - an optimization software GAMS. My question is > wheter is sufficient to use simply the command system(); for executing GAMS. > In other words, will the external program "work" in parallel? It depends on what you mean, and what your system setup is. Calling system() may (will) cause problems if you're using a Myrinet or OpenFabrics-based network in MPI (for deep, dark, voodoo reasons -- we can explain if you care). If you're using TCP, you should likely be fine -- but be aware that your resulting program may not be portable. Calling system() in your MPI application will effectively fork/exec the specified command. Hence, if you "mpirun -np 8 a.out", and a.out calls system("foo"), you'll get 8 copies of foo running independently of each other. If your project is supposed to parallelize foo, then it depends on the input / computation / output of foo as to whether this is a good approach. That being said, if you're just using MPI effectively as a launcher to launch N copies of foo, note that you can use Open MPI's "mpirun" to launch non-MPI applications (e.g., "mpirun -np 4 hostname"). > If the question is 'Yes', does someone know whether it will work also with > LAM/MPI instead of OpenMPI? As a former developer of LAM/MPI, I can pretty confidently say that, just like Mac replied to your initial question on the LAM/MPI list: LAM/MPI is pretty much dead. If you're just starting with MPI, you're much better to start with Open MPI than LAM/MPI. :-) -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Similar question about MPI_Create_type
On Mon, 08 Feb 2010 14:42:15 -0500, Prentice Bisbalwrote: > I'll give that a try, too. IMHO, MPI_Pack/Unpack looks easier and less > error prone, but Pacheco advocates using derived types over > MPI_Pack/Unpack. I would recommend using derived types for big structures, or perhaps for long-lived medium-sized structures. If your structure is static (i.e. doesn't contain pointers), then derived types definitely make sense and allow you to use that type in collectives. > In my situation, rank 0 is reading in a file containing all the coords. > So even if other ranks don't have the data, I still need to create the > structure on all the nodes, even if I don't populate it with data? You're populating it by receiving data. MPI can't allocate the space for you, so you have to it up. > To clarify: I thought adding a similar structure, b_point in rank 1 > would be adequate to receive the data from rank 0 You have allocated memory by the time you call MPI_Recv, but you were passing an undefined value to MPI_Address, and you certainly can't base derived_type an a_point and use it to receive into b_point. It would be fine to receive into a_point on rank 1, but whatever you do, derived_type has to be created correctly on each process. Jed
Re: [OMPI users] Similar question about MPI_Create_type
Prentice Bisbal wrote: > I hit send to early on my last reply, please forgive me... > > Jed Brown wrote: >> On Mon, 08 Feb 2010 13:54:10 -0500, Prentice Bisbalwrote: >>> but I don't have that book handy >> The standard has lots of examples. >> >> http://www.mpi-forum.org/docs/docs.html > > Thanks, I'll check out those examples. >> You can do this, but for small structures, you're better off just >> packing buffers. For large structures containing variable-size fields, >> I think it is clearer to use MPI_BOTTOM instead of offsets from an >> arbitrary (instance-dependent) address. > > I'll give that a try, too. IMHO, MPI_Pack/Unpack looks easier and less > error prone, but Pacheco advocates using derived types over > MPI_Pack/Unpack. > >> [...] >> >>> if (rank == 0) { >>> a_point.index = 1; >>> a_point.coords = malloc(3 * sizeof(int)); >>> a_point.coords[0] = 3; >>> a_point.coords[1] = 6; >>> a_point.coords[2] = 9; >>> } >>> >>> block_lengths[0] = 1; >>> block_lengths[1] = 3; >>> >>> type_list[0] = MPI_INT; >>> type_list[1] = MPI_INT; >>> >>> displacements[0] = 0; >>> MPI_Address(_point.index, _address); >>> MPI_Address(a_point.coords, ); >> ^^ >> >> Rank 1 has not allocated this yet. > > I'm glad you brought that up. I wanted to ask about that: > > In my situation, rank 0 is reading in a file containing all the coords. > So even if other ranks don't have the data, I still need to create the > structure on all the nodes, even if I don't populate it with data? To clarify: I thought adding a similar structure, b_point in rank 1 would be adequate to receive the data from rank 0. -- Prentice
[OMPI users] Executing of external programs
Hi, I am new in the world of MPI and I would like to ask you for the help. In my diploma thesis I need to write a program in C++ using MPI that will execute another external program - an optimization software GAMS. My question is wheter is sufficient to use simply the command system(); for executing GAMS. In other words, will the external program "work" in parallel? If the question is 'Yes', does someone know whether it will work also with LAM/MPI instead of OpenMPI? Thank you for the answer. Best regards, Lubajz
[OMPI users] openmpi fails to terminate for errors/signals on some but not all processes
This was "Re: [OMPI users] Trapping fortran I/O errors leaving zombie mpi processes", but it is more severe than this. Sorry, but it appears that at least with ifort most run-time errors and signals will leave zombie processes behind with openmpi if they only occur on some of the processors and not all. You can test this with the attached using (for instance) mpicc -c doraise.c mpif90 -o crash_test crash_test.F doraise.o -FR -xHost -O3 Then, as appropriate mpirun -np 8 crash_test The output is self explanatory, and has an option to both try and simulate common fortran problems as well as to send fortran or C signals to the process. Please note that the results can be dependent upon the level of optimization, and with other compilers there could be problems where the compiler complains about SIGSEV or other errors since the code deliberately tries to create these. -- Laurence Marks Department of Materials Science and Engineering MSE Rm 2036 Cook Hall 2220 N Campus Drive Northwestern University Evanston, IL 60208, USA Tel: (847) 491-3996 Fax: (847) 491-7820 email: L-marks at northwestern dot edu Web: www.numis.northwestern.edu Chair, Commission on Electron Crystallography of IUCR www.numis.northwestern.edu/ Electron crystallography is the branch of science that uses electron scattering and imaging to study the structure of matter. #include #include void doraise(isig) long isig[1] ; { int i, j ; i = isig[0]; raise( i ); /* signal i is raised */ } void doraise_(isig) long isig[1] ; { doraise(isig) ; } void whatsig(isig) long isig[1] ; { int i ; i = isig[0]; psignal( i , "Testing Signal"); } void whatsig_(isig) long isig[1] ; { whatsig(isig) ; } void showallsignals() { int i ; char buf[15]; for ( i = 1; i < 32; i++ ) { sprintf(buf, "Signal code %d ", i); psignal( i , buf ); } } void showallsignals_() { showallsignals() ; } crash_test.F Description: Binary data
Re: [OMPI users] Similar question about MPI_Create_type
On Mon, 08 Feb 2010 13:54:10 -0500, Prentice Bisbalwrote: > but I don't have that book handy The standard has lots of examples. http://www.mpi-forum.org/docs/docs.html You can do this, but for small structures, you're better off just packing buffers. For large structures containing variable-size fields, I think it is clearer to use MPI_BOTTOM instead of offsets from an arbitrary (instance-dependent) address. [...] > if (rank == 0) { > a_point.index = 1; > a_point.coords = malloc(3 * sizeof(int)); > a_point.coords[0] = 3; > a_point.coords[1] = 6; > a_point.coords[2] = 9; > } > > block_lengths[0] = 1; > block_lengths[1] = 3; > > type_list[0] = MPI_INT; > type_list[1] = MPI_INT; > > displacements[0] = 0; > MPI_Address(_point.index, _address); > MPI_Address(a_point.coords, ); ^^ Rank 1 has not allocated this yet. Jed
[OMPI users] Similar question about MPI_Create_type
Hello, again MPU Users: This question is similar to my earlier one about MPI_Pack/Unpack, I'm trying to send the following structure, which has a dynamically allocated array in it, as a MPI derived type using MPI_Create_type_struct(): typedef struct{ int index; int* coords; }point; I would think that this can't be done since the coords array will not be contiguous in memory with the rest of the structure, so calculating the displacements between point.index and point.coords will be meaningless. However, I'm pretty sure that Pacheco's book implies that this can be done (I'd list the exact page(s), but I don't have that book handy). Am I wrong or right? Below my signature is a the code I'm using to test this, which fails as I'd expect. Is my thinking right, or is my program wrong? When I run the program I get this error: *** An error occurred in MPI_Address *** on communicator MPI_COMM_WORLD *** MPI_ERR_ARG: invalid argument of some other kind *** MPI_ERRORS_ARE_FATAL (goodbye) mpirun noticed that job rank 0 with PID 28286 on node juno.sns.ias.edu exited on signal 15 (Terminated). -- Prentice #include #include #include int rank; MPI_Status status; int size; int tag; typedef struct{ int index; int* coords; }point; int block_lengths[2]; MPI_Datatype type_list[2]; MPI_Aint displacements[2]; MPI_Aint start_address; MPI_Aint address; MPI_Datatype derived_point; point a_point, b_point; int main(int argc, char* argv[]) { MPI_Init(, ); MPI_Comm_size(MPI_COMM_WORLD, ); MPI_Comm_rank(MPI_COMM_WORLD, ); if (rank == 0) { a_point.index = 1; a_point.coords = malloc(3 * sizeof(int)); a_point.coords[0] = 3; a_point.coords[1] = 6; a_point.coords[2] = 9; } block_lengths[0] = 1; block_lengths[1] = 3; type_list[0] = MPI_INT; type_list[1] = MPI_INT; displacements[0] = 0; MPI_Address(_point.index, _address); MPI_Address(a_point.coords, ); displacements[1] = address - start_address; MPI_Type_create_struct(2, block_lengths, displacements, type_list, _point); MPI_Type_commit(_point); if (rank == 0) { MPI_Send(_point, 1, derived_point, 1, 0, MPI_COMM_WORLD); } if (rank == 1) { b_point.coords = malloc(3 *sizeof(int)); MPI_Recv(_point, 1, derived_point, 0, 0, MPI_COMM_WORLD, ); printf("b_point.index = %i\n", b_point.index); printf("b_point.coords:(%i, %i, %i)\n", b_point.coords[0], b_point.coords[1], b_point.coords[2]); } MPI_Finalize(); exit(0); }
[OMPI users] openmpi-default-hostfile
I'm using ClusterTools 8.2.1 on Solaris 10 and according to the HPC docs, "Open MPI includes a commented default hostfile at /opt/SUNWhpc/HPC8.2/etc/openmpi-default-hostfile. Unless you specify a different hostfile at a different location, this is the hostfile that OpenMPI uses." I have added my list of hosts to that file. If I don't specify a hostfile in the mpirun command, it doesn't use any of the hosts in the file, it just runs everything on the node that I run the command on. However, if I implicitly call the hostfile in the mpirun command with -hostfile /opt/SUNWhpc/HPC8.2.1/etc/openmpi-default-hostfile, then it works as it should. So, I have come to the conclusion that mpirun is not reading my default file for some reason. Is there a way to figure out why? Benj
Re: [OMPI users] Difficulty with MPI_Unpack
Jed Brown wrote: > On Sun, 07 Feb 2010 22:40:55 -0500, Prentice Bisbalwrote: >> Hello, everyone. I'm having trouble packing/unpacking this structure: >> >> typedef struct{ >> int index; >> int* coords; >> }point; >> >> The size of the coords array is not known a priori, so it needs to be a >> dynamic array. I'm trying to send it from one node to another using >> MPI_Pack/MPI_Unpack as shown below. When I unpack it, I get this error >> when unpacking the coords array: >> >> [fatboy:07360] *** Process received signal *** >> [fatboy:07360] Signal: Segmentation fault (11) >> [fatboy:07360] Signal code: Address not mapped (1) >> [fatboy:07360] Failing at address: (nil) > > Looks like b_point.coords = NULL. Has this been allocated on rank=1? Yep, that was the problem. I left that out. I can't believe I overlooked something so obvious. Thanks for the code review. Thanks to Brian Austin, too, who also found that mistake. > > You might need to use MPI_Get_count to decide how much to allocate. > Also, if you don't have a convenient upper bound on the size of the > receive buffer, you can use MPI_Probe followed by MPI_Get_count to > determine this before calling MPI_Recv. Thanks for the tip. I'll take a look at those functions. -- Prentice
Re: [OMPI users] OpenMPI checkpoint/restart on multiple nodes
You can use the 'checkpoint to local disk' example to checkpoint and restart without access to a globally shared storage devices. There is an example on the website that does not use a globally mounted file system: http://www.osl.iu.edu/research/ft/ompi-cr/examples.php#uc-ckpt-local What version of Open MPI are you using? This functionality is known to be broken on the v1.3/1.4 branches, per the ticket below: https://svn.open-mpi.org/trac/ompi/ticket/2139 Try the nightly snapshot of the 1.5 branch or the development trunk, and see if this issues still occurs. -- Josh On Feb 8, 2010, at 8:35 AM, Andreea Costea wrote: > I asked this question because checkpointing with to NFS is successful, but > checkpointing without a mount filesystem or a shared storage throws this > warning: > > WARNING: Could not preload specified file: File already exists. > Fileset: /home/andreea/checkpoints/global/ompi_global_snapshot_7426.ckpt/0 > Host: X > > Will continue attempting to launch the process. > > > filem:rsh: wait_all(): Wait failed (-1) > [[62871,0],0] ORTE_ERROR_LOG: Error in file snapc_full_global.c at line 1054 > > even if I set the mca-parameters like this: > snapc_base_store_in_place=0 > > crs_base_snapshot_dir > =/home/andreea/checkpoints/local > > snapc_base_global_snapshot_dir > =/home/andreea/checkpoints/global > and the nodes can connect through ssh without a password. > > Thanks, > Andreea > > On Mon, Feb 8, 2010 at 12:59 PM, Andreea Costea> wrote: > Hi, > > Let's say I have an MPI application running on several hosts. Is there any > way to checkpoint this application without having a shared storage between > the nodes? > I already took a look at the examples here > http://www.osl.iu.edu/research/ft/ompi-cr/examples.php, but it seems that in > both cases there is a globally mounted file system. > > Thanks, > Andreea > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] OpenMPI checkpoint/restart on multiple nodes
I asked this question because checkpointing with to NFS is successful, but checkpointing without a mount filesystem or a shared storage throws this warning: WARNING: Could not preload specified file: File already exists. Fileset: /home/andreea/checkpoints/global/ompi_global_snapshot_7426.ckpt/0 Host: X Will continue attempting to launch the process. filem:rsh: wait_all(): Wait failed (-1) [[62871,0],0] ORTE_ERROR_LOG: Error in file snapc_full_global.c at line 1054 even if I set the mca-parameters like this: snapc_base_store_in_place=0 crs_base_snapshot_dir=/home/andreea/checkpoints/local snapc_base_global_snapshot_dir=/home/andreea/checkpoints/global and the nodes can connect through ssh without a password. Thanks, Andreea On Mon, Feb 8, 2010 at 12:59 PM, Andreea Costeawrote: > Hi, > > Let's say I have an MPI application running on several hosts. Is there any > way to checkpoint this application without having a shared storage between > the nodes? > I already took a look at the examples here > http://www.osl.iu.edu/research/ft/ompi-cr/examples.php, but it seems that > in both cases there is a globally mounted file system. > > Thanks, > Andreea > >
Re: [OMPI users] Difficulty with MPI_Unpack
On Sun, 07 Feb 2010 22:40:55 -0500, Prentice Bisbalwrote: > Hello, everyone. I'm having trouble packing/unpacking this structure: > > typedef struct{ > int index; > int* coords; > }point; > > The size of the coords array is not known a priori, so it needs to be a > dynamic array. I'm trying to send it from one node to another using > MPI_Pack/MPI_Unpack as shown below. When I unpack it, I get this error > when unpacking the coords array: > > [fatboy:07360] *** Process received signal *** > [fatboy:07360] Signal: Segmentation fault (11) > [fatboy:07360] Signal code: Address not mapped (1) > [fatboy:07360] Failing at address: (nil) Looks like b_point.coords = NULL. Has this been allocated on rank=1? You might need to use MPI_Get_count to decide how much to allocate. Also, if you don't have a convenient upper bound on the size of the receive buffer, you can use MPI_Probe followed by MPI_Get_count to determine this before calling MPI_Recv. Jed
Re: [OMPI users] Problems building Open MPI 1.4.1 with Pathscale
Hello, It does work with version 1.4. This is the hello world that hangs with 1.4.1: #include #include int main(int argc, char **argv) { int node, size; MPI_Init(,); MPI_Comm_rank(MPI_COMM_WORLD, ); MPI_Comm_size(MPI_COMM_WORLD, ); printf("Hello World from Node %d of %d.\n", node, size); MPI_Finalize(); return 0; } El mar, 26-01-2010 a las 03:57 -0500, Åke Sandgren escribió: > 1 - Do you have problems with openmpi 1.4 too? (I don't, haven't built > 1.4.1 yet) > 2 - There is a bug in the pathscale compiler with -fPIC and -g that > generates incorrect dwarf2 data so debuggers get really confused and > will have BIG problems debugging the code. I'm chasing them to get a > fix... > 3 - Do you have an example code that have problems? -- Rafael Arco Arredondo Centro de Servicios de Informática y Redes de Comunicaciones Universidad de Granada