Re: [OMPI users] valgrind complaint in openmpi1.3 (mca_mpool_sm_alloc)
I set it based on the only available information we have in the init function. This way the variable is always initialized, and the upper layer (whatever it is) has the responsibility to set it to something useful. Looking at the code it seems that the upper layer in question is the mpool sm component who has this information. r20780 fixes this problem. george. On Mar 14, 2009, at 09:23 , Jeff Squyres wrote: George -- Any particular reason you fixed it this way? On Mar 10, 2009, at 1:40 PM, Åke Sandgren wrote: On Tue, 2009-03-10 at 09:23 -0800, Eugene Loh wrote: > Åke Sandgren wrote: > > >Hi! > > > >Valgrind seems to think that there is an use of uninitialized value in > >mca_mpool_sm_alloc, i.e. the if(mpool_sm->mem_node >= 0) { > >Backtracking that i found that mem_node is not set during initializing > >in mca_mpool_sm_init. > >The resources parameter is never used and the mpool_module- >mem_node is > >never initalized. > > > >Bug or not? > > > > > Apparently George fixed this in the trunk in r19257 > https://svn.open-mpi.org/source/history/ompi-trunk/ompi/mca/mpool/sm/mpool_sm_module.c > . So, the resources parameter is never used, but you call > mca_mpool_sm_module_init(), which has the decency to set mem_node to > -1. Not a helpful value, but a legal one. So why not set it in the calling function which have access to the precomputed resources value? -- Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden Internet: a...@hpc2n.umu.se Phone: +46 90 7866134 Fax: +46 90 7866126 Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] core dump while running openmpi
Hi there: I'm trying to install an old version of openmpi 1.1.1 on a 32 bit cluster and running a program with it. This program runs fine for another 64 bit cluster which has openmpi 1.1.1 installed, but when running this on the 32 bit cluster, I get the following error: /var/spool/pbs/mom_priv/jobs/282832.borg.SC: line 37: 13154 Segmentation fault (core dumped) /ul/tedhyu/openmpi/openmpi-1.1.1/install/bin/mpirun -machinefile ${PBS_NODEFILE} -np ${NPROCS} ${CODE} Has anybody encountered this error before? If you have any advice, it would be much appreciated. Regards, Ted
Re: [OMPI users] valgrind complaint in openmpi1.3 (mca_mpool_sm_alloc)
George -- Any particular reason you fixed it this way? On Mar 10, 2009, at 1:40 PM, Åke Sandgren wrote: On Tue, 2009-03-10 at 09:23 -0800, Eugene Loh wrote: > Åke Sandgren wrote: > > >Hi! > > > >Valgrind seems to think that there is an use of uninitialized value in > >mca_mpool_sm_alloc, i.e. the if(mpool_sm->mem_node >= 0) { > >Backtracking that i found that mem_node is not set during initializing > >in mca_mpool_sm_init. > >The resources parameter is never used and the mpool_module- >mem_node is > >never initalized. > > > >Bug or not? > > > > > Apparently George fixed this in the trunk in r19257 > https://svn.open-mpi.org/source/history/ompi-trunk/ompi/mca/mpool/sm/mpool_sm_module.c > . So, the resources parameter is never used, but you call > mca_mpool_sm_module_init(), which has the decency to set mem_node to > -1. Not a helpful value, but a legal one. So why not set it in the calling function which have access to the precomputed resources value? -- Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden Internet: a...@hpc2n.umu.se Phone: +46 90 7866134 Fax: +46 90 7866126 Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] Run-time problem
Sorry for the delay in replying; this week unexpectedly turned exceptionally hectic for several us... On Mar 9, 2009, at 2:53 PM, justin oppenheim wrote: Yes. As I indicated earlier, I did use these options to compile my program MPI_CXX=/programs/openmpi/bin/mpicxx MPI_CC=/programs/openmpi/bin/mpicc MPI_INCLUDE=/programs/openmpi/include/ MPI_LIB=mpi /programs/openmpi/ MPI_LIBDIR=/programs/openmpi/lib/ MPI_LINKERFORPROGRAMS=/programs/ openmpi/bin/mpicxx Ah; I think Ralph was asking because we don't know exactly how these ? environment variables? are being used to build your application. where /programs/openmpi/ is the chosen location for installing the openmpi package (specifically, openmpi-1.3.tar.gz) that I downloaded from www.open-mpi.org. Can you ensure that you have exactly the same version of Open MPI installed on all nodes in exactly the same location in the filesystem (it doesn't *have* to be the same location on the filesystem on all the nodes, but it sure is easier if it is). Also be sure that when you mpirun across multiple nodes that the same version of Open MPI (both executables and libraries) are being found on all nodes. Any clue? Again, my system is Suse 10.3 64-bit, which should be pretty standard. Would another package openmpi-1.3-1.src.rpm work better for my system? Thanks, JO --- On Mon, 3/9/09, Ralph Castainwrote: From: Ralph Castain Subject: Re: [OMPI users] Run-time problem To: jl09...@yahoo.com Cc: us...@open-mpi.org Date: Monday, March 9, 2009, 7:59 AM Did you try compiling your program with the provided mpicc (or mpiCC, mpif90, etc. - as appropriate) wrapper compiler? The wrapper compilers contain all the required library definitions to make the application work. Compiling without the wrapper compilers is a very bad idea... Ralph On Mar 6, 2009, at 11:02 AM, justin oppenheim wrote: Please let me go over it again, and maybe it helps clarifying things a bit better. All the OS involved are Suse 10.3. I have a place for the the installed programs, say /programs. In /programs I have installed openmpi and my mpi program, say my_mpi_program. When I am in the working directory, my LD_LIBRARY_PATH does include both /programs/my_mpi_program/lib /programs/openmpi/lib And my PATH includes /programs/my_mpi_program/bin /programs/openmpi/bin So, then I do mpirun -machinefile machinefile -np 20 my_mpi_program and I get /programs/my_mpi_program: symbol lookup error: /programs/openmpi/ lib/libmpi_cxx.so.0: undefined symbol: ompi_registered_datareps When I configured openmpi, I did ./configure --prefix=/programs/openmpi and then compiled it. Subsequently, I compiled my_mpi_program with the options: MPI_CXX=/programs/openmpi/bin/mpicxx MPI_CC=/programs/openmpi/bin/mpicc MPI_INCLUDE=/programs/openmpi/include/ MPI_LIB=mpi MPI_LIBDIR=/programs/openmpi/lib/ MPI_LINKERFORPROGRAMS=/programs/ openmpi/bin/mpicxx Any clue? The directory /programs is NSF mounted on the nodes. Many thanks again, JO --- On Thu, 3/5/09, justin oppenheim wrote: From: justin oppenheim Subject: Re: [OMPI users] Run-time problem To: "Ralph Castain" Date: Thursday, March 5, 2009, 5:28 PM Hi Ralph: Sorry for my ignorance, but in you option 2: what command should I add the option --prefix=path-to-install? when I configure openmpi? I already did that when I configured and compiled openmpi. Also, in response to your option 1, I did add the paths to libraries of openmpi in the LD_LIBRARY_PATH in the .cshrc of the nodes. Thank you, JO --- On Thu, 3/5/09, Ralph Castain wrote: From: Ralph Castain Subject: Re: [OMPI users] Run-time problem To: jl09...@yahoo.com Cc: "Open MPI Users " Date: Thursday, March 5, 2009, 12:46 PM First, you can add --launch-agent rsh to the command line and that will have OMPI use rsh. It sounds like your remote nodes may not be seeing your OMPI install directory. Several ways you can resolve that - here are a couple: 1. add the install directory to your LD_LIBRARY_PATH in your .cshrc (or whatever shell rc you are using) - be sure this is being executed on the remote nodes 2. add --prefix=path-to-install on your cmd line - this will direct your remote procs to the proper libraries Ralph On Mar 5, 2009, at 10:18 AM, justin oppenheim wrote: Maybe I should also add that the program my_mpi_executable is locally installed under the same root directory as that under which openmpi-1.3 is installed. This root directory is NSF mounted on the working nodes. Thanks, JO --- On Thu, 3/5/09, justin oppenheim wrote: From: justin oppenheim Subject: Re: [OMPI users] Run-time problem To: "Ralph Castain" Date: Thursday, March 5, 2009, 12:04 PM Hi Ralph: Thanks for your
Re: [OMPI users] Can't start program across network
Can you send all the information here: http://www.open-mpi.org/community/help/ (including the network information) Thanks! On Mar 13, 2009, at 9:12 PM, Raymond Wan wrote: Hi Jeff, Jeff Squyres wrote: > On Mar 13, 2009, at 6:17 AM, Raymond Wan wrote: > >> What doesn't work is: >> >> [On Y] mpirun --host Y,Z --np 2 uname -a >> [On Y] mpirun --host X,Y,Z --np 3 uname -a >> >> ...and similarly for machine Z. I can confirm that from any of the 3 > > Do you see "rsh" or "ssh" in the output of "ps -eadf" when mpirun is > hanging, perchance? If you, what happens if you copy-n-paste those > command lines and run them manually? > No, I don't see either rsh or ssh when mpirun is hanging. Is that odd? Something I'm doing wrong? I only see an mpirun command and an orted command. rwan 22800 22761 0 09:52 pts/200:00:00 mpirun --host X,Y,Z --np 3 sleep 1000 rwan 22804 1 0 09:52 ?00:00:00 orted --bootproxy 1 --name 0.0.2 --num_procs 4 --vpid_start 0 --nodename Y --universe rwan@Y:default-universe-22800 --nsreplica "0.0.0;tcp://Y:36889" -- gprreplica "0.0.0;tcp://Y:36889" --set-sid Actually, when I run the above mpirun command, I don't see "sleep" running locally on machine Y, either. However, if I did this: mpirun --host Y --np 3 sleep 1000 I see 3 instances of "sleep" when I do ps -aedf. Does mpirun try to "ssh" all networked machines first before it starts the program (even if one of those instances will run locally?). Perhaps unrelated...but when I am on Y and I do an rsh to Z, I get a "No route to host". I asked the sysadmin about it (I'm not the sysadmin of Y or Z) and he doesn't know why but as we should be using ssh anyway, he isn't going to address the problem (unless it is a side- effect of my mpirun problem). I only presume rsh hasn't been set up properly; ssh works fine, though. Thank you! Ray ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] Problem in MPI::Finalize when freeingintercommunicators
On Mar 13, 2009, at 5:15 PM, Mikael Djurfeldt wrote: On Fri, Mar 13, 2009 at 9:28 PM, Jeff Squyreswrote: > No you should not need to do this. > > Is there any chance you could upgrade to Open MPI v1.3? Yes. It works without a Barrier under v1.3. Is this a known problem? Possibly...? I can't name any particular issue offhand that is a known culprit for this, but it's possible someone else can. There are many changes and fixes in the v1.3 series as compared to the v1.2 series. What is the best way for me to test in my configure script that I'm running under OpenMPI version >= 1.3 so that I can disable the Barrier for such versions? In mpi.h, we have a few macros that should help you: - /* * Just in case you need it. :-) */ #define OPEN_MPI 1 /* Major, minor, and release version of Open MPI */ #define OMPI_MAJOR_VERSION 1 #define OMPI_MINOR_VERSION 3 #define OMPI_RELEASE_VERSION 0 - You should be able to construct a fairly simple AC_TRY_RUN test that checks #if defined(), etc. -- Jeff Squyres Cisco Systems
Re: [OMPI users] PGI 8.0-4 doesn't like ompi/mca/op/op.h
Oops! I sent the patch to George but didn't send it to everyone else. Here's a patch showing how I propose to fix this problem: Index: ompi/mca/op/op.h === --- ompi/mca/op/op.h(revision 20777) +++ ompi/mca/op/op.h(working copy) @@ -258,14 +258,41 @@ typedef ompi_op_base_handler_fn_1_0_0_t ompi_op_base_handler_fn_t; /* + * Per the thread starting here: + * + * http://www.open-mpi.org/community/lists/users/2009/03/8402.php + * + * We [re-]discovered that AC_C_RESTRICT only checks for "restrict" in + * the C compiler. But this header file is included in components.cc + * (i.e., ompi_info), so the "restrict" here may be problematic for + * the C++ compiler. + * + * Since we *know* that this function is only used in C code in OMPI + * (e.g., it's not used in ompi_info or the C++ bindings), just + * have an "alternate" + */ +#if defined(c_plusplus) || defined(__cplusplus) +#define OMPI_SAFE_RESTRICT +#else +#define OMPI_SAFE_RESTRICT restrict +#endif +/* * Typedef for 3-buffer (two input and one output) op functions. */ -typedef void (*ompi_op_base_3buff_handler_fn_1_0_0_t)(void *, - void *, - void *, int *, +typedef void (*ompi_op_base_3buff_handler_fn_1_0_0_t)(void *OMPI_SAFE_RESTRICT, + void *OMPI_SAFE_RESTRICT, + void *OMPI_SAFE_RESTRICT, + int *, struct ompi_datatype_t **, struct ompi_op_base_module_1_0_0_t *); +/* + * We don't want anyone else using OMPI_SAFE_RESTRICT elsewhere in the + * code base; this hack is only because we don't have an + * AC_CXX_RESTRICT Autoconf test. + */ +#undef OMPI_SAFE_RESTRICT + typedef ompi_op_base_3buff_handler_fn_1_0_0_t ompi_op_base_3buff_handler_fn_t; /** On Mar 14, 2009, at 8:22 AM, Jeff Squyres (jsquyres) wrote: Yes, it does. It re-looking at this problem, it seemed to me: 1. The real fix is to talk to the AC people and get something like AC_CXX_RESTRICT. The PGI compiler is one place where "restrict" support may be different in the C and C++ compilers. I'm not sure what the Right answer is there, but I'll ask them about it. 2. In this specific case, the use of "restrict" *does not matter* in components.cc. This particular part of the file is not what components.cc needs/uses. So it's ok to #define it away to nothing. 3. Since this problem now exists in at least *2* compilers that we know about (Sun, PGI), it seemed that -- at least while waiting for some kind of proper fix from AC -- just #define restrict away for C++ for this particular case was ok, rather than try to adapt to every compiler. Rolf's fix was ok previously because we thought it was specific to one compiler. But now the door is open to other compilers, so let's use a broad stroke to work around it for all C++ compilers. That's why I coded it up this way. On Mar 14, 2009, at 7:39 AM, Terry Dontje wrote: > You know this all looks very similar to the reason why rolfv putback > r20351 which essentially defined out restrict within > opal_config_bottom.h when using Sun Studio. > > --td > > Date: Fri, 13 Mar 2009 16:40:49 -0400 > From: Jeff Squyres> Subject: Re: [OMPI users] PGI 8.0-4 doesn't like ompi/mca/op/op.h > To: "Open MPI Users" > Message-ID: <2aca69ab-5f23-4ae9-8826-77a6348e9...@cisco.com> > Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes > > On Mar 13, 2009, at 4:37 PM, Mostyn Lewis wrote: > > > > > >From config.log > > > > > > configure:21522: checking for C/C++ restrict keyword > > > configure:21558: pgcc -c -DNDEBUG -fast -Msignextend -tp p7-64 > > > conftest.c >&5 > > > configure:21564: $? = 0 > > > configure:21582: result: restrict > > > > > > So you only check using pgcc (not pgCC)? > > > > > > > The AC_C_RESTRICT test only checks the C compiler, yet. It's an > Autoconf-builtin test; we didn't write it. > > Odd that you get "restrict" and I get "__restrict". Hrm. > > Well, I suppose that one solution might be to disable those prototypes > in the op.h header file when they're included in components.cc (that's > a source file in the ompi_info executable; it shouldn't need the > specific MPI_Op callback prototypes). Fortunately, we have very > little > C++ code in OMPI, so this isn't a huge issue (C++ is only used for > the > MPI C++ bindings -- of course -- and in some of the command line > executables). > > Let me see what I can cook up, and then let me see if I can convince > George that it's the correct answer. ;-) > -- Jeff Squyres Cisco Systems > ___ > users mailing list >
Re: [OMPI users] PGI 8.0-4 doesn't like ompi/mca/op/op.h
Yes, it does. It re-looking at this problem, it seemed to me: 1. The real fix is to talk to the AC people and get something like AC_CXX_RESTRICT. The PGI compiler is one place where "restrict" support may be different in the C and C++ compilers. I'm not sure what the Right answer is there, but I'll ask them about it. 2. In this specific case, the use of "restrict" *does not matter* in components.cc. This particular part of the file is not what components.cc needs/uses. So it's ok to #define it away to nothing. 3. Since this problem now exists in at least *2* compilers that we know about (Sun, PGI), it seemed that -- at least while waiting for some kind of proper fix from AC -- just #define restrict away for C++ for this particular case was ok, rather than try to adapt to every compiler. Rolf's fix was ok previously because we thought it was specific to one compiler. But now the door is open to other compilers, so let's use a broad stroke to work around it for all C++ compilers. That's why I coded it up this way. On Mar 14, 2009, at 7:39 AM, Terry Dontje wrote: You know this all looks very similar to the reason why rolfv putback r20351 which essentially defined out restrict within opal_config_bottom.h when using Sun Studio. --td Date: Fri, 13 Mar 2009 16:40:49 -0400 From: Jeff SquyresSubject: Re: [OMPI users] PGI 8.0-4 doesn't like ompi/mca/op/op.h To: "Open MPI Users" Message-ID: <2aca69ab-5f23-4ae9-8826-77a6348e9...@cisco.com> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes On Mar 13, 2009, at 4:37 PM, Mostyn Lewis wrote: > > >From config.log > > > > configure:21522: checking for C/C++ restrict keyword > > configure:21558: pgcc -c -DNDEBUG -fast -Msignextend -tp p7-64 > > conftest.c >&5 > > configure:21564: $? = 0 > > configure:21582: result: restrict > > > > So you only check using pgcc (not pgCC)? > > > The AC_C_RESTRICT test only checks the C compiler, yet. It's an Autoconf-builtin test; we didn't write it. Odd that you get "restrict" and I get "__restrict". Hrm. Well, I suppose that one solution might be to disable those prototypes in the op.h header file when they're included in components.cc (that's a source file in the ompi_info executable; it shouldn't need the specific MPI_Op callback prototypes). Fortunately, we have very little C++ code in OMPI, so this isn't a huge issue (C++ is only used for the MPI C++ bindings -- of course -- and in some of the command line executables). Let me see what I can cook up, and then let me see if I can convince George that it's the correct answer. ;-) -- Jeff Squyres Cisco Systems ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] PGI 8.0-4 doesn't like ompi/mca/op/op.h
You know this all looks very similar to the reason why rolfv putback r20351 which essentially defined out restrict within opal_config_bottom.h when using Sun Studio. --td List-Post: users@lists.open-mpi.org Date: Fri, 13 Mar 2009 16:40:49 -0400 From: Jeff SquyresSubject: Re: [OMPI users] PGI 8.0-4 doesn't like ompi/mca/op/op.h To: "Open MPI Users" Message-ID: <2aca69ab-5f23-4ae9-8826-77a6348e9...@cisco.com> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes On Mar 13, 2009, at 4:37 PM, Mostyn Lewis wrote: > >From config.log > > configure:21522: checking for C/C++ restrict keyword > configure:21558: pgcc -c -DNDEBUG -fast -Msignextend -tp p7-64 > conftest.c >&5 > configure:21564: $? = 0 > configure:21582: result: restrict > > So you only check using pgcc (not pgCC)? > The AC_C_RESTRICT test only checks the C compiler, yet. It's an Autoconf-builtin test; we didn't write it. Odd that you get "restrict" and I get "__restrict". Hrm. Well, I suppose that one solution might be to disable those prototypes in the op.h header file when they're included in components.cc (that's a source file in the ompi_info executable; it shouldn't need the specific MPI_Op callback prototypes). Fortunately, we have very little C++ code in OMPI, so this isn't a huge issue (C++ is only used for the MPI C++ bindings -- of course -- and in some of the command line executables). Let me see what I can cook up, and then let me see if I can convince George that it's the correct answer. ;-) -- Jeff Squyres Cisco Systems
Re: [OMPI users] Compiling ompi for use on another machine
Hi Ben, ben rodriguez wrote: I have compiled ompi and another program for use on another rhel5/x86_64 machine, after transfering the binaries and setting up environment variables is there anything else I need to do for ompi to run properly? When executing my prog I get: -- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 1. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -- Just a few thoughts about your problem... Are the two machines identical in architecture and RH installation? Is there any reason why you cannot compile on the other machine too? (Sometimes the location of dynamic libraries, etc. changes so I try to make a note to always recompile on each machine.) Are you having problems running your program on each node individually first? If not, you might try that first (i.e., with "--np 1"). Ray
Re: [OMPI users] MPI jobs ending up in one node
oopssorryit is in Intel MPI library. Thanks!!! On Fri, Mar 13, 2009 at 9:47 PM, Ralph Castainwrote: > Hmmm...your comments don't sound like anything relating to Open MPI. Are you > sure you are not using some other MPI? > > Our mpiexec isn't a script, for example, nor do we have anything named > I_MPI_PIN_PROCESSOR_LIST in our code. > > :-) > > On Mar 13, 2009, at 4:00 AM, Peter Teoh wrote: > >> I saw the following problem posed somewhere - can anyone shed some >> light? Thanks. >> >> I have a cluster of 8-sock quad core systems running Redhat 5.2. It >> seems that whenever I try to run multiple MPI jobs to a single node >> all the jobs end up running on the same processors. For example, if I >> were to submit 4 8-way jobs to a single box they all end up in CPUs 0 >> to 7, leaving 8 to 31 idle. >> >> I then tried all sorts of I_MPI_PIN_PROCESSOR_LIST combinations but >> short of explicitly listing out the processors at each run, they all >> end up still hanging on to CPUs 0-7. Browsing through the mpiexec >> script, I realise that it is doing a taskset on each run. >> As my jobs are all submitted through a scheduler (PBS in this case) I >> cannot possibly know at job submission time which CPUs are not used. >> So is there a simple way to tell mpiexec to set the taskset affinity >> correctly at each run so that it will choose only the idle processors? >> Thanks. >> >> -- >> Regards, >> Peter Teoh >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Regards, Peter Teoh