Re: [OMPI devel] Getting the number of nodes
I'm running this on my mac where I expected to only get back the localhost. I upgraded to 1.0.2 a little while back, had been using one of the alphas (I think it was alpha 9 but I can't be sure) up until that point when this function returned '1' on my mac. -- Nathan Correspondence - Nathan DeBardeleben, Ph.D. Los Alamos National Laboratory Parallel Tools Team High Performance Computing Environments phone: 505-667-3428 email: ndeb...@lanl.gov - Ralph H Castain wrote: Rc=0 indicates that the "get" function was successful, so this means that there were no nodes on the NODE_SEGMENT. Were you running this in an environment where nodes had been allocated to you? Or were you expecting to find only "localhost" on the segment? I'm not entirely sure, but I don't believe there have been significant changes in 1.0.2 for some time. My guess is that something has changed on your system as opposed to in the OpenMPI code you're using. Did you do an update recently and then begin seeing this behavior? Your revision level is 1000+ behind the current repository, so my guess is that you haven't updated for awhile - since 1.0.2 is under maintenance for bugs only, that shouldn't be a problem. I'm just trying to understand why your function is doing something different if the OpenMPI code your using hasn't changed. Ralph On 7/5/06 2:40 PM, "Nathan DeBardeleben" <ndeb...@lanl.gov> wrote: Open MPI: 1.0.2 Open MPI SVN revision: r9571 The rc value returned by the 'get' call is '0'. All I'm doing is calling init with my own daemon name, it's coming up fine, then I immediately call this to figure out how many nodes are associated with this machine. -- Nathan Correspondence ----- Nathan DeBardeleben, Ph.D. Los Alamos National Laboratory Parallel Tools Team High Performance Computing Environments phone: 505-667-3428 email: ndeb...@lanl.gov - Ralph H Castain wrote: Hi Nathan Could you tell us which version of the code you are using, and print out the rc value that was returned by the "get" call? I see nothing obviously wrong with the code, but much depends on what happened prior to this call too. BTW: you might want to release the memory stored in the returned values - it could represent a substantial memory leak. Ralph On 7/5/06 9:28 AM, "Nathan DeBardeleben" <ndeb...@lanl.gov> wrote: I used to use this code to get the number of nodes in a cluster / machine / whatever: int get_num_nodes(void) { int rc; size_t cnt; orte_gpr_value_t **values; rc = orte_gpr.get(ORTE_GPR_KEYS_OR|ORTE_GPR_TOKENS_OR, ORTE_NODE_SEGMENT, NULL, NULL, , ); if(rc != ORTE_SUCCESS) { return 0; } return cnt; } This now returns '0' on my MAC when it used to return 1. Is this not an acceptable way of doing this? Is there a cleaner / better way these days? ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] Getting the number of nodes
I used to use this code to get the number of nodes in a cluster / machine / whatever: int get_num_nodes(void) { int rc; size_t cnt; orte_gpr_value_t **values; rc = orte_gpr.get(ORTE_GPR_KEYS_OR|ORTE_GPR_TOKENS_OR, ORTE_NODE_SEGMENT, NULL, NULL, , ); if(rc != ORTE_SUCCESS) { return 0; } return cnt; } This now returns '0' on my MAC when it used to return 1. Is this not an acceptable way of doing this? Is there a cleaner / better way these days? -- -- Nathan Correspondence - Nathan DeBardeleben, Ph.D. Los Alamos National Laboratory Parallel Tools Team High Performance Computing Environments phone: 505-667-3428 email: ndeb...@lanl.gov -
Re: [O-MPI devel] Alpha 4 and job state transitions
I've coded a hacky workaround in our code to get past this. Basically, I capture all of the state transitions and the first one fired for a job I fire the 'init' state internally in our tool. Generally this occurs for one of the gate transitions, G1 or something. It'll work this way. Furthermore, we're telling our users to get your 1.0.2a4 (or whatever 1.0.2 is available at the time). The way I coded it when you guys put this into the main branch and the INIT state resumes firing then my code will start working that much better. I really only brought it up because I felt it was a bug you might not have been aware of. Thanks all. -- Nathan Correspondence - Nathan DeBardeleben, Ph.D. Los Alamos National Laboratory Parallel Tools Team High Performance Computing Environments phone: 505-667-3428 email: ndeb...@lanl.gov - Jeff Squyres wrote: Nathan -- Ralph and I talked about this and decided not to bring it over to the 1.0 branch -- the fix uses new functionality that exists on the trunk and not in the 1.0 branch. The fix could be re-crafted to use existing functionality on the 1.0 branch (we're really trying to only put bug fixes on the 1.0 branch -- not any new functionality) -- but we didn't know if you cared. :-) Do you mind if this fix stays on the trunk, or do you need it in the v1.0 branch? On Feb 8, 2006, at 4:36 PM, Nathan DeBardeleben wrote: Thanks Ralph. -- Nathan Correspondence - Nathan DeBardeleben, Ph.D. Los Alamos National Laboratory Parallel Tools Team High Performance Computing Environments phone: 505-667-3428 email: ndeb...@lanl.gov - Ralph H. Castain wrote: Nathan This should now be fixed on the trunk. Once it is checked out more thoroughly, I'll ask that it be moved to the 1.0 branch. For now, you might want to check out the trunk and verify it meets your needs. Ralph At 03:05 PM 2/1/2006, you wrote: This was happening on Alpha 1 as well but I upgraded today to Alpha 4 to see if it's gone away - it has not. I register a callback on a spawn() inside ORTE. That callback includes the current state and should be called as the job goes through those states. I am now noticing that jobs never go through the INIT state. They may also not go through others but definitely not ORTE_PROC_STATE_INIT. I was registering the IOForwarding callback during the INIT phase so, consequentially, I now do not have IOF. There are other side effects such as jobs that I start I think are perpetually in the 'starting' state and then, suddenly, they're done. Can someone look into / comment on this please? Thanks. -- -- Nathan Correspondence - Nathan DeBardeleben, Ph.D. Los Alamos National Laboratory Parallel Tools Team High Performance Computing Environments phone: 505-667-3428 email: ndeb...@lanl.gov - ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[O-MPI devel] Alpha 4 and job state transitions
This was happening on Alpha 1 as well but I upgraded today to Alpha 4 to see if it's gone away - it has not. I register a callback on a spawn() inside ORTE. That callback includes the current state and should be called as the job goes through those states. I am now noticing that jobs never go through the INIT state. They may also not go through others but definitely not ORTE_PROC_STATE_INIT. I was registering the IOForwarding callback during the INIT phase so, consequentially, I now do not have IOF. There are other side effects such as jobs that I start I think are perpetually in the 'starting' state and then, suddenly, they're done. Can someone look into / comment on this please? Thanks. -- -- Nathan Correspondence - Nathan DeBardeleben, Ph.D. Los Alamos National Laboratory Parallel Tools Team High Performance Computing Environments phone: 505-667-3428 email: ndeb...@lanl.gov -
[O-MPI devel] Back to 32bit on 64bit machines...
So is this an error or am I configuring wrong? Here's my configure: [sparkplug]~/ompi > ./configure CFLAGS=-m32 FFLAGS=-m32 CXXFLAGS=-m32 --without-threads --prefix=/home/ndebard/local/ompi --with-devel-headers --without-gm I've also tried adding --build=i586-suse-linux, that didn't help either. Basically the compile eventually ends here: g++ -DHAVE_CONFIG_H -I. -I. -I../../../include -I../../../include -I../../../include -I../../.. -I../../.. -I../../../include -I../../../opal -I../../../orte -I../../../ompi -m32 -g -Wall -Wundef -Wno-long-long -finline-functions -MT comm.lo -MD -MP -MF .deps/comm.Tpo -c comm.cc -fPIC -DPIC -o .libs/comm.o /bin/sh ../../../libtool --mode=link g++ -m32 -g -Wall -Wundef -Wno-long-long -finline-functions -export-dynamic -o libmpi_cxx.la -rpath /home/ndebard/local/ompi/lib mpicxx.lo intercepts.lo comm.lo -lm -lutil -lnsl g++ -shared -nostdlib /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.3/../../../../lib/crti.o /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.3/32/crtbeginS.o .libs/mpicxx.o .libs/intercepts.o .libs/comm.o -lutil -lnsl -L/usr/lib64/gcc-lib/x86_64-suse-linux/3.3.3/32 -L/usr/lib64/gcc-lib/x86_64-suse-linux/3.3.3 -L/usr/lib64/gcc-lib/x86_64-suse-linux/3.3.3/../../../../x86_64-suse-linux/lib/../lib -L/usr/lib64/gcc-lib/x86_64-suse-linux/3.3.3/../../../../x86_64-suse-linux/lib -L/usr/lib64/gcc-lib/x86_64-suse-linux/3.3.3/../../../../lib -L/usr/lib64/gcc-lib/x86_64-suse-linux/3.3.3/../../.. -L/lib/../lib -L/usr/lib/../lib /usr/lib64/libstdc++.so -lm -lc -lgcc_s_32 /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.3/32/crtendS.o /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.3/../../../../lib/crtn.o -m32 -Wl,-soname -Wl,libmpi_cxx.so.0 -o .libs/libmpi_cxx.so.0.0.0 /usr/lib64/libstdc++.so: could not read symbols: Invalid operation collect2: ld returned 1 exit status make[3]: *** [libmpi_cxx.la] Error 1 make[3]: Leaving directory `/home/ndebard/ompi/ompi/mpi/cxx' make[2]: *** [all-recursive] Error 1 make[2]: Leaving directory `/home/ndebard/ompi/ompi/mpi' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/home/ndebard/ompi/ompi' make: *** [all-recursive] Error 1 [sparkplug]~/ompi > I'm having problems I think might be 64bit related and want to prove it by building in 32bit mode. Oh, here's some basics if it helps. [sparkplug]~/ompi > cat /etc/issue Welcome to SuSE Linux 9.1 (x86-64) - Kernel \r (\l). [sparkplug]~/ompi > uname -a Linux sparkplug 2.6.10 #4 SMP Wed Jan 26 11:50:00 MST 2005 x86_64 x86_64 x86_64 GNU/Linux [sparkplug]~/ompi > -- -- Nathan Correspondence --------- Nathan DeBardeleben, Ph.D. Los Alamos National Laboratory Parallel Tools Team High Performance Computing Environments phone: 505-667-3428 email: ndeb...@lanl.gov -
Re: [O-MPI devel] OMPI compile failing
I'm trying this on sparkplug. I have no real desire to use GM, so if it can be disabled then that'd be great. -- Nathan Correspondence - Nathan DeBardeleben, Ph.D. Los Alamos National Laboratory Parallel Tools Team High Performance Computing Environments phone: 505-667-3428 email: ndeb...@lanl.gov - Tim S. Woodall wrote: Nathan - What machine are you on? Galen - have you tried GM w/ your changes? Nathan DeBardeleben wrote: Compiling I get: gcc -DHAVE_CONFIG_H -I. -I. -I../../../../include -I../../../../include -I../../../../include -I../../../.. -I../../../.. -I../../../../include -I../../../../opal -I../../../../orte -I../../../../ompi -g -Wall -Wundef -Wno-long-long -Wsign-compare -Wmissing-prototypes -Wstrict-prototypes -Wcomment -pedantic -Werror-implicit-function-declaration -fno-strict-aliasing -MT btl_gm.lo -MD -MP -MF .deps/btl_gm.Tpo -c btl_gm.c -fPIC -DPIC -o .libs/btl_gm.o btl_gm.c: In function `mca_btl_gm_prepare_src': btl_gm.c:237: error: `gm_btl' undeclared (first use in this function) btl_gm.c:237: error: (Each undeclared identifier is reported only once btl_gm.c:237: error: for each function it appears in.) btl_gm.c: In function `mca_btl_gm_prepare_dst': btl_gm.c:398: warning: ISO C89 forbids mixed declarations and code btl_gm.c:404: error: structure has no member named `mpoo_retain' btl_gm.c:381: warning: unused variable `gm_btl' make[4]: *** [btl_gm.lo] Error 1 make[4]: Leaving directory `/home/ndebard/ompi/ompi/mca/btl/gm' make[3]: *** [all-recursive] Error 1 make[3]: Leaving directory `/home/ndebard/ompi/ompi/dynamic-mca/btl' make[2]: *** [all-recursive] Error 1 make[2]: Leaving directory `/home/ndebard/ompi/ompi/dynamic-mca' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/home/ndebard/ompi/ompi' make: *** [all-recursive] Error 1 [sparkplug]~/ompi > I've configured using the option I thought to disable this: --enable-mca-no-build=ptl-gm I even tried --enable-mca-no-build=btl-gm. No luck. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[O-MPI devel] OMPI compile failing
Compiling I get: gcc -DHAVE_CONFIG_H -I. -I. -I../../../../include -I../../../../include -I../../../../include -I../../../.. -I../../../.. -I../../../../include -I../../../../opal -I../../../../orte -I../../../../ompi -g -Wall -Wundef -Wno-long-long -Wsign-compare -Wmissing-prototypes -Wstrict-prototypes -Wcomment -pedantic -Werror-implicit-function-declaration -fno-strict-aliasing -MT btl_gm.lo -MD -MP -MF .deps/btl_gm.Tpo -c btl_gm.c -fPIC -DPIC -o .libs/btl_gm.o btl_gm.c: In function `mca_btl_gm_prepare_src': btl_gm.c:237: error: `gm_btl' undeclared (first use in this function) btl_gm.c:237: error: (Each undeclared identifier is reported only once btl_gm.c:237: error: for each function it appears in.) btl_gm.c: In function `mca_btl_gm_prepare_dst': btl_gm.c:398: warning: ISO C89 forbids mixed declarations and code btl_gm.c:404: error: structure has no member named `mpoo_retain' btl_gm.c:381: warning: unused variable `gm_btl' make[4]: *** [btl_gm.lo] Error 1 make[4]: Leaving directory `/home/ndebard/ompi/ompi/mca/btl/gm' make[3]: *** [all-recursive] Error 1 make[3]: Leaving directory `/home/ndebard/ompi/ompi/dynamic-mca/btl' make[2]: *** [all-recursive] Error 1 make[2]: Leaving directory `/home/ndebard/ompi/ompi/dynamic-mca' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/home/ndebard/ompi/ompi' make: *** [all-recursive] Error 1 [sparkplug]~/ompi > I've configured using the option I thought to disable this: --enable-mca-no-build=ptl-gm I even tried --enable-mca-no-build=btl-gm. No luck. -- -- Nathan Correspondence ----- Nathan DeBardeleben, Ph.D. Los Alamos National Laboratory Parallel Tools Team High Performance Computing Environments phone: 505-667-3428 email: ndeb...@lanl.gov -
[O-MPI devel] 64bit shared library problems
I've been having this problem for a week or so and I've been asking other people to weigh in if they know what I'm doing wrong. I've gotten no where on this so I figure I'll finally drop it out on the list. First, here's the important info: The machine: [sparkplug]~ > cat /etc/issue Welcome to SuSE Linux 9.1 (x86-64) - Kernel \r (\l). [sparkplug]~ > uname -a Linux sparkplug 2.6.10 #4 SMP Wed Jan 26 11:50:00 MST 2005 x86_64 x86_64 x86_64 GNU/Linux My versions of libtool, autoconf, automake: [sparkplug]~ > libtool --version ltmain.sh (GNU libtool) 1.5.20 (1.1220.2.287 2005/08/31 18:54:15) Copyright (C) 2005 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. [sparkplug]~ > autoconf --version autoconf (GNU Autoconf) 2.59 Written by David J. MacKenzie and Akim Demaille. Copyright (C) 2003 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. [sparkplug]~ > automake --version automake (GNU automake) 1.8.5 Written by Tom Tromey <tro...@redhat.com>. Copyright 2004 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. [sparkplug]~ > My ompi version: 7322 - but this has been going on for a few days like I said and I've been updating a lot, with no progress. Configured using: $ ./configure --enable-static --disable-shared --without-threads --prefix=/home/ndebard/local/ompi --with-devel-headers --enable-mca-no-build=ptl-gm Simple C file which I will compile into a shared library: int test_compile(int x) { int rc; rc = orte_init(true); printf("rc = %d\n", rc); return x + 1; } Above file is named 'testlib.c' OK, so let's build this: [sparkplug]~/ompi-test > mpicc -c testlib.c [sparkplug]~/ompi-test > mpicc -shared -o libtestlib.so testlib.o /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.3/../../../../x86_64-suse-linux/bin/ld: testlib.o: relocation R_X86_64_32 can not be used when making a shared object; recompile with -fPIC testlib.o: could not read symbols: Bad value collect2: ld returned 1 exit status OK so relocation problems. Maybe I'll follow the directions and -fPIC my file myself: [sparkplug]~/ompi-test > mpicc -c testlib.c -fPIC [sparkplug]~/ompi-test > mpicc -shared -o libtestlib.so testlib.o /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.3/../../../../x86_64-suse-linux/bin/ld: /home/ndebard/local/ompi/lib/liborte.a(orte_init.o): relocation R_X86_64_32 can not be used when making a shared object; recompile with -fPIC /home/ndebard/local/ompi/lib/liborte.a: could not read symbols: Bad value collect2: ld returned 1 exit status OK so I read this as there's a relocation problem in 'liborte.a'. I un-arred liborte.a and checked some of the files with 'file' and it says 64bit. I havn't yet written a script to check every file in here, but here's orte_init.o: [sparkplug]~/<1>tmp > file orte_init.o orte_init.o: ELF 64-bit LSB relocatable, AMD x86-64, version 1 (SYSV), not stripped So that at least says it's 64bit. And to confirm, my mpicc's 64bit too: [sparkplug]~/<1>tmp > which mpicc /home/ndebard/local/ompi/bin/mpicc [sparkplug]~/<1>tmp > file /home/ndebard/local/ompi/bin/mpicc /home/ndebard/local/ompi/bin/mpicc: ELF 64-bit LSB executable, AMD x86-64, version 1 (SYSV), for GNU/Linux 2.4.1, dynamically linked (uses shared libs), not stripped Someone suggested I take out the 'disabled-shared' from the configure line, so I did. The result was the same. So the result is that I can not build a shared library on a 64bit linux machine that uses orte calls. So then I tried taking out the orte calls and instead use MPI calls. Sure, this function makes no sense but here it is now: #include "orte_config.h" #include int test_compile(int x) { MPI_Comm_rank(MPI_COMM_WORLD, ); return x + 1; } And now, when I try and make a shared object I get relocation errors: /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.3/../../../../x86_64-suse-linux/bin/ld: /home/ndebard/local/ompi/lib/libmpi.a(comm_init.o): relocation R_X86_64_32 can not be used when making a shared object; recompile with -fPIC /home/ndebard/local/ompi/lib/libmpi.a: could not read symbols: Bad value So... could perhaps the build be messed up and not be really using 64bit code? Am I the only one seeing this? It's a trivial test for those of you with access to a 64bit machine if you wouldn't mind testing for me. Help would be greatly appreciated. -- -- Nathan Correspondence ----- Nathan DeBardeleben, Ph.D. Los Alamos National Laboratory Parallel
Re: [O-MPI devel] OMPI 32bit on a 64bit Linux box
FYI, this only happens when I let OMPI compile 64bit on Linux. When I throw in there CFLAGS=FFLAGS=CXXFLAGS=-m32 orted, my myriad of test codes, mpirun, registry subscription codes, and JNI all work like a champ. Something's wrong with the 64bit it appears to me. -- Nathan Correspondence - Nathan DeBardeleben, Ph.D. Los Alamos National Laboratory Parallel Tools Team High Performance Computing Environments phone: 505-667-3428 email: ndeb...@lanl.gov - Tim S. Woodall wrote: Nathan, I'll try to reproduce this sometime this week - but I'm pretty swamped. Is Greg also seeing the same behavior? Thanks, Tim Nathan DeBardeleben wrote: To expand on this further, orte_init() seg faults on both bluesteel (32bit linux) and sparkplug (64bit linux) equally. The required condition is that orted must be running first (which of course we require for our work - a persistent orte daemon and registry). [bluesteel]~/ptp > ./dump_info Segmentation fault [bluesteel]~/ptp > gdb dump_info GNU gdb 6.1 Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-suse-linux"...Using host libthread_db library "/lib64/tls/libthread_db.so.1". (gdb) run Starting program: /home/ndebard/ptp/dump_info Program received signal SIGSEGV, Segmentation fault. 0x in ?? () (gdb) where #0 0x in ?? () #1 0x0045997d in orte_init_stage1 () at orte_init_stage1.c:419 #2 0x004156a7 in orte_system_init () at orte_system_init.c:38 #3 0x004151c7 in orte_init () at orte_init.c:46 #4 0x00414cbb in main (argc=1, argv=0x7fb298) at dump_info.c:185 (gdb) -- Nathan Correspondence ----- Nathan DeBardeleben, Ph.D. Los Alamos National Laboratory Parallel Tools Team High Performance Computing Environments phone: 505-667-3428 email: ndeb...@lanl.gov --------- Nathan DeBardeleben wrote: Just to clarify: 1: no orted started (meaning the MPIrun or registry programs will start one by themselves) causes those programs to lock up. 2: starting orted by hand (trying to get these programs to connect to a centralized one) causes the connecting programs to seg fault. -- Nathan Correspondence ----- Nathan DeBardeleben, Ph.D. Los Alamos National Laboratory Parallel Tools Team High Performance Computing Environments phone: 505-667-3428 email: ndeb...@lanl.gov --------- Nathan DeBardeleben wrote: So I dropped an .ompi_ignore into that directory, reconfigured, and compile worked (yay!). However, not a lot of progress: mpirun locks up, all my registry test programs lock up as well. If I start the orted by hand, then any of my registry calling programs cause segfault: [sparkplug]~/ptp > gdb sub_test GNU gdb 6.1 Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-suse-linux"...Using host libthread_db library "/lib64/tls/libthread_db.so.1". (gdb) run Starting program: /home/ndebard/ptp/sub_test Program received signal SIGSEGV, Segmentation fault. 0x in ?? () (gdb) where #0 0x in ?? () #1 0x004598a5 in orte_init_stage1 () at orte_init_stage1.c:419 #2 0x004155cf in orte_system_init () at orte_system_init.c:38 #3 0x004150ef in orte_init () at orte_init.c:46 #4 0x004148a1 in main (argc=1, argv=0x7fb178) at sub_test.c:60 (gdb) Yes, I recompiled everything. Here's an example of me trying something a little more complicated (which I believe locks up for the same reason - something borked with the registry interaction). [sparkplug]~/ompi-test > bjssub -s 1 -n 10 -i bash Waiting for interactive job nodes. (nodes 18 16 17 18 19 20 21 22 23 24 25) Starting interactive job. NODES=16,17,18,19,20,21,22,23,24,25 JOBID=18 so i got my nodes ndebard@sparkplug:~/ompi-test> export OMPI_MCA_
Re: [O-MPI devel] OMPI 32bit on a 64bit Linux box
To expand on this further, orte_init() seg faults on both bluesteel (32bit linux) and sparkplug (64bit linux) equally. The required condition is that orted must be running first (which of course we require for our work - a persistent orte daemon and registry). [bluesteel]~/ptp > ./dump_info Segmentation fault [bluesteel]~/ptp > gdb dump_info GNU gdb 6.1 Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-suse-linux"...Using host libthread_db library "/lib64/tls/libthread_db.so.1". (gdb) run Starting program: /home/ndebard/ptp/dump_info Program received signal SIGSEGV, Segmentation fault. 0x in ?? () (gdb) where #0 0x in ?? () #1 0x0045997d in orte_init_stage1 () at orte_init_stage1.c:419 #2 0x004156a7 in orte_system_init () at orte_system_init.c:38 #3 0x004151c7 in orte_init () at orte_init.c:46 #4 0x00414cbb in main (argc=1, argv=0x7fb298) at dump_info.c:185 (gdb) -- Nathan Correspondence ----- Nathan DeBardeleben, Ph.D. Los Alamos National Laboratory Parallel Tools Team High Performance Computing Environments phone: 505-667-3428 email: ndeb...@lanl.gov --------- Nathan DeBardeleben wrote: Just to clarify: 1: no orted started (meaning the MPIrun or registry programs will start one by themselves) causes those programs to lock up. 2: starting orted by hand (trying to get these programs to connect to a centralized one) causes the connecting programs to seg fault. -- Nathan Correspondence ----- Nathan DeBardeleben, Ph.D. Los Alamos National Laboratory Parallel Tools Team High Performance Computing Environments phone: 505-667-3428 email: ndeb...@lanl.gov --------- Nathan DeBardeleben wrote: So I dropped an .ompi_ignore into that directory, reconfigured, and compile worked (yay!). However, not a lot of progress: mpirun locks up, all my registry test programs lock up as well. If I start the orted by hand, then any of my registry calling programs cause segfault: [sparkplug]~/ptp > gdb sub_test GNU gdb 6.1 Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-suse-linux"...Using host libthread_db library "/lib64/tls/libthread_db.so.1". (gdb) run Starting program: /home/ndebard/ptp/sub_test Program received signal SIGSEGV, Segmentation fault. 0x in ?? () (gdb) where #0 0x in ?? () #1 0x004598a5 in orte_init_stage1 () at orte_init_stage1.c:419 #2 0x004155cf in orte_system_init () at orte_system_init.c:38 #3 0x004150ef in orte_init () at orte_init.c:46 #4 0x004148a1 in main (argc=1, argv=0x7fb178) at sub_test.c:60 (gdb) Yes, I recompiled everything. Here's an example of me trying something a little more complicated (which I believe locks up for the same reason - something borked with the registry interaction). [sparkplug]~/ompi-test > bjssub -s 1 -n 10 -i bash Waiting for interactive job nodes. (nodes 18 16 17 18 19 20 21 22 23 24 25) Starting interactive job. NODES=16,17,18,19,20,21,22,23,24,25 JOBID=18 so i got my nodes ndebard@sparkplug:~/ompi-test> export OMPI_MCA_ptl_base_exclude=sm ndebard@sparkplug:~/ompi-test> export OMPI_MCA_pls_bproc_seed_priority=101 and set these envvars like we need to use Greg's bproc, without the 2nd export the machine's load maxes and locks up. ndebard@sparkplug:~/ompi-test> bpstat Node(s)Status Mode User Group 100-128down -- root root0-15 up ---x-- vchandu vchandu 16-25 up ---x-- ndebard ndebard 26-27 up ---x-- root root28-30 up ---x--x--x root rootndebard@sparkplug:~/ompi-test> env | grep NODES NODES=16,17,18,19,20,21,22,23,24,25
Re: [O-MPI devel] OMPI 32bit on a 64bit Linux box
So I'm seeing all these nice emails about people developing on OMPI today yet I can't get it to compile. Am I out here in limbo on this or are others in the same boat? The errors I'm seeing are about some bproc code calling undefined functions and they are linked again below. -- Nathan Correspondence - Nathan DeBardeleben, Ph.D. Los Alamos National Laboratory Parallel Tools Team High Performance Computing Environments phone: 505-667-3428 email: ndeb...@lanl.gov - Nathan DeBardeleben wrote: Back from training and trying to test this but now OMPI doesn't compile at all: gcc -DHAVE_CONFIG_H -I. -I. -I../../../../include -I../../../../include -I../../../.. -I../../../.. -I../../../../include -I../../../../opal -I../../../../orte -I../../../../ompi -g -Wall -Wundef -Wno-long-long -Wsign-compare -Wmissing-prototypes -Wstrict-prototypes -Wcomment -pedantic -Werror-implicit-function-declaration -fno-strict-aliasing -MT ras_lsf_bproc.lo -MD -MP -MF .deps/ras_lsf_bproc.Tpo -c ras_lsf_bproc.c -o ras_lsf_bproc.o ras_lsf_bproc.c: In function `orte_ras_lsf_bproc_node_insert': ras_lsf_bproc.c:32: error: implicit declaration of function `orte_ras_base_node_insert' ras_lsf_bproc.c: In function `orte_ras_lsf_bproc_node_query': ras_lsf_bproc.c:37: error: implicit declaration of function `orte_ras_base_node_query' make[4]: *** [ras_lsf_bproc.lo] Error 1 make[4]: Leaving directory `/home/ndebard/ompi/orte/mca/ras/lsf_bproc' make[3]: *** [all-recursive] Error 1 make[3]: Leaving directory `/home/ndebard/ompi/orte/mca/ras' make[2]: *** [all-recursive] Error 1 make[2]: Leaving directory `/home/ndebard/ompi/orte/mca' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/home/ndebard/ompi/orte' make: *** [all-recursive] Error 1 [sparkplug]~/ompi > Clean SVN checkout this morning with configure: [sparkplug]~/ompi > ./configure --enable-static --disable-shared --without-threads --prefix=/home/ndebard/local/ompi --with-devel-headers -- Nathan Correspondence - Nathan DeBardeleben, Ph.D. Los Alamos National Laboratory Parallel Tools Team High Performance Computing Environments phone: 505-667-3428 email: ndeb...@lanl.gov - Brian Barrett wrote: This is now fixed in SVN. You should no longer need the --build=i586... hack to compile 32 bit code on Opterons. Brian On Aug 12, 2005, at 3:17 PM, Brian Barrett wrote: On Aug 12, 2005, at 3:13 PM, Nathan DeBardeleben wrote: We've got a 64bit Linux (SUSE) box here. For a variety of reasons (Java, JNI, linking in with OMPI libraries, etc which I won't get into) I need to compile OMPI 32 bit (or get 64bit versions of a lot of other libraries). I get various compile errors when I try different things, but first let me explain the system we have: This goes on and on and on actually. And the 'is incompatible with i386:x86-64 output' looks to be repeated for every line before this error which actually caused the Make to bomb. Any suggestions at all? Surely someone must have tried to force OMPI to build in 32bit mode on a 64bit machine. I don't think anyone has tried to build 32 bit on an Opteron, which is the cause of the problems... I think I know how to fix this, but won't happen until later in the weekend. I can't think of a good workaround until then. Well, one possibility is to set the target like you were doing and disable ROMIO. Actually, you'll also need to disable Fortran 77. So something like: ./configure [usual options] --build=i586-suse-linux --disable-io- romio --disable-f77 might just do the trick. Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/ ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [O-MPI devel] OMPI 32bit on a 64bit Linux box
Back from training and trying to test this but now OMPI doesn't compile at all: gcc -DHAVE_CONFIG_H -I. -I. -I../../../../include -I../../../../include -I../../../.. -I../../../.. -I../../../../include -I../../../../opal -I../../../../orte -I../../../../ompi -g -Wall -Wundef -Wno-long-long -Wsign-compare -Wmissing-prototypes -Wstrict-prototypes -Wcomment -pedantic -Werror-implicit-function-declaration -fno-strict-aliasing -MT ras_lsf_bproc.lo -MD -MP -MF .deps/ras_lsf_bproc.Tpo -c ras_lsf_bproc.c -o ras_lsf_bproc.o ras_lsf_bproc.c: In function `orte_ras_lsf_bproc_node_insert': ras_lsf_bproc.c:32: error: implicit declaration of function `orte_ras_base_node_insert' ras_lsf_bproc.c: In function `orte_ras_lsf_bproc_node_query': ras_lsf_bproc.c:37: error: implicit declaration of function `orte_ras_base_node_query' make[4]: *** [ras_lsf_bproc.lo] Error 1 make[4]: Leaving directory `/home/ndebard/ompi/orte/mca/ras/lsf_bproc' make[3]: *** [all-recursive] Error 1 make[3]: Leaving directory `/home/ndebard/ompi/orte/mca/ras' make[2]: *** [all-recursive] Error 1 make[2]: Leaving directory `/home/ndebard/ompi/orte/mca' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/home/ndebard/ompi/orte' make: *** [all-recursive] Error 1 [sparkplug]~/ompi > Clean SVN checkout this morning with configure: [sparkplug]~/ompi > ./configure --enable-static --disable-shared --without-threads --prefix=/home/ndebard/local/ompi --with-devel-headers -- Nathan Correspondence - Nathan DeBardeleben, Ph.D. Los Alamos National Laboratory Parallel Tools Team High Performance Computing Environments phone: 505-667-3428 email: ndeb...@lanl.gov - Brian Barrett wrote: This is now fixed in SVN. You should no longer need the --build=i586... hack to compile 32 bit code on Opterons. Brian On Aug 12, 2005, at 3:17 PM, Brian Barrett wrote: On Aug 12, 2005, at 3:13 PM, Nathan DeBardeleben wrote: We've got a 64bit Linux (SUSE) box here. For a variety of reasons (Java, JNI, linking in with OMPI libraries, etc which I won't get into) I need to compile OMPI 32 bit (or get 64bit versions of a lot of other libraries). I get various compile errors when I try different things, but first let me explain the system we have: This goes on and on and on actually. And the 'is incompatible with i386:x86-64 output' looks to be repeated for every line before this error which actually caused the Make to bomb. Any suggestions at all? Surely someone must have tried to force OMPI to build in 32bit mode on a 64bit machine. I don't think anyone has tried to build 32 bit on an Opteron, which is the cause of the problems... I think I know how to fix this, but won't happen until later in the weekend. I can't think of a good workaround until then. Well, one possibility is to set the target like you were doing and disable ROMIO. Actually, you'll also need to disable Fortran 77. So something like: ./configure [usual options] --build=i586-suse-linux --disable-io- romio --disable-f77 might just do the trick. Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/ ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel