Hello, see the "same" (well probably not exactly same) thing here in Opteron with 64bit (-g and so on), I get:
#0 0x0000000040085160 in orte_sds_base_contact_universe () at ../../../../../orte/mca/sds/base/sds_base_interface.c:29 29 return orte_sds_base_module->contact_universe(); (gdb) where #0 0x0000000040085160 in orte_sds_base_contact_universe () at ../../../../../orte/mca/sds/base/sds_base_interface.c:29 #1 0x0000000040063e95 in orte_init_stage1 () at ../../../orte/runtime/orte_init_stage1.c:185 #2 0x0000000040017e7d in orte_system_init () at ../../../orte/runtime/orte_system_init.c:38 #3 0x00000000400148f5 in orte_init () at ../../../orte/runtime/orte_init.c:46 #4 0x000000004000dfc7 in main (argc=4, argv=0x7fbfffe8a8) at ../../../../orte/tools/orterun/orterun.c:291 #5 0x0000002a95c0c017 in __libc_start_main () from /lib64/libc.so.6 #6 0x000000004000bf2a in _start () (gdb) within mpirun orte_sds_base_module here is Null... This is without persistent orted; Just mpirun... CU, ray On Thursday 18 August 2005 16:57, Nathan DeBardeleben wrote: > FYI, this only happens when I let OMPI compile 64bit on Linux. When I > throw in there CFLAGS=FFLAGS=CXXFLAGS=-m32 orted, my myriad of test > codes, mpirun, registry subscription codes, and JNI all work like a champ. > Something's wrong with the 64bit it appears to me. > > -- Nathan > Correspondence > --------------------------------------------------------------------- > Nathan DeBardeleben, Ph.D. > Los Alamos National Laboratory > Parallel Tools Team > High Performance Computing Environments > phone: 505-667-3428 > email: ndeb...@lanl.gov > --------------------------------------------------------------------- > > Tim S. Woodall wrote: > >Nathan, > > > >I'll try to reproduce this sometime this week - but I'm pretty swamped. > >Is Greg also seeing the same behavior? > > > >Thanks, > >Tim > > > >Nathan DeBardeleben wrote: > >>To expand on this further, orte_init() seg faults on both bluesteel > >>(32bit linux) and sparkplug (64bit linux) equally. The required > >>condition is that orted must be running first (which of course we > >>require for our work - a persistent orte daemon and registry). > >> > >>>[bluesteel]~/ptp > ./dump_info > >>>Segmentation fault > >>>[bluesteel]~/ptp > gdb dump_info > >>>GNU gdb 6.1 > >>>Copyright 2004 Free Software Foundation, Inc. > >>>GDB is free software, covered by the GNU General Public License, and > >>>you are > >>>welcome to change it and/or distribute copies of it under certain > >>>conditions. > >>>Type "show copying" to see the conditions. > >>>There is absolutely no warranty for GDB. Type "show warranty" for > >>>details. > >>>This GDB was configured as "x86_64-suse-linux"...Using host > >>>libthread_db library "/lib64/tls/libthread_db.so.1". > >>> > >>>(gdb) run > >>>Starting program: /home/ndebard/ptp/dump_info > >>> > >>>Program received signal SIGSEGV, Segmentation fault. > >>>0x0000000000000000 in ?? () > >>>(gdb) where > >>>#0 0x0000000000000000 in ?? () > >>>#1 0x000000000045997d in orte_init_stage1 () at orte_init_stage1.c:419 > >>>#2 0x00000000004156a7 in orte_system_init () at orte_system_init.c:38 > >>>#3 0x00000000004151c7 in orte_init () at orte_init.c:46 > >>>#4 0x0000000000414cbb in main (argc=1, argv=0x7fbffff298) at > >>>dump_info.c:185 > >>>(gdb) > >> > >>-- Nathan > >>Correspondence > >>--------------------------------------------------------------------- > >>Nathan DeBardeleben, Ph.D. > >>Los Alamos National Laboratory > >>Parallel Tools Team > >>High Performance Computing Environments > >>phone: 505-667-3428 > >>email: ndeb...@lanl.gov > >>--------------------------------------------------------------------- > >> > >>Nathan DeBardeleben wrote: > >>>Just to clarify: > >>>1: no orted started (meaning the MPIrun or registry programs will > >>>start one by themselves) causes those programs to lock up. > >>>2: starting orted by hand (trying to get these programs to connect to > >>>a centralized one) causes the connecting programs to seg fault. > >>> > >>>-- Nathan > >>>Correspondence > >>>--------------------------------------------------------------------- > >>>Nathan DeBardeleben, Ph.D. > >>>Los Alamos National Laboratory > >>>Parallel Tools Team > >>>High Performance Computing Environments > >>>phone: 505-667-3428 > >>>email: ndeb...@lanl.gov > >>>--------------------------------------------------------------------- > >>> > >>>Nathan DeBardeleben wrote: > >>>>So I dropped an .ompi_ignore into that directory, reconfigured, and > >>>>compile worked (yay!). > >>>>However, not a lot of progress: mpirun locks up, all my registry test > >>>>programs lock up as well. If I start the orted by hand, then any of my > >>>> > >>>>registry calling programs cause segfault: > >>>>>[sparkplug]~/ptp > gdb sub_test > >>>>>GNU gdb 6.1 > >>>>>Copyright 2004 Free Software Foundation, Inc. > >>>>>GDB is free software, covered by the GNU General Public License, and > >>>>>you are > >>>>>welcome to change it and/or distribute copies of it under certain > >>>>>conditions. > >>>>>Type "show copying" to see the conditions. > >>>>>There is absolutely no warranty for GDB. Type "show warranty" for > >>>>>details. > >>>>>This GDB was configured as "x86_64-suse-linux"...Using host > >>>>>libthread_db library "/lib64/tls/libthread_db.so.1". > >>>>> > >>>>>(gdb) run > >>>>>Starting program: /home/ndebard/ptp/sub_test > >>>>> > >>>>>Program received signal SIGSEGV, Segmentation fault. > >>>>>0x0000000000000000 in ?? () > >>>>>(gdb) where > >>>>>#0 0x0000000000000000 in ?? () > >>>>>#1 0x00000000004598a5 in orte_init_stage1 () at > >>>>> orte_init_stage1.c:419 #2 0x00000000004155cf in orte_system_init () > >>>>> at orte_system_init.c:38 #3 0x00000000004150ef in orte_init () at > >>>>> orte_init.c:46 > >>>>>#4 0x00000000004148a1 in main (argc=1, argv=0x7fbffff178) at > >>>>>sub_test.c:60 > >>>>>(gdb) > >>>> > >>>>Yes, I recompiled everything. > >>>> > >>>>Here's an example of me trying something a little more complicated > >>>>(which I believe locks up for the same reason - something borked with > >>>>the registry interaction). > >>>> > >>>>>>[sparkplug]~/ompi-test > bjssub -s 10000 -n 10 -i bash > >>>>>>Waiting for interactive job nodes. > >>>>>>(nodes 18 16 17 18 19 20 21 22 23 24 25) > >>>>>>Starting interactive job. > >>>>>>NODES=16,17,18,19,20,21,22,23,24,25 > >>>>>>JOBID=18 > >>>>> > >>>>>so i got my nodes > >>>>> > >>>>>>ndebard@sparkplug:~/ompi-test> export OMPI_MCA_ptl_base_exclude=sm > >>>>>>ndebard@sparkplug:~/ompi-test> export > >>>>>>OMPI_MCA_pls_bproc_seed_priority=101 > >>>>> > >>>>>and set these envvars like we need to use Greg's bproc, without the > >>>>>2nd export the machine's load maxes and locks up. > >>>>> > >>>>>>ndebard@sparkplug:~/ompi-test> bpstat > >>>>>>Node(s) Status Mode > >>>>>>User Group 100-128 down > >>>>>>---------- root root 0-15 > >>>>>>up ---x------ vchandu vchandu > >>>>>>16-25 up ---x------ > >>>>>>ndebard ndebard > >>>>>>26-27 up ---x------ > >>>>>>root root 28-30 up > >>>>>>---x--x--x root root ndebard@sparkplug:~/ompi-test> env | grep > >>>>>>NODES > >>>>>>NODES=16,17,18,19,20,21,22,23,24,25 > >>>>> > >>>>>yes, i really have the nodes > >>>>> > >>>>>>ndebard@sparkplug:~/ompi-test> mpicc -o test-mpi test-mpi.c > >>>>>>ndebard@sparkplug:~/ompi-test> > >>>>> > >>>>>recompile for good measure > >>>>> > >>>>>>ndebard@sparkplug:~/ompi-test> ls /tmp/openmpi-sessions-ndebard* > >>>>>>/bin/ls: /tmp/openmpi-sessions-ndebard*: No such file or directory > >>>>> > >>>>>proof that there's no left over old directory > >>>>> > >>>>>>ndebard@sparkplug:~/ompi-test> mpirun -np 1 test-mpi > >>>>> > >>>>>it never responds at this point - but I can kill it with ^C. > >>>>> > >>>>>>mpirun: killing job... > >>>>>>Killed > >>>>>>ndebard@sparkplug:~/ompi-test> > >>>> > >>>>-- Nathan > >>>>Correspondence > >>>>--------------------------------------------------------------------- > >>>>Nathan DeBardeleben, Ph.D. > >>>>Los Alamos National Laboratory > >>>>Parallel Tools Team > >>>>High Performance Computing Environments > >>>>phone: 505-667-3428 > >>>>email: ndeb...@lanl.gov > >>>>--------------------------------------------------------------------- > >>>> > >>>>Jeff Squyres wrote: > >>>>>Is this what Tim Prins was working on? > >>>>> > >>>>>On Aug 16, 2005, at 5:21 PM, Tim S. Woodall wrote: > >>>>>>I'm not sure why this is even building... Is someone working on this? > >>>>>>I thought we had .ompi_ignore files in this directory. > >>>>>> > >>>>>>Tim > >>>>>> > >>>>>>Nathan DeBardeleben wrote: > >>>>>>>So I'm seeing all these nice emails about people developing on OMPI > >>>>>>>today yet I can't get it to compile. Am I out here in limbo on this > >>>>>>>or > >>>>>>>are others in the same boat? The errors I'm seeing are about some > >>>>>>>bproc > >>>>>>>code calling undefined functions and they are linked again below. > >>>>>>> > >>>>>>>-- Nathan > >>>>>>>Correspondence > >>>>>>>-------------------------------------------------------------------- > >>>>>>>- Nathan DeBardeleben, Ph.D. > >>>>>>>Los Alamos National Laboratory > >>>>>>>Parallel Tools Team > >>>>>>>High Performance Computing Environments > >>>>>>>phone: 505-667-3428 > >>>>>>>email: ndeb...@lanl.gov > >>>>>>>-------------------------------------------------------------------- > >>>>>>>- > >>>>>>> > >>>>>>>Nathan DeBardeleben wrote: > >>>>>>>>Back from training and trying to test this but now OMPI doesn't > >>>>>>>>compile > >>>>>>>> > >>>>>>>>at all: > >>>>>>>>>gcc -DHAVE_CONFIG_H -I. -I. -I../../../../include > >>>>>>>>>-I../../../../include -I../../../.. -I../../../.. > >>>>>>>>>-I../../../../include -I../../../../opal -I../../../../orte > >>>>>>>>>-I../../../../ompi -g -Wall -Wundef -Wno-long-long -Wsign-compare > >>>>>>>>>-Wmissing-prototypes -Wstrict-prototypes -Wcomment -pedantic > >>>>>>>>>-Werror-implicit-function-declaration -fno-strict-aliasing -MT > >>>>>>>>>ras_lsf_bproc.lo -MD -MP -MF .deps/ras_lsf_bproc.Tpo -c > >>>>>>>>>ras_lsf_bproc.c -o ras_lsf_bproc.o > >>>>>>>>>ras_lsf_bproc.c: In function `orte_ras_lsf_bproc_node_insert': > >>>>>>>>>ras_lsf_bproc.c:32: error: implicit declaration of function > >>>>>>>>>`orte_ras_base_node_insert' > >>>>>>>>>ras_lsf_bproc.c: In function `orte_ras_lsf_bproc_node_query': > >>>>>>>>>ras_lsf_bproc.c:37: error: implicit declaration of function > >>>>>>>>>`orte_ras_base_node_query' > >>>>>>>>>make[4]: *** [ras_lsf_bproc.lo] Error 1 > >>>>>>>>>make[4]: Leaving directory > >>>>>>>>>`/home/ndebard/ompi/orte/mca/ras/lsf_bproc' > >>>>>>>>>make[3]: *** [all-recursive] Error 1 > >>>>>>>>>make[3]: Leaving directory `/home/ndebard/ompi/orte/mca/ras' > >>>>>>>>>make[2]: *** [all-recursive] Error 1 > >>>>>>>>>make[2]: Leaving directory `/home/ndebard/ompi/orte/mca' > >>>>>>>>>make[1]: *** [all-recursive] Error 1 > >>>>>>>>>make[1]: Leaving directory `/home/ndebard/ompi/orte' > >>>>>>>>>make: *** [all-recursive] Error 1 > >>>>>>>>>[sparkplug]~/ompi > > >>>>>>>> > >>>>>>>>Clean SVN checkout this morning with configure: > >>>>>>>>>[sparkplug]~/ompi > ./configure --enable-static --disable-shared > >>>>>>>>>--without-threads --prefix=/home/ndebard/local/ompi > >>>>>>>>>--with-devel-headers > >>>>>>>> > >>>>>>>>-- Nathan > >>>>>>>>Correspondence > >>>>>>>>------------------------------------------------------------------- > >>>>>>>>-- Nathan DeBardeleben, Ph.D. > >>>>>>>>Los Alamos National Laboratory > >>>>>>>>Parallel Tools Team > >>>>>>>>High Performance Computing Environments > >>>>>>>>phone: 505-667-3428 > >>>>>>>>email: ndeb...@lanl.gov > >>>>>>>>------------------------------------------------------------------- > >>>>>>>>-- > >>>>>>>> > >>>>>>>>Brian Barrett wrote: > >>>>>>>>>This is now fixed in SVN. You should no longer need the > >>>>>>>>>--build=i586... hack to compile 32 bit code on Opterons. > >>>>>>>>> > >>>>>>>>>Brian > >>>>>>>>> > >>>>>>>>>On Aug 12, 2005, at 3:17 PM, Brian Barrett wrote: > >>>>>>>>>>On Aug 12, 2005, at 3:13 PM, Nathan DeBardeleben wrote: > >>>>>>>>>>>We've got a 64bit Linux (SUSE) box here. For a variety of > >>>>>>>>>>> reasons (Java, JNI, linking in with OMPI libraries, etc which I > >>>>>>>>>>> won't get into) > >>>>>>>>>>>I need to compile OMPI 32 bit (or get 64bit versions of a lot of > >>>>>>>>>>>other > >>>>>>>>>>>libraries). > >>>>>>>>>>>I get various compile errors when I try different things, but > >>>>>>>>>>>first > >>>>>>>>>>>let > >>>>>>>>>>>me explain the system we have: > >>>>>>>>>> > >>>>>>>>>><snip> > >>>>>>>>>> > >>>>>>>>>>>This goes on and on and on actually. And the 'is incompatible > >>>>>>>>>>>with > >>>>>>>>>>>i386:x86-64 output' looks to be repeated for every line before > >>>>>>>>>>>this > >>>>>>>>>>>error which actually caused the Make to bomb. > >>>>>>>>>>> > >>>>>>>>>>>Any suggestions at all? Surely someone must have tried to force > >>>>>>>>>>>OMPI to > >>>>>>>>>>>build in 32bit mode on a 64bit machine. > >>>>>>>>>> > >>>>>>>>>>I don't think anyone has tried to build 32 bit on an Opteron, > >>>>>>>>>> which is the cause of the problems... > >>>>>>>>>> > >>>>>>>>>>I think I know how to fix this, but won't happen until later in > >>>>>>>>>> the weekend. I can't think of a good workaround until then. > >>>>>>>>>> Well, one possibility is to set the target like you were doing > >>>>>>>>>> and disable ROMIO. Actually, you'll also need to disable > >>>>>>>>>> Fortran 77. So something like: > >>>>>>>>>> > >>>>>>>>>>./configure [usual options] --build=i586-suse-linux --disable-io- > >>>>>>>>>>romio --disable-f77 > >>>>>>>>>> > >>>>>>>>>>might just do the trick. > >>>>>>>>>> > >>>>>>>>>>Brian > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>-- > >>>>>>>>>>Brian Barrett > >>>>>>>>>>Open MPI developer > >>>>>>>>>>http://www.open-mpi.org/ > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>_______________________________________________ > >>>>>>>>>>devel mailing list > >>>>>>>>>>de...@open-mpi.org > >>>>>>>>>>http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>>>>>> > >>>>>>>>_______________________________________________ > >>>>>>>>devel mailing list > >>>>>>>>de...@open-mpi.org > >>>>>>>>http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>>>>> > >>>>>>>_______________________________________________ > >>>>>>>devel mailing list > >>>>>>>de...@open-mpi.org > >>>>>>>http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>>>> > >>>>>>_______________________________________________ > >>>>>>devel mailing list > >>>>>>de...@open-mpi.org > >>>>>>http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>> > >>>>_______________________________________________ > >>>>devel mailing list > >>>>de...@open-mpi.org > >>>>http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>> > >>>_______________________________________________ > >>>devel mailing list > >>>de...@open-mpi.org > >>>http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> > >>_______________________________________________ > >>devel mailing list > >>de...@open-mpi.org > >>http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > >_______________________________________________ > >devel mailing list > >de...@open-mpi.org > >http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- --------------------------------------------------------------------- Dipl.-Inf. Rainer Keller email: kel...@hlrs.de High Performance Computing Tel: ++49 (0)711-685 5858 Center Stuttgart (HLRS) Fax: ++49 (0)711-678 7626 Nobelstrasse 19, R. O0.030 http://www.hlrs.de/people/keller 70550 Stuttgart