Okay, I tracked this silliness down. On odin, my platform file builds both shared and static. It appears that mpicc in that situation defaults to picking the static build, and so I wind up with a static executable. This behavior was unexpected - I thought we would default to dynamic, but support static if that flag was given to mpicc. Call me surprised, but at least now I know.
I found that the Lustre headers and libs are indeed on the system, and so your analysis of the problem is correct. When I build with nothing on the configure line, we only build shared and so the executable is dynamic - and the problem goes away. HTH Ralph On Oct 30, 2012, at 12:06 PM, Ralph Castain <r...@open-mpi.org> wrote: > Sure - I can do that. > > On Oct 30, 2012, at 11:29 AM, Edgar Gabriel <gabr...@cs.uh.edu> wrote: > >> glad to hear that. However, since we are also having the problem with >> the lustre-fs module for static builds, I think it would still make >> sense to disable fs/lustre/ for 1.7.0 >> >> Edgar >> >> On 10/30/2012 12:34 PM, Ralph Castain wrote: >>> I hate odin :-( >>> >>> FWIW: it all works fine today, no matter how I configure it. No earthly >>> idea what happened. >>> >>> Ignore these droids.... >>> >>> >>> On Oct 30, 2012, at 7:28 AM, Edgar Gabriel <gabr...@cs.uh.edu> wrote: >>> >>>> ok, so a couple of things. >>>> >>>> I still think it is the same issue that I observed 1-2 days ago. Could >>>> you try to remove the fs/lustre component from your compilation, e.g. by >>>> adding an .ompi_ignore file into that directory, and see whether this >>>> fixes the issue? >>>> >>>> I tried on my machine (no lustre, no ib) compilations with >>>> --disable-mpi-io *or* --disable-io-romio, and both worked correctly and >>>> I could run things. Note, that the flags are truly different meanwhile, >>>> since the second flag is now equivalent to --enable-mca-no-build=io:romio >>>> The first flag disables the io, fcoll, fs and sharedfp frameworks. >>>> (prior to ompio they had basically the same effect). >>>> >>>> In your particular case this means, that you disabled romio, but the >>>> entire ompio stack is still compiled, and error must come from that >>>> portion. If my suspecion is correct, it is still liblustre >>>> messing around with the malloc hooks, and that causes the stack frame to >>>> be completely broken. I thought I fixed that since we did not have the >>>> issue on trunk, but we did observe that in the 1.7 branch 1-2 days back >>>> as well, and I was looking into that. >>>> >>>> That being said, there is another malloc-hooks issue that makes me a bit >>>> nervous. The compilation of the otf stuff produced a ton of warnings on >>>> my machine with gcc4.6.2 also with respect to the _malloc_hooks and >>>> _realloc_hooks. Not sure whether this contributed to the problem as >>>> well, just thought I bring it up since we seem to have a corrupted stack >>>> frame problem. >>>> >>>> Thanks >>>> Edgar >>>> >>>> >>>> On 10/30/2012 8:29 AM, Edgar Gabriel wrote: >>>>> ok, I'll look into this. I noticed a problem with static builds on >>>>> lustre file systems recently, and I was wandering whether its the same >>>>> issue or not. But I'll check what's going on. >>>>> >>>>> THanks >>>>> Edgar >>>>> >>>>> On 10/30/2012 7:22 AM, Ralph Castain wrote: >>>>>> No to Lustre, and I didn't build static >>>>>> >>>>>> I'm not sure what, if any, parallel file system might be present. In the >>>>>> case that works, I just built with no configure args other than prefix. >>>>>> ompi_info shows both romio and mpio built, but nothing more about what >>>>>> support they built internally. >>>>>> >>>>>> >>>>>> On Oct 30, 2012, at 4:14 AM, Edgar Gabriel <gabr...@cs.uh.edu> wrote: >>>>>> >>>>>>> Ralph, >>>>>>> >>>>>>> just out curiosity: is there a lustre file system on the machine and is >>>>>>> this a static build ? >>>>>>> >>>>>>> Thanks >>>>>>> Edgar >>>>>>> >>>>>>> On 10/29/2012 9:17 PM, Ralph Castain wrote: >>>>>>>> Hmmm...I added that directory and tried this on odin (which is an >>>>>>>> IB-based machine). Any MPI proc segfaults: >>>>>>>> >>>>>>>> Core was generated by `./hello'. >>>>>>>> Program terminated with signal 11, Segmentation fault. >>>>>>>> w#0 _sysio_p_validate (pno=0x0, intnt=0x0, path=0x0) at >>>>>>>> src/inode.c:574 >>>>>>>> 574 src/inode.c: No such file or directory. >>>>>>>> in src/inode.c >>>>>>>> (gdb) where >>>>>>>> #0 _sysio_p_validate (pno=0x0, intnt=0x0, path=0x0) at src/inode.c:574 >>>>>>>> #1 0x00002aaaabd3f3e9 in _sysio_path_walk (parent=0x0, >>>>>>>> nd=0x7fffffffd8e0) at src/namei.c:216 >>>>>>>> #2 0x00002aaaabd3faad in _sysio_namei (parent=0x0, path=<value >>>>>>>> optimized out>, flags=0, intnt=0x7fffffffd950, pnop=0x7fffffffd970) at >>>>>>>> src/namei.c:505 >>>>>>>> #3 0x00002aaaabd3fd98 in open (path=0x2aaaac24280f >>>>>>>> "/sys/devices/system/node", flags=<value optimized out>) at >>>>>>>> src/open.c:179 >>>>>>>> #4 0x00002aaaabd43d5b in opendir (name=0x2aaaac24280f >>>>>>>> "/sys/devices/system/node") at src/stddir.c:60 >>>>>>>> #5 0x00002aaaac241825 in numa_max_node () from /usr/lib64/libnuma.so.1 >>>>>>>> #6 0x00002aaaac241d13 in numa_init () from /usr/lib64/libnuma.so.1 >>>>>>>> #7 0x00002aaaaaab845b in call_init () from /lib64/ld-linux-x86-64.so.2 >>>>>>>> #8 0x00002aaaaaab8565 in _dl_init_internal () from >>>>>>>> /lib64/ld-linux-x86-64.so.2 >>>>>>>> #9 0x00002aaaaaaabaaa in _dl_start_user () from >>>>>>>> /lib64/ld-linux-x86-64.so.2 >>>>>>>> #10 0x0000000000000001 in ?? () >>>>>>>> #11 0x00007fffffffe03c in ?? () >>>>>>>> #12 0x0000000000000000 in ?? () >>>>>>>> >>>>>>>> I got the same thing whether I excluded openib or not. I then ran on >>>>>>>> my Linux cluster, which doesn't have IB at all - and it ran fine. Also >>>>>>>> runs clean on the Mac. However, in both those cases, I had left IO >>>>>>>> romio enabled. >>>>>>>> >>>>>>>> Now on odin, I always disable-io-romio. So I tried deliberately >>>>>>>> enabling it, and everything works. So this appears to be something >>>>>>>> that the IO work has broken. >>>>>>>> >>>>>>>> Edgar: can you please fix --disable-io-romio? >>>>>>>> >>>>>>>> Thanks >>>>>>>> Ralph >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Oct 29, 2012, at 11:55 AM, Edgar Gabriel <gabr...@cs.uh.edu> wrote: >>>>>>>> >>>>>>>>> I'm sorry to add one more thing to the list, but beyond this file, it >>>>>>>>> looks like also the entire ompi/mca/common/verbs/ directory is also >>>>>>>>> missing in the 1.7 branch, but is required to compile the bcoll >>>>>>>>> framework. It is there in the trunk, but missing in the 1.7 branch... >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Edgar >>>>>>>>> >>>>>>>>> >>>>>>>>> On 10/26/2012 5:31 PM, Ralph Castain wrote: >>>>>>>>>> Okay, I'll fix for tonights tarball. >>>>>>>>>> >>>>>>>>>> Thanks! >>>>>>>>>> >>>>>>>>>> On Oct 26, 2012, at 3:28 PM, "Shamis, Pavel" <sham...@ornl.gov> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> There is a bug in makefile. The file existing in svn, but it is not >>>>>>>>>>> listed in the Makefile.am. As a result, it wasn't pulled to the >>>>>>>>>>> tarball. >>>>>>>>>>> >>>>>>>>>>> Pavel (Pasha) Shamis >>>>>>>>>>> --- >>>>>>>>>>> Computer Science Research Group >>>>>>>>>>> Computer Science and Math Division >>>>>>>>>>> Oak Ridge National Laboratory >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Oct 26, 2012, at 2:33 PM, Edgar Gabriel wrote: >>>>>>>>>>> >>>>>>>>>>> we have trouble compiling the 1.7 series on a machine in Dresden. >>>>>>>>>>> Specifically, we receive an error message when compiling the >>>>>>>>>>> bcol/iboffload component (other infiniband components compile fine). >>>>>>>>>>> >>>>>>>>>>> Any idea/suggestions what we might be doing wrong or what to look >>>>>>>>>>> for? >>>>>>>>>>> >>>>>>>>>>> make[2]: Entering directory >>>>>>>>>>> `/home/h2/gabriel/openmpi-1.7rc4/ompi/mca/bcol/iboffload' >>>>>>>>>>> CC bcol_iboffload_module.lo >>>>>>>>>>> CC bcol_iboffload_mca.lo >>>>>>>>>>> CC bcol_iboffload_endpoint.lo >>>>>>>>>>> CC bcol_iboffload_frag.lo >>>>>>>>>>> In file included from bcol_iboffload_frag.c:16:0: >>>>>>>>>>> bcol_iboffload.h:46:36: fatal error: bcol_iboffload_qp_info.h: No >>>>>>>>>>> such >>>>>>>>>>> file or directory >>>>>>>>>>> compilation terminated. >>>>>>>>>>> make[2]: *** [bcol_iboffload_frag.lo] Error 1 >>>>>>>>>>> make[2]: *** Waiting for unfinished jobs.... >>>>>>>>>>> In file included from bcol_iboffload_mca.c:18:0: >>>>>>>>>>> bcol_iboffload.h:46:36: fatal error: bcol_iboffload_qp_info.h: No >>>>>>>>>>> such >>>>>>>>>>> file or directory >>>>>>>>>>> compilation terminated. >>>>>>>>>>> make[2]: *** [bcol_iboffload_mca.lo] Error 1 >>>>>>>>>>> In file included from bcol_iboffload_endpoint.c:23:0: >>>>>>>>>>> bcol_iboffload.h:46:36: fatal error: bcol_iboffload_qp_info.h: No >>>>>>>>>>> such >>>>>>>>>>> file or directory >>>>>>>>>>> compilation terminated. >>>>>>>>>>> make[2]: *** [bcol_iboffload_endpoint.lo] Error 1 >>>>>>>>>>> In file included from bcol_iboffload_module.c:39:0: >>>>>>>>>>> bcol_iboffload.h:46:36: fatal error: bcol_iboffload_qp_info.h: No >>>>>>>>>>> such >>>>>>>>>>> file or directory >>>>>>>>>>> compilation terminated. >>>>>>>>>>> make[2]: *** [bcol_iboffload_module.lo] Error 1 >>>>>>>>>>> make[2]: Leaving directory >>>>>>>>>>> `/home/h2/gabriel/openmpi-1.7rc4/ompi/mca/bcol/iboffload' >>>>>>>>>>> make[1]: *** [all-recursive] Error 1 >>>>>>>>>>> make[1]: Leaving directory `/home/h2/gabriel/openmpi-1.7rc4/ompi' >>>>>>>>>>> make: *** [all-recursive] Error 1 >>>>>>>>>>> >>>>>>>>>>> Thanks >>>>>>>>>>> Edgar >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Edgar Gabriel >>>>>>>>>>> Associate Professor >>>>>>>>>>> Parallel Software Technologies Lab http://pstl.cs.uh.edu >>>>>>>>>>> Department of Computer Science University of Houston >>>>>>>>>>> Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA >>>>>>>>>>> Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335 >>>>>>>>>>> >>>>>>>>>>> <signature.asc>_______________________________________________ >>>>>>>>>>> devel mailing list >>>>>>>>>>> de...@open-mpi.org<mailto:de...@open-mpi.org> >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> devel mailing list >>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> devel mailing list >>>>>>>>>> de...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Edgar Gabriel >>>>>>>>> Associate Professor >>>>>>>>> Parallel Software Technologies Lab http://pstl.cs.uh.edu >>>>>>>>> Department of Computer Science University of Houston >>>>>>>>> Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA >>>>>>>>> Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335 >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Edgar Gabriel >>>>>>> Associate Professor >>>>>>> Parallel Software Technologies Lab http://pstl.cs.uh.edu >>>>>>> Department of Computer Science University of Houston >>>>>>> Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA >>>>>>> Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335 >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>> >>>> -- >>>> Edgar Gabriel >>>> Associate Professor >>>> Parallel Software Technologies Lab http://pstl.cs.uh.edu >>>> Department of Computer Science University of Houston >>>> Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA >>>> Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335 >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >> >> -- >> Edgar Gabriel >> Associate Professor >> Parallel Software Technologies Lab http://pstl.cs.uh.edu >> Department of Computer Science University of Houston >> Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA >> Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335 >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >