[OMPI devel] running top-level autogen.sh breaks romio in 1.6.3 tarball
Hello, When using Open MPI from the 1.6.3 tarball, I have found that running the top-level autogen.sh breaks the romio component. Here are the steps to reproduce: 1) download openmpi-1.6.3.tar.bz2 from http://www.open-mpi.org/software/ompi/v1.6/ 2) untar openmpi-1.6.3.tar.bz2 3) cd openmpi-1.6.3 4) ./autogen.sh 5) ./configure In configure's output, the following error can be seen: ...output snipped... *** Configuring ROMIO distribution configure: OMPI configuring in ompi/mca/io/romio/romio configure: running /bin/sh './configure' CFLAGS="-DNDEBUG -g -O2 -finline-functions -fno-strict-aliasing -pthread" CPPFLAGS=" -I/users/dshrader/tmp/openmpi-1.6.3/opal/mca/hwloc/hwloc132/hwloc/include -I/usr/include/infiniband -I/usr/include/infiniband" FFLAGS="" LDFLAGS=" " --enable-shared --disable-static --with-mpi=open_mpi --disable-aio --cache-file=/dev/null --srcdir=. --disable-option-checking Configuring with args CFLAGS=-DNDEBUG -g -O2 -finline-functions -fno-strict-aliasing -pthread CPPFLAGS= -I/users/dshrader/tmp/openmpi-1.6.3/opal/mca/hwloc/hwloc132/hwloc/include -I/usr/include/infiniband -I/usr/include/infiniband FFLAGS= LDFLAGS= --enable-shared --disable-static --with-mpi=open_mpi --disable-aio --cache-file=/dev/null --srcdir=. --disable-option-checking checking for Open MPI support files... in Open MPI source tree -- good ./configure: line 2805: PAC_PROG_MAKE: command not found ...output snipped... ./configure: line 7908: syntax error near unexpected token `newline' ./configure: line 7908: `PAC_FUNC_NEEDS_DECL(#include ,strdup)' configure: /bin/sh './configure' *failed* for ompi/mca/io/romio/romio configure: WARNING: ROMIO distribution did not configure successfully checking if MCA component io:romio can compile... no ...remaining output snipped... None of the MPI/IO components work, including ufs, if I continue with a 'make' and 'make install'. Judging from the output from configure and a quick perusal of romio's configure script, it looks like some macros are not being correctly expanded in the creation of romio's configure script. For reference, here are the versions of my autotools: m4: 1.4.16 autoconf: 2.69 automake: 1.11.5 libtool: 2.4.2 I have not yet submitted this as a Trac item on svn.open-mpi.org. I wasn't sure what to put in the "Version" field as 1.6.3 wasn't listed there and I don't know if this is an issue in the 1.6 branch. Thank you very much for your time, David -- David Shrader SICORP, Inc 1350 Central Ave Suite 104 Los Alamos, NM 87544 david.shra...@sicorp.com LANL contact information: LANL #: 505-664-0996 LANL email: dshra...@lanl.gov
Re: [OMPI devel] running top-level autogen.sh breaks romio in 1.6.3 tarball
Hello, Thank you for the reply! All of the autotools I am using have the same or higher versions than those specified at http://www.open-mpi.org/software/ompi/v1.6/. I referenced the specific versions at the end of my initial email. After some digging on the svn branch and some help from Nathan Hjelm, I found and checked out ompi/mca/io/romio/romio/autogen.sh. If I put this file in the same location inside the tarball contents, romio is not broken by running the top-level autogen.sh. I am still using the same versions of autotools. Thank you for the reminder on not needing to run autogen.sh on the tarball version. Unfortunately, we're doing some modifications to the romio Makefile.am files to add another MPI-IO type and want to test against Open MPI releases. Hence, we have to regenerate all the configure scripts and Makefile.in files by running autogen.sh. Again, thank you all for your time! David On 10/31/2012 02:46 PM, Ralph Castain wrote: We've seen this before - it's caused by using autotools that are too old. Please look at the HACKING file to see the required version levels. BTW: you should not be running autogen.sh on a tarball version. You should only run configure. On Oct 31, 2012, at 1:31 PM, David Shrader wrote: Hello, When using Open MPI from the 1.6.3 tarball, I have found that running the top-level autogen.sh breaks the romio component. Here are the steps to reproduce: 1) download openmpi-1.6.3.tar.bz2 from http://www.open-mpi.org/software/ompi/v1.6/ 2) untar openmpi-1.6.3.tar.bz2 3) cd openmpi-1.6.3 4) ./autogen.sh 5) ./configure In configure's output, the following error can be seen: ...output snipped... *** Configuring ROMIO distribution configure: OMPI configuring in ompi/mca/io/romio/romio configure: running /bin/sh './configure' CFLAGS="-DNDEBUG -g -O2 -finline-functions -fno-strict-aliasing -pthread" CPPFLAGS=" -I/users/dshrader/tmp/openmpi-1.6.3/opal/mca/hwloc/hwloc132/hwloc/include -I/usr/include/infiniband -I/usr/include/infiniband" FFLAGS="" LDFLAGS=" " --enable-shared --disable-static --with-mpi=open_mpi --disable-aio --cache-file=/dev/null --srcdir=. --disable-option-checking Configuring with args CFLAGS=-DNDEBUG -g -O2 -finline-functions -fno-strict-aliasing -pthread CPPFLAGS= -I/users/dshrader/tmp/openmpi-1.6.3/opal/mca/hwloc/hwloc132/hwloc/include -I/usr/include/infiniband -I/usr/include/infiniband FFLAGS= LDFLAGS= --enable-shared --disable-static --with-mpi=open_mpi --disable-aio --cache-file=/dev/null --srcdir=. --disable-option-checking checking for Open MPI support files... in Open MPI source tree -- good ./configure: line 2805: PAC_PROG_MAKE: command not found ...output snipped... ./configure: line 7908: syntax error near unexpected token `newline' ./configure: line 7908: `PAC_FUNC_NEEDS_DECL(#include ,strdup)' configure: /bin/sh './configure' *failed* for ompi/mca/io/romio/romio configure: WARNING: ROMIO distribution did not configure successfully checking if MCA component io:romio can compile... no ...remaining output snipped... None of the MPI/IO components work, including ufs, if I continue with a 'make' and 'make install'. Judging from the output from configure and a quick perusal of romio's configure script, it looks like some macros are not being correctly expanded in the creation of romio's configure script. For reference, here are the versions of my autotools: m4: 1.4.16 autoconf: 2.69 automake: 1.11.5 libtool: 2.4.2 I have not yet submitted this as a Trac item on svn.open-mpi.org. I wasn't sure what to put in the "Version" field as 1.6.3 wasn't listed there and I don't know if this is an issue in the 1.6 branch. Thank you very much for your time, David -- David Shrader SICORP, Inc 1350 Central Ave Suite 104 Los Alamos, NM 87544 david.shra...@sicorp.com LANL contact information: LANL #: 505-664-0996 LANL email: dshra...@lanl.gov ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel _______ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- David Shrader SICORP, Inc 1350 Central Ave Suite 104 Los Alamos, NM 87544 david.shra...@sicorp.com LANL contact information: LANL #: 505-664-0996 LANL email: dshra...@lanl.gov
Re: [OMPI devel] [EXTERNAL] Re: running top-level autogen.sh breaks romio in 1.6.3 tarball
No problem! And thank you for the speedy reply. I'm glad it was a simple fix. Thanks again, David On 11/01/2012 08:05 AM, Barrett, Brian W wrote: David - Thanks for the bug report; the missing file will be in the tarball of the next release. Brian On 10/31/12 3:15 PM, "David Shrader" wrote: Hello, Thank you for the reply! All of the autotools I am using have the same or higher versions than those specified at http://www.open-mpi.org/software/ompi/v1.6/. I referenced the specific versions at the end of my initial email. After some digging on the svn branch and some help from Nathan Hjelm, I found and checked out ompi/mca/io/romio/romio/autogen.sh. If I put this file in the same location inside the tarball contents, romio is not broken by running the top-level autogen.sh. I am still using the same versions of autotools. Thank you for the reminder on not needing to run autogen.sh on the tarball version. Unfortunately, we're doing some modifications to the romio Makefile.am files to add another MPI-IO type and want to test against Open MPI releases. Hence, we have to regenerate all the configure scripts and Makefile.in files by running autogen.sh. Again, thank you all for your time! David On 10/31/2012 02:46 PM, Ralph Castain wrote: We've seen this before - it's caused by using autotools that are too old. Please look at the HACKING file to see the required version levels. BTW: you should not be running autogen.sh on a tarball version. You should only run configure. On Oct 31, 2012, at 1:31 PM, David Shrader wrote: Hello, When using Open MPI from the 1.6.3 tarball, I have found that running the top-level autogen.sh breaks the romio component. Here are the steps to reproduce: 1) download openmpi-1.6.3.tar.bz2 from http://www.open-mpi.org/software/ompi/v1.6/ 2) untar openmpi-1.6.3.tar.bz2 3) cd openmpi-1.6.3 4) ./autogen.sh 5) ./configure In configure's output, the following error can be seen: ...output snipped... *** Configuring ROMIO distribution configure: OMPI configuring in ompi/mca/io/romio/romio configure: running /bin/sh './configure' CFLAGS="-DNDEBUG -g -O2 -finline-functions -fno-strict-aliasing -pthread" CPPFLAGS=" -I/users/dshrader/tmp/openmpi-1.6.3/opal/mca/hwloc/hwloc132/hwloc/includ e -I/usr/include/infiniband -I/usr/include/infiniband" FFLAGS="" LDFLAGS=" " --enable-shared --disable-static --with-mpi=open_mpi --disable-aio --cache-file=/dev/null --srcdir=. --disable-option-checking Configuring with args CFLAGS=-DNDEBUG -g -O2 -finline-functions -fno-strict-aliasing -pthread CPPFLAGS= -I/users/dshrader/tmp/openmpi-1.6.3/opal/mca/hwloc/hwloc132/hwloc/includ e -I/usr/include/infiniband -I/usr/include/infiniband FFLAGS= LDFLAGS= --enable-shared --disable-static --with-mpi=open_mpi --disable-aio --cache-file=/dev/null --srcdir=. --disable-option-checking checking for Open MPI support files... in Open MPI source tree -- good ./configure: line 2805: PAC_PROG_MAKE: command not found ...output snipped... ./configure: line 7908: syntax error near unexpected token `newline' ./configure: line 7908: `PAC_FUNC_NEEDS_DECL(#include ,strdup)' configure: /bin/sh './configure' *failed* for ompi/mca/io/romio/romio configure: WARNING: ROMIO distribution did not configure successfully checking if MCA component io:romio can compile... no ...remaining output snipped... None of the MPI/IO components work, including ufs, if I continue with a 'make' and 'make install'. Judging from the output from configure and a quick perusal of romio's configure script, it looks like some macros are not being correctly expanded in the creation of romio's configure script. For reference, here are the versions of my autotools: m4: 1.4.16 autoconf: 2.69 automake: 1.11.5 libtool: 2.4.2 I have not yet submitted this as a Trac item on svn.open-mpi.org. I wasn't sure what to put in the "Version" field as 1.6.3 wasn't listed there and I don't know if this is an issue in the 1.6 branch. Thank you very much for your time, David -- David Shrader SICORP, Inc 1350 Central Ave Suite 104 Los Alamos, NM 87544 david.shra...@sicorp.com LANL contact information: LANL #: 505-664-0996 LANL email: dshra...@lanl.gov ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- David Shrader SICORP, Inc 1350 Central Ave Suite 104 Los Alamos, NM 87544 david.shra...@sicorp.com LANL contact information: LANL #: 505-664-0996 LANL email: dshra...@lanl.gov ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- David Shrader SICORP, Inc 1350 Central Ave Suite 104 Los Alamos, NM 87544 david.shra...@sicorp.com LANL contact information: LANL #: 505-664-0996 LANL email: dshra...@lanl.gov
[OMPI devel] seg fault when using yalla, XRC, and yalla
Hello, I have been investigating using XRC on a cluster with a mellanox interconnect. I have found that in a certain situation I get a seg fault. I am using 1.10.2 compiled with gcc 5.3.0, and the simplest configure line that I have found that still results in the seg fault is as follows: $> ./configure --with-hcoll --with-mxm --prefix=... I do have mxm 3.4.3065 and hcoll 3.3.768 installed in to system space (/usr/lib64). If I use '--without-hcoll --without-mxm,' the seg fault does not happen. The seg fault happens even when using examples/hello_c.c, so here is an example of the seg fault using it: $> mpicc hello_c.c -o hello_c.x $> mpirun -n 1 ./hello_c.x Hello, world, I am 0 of 1, (Open MPI v1.10.2, package: Open MPI dshra...@mu-fey.lanl.gov Distribution, ident: 1.10.2, repo rev: v1.10.1-145-g799148f, Jan 21, 2016, 135) $> mpirun -n 1 -mca btl_openib_receive_queues X,4096,1024:X,12288,512:X,65536,512 Hello, world, I am 0 of 1, (Open MPI v1.10.2, package: Open MPI dshra...@mu-fey.lanl.gov Distribution, ident: 1.10.2, repo rev: v1.10.1-145-g799148f, Jan 21, 2016, 135) -- mpirun noticed that process rank 0 with PID 22819 on node mu0001 exited on signal 11 (Segmentation fault). -- The seg fault happens no matter the number of ranks. I have tried the above command with '-mca pml_base_verbose,' and it shows that I am using the yalla pml: $> mpirun -n 1 -mca btl_openib_receive_queues X,4096,1024:X,12288,512:X,65536,512 -mca pml_base_verbose 100 ./hello_c.x ...output snipped... [mu0001.localdomain:22825] select: component cm not selected / finalized [mu0001.localdomain:22825] select: component ob1 not selected / finalized [mu0001.localdomain:22825] select: component yalla selected ...output snipped... -- mpirun noticed that process rank 0 with PID 22825 on node mu0001 exited on signal 11 (Segmentation fault). -- Interestingly enough, if I tell mpirun what pml to use, the seg fault goes away. The following command does not get the seg fault: $> mpirun -n 1 -mca btl_openib_receive_queues X,4096,1024:X,12288,512:X,65536,512 -mca pml yalla ./hello_c.x Passing either ob1 or cm to '-mca pml' also works. So it seems that the seg fault comes about when the yalla pml is chosen by default, when mxm/hcoll is involved, and using XRC. I'm not sure if mxm is to blame, however, as using '-mca pml cm -mca mtl mxm' with the XRC parameters doesn't throw the seg fault. Other information... OS: RHEL 6.7-based (TOSS) OpenFabrics: RedHat provided Kernel: 2.6.32-573.8.1.2chaos.ch5.4.x86_64 Config.log and 'ompi_info --all' are in the tarball ompi.tar.bz2 which is attached. Is there something else I should be doing with the yalla pml when using XRC? Regardless, I hope reporting the seg fault is useful. Thanks, David -- David Shrader HPC-ENV High Performance Computer Systems Los Alamos National Lab Email: dshrader lanl.gov ompi.tar.bz2 Description: application/bzip
Re: [OMPI devel] seg fault when using yalla, XRC, and yalla
Hello Alina, Thank you for the information about how the pml components work. I knew that the other components were being opened and ultimately closed in favor of yalla, but I didn't realize that initial open would cause a persistent change in the ompi runtime. Here's the information you requested about the ib network: - MOFED version: We are using the Open Fabrics Software as bundled by RedHat, and my ib network folks say we're running something close to v1.5.4 - ibv_devinfo: [dshrader@mu0001 examples]$ ibv_devinfo hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.9.1000 node_guid: 0025:90ff:ff16:78d8 sys_image_guid: 0025:90ff:ff16:78db vendor_id: 0x02c9 vendor_part_id: 26428 hw_ver: 0xB0 board_id: SM_212101000 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu:4096 (5) active_mtu: 4096 (5) sm_lid: 250 port_lid: 366 port_lmc: 0x00 link_layer: InfiniBand I still get the seg fault when specifying the hca: $> mpirun -n 1 -mca btl_openib_receive_queues X,4096,1024:X,12288,512:X,65536,512 -mca btl_openib_if_include mlx4_0 ./hello_c.x Hello, world, I am 0 of 1, (Open MPI v1.10.2, package: Open MPI dshra...@mu-fey.lanl.gov Distribution, ident: 1.10.2, repo rev: v1.10.1-145-g799148f, Jan 21, 2016, 135) -- mpirun noticed that process rank 0 with PID 10045 on node mu0001 exited on signal 11 (Segmentation fault). -- I don't know if this helps, but the first time I tried the command I mistyped the hca name. This got me a warning, but no seg fault: $> mpirun -n 1 -mca btl_openib_receive_queues X,4096,1024:X,12288,512:X,65536,512 -mca btl_openib_if_include ml4_0 ./hello_c.x -- WARNING: One or more nonexistent OpenFabrics devices/ports were specified: Host: mu0001 MCA parameter:mca_btl_if_include Nonexistent entities: ml4_0 These entities will be ignored. You can disable this warning by setting the btl_openib_warn_nonexistent_if MCA parameter to 0. -- Hello, world, I am 0 of 1, (Open MPI v1.10.2, package: Open MPI dshra...@mu-fey.lanl.gov Distribution, ident: 1.10.2, repo rev: v1.10.1-145-g799148f, Jan 21, 2016, 135) So, telling the openib btl to use the actual hca didn't get the seg fault to go away, but giving it a dummy value did. Thanks, David On 04/20/2016 08:13 AM, Alina Sklarevich wrote: Hi David, I was able to reproduce the issue you reported. When the command line doesn't specify the components to use, ompi will try to load/open all the ones available (and close them in the end) and then choose the components according to their priority and whether or not they were opened successfully. This means that even if pml yalla was the one running, other components were opened and closed as well. The parameter you are using - btl_openib_receive_queues, doesn't have an effect on pml yalla. It only affects the openib btl which is used by pml ob1. Using the verbosity of btl_base_verbose I see that when the segmentation fault happens, the code doesn't reach the phase of unloading the openib btl so perhaps the problem originates there (since pml yalla was already unloaded). Can you please try adding this mca parameter to your command line to specify the HCA you are using? -mca btl_openib_if_include It made the segv go away for me. Can you please attach the output of ibv_devinfo and write the MOFED version you are using? Thank you, Alina. On Wed, Apr 20, 2016 at 2:53 PM, Joshua Ladd <mailto:jladd.m...@gmail.com>> wrote: Hi, David We are looking into your report. Best, Josh On Tue, Apr 19, 2016 at 4:41 PM, David Shrader mailto:dshra...@lanl.gov>> wrote: Hello, I have been investigating using XRC on a cluster with a mellanox interconnect. I have found that in a certain situation I get a seg fault. I am using 1.10.2 compiled with gcc 5.3.0, and the simplest configure line that I have found that still results in the seg fault is as follows: $> ./configure --with-hcoll --with-mxm --prefix=... I do have mxm 3.4.3065 and hcoll 3.3.768 installed in
Re: [OMPI devel] seg fault when using yalla, XRC, and yalla
Hey Nathan, I thought only 1 pml could be loaded at a time, and the only pml that could use btl's was ob1. If that is the case, how can the openib btl run at the same time as cm and yalla? Also, what is UD? Thanks, David On 04/21/2016 09:25 AM, Nathan Hjelm wrote: The openib btl should be able to run alongside cm/mxm or yalla. If I have time this weekend I will get on the mustang and see what the problem is. The best answer is to change the openmpi-mca-params.conf in the install to have pml = ob1. I have seen little to no benefit with using MXM on mustang. In fact, the default configuration (which uses UD) gets terrible bandwidth. -Nathan On Thu, Apr 21, 2016 at 01:48:46PM +0300, Alina Sklarevich wrote: David, thanks for the info you provided. I will try to dig in further to see what might be causing this issue. In the meantime, maybe Nathan can please comment about the openib btl behavior here? Thanks, Alina. On Wed, Apr 20, 2016 at 8:01 PM, David Shrader wrote: Hello Alina, Thank you for the information about how the pml components work. I knew that the other components were being opened and ultimately closed in favor of yalla, but I didn't realize that initial open would cause a persistent change in the ompi runtime. Here's the information you requested about the ib network: - MOFED version: We are using the Open Fabrics Software as bundled by RedHat, and my ib network folks say we're running something close to v1.5.4 - ibv_devinfo: [dshrader@mu0001 examples]$ ibv_devinfo hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.9.1000 node_guid: 0025:90ff:ff16:78d8 sys_image_guid: 0025:90ff:ff16:78db vendor_id: 0x02c9 vendor_part_id: 26428 hw_ver: 0xB0 board_id: SM_212101000 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu:4096 (5) active_mtu: 4096 (5) sm_lid: 250 port_lid: 366 port_lmc: 0x00 link_layer: InfiniBand I still get the seg fault when specifying the hca: $> mpirun -n 1 -mca btl_openib_receive_queues X,4096,1024:X,12288,512:X,65536,512 -mca btl_openib_if_include mlx4_0 ./hello_c.x Hello, world, I am 0 of 1, (Open MPI v1.10.2, package: Open MPI dshra...@mu-fey.lanl.gov Distribution, ident: 1.10.2, repo rev: v1.10.1-145-g799148f, Jan 21, 2016, 135) -- mpirun noticed that process rank 0 with PID 10045 on node mu0001 exited on signal 11 (Segmentation fault). -- I don't know if this helps, but the first time I tried the command I mistyped the hca name. This got me a warning, but no seg fault: $> mpirun -n 1 -mca btl_openib_receive_queues X,4096,1024:X,12288,512:X,65536,512 -mca btl_openib_if_include ml4_0 ./hello_c.x -- WARNING: One or more nonexistent OpenFabrics devices/ports were specified: Host: mu0001 MCA parameter:mca_btl_if_include Nonexistent entities: ml4_0 These entities will be ignored. You can disable this warning by setting the btl_openib_warn_nonexistent_if MCA parameter to 0. -- Hello, world, I am 0 of 1, (Open MPI v1.10.2, package: Open MPI dshra...@mu-fey.lanl.gov Distribution, ident: 1.10.2, repo rev: v1.10.1-145-g799148f, Jan 21, 2016, 135) So, telling the openib btl to use the actual hca didn't get the seg fault to go away, but giving it a dummy value did. Thanks, David On 04/20/2016 08:13 AM, Alina Sklarevich wrote: Hi David, I was able to reproduce the issue you reported. When the command line doesn't specify the components to use, ompi will try to load/open all the ones available (and close them in the end) and then choose the components according to their priority and whether or not they were opened successfully. This means that even if pml yalla was the one running, other components were opened and clos