Re: [OMPI users] No log_num_mtt in Ubuntu 14.04
Hi, what ofed version do you use? (ofed_info -s) On Sun, Aug 17, 2014 at 7:16 PM, Rio Yokota wrote: > I have recently upgraded from Ubuntu 12.04 to 14.04 and OpenMPI gives the > following warning upon execution, which did not appear before the upgrade. > > WARNING: It appears that your OpenFabrics subsystem is configured to only > allow registering part of your physical memory. This can cause MPI jobs to > run with erratic performance, hang, and/or crash. > > Everything that I could find on google suggests to change log_num_mtt, but > I cannot do this for the following reasons: > 1. There is no log_num_mtt in /sys/module/mlx4_core/parameters/ > 2. Adding "options mlx4_core log_num_mtt=24" to /etc/modprobe.d/mlx4.conf > doesn't seem to change anything > 3. I am not sure how I can restart the driver because there is no > "/etc/init.d/openibd" file (I've rebooted the system but it didn't do > anything to create log_num_mtt) > > [Template information] > 1. OpenFabrics is from the Ubuntu distribution using "apt-get install > infiniband-diags ibutils ibverbs-utils libmlx4-dev" > 2. OS is Ubuntu 14.04 LTS > 3. Subnet manager is from the Ubuntu distribution using "apt-get install > opensm" > 4. Output of ibv_devinfo is: > hca_id: mlx4_0 > transport: InfiniBand (0) > fw_ver: 2.10.600 > node_guid: 0002:c903:003d:52b0 > sys_image_guid: 0002:c903:003d:52b3 > vendor_id: 0x02c9 > vendor_part_id: 4099 > hw_ver: 0x0 > board_id: MT_1100120019 > phys_port_cnt: 1 > port: 1 > state: PORT_ACTIVE (4) > max_mtu:4096 (5) > active_mtu: 4096 (5) > sm_lid: 1 > port_lid: 1 > port_lmc: 0x00 > link_layer: InfiniBand > 5. Output of ifconfig for IB is > ib0 Link encap:UNSPEC HWaddr > 80-00-00-48-FE-80-00-00-00-00-00-00-00-00-00-00 > inet addr:192.168.1.1 Bcast:192.168.1.255 Mask:255.255.255.0 > inet6 addr: fe80::202:c903:3d:52b1/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 > RX packets:26 errors:0 dropped:0 overruns:0 frame:0 > TX packets:34 errors:0 dropped:16 overruns:0 carrier:0 > collisions:0 txqueuelen:256 > RX bytes:5843 (5.8 KB) TX bytes:4324 (4.3 KB) > 6. ulimit -l is "unlimited" > > Thanks, > Rio > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25048.php >
Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1
Hi, I just did compile without Cuda, and the result is the same. No output, exits with code 65. [mboisson@helios-login1 examples]$ ldd ring_c linux-vdso.so.1 => (0x7fff3ab31000) libmpi.so.1 => /software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_nocuda/lib/libmpi.so.1 (0x7fab9ec2a000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00381c00) libc.so.6 => /lib64/libc.so.6 (0x00381bc0) librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x00381c80) libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x00381c40) libopen-rte.so.7 => /software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_nocuda/lib/libopen-rte.so.7 (0x7fab9e932000) libtorque.so.2 => /usr/lib64/libtorque.so.2 (0x00391820) libxml2.so.2 => /usr/lib64/libxml2.so.2 (0x003917e0) libz.so.1 => /lib64/libz.so.1 (0x00381cc0) libcrypto.so.10 => /usr/lib64/libcrypto.so.10 (0x00382100) libssl.so.10 => /usr/lib64/libssl.so.10 (0x00382300) libopen-pal.so.6 => /software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_nocuda/lib/libopen-pal.so.6 (0x7fab9e64a000) libdl.so.2 => /lib64/libdl.so.2 (0x00381b80) librt.so.1 => /lib64/librt.so.1 (0x0035b360) libm.so.6 => /lib64/libm.so.6 (0x003c25a0) libutil.so.1 => /lib64/libutil.so.1 (0x003f7100) /lib64/ld-linux-x86-64.so.2 (0x00381b40) libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x003917a0) libgcc_s.so.1 => /software6/compilers/gcc/4.8/lib64/libgcc_s.so.1 (0x7fab9e433000) libgssapi_krb5.so.2 => /lib64/libgssapi_krb5.so.2 (0x00382240) libkrb5.so.3 => /lib64/libkrb5.so.3 (0x00382140) libcom_err.so.2 => /lib64/libcom_err.so.2 (0x00381e40) libk5crypto.so.3 => /lib64/libk5crypto.so.3 (0x00382180) libkrb5support.so.0 => /lib64/libkrb5support.so.0 (0x003821c0) libkeyutils.so.1 => /lib64/libkeyutils.so.1 (0x00382200) libresolv.so.2 => /lib64/libresolv.so.2 (0x00381dc0) libselinux.so.1 => /lib64/libselinux.so.1 (0x00381d00) [mboisson@helios-login1 examples]$ mpiexec ring_c [mboisson@helios-login1 examples]$ echo $? 65 Maxime Le 2014-08-16 06:22, Jeff Squyres (jsquyres) a écrit : Just out of curiosity, I saw that one of the segv stack traces involved the cuda stack. Can you try a build without CUDA and see if that resolves the problem? On Aug 15, 2014, at 6:47 PM, Maxime Boissonneault wrote: Hi Jeff, Le 2014-08-15 17:50, Jeff Squyres (jsquyres) a écrit : On Aug 15, 2014, at 5:39 PM, Maxime Boissonneault wrote: Correct. Can it be because torque (pbs_mom) is not running on the head node and mpiexec attempts to contact it ? Not for Open MPI's mpiexec, no. Open MPI's mpiexec (mpirun -- they're the same to us) will only try to use TM stuff (i.e., Torque stuff) if it sees the environment variable markers indicating that it's inside a Torque job. If not, it just uses rsh/ssh (or localhost launch in your case, since you didn't specify any hosts). If you are unable to run even "mpirun -np 4 hostname" (i.e., the non-MPI "hostname" command from Linux), then something is seriously borked with your Open MPI installation. mpirun -np 4 hostname works fine : [mboisson@helios-login1 ~]$ which mpirun /software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/bin/mpirun [mboisson@helios-login1 examples]$ mpirun -np 4 hostname; echo $? helios-login1 helios-login1 helios-login1 helios-login1 0 Try running with: mpirun -np 4 --mca plm_base_verbose 10 hostname This should show the steps OMPI is trying to take to launch the 4 copies of "hostname" and potentially give some insight into where it's hanging. Also, just to make sure, you have ensured that you're compiling everything with a single compiler toolchain, and the support libraries from that specific compiler toolchain are available on any server on which you're running (to include the head node and compute nodes), right? Yes. Everything has been compiled with GCC 4.8 (I also tried GCC 4.6 with the same results). Almost every software (that is compiler, toolchain, etc.) is installed on lustre, from sources and is the same on both the login (head) node and the compute. The few differences between the head node and the compute : 1) Computes are in RAMFS - login is installed on disk 2) Computes and login node have different hardware configuration (computes have GPUs, head node does not). 3) Login node has MORE CentOS6 packages than computes (such as the -devel packages, some fonts/X11 libraries, etc.), but all the packages that are on the computes are also on the login node. And you've verified that PATH and LD_LIBRARY_PATH are pointing to the right places -- i.e., to the Open MPI installation that you expect it to point to. E.g., if you "ldd ring_c", it shows
Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1
Maxime, Can you run with: mpirun -np 4 --mca plm_base_verbose 10 /path/to/examples//ring_c On Mon, Aug 18, 2014 at 12:21 PM, Maxime Boissonneault < maxime.boissonnea...@calculquebec.ca> wrote: > Hi, > I just did compile without Cuda, and the result is the same. No output, > exits with code 65. > > [mboisson@helios-login1 examples]$ ldd ring_c > linux-vdso.so.1 => (0x7fff3ab31000) > libmpi.so.1 => /software-gpu/mpi/openmpi/1.8. > 2rc4_gcc4.8_nocuda/lib/libmpi.so.1 (0x7fab9ec2a000) > libpthread.so.0 => /lib64/libpthread.so.0 (0x00381c00) > libc.so.6 => /lib64/libc.so.6 (0x00381bc0) > librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x00381c80) > libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x00381c40) > libopen-rte.so.7 => /software-gpu/mpi/openmpi/1.8. > 2rc4_gcc4.8_nocuda/lib/libopen-rte.so.7 (0x7fab9e932000) > libtorque.so.2 => /usr/lib64/libtorque.so.2 (0x00391820) > libxml2.so.2 => /usr/lib64/libxml2.so.2 (0x003917e0) > libz.so.1 => /lib64/libz.so.1 (0x00381cc0) > libcrypto.so.10 => /usr/lib64/libcrypto.so.10 (0x00382100) > libssl.so.10 => /usr/lib64/libssl.so.10 (0x00382300) > libopen-pal.so.6 => /software-gpu/mpi/openmpi/1.8. > 2rc4_gcc4.8_nocuda/lib/libopen-pal.so.6 (0x7fab9e64a000) > libdl.so.2 => /lib64/libdl.so.2 (0x00381b80) > librt.so.1 => /lib64/librt.so.1 (0x0035b360) > libm.so.6 => /lib64/libm.so.6 (0x003c25a0) > libutil.so.1 => /lib64/libutil.so.1 (0x003f7100) > /lib64/ld-linux-x86-64.so.2 (0x00381b40) > libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x003917a0) > libgcc_s.so.1 => /software6/compilers/gcc/4.8/lib64/libgcc_s.so.1 > (0x7fab9e433000) > libgssapi_krb5.so.2 => /lib64/libgssapi_krb5.so.2 > (0x00382240) > libkrb5.so.3 => /lib64/libkrb5.so.3 (0x00382140) > libcom_err.so.2 => /lib64/libcom_err.so.2 (0x00381e40) > libk5crypto.so.3 => /lib64/libk5crypto.so.3 (0x00382180) > libkrb5support.so.0 => /lib64/libkrb5support.so.0 > (0x003821c0) > libkeyutils.so.1 => /lib64/libkeyutils.so.1 (0x00382200) > libresolv.so.2 => /lib64/libresolv.so.2 (0x00381dc0) > libselinux.so.1 => /lib64/libselinux.so.1 (0x00381d00) > > [mboisson@helios-login1 examples]$ mpiexec ring_c > [mboisson@helios-login1 examples]$ echo $? > 65 > > > Maxime > > > Le 2014-08-16 06:22, Jeff Squyres (jsquyres) a écrit : > > Just out of curiosity, I saw that one of the segv stack traces involved >> the cuda stack. >> >> Can you try a build without CUDA and see if that resolves the problem? >> >> >> >> On Aug 15, 2014, at 6:47 PM, Maxime Boissonneault > calculquebec.ca> wrote: >> >> Hi Jeff, >>> >>> Le 2014-08-15 17:50, Jeff Squyres (jsquyres) a écrit : >>> On Aug 15, 2014, at 5:39 PM, Maxime Boissonneault < maxime.boissonnea...@calculquebec.ca> wrote: Correct. > > Can it be because torque (pbs_mom) is not running on the head node and > mpiexec attempts to contact it ? > Not for Open MPI's mpiexec, no. Open MPI's mpiexec (mpirun -- they're the same to us) will only try to use TM stuff (i.e., Torque stuff) if it sees the environment variable markers indicating that it's inside a Torque job. If not, it just uses rsh/ssh (or localhost launch in your case, since you didn't specify any hosts). If you are unable to run even "mpirun -np 4 hostname" (i.e., the non-MPI "hostname" command from Linux), then something is seriously borked with your Open MPI installation. >>> mpirun -np 4 hostname works fine : >>> [mboisson@helios-login1 ~]$ which mpirun >>> /software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/bin/mpirun >>> [mboisson@helios-login1 examples]$ mpirun -np 4 hostname; echo $? >>> helios-login1 >>> helios-login1 >>> helios-login1 >>> helios-login1 >>> 0 >>> >>> Try running with: mpirun -np 4 --mca plm_base_verbose 10 hostname This should show the steps OMPI is trying to take to launch the 4 copies of "hostname" and potentially give some insight into where it's hanging. Also, just to make sure, you have ensured that you're compiling everything with a single compiler toolchain, and the support libraries from that specific compiler toolchain are available on any server on which you're running (to include the head node and compute nodes), right? >>> Yes. Everything has been compiled with GCC 4.8 (I also tried GCC 4.6 >>> with the same results). Almost every software (that is compiler, toolchain, >>> etc.) is installed on lustre, from sources and is the same on both the >>> login (head) node and the compute. >>> >>> The few differences between th
Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1
Here it is Le 2014-08-18 12:30, Joshua Ladd a écrit : mpirun -np 4 --mca plm_base_verbose 10 [mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose 10 ring_c [helios-login1:27853] mca: base: components_register: registering plm components [helios-login1:27853] mca: base: components_register: found loaded component isolated [helios-login1:27853] mca: base: components_register: component isolated has no register or open function [helios-login1:27853] mca: base: components_register: found loaded component rsh [helios-login1:27853] mca: base: components_register: component rsh register function successful [helios-login1:27853] mca: base: components_register: found loaded component tm [helios-login1:27853] mca: base: components_register: component tm register function successful [helios-login1:27853] mca: base: components_open: opening plm components [helios-login1:27853] mca: base: components_open: found loaded component isolated [helios-login1:27853] mca: base: components_open: component isolated open function successful [helios-login1:27853] mca: base: components_open: found loaded component rsh [helios-login1:27853] mca: base: components_open: component rsh open function successful [helios-login1:27853] mca: base: components_open: found loaded component tm [helios-login1:27853] mca: base: components_open: component tm open function successful [helios-login1:27853] mca:base:select: Auto-selecting plm components [helios-login1:27853] mca:base:select:( plm) Querying component [isolated] [helios-login1:27853] mca:base:select:( plm) Query of component [isolated] set priority to 0 [helios-login1:27853] mca:base:select:( plm) Querying component [rsh] [helios-login1:27853] mca:base:select:( plm) Query of component [rsh] set priority to 10 [helios-login1:27853] mca:base:select:( plm) Querying component [tm] [helios-login1:27853] mca:base:select:( plm) Skipping component [tm]. Query failed to return a module [helios-login1:27853] mca:base:select:( plm) Selected component [rsh] [helios-login1:27853] mca: base: close: component isolated closed [helios-login1:27853] mca: base: close: unloading component isolated [helios-login1:27853] mca: base: close: component tm closed [helios-login1:27853] mca: base: close: unloading component tm [helios-login1:27853] mca: base: close: component rsh closed [helios-login1:27853] mca: base: close: unloading component rsh [mboisson@helios-login1 examples]$ echo $? 65 Maxime
Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1
This is all on one node, yes? Try adding the following: -mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5 Lot of garbage, but should tell us what is going on. On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault wrote: > Here it is > Le 2014-08-18 12:30, Joshua Ladd a écrit : >> mpirun -np 4 --mca plm_base_verbose 10 > [mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose 10 > ring_c > [helios-login1:27853] mca: base: components_register: registering plm > components > [helios-login1:27853] mca: base: components_register: found loaded component > isolated > [helios-login1:27853] mca: base: components_register: component isolated has > no register or open function > [helios-login1:27853] mca: base: components_register: found loaded component > rsh > [helios-login1:27853] mca: base: components_register: component rsh register > function successful > [helios-login1:27853] mca: base: components_register: found loaded component > tm > [helios-login1:27853] mca: base: components_register: component tm register > function successful > [helios-login1:27853] mca: base: components_open: opening plm components > [helios-login1:27853] mca: base: components_open: found loaded component > isolated > [helios-login1:27853] mca: base: components_open: component isolated open > function successful > [helios-login1:27853] mca: base: components_open: found loaded component rsh > [helios-login1:27853] mca: base: components_open: component rsh open function > successful > [helios-login1:27853] mca: base: components_open: found loaded component tm > [helios-login1:27853] mca: base: components_open: component tm open function > successful > [helios-login1:27853] mca:base:select: Auto-selecting plm components > [helios-login1:27853] mca:base:select:( plm) Querying component [isolated] > [helios-login1:27853] mca:base:select:( plm) Query of component [isolated] > set priority to 0 > [helios-login1:27853] mca:base:select:( plm) Querying component [rsh] > [helios-login1:27853] mca:base:select:( plm) Query of component [rsh] set > priority to 10 > [helios-login1:27853] mca:base:select:( plm) Querying component [tm] > [helios-login1:27853] mca:base:select:( plm) Skipping component [tm]. Query > failed to return a module > [helios-login1:27853] mca:base:select:( plm) Selected component [rsh] > [helios-login1:27853] mca: base: close: component isolated closed > [helios-login1:27853] mca: base: close: unloading component isolated > [helios-login1:27853] mca: base: close: component tm closed > [helios-login1:27853] mca: base: close: unloading component tm > [helios-login1:27853] mca: base: close: component rsh closed > [helios-login1:27853] mca: base: close: unloading component rsh > [mboisson@helios-login1 examples]$ echo $? > 65 > > > Maxime > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25052.php
Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1
This is all one one node indeed. Attached is the output of mpirun -np 4 --mca plm_base_verbose 10 -mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5 ring_c |& tee output_ringc_verbose.txt Maxime Le 2014-08-18 12:48, Ralph Castain a écrit : This is all on one node, yes? Try adding the following: -mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5 Lot of garbage, but should tell us what is going on. On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault wrote: Here it is Le 2014-08-18 12:30, Joshua Ladd a écrit : mpirun -np 4 --mca plm_base_verbose 10 [mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose 10 ring_c [helios-login1:27853] mca: base: components_register: registering plm components [helios-login1:27853] mca: base: components_register: found loaded component isolated [helios-login1:27853] mca: base: components_register: component isolated has no register or open function [helios-login1:27853] mca: base: components_register: found loaded component rsh [helios-login1:27853] mca: base: components_register: component rsh register function successful [helios-login1:27853] mca: base: components_register: found loaded component tm [helios-login1:27853] mca: base: components_register: component tm register function successful [helios-login1:27853] mca: base: components_open: opening plm components [helios-login1:27853] mca: base: components_open: found loaded component isolated [helios-login1:27853] mca: base: components_open: component isolated open function successful [helios-login1:27853] mca: base: components_open: found loaded component rsh [helios-login1:27853] mca: base: components_open: component rsh open function successful [helios-login1:27853] mca: base: components_open: found loaded component tm [helios-login1:27853] mca: base: components_open: component tm open function successful [helios-login1:27853] mca:base:select: Auto-selecting plm components [helios-login1:27853] mca:base:select:( plm) Querying component [isolated] [helios-login1:27853] mca:base:select:( plm) Query of component [isolated] set priority to 0 [helios-login1:27853] mca:base:select:( plm) Querying component [rsh] [helios-login1:27853] mca:base:select:( plm) Query of component [rsh] set priority to 10 [helios-login1:27853] mca:base:select:( plm) Querying component [tm] [helios-login1:27853] mca:base:select:( plm) Skipping component [tm]. Query failed to return a module [helios-login1:27853] mca:base:select:( plm) Selected component [rsh] [helios-login1:27853] mca: base: close: component isolated closed [helios-login1:27853] mca: base: close: unloading component isolated [helios-login1:27853] mca: base: close: component tm closed [helios-login1:27853] mca: base: close: unloading component tm [helios-login1:27853] mca: base: close: component rsh closed [helios-login1:27853] mca: base: close: unloading component rsh [mboisson@helios-login1 examples]$ echo $? 65 Maxime ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/25052.php ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/25053.php -- - Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique output_ringc_verbose.txt.gz Description: GNU Zip compressed data
Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1
Ah...now that showed the problem. To pinpoint it better, please add -mca oob_base_verbose 10 and I think we'll have it On Aug 18, 2014, at 9:54 AM, Maxime Boissonneault wrote: > This is all one one node indeed. > > Attached is the output of > mpirun -np 4 --mca plm_base_verbose 10 -mca odls_base_verbose 5 -mca > state_base_verbose 5 -mca errmgr_base_verbose 5 ring_c |& tee > output_ringc_verbose.txt > > > Maxime > > Le 2014-08-18 12:48, Ralph Castain a écrit : >> This is all on one node, yes? >> >> Try adding the following: >> >> -mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5 >> >> Lot of garbage, but should tell us what is going on. >> >> On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault >> wrote: >> >>> Here it is >>> Le 2014-08-18 12:30, Joshua Ladd a écrit : mpirun -np 4 --mca plm_base_verbose 10 >>> [mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose 10 >>> ring_c >>> [helios-login1:27853] mca: base: components_register: registering plm >>> components >>> [helios-login1:27853] mca: base: components_register: found loaded >>> component isolated >>> [helios-login1:27853] mca: base: components_register: component isolated >>> has no register or open function >>> [helios-login1:27853] mca: base: components_register: found loaded >>> component rsh >>> [helios-login1:27853] mca: base: components_register: component rsh >>> register function successful >>> [helios-login1:27853] mca: base: components_register: found loaded >>> component tm >>> [helios-login1:27853] mca: base: components_register: component tm register >>> function successful >>> [helios-login1:27853] mca: base: components_open: opening plm components >>> [helios-login1:27853] mca: base: components_open: found loaded component >>> isolated >>> [helios-login1:27853] mca: base: components_open: component isolated open >>> function successful >>> [helios-login1:27853] mca: base: components_open: found loaded component rsh >>> [helios-login1:27853] mca: base: components_open: component rsh open >>> function successful >>> [helios-login1:27853] mca: base: components_open: found loaded component tm >>> [helios-login1:27853] mca: base: components_open: component tm open >>> function successful >>> [helios-login1:27853] mca:base:select: Auto-selecting plm components >>> [helios-login1:27853] mca:base:select:( plm) Querying component [isolated] >>> [helios-login1:27853] mca:base:select:( plm) Query of component [isolated] >>> set priority to 0 >>> [helios-login1:27853] mca:base:select:( plm) Querying component [rsh] >>> [helios-login1:27853] mca:base:select:( plm) Query of component [rsh] set >>> priority to 10 >>> [helios-login1:27853] mca:base:select:( plm) Querying component [tm] >>> [helios-login1:27853] mca:base:select:( plm) Skipping component [tm]. >>> Query failed to return a module >>> [helios-login1:27853] mca:base:select:( plm) Selected component [rsh] >>> [helios-login1:27853] mca: base: close: component isolated closed >>> [helios-login1:27853] mca: base: close: unloading component isolated >>> [helios-login1:27853] mca: base: close: component tm closed >>> [helios-login1:27853] mca: base: close: unloading component tm >>> [helios-login1:27853] mca: base: close: component rsh closed >>> [helios-login1:27853] mca: base: close: unloading component rsh >>> [mboisson@helios-login1 examples]$ echo $? >>> 65 >>> >>> >>> Maxime >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2014/08/25052.php >> ___ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/08/25053.php > > > -- > - > Maxime Boissonneault > Analyste de calcul - Calcul Québec, Université Laval > Ph. D. en physique > > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25054.php
Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1
Here it is. Maxime Le 2014-08-18 12:59, Ralph Castain a écrit : Ah...now that showed the problem. To pinpoint it better, please add -mca oob_base_verbose 10 and I think we'll have it On Aug 18, 2014, at 9:54 AM, Maxime Boissonneault wrote: This is all one one node indeed. Attached is the output of mpirun -np 4 --mca plm_base_verbose 10 -mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5 ring_c |& tee output_ringc_verbose.txt Maxime Le 2014-08-18 12:48, Ralph Castain a écrit : This is all on one node, yes? Try adding the following: -mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5 Lot of garbage, but should tell us what is going on. On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault wrote: Here it is Le 2014-08-18 12:30, Joshua Ladd a écrit : mpirun -np 4 --mca plm_base_verbose 10 [mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose 10 ring_c [helios-login1:27853] mca: base: components_register: registering plm components [helios-login1:27853] mca: base: components_register: found loaded component isolated [helios-login1:27853] mca: base: components_register: component isolated has no register or open function [helios-login1:27853] mca: base: components_register: found loaded component rsh [helios-login1:27853] mca: base: components_register: component rsh register function successful [helios-login1:27853] mca: base: components_register: found loaded component tm [helios-login1:27853] mca: base: components_register: component tm register function successful [helios-login1:27853] mca: base: components_open: opening plm components [helios-login1:27853] mca: base: components_open: found loaded component isolated [helios-login1:27853] mca: base: components_open: component isolated open function successful [helios-login1:27853] mca: base: components_open: found loaded component rsh [helios-login1:27853] mca: base: components_open: component rsh open function successful [helios-login1:27853] mca: base: components_open: found loaded component tm [helios-login1:27853] mca: base: components_open: component tm open function successful [helios-login1:27853] mca:base:select: Auto-selecting plm components [helios-login1:27853] mca:base:select:( plm) Querying component [isolated] [helios-login1:27853] mca:base:select:( plm) Query of component [isolated] set priority to 0 [helios-login1:27853] mca:base:select:( plm) Querying component [rsh] [helios-login1:27853] mca:base:select:( plm) Query of component [rsh] set priority to 10 [helios-login1:27853] mca:base:select:( plm) Querying component [tm] [helios-login1:27853] mca:base:select:( plm) Skipping component [tm]. Query failed to return a module [helios-login1:27853] mca:base:select:( plm) Selected component [rsh] [helios-login1:27853] mca: base: close: component isolated closed [helios-login1:27853] mca: base: close: unloading component isolated [helios-login1:27853] mca: base: close: component tm closed [helios-login1:27853] mca: base: close: unloading component tm [helios-login1:27853] mca: base: close: component rsh closed [helios-login1:27853] mca: base: close: unloading component rsh [mboisson@helios-login1 examples]$ echo $? 65 Maxime ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/25052.php ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/25053.php -- - Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/25054.php ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/25055.php -- - Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique output_ringc_verbose2.txt.gz Description: GNU Zip compressed data
Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1
Yep, that pinpointed the problem: [helios-login1:28558] [[63019,1],0] tcp:send_handler CONNECTING [helios-login1:28558] [[63019,1],0]:tcp:complete_connect called for peer [[63019,0],0] on socket 11 [helios-login1:28558] [[63019,1],0]-[[63019,0],0] tcp_peer_complete_connect: connection failed: Connection refused (111) [helios-login1:28558] [[63019,1],0] tcp_peer_close for [[63019,0],0] sd 11 state CONNECTING [helios-login1:28558] [[63019,1],0] tcp:lost connection called for peer [[63019,0],0] The apps are trying to connect back to mpirun using the following addresses: tcp://10.10.1.3,132.219.137.36,10.12.1.3:34237 The initial attempt is here [helios-login1:28558] [[63019,1],0] orte_tcp_peer_try_connect: attempting to connect to proc [[63019,0],0] on 10.10.1.3:34237 - 0 retries I know there is a failover bug in the 1.8 series, and so if that connection got rejected the proc would abort. Should we be using a different network? If so, telling us via the oob_tcp_if_include param would be the solution. On Aug 18, 2014, at 10:04 AM, Maxime Boissonneault wrote: > Here it is. > > Maxime > > Le 2014-08-18 12:59, Ralph Castain a écrit : >> Ah...now that showed the problem. To pinpoint it better, please add >> >> -mca oob_base_verbose 10 >> >> and I think we'll have it >> >> On Aug 18, 2014, at 9:54 AM, Maxime Boissonneault >> wrote: >> >>> This is all one one node indeed. >>> >>> Attached is the output of >>> mpirun -np 4 --mca plm_base_verbose 10 -mca odls_base_verbose 5 -mca >>> state_base_verbose 5 -mca errmgr_base_verbose 5 ring_c |& tee >>> output_ringc_verbose.txt >>> >>> >>> Maxime >>> >>> Le 2014-08-18 12:48, Ralph Castain a écrit : This is all on one node, yes? Try adding the following: -mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5 Lot of garbage, but should tell us what is going on. On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault wrote: > Here it is > Le 2014-08-18 12:30, Joshua Ladd a écrit : >> mpirun -np 4 --mca plm_base_verbose 10 > [mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose 10 > ring_c > [helios-login1:27853] mca: base: components_register: registering plm > components > [helios-login1:27853] mca: base: components_register: found loaded > component isolated > [helios-login1:27853] mca: base: components_register: component isolated > has no register or open function > [helios-login1:27853] mca: base: components_register: found loaded > component rsh > [helios-login1:27853] mca: base: components_register: component rsh > register function successful > [helios-login1:27853] mca: base: components_register: found loaded > component tm > [helios-login1:27853] mca: base: components_register: component tm > register function successful > [helios-login1:27853] mca: base: components_open: opening plm components > [helios-login1:27853] mca: base: components_open: found loaded component > isolated > [helios-login1:27853] mca: base: components_open: component isolated open > function successful > [helios-login1:27853] mca: base: components_open: found loaded component > rsh > [helios-login1:27853] mca: base: components_open: component rsh open > function successful > [helios-login1:27853] mca: base: components_open: found loaded component > tm > [helios-login1:27853] mca: base: components_open: component tm open > function successful > [helios-login1:27853] mca:base:select: Auto-selecting plm components > [helios-login1:27853] mca:base:select:( plm) Querying component > [isolated] > [helios-login1:27853] mca:base:select:( plm) Query of component > [isolated] set priority to 0 > [helios-login1:27853] mca:base:select:( plm) Querying component [rsh] > [helios-login1:27853] mca:base:select:( plm) Query of component [rsh] > set priority to 10 > [helios-login1:27853] mca:base:select:( plm) Querying component [tm] > [helios-login1:27853] mca:base:select:( plm) Skipping component [tm]. > Query failed to return a module > [helios-login1:27853] mca:base:select:( plm) Selected component [rsh] > [helios-login1:27853] mca: base: close: component isolated closed > [helios-login1:27853] mca: base: close: unloading component isolated > [helios-login1:27853] mca: base: close: component tm closed > [helios-login1:27853] mca: base: close: unloading component tm > [helios-login1:27853] mca: base: close: component rsh closed > [helios-login1:27853] mca: base: close: unloading component rsh > [mboisson@helios-login1 examples]$ echo $? > 65 > > > Maxime > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Lin
Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1
Indeed, that makes sense now. Why isn't OpenMPI attempting to connect with the local loop for same node ? This used to work with 1.6.5. Maxime Le 2014-08-18 13:11, Ralph Castain a écrit : Yep, that pinpointed the problem: [helios-login1:28558] [[63019,1],0] tcp:send_handler CONNECTING [helios-login1:28558] [[63019,1],0]:tcp:complete_connect called for peer [[63019,0],0] on socket 11 [helios-login1:28558] [[63019,1],0]-[[63019,0],0] tcp_peer_complete_connect: connection failed: Connection refused (111) [helios-login1:28558] [[63019,1],0] tcp_peer_close for [[63019,0],0] sd 11 state CONNECTING [helios-login1:28558] [[63019,1],0] tcp:lost connection called for peer [[63019,0],0] The apps are trying to connect back to mpirun using the following addresses: tcp://10.10.1.3,132.219.137.36,10.12.1.3:34237 The initial attempt is here [helios-login1:28558] [[63019,1],0] orte_tcp_peer_try_connect: attempting to connect to proc [[63019,0],0] on 10.10.1.3:34237 - 0 retries I know there is a failover bug in the 1.8 series, and so if that connection got rejected the proc would abort. Should we be using a different network? If so, telling us via the oob_tcp_if_include param would be the solution. On Aug 18, 2014, at 10:04 AM, Maxime Boissonneault wrote: Here it is. Maxime Le 2014-08-18 12:59, Ralph Castain a écrit : Ah...now that showed the problem. To pinpoint it better, please add -mca oob_base_verbose 10 and I think we'll have it On Aug 18, 2014, at 9:54 AM, Maxime Boissonneault wrote: This is all one one node indeed. Attached is the output of mpirun -np 4 --mca plm_base_verbose 10 -mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5 ring_c |& tee output_ringc_verbose.txt Maxime Le 2014-08-18 12:48, Ralph Castain a écrit : This is all on one node, yes? Try adding the following: -mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5 Lot of garbage, but should tell us what is going on. On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault wrote: Here it is Le 2014-08-18 12:30, Joshua Ladd a écrit : mpirun -np 4 --mca plm_base_verbose 10 [mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose 10 ring_c [helios-login1:27853] mca: base: components_register: registering plm components [helios-login1:27853] mca: base: components_register: found loaded component isolated [helios-login1:27853] mca: base: components_register: component isolated has no register or open function [helios-login1:27853] mca: base: components_register: found loaded component rsh [helios-login1:27853] mca: base: components_register: component rsh register function successful [helios-login1:27853] mca: base: components_register: found loaded component tm [helios-login1:27853] mca: base: components_register: component tm register function successful [helios-login1:27853] mca: base: components_open: opening plm components [helios-login1:27853] mca: base: components_open: found loaded component isolated [helios-login1:27853] mca: base: components_open: component isolated open function successful [helios-login1:27853] mca: base: components_open: found loaded component rsh [helios-login1:27853] mca: base: components_open: component rsh open function successful [helios-login1:27853] mca: base: components_open: found loaded component tm [helios-login1:27853] mca: base: components_open: component tm open function successful [helios-login1:27853] mca:base:select: Auto-selecting plm components [helios-login1:27853] mca:base:select:( plm) Querying component [isolated] [helios-login1:27853] mca:base:select:( plm) Query of component [isolated] set priority to 0 [helios-login1:27853] mca:base:select:( plm) Querying component [rsh] [helios-login1:27853] mca:base:select:( plm) Query of component [rsh] set priority to 10 [helios-login1:27853] mca:base:select:( plm) Querying component [tm] [helios-login1:27853] mca:base:select:( plm) Skipping component [tm]. Query failed to return a module [helios-login1:27853] mca:base:select:( plm) Selected component [rsh] [helios-login1:27853] mca: base: close: component isolated closed [helios-login1:27853] mca: base: close: unloading component isolated [helios-login1:27853] mca: base: close: component tm closed [helios-login1:27853] mca: base: close: unloading component tm [helios-login1:27853] mca: base: close: component rsh closed [helios-login1:27853] mca: base: close: unloading component rsh [mboisson@helios-login1 examples]$ echo $? 65 Maxime ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/25052.php ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/2
Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1
Yeah, there are some issues with the internal connection logic that need to get fixed. We haven't had many cases where it's been an issue, but a couple like this have cropped up - enough that I need to set aside some time to fix it. My apologies for the problem. On Aug 18, 2014, at 10:31 AM, Maxime Boissonneault wrote: > Indeed, that makes sense now. > > Why isn't OpenMPI attempting to connect with the local loop for same node ? > This used to work with 1.6.5. > > Maxime > > Le 2014-08-18 13:11, Ralph Castain a écrit : >> Yep, that pinpointed the problem: >> >> [helios-login1:28558] [[63019,1],0] tcp:send_handler CONNECTING >> [helios-login1:28558] [[63019,1],0]:tcp:complete_connect called for peer >> [[63019,0],0] on socket 11 >> [helios-login1:28558] [[63019,1],0]-[[63019,0],0] tcp_peer_complete_connect: >> connection failed: Connection refused (111) >> [helios-login1:28558] [[63019,1],0] tcp_peer_close for [[63019,0],0] sd 11 >> state CONNECTING >> [helios-login1:28558] [[63019,1],0] tcp:lost connection called for peer >> [[63019,0],0] >> >> >> The apps are trying to connect back to mpirun using the following addresses: >> >> tcp://10.10.1.3,132.219.137.36,10.12.1.3:34237 >> >> The initial attempt is here >> >> [helios-login1:28558] [[63019,1],0] orte_tcp_peer_try_connect: attempting to >> connect to proc [[63019,0],0] on 10.10.1.3:34237 - 0 retries >> >> I know there is a failover bug in the 1.8 series, and so if that connection >> got rejected the proc would abort. Should we be using a different network? >> If so, telling us via the oob_tcp_if_include param would be the solution. >> >> >> On Aug 18, 2014, at 10:04 AM, Maxime Boissonneault >> wrote: >> >>> Here it is. >>> >>> Maxime >>> >>> Le 2014-08-18 12:59, Ralph Castain a écrit : Ah...now that showed the problem. To pinpoint it better, please add -mca oob_base_verbose 10 and I think we'll have it On Aug 18, 2014, at 9:54 AM, Maxime Boissonneault wrote: > This is all one one node indeed. > > Attached is the output of > mpirun -np 4 --mca plm_base_verbose 10 -mca odls_base_verbose 5 -mca > state_base_verbose 5 -mca errmgr_base_verbose 5 ring_c |& tee > output_ringc_verbose.txt > > > Maxime > > Le 2014-08-18 12:48, Ralph Castain a écrit : >> This is all on one node, yes? >> >> Try adding the following: >> >> -mca odls_base_verbose 5 -mca state_base_verbose 5 -mca >> errmgr_base_verbose 5 >> >> Lot of garbage, but should tell us what is going on. >> >> On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault >> wrote: >> >>> Here it is >>> Le 2014-08-18 12:30, Joshua Ladd a écrit : mpirun -np 4 --mca plm_base_verbose 10 >>> [mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose >>> 10 ring_c >>> [helios-login1:27853] mca: base: components_register: registering plm >>> components >>> [helios-login1:27853] mca: base: components_register: found loaded >>> component isolated >>> [helios-login1:27853] mca: base: components_register: component >>> isolated has no register or open function >>> [helios-login1:27853] mca: base: components_register: found loaded >>> component rsh >>> [helios-login1:27853] mca: base: components_register: component rsh >>> register function successful >>> [helios-login1:27853] mca: base: components_register: found loaded >>> component tm >>> [helios-login1:27853] mca: base: components_register: component tm >>> register function successful >>> [helios-login1:27853] mca: base: components_open: opening plm components >>> [helios-login1:27853] mca: base: components_open: found loaded >>> component isolated >>> [helios-login1:27853] mca: base: components_open: component isolated >>> open function successful >>> [helios-login1:27853] mca: base: components_open: found loaded >>> component rsh >>> [helios-login1:27853] mca: base: components_open: component rsh open >>> function successful >>> [helios-login1:27853] mca: base: components_open: found loaded >>> component tm >>> [helios-login1:27853] mca: base: components_open: component tm open >>> function successful >>> [helios-login1:27853] mca:base:select: Auto-selecting plm components >>> [helios-login1:27853] mca:base:select:( plm) Querying component >>> [isolated] >>> [helios-login1:27853] mca:base:select:( plm) Query of component >>> [isolated] set priority to 0 >>> [helios-login1:27853] mca:base:select:( plm) Querying component [rsh] >>> [helios-login1:27853] mca:base:select:( plm) Query of component [rsh] >>> set priority to 10 >>> [helios-login1:27853] mca:base:select:( plm) Querying component [tm] >>> [helios-login1:27853] mca:base:select:( plm) Skipping component [tm]. >>> Query fail
Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1
Ok, I confirm that with mpiexec -mca oob_tcp_if_include lo ring_c it works. It also works with mpiexec -mca oob_tcp_if_include ib0 ring_c We have 4 interfaces on this node. - lo, the local loop - ib0, infiniband - eth2, a management network - eth3, the public network It seems that mpiexec attempts to use the two addresses that do not work (eth2, eth3) and does not use the two that do work (ib0 and lo). However, according to the logs sent previously, it does see ib0 (despite not seeing lo), but does not attempt to use it. On the compute nodes, we have eth0 (management), ib0 and lo, and it works. I am unsure why it does work on the compute nodes and not on the login nodes. The only difference is the presence of a public interface on the login node. Maxime Le 2014-08-18 13:37, Ralph Castain a écrit : Yeah, there are some issues with the internal connection logic that need to get fixed. We haven't had many cases where it's been an issue, but a couple like this have cropped up - enough that I need to set aside some time to fix it. My apologies for the problem. On Aug 18, 2014, at 10:31 AM, Maxime Boissonneault wrote: Indeed, that makes sense now. Why isn't OpenMPI attempting to connect with the local loop for same node ? This used to work with 1.6.5. Maxime Le 2014-08-18 13:11, Ralph Castain a écrit : Yep, that pinpointed the problem: [helios-login1:28558] [[63019,1],0] tcp:send_handler CONNECTING [helios-login1:28558] [[63019,1],0]:tcp:complete_connect called for peer [[63019,0],0] on socket 11 [helios-login1:28558] [[63019,1],0]-[[63019,0],0] tcp_peer_complete_connect: connection failed: Connection refused (111) [helios-login1:28558] [[63019,1],0] tcp_peer_close for [[63019,0],0] sd 11 state CONNECTING [helios-login1:28558] [[63019,1],0] tcp:lost connection called for peer [[63019,0],0] The apps are trying to connect back to mpirun using the following addresses: tcp://10.10.1.3,132.219.137.36,10.12.1.3:34237 The initial attempt is here [helios-login1:28558] [[63019,1],0] orte_tcp_peer_try_connect: attempting to connect to proc [[63019,0],0] on 10.10.1.3:34237 - 0 retries I know there is a failover bug in the 1.8 series, and so if that connection got rejected the proc would abort. Should we be using a different network? If so, telling us via the oob_tcp_if_include param would be the solution. On Aug 18, 2014, at 10:04 AM, Maxime Boissonneault wrote: Here it is. Maxime Le 2014-08-18 12:59, Ralph Castain a écrit : Ah...now that showed the problem. To pinpoint it better, please add -mca oob_base_verbose 10 and I think we'll have it On Aug 18, 2014, at 9:54 AM, Maxime Boissonneault wrote: This is all one one node indeed. Attached is the output of mpirun -np 4 --mca plm_base_verbose 10 -mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5 ring_c |& tee output_ringc_verbose.txt Maxime Le 2014-08-18 12:48, Ralph Castain a écrit : This is all on one node, yes? Try adding the following: -mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5 Lot of garbage, but should tell us what is going on. On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault wrote: Here it is Le 2014-08-18 12:30, Joshua Ladd a écrit : mpirun -np 4 --mca plm_base_verbose 10 [mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose 10 ring_c [helios-login1:27853] mca: base: components_register: registering plm components [helios-login1:27853] mca: base: components_register: found loaded component isolated [helios-login1:27853] mca: base: components_register: component isolated has no register or open function [helios-login1:27853] mca: base: components_register: found loaded component rsh [helios-login1:27853] mca: base: components_register: component rsh register function successful [helios-login1:27853] mca: base: components_register: found loaded component tm [helios-login1:27853] mca: base: components_register: component tm register function successful [helios-login1:27853] mca: base: components_open: opening plm components [helios-login1:27853] mca: base: components_open: found loaded component isolated [helios-login1:27853] mca: base: components_open: component isolated open function successful [helios-login1:27853] mca: base: components_open: found loaded component rsh [helios-login1:27853] mca: base: components_open: component rsh open function successful [helios-login1:27853] mca: base: components_open: found loaded component tm [helios-login1:27853] mca: base: components_open: component tm open function successful [helios-login1:27853] mca:base:select: Auto-selecting plm components [helios-login1:27853] mca:base:select:( plm) Querying component [isolated] [helios-login1:27853] mca:base:select:( plm) Query of component [isolated] set priority to 0 [helios-login1:27853] mca:base:select:( plm) Querying component [rsh] [helios-login1:27853] mca:base:select:( plm) Query of component [rsh] set prio
Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1
Indeed odd - I'm afraid that this is just the kind of case that has been causing problems. I think I've figured out the problem, but have been buried with my "day job" for the last few weeks and unable to pursue it. On Aug 18, 2014, at 11:10 AM, Maxime Boissonneault wrote: > Ok, I confirm that with > mpiexec -mca oob_tcp_if_include lo ring_c > > it works. > > It also works with > mpiexec -mca oob_tcp_if_include ib0 ring_c > > We have 4 interfaces on this node. > - lo, the local loop > - ib0, infiniband > - eth2, a management network > - eth3, the public network > > It seems that mpiexec attempts to use the two addresses that do not work > (eth2, eth3) and does not use the two that do work (ib0 and lo). However, > according to the logs sent previously, it does see ib0 (despite not seeing > lo), but does not attempt to use it. > > > On the compute nodes, we have eth0 (management), ib0 and lo, and it works. I > am unsure why it does work on the compute nodes and not on the login nodes. > The only difference is the presence of a public interface on the login node. > > Maxime > > > Le 2014-08-18 13:37, Ralph Castain a écrit : >> Yeah, there are some issues with the internal connection logic that need to >> get fixed. We haven't had many cases where it's been an issue, but a couple >> like this have cropped up - enough that I need to set aside some time to fix >> it. >> >> My apologies for the problem. >> >> >> On Aug 18, 2014, at 10:31 AM, Maxime Boissonneault >> wrote: >> >>> Indeed, that makes sense now. >>> >>> Why isn't OpenMPI attempting to connect with the local loop for same node ? >>> This used to work with 1.6.5. >>> >>> Maxime >>> >>> Le 2014-08-18 13:11, Ralph Castain a écrit : Yep, that pinpointed the problem: [helios-login1:28558] [[63019,1],0] tcp:send_handler CONNECTING [helios-login1:28558] [[63019,1],0]:tcp:complete_connect called for peer [[63019,0],0] on socket 11 [helios-login1:28558] [[63019,1],0]-[[63019,0],0] tcp_peer_complete_connect: connection failed: Connection refused (111) [helios-login1:28558] [[63019,1],0] tcp_peer_close for [[63019,0],0] sd 11 state CONNECTING [helios-login1:28558] [[63019,1],0] tcp:lost connection called for peer [[63019,0],0] The apps are trying to connect back to mpirun using the following addresses: tcp://10.10.1.3,132.219.137.36,10.12.1.3:34237 The initial attempt is here [helios-login1:28558] [[63019,1],0] orte_tcp_peer_try_connect: attempting to connect to proc [[63019,0],0] on 10.10.1.3:34237 - 0 retries I know there is a failover bug in the 1.8 series, and so if that connection got rejected the proc would abort. Should we be using a different network? If so, telling us via the oob_tcp_if_include param would be the solution. On Aug 18, 2014, at 10:04 AM, Maxime Boissonneault wrote: > Here it is. > > Maxime > > Le 2014-08-18 12:59, Ralph Castain a écrit : >> Ah...now that showed the problem. To pinpoint it better, please add >> >> -mca oob_base_verbose 10 >> >> and I think we'll have it >> >> On Aug 18, 2014, at 9:54 AM, Maxime Boissonneault >> wrote: >> >>> This is all one one node indeed. >>> >>> Attached is the output of >>> mpirun -np 4 --mca plm_base_verbose 10 -mca odls_base_verbose 5 -mca >>> state_base_verbose 5 -mca errmgr_base_verbose 5 ring_c |& tee >>> output_ringc_verbose.txt >>> >>> >>> Maxime >>> >>> Le 2014-08-18 12:48, Ralph Castain a écrit : This is all on one node, yes? Try adding the following: -mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5 Lot of garbage, but should tell us what is going on. On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault wrote: > Here it is > Le 2014-08-18 12:30, Joshua Ladd a écrit : >> mpirun -np 4 --mca plm_base_verbose 10 > [mboisson@helios-login1 examples]$ mpirun -np 4 --mca > plm_base_verbose 10 ring_c > [helios-login1:27853] mca: base: components_register: registering plm > components > [helios-login1:27853] mca: base: components_register: found loaded > component isolated > [helios-login1:27853] mca: base: components_register: component > isolated has no register or open function > [helios-login1:27853] mca: base: components_register: found loaded > component rsh > [helios-login1:27853] mca: base: components_register: component rsh > register function successful > [helios-login1:27853] mca: base: components_register: found loaded > component tm > [helios-login1:27853] mca: base: components_r
Re: [OMPI users] No log_num_mtt in Ubuntu 14.04
I get "ofed_info: command not found". Note that I don't install the entire OFED, but do a component wise installation by doing "apt-get install infiniband-diags ibutils ibverbs-utils libmlx4-dev" for the drivers and utilities. > Hi, > what ofed version do you use? > (ofed_info -s) > > > On Sun, Aug 17, 2014 at 7:16 PM, Rio Yokota wrote: > I have recently upgraded from Ubuntu 12.04 to 14.04 and OpenMPI gives the > following warning upon execution, which did not appear before the upgrade. > > WARNING: It appears that your OpenFabrics subsystem is configured to only > allow registering part of your physical memory. This can cause MPI jobs to > run with erratic performance, hang, and/or crash. > > Everything that I could find on google suggests to change log_num_mtt, but I > cannot do this for the following reasons: > 1. There is no log_num_mtt in /sys/module/mlx4_core/parameters/ > 2. Adding "options mlx4_core log_num_mtt=24" to /etc/modprobe.d/mlx4.conf > doesn't seem to change anything > 3. I am not sure how I can restart the driver because there is no > "/etc/init.d/openibd" file (I've rebooted the system but it didn't do > anything to create log_num_mtt) > > [Template information] > 1. OpenFabrics is from the Ubuntu distribution using "apt-get install > infiniband-diags ibutils ibverbs-utils libmlx4-dev" > 2. OS is Ubuntu 14.04 LTS > 3. Subnet manager is from the Ubuntu distribution using "apt-get install > opensm" > 4. Output of ibv_devinfo is: > hca_id: mlx4_0 > transport: InfiniBand (0) > fw_ver: 2.10.600 > node_guid: 0002:c903:003d:52b0 > sys_image_guid: 0002:c903:003d:52b3 > vendor_id: 0x02c9 > vendor_part_id: 4099 > hw_ver: 0x0 > board_id: MT_1100120019 > phys_port_cnt: 1 > port: 1 > state: PORT_ACTIVE (4) > max_mtu:4096 (5) > active_mtu: 4096 (5) > sm_lid: 1 > port_lid: 1 > port_lmc: 0x00 > link_layer: InfiniBand > 5. Output of ifconfig for IB is > ib0 Link encap:UNSPEC HWaddr > 80-00-00-48-FE-80-00-00-00-00-00-00-00-00-00-00 > inet addr:192.168.1.1 Bcast:192.168.1.255 Mask:255.255.255.0 > inet6 addr: fe80::202:c903:3d:52b1/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 > RX packets:26 errors:0 dropped:0 overruns:0 frame:0 > TX packets:34 errors:0 dropped:16 overruns:0 carrier:0 > collisions:0 txqueuelen:256 > RX bytes:5843 (5.8 KB) TX bytes:4324 (4.3 KB) > 6. ulimit -l is "unlimited" > > Thanks, > Rio > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25048.php > > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25049.php
Re: [OMPI users] No log_num_mtt in Ubuntu 14.04
most likely you installing old ofed which does not have this parameter: try: #modinfo mlx4_core and see if it is there. I would suggest install latest OFED or Mellanox OFED. On Mon, Aug 18, 2014 at 9:53 PM, Rio Yokota wrote: > I get "ofed_info: command not found". Note that I don't install the entire > OFED, but do a component wise installation by doing "apt-get install > infiniband-diags ibutils ibverbs-utils libmlx4-dev" for the drivers and > utilities. > > Hi, > what ofed version do you use? > (ofed_info -s) > > > On Sun, Aug 17, 2014 at 7:16 PM, Rio Yokota wrote: > >> I have recently upgraded from Ubuntu 12.04 to 14.04 and OpenMPI gives the >> following warning upon execution, which did not appear before the upgrade. >> >> WARNING: It appears that your OpenFabrics subsystem is configured to only >> allow registering part of your physical memory. This can cause MPI jobs to >> run with erratic performance, hang, and/or crash. >> >> Everything that I could find on google suggests to change log_num_mtt, >> but I cannot do this for the following reasons: >> 1. There is no log_num_mtt in /sys/module/mlx4_core/parameters/ >> 2. Adding "options mlx4_core log_num_mtt=24" to /etc/modprobe.d/mlx4.conf >> doesn't seem to change anything >> 3. I am not sure how I can restart the driver because there is no >> "/etc/init.d/openibd" file (I've rebooted the system but it didn't do >> anything to create log_num_mtt) >> >> [Template information] >> 1. OpenFabrics is from the Ubuntu distribution using "apt-get install >> infiniband-diags ibutils ibverbs-utils libmlx4-dev" >> 2. OS is Ubuntu 14.04 LTS >> 3. Subnet manager is from the Ubuntu distribution using "apt-get install >> opensm" >> 4. Output of ibv_devinfo is: >> hca_id: mlx4_0 >> transport: InfiniBand (0) >> fw_ver: 2.10.600 >> node_guid: 0002:c903:003d:52b0 >> sys_image_guid: 0002:c903:003d:52b3 >> vendor_id: 0x02c9 >> vendor_part_id: 4099 >> hw_ver: 0x0 >> board_id: MT_1100120019 >> phys_port_cnt: 1 >> port: 1 >> state: PORT_ACTIVE (4) >> max_mtu:4096 (5) >> active_mtu: 4096 (5) >> sm_lid: 1 >> port_lid: 1 >> port_lmc: 0x00 >> link_layer: InfiniBand >> 5. Output of ifconfig for IB is >> ib0 Link encap:UNSPEC HWaddr >> 80-00-00-48-FE-80-00-00-00-00-00-00-00-00-00-00 >> inet addr:192.168.1.1 Bcast:192.168.1.255 Mask:255.255.255.0 >> inet6 addr: fe80::202:c903:3d:52b1/64 Scope:Link >> UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 >> RX packets:26 errors:0 dropped:0 overruns:0 frame:0 >> TX packets:34 errors:0 dropped:16 overruns:0 carrier:0 >> collisions:0 txqueuelen:256 >> RX bytes:5843 (5.8 KB) TX bytes:4324 (4.3 KB) >> 6. ulimit -l is "unlimited" >> >> Thanks, >> Rio >> ___ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/08/25048.php >> > > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25049.php > > > > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25062.php >
[OMPI users] Segfault with MPI + Cuda on multiple nodes
Hi, Since my previous thread (Segmentation fault in OpenMPI 1.8.1) kindda derailed into two problems, one of which has been addressed, I figured I would start a new, more precise and simple one. I reduced the code to the minimal that would reproduce the bug. I have pasted it here : http://pastebin.com/1uAK4Z8R Basically, it is a program that initializes MPI and cudaMalloc memory, and then free memory and finalize MPI. Nothing else. When I compile and run this on a single node, everything works fine. When I compile and run this on more than one node, I get the following stack trace : [gpu-k20-07:40041] *** Process received signal *** [gpu-k20-07:40041] Signal: Segmentation fault (11) [gpu-k20-07:40041] Signal code: Address not mapped (1) [gpu-k20-07:40041] Failing at address: 0x8 [gpu-k20-07:40041] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2ac85a5bb710] [gpu-k20-07:40041] [ 1] /usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac85fc0aacf] [gpu-k20-07:40041] [ 2] /usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac85fbd0a83] [gpu-k20-07:40041] [ 3] /usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac85fb022da] [gpu-k20-07:40041] [ 4] /usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac85faee933] [gpu-k20-07:40041] [ 5] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac859e3a965] [gpu-k20-07:40041] [ 6] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2ac859e3aa0a] [gpu-k20-07:40041] [ 7] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2ac859e3aa3b] [gpu-k20-07:40041] [ 8] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaMalloc+0x52)[0x2ac859e5b532] [gpu-k20-07:40041] [ 9] cudampi_simple[0x4007b5] [gpu-k20-07:40041] [10] /lib64/libc.so.6(__libc_start_main+0xfd)[0x2ac85a7e7d1d] [gpu-k20-07:40041] [11] cudampi_simple[0x400699] [gpu-k20-07:40041] *** End of error message *** The stack trace is the same weither I use OpenMPI 1.6.5 (not cuda aware) or OpenMPI 1.8.1 (cuda aware). I know this is more than likely a problem with Cuda than with OpenMPI (since it does the same for two different versions), but I figured I would ask here if somebody has a clue of what might be going on. I have yet to be able to fill a bug report on NVidia's website for Cuda. Thanks, -- - Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique
Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes
Just to help reduce the scope of the problem, can you retest with a non CUDA-aware Open MPI 1.8.1? And if possible, use --enable-debug in the configure line to help with the stack trace? >-Original Message- >From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime >Boissonneault >Sent: Monday, August 18, 2014 4:23 PM >To: Open MPI Users >Subject: [OMPI users] Segfault with MPI + Cuda on multiple nodes > >Hi, >Since my previous thread (Segmentation fault in OpenMPI 1.8.1) kindda >derailed into two problems, one of which has been addressed, I figured I >would start a new, more precise and simple one. > >I reduced the code to the minimal that would reproduce the bug. I have >pasted it here : >http://pastebin.com/1uAK4Z8R >Basically, it is a program that initializes MPI and cudaMalloc memory, and then >free memory and finalize MPI. Nothing else. > >When I compile and run this on a single node, everything works fine. > >When I compile and run this on more than one node, I get the following stack >trace : >[gpu-k20-07:40041] *** Process received signal *** [gpu-k20-07:40041] Signal: >Segmentation fault (11) [gpu-k20-07:40041] Signal code: Address not mapped >(1) [gpu-k20-07:40041] Failing at address: 0x8 [gpu-k20-07:40041] [ 0] >/lib64/libpthread.so.0(+0xf710)[0x2ac85a5bb710] >[gpu-k20-07:40041] [ 1] >/usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac85fc0aacf] >[gpu-k20-07:40041] [ 2] >/usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac85fbd0a83] >[gpu-k20-07:40041] [ 3] >/usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac85fb022da] >[gpu-k20-07:40041] [ 4] >/usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac85faee933] >[gpu-k20-07:40041] [ 5] >/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac859e3a965] >[gpu-k20-07:40041] [ 6] >/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2ac859e3aa0a] >[gpu-k20-07:40041] [ 7] >/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2ac859e3aa3b] >[gpu-k20-07:40041] [ 8] >/software- >gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaMalloc+0x52)[0x2ac859e5b532] >[gpu-k20-07:40041] [ 9] cudampi_simple[0x4007b5] [gpu-k20-07:40041] [10] >/lib64/libc.so.6(__libc_start_main+0xfd)[0x2ac85a7e7d1d] >[gpu-k20-07:40041] [11] cudampi_simple[0x400699] [gpu-k20-07:40041] *** >End of error message *** > > >The stack trace is the same weither I use OpenMPI 1.6.5 (not cuda aware) or >OpenMPI 1.8.1 (cuda aware). > >I know this is more than likely a problem with Cuda than with OpenMPI (since >it does the same for two different versions), but I figured I would ask here if >somebody has a clue of what might be going on. I have yet to be able to fill a >bug report on NVidia's website for Cuda. > > >Thanks, > > >-- >- >Maxime Boissonneault >Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique > >___ >users mailing list >us...@open-mpi.org >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >Link to this post: http://www.open- >mpi.org/community/lists/users/2014/08/25064.php --- This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. ---
Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes
Try the following: export MALLOC_CHECK_=1 and then run it again Kind regards, Alex Granovsky -Original Message- From: Maxime Boissonneault Sent: Tuesday, August 19, 2014 12:23 AM To: Open MPI Users Subject: [OMPI users] Segfault with MPI + Cuda on multiple nodes Hi, Since my previous thread (Segmentation fault in OpenMPI 1.8.1) kindda derailed into two problems, one of which has been addressed, I figured I would start a new, more precise and simple one. I reduced the code to the minimal that would reproduce the bug. I have pasted it here : http://pastebin.com/1uAK4Z8R Basically, it is a program that initializes MPI and cudaMalloc memory, and then free memory and finalize MPI. Nothing else. When I compile and run this on a single node, everything works fine. When I compile and run this on more than one node, I get the following stack trace : [gpu-k20-07:40041] *** Process received signal *** [gpu-k20-07:40041] Signal: Segmentation fault (11) [gpu-k20-07:40041] Signal code: Address not mapped (1) [gpu-k20-07:40041] Failing at address: 0x8 [gpu-k20-07:40041] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2ac85a5bb710] [gpu-k20-07:40041] [ 1] /usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac85fc0aacf] [gpu-k20-07:40041] [ 2] /usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac85fbd0a83] [gpu-k20-07:40041] [ 3] /usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac85fb022da] [gpu-k20-07:40041] [ 4] /usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac85faee933] [gpu-k20-07:40041] [ 5] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac859e3a965] [gpu-k20-07:40041] [ 6] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2ac859e3aa0a] [gpu-k20-07:40041] [ 7] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2ac859e3aa3b] [gpu-k20-07:40041] [ 8] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaMalloc+0x52)[0x2ac859e5b532] [gpu-k20-07:40041] [ 9] cudampi_simple[0x4007b5] [gpu-k20-07:40041] [10] /lib64/libc.so.6(__libc_start_main+0xfd)[0x2ac85a7e7d1d] [gpu-k20-07:40041] [11] cudampi_simple[0x400699] [gpu-k20-07:40041] *** End of error message *** The stack trace is the same weither I use OpenMPI 1.6.5 (not cuda aware) or OpenMPI 1.8.1 (cuda aware). I know this is more than likely a problem with Cuda than with OpenMPI (since it does the same for two different versions), but I figured I would ask here if somebody has a clue of what might be going on. I have yet to be able to fill a bug report on NVidia's website for Cuda. Thanks, -- - Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/25064.php
Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes
Same thing : [mboisson@gpu-k20-07 simple_cuda_mpi]$ export MALLOC_CHECK_=1 [mboisson@gpu-k20-07 simple_cuda_mpi]$ mpiexec -np 2 --map-by ppr:1:node cudampi_simple malloc: using debugging hooks malloc: using debugging hooks [gpu-k20-07:47628] *** Process received signal *** [gpu-k20-07:47628] Signal: Segmentation fault (11) [gpu-k20-07:47628] Signal code: Address not mapped (1) [gpu-k20-07:47628] Failing at address: 0x8 [gpu-k20-07:47628] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2b14cf850710] [gpu-k20-07:47628] [ 1] /usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2b14d4e9facf] [gpu-k20-07:47628] [ 2] /usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2b14d4e65a83] [gpu-k20-07:47628] [ 3] /usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2b14d4d972da] [gpu-k20-07:47628] [ 4] /usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2b14d4d83933] [gpu-k20-07:47628] [ 5] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2b14cf0cf965] [gpu-k20-07:47628] [ 6] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2b14cf0cfa0a] [gpu-k20-07:47628] [ 7] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2b14cf0cfa3b] [gpu-k20-07:47628] [ 8] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaMalloc+0x52)[0x2b14cf0f0532] [gpu-k20-07:47628] [ 9] cudampi_simple[0x4007b5] [gpu-k20-07:47628] [10] /lib64/libc.so.6(__libc_start_main+0xfd)[0x2b14cfa7cd1d] [gpu-k20-07:47628] [11] cudampi_simple[0x400699] [gpu-k20-07:47628] *** End of error message *** ... (same segfault from the other node) Maxime Le 2014-08-18 16:52, Alex A. Granovsky a écrit : Try the following: export MALLOC_CHECK_=1 and then run it again Kind regards, Alex Granovsky -Original Message- From: Maxime Boissonneault Sent: Tuesday, August 19, 2014 12:23 AM To: Open MPI Users Subject: [OMPI users] Segfault with MPI + Cuda on multiple nodes Hi, Since my previous thread (Segmentation fault in OpenMPI 1.8.1) kindda derailed into two problems, one of which has been addressed, I figured I would start a new, more precise and simple one. I reduced the code to the minimal that would reproduce the bug. I have pasted it here : http://pastebin.com/1uAK4Z8R Basically, it is a program that initializes MPI and cudaMalloc memory, and then free memory and finalize MPI. Nothing else. When I compile and run this on a single node, everything works fine. When I compile and run this on more than one node, I get the following stack trace : [gpu-k20-07:40041] *** Process received signal *** [gpu-k20-07:40041] Signal: Segmentation fault (11) [gpu-k20-07:40041] Signal code: Address not mapped (1) [gpu-k20-07:40041] Failing at address: 0x8 [gpu-k20-07:40041] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2ac85a5bb710] [gpu-k20-07:40041] [ 1] /usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac85fc0aacf] [gpu-k20-07:40041] [ 2] /usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac85fbd0a83] [gpu-k20-07:40041] [ 3] /usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac85fb022da] [gpu-k20-07:40041] [ 4] /usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac85faee933] [gpu-k20-07:40041] [ 5] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac859e3a965] [gpu-k20-07:40041] [ 6] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2ac859e3aa0a] [gpu-k20-07:40041] [ 7] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2ac859e3aa3b] [gpu-k20-07:40041] [ 8] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaMalloc+0x52)[0x2ac859e5b532] [gpu-k20-07:40041] [ 9] cudampi_simple[0x4007b5] [gpu-k20-07:40041] [10] /lib64/libc.so.6(__libc_start_main+0xfd)[0x2ac85a7e7d1d] [gpu-k20-07:40041] [11] cudampi_simple[0x400699] [gpu-k20-07:40041] *** End of error message *** The stack trace is the same weither I use OpenMPI 1.6.5 (not cuda aware) or OpenMPI 1.8.1 (cuda aware). I know this is more than likely a problem with Cuda than with OpenMPI (since it does the same for two different versions), but I figured I would ask here if somebody has a clue of what might be going on. I have yet to be able to fill a bug report on NVidia's website for Cuda. Thanks, -- - Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique
Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes
It's building... to be continued tomorrow morning. Le 2014-08-18 16:45, Rolf vandeVaart a écrit : Just to help reduce the scope of the problem, can you retest with a non CUDA-aware Open MPI 1.8.1? And if possible, use --enable-debug in the configure line to help with the stack trace? -Original Message- From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime Boissonneault Sent: Monday, August 18, 2014 4:23 PM To: Open MPI Users Subject: [OMPI users] Segfault with MPI + Cuda on multiple nodes Hi, Since my previous thread (Segmentation fault in OpenMPI 1.8.1) kindda derailed into two problems, one of which has been addressed, I figured I would start a new, more precise and simple one. I reduced the code to the minimal that would reproduce the bug. I have pasted it here : http://pastebin.com/1uAK4Z8R Basically, it is a program that initializes MPI and cudaMalloc memory, and then free memory and finalize MPI. Nothing else. When I compile and run this on a single node, everything works fine. When I compile and run this on more than one node, I get the following stack trace : [gpu-k20-07:40041] *** Process received signal *** [gpu-k20-07:40041] Signal: Segmentation fault (11) [gpu-k20-07:40041] Signal code: Address not mapped (1) [gpu-k20-07:40041] Failing at address: 0x8 [gpu-k20-07:40041] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2ac85a5bb710] [gpu-k20-07:40041] [ 1] /usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac85fc0aacf] [gpu-k20-07:40041] [ 2] /usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac85fbd0a83] [gpu-k20-07:40041] [ 3] /usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac85fb022da] [gpu-k20-07:40041] [ 4] /usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac85faee933] [gpu-k20-07:40041] [ 5] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac859e3a965] [gpu-k20-07:40041] [ 6] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2ac859e3aa0a] [gpu-k20-07:40041] [ 7] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2ac859e3aa3b] [gpu-k20-07:40041] [ 8] /software- gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaMalloc+0x52)[0x2ac859e5b532] [gpu-k20-07:40041] [ 9] cudampi_simple[0x4007b5] [gpu-k20-07:40041] [10] /lib64/libc.so.6(__libc_start_main+0xfd)[0x2ac85a7e7d1d] [gpu-k20-07:40041] [11] cudampi_simple[0x400699] [gpu-k20-07:40041] *** End of error message *** The stack trace is the same weither I use OpenMPI 1.6.5 (not cuda aware) or OpenMPI 1.8.1 (cuda aware). I know this is more than likely a problem with Cuda than with OpenMPI (since it does the same for two different versions), but I figured I would ask here if somebody has a clue of what might be going on. I have yet to be able to fill a bug report on NVidia's website for Cuda. Thanks, -- - Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open- mpi.org/community/lists/users/2014/08/25064.php --- This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. --- ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/25065.php -- - Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique