Re: [OMPI users] No log_num_mtt in Ubuntu 14.04

2014-08-18 Thread Mike Dubman
Hi,
what ofed version do you use?
(ofed_info -s)


On Sun, Aug 17, 2014 at 7:16 PM, Rio Yokota  wrote:

> I have recently upgraded from Ubuntu 12.04 to 14.04 and OpenMPI gives the
> following warning upon execution, which did not appear before the upgrade.
>
> WARNING: It appears that your OpenFabrics subsystem is configured to only
> allow registering part of your physical memory. This can cause MPI jobs to
> run with erratic performance, hang, and/or crash.
>
> Everything that I could find on google suggests to change log_num_mtt, but
> I cannot do this for the following reasons:
> 1. There is no log_num_mtt in /sys/module/mlx4_core/parameters/
> 2. Adding "options mlx4_core log_num_mtt=24" to /etc/modprobe.d/mlx4.conf
> doesn't seem to change anything
> 3. I am not sure how I can restart the driver because there is no
> "/etc/init.d/openibd" file (I've rebooted the system but it didn't do
> anything to create log_num_mtt)
>
> [Template information]
> 1. OpenFabrics is from the Ubuntu distribution using "apt-get install
> infiniband-diags ibutils ibverbs-utils libmlx4-dev"
> 2. OS is Ubuntu 14.04 LTS
> 3. Subnet manager is from the Ubuntu distribution using "apt-get install
> opensm"
> 4. Output of ibv_devinfo is:
> hca_id: mlx4_0
> transport:  InfiniBand (0)
> fw_ver: 2.10.600
> node_guid:  0002:c903:003d:52b0
> sys_image_guid: 0002:c903:003d:52b3
> vendor_id:  0x02c9
> vendor_part_id: 4099
> hw_ver: 0x0
> board_id:   MT_1100120019
> phys_port_cnt:  1
> port:   1
> state:  PORT_ACTIVE (4)
> max_mtu:4096 (5)
> active_mtu: 4096 (5)
> sm_lid: 1
> port_lid:   1
> port_lmc:   0x00
> link_layer: InfiniBand
> 5. Output of ifconfig for IB is
> ib0   Link encap:UNSPEC  HWaddr
> 80-00-00-48-FE-80-00-00-00-00-00-00-00-00-00-00
>   inet addr:192.168.1.1  Bcast:192.168.1.255  Mask:255.255.255.0
>   inet6 addr: fe80::202:c903:3d:52b1/64 Scope:Link
>   UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
>   RX packets:26 errors:0 dropped:0 overruns:0 frame:0
>   TX packets:34 errors:0 dropped:16 overruns:0 carrier:0
>   collisions:0 txqueuelen:256
>   RX bytes:5843 (5.8 KB)  TX bytes:4324 (4.3 KB)
> 6. ulimit -l is "unlimited"
>
> Thanks,
> Rio
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25048.php
>


Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Maxime Boissonneault

Hi,
I just did compile without Cuda, and the result is the same. No output, 
exits with code 65.


[mboisson@helios-login1 examples]$ ldd ring_c
linux-vdso.so.1 =>  (0x7fff3ab31000)
libmpi.so.1 => 
/software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_nocuda/lib/libmpi.so.1 
(0x7fab9ec2a000)

libpthread.so.0 => /lib64/libpthread.so.0 (0x00381c00)
libc.so.6 => /lib64/libc.so.6 (0x00381bc0)
librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x00381c80)
libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x00381c40)
libopen-rte.so.7 => 
/software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_nocuda/lib/libopen-rte.so.7 
(0x7fab9e932000)

libtorque.so.2 => /usr/lib64/libtorque.so.2 (0x00391820)
libxml2.so.2 => /usr/lib64/libxml2.so.2 (0x003917e0)
libz.so.1 => /lib64/libz.so.1 (0x00381cc0)
libcrypto.so.10 => /usr/lib64/libcrypto.so.10 (0x00382100)
libssl.so.10 => /usr/lib64/libssl.so.10 (0x00382300)
libopen-pal.so.6 => 
/software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_nocuda/lib/libopen-pal.so.6 
(0x7fab9e64a000)

libdl.so.2 => /lib64/libdl.so.2 (0x00381b80)
librt.so.1 => /lib64/librt.so.1 (0x0035b360)
libm.so.6 => /lib64/libm.so.6 (0x003c25a0)
libutil.so.1 => /lib64/libutil.so.1 (0x003f7100)
/lib64/ld-linux-x86-64.so.2 (0x00381b40)
libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x003917a0)
libgcc_s.so.1 => 
/software6/compilers/gcc/4.8/lib64/libgcc_s.so.1 (0x7fab9e433000)
libgssapi_krb5.so.2 => /lib64/libgssapi_krb5.so.2 
(0x00382240)

libkrb5.so.3 => /lib64/libkrb5.so.3 (0x00382140)
libcom_err.so.2 => /lib64/libcom_err.so.2 (0x00381e40)
libk5crypto.so.3 => /lib64/libk5crypto.so.3 (0x00382180)
libkrb5support.so.0 => /lib64/libkrb5support.so.0 
(0x003821c0)

libkeyutils.so.1 => /lib64/libkeyutils.so.1 (0x00382200)
libresolv.so.2 => /lib64/libresolv.so.2 (0x00381dc0)
libselinux.so.1 => /lib64/libselinux.so.1 (0x00381d00)
[mboisson@helios-login1 examples]$ mpiexec ring_c
[mboisson@helios-login1 examples]$ echo $?
65


Maxime


Le 2014-08-16 06:22, Jeff Squyres (jsquyres) a écrit :

Just out of curiosity, I saw that one of the segv stack traces involved the 
cuda stack.

Can you try a build without CUDA and see if that resolves the problem?



On Aug 15, 2014, at 6:47 PM, Maxime Boissonneault 
 wrote:


Hi Jeff,

Le 2014-08-15 17:50, Jeff Squyres (jsquyres) a écrit :

On Aug 15, 2014, at 5:39 PM, Maxime Boissonneault 
 wrote:


Correct.

Can it be because torque (pbs_mom) is not running on the head node and mpiexec 
attempts to contact it ?

Not for Open MPI's mpiexec, no.

Open MPI's mpiexec (mpirun -- they're the same to us) will only try to use TM 
stuff (i.e., Torque stuff) if it sees the environment variable markers 
indicating that it's inside a Torque job.  If not, it just uses rsh/ssh (or 
localhost launch in your case, since you didn't specify any hosts).

If you are unable to run even "mpirun -np 4 hostname" (i.e., the non-MPI 
"hostname" command from Linux), then something is seriously borked with your Open MPI 
installation.

mpirun -np 4 hostname works fine :
[mboisson@helios-login1 ~]$ which mpirun
/software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/bin/mpirun
[mboisson@helios-login1 examples]$ mpirun -np 4 hostname; echo $?
helios-login1
helios-login1
helios-login1
helios-login1
0


Try running with:

 mpirun -np 4 --mca plm_base_verbose 10 hostname

This should show the steps OMPI is trying to take to launch the 4 copies of 
"hostname" and potentially give some insight into where it's hanging.

Also, just to make sure, you have ensured that you're compiling everything with 
a single compiler toolchain, and the support libraries from that specific 
compiler toolchain are available on any server on which you're running (to 
include the head node and compute nodes), right?

Yes. Everything has been compiled with GCC 4.8 (I also tried GCC 4.6 with the 
same results). Almost every software (that is compiler, toolchain, etc.) is 
installed on lustre, from sources and is the same on both the login (head) node 
and the compute.

The few differences between the head node and the compute :
1) Computes are in RAMFS - login is installed on disk
2) Computes and login node have different hardware configuration (computes have 
GPUs, head node does not).
3) Login node has MORE CentOS6 packages than computes (such as the -devel 
packages, some fonts/X11 libraries, etc.), but all the packages that are on the 
computes are also on the login node.


And you've verified that PATH and LD_LIBRARY_PATH are pointing to the right places -- i.e., to the 
Open MPI installation that you expect it to point to.  E.g., if you "ldd ring_c", it 
shows

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Joshua Ladd
Maxime,

Can you run with:

mpirun -np 4 --mca plm_base_verbose 10 /path/to/examples//ring_c


On Mon, Aug 18, 2014 at 12:21 PM, Maxime Boissonneault <
maxime.boissonnea...@calculquebec.ca> wrote:

> Hi,
> I just did compile without Cuda, and the result is the same. No output,
> exits with code 65.
>
> [mboisson@helios-login1 examples]$ ldd ring_c
> linux-vdso.so.1 =>  (0x7fff3ab31000)
> libmpi.so.1 => /software-gpu/mpi/openmpi/1.8.
> 2rc4_gcc4.8_nocuda/lib/libmpi.so.1 (0x7fab9ec2a000)
> libpthread.so.0 => /lib64/libpthread.so.0 (0x00381c00)
> libc.so.6 => /lib64/libc.so.6 (0x00381bc0)
> librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x00381c80)
> libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x00381c40)
> libopen-rte.so.7 => /software-gpu/mpi/openmpi/1.8.
> 2rc4_gcc4.8_nocuda/lib/libopen-rte.so.7 (0x7fab9e932000)
> libtorque.so.2 => /usr/lib64/libtorque.so.2 (0x00391820)
> libxml2.so.2 => /usr/lib64/libxml2.so.2 (0x003917e0)
> libz.so.1 => /lib64/libz.so.1 (0x00381cc0)
> libcrypto.so.10 => /usr/lib64/libcrypto.so.10 (0x00382100)
> libssl.so.10 => /usr/lib64/libssl.so.10 (0x00382300)
> libopen-pal.so.6 => /software-gpu/mpi/openmpi/1.8.
> 2rc4_gcc4.8_nocuda/lib/libopen-pal.so.6 (0x7fab9e64a000)
> libdl.so.2 => /lib64/libdl.so.2 (0x00381b80)
> librt.so.1 => /lib64/librt.so.1 (0x0035b360)
> libm.so.6 => /lib64/libm.so.6 (0x003c25a0)
> libutil.so.1 => /lib64/libutil.so.1 (0x003f7100)
> /lib64/ld-linux-x86-64.so.2 (0x00381b40)
> libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x003917a0)
> libgcc_s.so.1 => /software6/compilers/gcc/4.8/lib64/libgcc_s.so.1
> (0x7fab9e433000)
> libgssapi_krb5.so.2 => /lib64/libgssapi_krb5.so.2
> (0x00382240)
> libkrb5.so.3 => /lib64/libkrb5.so.3 (0x00382140)
> libcom_err.so.2 => /lib64/libcom_err.so.2 (0x00381e40)
> libk5crypto.so.3 => /lib64/libk5crypto.so.3 (0x00382180)
> libkrb5support.so.0 => /lib64/libkrb5support.so.0
> (0x003821c0)
> libkeyutils.so.1 => /lib64/libkeyutils.so.1 (0x00382200)
> libresolv.so.2 => /lib64/libresolv.so.2 (0x00381dc0)
> libselinux.so.1 => /lib64/libselinux.so.1 (0x00381d00)
>
> [mboisson@helios-login1 examples]$ mpiexec ring_c
> [mboisson@helios-login1 examples]$ echo $?
> 65
>
>
> Maxime
>
>
> Le 2014-08-16 06:22, Jeff Squyres (jsquyres) a écrit :
>
>  Just out of curiosity, I saw that one of the segv stack traces involved
>> the cuda stack.
>>
>> Can you try a build without CUDA and see if that resolves the problem?
>>
>>
>>
>> On Aug 15, 2014, at 6:47 PM, Maxime Boissonneault > calculquebec.ca> wrote:
>>
>>  Hi Jeff,
>>>
>>> Le 2014-08-15 17:50, Jeff Squyres (jsquyres) a écrit :
>>>
 On Aug 15, 2014, at 5:39 PM, Maxime Boissonneault <
 maxime.boissonnea...@calculquebec.ca> wrote:

  Correct.
>
> Can it be because torque (pbs_mom) is not running on the head node and
> mpiexec attempts to contact it ?
>
 Not for Open MPI's mpiexec, no.

 Open MPI's mpiexec (mpirun -- they're the same to us) will only try to
 use TM stuff (i.e., Torque stuff) if it sees the environment variable
 markers indicating that it's inside a Torque job.  If not, it just uses
 rsh/ssh (or localhost launch in your case, since you didn't specify any
 hosts).

 If you are unable to run even "mpirun -np 4 hostname" (i.e., the
 non-MPI "hostname" command from Linux), then something is seriously borked
 with your Open MPI installation.

>>> mpirun -np 4 hostname works fine :
>>> [mboisson@helios-login1 ~]$ which mpirun
>>> /software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/bin/mpirun
>>> [mboisson@helios-login1 examples]$ mpirun -np 4 hostname; echo $?
>>> helios-login1
>>> helios-login1
>>> helios-login1
>>> helios-login1
>>> 0
>>>
>>>  Try running with:

  mpirun -np 4 --mca plm_base_verbose 10 hostname

 This should show the steps OMPI is trying to take to launch the 4
 copies of "hostname" and potentially give some insight into where it's
 hanging.

 Also, just to make sure, you have ensured that you're compiling
 everything with a single compiler toolchain, and the support libraries from
 that specific compiler toolchain are available on any server on which
 you're running (to include the head node and compute nodes), right?

>>> Yes. Everything has been compiled with GCC 4.8 (I also tried GCC 4.6
>>> with the same results). Almost every software (that is compiler, toolchain,
>>> etc.) is installed on lustre, from sources and is the same on both the
>>> login (head) node and the compute.
>>>
>>> The few differences between th

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Maxime Boissonneault

Here it is
Le 2014-08-18 12:30, Joshua Ladd a écrit :
mpirun -np 4 --mca plm_base_verbose 10 
[mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose 
10 ring_c
[helios-login1:27853] mca: base: components_register: registering plm 
components
[helios-login1:27853] mca: base: components_register: found loaded 
component isolated
[helios-login1:27853] mca: base: components_register: component isolated 
has no register or open function
[helios-login1:27853] mca: base: components_register: found loaded 
component rsh
[helios-login1:27853] mca: base: components_register: component rsh 
register function successful
[helios-login1:27853] mca: base: components_register: found loaded 
component tm
[helios-login1:27853] mca: base: components_register: component tm 
register function successful

[helios-login1:27853] mca: base: components_open: opening plm components
[helios-login1:27853] mca: base: components_open: found loaded component 
isolated
[helios-login1:27853] mca: base: components_open: component isolated 
open function successful

[helios-login1:27853] mca: base: components_open: found loaded component rsh
[helios-login1:27853] mca: base: components_open: component rsh open 
function successful

[helios-login1:27853] mca: base: components_open: found loaded component tm
[helios-login1:27853] mca: base: components_open: component tm open 
function successful

[helios-login1:27853] mca:base:select: Auto-selecting plm components
[helios-login1:27853] mca:base:select:(  plm) Querying component [isolated]
[helios-login1:27853] mca:base:select:(  plm) Query of component 
[isolated] set priority to 0

[helios-login1:27853] mca:base:select:(  plm) Querying component [rsh]
[helios-login1:27853] mca:base:select:(  plm) Query of component [rsh] 
set priority to 10

[helios-login1:27853] mca:base:select:(  plm) Querying component [tm]
[helios-login1:27853] mca:base:select:(  plm) Skipping component [tm]. 
Query failed to return a module

[helios-login1:27853] mca:base:select:(  plm) Selected component [rsh]
[helios-login1:27853] mca: base: close: component isolated closed
[helios-login1:27853] mca: base: close: unloading component isolated
[helios-login1:27853] mca: base: close: component tm closed
[helios-login1:27853] mca: base: close: unloading component tm
[helios-login1:27853] mca: base: close: component rsh closed
[helios-login1:27853] mca: base: close: unloading component rsh
[mboisson@helios-login1 examples]$ echo $?
65


Maxime


Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Ralph Castain
This is all on one node, yes?

Try adding the following:

-mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5

Lot of garbage, but should tell us what is going on.

On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault 
 wrote:

> Here it is
> Le 2014-08-18 12:30, Joshua Ladd a écrit :
>> mpirun -np 4 --mca plm_base_verbose 10 
> [mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose 10 
> ring_c
> [helios-login1:27853] mca: base: components_register: registering plm 
> components
> [helios-login1:27853] mca: base: components_register: found loaded component 
> isolated
> [helios-login1:27853] mca: base: components_register: component isolated has 
> no register or open function
> [helios-login1:27853] mca: base: components_register: found loaded component 
> rsh
> [helios-login1:27853] mca: base: components_register: component rsh register 
> function successful
> [helios-login1:27853] mca: base: components_register: found loaded component 
> tm
> [helios-login1:27853] mca: base: components_register: component tm register 
> function successful
> [helios-login1:27853] mca: base: components_open: opening plm components
> [helios-login1:27853] mca: base: components_open: found loaded component 
> isolated
> [helios-login1:27853] mca: base: components_open: component isolated open 
> function successful
> [helios-login1:27853] mca: base: components_open: found loaded component rsh
> [helios-login1:27853] mca: base: components_open: component rsh open function 
> successful
> [helios-login1:27853] mca: base: components_open: found loaded component tm
> [helios-login1:27853] mca: base: components_open: component tm open function 
> successful
> [helios-login1:27853] mca:base:select: Auto-selecting plm components
> [helios-login1:27853] mca:base:select:(  plm) Querying component [isolated]
> [helios-login1:27853] mca:base:select:(  plm) Query of component [isolated] 
> set priority to 0
> [helios-login1:27853] mca:base:select:(  plm) Querying component [rsh]
> [helios-login1:27853] mca:base:select:(  plm) Query of component [rsh] set 
> priority to 10
> [helios-login1:27853] mca:base:select:(  plm) Querying component [tm]
> [helios-login1:27853] mca:base:select:(  plm) Skipping component [tm]. Query 
> failed to return a module
> [helios-login1:27853] mca:base:select:(  plm) Selected component [rsh]
> [helios-login1:27853] mca: base: close: component isolated closed
> [helios-login1:27853] mca: base: close: unloading component isolated
> [helios-login1:27853] mca: base: close: component tm closed
> [helios-login1:27853] mca: base: close: unloading component tm
> [helios-login1:27853] mca: base: close: component rsh closed
> [helios-login1:27853] mca: base: close: unloading component rsh
> [mboisson@helios-login1 examples]$ echo $?
> 65
> 
> 
> Maxime
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/25052.php



Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Maxime Boissonneault

This is all one one node indeed.

Attached is the output of
mpirun -np 4 --mca plm_base_verbose 10 -mca odls_base_verbose 5 -mca 
state_base_verbose 5 -mca errmgr_base_verbose 5  ring_c |& tee 
output_ringc_verbose.txt



Maxime

Le 2014-08-18 12:48, Ralph Castain a écrit :

This is all on one node, yes?

Try adding the following:

-mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5

Lot of garbage, but should tell us what is going on.

On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault 
 wrote:


Here it is
Le 2014-08-18 12:30, Joshua Ladd a écrit :

mpirun -np 4 --mca plm_base_verbose 10

[mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose 10 ring_c
[helios-login1:27853] mca: base: components_register: registering plm components
[helios-login1:27853] mca: base: components_register: found loaded component 
isolated
[helios-login1:27853] mca: base: components_register: component isolated has no 
register or open function
[helios-login1:27853] mca: base: components_register: found loaded component rsh
[helios-login1:27853] mca: base: components_register: component rsh register 
function successful
[helios-login1:27853] mca: base: components_register: found loaded component tm
[helios-login1:27853] mca: base: components_register: component tm register 
function successful
[helios-login1:27853] mca: base: components_open: opening plm components
[helios-login1:27853] mca: base: components_open: found loaded component 
isolated
[helios-login1:27853] mca: base: components_open: component isolated open 
function successful
[helios-login1:27853] mca: base: components_open: found loaded component rsh
[helios-login1:27853] mca: base: components_open: component rsh open function 
successful
[helios-login1:27853] mca: base: components_open: found loaded component tm
[helios-login1:27853] mca: base: components_open: component tm open function 
successful
[helios-login1:27853] mca:base:select: Auto-selecting plm components
[helios-login1:27853] mca:base:select:(  plm) Querying component [isolated]
[helios-login1:27853] mca:base:select:(  plm) Query of component [isolated] set 
priority to 0
[helios-login1:27853] mca:base:select:(  plm) Querying component [rsh]
[helios-login1:27853] mca:base:select:(  plm) Query of component [rsh] set 
priority to 10
[helios-login1:27853] mca:base:select:(  plm) Querying component [tm]
[helios-login1:27853] mca:base:select:(  plm) Skipping component [tm]. Query 
failed to return a module
[helios-login1:27853] mca:base:select:(  plm) Selected component [rsh]
[helios-login1:27853] mca: base: close: component isolated closed
[helios-login1:27853] mca: base: close: unloading component isolated
[helios-login1:27853] mca: base: close: component tm closed
[helios-login1:27853] mca: base: close: unloading component tm
[helios-login1:27853] mca: base: close: component rsh closed
[helios-login1:27853] mca: base: close: unloading component rsh
[mboisson@helios-login1 examples]$ echo $?
65


Maxime
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/25052.php

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/25053.php



--
-
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique



output_ringc_verbose.txt.gz
Description: GNU Zip compressed data


Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Ralph Castain
Ah...now that showed the problem. To pinpoint it better, please add

-mca oob_base_verbose 10

and I think we'll have it

On Aug 18, 2014, at 9:54 AM, Maxime Boissonneault 
 wrote:

> This is all one one node indeed.
> 
> Attached is the output of
> mpirun -np 4 --mca plm_base_verbose 10 -mca odls_base_verbose 5 -mca 
> state_base_verbose 5 -mca errmgr_base_verbose 5  ring_c |& tee 
> output_ringc_verbose.txt
> 
> 
> Maxime
> 
> Le 2014-08-18 12:48, Ralph Castain a écrit :
>> This is all on one node, yes?
>> 
>> Try adding the following:
>> 
>> -mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5
>> 
>> Lot of garbage, but should tell us what is going on.
>> 
>> On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault 
>>  wrote:
>> 
>>> Here it is
>>> Le 2014-08-18 12:30, Joshua Ladd a écrit :
 mpirun -np 4 --mca plm_base_verbose 10
>>> [mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose 10 
>>> ring_c
>>> [helios-login1:27853] mca: base: components_register: registering plm 
>>> components
>>> [helios-login1:27853] mca: base: components_register: found loaded 
>>> component isolated
>>> [helios-login1:27853] mca: base: components_register: component isolated 
>>> has no register or open function
>>> [helios-login1:27853] mca: base: components_register: found loaded 
>>> component rsh
>>> [helios-login1:27853] mca: base: components_register: component rsh 
>>> register function successful
>>> [helios-login1:27853] mca: base: components_register: found loaded 
>>> component tm
>>> [helios-login1:27853] mca: base: components_register: component tm register 
>>> function successful
>>> [helios-login1:27853] mca: base: components_open: opening plm components
>>> [helios-login1:27853] mca: base: components_open: found loaded component 
>>> isolated
>>> [helios-login1:27853] mca: base: components_open: component isolated open 
>>> function successful
>>> [helios-login1:27853] mca: base: components_open: found loaded component rsh
>>> [helios-login1:27853] mca: base: components_open: component rsh open 
>>> function successful
>>> [helios-login1:27853] mca: base: components_open: found loaded component tm
>>> [helios-login1:27853] mca: base: components_open: component tm open 
>>> function successful
>>> [helios-login1:27853] mca:base:select: Auto-selecting plm components
>>> [helios-login1:27853] mca:base:select:(  plm) Querying component [isolated]
>>> [helios-login1:27853] mca:base:select:(  plm) Query of component [isolated] 
>>> set priority to 0
>>> [helios-login1:27853] mca:base:select:(  plm) Querying component [rsh]
>>> [helios-login1:27853] mca:base:select:(  plm) Query of component [rsh] set 
>>> priority to 10
>>> [helios-login1:27853] mca:base:select:(  plm) Querying component [tm]
>>> [helios-login1:27853] mca:base:select:(  plm) Skipping component [tm]. 
>>> Query failed to return a module
>>> [helios-login1:27853] mca:base:select:(  plm) Selected component [rsh]
>>> [helios-login1:27853] mca: base: close: component isolated closed
>>> [helios-login1:27853] mca: base: close: unloading component isolated
>>> [helios-login1:27853] mca: base: close: component tm closed
>>> [helios-login1:27853] mca: base: close: unloading component tm
>>> [helios-login1:27853] mca: base: close: component rsh closed
>>> [helios-login1:27853] mca: base: close: unloading component rsh
>>> [mboisson@helios-login1 examples]$ echo $?
>>> 65
>>> 
>>> 
>>> Maxime
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2014/08/25052.php
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/08/25053.php
> 
> 
> -- 
> -
> Maxime Boissonneault
> Analyste de calcul - Calcul Québec, Université Laval
> Ph. D. en physique
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/25054.php



Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Maxime Boissonneault

Here it is.

Maxime

Le 2014-08-18 12:59, Ralph Castain a écrit :

Ah...now that showed the problem. To pinpoint it better, please add

-mca oob_base_verbose 10

and I think we'll have it

On Aug 18, 2014, at 9:54 AM, Maxime Boissonneault 
 wrote:


This is all one one node indeed.

Attached is the output of
mpirun -np 4 --mca plm_base_verbose 10 -mca odls_base_verbose 5 -mca 
state_base_verbose 5 -mca errmgr_base_verbose 5  ring_c |& tee 
output_ringc_verbose.txt


Maxime

Le 2014-08-18 12:48, Ralph Castain a écrit :

This is all on one node, yes?

Try adding the following:

-mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5

Lot of garbage, but should tell us what is going on.

On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault 
 wrote:


Here it is
Le 2014-08-18 12:30, Joshua Ladd a écrit :

mpirun -np 4 --mca plm_base_verbose 10

[mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose 10 ring_c
[helios-login1:27853] mca: base: components_register: registering plm components
[helios-login1:27853] mca: base: components_register: found loaded component 
isolated
[helios-login1:27853] mca: base: components_register: component isolated has no 
register or open function
[helios-login1:27853] mca: base: components_register: found loaded component rsh
[helios-login1:27853] mca: base: components_register: component rsh register 
function successful
[helios-login1:27853] mca: base: components_register: found loaded component tm
[helios-login1:27853] mca: base: components_register: component tm register 
function successful
[helios-login1:27853] mca: base: components_open: opening plm components
[helios-login1:27853] mca: base: components_open: found loaded component 
isolated
[helios-login1:27853] mca: base: components_open: component isolated open 
function successful
[helios-login1:27853] mca: base: components_open: found loaded component rsh
[helios-login1:27853] mca: base: components_open: component rsh open function 
successful
[helios-login1:27853] mca: base: components_open: found loaded component tm
[helios-login1:27853] mca: base: components_open: component tm open function 
successful
[helios-login1:27853] mca:base:select: Auto-selecting plm components
[helios-login1:27853] mca:base:select:(  plm) Querying component [isolated]
[helios-login1:27853] mca:base:select:(  plm) Query of component [isolated] set 
priority to 0
[helios-login1:27853] mca:base:select:(  plm) Querying component [rsh]
[helios-login1:27853] mca:base:select:(  plm) Query of component [rsh] set 
priority to 10
[helios-login1:27853] mca:base:select:(  plm) Querying component [tm]
[helios-login1:27853] mca:base:select:(  plm) Skipping component [tm]. Query 
failed to return a module
[helios-login1:27853] mca:base:select:(  plm) Selected component [rsh]
[helios-login1:27853] mca: base: close: component isolated closed
[helios-login1:27853] mca: base: close: unloading component isolated
[helios-login1:27853] mca: base: close: component tm closed
[helios-login1:27853] mca: base: close: unloading component tm
[helios-login1:27853] mca: base: close: component rsh closed
[helios-login1:27853] mca: base: close: unloading component rsh
[mboisson@helios-login1 examples]$ echo $?
65


Maxime
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/25052.php

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/25053.php


--
-
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/25054.php

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/25055.php



--
-
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique



output_ringc_verbose2.txt.gz
Description: GNU Zip compressed data


Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Ralph Castain
Yep, that pinpointed the problem:

[helios-login1:28558] [[63019,1],0] tcp:send_handler CONNECTING
[helios-login1:28558] [[63019,1],0]:tcp:complete_connect called for peer 
[[63019,0],0] on socket 11
[helios-login1:28558] [[63019,1],0]-[[63019,0],0] tcp_peer_complete_connect: 
connection failed: Connection refused (111)
[helios-login1:28558] [[63019,1],0] tcp_peer_close for [[63019,0],0] sd 11 
state CONNECTING
[helios-login1:28558] [[63019,1],0] tcp:lost connection called for peer 
[[63019,0],0]


The apps are trying to connect back to mpirun using the following addresses:

tcp://10.10.1.3,132.219.137.36,10.12.1.3:34237

The initial attempt is here

[helios-login1:28558] [[63019,1],0] orte_tcp_peer_try_connect: attempting to 
connect to proc [[63019,0],0] on 10.10.1.3:34237 - 0 retries

I know there is a failover bug in the 1.8 series, and so if that connection got 
rejected the proc would abort. Should we be using a different network? If so, 
telling us via the oob_tcp_if_include param would be the solution.


On Aug 18, 2014, at 10:04 AM, Maxime Boissonneault 
 wrote:

> Here it is.
> 
> Maxime
> 
> Le 2014-08-18 12:59, Ralph Castain a écrit :
>> Ah...now that showed the problem. To pinpoint it better, please add
>> 
>> -mca oob_base_verbose 10
>> 
>> and I think we'll have it
>> 
>> On Aug 18, 2014, at 9:54 AM, Maxime Boissonneault 
>>  wrote:
>> 
>>> This is all one one node indeed.
>>> 
>>> Attached is the output of
>>> mpirun -np 4 --mca plm_base_verbose 10 -mca odls_base_verbose 5 -mca 
>>> state_base_verbose 5 -mca errmgr_base_verbose 5  ring_c |& tee 
>>> output_ringc_verbose.txt
>>> 
>>> 
>>> Maxime
>>> 
>>> Le 2014-08-18 12:48, Ralph Castain a écrit :
 This is all on one node, yes?
 
 Try adding the following:
 
 -mca odls_base_verbose 5 -mca state_base_verbose 5 -mca 
 errmgr_base_verbose 5
 
 Lot of garbage, but should tell us what is going on.
 
 On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault 
  wrote:
 
> Here it is
> Le 2014-08-18 12:30, Joshua Ladd a écrit :
>> mpirun -np 4 --mca plm_base_verbose 10
> [mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose 10 
> ring_c
> [helios-login1:27853] mca: base: components_register: registering plm 
> components
> [helios-login1:27853] mca: base: components_register: found loaded 
> component isolated
> [helios-login1:27853] mca: base: components_register: component isolated 
> has no register or open function
> [helios-login1:27853] mca: base: components_register: found loaded 
> component rsh
> [helios-login1:27853] mca: base: components_register: component rsh 
> register function successful
> [helios-login1:27853] mca: base: components_register: found loaded 
> component tm
> [helios-login1:27853] mca: base: components_register: component tm 
> register function successful
> [helios-login1:27853] mca: base: components_open: opening plm components
> [helios-login1:27853] mca: base: components_open: found loaded component 
> isolated
> [helios-login1:27853] mca: base: components_open: component isolated open 
> function successful
> [helios-login1:27853] mca: base: components_open: found loaded component 
> rsh
> [helios-login1:27853] mca: base: components_open: component rsh open 
> function successful
> [helios-login1:27853] mca: base: components_open: found loaded component 
> tm
> [helios-login1:27853] mca: base: components_open: component tm open 
> function successful
> [helios-login1:27853] mca:base:select: Auto-selecting plm components
> [helios-login1:27853] mca:base:select:(  plm) Querying component 
> [isolated]
> [helios-login1:27853] mca:base:select:(  plm) Query of component 
> [isolated] set priority to 0
> [helios-login1:27853] mca:base:select:(  plm) Querying component [rsh]
> [helios-login1:27853] mca:base:select:(  plm) Query of component [rsh] 
> set priority to 10
> [helios-login1:27853] mca:base:select:(  plm) Querying component [tm]
> [helios-login1:27853] mca:base:select:(  plm) Skipping component [tm]. 
> Query failed to return a module
> [helios-login1:27853] mca:base:select:(  plm) Selected component [rsh]
> [helios-login1:27853] mca: base: close: component isolated closed
> [helios-login1:27853] mca: base: close: unloading component isolated
> [helios-login1:27853] mca: base: close: component tm closed
> [helios-login1:27853] mca: base: close: unloading component tm
> [helios-login1:27853] mca: base: close: component rsh closed
> [helios-login1:27853] mca: base: close: unloading component rsh
> [mboisson@helios-login1 examples]$ echo $?
> 65
> 
> 
> Maxime
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Lin

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Maxime Boissonneault

Indeed, that makes sense now.

Why isn't OpenMPI attempting to connect with the local loop for same 
node ? This used to work with 1.6.5.


Maxime

Le 2014-08-18 13:11, Ralph Castain a écrit :

Yep, that pinpointed the problem:

[helios-login1:28558] [[63019,1],0] tcp:send_handler CONNECTING
[helios-login1:28558] [[63019,1],0]:tcp:complete_connect called for peer 
[[63019,0],0] on socket 11
[helios-login1:28558] [[63019,1],0]-[[63019,0],0] tcp_peer_complete_connect: 
connection failed: Connection refused (111)
[helios-login1:28558] [[63019,1],0] tcp_peer_close for [[63019,0],0] sd 11 
state CONNECTING
[helios-login1:28558] [[63019,1],0] tcp:lost connection called for peer 
[[63019,0],0]


The apps are trying to connect back to mpirun using the following addresses:

tcp://10.10.1.3,132.219.137.36,10.12.1.3:34237

The initial attempt is here

[helios-login1:28558] [[63019,1],0] orte_tcp_peer_try_connect: attempting to 
connect to proc [[63019,0],0] on 10.10.1.3:34237 - 0 retries

I know there is a failover bug in the 1.8 series, and so if that connection got 
rejected the proc would abort. Should we be using a different network? If so, 
telling us via the oob_tcp_if_include param would be the solution.


On Aug 18, 2014, at 10:04 AM, Maxime Boissonneault 
 wrote:


Here it is.

Maxime

Le 2014-08-18 12:59, Ralph Castain a écrit :

Ah...now that showed the problem. To pinpoint it better, please add

-mca oob_base_verbose 10

and I think we'll have it

On Aug 18, 2014, at 9:54 AM, Maxime Boissonneault 
 wrote:


This is all one one node indeed.

Attached is the output of
mpirun -np 4 --mca plm_base_verbose 10 -mca odls_base_verbose 5 -mca 
state_base_verbose 5 -mca errmgr_base_verbose 5  ring_c |& tee 
output_ringc_verbose.txt


Maxime

Le 2014-08-18 12:48, Ralph Castain a écrit :

This is all on one node, yes?

Try adding the following:

-mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5

Lot of garbage, but should tell us what is going on.

On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault 
 wrote:


Here it is
Le 2014-08-18 12:30, Joshua Ladd a écrit :

mpirun -np 4 --mca plm_base_verbose 10

[mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose 10 ring_c
[helios-login1:27853] mca: base: components_register: registering plm components
[helios-login1:27853] mca: base: components_register: found loaded component 
isolated
[helios-login1:27853] mca: base: components_register: component isolated has no 
register or open function
[helios-login1:27853] mca: base: components_register: found loaded component rsh
[helios-login1:27853] mca: base: components_register: component rsh register 
function successful
[helios-login1:27853] mca: base: components_register: found loaded component tm
[helios-login1:27853] mca: base: components_register: component tm register 
function successful
[helios-login1:27853] mca: base: components_open: opening plm components
[helios-login1:27853] mca: base: components_open: found loaded component 
isolated
[helios-login1:27853] mca: base: components_open: component isolated open 
function successful
[helios-login1:27853] mca: base: components_open: found loaded component rsh
[helios-login1:27853] mca: base: components_open: component rsh open function 
successful
[helios-login1:27853] mca: base: components_open: found loaded component tm
[helios-login1:27853] mca: base: components_open: component tm open function 
successful
[helios-login1:27853] mca:base:select: Auto-selecting plm components
[helios-login1:27853] mca:base:select:(  plm) Querying component [isolated]
[helios-login1:27853] mca:base:select:(  plm) Query of component [isolated] set 
priority to 0
[helios-login1:27853] mca:base:select:(  plm) Querying component [rsh]
[helios-login1:27853] mca:base:select:(  plm) Query of component [rsh] set 
priority to 10
[helios-login1:27853] mca:base:select:(  plm) Querying component [tm]
[helios-login1:27853] mca:base:select:(  plm) Skipping component [tm]. Query 
failed to return a module
[helios-login1:27853] mca:base:select:(  plm) Selected component [rsh]
[helios-login1:27853] mca: base: close: component isolated closed
[helios-login1:27853] mca: base: close: unloading component isolated
[helios-login1:27853] mca: base: close: component tm closed
[helios-login1:27853] mca: base: close: unloading component tm
[helios-login1:27853] mca: base: close: component rsh closed
[helios-login1:27853] mca: base: close: unloading component rsh
[mboisson@helios-login1 examples]$ echo $?
65


Maxime
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/25052.php

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/2

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Ralph Castain
Yeah, there are some issues with the internal connection logic that need to get 
fixed. We haven't had many cases where it's been an issue, but a couple like 
this have cropped up - enough that I need to set aside some time to fix it.

My apologies for the problem.


On Aug 18, 2014, at 10:31 AM, Maxime Boissonneault 
 wrote:

> Indeed, that makes sense now.
> 
> Why isn't OpenMPI attempting to connect with the local loop for same node ? 
> This used to work with 1.6.5.
> 
> Maxime
> 
> Le 2014-08-18 13:11, Ralph Castain a écrit :
>> Yep, that pinpointed the problem:
>> 
>> [helios-login1:28558] [[63019,1],0] tcp:send_handler CONNECTING
>> [helios-login1:28558] [[63019,1],0]:tcp:complete_connect called for peer 
>> [[63019,0],0] on socket 11
>> [helios-login1:28558] [[63019,1],0]-[[63019,0],0] tcp_peer_complete_connect: 
>> connection failed: Connection refused (111)
>> [helios-login1:28558] [[63019,1],0] tcp_peer_close for [[63019,0],0] sd 11 
>> state CONNECTING
>> [helios-login1:28558] [[63019,1],0] tcp:lost connection called for peer 
>> [[63019,0],0]
>> 
>> 
>> The apps are trying to connect back to mpirun using the following addresses:
>> 
>> tcp://10.10.1.3,132.219.137.36,10.12.1.3:34237
>> 
>> The initial attempt is here
>> 
>> [helios-login1:28558] [[63019,1],0] orte_tcp_peer_try_connect: attempting to 
>> connect to proc [[63019,0],0] on 10.10.1.3:34237 - 0 retries
>> 
>> I know there is a failover bug in the 1.8 series, and so if that connection 
>> got rejected the proc would abort. Should we be using a different network? 
>> If so, telling us via the oob_tcp_if_include param would be the solution.
>> 
>> 
>> On Aug 18, 2014, at 10:04 AM, Maxime Boissonneault 
>>  wrote:
>> 
>>> Here it is.
>>> 
>>> Maxime
>>> 
>>> Le 2014-08-18 12:59, Ralph Castain a écrit :
 Ah...now that showed the problem. To pinpoint it better, please add
 
 -mca oob_base_verbose 10
 
 and I think we'll have it
 
 On Aug 18, 2014, at 9:54 AM, Maxime Boissonneault 
  wrote:
 
> This is all one one node indeed.
> 
> Attached is the output of
> mpirun -np 4 --mca plm_base_verbose 10 -mca odls_base_verbose 5 -mca 
> state_base_verbose 5 -mca errmgr_base_verbose 5  ring_c |& tee 
> output_ringc_verbose.txt
> 
> 
> Maxime
> 
> Le 2014-08-18 12:48, Ralph Castain a écrit :
>> This is all on one node, yes?
>> 
>> Try adding the following:
>> 
>> -mca odls_base_verbose 5 -mca state_base_verbose 5 -mca 
>> errmgr_base_verbose 5
>> 
>> Lot of garbage, but should tell us what is going on.
>> 
>> On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault 
>>  wrote:
>> 
>>> Here it is
>>> Le 2014-08-18 12:30, Joshua Ladd a écrit :
 mpirun -np 4 --mca plm_base_verbose 10
>>> [mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose 
>>> 10 ring_c
>>> [helios-login1:27853] mca: base: components_register: registering plm 
>>> components
>>> [helios-login1:27853] mca: base: components_register: found loaded 
>>> component isolated
>>> [helios-login1:27853] mca: base: components_register: component 
>>> isolated has no register or open function
>>> [helios-login1:27853] mca: base: components_register: found loaded 
>>> component rsh
>>> [helios-login1:27853] mca: base: components_register: component rsh 
>>> register function successful
>>> [helios-login1:27853] mca: base: components_register: found loaded 
>>> component tm
>>> [helios-login1:27853] mca: base: components_register: component tm 
>>> register function successful
>>> [helios-login1:27853] mca: base: components_open: opening plm components
>>> [helios-login1:27853] mca: base: components_open: found loaded 
>>> component isolated
>>> [helios-login1:27853] mca: base: components_open: component isolated 
>>> open function successful
>>> [helios-login1:27853] mca: base: components_open: found loaded 
>>> component rsh
>>> [helios-login1:27853] mca: base: components_open: component rsh open 
>>> function successful
>>> [helios-login1:27853] mca: base: components_open: found loaded 
>>> component tm
>>> [helios-login1:27853] mca: base: components_open: component tm open 
>>> function successful
>>> [helios-login1:27853] mca:base:select: Auto-selecting plm components
>>> [helios-login1:27853] mca:base:select:(  plm) Querying component 
>>> [isolated]
>>> [helios-login1:27853] mca:base:select:(  plm) Query of component 
>>> [isolated] set priority to 0
>>> [helios-login1:27853] mca:base:select:(  plm) Querying component [rsh]
>>> [helios-login1:27853] mca:base:select:(  plm) Query of component [rsh] 
>>> set priority to 10
>>> [helios-login1:27853] mca:base:select:(  plm) Querying component [tm]
>>> [helios-login1:27853] mca:base:select:(  plm) Skipping component [tm]. 
>>> Query fail

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Maxime Boissonneault

Ok, I confirm that with
mpiexec -mca oob_tcp_if_include lo ring_c

it works.

It also works with
mpiexec -mca oob_tcp_if_include ib0 ring_c

We have 4 interfaces on this node.
- lo, the local loop
- ib0, infiniband
- eth2, a management network
- eth3, the public network

It seems that mpiexec attempts to use the two addresses that do not work 
(eth2, eth3) and does not use the two that do work (ib0 and lo). 
However, according to the logs sent previously, it does see ib0 (despite 
not seeing lo), but does not attempt to use it.



On the compute nodes, we have eth0 (management), ib0 and lo, and it 
works. I am unsure why it does work on the compute nodes and not on the 
login nodes. The only difference is the presence of a public interface 
on the login node.


Maxime


Le 2014-08-18 13:37, Ralph Castain a écrit :

Yeah, there are some issues with the internal connection logic that need to get 
fixed. We haven't had many cases where it's been an issue, but a couple like 
this have cropped up - enough that I need to set aside some time to fix it.

My apologies for the problem.


On Aug 18, 2014, at 10:31 AM, Maxime Boissonneault 
 wrote:


Indeed, that makes sense now.

Why isn't OpenMPI attempting to connect with the local loop for same node ? 
This used to work with 1.6.5.

Maxime

Le 2014-08-18 13:11, Ralph Castain a écrit :

Yep, that pinpointed the problem:

[helios-login1:28558] [[63019,1],0] tcp:send_handler CONNECTING
[helios-login1:28558] [[63019,1],0]:tcp:complete_connect called for peer 
[[63019,0],0] on socket 11
[helios-login1:28558] [[63019,1],0]-[[63019,0],0] tcp_peer_complete_connect: 
connection failed: Connection refused (111)
[helios-login1:28558] [[63019,1],0] tcp_peer_close for [[63019,0],0] sd 11 
state CONNECTING
[helios-login1:28558] [[63019,1],0] tcp:lost connection called for peer 
[[63019,0],0]


The apps are trying to connect back to mpirun using the following addresses:

tcp://10.10.1.3,132.219.137.36,10.12.1.3:34237

The initial attempt is here

[helios-login1:28558] [[63019,1],0] orte_tcp_peer_try_connect: attempting to 
connect to proc [[63019,0],0] on 10.10.1.3:34237 - 0 retries

I know there is a failover bug in the 1.8 series, and so if that connection got 
rejected the proc would abort. Should we be using a different network? If so, 
telling us via the oob_tcp_if_include param would be the solution.


On Aug 18, 2014, at 10:04 AM, Maxime Boissonneault 
 wrote:


Here it is.

Maxime

Le 2014-08-18 12:59, Ralph Castain a écrit :

Ah...now that showed the problem. To pinpoint it better, please add

-mca oob_base_verbose 10

and I think we'll have it

On Aug 18, 2014, at 9:54 AM, Maxime Boissonneault 
 wrote:


This is all one one node indeed.

Attached is the output of
mpirun -np 4 --mca plm_base_verbose 10 -mca odls_base_verbose 5 -mca 
state_base_verbose 5 -mca errmgr_base_verbose 5  ring_c |& tee 
output_ringc_verbose.txt


Maxime

Le 2014-08-18 12:48, Ralph Castain a écrit :

This is all on one node, yes?

Try adding the following:

-mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5

Lot of garbage, but should tell us what is going on.

On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault 
 wrote:


Here it is
Le 2014-08-18 12:30, Joshua Ladd a écrit :

mpirun -np 4 --mca plm_base_verbose 10

[mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose 10 ring_c
[helios-login1:27853] mca: base: components_register: registering plm components
[helios-login1:27853] mca: base: components_register: found loaded component 
isolated
[helios-login1:27853] mca: base: components_register: component isolated has no 
register or open function
[helios-login1:27853] mca: base: components_register: found loaded component rsh
[helios-login1:27853] mca: base: components_register: component rsh register 
function successful
[helios-login1:27853] mca: base: components_register: found loaded component tm
[helios-login1:27853] mca: base: components_register: component tm register 
function successful
[helios-login1:27853] mca: base: components_open: opening plm components
[helios-login1:27853] mca: base: components_open: found loaded component 
isolated
[helios-login1:27853] mca: base: components_open: component isolated open 
function successful
[helios-login1:27853] mca: base: components_open: found loaded component rsh
[helios-login1:27853] mca: base: components_open: component rsh open function 
successful
[helios-login1:27853] mca: base: components_open: found loaded component tm
[helios-login1:27853] mca: base: components_open: component tm open function 
successful
[helios-login1:27853] mca:base:select: Auto-selecting plm components
[helios-login1:27853] mca:base:select:(  plm) Querying component [isolated]
[helios-login1:27853] mca:base:select:(  plm) Query of component [isolated] set 
priority to 0
[helios-login1:27853] mca:base:select:(  plm) Querying component [rsh]
[helios-login1:27853] mca:base:select:(  plm) Query of component [rsh] set 
prio

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Ralph Castain
Indeed odd - I'm afraid that this is just the kind of case that has been 
causing problems. I think I've figured out the problem, but have been buried 
with my "day job" for the last few weeks and unable to pursue it.


On Aug 18, 2014, at 11:10 AM, Maxime Boissonneault 
 wrote:

> Ok, I confirm that with
> mpiexec -mca oob_tcp_if_include lo ring_c
> 
> it works.
> 
> It also works with
> mpiexec -mca oob_tcp_if_include ib0 ring_c
> 
> We have 4 interfaces on this node.
> - lo, the local loop
> - ib0, infiniband
> - eth2, a management network
> - eth3, the public network
> 
> It seems that mpiexec attempts to use the two addresses that do not work 
> (eth2, eth3) and does not use the two that do work (ib0 and lo). However, 
> according to the logs sent previously, it does see ib0 (despite not seeing 
> lo), but does not attempt to use it.
> 
> 
> On the compute nodes, we have eth0 (management), ib0 and lo, and it works. I 
> am unsure why it does work on the compute nodes and not on the login nodes. 
> The only difference is the presence of a public interface on the login node.
> 
> Maxime
> 
> 
> Le 2014-08-18 13:37, Ralph Castain a écrit :
>> Yeah, there are some issues with the internal connection logic that need to 
>> get fixed. We haven't had many cases where it's been an issue, but a couple 
>> like this have cropped up - enough that I need to set aside some time to fix 
>> it.
>> 
>> My apologies for the problem.
>> 
>> 
>> On Aug 18, 2014, at 10:31 AM, Maxime Boissonneault 
>>  wrote:
>> 
>>> Indeed, that makes sense now.
>>> 
>>> Why isn't OpenMPI attempting to connect with the local loop for same node ? 
>>> This used to work with 1.6.5.
>>> 
>>> Maxime
>>> 
>>> Le 2014-08-18 13:11, Ralph Castain a écrit :
 Yep, that pinpointed the problem:
 
 [helios-login1:28558] [[63019,1],0] tcp:send_handler CONNECTING
 [helios-login1:28558] [[63019,1],0]:tcp:complete_connect called for peer 
 [[63019,0],0] on socket 11
 [helios-login1:28558] [[63019,1],0]-[[63019,0],0] 
 tcp_peer_complete_connect: connection failed: Connection refused (111)
 [helios-login1:28558] [[63019,1],0] tcp_peer_close for [[63019,0],0] sd 11 
 state CONNECTING
 [helios-login1:28558] [[63019,1],0] tcp:lost connection called for peer 
 [[63019,0],0]
 
 
 The apps are trying to connect back to mpirun using the following 
 addresses:
 
 tcp://10.10.1.3,132.219.137.36,10.12.1.3:34237
 
 The initial attempt is here
 
 [helios-login1:28558] [[63019,1],0] orte_tcp_peer_try_connect: attempting 
 to connect to proc [[63019,0],0] on 10.10.1.3:34237 - 0 retries
 
 I know there is a failover bug in the 1.8 series, and so if that 
 connection got rejected the proc would abort. Should we be using a 
 different network? If so, telling us via the oob_tcp_if_include param 
 would be the solution.
 
 
 On Aug 18, 2014, at 10:04 AM, Maxime Boissonneault 
  wrote:
 
> Here it is.
> 
> Maxime
> 
> Le 2014-08-18 12:59, Ralph Castain a écrit :
>> Ah...now that showed the problem. To pinpoint it better, please add
>> 
>> -mca oob_base_verbose 10
>> 
>> and I think we'll have it
>> 
>> On Aug 18, 2014, at 9:54 AM, Maxime Boissonneault 
>>  wrote:
>> 
>>> This is all one one node indeed.
>>> 
>>> Attached is the output of
>>> mpirun -np 4 --mca plm_base_verbose 10 -mca odls_base_verbose 5 -mca 
>>> state_base_verbose 5 -mca errmgr_base_verbose 5  ring_c |& tee 
>>> output_ringc_verbose.txt
>>> 
>>> 
>>> Maxime
>>> 
>>> Le 2014-08-18 12:48, Ralph Castain a écrit :
 This is all on one node, yes?
 
 Try adding the following:
 
 -mca odls_base_verbose 5 -mca state_base_verbose 5 -mca 
 errmgr_base_verbose 5
 
 Lot of garbage, but should tell us what is going on.
 
 On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault 
  wrote:
 
> Here it is
> Le 2014-08-18 12:30, Joshua Ladd a écrit :
>> mpirun -np 4 --mca plm_base_verbose 10
> [mboisson@helios-login1 examples]$ mpirun -np 4 --mca 
> plm_base_verbose 10 ring_c
> [helios-login1:27853] mca: base: components_register: registering plm 
> components
> [helios-login1:27853] mca: base: components_register: found loaded 
> component isolated
> [helios-login1:27853] mca: base: components_register: component 
> isolated has no register or open function
> [helios-login1:27853] mca: base: components_register: found loaded 
> component rsh
> [helios-login1:27853] mca: base: components_register: component rsh 
> register function successful
> [helios-login1:27853] mca: base: components_register: found loaded 
> component tm
> [helios-login1:27853] mca: base: components_r

Re: [OMPI users] No log_num_mtt in Ubuntu 14.04

2014-08-18 Thread Rio Yokota
I get "ofed_info: command not found". Note that I don't install the entire 
OFED, but do a component wise installation by doing "apt-get install 
infiniband-diags ibutils ibverbs-utils libmlx4-dev" for the drivers and 
utilities.

> Hi,
> what ofed version do you use?
> (ofed_info -s)
> 
> 
> On Sun, Aug 17, 2014 at 7:16 PM, Rio Yokota  wrote:
> I have recently upgraded from Ubuntu 12.04 to 14.04 and OpenMPI gives the 
> following warning upon execution, which did not appear before the upgrade.
> 
> WARNING: It appears that your OpenFabrics subsystem is configured to only
> allow registering part of your physical memory. This can cause MPI jobs to
> run with erratic performance, hang, and/or crash.
> 
> Everything that I could find on google suggests to change log_num_mtt, but I 
> cannot do this for the following reasons:
> 1. There is no log_num_mtt in /sys/module/mlx4_core/parameters/
> 2. Adding "options mlx4_core log_num_mtt=24" to /etc/modprobe.d/mlx4.conf 
> doesn't seem to change anything
> 3. I am not sure how I can restart the driver because there is no 
> "/etc/init.d/openibd" file (I've rebooted the system but it didn't do 
> anything to create log_num_mtt)
> 
> [Template information]
> 1. OpenFabrics is from the Ubuntu distribution using "apt-get install 
> infiniband-diags ibutils ibverbs-utils libmlx4-dev"
> 2. OS is Ubuntu 14.04 LTS
> 3. Subnet manager is from the Ubuntu distribution using "apt-get install 
> opensm"
> 4. Output of ibv_devinfo is:
> hca_id: mlx4_0
> transport:  InfiniBand (0)
> fw_ver: 2.10.600
> node_guid:  0002:c903:003d:52b0
> sys_image_guid: 0002:c903:003d:52b3
> vendor_id:  0x02c9
> vendor_part_id: 4099
> hw_ver: 0x0
> board_id:   MT_1100120019
> phys_port_cnt:  1
> port:   1
> state:  PORT_ACTIVE (4)
> max_mtu:4096 (5)
> active_mtu: 4096 (5)
> sm_lid: 1
> port_lid:   1
> port_lmc:   0x00
> link_layer: InfiniBand
> 5. Output of ifconfig for IB is
> ib0   Link encap:UNSPEC  HWaddr 
> 80-00-00-48-FE-80-00-00-00-00-00-00-00-00-00-00
>   inet addr:192.168.1.1  Bcast:192.168.1.255  Mask:255.255.255.0
>   inet6 addr: fe80::202:c903:3d:52b1/64 Scope:Link
>   UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
>   RX packets:26 errors:0 dropped:0 overruns:0 frame:0
>   TX packets:34 errors:0 dropped:16 overruns:0 carrier:0
>   collisions:0 txqueuelen:256
>   RX bytes:5843 (5.8 KB)  TX bytes:4324 (4.3 KB)
> 6. ulimit -l is "unlimited"
> 
> Thanks,
> Rio
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/25048.php
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/25049.php



Re: [OMPI users] No log_num_mtt in Ubuntu 14.04

2014-08-18 Thread Mike Dubman
most likely you installing old ofed which does not have this parameter:

try:

#modinfo mlx4_core

and see if it is there.
I would suggest install latest OFED or Mellanox OFED.


On Mon, Aug 18, 2014 at 9:53 PM, Rio Yokota  wrote:

> I get "ofed_info: command not found". Note that I don't install the entire
> OFED, but do a component wise installation by doing "apt-get install
> infiniband-diags ibutils ibverbs-utils libmlx4-dev" for the drivers and
> utilities.
>
> Hi,
> what ofed version do you use?
> (ofed_info -s)
>
>
> On Sun, Aug 17, 2014 at 7:16 PM, Rio Yokota  wrote:
>
>> I have recently upgraded from Ubuntu 12.04 to 14.04 and OpenMPI gives the
>> following warning upon execution, which did not appear before the upgrade.
>>
>> WARNING: It appears that your OpenFabrics subsystem is configured to only
>> allow registering part of your physical memory. This can cause MPI jobs to
>> run with erratic performance, hang, and/or crash.
>>
>> Everything that I could find on google suggests to change log_num_mtt,
>> but I cannot do this for the following reasons:
>> 1. There is no log_num_mtt in /sys/module/mlx4_core/parameters/
>> 2. Adding "options mlx4_core log_num_mtt=24" to /etc/modprobe.d/mlx4.conf
>> doesn't seem to change anything
>> 3. I am not sure how I can restart the driver because there is no
>> "/etc/init.d/openibd" file (I've rebooted the system but it didn't do
>> anything to create log_num_mtt)
>>
>> [Template information]
>> 1. OpenFabrics is from the Ubuntu distribution using "apt-get install
>> infiniband-diags ibutils ibverbs-utils libmlx4-dev"
>> 2. OS is Ubuntu 14.04 LTS
>> 3. Subnet manager is from the Ubuntu distribution using "apt-get install
>> opensm"
>> 4. Output of ibv_devinfo is:
>> hca_id: mlx4_0
>> transport:  InfiniBand (0)
>> fw_ver: 2.10.600
>> node_guid:  0002:c903:003d:52b0
>> sys_image_guid: 0002:c903:003d:52b3
>> vendor_id:  0x02c9
>> vendor_part_id: 4099
>> hw_ver: 0x0
>> board_id:   MT_1100120019
>> phys_port_cnt:  1
>> port:   1
>> state:  PORT_ACTIVE (4)
>> max_mtu:4096 (5)
>> active_mtu: 4096 (5)
>> sm_lid: 1
>> port_lid:   1
>> port_lmc:   0x00
>> link_layer: InfiniBand
>> 5. Output of ifconfig for IB is
>> ib0   Link encap:UNSPEC  HWaddr
>> 80-00-00-48-FE-80-00-00-00-00-00-00-00-00-00-00
>>   inet addr:192.168.1.1  Bcast:192.168.1.255  Mask:255.255.255.0
>>   inet6 addr: fe80::202:c903:3d:52b1/64 Scope:Link
>>   UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
>>   RX packets:26 errors:0 dropped:0 overruns:0 frame:0
>>   TX packets:34 errors:0 dropped:16 overruns:0 carrier:0
>>   collisions:0 txqueuelen:256
>>   RX bytes:5843 (5.8 KB)  TX bytes:4324 (4.3 KB)
>> 6. ulimit -l is "unlimited"
>>
>> Thanks,
>> Rio
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2014/08/25048.php
>>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25049.php
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25062.php
>


[OMPI users] Segfault with MPI + Cuda on multiple nodes

2014-08-18 Thread Maxime Boissonneault

Hi,
Since my previous thread (Segmentation fault in OpenMPI 1.8.1) kindda 
derailed into two problems, one of which has been addressed, I figured I 
would start a new, more precise and simple one.


I reduced the code to the minimal that would reproduce the bug. I have 
pasted it here :

http://pastebin.com/1uAK4Z8R
Basically, it is a program that initializes MPI and cudaMalloc memory, 
and then free memory and finalize MPI. Nothing else.


When I compile and run this on a single node, everything works fine.

When I compile and run this on more than one node, I get the following 
stack trace :

[gpu-k20-07:40041] *** Process received signal ***
[gpu-k20-07:40041] Signal: Segmentation fault (11)
[gpu-k20-07:40041] Signal code: Address not mapped (1)
[gpu-k20-07:40041] Failing at address: 0x8
[gpu-k20-07:40041] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2ac85a5bb710]
[gpu-k20-07:40041] [ 1] 
/usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac85fc0aacf]
[gpu-k20-07:40041] [ 2] 
/usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac85fbd0a83]
[gpu-k20-07:40041] [ 3] 
/usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac85fb022da]
[gpu-k20-07:40041] [ 4] 
/usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac85faee933]
[gpu-k20-07:40041] [ 5] 
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac859e3a965]
[gpu-k20-07:40041] [ 6] 
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2ac859e3aa0a]
[gpu-k20-07:40041] [ 7] 
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2ac859e3aa3b]
[gpu-k20-07:40041] [ 8] 
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaMalloc+0x52)[0x2ac859e5b532]

[gpu-k20-07:40041] [ 9] cudampi_simple[0x4007b5]
[gpu-k20-07:40041] [10] 
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2ac85a7e7d1d]

[gpu-k20-07:40041] [11] cudampi_simple[0x400699]
[gpu-k20-07:40041] *** End of error message ***


The stack trace is the same weither I use OpenMPI 1.6.5 (not cuda aware) 
or OpenMPI 1.8.1 (cuda aware).


I know this is more than likely a problem with Cuda than with OpenMPI 
(since it does the same for two different versions), but I figured I 
would ask here if somebody has a clue of what might be going on. I have 
yet to be able to fill a bug report on NVidia's website for Cuda.



Thanks,


--
-
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique



Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes

2014-08-18 Thread Rolf vandeVaart
Just to help reduce the scope of the problem, can you retest with a non 
CUDA-aware Open MPI 1.8.1?   And if possible, use --enable-debug in the 
configure line to help with the stack trace?


>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime
>Boissonneault
>Sent: Monday, August 18, 2014 4:23 PM
>To: Open MPI Users
>Subject: [OMPI users] Segfault with MPI + Cuda on multiple nodes
>
>Hi,
>Since my previous thread (Segmentation fault in OpenMPI 1.8.1) kindda
>derailed into two problems, one of which has been addressed, I figured I
>would start a new, more precise and simple one.
>
>I reduced the code to the minimal that would reproduce the bug. I have
>pasted it here :
>http://pastebin.com/1uAK4Z8R
>Basically, it is a program that initializes MPI and cudaMalloc memory, and then
>free memory and finalize MPI. Nothing else.
>
>When I compile and run this on a single node, everything works fine.
>
>When I compile and run this on more than one node, I get the following stack
>trace :
>[gpu-k20-07:40041] *** Process received signal *** [gpu-k20-07:40041] Signal:
>Segmentation fault (11) [gpu-k20-07:40041] Signal code: Address not mapped
>(1) [gpu-k20-07:40041] Failing at address: 0x8 [gpu-k20-07:40041] [ 0]
>/lib64/libpthread.so.0(+0xf710)[0x2ac85a5bb710]
>[gpu-k20-07:40041] [ 1]
>/usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac85fc0aacf]
>[gpu-k20-07:40041] [ 2]
>/usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac85fbd0a83]
>[gpu-k20-07:40041] [ 3]
>/usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac85fb022da]
>[gpu-k20-07:40041] [ 4]
>/usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac85faee933]
>[gpu-k20-07:40041] [ 5]
>/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac859e3a965]
>[gpu-k20-07:40041] [ 6]
>/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2ac859e3aa0a]
>[gpu-k20-07:40041] [ 7]
>/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2ac859e3aa3b]
>[gpu-k20-07:40041] [ 8]
>/software-
>gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaMalloc+0x52)[0x2ac859e5b532]
>[gpu-k20-07:40041] [ 9] cudampi_simple[0x4007b5] [gpu-k20-07:40041] [10]
>/lib64/libc.so.6(__libc_start_main+0xfd)[0x2ac85a7e7d1d]
>[gpu-k20-07:40041] [11] cudampi_simple[0x400699] [gpu-k20-07:40041] ***
>End of error message ***
>
>
>The stack trace is the same weither I use OpenMPI 1.6.5 (not cuda aware) or
>OpenMPI 1.8.1 (cuda aware).
>
>I know this is more than likely a problem with Cuda than with OpenMPI (since
>it does the same for two different versions), but I figured I would ask here if
>somebody has a clue of what might be going on. I have yet to be able to fill a
>bug report on NVidia's website for Cuda.
>
>
>Thanks,
>
>
>--
>-
>Maxime Boissonneault
>Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique
>
>___
>users mailing list
>us...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>Link to this post: http://www.open-
>mpi.org/community/lists/users/2014/08/25064.php
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes

2014-08-18 Thread Alex A. Granovsky

Try the following:

export MALLOC_CHECK_=1

and then run it again

Kind regards,
Alex Granovsky



-Original Message- 
From: Maxime Boissonneault

Sent: Tuesday, August 19, 2014 12:23 AM
To: Open MPI Users
Subject: [OMPI users] Segfault with MPI + Cuda on multiple nodes

Hi,
Since my previous thread (Segmentation fault in OpenMPI 1.8.1) kindda
derailed into two problems, one of which has been addressed, I figured I
would start a new, more precise and simple one.

I reduced the code to the minimal that would reproduce the bug. I have
pasted it here :
http://pastebin.com/1uAK4Z8R
Basically, it is a program that initializes MPI and cudaMalloc memory,
and then free memory and finalize MPI. Nothing else.

When I compile and run this on a single node, everything works fine.

When I compile and run this on more than one node, I get the following
stack trace :
[gpu-k20-07:40041] *** Process received signal ***
[gpu-k20-07:40041] Signal: Segmentation fault (11)
[gpu-k20-07:40041] Signal code: Address not mapped (1)
[gpu-k20-07:40041] Failing at address: 0x8
[gpu-k20-07:40041] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2ac85a5bb710]
[gpu-k20-07:40041] [ 1]
/usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac85fc0aacf]
[gpu-k20-07:40041] [ 2]
/usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac85fbd0a83]
[gpu-k20-07:40041] [ 3]
/usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac85fb022da]
[gpu-k20-07:40041] [ 4]
/usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac85faee933]
[gpu-k20-07:40041] [ 5]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac859e3a965]
[gpu-k20-07:40041] [ 6]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2ac859e3aa0a]
[gpu-k20-07:40041] [ 7]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2ac859e3aa3b]
[gpu-k20-07:40041] [ 8]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaMalloc+0x52)[0x2ac859e5b532]
[gpu-k20-07:40041] [ 9] cudampi_simple[0x4007b5]
[gpu-k20-07:40041] [10]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2ac85a7e7d1d]
[gpu-k20-07:40041] [11] cudampi_simple[0x400699]
[gpu-k20-07:40041] *** End of error message ***


The stack trace is the same weither I use OpenMPI 1.6.5 (not cuda aware)
or OpenMPI 1.8.1 (cuda aware).

I know this is more than likely a problem with Cuda than with OpenMPI
(since it does the same for two different versions), but I figured I
would ask here if somebody has a clue of what might be going on. I have
yet to be able to fill a bug report on NVidia's website for Cuda.


Thanks,


--
-
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/25064.php 





Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes

2014-08-18 Thread Maxime Boissonneault

Same thing :

[mboisson@gpu-k20-07 simple_cuda_mpi]$ export MALLOC_CHECK_=1
[mboisson@gpu-k20-07 simple_cuda_mpi]$ mpiexec -np 2 --map-by ppr:1:node 
cudampi_simple

malloc: using debugging hooks
malloc: using debugging hooks
[gpu-k20-07:47628] *** Process received signal ***
[gpu-k20-07:47628] Signal: Segmentation fault (11)
[gpu-k20-07:47628] Signal code: Address not mapped (1)
[gpu-k20-07:47628] Failing at address: 0x8
[gpu-k20-07:47628] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2b14cf850710]
[gpu-k20-07:47628] [ 1] 
/usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2b14d4e9facf]
[gpu-k20-07:47628] [ 2] 
/usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2b14d4e65a83]
[gpu-k20-07:47628] [ 3] 
/usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2b14d4d972da]
[gpu-k20-07:47628] [ 4] 
/usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2b14d4d83933]
[gpu-k20-07:47628] [ 5] 
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2b14cf0cf965]
[gpu-k20-07:47628] [ 6] 
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2b14cf0cfa0a]
[gpu-k20-07:47628] [ 7] 
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2b14cf0cfa3b]
[gpu-k20-07:47628] [ 8] 
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaMalloc+0x52)[0x2b14cf0f0532]

[gpu-k20-07:47628] [ 9] cudampi_simple[0x4007b5]
[gpu-k20-07:47628] [10] 
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2b14cfa7cd1d]

[gpu-k20-07:47628] [11] cudampi_simple[0x400699]
[gpu-k20-07:47628] *** End of error message ***
... (same segfault from the other node)

Maxime


Le 2014-08-18 16:52, Alex A. Granovsky a écrit :

Try the following:

export MALLOC_CHECK_=1

and then run it again

Kind regards,
Alex Granovsky



-Original Message- From: Maxime Boissonneault
Sent: Tuesday, August 19, 2014 12:23 AM
To: Open MPI Users
Subject: [OMPI users] Segfault with MPI + Cuda on multiple nodes

Hi,
Since my previous thread (Segmentation fault in OpenMPI 1.8.1) kindda
derailed into two problems, one of which has been addressed, I figured I
would start a new, more precise and simple one.

I reduced the code to the minimal that would reproduce the bug. I have
pasted it here :
http://pastebin.com/1uAK4Z8R
Basically, it is a program that initializes MPI and cudaMalloc memory,
and then free memory and finalize MPI. Nothing else.

When I compile and run this on a single node, everything works fine.

When I compile and run this on more than one node, I get the following
stack trace :
[gpu-k20-07:40041] *** Process received signal ***
[gpu-k20-07:40041] Signal: Segmentation fault (11)
[gpu-k20-07:40041] Signal code: Address not mapped (1)
[gpu-k20-07:40041] Failing at address: 0x8
[gpu-k20-07:40041] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2ac85a5bb710]
[gpu-k20-07:40041] [ 1]
/usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac85fc0aacf]
[gpu-k20-07:40041] [ 2]
/usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac85fbd0a83]
[gpu-k20-07:40041] [ 3]
/usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac85fb022da]
[gpu-k20-07:40041] [ 4]
/usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac85faee933]
[gpu-k20-07:40041] [ 5]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac859e3a965] 


[gpu-k20-07:40041] [ 6]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2ac859e3aa0a] 


[gpu-k20-07:40041] [ 7]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2ac859e3aa3b] 


[gpu-k20-07:40041] [ 8]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaMalloc+0x52)[0x2ac859e5b532] 


[gpu-k20-07:40041] [ 9] cudampi_simple[0x4007b5]
[gpu-k20-07:40041] [10]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2ac85a7e7d1d]
[gpu-k20-07:40041] [11] cudampi_simple[0x400699]
[gpu-k20-07:40041] *** End of error message ***


The stack trace is the same weither I use OpenMPI 1.6.5 (not cuda aware)
or OpenMPI 1.8.1 (cuda aware).

I know this is more than likely a problem with Cuda than with OpenMPI
(since it does the same for two different versions), but I figured I
would ask here if somebody has a clue of what might be going on. I have
yet to be able to fill a bug report on NVidia's website for Cuda.


Thanks,





--
-
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique



Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes

2014-08-18 Thread Maxime Boissonneault

It's building... to be continued tomorrow morning.

Le 2014-08-18 16:45, Rolf vandeVaart a écrit :

Just to help reduce the scope of the problem, can you retest with a non 
CUDA-aware Open MPI 1.8.1?   And if possible, use --enable-debug in the 
configure line to help with the stack trace?



-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime
Boissonneault
Sent: Monday, August 18, 2014 4:23 PM
To: Open MPI Users
Subject: [OMPI users] Segfault with MPI + Cuda on multiple nodes

Hi,
Since my previous thread (Segmentation fault in OpenMPI 1.8.1) kindda
derailed into two problems, one of which has been addressed, I figured I
would start a new, more precise and simple one.

I reduced the code to the minimal that would reproduce the bug. I have
pasted it here :
http://pastebin.com/1uAK4Z8R
Basically, it is a program that initializes MPI and cudaMalloc memory, and then
free memory and finalize MPI. Nothing else.

When I compile and run this on a single node, everything works fine.

When I compile and run this on more than one node, I get the following stack
trace :
[gpu-k20-07:40041] *** Process received signal *** [gpu-k20-07:40041] Signal:
Segmentation fault (11) [gpu-k20-07:40041] Signal code: Address not mapped
(1) [gpu-k20-07:40041] Failing at address: 0x8 [gpu-k20-07:40041] [ 0]
/lib64/libpthread.so.0(+0xf710)[0x2ac85a5bb710]
[gpu-k20-07:40041] [ 1]
/usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac85fc0aacf]
[gpu-k20-07:40041] [ 2]
/usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac85fbd0a83]
[gpu-k20-07:40041] [ 3]
/usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac85fb022da]
[gpu-k20-07:40041] [ 4]
/usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac85faee933]
[gpu-k20-07:40041] [ 5]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac859e3a965]
[gpu-k20-07:40041] [ 6]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2ac859e3aa0a]
[gpu-k20-07:40041] [ 7]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2ac859e3aa3b]
[gpu-k20-07:40041] [ 8]
/software-
gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaMalloc+0x52)[0x2ac859e5b532]
[gpu-k20-07:40041] [ 9] cudampi_simple[0x4007b5] [gpu-k20-07:40041] [10]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2ac85a7e7d1d]
[gpu-k20-07:40041] [11] cudampi_simple[0x400699] [gpu-k20-07:40041] ***
End of error message ***


The stack trace is the same weither I use OpenMPI 1.6.5 (not cuda aware) or
OpenMPI 1.8.1 (cuda aware).

I know this is more than likely a problem with Cuda than with OpenMPI (since
it does the same for two different versions), but I figured I would ask here if
somebody has a clue of what might be going on. I have yet to be able to fill a
bug report on NVidia's website for Cuda.


Thanks,


--
-
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-
mpi.org/community/lists/users/2014/08/25064.php

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/25065.php



--
-
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique