[OMPI users] open mpi upgrade

2015-08-13 Thread Ehsan Moradi
hi,
my dear friends
i tried to upgrade my openmpi version from 1.2.8 to 1.8.8
but after installing it on different directory "/opt/openmpi-1.8.8/" when i
enter mpirun its version is 1.2.8
and  after installing the directory "/opt/openmpi-1.8.8/" is empty!!
so what should i do for installing and using new version

my current version was preinstalled on the system and its have a problem
with big jobs i got errno110 (time out), i changed the mac argument to but
steel not working
so if anyone help me to install last version would be a great help.
os: open suse, 4 nodes ,

thanks alot


Re: [OMPI users] open mpi upgrade

2015-08-13 Thread Gilles Gouaillardet

Ehsan,

how did you try to install openmpi ?
shall i assume you download a tarball, and ran configure && make install ?

can you post the full commands you ran ?

are you installing as root ? or did you run sudo make install ?
if not, do you have write access to the /opt/openmpi-1.8.8 directory ?

Cheers,

Gilles

On 8/13/2015 2:29 PM, Ehsan Moradi wrote:

hi,
my dear friends
i tried to upgrade my openmpi version from 1.2.8 to 1.8.8
but after installing it on different directory "/opt/openmpi-1.8.8/" 
when i enter mpirun its version is 1.2.8

and  after installing the directory "/opt/openmpi-1.8.8/" is empty!!
so what should i do for installing and using new version

my current version was preinstalled on the system and its have a 
problem with big jobs i got errno110 (time out), i changed the mac 
argument to but steel not working

so if anyone help me to install last version would be a great help.
os: open suse, 4 nodes ,

thanks alot


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/08/27451.php




Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space

2015-08-13 Thread Gilles Gouaillardet

David,

i guess you do not want to use the ml coll module at all  in openmpi 1.8.8

you can simply do
touch ompi/mca/coll/ml/.ompi_ignore
./autogen.pl
./configure ...
make && make install

so the ml component is not even built

Cheers,

Gilles

On 8/13/2015 7:30 AM, David Shrader wrote:
I remember seeing those, but forgot about them. I am curious, though, 
why using '-mca coll ^ml' wouldn't work for me.


We'll watch for the next HPCX release. Is there an ETA on when that 
release may happen? Thank you for the help!

David

On 08/12/2015 04:04 PM, Deva wrote:

David,

This is because of hcoll symbols conflict with ml coll module inside 
OMPI. HCOLL is derived from ml module. This issue is fixed in hcoll 
library and will be available in next HPCX release.


Some earlier discussion on this issue:
http://www.open-mpi.org/community/lists/users/2015/06/27154.php
http://www.open-mpi.org/community/lists/devel/2015/06/17562.php

-Devendar

On Wed, Aug 12, 2015 at 2:52 PM, David Shrader > wrote:


Interesting... the seg faults went away:

[dshrader@zo-fe1 tests]$ export LD_PRELOAD=/usr/lib64/libhcoll.so
[dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
[1439416182.732720] [zo-fe1:14690:0] shm.c:65   MXM  WARN
 Could not open the KNEM device file at /dev/knem : No such file
or direc
tory. Won't use knem.
[1439416182.733640] [zo-fe1:14689:0] shm.c:65   MXM  WARN
 Could not open the KNEM device file at /dev/knem : No such file
or direc
tory. Won't use knem.
0: Running on host zo-fe1.lanl.gov 
0: We have 2 processors
0: Hello 1! Processor 1 on host zo-fe1.lanl.gov
 reporting for duty

This implies to me that some other library is being used instead
of /usr/lib64/libhcoll.so, but I am not sure how that could be...

Thanks,
David

On 08/12/2015 03:30 PM, Deva wrote:

Hi David,

I tried same tarball on OFED-1.5.4.1 and I could not reproduce
the issue.  Can you do one more quick test with seeing
LD_PRELOAD to hcoll lib?

$LD_PRELOAD= mpirun -n 2 -mca
coll ^ml ./a.out

-Devendar

On Wed, Aug 12, 2015 at 12:52 PM, David Shrader
 wrote:

The admin that rolled the hcoll rpm that we're using (and
got it in system space) said that she got it from
hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64.tar.

Thanks,
David


On 08/12/2015 10:51 AM, Deva wrote:

From where did you grab this HCOLL lib?  MOFED or HPCX?
what version?

On Wed, Aug 12, 2015 at 9:47 AM, David Shrader
 wrote:

Hey Devendar,

It looks like I still get the error:

[dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
App launch reported: 1 (out of 1) daemons - 2 (out of
2) procs
[1439397957.351764] [zo-fe1:14678:0] shm.c:65
  MXM  WARN  Could not open the KNEM device file at
/dev/knem : No such file or direc
tory. Won't use knem.
[1439397957.352704] [zo-fe1:14677:0] shm.c:65
  MXM  WARN  Could not open the KNEM device file at
/dev/knem : No such file or direc
tory. Won't use knem.
[zo-fe1:14677:0] Caught signal 11 (Segmentation fault)
[zo-fe1:14678:0] Caught signal 11 (Segmentation fault)
 backtrace 
2 0x00056cdc mxm_handle_error()
 
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h

pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641

3 0x00056e4c mxm_error_signal_handler()
 
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro

ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616

4 0x000326a0 killpg()  ??:0
5 0x000b82cb
base_bcol_basesmuma_setup_library_buffers()  ??:0
6 0x000969e3 hmca_bcol_basesmuma_comm_query()
 ??:0
7 0x00032ee3
hmca_coll_ml_tree_hierarchy_discovery()
 coll_ml_module.c:0
8 0x0002fda2 hmca_coll_ml_comm_query()  ??:0
9 0x0006ace9 hcoll_create_context()  ??:0
10 0x000f9706 mca_coll_hcoll_comm_query()  ??:0
11 0x000f684e mca_coll_base_comm_select()  ??:0
12 0x00073fc4 ompi_mpi_init()  ??:0
13 0x00092ea0 PMPI_Init()  ??:0
14 0x004009b6 main()  ??:0
15 0x0001ed5d __libc_start_main()  ??:0
16 0x004008c9 _start()  ??:0
 

Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space

2015-08-13 Thread Jeff Squyres (jsquyres)
Note that this will require you to have fairly recent GNU Autotools installed.

Another workaround for avoiding the coll ml module would be to install Open MPI 
as normal, and then rm the following files after installation:

   rm $prefix/lib/openmpi/mca_coll_ml*

This will physically remove the coll ml plugin from the Open MPI installation, 
and therefore it won't/can't be used (or interfere with the hcoll plugin).


> On Aug 13, 2015, at 2:03 AM, Gilles Gouaillardet  wrote:
> 
> David,
> 
> i guess you do not want to use the ml coll module at all  in openmpi 1.8.8
> 
> you can simply do
> touch ompi/mca/coll/ml/.ompi_ignore
> ./autogen.pl
> ./configure ...
> make && make install
> 
> so the ml component is not even built
> 
> Cheers,
> 
> Gilles
> 
> On 8/13/2015 7:30 AM, David Shrader wrote:
>> I remember seeing those, but forgot about them. I am curious, though, why 
>> using '-mca coll ^ml' wouldn't work for me.
>> 
>> We'll watch for the next HPCX release. Is there an ETA on when that release 
>> may happen? Thank you for the help!
>> David
>> 
>> On 08/12/2015 04:04 PM, Deva wrote:
>>> David,
>>> 
>>> This is because of hcoll symbols conflict with ml coll module inside OMPI. 
>>> HCOLL is derived from ml module. This issue is fixed in hcoll library and 
>>> will be available in next HPCX release.
>>> 
>>> Some earlier discussion on this issue:
>>> http://www.open-mpi.org/community/lists/users/2015/06/27154.php
>>> http://www.open-mpi.org/community/lists/devel/2015/06/17562.php
>>> 
>>> -Devendar
>>> 
>>> On Wed, Aug 12, 2015 at 2:52 PM, David Shrader  wrote:
>>> Interesting... the seg faults went away:
>>> 
>>> [dshrader@zo-fe1 tests]$ export LD_PRELOAD=/usr/lib64/libhcoll.so 
>>> [dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
>>> App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs 
>>> [1439416182.732720] [zo-fe1:14690:0] shm.c:65   MXM  WARN  Could 
>>> not open the KNEM device file at /dev/knem : No such file or direc
>>> tory. Won't use knem. 
>>> [1439416182.733640] [zo-fe1:14689:0] shm.c:65   MXM  WARN  Could 
>>> not open the KNEM device file at /dev/knem : No such file or direc
>>> tory. Won't use knem. 
>>> 0: Running on host zo-fe1.lanl.gov 
>>> 0: We have 2 processors 
>>> 0: Hello 1! Processor 1 on host zo-fe1.lanl.gov reporting for duty
>>> 
>>> This implies to me that some other library is being used instead of 
>>> /usr/lib64/libhcoll.so, but I am not sure how that could be...
>>> 
>>> Thanks,
>>> David 
>>> 
>>> On 08/12/2015 03:30 PM, Deva wrote:
 Hi David,
 
 I tried same tarball on OFED-1.5.4.1 and I could not reproduce the issue.  
 Can you do one more quick test with seeing LD_PRELOAD to hcoll lib?
 
 $LD_PRELOAD=  mpirun -n 2  -mca coll ^ml 
 ./a.out   
 
 -Devendar
 
 On Wed, Aug 12, 2015 at 12:52 PM, David Shrader  wrote:
 The admin that rolled the hcoll rpm that we're using (and got it in system 
 space) said that she got it from 
 hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64.tar.
 
 Thanks,
 David
 
 
 On 08/12/2015 10:51 AM, Deva wrote:
> From where did you grab this HCOLL lib?  MOFED or HPCX? what version?
> 
> On Wed, Aug 12, 2015 at 9:47 AM, David Shrader  wrote:
> Hey Devendar,
> 
> It looks like I still get the error:
> 
> [dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
> App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs 
> [1439397957.351764] [zo-fe1:14678:0] shm.c:65   MXM  WARN  Could 
> not open the KNEM device file at /dev/knem : No such file or direc
> tory. Won't use knem. 
> [1439397957.352704] [zo-fe1:14677:0] shm.c:65   MXM  WARN  Could 
> not open the KNEM device file at /dev/knem : No such file or direc
> tory. Won't use knem. 
> [zo-fe1:14677:0] Caught signal 11 (Segmentation fault) 
> [zo-fe1:14678:0] Caught signal 11 (Segmentation fault) 
>  backtrace  
> 2 0x00056cdc mxm_handle_error()  
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
> pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
>  
> 3 0x00056e4c mxm_error_signal_handler()  
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
> ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
>  
> 4 0x000326a0 killpg()  ??:0 
> 5 0x000b82cb base_bcol_basesmuma_setup_library_buffers()  ??:0 
> 6 0x000969e3 hmca_bcol_basesmuma_comm_query()  ??:0 
> 7 0x00032ee3 hmca_coll_ml_tree_hierarchy_discovery()  
> coll_ml_module.c:0 
> 8 0x0002fda2 hmca_coll_ml_comm_query()  ??:0 
> 9 0x0006ace9 hcoll_create_context()  ??:0 
> 10 0x000f9706 mca_coll_hcoll_comm_query()

[OMPI users] Trouble with udcm and rdmacm

2015-08-13 Thread Tobias Kloeffel

Hi all,

The configuration might be a bit exotic:

Kernel 4.1.5 vanilla, Mellanox OFED 3.0-2.0.1

ccc174 1 x dual port ConnectX-3
mini4   2 x single port ConnectX-2
mini2   8 x single port ConnectX-2
MIS20025

The following does work:

using oob coonection manager in 1.7.3:
everything works, except latencies are really bad compared to 1.8.8

udcm in 1.8.8:
everything works as long as I exclude mlx4_0:2 by setting:
 --mca btl_openib_if_include 
'mlx4_0:1,mlx4_1:1,mlx4_2:1,mlx4_3:1,mlx4_4:1,mlx4_5:1,mlx4_6:1,mlx4_7:1'

if I include mlx4_0:2 I get:
[mini4][[62272,1],4][connect/btl_openib_connect_udcm.c:1907:udcm_process_messages] 
could not initialize cpc data for endpoint

libibverbs: ibv_create_ah failed to query port.

rdmacm in 1.8.8 only works between ccc174 and mini4, running across all 
three nodes will produce:


mpirun --mca btl_openib_cpc_include rdmacm --mca 
btl_openib_warn_default_gid_prefix  0  --hostfile ~/hostlist -np 40 
./osu_alltoall

--
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:   mini2
  Local device: mlx4_7
  Local port:   1
  CPCs attempted:   rdmacm
--
[ccc174][[61500,1],1][btl_openib_proc.c:157:mca_btl_openib_proc_create] 
[btl_openib_proc.c:157] ompi_modex_recv failed for peer [[61500,1],9]


Any help would be much appreciated.

Regards,
Tobias


Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space

2015-08-13 Thread David Shrader
I don't have that option on the configure command line, but my platform 
file is using "enable_dlopen=no." I imagine that is getting the same 
result. Thank you for the pointer!


Thanks,
David

On 08/12/2015 05:04 PM, Deva wrote:
do you have "-disable-dlopen" in your configure option? This might 
force coll_ml to be loaded first even with -mca coll ^ml.


next HPCX is expected to release by end of Aug.

-Devendar

On Wed, Aug 12, 2015 at 3:30 PM, David Shrader > wrote:


I remember seeing those, but forgot about them. I am curious,
though, why using '-mca coll ^ml' wouldn't work for me.

We'll watch for the next HPCX release. Is there an ETA on when
that release may happen? Thank you for the help!
David


On 08/12/2015 04:04 PM, Deva wrote:

David,

This is because of hcoll symbols conflict with ml coll module
inside OMPI. HCOLL is derived from ml module. This issue is fixed
in hcoll library and will be available in next HPCX release.

Some earlier discussion on this issue:
http://www.open-mpi.org/community/lists/users/2015/06/27154.php
http://www.open-mpi.org/community/lists/devel/2015/06/17562.php

-Devendar

On Wed, Aug 12, 2015 at 2:52 PM, David Shrader mailto:dshra...@lanl.gov>> wrote:

Interesting... the seg faults went away:

[dshrader@zo-fe1 tests]$ export
LD_PRELOAD=/usr/lib64/libhcoll.so
[dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
[1439416182.732720] [zo-fe1:14690:0] shm.c:65   MXM
 WARN  Could not open the KNEM device file at /dev/knem : No
such file or direc
tory. Won't use knem.
[1439416182.733640] [zo-fe1:14689:0] shm.c:65   MXM
 WARN  Could not open the KNEM device file at /dev/knem : No
such file or direc
tory. Won't use knem.
0: Running on host zo-fe1.lanl.gov 
0: We have 2 processors
0: Hello 1! Processor 1 on host zo-fe1.lanl.gov
 reporting for duty

This implies to me that some other library is being used
instead of /usr/lib64/libhcoll.so, but I am not sure how that
could be...

Thanks,
David

On 08/12/2015 03:30 PM, Deva wrote:

Hi David,

I tried same tarball on OFED-1.5.4.1 and I could not
reproduce the issue.  Can you do one more quick test with
seeing LD_PRELOAD to hcoll lib?

$LD_PRELOAD= mpirun -n 2 -mca
coll ^ml ./a.out

-Devendar

On Wed, Aug 12, 2015 at 12:52 PM, David Shrader
mailto:dshra...@lanl.gov>> wrote:

The admin that rolled the hcoll rpm that we're using
(and got it in system space) said that she got it from
hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64.tar.

Thanks,
David


On 08/12/2015 10:51 AM, Deva wrote:

From where did you grab this HCOLL lib? MOFED or HPCX?
what version?

On Wed, Aug 12, 2015 at 9:47 AM, David Shrader
mailto:dshra...@lanl.gov>> wrote:

Hey Devendar,

It looks like I still get the error:

[dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml
./a.out
App launch reported: 1 (out of 1) daemons - 2 (out
of 2) procs
[1439397957.351764] [zo-fe1:14678:0]
shm.c:65   MXM  WARN  Could not open the
KNEM device file at /dev/knem : No such file or direc
tory. Won't use knem.
[1439397957.352704] [zo-fe1:14677:0]
shm.c:65   MXM  WARN  Could not open the
KNEM device file at /dev/knem : No such file or direc
tory. Won't use knem.
[zo-fe1:14677:0] Caught signal 11 (Segmentation fault)
[zo-fe1:14678:0] Caught signal 11 (Segmentation fault)
 backtrace 
2 0x00056cdc mxm_handle_error()
 
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h

pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641

3 0x00056e4c mxm_error_signal_handler()
 
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro

ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616

4 0x000326a0 killpg()  ??:0
5 0x000b82cb
base_bcol_basesmuma_setup_library_buffers()  ??:0
6 0x000969e3
hmca_bcol_basesmuma_comm_query()  ??:0
7 0x00032ee3

Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space

2015-08-13 Thread David Shrader

Hey Jeff,

I'm actually not able to find coll_ml related files at that location. 
All I see are the following files:


[dshrader@zo-fe1 openmpi]$ ls 
/usr/projects/hpcsoft/toss2/zorrillo/openmpi/1.8.8-gcc-4.4/lib/openmpi/

libompi_dbg_msgq.a  libompi_dbg_msgq.la  libompi_dbg_msgq.so

In this particular build, I am using platform files instead of the 
stripped down debug builds I was doing before. Could something in the 
platform files move or combine with something else the coll_ml related 
files?


Thanks,
David

On 08/13/2015 04:02 AM, Jeff Squyres (jsquyres) wrote:

Note that this will require you to have fairly recent GNU Autotools installed.

Another workaround for avoiding the coll ml module would be to install Open MPI 
as normal, and then rm the following files after installation:

rm $prefix/lib/openmpi/mca_coll_ml*

This will physically remove the coll ml plugin from the Open MPI installation, 
and therefore it won't/can't be used (or interfere with the hcoll plugin).



On Aug 13, 2015, at 2:03 AM, Gilles Gouaillardet  wrote:

David,

i guess you do not want to use the ml coll module at all  in openmpi 1.8.8

you can simply do
touch ompi/mca/coll/ml/.ompi_ignore
./autogen.pl
./configure ...
make && make install

so the ml component is not even built

Cheers,

Gilles

On 8/13/2015 7:30 AM, David Shrader wrote:

I remember seeing those, but forgot about them. I am curious, though, why using 
'-mca coll ^ml' wouldn't work for me.

We'll watch for the next HPCX release. Is there an ETA on when that release may 
happen? Thank you for the help!
David

On 08/12/2015 04:04 PM, Deva wrote:

David,

This is because of hcoll symbols conflict with ml coll module inside OMPI. 
HCOLL is derived from ml module. This issue is fixed in hcoll library and will 
be available in next HPCX release.

Some earlier discussion on this issue:
http://www.open-mpi.org/community/lists/users/2015/06/27154.php
http://www.open-mpi.org/community/lists/devel/2015/06/17562.php

-Devendar

On Wed, Aug 12, 2015 at 2:52 PM, David Shrader  wrote:
Interesting... the seg faults went away:

[dshrader@zo-fe1 tests]$ export LD_PRELOAD=/usr/lib64/libhcoll.so
[dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
[1439416182.732720] [zo-fe1:14690:0] shm.c:65   MXM  WARN  Could not 
open the KNEM device file at /dev/knem : No such file or direc
tory. Won't use knem.
[1439416182.733640] [zo-fe1:14689:0] shm.c:65   MXM  WARN  Could not 
open the KNEM device file at /dev/knem : No such file or direc
tory. Won't use knem.
0: Running on host zo-fe1.lanl.gov
0: We have 2 processors
0: Hello 1! Processor 1 on host zo-fe1.lanl.gov reporting for duty

This implies to me that some other library is being used instead of 
/usr/lib64/libhcoll.so, but I am not sure how that could be...

Thanks,
David

On 08/12/2015 03:30 PM, Deva wrote:

Hi David,

I tried same tarball on OFED-1.5.4.1 and I could not reproduce the issue.  Can 
you do one more quick test with seeing LD_PRELOAD to hcoll lib?

$LD_PRELOAD=  mpirun -n 2  -mca coll ^ml ./a.out

-Devendar

On Wed, Aug 12, 2015 at 12:52 PM, David Shrader  wrote:
The admin that rolled the hcoll rpm that we're using (and got it in system 
space) said that she got it from 
hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64.tar.

Thanks,
David


On 08/12/2015 10:51 AM, Deva wrote:

 From where did you grab this HCOLL lib?  MOFED or HPCX? what version?

On Wed, Aug 12, 2015 at 9:47 AM, David Shrader  wrote:
Hey Devendar,

It looks like I still get the error:

[dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
[1439397957.351764] [zo-fe1:14678:0] shm.c:65   MXM  WARN  Could not 
open the KNEM device file at /dev/knem : No such file or direc
tory. Won't use knem.
[1439397957.352704] [zo-fe1:14677:0] shm.c:65   MXM  WARN  Could not 
open the KNEM device file at /dev/knem : No such file or direc
tory. Won't use knem.
[zo-fe1:14677:0] Caught signal 11 (Segmentation fault)
[zo-fe1:14678:0] Caught signal 11 (Segmentation fault)
 backtrace 
2 0x00056cdc mxm_handle_error()  
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
3 0x00056e4c mxm_error_signal_handler()  
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
4 0x000326a0 killpg()  ??:0
5 0x000b82cb base_bcol_basesmuma_setup_library_buffers()  ??:0
6 0x000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
7 0x00032ee3 hmca_coll_ml_tree_hierarchy_discovery()  coll_ml_module.c:0
8 0x0002fda2 hmca_coll_ml_comm_query()  ??:0
9 0x0006ace9 hcoll_create_context()  ??:0
10 0x00

Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space

2015-08-13 Thread Nathan Hjelm

David, our platform files disable dlopen. That is why you are not seeing
any component files. coll/ml is built into libmpi.so.

-Nathan

On Thu, Aug 13, 2015 at 09:23:09AM -0600, David Shrader wrote:
> Hey Jeff,
> 
> I'm actually not able to find coll_ml related files at that location. All I
> see are the following files:
> 
> [dshrader@zo-fe1 openmpi]$ ls
> /usr/projects/hpcsoft/toss2/zorrillo/openmpi/1.8.8-gcc-4.4/lib/openmpi/
> libompi_dbg_msgq.a  libompi_dbg_msgq.la  libompi_dbg_msgq.so
> 
> In this particular build, I am using platform files instead of the stripped
> down debug builds I was doing before. Could something in the platform files
> move or combine with something else the coll_ml related files?
> 
> Thanks,
> David
> 
> On 08/13/2015 04:02 AM, Jeff Squyres (jsquyres) wrote:
> >Note that this will require you to have fairly recent GNU Autotools 
> >installed.
> >
> >Another workaround for avoiding the coll ml module would be to install Open 
> >MPI as normal, and then rm the following files after installation:
> >
> >rm $prefix/lib/openmpi/mca_coll_ml*
> >
> >This will physically remove the coll ml plugin from the Open MPI 
> >installation, and therefore it won't/can't be used (or interfere with the 
> >hcoll plugin).
> >
> >
> >>On Aug 13, 2015, at 2:03 AM, Gilles Gouaillardet  wrote:
> >>
> >>David,
> >>
> >>i guess you do not want to use the ml coll module at all  in openmpi 1.8.8
> >>
> >>you can simply do
> >>touch ompi/mca/coll/ml/.ompi_ignore
> >>./autogen.pl
> >>./configure ...
> >>make && make install
> >>
> >>so the ml component is not even built
> >>
> >>Cheers,
> >>
> >>Gilles
> >>
> >>On 8/13/2015 7:30 AM, David Shrader wrote:
> >>>I remember seeing those, but forgot about them. I am curious, though, why 
> >>>using '-mca coll ^ml' wouldn't work for me.
> >>>
> >>>We'll watch for the next HPCX release. Is there an ETA on when that 
> >>>release may happen? Thank you for the help!
> >>>David
> >>>
> >>>On 08/12/2015 04:04 PM, Deva wrote:
> David,
> 
> This is because of hcoll symbols conflict with ml coll module inside 
> OMPI. HCOLL is derived from ml module. This issue is fixed in hcoll 
> library and will be available in next HPCX release.
> 
> Some earlier discussion on this issue:
> http://www.open-mpi.org/community/lists/users/2015/06/27154.php
> http://www.open-mpi.org/community/lists/devel/2015/06/17562.php
> 
> -Devendar
> 
> On Wed, Aug 12, 2015 at 2:52 PM, David Shrader  wrote:
> Interesting... the seg faults went away:
> 
> [dshrader@zo-fe1 tests]$ export LD_PRELOAD=/usr/lib64/libhcoll.so
> [dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
> App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
> [1439416182.732720] [zo-fe1:14690:0] shm.c:65   MXM  WARN  Could 
> not open the KNEM device file at /dev/knem : No such file or direc
> tory. Won't use knem.
> [1439416182.733640] [zo-fe1:14689:0] shm.c:65   MXM  WARN  Could 
> not open the KNEM device file at /dev/knem : No such file or direc
> tory. Won't use knem.
> 0: Running on host zo-fe1.lanl.gov
> 0: We have 2 processors
> 0: Hello 1! Processor 1 on host zo-fe1.lanl.gov reporting for duty
> 
> This implies to me that some other library is being used instead of 
> /usr/lib64/libhcoll.so, but I am not sure how that could be...
> 
> Thanks,
> David
> 
> On 08/12/2015 03:30 PM, Deva wrote:
> >Hi David,
> >
> >I tried same tarball on OFED-1.5.4.1 and I could not reproduce the 
> >issue.  Can you do one more quick test with seeing LD_PRELOAD to hcoll 
> >lib?
> >
> >$LD_PRELOAD=  mpirun -n 2  -mca coll ^ml 
> >./a.out
> >
> >-Devendar
> >
> >On Wed, Aug 12, 2015 at 12:52 PM, David Shrader  
> >wrote:
> >The admin that rolled the hcoll rpm that we're using (and got it in 
> >system space) said that she got it from 
> >hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64.tar.
> >
> >Thanks,
> >David
> >
> >
> >On 08/12/2015 10:51 AM, Deva wrote:
> >> From where did you grab this HCOLL lib?  MOFED or HPCX? what version?
> >>
> >>On Wed, Aug 12, 2015 at 9:47 AM, David Shrader  
> >>wrote:
> >>Hey Devendar,
> >>
> >>It looks like I still get the error:
> >>
> >>[dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
> >>App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
> >>[1439397957.351764] [zo-fe1:14678:0] shm.c:65   MXM  WARN  
> >>Could not open the KNEM device file at /dev/knem : No such file or direc
> >>tory. Won't use knem.
> >>[1439397957.352704] [zo-fe1:14677:0] shm.c:65   MXM  WARN  
> >>Could not open the KNEM device file at /dev/knem : No such file or direc
> >>tory. Won't use knem.
> >>[zo-fe1:14677:0] Caught signal 11 (Segmentation fault)
> >>[zo-fe1:14

Re: [OMPI users] open mpi upgrade

2015-08-13 Thread Gustavo Correa
Hi Ehsan

You didn't tell the details of how you configured and installed Open MPI.
However, you must point the configuration --prefix to the installation
directory, say:

./configure --prefix=/opt/openmpi-1.8.8

In addition, the installation directory must be *different* from the 
directory where you build Open MPI (say, where you downloaded
and decompressed the Open MPI source code tarball).

Also, make sure you have writing permissions to both the build 
and installation directories. 
If you don't have writing permissions to the installation directory, 
you may need to use "sudo" or "su" when you do "make install".

I hope this helps,
Gus Correa


On Aug 13, 2015, at 1:29 AM, Ehsan Moradi wrote:

> hi,
> my dear friends
> i tried to upgrade my openmpi version from 1.2.8 to 1.8.8
> but after installing it on different directory "/opt/openmpi-1.8.8/" when i 
> enter mpirun its version is 1.2.8 
> and  after installing the directory "/opt/openmpi-1.8.8/" is empty!!
> so what should i do for installing and using new version
> 
> my current version was preinstalled on the system and its have a problem with 
> big jobs i got errno110 (time out), i changed the mac argument to but steel 
> not working
> so if anyone help me to install last version would be a great help.
> os: open suse, 4 nodes ,
> 
> thanks alot
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/08/27451.php



Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space

2015-08-13 Thread Jeff Squyres (jsquyres)
Ah, if you're disable-dlopen, then you won't find individual plugin DSOs.

Instead, you can configure this way:

./configure --enable-mca-no-build=coll-ml ...

This will disable the build of the coll/ml component altogether.




> On Aug 13, 2015, at 11:23 AM, David Shrader  wrote:
> 
> Hey Jeff,
> 
> I'm actually not able to find coll_ml related files at that location. All I 
> see are the following files:
> 
> [dshrader@zo-fe1 openmpi]$ ls 
> /usr/projects/hpcsoft/toss2/zorrillo/openmpi/1.8.8-gcc-4.4/lib/openmpi/
> libompi_dbg_msgq.a  libompi_dbg_msgq.la  libompi_dbg_msgq.so
> 
> In this particular build, I am using platform files instead of the stripped 
> down debug builds I was doing before. Could something in the platform files 
> move or combine with something else the coll_ml related files?
> 
> Thanks,
> David
> 
> On 08/13/2015 04:02 AM, Jeff Squyres (jsquyres) wrote:
>> Note that this will require you to have fairly recent GNU Autotools 
>> installed.
>> 
>> Another workaround for avoiding the coll ml module would be to install Open 
>> MPI as normal, and then rm the following files after installation:
>> 
>>rm $prefix/lib/openmpi/mca_coll_ml*
>> 
>> This will physically remove the coll ml plugin from the Open MPI 
>> installation, and therefore it won't/can't be used (or interfere with the 
>> hcoll plugin).
>> 
>> 
>>> On Aug 13, 2015, at 2:03 AM, Gilles Gouaillardet  wrote:
>>> 
>>> David,
>>> 
>>> i guess you do not want to use the ml coll module at all  in openmpi 1.8.8
>>> 
>>> you can simply do
>>> touch ompi/mca/coll/ml/.ompi_ignore
>>> ./autogen.pl
>>> ./configure ...
>>> make && make install
>>> 
>>> so the ml component is not even built
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> On 8/13/2015 7:30 AM, David Shrader wrote:
 I remember seeing those, but forgot about them. I am curious, though, why 
 using '-mca coll ^ml' wouldn't work for me.
 
 We'll watch for the next HPCX release. Is there an ETA on when that 
 release may happen? Thank you for the help!
 David
 
 On 08/12/2015 04:04 PM, Deva wrote:
> David,
> 
> This is because of hcoll symbols conflict with ml coll module inside 
> OMPI. HCOLL is derived from ml module. This issue is fixed in hcoll 
> library and will be available in next HPCX release.
> 
> Some earlier discussion on this issue:
> http://www.open-mpi.org/community/lists/users/2015/06/27154.php
> http://www.open-mpi.org/community/lists/devel/2015/06/17562.php
> 
> -Devendar
> 
> On Wed, Aug 12, 2015 at 2:52 PM, David Shrader  wrote:
> Interesting... the seg faults went away:
> 
> [dshrader@zo-fe1 tests]$ export LD_PRELOAD=/usr/lib64/libhcoll.so
> [dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
> App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
> [1439416182.732720] [zo-fe1:14690:0] shm.c:65   MXM  WARN  Could 
> not open the KNEM device file at /dev/knem : No such file or direc
> tory. Won't use knem.
> [1439416182.733640] [zo-fe1:14689:0] shm.c:65   MXM  WARN  Could 
> not open the KNEM device file at /dev/knem : No such file or direc
> tory. Won't use knem.
> 0: Running on host zo-fe1.lanl.gov
> 0: We have 2 processors
> 0: Hello 1! Processor 1 on host zo-fe1.lanl.gov reporting for duty
> 
> This implies to me that some other library is being used instead of 
> /usr/lib64/libhcoll.so, but I am not sure how that could be...
> 
> Thanks,
> David
> 
> On 08/12/2015 03:30 PM, Deva wrote:
>> Hi David,
>> 
>> I tried same tarball on OFED-1.5.4.1 and I could not reproduce the 
>> issue.  Can you do one more quick test with seeing LD_PRELOAD to hcoll 
>> lib?
>> 
>> $LD_PRELOAD=  mpirun -n 2  -mca coll ^ml 
>> ./a.out
>> 
>> -Devendar
>> 
>> On Wed, Aug 12, 2015 at 12:52 PM, David Shrader  
>> wrote:
>> The admin that rolled the hcoll rpm that we're using (and got it in 
>> system space) said that she got it from 
>> hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64.tar.
>> 
>> Thanks,
>> David
>> 
>> 
>> On 08/12/2015 10:51 AM, Deva wrote:
>>> From where did you grab this HCOLL lib?  MOFED or HPCX? what version?
>>> 
>>> On Wed, Aug 12, 2015 at 9:47 AM, David Shrader  
>>> wrote:
>>> Hey Devendar,
>>> 
>>> It looks like I still get the error:
>>> 
>>> [dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
>>> App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
>>> [1439397957.351764] [zo-fe1:14678:0] shm.c:65   MXM  WARN  
>>> Could not open the KNEM device file at /dev/knem : No such file or direc
>>> tory. Won't use knem.
>>> [1439397957.352704] [zo-fe1:14677:0] shm.c:65   MXM  WARN  
>>> Could not open the KNEM device file at /dev/knem : No such file or direc
>>> tory. Wo

Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space

2015-08-13 Thread Nathan Hjelm

David, to modify that option modify the toss-common file. It is in the
same location as the platform file. We have a number of component we
disable by default. Just add coll-ml to the end of the list.

-Nathan

On Thu, Aug 13, 2015 at 05:19:35PM +, Jeff Squyres (jsquyres) wrote:
> Ah, if you're disable-dlopen, then you won't find individual plugin DSOs.
> 
> Instead, you can configure this way:
> 
> ./configure --enable-mca-no-build=coll-ml ...
> 
> This will disable the build of the coll/ml component altogether.
> 
> 
> 
> 
> > On Aug 13, 2015, at 11:23 AM, David Shrader  wrote:
> > 
> > Hey Jeff,
> > 
> > I'm actually not able to find coll_ml related files at that location. All I 
> > see are the following files:
> > 
> > [dshrader@zo-fe1 openmpi]$ ls 
> > /usr/projects/hpcsoft/toss2/zorrillo/openmpi/1.8.8-gcc-4.4/lib/openmpi/
> > libompi_dbg_msgq.a  libompi_dbg_msgq.la  libompi_dbg_msgq.so
> > 
> > In this particular build, I am using platform files instead of the stripped 
> > down debug builds I was doing before. Could something in the platform files 
> > move or combine with something else the coll_ml related files?
> > 
> > Thanks,
> > David
> > 
> > On 08/13/2015 04:02 AM, Jeff Squyres (jsquyres) wrote:
> >> Note that this will require you to have fairly recent GNU Autotools 
> >> installed.
> >> 
> >> Another workaround for avoiding the coll ml module would be to install 
> >> Open MPI as normal, and then rm the following files after installation:
> >> 
> >>rm $prefix/lib/openmpi/mca_coll_ml*
> >> 
> >> This will physically remove the coll ml plugin from the Open MPI 
> >> installation, and therefore it won't/can't be used (or interfere with the 
> >> hcoll plugin).
> >> 
> >> 
> >>> On Aug 13, 2015, at 2:03 AM, Gilles Gouaillardet  
> >>> wrote:
> >>> 
> >>> David,
> >>> 
> >>> i guess you do not want to use the ml coll module at all  in openmpi 1.8.8
> >>> 
> >>> you can simply do
> >>> touch ompi/mca/coll/ml/.ompi_ignore
> >>> ./autogen.pl
> >>> ./configure ...
> >>> make && make install
> >>> 
> >>> so the ml component is not even built
> >>> 
> >>> Cheers,
> >>> 
> >>> Gilles
> >>> 
> >>> On 8/13/2015 7:30 AM, David Shrader wrote:
>  I remember seeing those, but forgot about them. I am curious, though, 
>  why using '-mca coll ^ml' wouldn't work for me.
>  
>  We'll watch for the next HPCX release. Is there an ETA on when that 
>  release may happen? Thank you for the help!
>  David
>  
>  On 08/12/2015 04:04 PM, Deva wrote:
> > David,
> > 
> > This is because of hcoll symbols conflict with ml coll module inside 
> > OMPI. HCOLL is derived from ml module. This issue is fixed in hcoll 
> > library and will be available in next HPCX release.
> > 
> > Some earlier discussion on this issue:
> > http://www.open-mpi.org/community/lists/users/2015/06/27154.php
> > http://www.open-mpi.org/community/lists/devel/2015/06/17562.php
> > 
> > -Devendar
> > 
> > On Wed, Aug 12, 2015 at 2:52 PM, David Shrader  
> > wrote:
> > Interesting... the seg faults went away:
> > 
> > [dshrader@zo-fe1 tests]$ export LD_PRELOAD=/usr/lib64/libhcoll.so
> > [dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
> > App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
> > [1439416182.732720] [zo-fe1:14690:0] shm.c:65   MXM  WARN  
> > Could not open the KNEM device file at /dev/knem : No such file or direc
> > tory. Won't use knem.
> > [1439416182.733640] [zo-fe1:14689:0] shm.c:65   MXM  WARN  
> > Could not open the KNEM device file at /dev/knem : No such file or direc
> > tory. Won't use knem.
> > 0: Running on host zo-fe1.lanl.gov
> > 0: We have 2 processors
> > 0: Hello 1! Processor 1 on host zo-fe1.lanl.gov reporting for duty
> > 
> > This implies to me that some other library is being used instead of 
> > /usr/lib64/libhcoll.so, but I am not sure how that could be...
> > 
> > Thanks,
> > David
> > 
> > On 08/12/2015 03:30 PM, Deva wrote:
> >> Hi David,
> >> 
> >> I tried same tarball on OFED-1.5.4.1 and I could not reproduce the 
> >> issue.  Can you do one more quick test with seeing LD_PRELOAD to hcoll 
> >> lib?
> >> 
> >> $LD_PRELOAD=  mpirun -n 2  -mca coll 
> >> ^ml ./a.out
> >> 
> >> -Devendar
> >> 
> >> On Wed, Aug 12, 2015 at 12:52 PM, David Shrader  
> >> wrote:
> >> The admin that rolled the hcoll rpm that we're using (and got it in 
> >> system space) said that she got it from 
> >> hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64.tar.
> >> 
> >> Thanks,
> >> David
> >> 
> >> 
> >> On 08/12/2015 10:51 AM, Deva wrote:
> >>> From where did you grab this HCOLL lib?  MOFED or HPCX? what version?
> >>> 
> >>> On Wed, Aug 12, 2015 at 9:47 AM, David Shrader  
> >>> wrote:
> >>> Hey Devend

[OMPI users] connection time out (110)

2015-08-13 Thread Ehsan Moradi
hi
my friends, i getting error connection time out (110) even after

echo 1 > /proc/sys/net/core/somaxconn

echo 10 > /proc/sys/net/core/netdev_max_backlog

mpirun --mca oob_tcp_listen_mode listen_thread -np 1024 my_mpi_program

 my program work on 2 nodes only if i add one more its going to break and
show connection time out (110)
what should i do?
i checked all 4 nodes and they have ping to each other by 0.080 sec
and i can use all nodes for a sample program like hello world
but when the program get bigger its not work
thank you guys help me please


Re: [OMPI users] segfault on java binding from MPI.init()

2015-08-13 Thread Howard Pritchard
Hi Nate,

The odls output helps some.  You have a really big CLASSPATH.   Also there
might be a small chance that the shmem.jar is causing problems.
Could you try undefining your CLASSPATH just to run the test case?

If the little test case still doesn't work, could you reconfigure the mpi
build to not build oshmem?  --disable-oshmem?

We've never tested the oshmem jar.

Thanks,

Howard


2015-08-12 18:19 GMT-06:00 Nate Chambers :

> *I appreciate you trying to help! I put the Java and its compiled .class
> file on Dropbox. The directory contains the .java and .class files, as well
> as a data/ directory:*
>
> http://www.dropbox.com/sh/pds5c5wecfpb2wk/AAAcz17UTDQErmrUqp2SPjpqa?dl=0
>
> *You can run it with and without MPI:*
>
> >  java MPITestBroke data/
> >  mpirun -np 1 java MPITestBroke data/
>
> *Attached is a text file of what I see when I run it with mpirun and your
> debug flag. Lots of debug lines.*
>
>
> Nate
>
>
>
>
>
> On Wed, Aug 12, 2015 at 11:09 AM, Howard Pritchard 
> wrote:
>
>> Hi Nate,
>>
>> Sorry for the delay in getting back to you.
>>
>> We're somewhat stuck on how to help you, but here are two suggestions.
>>
>> Could you add the following to your launch command line
>>
>> --mca odls_base_verbose 100
>>
>> so we can see exactly what arguments are being feed to java when launching
>> your app.
>>
>> Also, if you could put your MPITestBroke.class file somewhere (like
>> google drive)
>> where we could get it and try to run locally or at NERSC, that might help
>> us
>> narrow down the problem.Better yet, if you have the class or jar file
>> for
>> the entire app plus some data sets, we could try that out as well.
>>
>> All the config outputs, etc. you've sent so far indicate a correct
>> installation
>> of open mpi.
>>
>> Howard
>>
>>
>> On Aug 6, 2015 1:54 PM, "Nate Chambers"  wrote:
>>
>>> Howard,
>>>
>>> I tried the nightly build openmpi-dev-2223-g731cfe3 and it still
>>> segfaults as before. I must admit I am new to MPI, so is it possible I'm
>>> just configuring or running incorrectly? Let me list my steps for you, and
>>> maybe something will jump out? Also attached is my config.log.
>>>
>>>
>>> CONFIGURE
>>> ./configure --prefix= --enable-mpi-java CC=gcc
>>>
>>> MAKE
>>> make all install
>>>
>>> RUN
>>> /mpirun -np 1 java MPITestBroke twitter/
>>>
>>>
>>> DEFAULT JAVA AND GCC
>>>
>>> $ java -version
>>> java version "1.7.0_21"
>>> Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
>>> Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)
>>>
>>> $ gcc --v
>>> Using built-in specs.
>>> Target: x86_64-redhat-linux
>>> Configured with: ../configure --prefix=/usr --mandir=/usr/share/man
>>> --infodir=/usr/share/info --with-bugurl=
>>> http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared
>>> --enable-threads=posix --enable-checking=release --with-system-zlib
>>> --enable-__cxa_atexit --disable-libunwind-exceptions
>>> --enable-gnu-unique-object
>>> --enable-languages=c,c++,objc,obj-c++,java,fortran,ada
>>> --enable-java-awt=gtk --disable-dssi
>>> --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre
>>> --enable-libgcj-multifile --enable-java-maintainer-mode
>>> --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libjava-multilib
>>> --with-ppl --with-cloog --with-tune=generic --with-arch_32=i686
>>> --build=x86_64-redhat-linux
>>> Thread model: posix
>>> gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC)
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Aug 6, 2015 at 7:58 AM, Howard Pritchard 
>>> wrote:
>>>
 HI Nate,

 We're trying this out on a mac running mavericks and a cray xc system.
   the mac has java 8
 while the cray xc has java 7.

 We could not get the code to run just using the java launch command,
 although we noticed if you add

 catch(NoClassDefFoundError e) {

   System.out.println("Not using MPI its out to lunch for now");

 }

 as one of the catches after the try for firing up MPI, you can get
 further.

 Instead we tried on the two systems using

 mpirun -np 1 java MPITestBroke tweets repeat.txt

 and, you guessed it, we can't reproduce the error, at least using
 master.

 Would you mind trying to get a copy of nightly master build off of

 http://www.open-mpi.org/nightly/master/

 and install that version and give it a try.

 If that works, then I'd suggest using master (or v2.0) for now.

 Howard




 2015-08-05 14:41 GMT-06:00 Nate Chambers :

> Howard,
>
> Thanks for looking at all this. Adding System.gc() did not cause it to
> segfault. The segfault still comes much later in the processing.
>
> I was able to reduce my code to a single test file without other
> dependencies. It is attached. This code simply opens a text file and reads
> its lines, one by one. Once finished, it closes and opens the same file 
> and
> reads the li

Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space

2015-08-13 Thread David Shrader
Interestingly enough, I have found that using --disable-dlopen causes 
the seg fault whether or not --enable-mca-no-build=coll-ml is used. That 
is, the following configure line generates a build of Open MPI that will 
*not* seg fault when running a simple hello world program:


./configure --prefix=/tmp/dshrader-ompi-1.8.8-install 
--enable-mca-no-build=coll-ml --with-mxm=no --with-hcoll


While the following configure line will produce a build of Open MPI that 
*will* seg fault with the same error I mentioned before:


./configure --prefix=/tmp/dshrader-ompi-1.8.8-install 
--enable-mca-no-build=coll-ml --with-mxm=no --with-hcoll --disable-dlopen


I'm not sure why this would be.

Thanks,
David

On 08/13/2015 11:19 AM, Jeff Squyres (jsquyres) wrote:

Ah, if you're disable-dlopen, then you won't find individual plugin DSOs.

Instead, you can configure this way:

 ./configure --enable-mca-no-build=coll-ml ...

This will disable the build of the coll/ml component altogether.

 




On Aug 13, 2015, at 11:23 AM, David Shrader  wrote:

Hey Jeff,

I'm actually not able to find coll_ml related files at that location. All I see 
are the following files:

[dshrader@zo-fe1 openmpi]$ ls 
/usr/projects/hpcsoft/toss2/zorrillo/openmpi/1.8.8-gcc-4.4/lib/openmpi/
libompi_dbg_msgq.a  libompi_dbg_msgq.la  libompi_dbg_msgq.so

In this particular build, I am using platform files instead of the stripped 
down debug builds I was doing before. Could something in the platform files 
move or combine with something else the coll_ml related files?

Thanks,
David

On 08/13/2015 04:02 AM, Jeff Squyres (jsquyres) wrote:

Note that this will require you to have fairly recent GNU Autotools installed.

Another workaround for avoiding the coll ml module would be to install Open MPI 
as normal, and then rm the following files after installation:

rm $prefix/lib/openmpi/mca_coll_ml*

This will physically remove the coll ml plugin from the Open MPI installation, 
and therefore it won't/can't be used (or interfere with the hcoll plugin).



On Aug 13, 2015, at 2:03 AM, Gilles Gouaillardet  wrote:

David,

i guess you do not want to use the ml coll module at all  in openmpi 1.8.8

you can simply do
touch ompi/mca/coll/ml/.ompi_ignore
./autogen.pl
./configure ...
make && make install

so the ml component is not even built

Cheers,

Gilles

On 8/13/2015 7:30 AM, David Shrader wrote:

I remember seeing those, but forgot about them. I am curious, though, why using 
'-mca coll ^ml' wouldn't work for me.

We'll watch for the next HPCX release. Is there an ETA on when that release may 
happen? Thank you for the help!
David

On 08/12/2015 04:04 PM, Deva wrote:

David,

This is because of hcoll symbols conflict with ml coll module inside OMPI. 
HCOLL is derived from ml module. This issue is fixed in hcoll library and will 
be available in next HPCX release.

Some earlier discussion on this issue:
http://www.open-mpi.org/community/lists/users/2015/06/27154.php
http://www.open-mpi.org/community/lists/devel/2015/06/17562.php

-Devendar

On Wed, Aug 12, 2015 at 2:52 PM, David Shrader  wrote:
Interesting... the seg faults went away:

[dshrader@zo-fe1 tests]$ export LD_PRELOAD=/usr/lib64/libhcoll.so
[dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
[1439416182.732720] [zo-fe1:14690:0] shm.c:65   MXM  WARN  Could not 
open the KNEM device file at /dev/knem : No such file or direc
tory. Won't use knem.
[1439416182.733640] [zo-fe1:14689:0] shm.c:65   MXM  WARN  Could not 
open the KNEM device file at /dev/knem : No such file or direc
tory. Won't use knem.
0: Running on host zo-fe1.lanl.gov
0: We have 2 processors
0: Hello 1! Processor 1 on host zo-fe1.lanl.gov reporting for duty

This implies to me that some other library is being used instead of 
/usr/lib64/libhcoll.so, but I am not sure how that could be...

Thanks,
David

On 08/12/2015 03:30 PM, Deva wrote:

Hi David,

I tried same tarball on OFED-1.5.4.1 and I could not reproduce the issue.  Can 
you do one more quick test with seeing LD_PRELOAD to hcoll lib?

$LD_PRELOAD=  mpirun -n 2  -mca coll ^ml ./a.out

-Devendar

On Wed, Aug 12, 2015 at 12:52 PM, David Shrader  wrote:
The admin that rolled the hcoll rpm that we're using (and got it in system 
space) said that she got it from 
hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64.tar.

Thanks,
David


On 08/12/2015 10:51 AM, Deva wrote:

 From where did you grab this HCOLL lib?  MOFED or HPCX? what version?

On Wed, Aug 12, 2015 at 9:47 AM, David Shrader  wrote:
Hey Devendar,

It looks like I still get the error:

[dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
[1439397957.351764] [zo-fe1:14678:0] shm.c:65   MXM  WARN  Could not 
open the KNEM device file at /dev/knem : No such file or direc
tory. Won't use knem.
[1439397957.352704] [zo-fe1:14677:0] shm.c:65   MXM  WARN