Re: [OMPI users] OpenMPI + InfiniBand

2016-12-23 Thread gilles
 Serguei,

this looks like a very different issue, orted cannot be remotely started.

that typically occurs if orted cannot find some dependencies

(the Open MPI libs and/or the compiler runtime)

for example, from a node, ssh  orted should not fail because 
of unresolved dependencies.

a simple trick is to replace

mpirun ...

with

`which mpirun` ...

a better option (as long as you do not plan to relocate Open MPI install 
dir) is to configure with

--enable-mpirun-prefix-by-default

Cheers,

Gilles

- Original Message -

Hi All !

As there are no any positive changes with "UDSM + IPoIB" problem 
since my previous post,
we installed IPoIB on the cluster and "No OpenFabrics connection..." 
error doesn't appear more.
But now OpenMPI reports about another problem:

In app ERROR OUTPUT stream:

[node2:14142] [[37935,0],0] ORTE_ERROR_LOG: Data unpack had 
inadequate space in file base/plm_base_launch_support.c at line 1035

In app OUTPUT stream:


--
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-
default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_
tmpdir_base).
  Please check with your sys admin to determine the correct location 
to use.

*  compilation of the orted with dynamic libraries when static are 
required
  (e.g., on Cray). Please check your configure cmd line and consider 
using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).

--

When I'm trying to run the task using single node - all works 
properly.
But when I specify "run on 2 nodes", the problem appears.

I tried to run ping using IPoIB addresses and all hosts are resolved 
properly,
ping requests and replies are going over IB without any problems.
So all nodes (including head) see each other via IPoIB.
But MPI app fails.

Same test task works perfect on all nodes being run with Ethernet 
transport instead of InfiniBand.

P.S. We use Torque resource manager to enqueue MPI tasks.

Best regards,
Sergei.



___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Segmentation Fault (Core Dumped) on mpif90 -v

2016-12-23 Thread Paul Kapinos

Hi all,

we discussed this issue with Intel compiler support and it looks like they now 
know what the issue is and how to protect after. It is a known issue resulting 
from a backwards incompatibility in an OS/glibc update, cf. 
https://sourceware.org/bugzilla/show_bug.cgi?id=20019


Affected versions of the Intel compilers: 16.0.3, 16.0.4
Not affected versions: 16.0.2, 17.0

So, simply do not use affected versions (and hope on an bugfix update in 16x 
series if you cannot immediately upgrade to 17x, like we, despite this is the 
favourite option from Intel).


Have a nice Christmas time!

Paul Kapinos

On 12/14/16 13:29, Paul Kapinos wrote:

Hello all,
we seem to run into the same issue: 'mpif90' sigsegvs immediately for Open MPI
1.10.4 compiled using Intel compilers 16.0.4.258 and 16.0.3.210, while it works
fine when compiled with 16.0.2.181.

It seems to be a compiler issue (more exactly: library issue on libs delivered
with 16.0.4.258 and 16.0.3.210 versions). Changing the version of compiler
loaded back to 16.0.2.181 (=> change of dynamically loaded libs) let the
prevously-failing binary (compiled with newer compilers) to work propperly.

Compiling with -O0 does not help. As the issue is likely in the Intel libs (as
said changing out these solves/raises the issue) we will do a failback to
16.0.2.181 compiler version. We will try to open a case by Intel - let's see...

Have a nice day,

Paul Kapinos



On 05/06/16 14:10, Jeff Squyres (jsquyres) wrote:

Ok, good.

I asked that question because typically when we see errors like this, it is
usually either a busted compiler installation or inadvertently mixing the
run-times of multiple different compilers in some kind of incompatible way.
Specifically, the mpifort (aka mpif90) application is a fairly simple program
-- there's no reason it should segv, especially with a stack trace that you
sent that implies that it's dying early in startup, potentially even before it
has hit any Open MPI code (i.e., it could even be pre-main).

BTW, you might be able to get a more complete stack trace from the debugger
that comes with the Intel compiler (idb?  I don't remember offhand).

Since you are able to run simple programs compiled by this compiler, it sounds
like the compiler is working fine.  Good!

The next thing to check is to see if somehow the compiler and/or run-time
environments are getting mixed up.  E.g., the apps were compiled for one
compiler/run-time but are being used with another.  Also ensure that any
compiler/linker flags that you are passing to Open MPI's configure script are
native and correct for the platform for which you're compiling (e.g., don't
pass in flags that optimize for a different platform; that may result in
generating machine code instructions that are invalid for your platform).

Try recompiling/re-installing Open MPI from scratch, and if it still doesn't
work, then send all the information listed here:

https://www.open-mpi.org/community/help/



On May 6, 2016, at 3:45 AM, Giacomo Rossi  wrote:

Yes, I've tried three simple "Hello world" programs in fortan, C and C++ and
the compile and run with intel 16.0.3. The problem is with the openmpi
compiled from source.

Giacomo Rossi Ph.D., Space Engineer

Research Fellow at Dept. of Mechanical and Aerospace Engineering, "Sapienza"
University of Rome
p: (+39) 0692927207 | m: (+39) 3408816643 | e: giacom...@gmail.com

Member of Fortran-FOSS-programmers


2016-05-05 11:15 GMT+02:00 Giacomo Rossi :
 gdb /opt/openmpi/1.10.2/intel/16.0.3/bin/mpif90
GNU gdb (GDB) 7.11
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
.
Find the GDB manual and other documentation resources online at:
.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /opt/openmpi/1.10.2/intel/16.0.3/bin/mpif90...(no
debugging symbols found)...done.
(gdb) r -v
Starting program: /opt/openmpi/1.10.2/intel/16.0.3/bin/mpif90 -v

Program received signal SIGSEGV, Segmentation fault.
0x76858f38 in ?? ()
(gdb) bt
#0  0x76858f38 in ?? ()
#1  0x77de5828 in _dl_relocate_object () from
/lib64/ld-linux-x86-64.so.2
#2  0x77ddcfa3 in dl_main () from /lib64/ld-linux-x86-64.so.2
#3  0x77df029c in _dl_sysdep_start () from /lib64/ld-linux-x86-64.so.2
#4  0x774a in _dl_start () from /lib64/ld-linux-x86-64.so.2
#5  0x77dd9d98 in _start () from /lib64/ld-linux-x86-64.so.2
#6  0x0002 in ?? ()
#7  0x7fffaa8a in ?? ()
#8  0x7f

Re: [OMPI users] OpenMPI + InfiniBand

2016-12-23 Thread r...@open-mpi.org
Also check to ensure you are using the same version of OMPI on all nodes - this 
message usually means that a different version was used on at least one node.

> On Dec 23, 2016, at 1:58 AM, gil...@rist.or.jp wrote:
> 
>  Serguei,
> 
>  
> this looks like a very different issue, orted cannot be remotely started.
> 
>  
> that typically occurs if orted cannot find some dependencies
> 
> (the Open MPI libs and/or the compiler runtime)
> 
>  
> for example, from a node, ssh  orted should not fail because of 
> unresolved dependencies.
> 
> a simple trick is to replace
> 
> mpirun ...
> 
> with
> 
> `which mpirun` ...
> 
>  
> a better option (as long as you do not plan to relocate Open MPI install dir) 
> is to configure with
> 
> --enable-mpirun-prefix-by-default
> 
>  
> Cheers,
> 
>  
> Gilles
> 
> - Original Message -
> 
> Hi All !
> As there are no any positive changes with "UDSM + IPoIB" problem since my 
> previous post, 
> we installed IPoIB on the cluster and "No OpenFabrics connection..." error 
> doesn't appear more.
> But now OpenMPI reports about another problem:
> 
> In app ERROR OUTPUT stream:
> 
> [node2:14142] [[37935,0],0] ORTE_ERROR_LOG: Data unpack had inadequate space 
> in file base/plm_base_launch_support.c at line 1035
> 
> In app OUTPUT stream:
> 
> --
> ORTE was unable to reliably start one or more daemons.
> This usually is caused by:
> 
> * not finding the required libraries and/or binaries on
>   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>   settings, or configure OMPI with --enable-orterun-prefix-by-default
> 
> * lack of authority to execute on one or more specified nodes.
>   Please verify your allocation and authorities.
> 
> * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
>   Please check with your sys admin to determine the correct location to use.
> 
> *  compilation of the orted with dynamic libraries when static are required
>   (e.g., on Cray). Please check your configure cmd line and consider using
>   one of the contrib/platform definitions for your system type.
> 
> * an inability to create a connection back to mpirun due to a
>   lack of common network interfaces and/or no route found between
>   them. Please check network connectivity (including firewalls
>   and network routing requirements).
> --
> 
> When I'm trying to run the task using single node - all works properly.
> But when I specify "run on 2 nodes", the problem appears.
> 
> I tried to run ping using IPoIB addresses and all hosts are resolved 
> properly, 
> ping requests and replies are going over IB without any problems.
> So all nodes (including head) see each other via IPoIB.
> But MPI app fails.
> 
> Same test task works perfect on all nodes being run with Ethernet transport 
> instead of InfiniBand.
> 
> P.S. We use Torque resource manager to enqueue MPI tasks.
> 
> Best regards,
> Sergei.
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Segmentation Fault (Core Dumped) on mpif90 -v

2016-12-23 Thread Howard Pritchard
Hi Paul,

Thanks very much Christmas present.

The Open MPI README has been updated
to include a note about issues with the Intel 16.0.3-4 compiler suites.

Enjoy the holidays,

Howard


2016-12-23 3:41 GMT-07:00 Paul Kapinos :

> Hi all,
>
> we discussed this issue with Intel compiler support and it looks like they
> now know what the issue is and how to protect after. It is a known issue
> resulting from a backwards incompatibility in an OS/glibc update, cf.
> https://sourceware.org/bugzilla/show_bug.cgi?id=20019
>
> Affected versions of the Intel compilers: 16.0.3, 16.0.4
> Not affected versions: 16.0.2, 17.0
>
> So, simply do not use affected versions (and hope on an bugfix update in
> 16x series if you cannot immediately upgrade to 17x, like we, despite this
> is the favourite option from Intel).
>
> Have a nice Christmas time!
>
> Paul Kapinos
>
> On 12/14/16 13:29, Paul Kapinos wrote:
>
>> Hello all,
>> we seem to run into the same issue: 'mpif90' sigsegvs immediately for
>> Open MPI
>> 1.10.4 compiled using Intel compilers 16.0.4.258 and 16.0.3.210, while it
>> works
>> fine when compiled with 16.0.2.181.
>>
>> It seems to be a compiler issue (more exactly: library issue on libs
>> delivered
>> with 16.0.4.258 and 16.0.3.210 versions). Changing the version of compiler
>> loaded back to 16.0.2.181 (=> change of dynamically loaded libs) let the
>> prevously-failing binary (compiled with newer compilers) to work
>> propperly.
>>
>> Compiling with -O0 does not help. As the issue is likely in the Intel
>> libs (as
>> said changing out these solves/raises the issue) we will do a failback to
>> 16.0.2.181 compiler version. We will try to open a case by Intel - let's
>> see...
>>
>> Have a nice day,
>>
>> Paul Kapinos
>>
>>
>>
>> On 05/06/16 14:10, Jeff Squyres (jsquyres) wrote:
>>
>>> Ok, good.
>>>
>>> I asked that question because typically when we see errors like this, it
>>> is
>>> usually either a busted compiler installation or inadvertently mixing the
>>> run-times of multiple different compilers in some kind of incompatible
>>> way.
>>> Specifically, the mpifort (aka mpif90) application is a fairly simple
>>> program
>>> -- there's no reason it should segv, especially with a stack trace that
>>> you
>>> sent that implies that it's dying early in startup, potentially even
>>> before it
>>> has hit any Open MPI code (i.e., it could even be pre-main).
>>>
>>> BTW, you might be able to get a more complete stack trace from the
>>> debugger
>>> that comes with the Intel compiler (idb?  I don't remember offhand).
>>>
>>> Since you are able to run simple programs compiled by this compiler, it
>>> sounds
>>> like the compiler is working fine.  Good!
>>>
>>> The next thing to check is to see if somehow the compiler and/or run-time
>>> environments are getting mixed up.  E.g., the apps were compiled for one
>>> compiler/run-time but are being used with another.  Also ensure that any
>>> compiler/linker flags that you are passing to Open MPI's configure
>>> script are
>>> native and correct for the platform for which you're compiling (e.g.,
>>> don't
>>> pass in flags that optimize for a different platform; that may result in
>>> generating machine code instructions that are invalid for your platform).
>>>
>>> Try recompiling/re-installing Open MPI from scratch, and if it still
>>> doesn't
>>> work, then send all the information listed here:
>>>
>>> https://www.open-mpi.org/community/help/
>>>
>>>
>>> On May 6, 2016, at 3:45 AM, Giacomo Rossi  wrote:

 Yes, I've tried three simple "Hello world" programs in fortan, C and
 C++ and
 the compile and run with intel 16.0.3. The problem is with the openmpi
 compiled from source.

 Giacomo Rossi Ph.D., Space Engineer

 Research Fellow at Dept. of Mechanical and Aerospace Engineering,
 "Sapienza"
 University of Rome
 p: (+39) 0692927207 | m: (+39) 3408816643 | e: giacom...@gmail.com

 Member of Fortran-FOSS-programmers


 2016-05-05 11:15 GMT+02:00 Giacomo Rossi :
  gdb /opt/openmpi/1.10.2/intel/16.0.3/bin/mpif90
 GNU gdb (GDB) 7.11
 Copyright (C) 2016 Free Software Foundation, Inc.
 License GPLv3+: GNU GPL version 3 or later <
 http://gnu.org/licenses/gpl.html>
 This is free software: you are free to change and redistribute it.
 There is NO WARRANTY, to the extent permitted by law.  Type "show
 copying"
 and "show warranty" for details.
 This GDB was configured as "x86_64-pc-linux-gnu".
 Type "show configuration" for configuration details.
 For bug reporting instructions, please see:
 .
 Find the GDB manual and other documentation resources online at:
 .
 For help, type "help".
 Type "apropos word" to search for commands related to "word"...
 Reading symbols from /opt/openmpi/1.10.2/intel/16.0.3/bin/mpif90...(no
 debugging symbols found)