Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-02 Thread Jeff Squyres (jsquyres)
Ah, I see the "sh: tcp://10.1.25.142,172.31.1.254,10.12.25.142:41686: No such file or directory" message now -- I was looking for something like that when I replied before and missed it. I really wish I understood why the heck that is happening; it doesn't seem to make sense. Matt: Random

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-02 Thread Ralph Castain
I can answer that for you right now. The launch of the orted's is what is failing, and they are "silently" failing at this time. The reason is simple: 1. we are failing due to truncation of the HNP uri at the first semicolon. This causes the orted to emit an ORTE_ERROR_LOG message and then

Re: [OMPI users] How does binding option affect network traffic?

2014-09-02 Thread Jeff Squyres (jsquyres)
Ah, ok -- I think I missed this part of the thread: each of your individual MPI processes suck up huge gobbs of memory. So just to be clear, in general: you don't intend to run more MPI processes than cores per server, *and* you intend to run fewer MPI processes per server than would consume

Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots (updated findings)

2014-09-02 Thread Ralph Castain
Argh - yeah, I got confused as things context switched a few too many times. The 1.8.2 release should certainly understand that arrangement, and --hetero-nodes. The only way it wouldn't see the latter is if you configure it --without-hwloc, or hwloc refused to build. Since there was a question

Re: [OMPI users] Open MPI 1.6.5 or 1.8.1 Please respond to swa...@us.ibm.com

2014-09-02 Thread Jeff Squyres (jsquyres)
Please send the information listed here: http://www.open-mpi.org/community/help/ On Sep 2, 2014, at 2:10 PM, Swamy Kandadai wrote: > Hi: > While building OpenMPI (1.6.5 or 1.8.1) using openib on our power8 cluster > with Mellanox IB (FDR) I get the following error: >

Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots (updated findings)

2014-09-02 Thread Lane, William
Ralph, These latest issues (since 8/28/14) all occurred after we upgraded our cluster to OpenMPI 1.8.2 on . Maybe I should've created a new thread rather than tacking on these issues to my existing thread. -Bill Lane From: users

Re: [OMPI users] problems and bus error with openmpi-1.9a1r32657

2014-09-02 Thread Ralph Castain
The difficulty here is that you have bundled several errors again into a single message, making it hard to keep the conversation from getting terribly confused. I was trying to address the segfault errors on cleanup, which have nothing to do with the accept being rejected. It looks like those

[OMPI users] Open MPI 1.6.5 or 1.8.1 Please respond to swa...@us.ibm.com

2014-09-02 Thread Swamy Kandadai
Hi: While building OpenMPI (1.6.5 or 1.8.1) using openib on our power8 cluster with Mellanox IB (FDR) I get the following error: configure: WARNING: infiniband/verbs.h: present but cannot be compiled configure: WARNING: infiniband/verbs.h: check for missing prerequisite headers? configure:

Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots (updated findings)

2014-09-02 Thread Ralph Castain
On Sep 2, 2014, at 10:48 AM, Lane, William wrote: > Ralph, > > There are at least three different permutations of CPU configurations in the > cluster > involved. Some are blades that have two sockets with two cores per Intel CPU > (and not all > sockets are filled).

Re: [OMPI users] problems and bus error with openmpi-1.9a1r32657

2014-09-02 Thread Ralph Castain
I don't see any line numbers on the errors I flagged - all I see are the usual memory offsets in bytes, which is of little help. I'm afraid I don't what what you'd have to do under SunOS to get line numbers, but I can't do much without it On Sep 2, 2014, at 10:26 AM, Siegmar Gross

Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots (updated findings)

2014-09-02 Thread Lane, William
Ralph, There are at least three different permutations of CPU configurations in the cluster involved. Some are blades that have two sockets with two cores per Intel CPU (and not all sockets are filled). Some are IBM x3550 systems having two sockets with three cores per Intel CPU (and not all

Re: [OMPI users] problems and bus error with openmpi-1.9a1r32657

2014-09-02 Thread Ralph Castain
Hi Siegmar Could you please configure this OMPI install with --enable-debug so that gdb will provide line numbers where the error is occurring? Otherwise, I'm having a hard time chasing this problem down. Thanks Ralph On Sep 2, 2014, at 6:01 AM, Siegmar Gross

Re: [OMPI users] SIGSEGV with Java, openmpi-1.8.2, and Sun C and gcc-4.9.0

2014-09-02 Thread Ralph Castain
I believe this was fixed in the trunk and is now scheduled to come across to 1.8.3 On Sep 2, 2014, at 4:21 AM, Siegmar Gross wrote: > Hi, > > yesterday I installed openmpi-1.8.2 on my machines (Solaris 10 Sparc > (tyr), Solaris 10 x86_64 (sunpc0), and

Re: [OMPI users] same problems and bus error with openmpi-1.9a1r32657 and gcc

2014-09-02 Thread Ralph Castain
Would you please try r32662? I believe I finally found and fixed this problem. On Sep 2, 2014, at 6:12 AM, Siegmar Gross wrote: > Hi, > > yesterday I installed openmpi-1.9a1r32657 on my machines (Solaris > 10 Sparc (tyr), Solaris 10 x86_64 (sunpc0), and

Re: [OMPI users] How does binding option affect network traffic?

2014-09-02 Thread McGrattan, Kevin B. Dr.
Thanks for the advice. Our jobs vary in size, from just a few MPI processes to about 64. Jobs are submitted at random, which is why I want to map by socket. If the cluster is empty, and someone submits a job with 16 MPI processes, I would think it would run most efficiently if it used 8 nodes,

Re: [hwloc-users] setting memory bindings

2014-09-02 Thread Aulwes, Rob
great, thanks Brice! From: Brice Goglin > Reply-To: Hardware locality user list > List-Post: hwloc-users@lists.open-mpi.org Date: Tue, 2 Sep 2014 10:32:01 +0200 To: Hardware locality

[OMPI users] same problems and bus error with openmpi-1.9a1r32657 and gcc

2014-09-02 Thread Siegmar Gross
Hi, yesterday I installed openmpi-1.9a1r32657 on my machines (Solaris 10 Sparc (tyr), Solaris 10 x86_64 (sunpc0), and openSUSE Linux 12.1 x86_64 (linpc0)) with Sun C 5.12 and gcc-4.9.0. I have the following problems with my gcc version. First once more my problems with Java and below my problems

[OMPI users] problems and bus error with openmpi-1.9a1r32657

2014-09-02 Thread Siegmar Gross
Hi, yesterday I installed openmpi-1.9a1r32657 on my machines (Solaris 10 Sparc (tyr), Solaris 10 x86_64 (sunpc0), and openSUSE Linux 12.1 x86_64 (linpc0)) with Sun C 5.12 and gcc-4.9.0. I have the following problems with my Sun C version. First my problem with Java and below my problem with C.

[OMPI users] Java problem with openmpi-1.8.3a1r32641

2014-09-02 Thread Siegmar Gross
Hi, yesterday I installed openmpi-1.8.3a1r32641 on my machines (Solaris 10 Sparc (tyr), Solaris 10 x86_64 (sunpc0), and openSUSE Linux 12.1 x86_64 (linpc0)) with Sun C 5.12 and gcc-4.9.0. A small Java program breaks with SIGSEGV. gdb shows the following backtrace for the Sun C version. tyr java

Re: [OMPI users] bus error with openmpi-1.8.2 and gcc-4.9.0

2014-09-02 Thread Siegmar Gross
Hi Takahiro, > I forgot to follow the previous report, sorry. > The patch I suggested is not included in Open MPI 1.8.2. > The backtrace Siegmar reported points the problem that I fixed > in the patch. > > http://www.open-mpi.org/community/lists/users/2014/08/24968.php > > Siegmar: > Could

[OMPI users] SIGSEGV with Java, openmpi-1.8.2, and Sun C and gcc-4.9.0

2014-09-02 Thread Siegmar Gross
Hi, yesterday I installed openmpi-1.8.2 on my machines (Solaris 10 Sparc (tyr), Solaris 10 x86_64 (sunpc0), and openSUSE Linux 12.1 x86_64 (linpc0)) with Sun C 5.12. A small Java program works on Linux, but breaks with a segmentation fault on Solaris 10. tyr java 172 where mpijavac mpijavac is

[OMPI users] bus error with openmpi-1.8.2 and gcc-4.9.0

2014-09-02 Thread Siegmar Gross
Hi, yesterday I installed openmpi-1.8.2 on my machines (Solaris 10 Sparc (tyr), Solaris 10 x86_64 (sunpc0), and openSUSE Linux 12.1 x86_64 (linpc0)) with gcc-4.9.0. A small program works on some machines, but breaks with a bus error on Solaris 10 Sparc. tyr small_prog 118 which mpicc

Re: [hwloc-users] setting memory bindings

2014-09-02 Thread Brice Goglin
Hello, I am coming back on this thread to fix things before releasing v1.10. Regarding your question below, there's already an answer with hwloc_topology_get_support() which returns things like support->membind->replicate_membind (set to 0 or 1 depending on whether the policy is supported).