Re: [OMPI users] Openmpi 1.10.4 crashes with 1024 processes

2017-11-30 Thread Götz Waschk
Dear Jeff, I'm using openmpi as shipped by OpenHPC, so I'll upgrade 1.10 to 1.10.7 when they do. But it isn't 1.10 that is failing for me but openmpi 3.0.0. Regards, Götz On Thu, Nov 30, 2017 at 4:24 PM, Jeff Squyres (jsquyres) wrote: > Can you upgrade to 1.10.7? That's

Re: [OMPI users] Openmpi 1.10.4 crashes with 1024 processes

2017-11-30 Thread Jeff Squyres (jsquyres)
Can you upgrade to 1.10.7? That's the last release in the v1.10 series, and has all the latest bug fixes. > On Nov 30, 2017, at 9:53 AM, Götz Waschk wrote: > > Hi everyone, > > I have managed to solve the first part of this problem. It was caused > by the quota on

Re: [OMPI users] Openmpi 1.10.4 crashes with 1024 processes

2017-11-30 Thread Götz Waschk
Hi everyone, I have managed to solve the first part of this problem. It was caused by the quota on /tmp, that's where the session directory of openmpi was stored. There's a XFS default quota of 100MB to prevent users from filling up /tmp. Instead of an over quota message, the result was the

Re: [OMPI users] Openmpi 1.10.4 crashes with 1024 processes

2017-03-29 Thread Gilles Gouaillardet
Hi, yes, please open an issue on github, and post your configure and mpirun command lines. ideally, could you try the latest v1.10.6 and v2.1.0 ? if you can reproduce the issue with a smaller number of MPI tasks, that would be great too Cheers, Gilles On 3/28/2017 11:19 PM, Götz

Re: [OMPI users] Openmpi 1.10.4 crashes with 1024 processes

2017-03-28 Thread Götz Waschk
Hi everyone, so how do I proceed with this problem, do you need more information? Should I open a bug report on github? Regards, Götz Waschk ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Openmpi 1.10.4 crashes with 1024 processes

2017-03-23 Thread Götz Waschk
On Thu, Mar 23, 2017 at 2:37 PM, Götz Waschk wrote: > I have also tried mpirun --mca coll ^tuned --mca btl tcp,openib , this > finished fine, but was quite slow. I am currently testing with mpirun > --mca coll ^tuned This one ran also fine.

Re: [OMPI users] Openmpi 1.10.4 crashes with 1024 processes

2017-03-23 Thread Götz Waschk
Hi Gilles, On Thu, Mar 23, 2017 at 10:33 AM, Gilles Gouaillardet wrote: > mpirun --mca btl openib,self ... Looks like this didn't finish, I had to terminate the job during the Gather with 32 processes step. > Then can you try > mpirun --mca coll ^tuned --mca btl

Re: [OMPI users] Openmpi 1.10.4 crashes with 1024 processes

2017-03-23 Thread Götz Waschk
Hi Gilles, I'm currently testing and here are some preliminary results: On Thu, Mar 23, 2017 at 10:33 AM, Gilles Gouaillardet wrote: > Can you please try > mpirun --mca btl tcp,self ... this failed to produce the program output, there were lots of errors like

Re: [OMPI users] Openmpi 1.10.4 crashes with 1024 processes

2017-03-23 Thread Gilles Gouaillardet
Can you please try mpirun --mca btl tcp,self ... And if it works mpirun --mca btl openib,self ... Then can you try mpirun --mca coll ^tuned --mca btl tcp,self ... That will help figuring out whether the error is in the pml or the coll framework/module Cheers, Gilles On Thursday, March 23,

Re: [OMPI users] Openmpi 1.10.4 crashes with 1024 processes

2017-03-23 Thread Åke Sandgren
Ok, we have E5-2690v4's and Connect-IB. On 03/23/2017 10:11 AM, Götz Waschk wrote: > On Thu, Mar 23, 2017 at 9:59 AM, Åke Sandgren > wrote: >> E5-2697A which version? v4? > Hi, yes, that one: > Intel(R) Xeon(R) CPU E5-2697A v4 @ 2.60GHz > > Regards, Götz >

Re: [OMPI users] Openmpi 1.10.4 crashes with 1024 processes

2017-03-23 Thread Götz Waschk
On Thu, Mar 23, 2017 at 9:59 AM, Åke Sandgren wrote: > E5-2697A which version? v4? Hi, yes, that one: Intel(R) Xeon(R) CPU E5-2697A v4 @ 2.60GHz Regards, Götz ___ users mailing list users@lists.open-mpi.org

Re: [OMPI users] Openmpi 1.10.4 crashes with 1024 processes

2017-03-23 Thread Åke Sandgren
E5-2697A which version? v4? On 03/23/2017 09:53 AM, Götz Waschk wrote: > Hi Åke, > > I have E5-2697A CPUs and Mellanox ConnectX-3 FDR Infiniband. I'm using > EL7.3 as the operating system. -- Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden Internet: a...@hpc2n.umu.se Phone: +46

Re: [OMPI users] Openmpi 1.10.4 crashes with 1024 processes

2017-03-23 Thread Götz Waschk
Hi Howard, I had tried to send config.log of my 2.1.0 build, but I guess it was too big for the list. I'm trying again with a compressed file. I have based it on the OpenHPC package. Unfortunately, it still crashes with disabling the vader btl with this command line: mpirun --mca btl "^vader"

Re: [OMPI users] Openmpi 1.10.4 crashes with 1024 processes

2017-03-23 Thread Götz Waschk
Hi Åke, I have E5-2697A CPUs and Mellanox ConnectX-3 FDR Infiniband. I'm using EL7.3 as the operating system. Regards, Götz Waschk On Thu, Mar 23, 2017 at 9:28 AM, Åke Sandgren wrote: > Since i'm seeing similar Bus errors from both openmpi and other places > on our

Re: [OMPI users] Openmpi 1.10.4 crashes with 1024 processes

2017-03-23 Thread Åke Sandgren
Since i'm seeing similar Bus errors from both openmpi and other places on our system I'm wondering, what hardware do you have? CPU:s, interconnect etc. On 03/23/2017 08:45 AM, Götz Waschk wrote: > Hi Howard, > > I have attached my config.log file for version 2.1.0. I have based it > on the

Re: [OMPI users] Openmpi 1.10.4 crashes with 1024 processes

2017-03-22 Thread Howard Pritchard
Forgot you probably need an equal sign after btl arg Howard Pritchard schrieb am Mi. 22. März 2017 um 18:11: > Hi Goetz > > Thanks for trying these other versions. Looks like a bug. Could you post > the config.log output from your build of the 2.1.0 to the list? > > Also

Re: [OMPI users] Openmpi 1.10.4 crashes with 1024 processes

2017-03-22 Thread Howard Pritchard
Hi Goetz Thanks for trying these other versions. Looks like a bug. Could you post the config.log output from your build of the 2.1.0 to the list? Also could you try running the job using this extra command line arg to see if the problem goes away? mpirun --mca btl ^vader (rest of your args)

Re: [OMPI users] Openmpi 1.10.4 crashes with 1024 processes

2017-03-22 Thread Götz Waschk
On Wed, Mar 22, 2017 at 7:46 PM, Howard Pritchard wrote: > Hi Goetz, > > Would you mind testing against the 2.1.0 release or the latest from the > 1.10.x series (1.10.6)? Hi Howard, after sending my mail I have tested both 1.10.6 and 2.1.0 and I have received the same

Re: [OMPI users] Openmpi 1.10.4 crashes with 1024 processes

2017-03-22 Thread Howard Pritchard
Hi Goetz, Would you mind testing against the 2.1.0 release or the latest from the 1.10.x series (1.10.6)? Thanks, Howard 2017-03-22 6:25 GMT-06:00 Götz Waschk : > Hi everyone, > > I'm testing a new machine with 32 nodes of 32 cores each using the IMB > benchmark. It

[OMPI users] Openmpi 1.10.4 crashes with 1024 processes

2017-03-22 Thread Götz Waschk
Hi everyone, I'm testing a new machine with 32 nodes of 32 cores each using the IMB benchmark. It is working fine with 512 processes, but it crashes with 1024 processes after a running for a minute: [pax11-17:16978] *** Process received signal *** [pax11-17:16978] Signal: Bus error (7)