Re: [OMPI users] [EXTERNAL] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success"

2021-10-15 Thread Fischer, Greg A. via users
he network. Unfortunately, the person who knows the most has recently left the organization.) Greg From: Pritchard Jr., Howard Sent: Thursday, October 14, 2021 5:45 PM To: Fischer, Greg A. ; Open MPI Users Subject: Re: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 err

Re: [OMPI users] [EXTERNAL] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success"

2021-10-14 Thread Fischer, Greg A. via users
: Open MPI Users Cc: Fischer, Greg A. Subject: Re: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success" [External Email] HI Greg, It's the aging of the openib btl. You may be able to apply the attached patch. Note the 3.1.x release stream is no longer

Re: [OMPI users] [EXTERNAL] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success"

2021-10-14 Thread Fischer, Greg A. via users
The version of librdmacm we have comes from librdmacm-devel-41mlnx1-OFED.4.1.0.1.0.41102.x86_64, which seems to date from mid-2017. I wonder if that's too old? Greg From: Pritchard Jr., Howard Sent: Thursday, October 14, 2021 3:31 PM To: Fischer, Greg A. ; Open MPI Users Subject: Re

[OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success"

2021-10-13 Thread Fischer, Greg A. via users
Hello, I have compiled OpenMPI 3.1.6 from source on SLES12-SP3, and I am seeing the following errors when I try to use the openib btl: WARNING: There was an error initializing an OpenFabrics device. Local host: bl1308 Local device: mlx4_0

[OMPI users] disappearance of the memory registration error in 1.8.x?

2015-03-10 Thread Fischer, Greg A.
Hello, I'm trying to run the "connectivity_c" test on a variety of systems using OpenMPI 1.8.4. The test returns segmentation faults when running across nodes on one particular type of system, and only when using the openib BTL. (The test runs without error if I stipulate "--mca btl

Re: [OMPI users] poor performance using the openib btl

2014-06-25 Thread Fischer, Greg A.
file, to see if it was enabled or not. Maxime Le 2014-06-25 10:46, Fischer, Greg A. a écrit : Attached are the results of "grep thread" on my configure output. There appears to be some amount of threading, but is there anything I should look for in particular? I see Mike Dubman's

Re: [OMPI users] poor performance using the openib btl

2014-06-25 Thread Fischer, Greg A.
ance using the openib btl What are your threading options for OpenMPI (when it was built) ? I have seen OpenIB BTL completely lock when some level of threading is enabled before. Maxime Boissonneault Le 2014-06-24 18:18, Fischer, Greg A. a écrit : Hello openmpi-users, A few weeks ago, I posted t

[OMPI users] poor performance using the openib btl

2014-06-24 Thread Fischer, Greg A.
Hello openmpi-users, A few weeks ago, I posted to the list about difficulties I was having getting openib to work with Torque (see "openib segfaults with Torque", June 6, 2014). The issues were related to Torque imposing restrictive limits on locked memory, and have since been resolved.

Re: [OMPI users] openib segfaults with Torque

2014-06-13 Thread Fischer, Greg A.
work, try using the RDMACM CPC (instead >>>> of >>>> UDCM, which is a pretty recent addition to the openIB BTL.) by setting: >>>> >>>> -mca btl_openib_cpc_include rdmacm >>>> >>>> Josh >>>&

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Fischer, Greg A.
Is there any other work around that I might try? Something that avoids UDCM? -Original Message- From: Fischer, Greg A. Sent: Tuesday, June 10, 2014 2:59 PM To: Nathan Hjelm Cc: Open MPI Users; Fischer, Greg A. Subject: RE: [OMPI users] openib segfaults with Torque [binf316:fischega

Re: [OMPI users] openib segfaults with Torque

2014-06-10 Thread Fischer, Greg A.
[binf316:fischega] $ ulimit -m unlimited Greg -Original Message- From: Nathan Hjelm [mailto:hje...@lanl.gov] Sent: Tuesday, June 10, 2014 2:58 PM To: Fischer, Greg A. Cc: Open MPI Users Subject: Re: [OMPI users] openib segfaults with Torque Out of curiosity what is the mlock limit

Re: [OMPI users] openib segfaults with Torque

2014-06-10 Thread Fischer, Greg A.
Yes, this fails on all nodes on the system, except for the head node. The uptime of the system isn't significant. Maybe 1 week, and it's received basically no use. -Original Message- From: Nathan Hjelm [mailto:hje...@lanl.gov] Sent: Tuesday, June 10, 2014 2:49 PM To: Fischer, Greg A. Cc

Re: [OMPI users] openib segfaults with Torque

2014-06-10 Thread Fischer, Greg A.
Jeff/Nathan, I ran the following with my debug build of OpenMPI 1.8.1 - after opening a terminal on a compute node with "qsub -l nodes 2 -I": mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2 ring_c &> output.txt Output and backtrace are attached. Let me know if I can

Re: [OMPI users] intermittent segfaults with openib on ring_c.c

2014-06-10 Thread Fischer, Greg A.
, at 5:15 PM, Ralph Castain <r...@open-mpi.org> wrote: > Aha!! I found this in our users mailing list archives: > > http://www.open-mpi.org/community/lists/users/2012/01/18091.php > > Looks like this is a known compiler vectorization issue. > > > On Jun 4, 2014, at 1:

Re: [OMPI users] openib segfaults with Torque

2014-06-06 Thread Fischer, Greg A.
the launch is complete. Are you able to run this with btl tcp,sm,self? If so, that would confirm that everything else is correct, and the problem truly is limited to the udcm itself...which shouldn't have anything to do with how the proc was launched. On Jun 6, 2014, at 6:47 AM, Fischer, Greg

Re: [OMPI users] openib segfaults with Torque

2014-06-06 Thread Fischer, Greg A.
run with -np 2 -mca btl openib,sm,self, is it happy? On Jun 5, 2014, at 2:16 PM, Fischer, Greg A. <fisch...@westinghouse.com<mailto:fisch...@westinghouse.com>> wrote: Here's the command I'm invoking and the terminal output. (Some of this information doesn't appear to be captured i

Re: [OMPI users] openib segfaults with Torque

2014-06-05 Thread Fischer, Greg A.
l 6 (Aborted). -- From: Fischer, Greg A. Sent: Thursday, June 05, 2014 5:10 PM To: us...@open-mpi.org Cc: Fischer, Greg A. Subject: openib segfaults with Torque OpenMPI Users, After encountering difficulty with the Intel co

[OMPI users] openib segfaults with Torque

2014-06-05 Thread Fischer, Greg A.
OpenMPI Users, After encountering difficulty with the Intel compilers (see the "intermittent segfaults with openib on ring_c.c" thread), I installed GCC-4.8.3 and recompiled OpenMPI. I ran the simple examples (ring, etc.) with the openib BTL in a typical BASH environment. Everything appeared

Re: [OMPI users] intermittent segfaults with openib on ring_c.c

2014-06-04 Thread Fischer, Greg A.
- but in the same part of the interceptor, which makes me suspicious. Don't know how much testing we've seen on SLES... On Jun 4, 2014, at 1:18 PM, Fischer, Greg A. <fisch...@westinghouse.com> wrote: > Ralph, > > It segfaults. Here's the backtrace: > > Core was generated by `ring_c'.

Re: [OMPI users] intermittent segfaults with openib on ring_c.c

2014-06-04 Thread Fischer, Greg A.
add the shared memory btl to your mca parameter? >> >> mpirun -np 2 --mca btl openib,sm,self ring_c >> >> As far as I know, sm is the preferred transport layer for intra-node >> communication. >> >> Gus Correa >> >> >> On 06/04/2014 11:13 AM,

Re: [OMPI users] intermittent segfaults with openib on ring_c.c

2014-06-04 Thread Fischer, Greg A.
ify the source On Jun 4, 2014, at 7:55 AM, Fischer, Greg A. <fisch...@westinghouse.com<mailto:fisch...@westinghouse.com>> wrote: Oops, ulimit was set improperly. I generated a core file, loaded it in GDB, and ran a backtrace: Core was generated by `ring_c'. Program terminated

Re: [OMPI users] intermittent segfaults with openib on ring_c.c

2014-06-04 Thread Fischer, Greg A.
rom: Fischer, Greg A. Sent: Wednesday, June 04, 2014 10:17 AM To: 'Open MPI Users' Cc: Fischer, Greg A. Subject: RE: [OMPI users] intermittent segfaults with openib on ring_c.c I recompiled with "-enable-debug" but it doesn't seem to be providing any more information or a core dump.

Re: [OMPI users] intermittent segfaults with openib on ring_c.c

2014-06-04 Thread Fischer, Greg A.
mber where it is failing? On Jun 3, 2014, at 9:58 AM, Fischer, Greg A. <fisch...@westinghouse.com<mailto:fisch...@westinghouse.com>> wrote: Apologies - I forgot to add some of the information requested by the FAQ: 1. OpenFabrics is provided by the Linux distribution: [binf1

Re: [OMPI users] intermittent segfaults with openib on ring_c.c

2014-06-03 Thread Fischer, Greg A.
is being used - I think OpenSM, but I'll need to check with my administrators. 4. Output of ibv_devinfo is attached. 5. Ifconfig output is attached. 6. Ulimit -l output: [binf102:fischega] $ ulimit -l unlimited Greg From: Fischer, Greg A. Sent: Tuesday, June 03, 2014 12:38

[OMPI users] intermittent segfaults with openib on ring_c.c

2014-06-03 Thread Fischer, Greg A.
Hello openmpi-users, I'm running into a perplexing problem on a new system, whereby I'm experiencing intermittent segmentation faults when I run the ring_c.c example and use the openib BTL. See an example below. Approximately 50% of the time it provides the expected output, but the other 50%

Re: [OMPI users] simple test problem hangs on mpi_finalize and consumes all system resources

2014-01-24 Thread Fischer, Greg A.
an incompatibility lurking in there somewhere that would trip openmpi-1.6.5 but not openmpi-1.4.3? Greg >-Original Message- >From: Fischer, Greg A. >Sent: Friday, January 24, 2014 11:41 AM >To: 'Open MPI Users' >Cc: Fischer, Greg A. >Subject: RE: [OMPI users] simple

Re: [OMPI users] simple test problem hangs on mpi_finalize and consumes all system resources

2014-01-24 Thread Fischer, Greg A.
ymbol: mca_base_param_reg_int" type of message is almost always an >indicator of two different versions being installed into the same tree. > > >On Jan 24, 2014, at 11:26 AM, "Fischer, Greg A." ><fisch...@westinghouse.com> wrote: > >> Version 1.4.3

Re: [OMPI users] simple test problem hangs on mpi_finalize and consumes all system resources

2014-01-24 Thread Fischer, Greg A.
day, January 24, 2014 11:07 AM >To: Open MPI Users >Subject: Re: [OMPI users] simple test problem hangs on mpi_finalize and >consumes all system resources > >On Jan 22, 2014, at 10:21 AM, "Fischer, Greg A." ><fisch...@westinghouse.com> wrote: > >> The reason f

Re: [OMPI users] simple test problem hangs on mpi_finalize and consumes all system resources

2014-01-22 Thread Fischer, Greg A.
let's get the Fortran out of the way and use just the base C bindings, >and >see what happens. > > >On Jan 19, 2014, at 6:18 PM, "Fischer, Greg A." <fisch...@westinghouse.com> >wrote: > >> I just tried running "hello_f90.f90" and see the same behavi

Re: [OMPI users] simple test problem hangs on mpi_finalize and consumes all system resources

2014-01-19 Thread Fischer, Greg A.
kup the OMPI libs you installed - most Linux distros come with an older version, and that can cause problems if you inadvertently pick them up. On Jan 19, 2014, at 5:51 AM, Fischer, Greg A. <fisch...@westinghouse.com<mailto:fisch...@westinghouse.com>> wrote: Hello, I have a si

[OMPI users] simple test problem hangs on mpi_finalize and consumes all system resources

2014-01-19 Thread Fischer, Greg A.
Hello, I have a simple, 1-process test case that gets stuck on the mpi_finalize call. The test case is a dead-simple calculation of pi - 50 lines of Fortran. The process gradually consumes more and more memory until the system becomes unresponsive and needs to be rebooted, unless the job is