Hi Luca
Another possibility that comes to mind,
besides mixed versions mentioned by Gilles,
is the OS limits.
Limits may vary according to the user and user privileges.
Large programs tend to require big stacksize (even unlimited),
and typically segfault when the stack is not large enough.
Max number of open files is yet another hurdle.
And if you're using Infinband, the max locked memory size should be
unlimited.
Check /etc/security/limits.conf and "ulimit -a".
I hope this helps,
Gus Correa
On 12/10/2014 08:28 AM, Gilles Gouaillardet wrote:
Luca,
your email mentions openmpi 1.6.5
but gdb output points to openmpi 1.8.1.
could the root cause be a mix of versions that does not occur with root
account ?
which openmpi version are you expecting ?
you can run
pmap <pid>
when your binary is running and/or under gdb to confirm the openmpi
library that is really used
Cheers,
Gilles
On Wed, Dec 10, 2014 at 7:21 PM, Luca Fini <lf...@arcetri.astro.it
<mailto:lf...@arcetri.astro.it>> wrote:
I've a problem running a well tested MPI based application.
The program has been used for years with no problems. Suddenly the
executable which was run many times with no problems crashed with
SIGSEGV. The very same executable if run with root privileges works
OK. The same happens with other executables and across various
recompilation attempts.
We could not find any relevant difference in the O.S. since a few days
ago when the program worked also under unprivileged user ID. Actually
about in the same span of time we changed the GID of the user
experiencing the fault, but we think this is not relevant because the
same SIGSEGV happens to another user which was not modified. Moreover
we cannot see how that change can affect the running executabe (we
checked all file permissions in the directory tree where the program
is used).
Running the program under GDB we get the trace reported below. The
segfault happens at the very beginning during MPI initialization.
We can use the program with sudo, but I'd like to find out what
happened to go back to "normal" usage.
I'd appreciate any hint on the issue.
Many thanks,
Luca Fini
==============================
Here follows a few environment details:
Program started with: mpirun -debug -debugger gdb -np 1
/home/lascaux/MNH-V5-1-2/src/dir_obj-LXifortI4-MNH-V5-1-2-OMPI12X-O2/M51b2_OT_2POINT_RH_v1_mod/PREP_PGD
OPEN-MPI 1.6.5
Linux 2.6.32-431.29.2.2.6.32-431.29.2.el6.x86_64
Intel fortran Compiler: 2011.7.256
=========================
Here follows the stack trace:
Starting program:
/home/lascaux/MNH-V5-1-2/src/dir_obj-LXifortI4-MNH-V5-1-2-OMPI12X-O2/M51b2_OT_2POINT_RH_v1_mod/PREP_PGD
/home/lascaux/MNH-V5-1-2/src/dir_obj-LXifortI4-MNH-V5-1-2-OMPI12X-O2/M51b2_OT_2POINT_RH_v1_mod/PREP_PGD
[Thread debugging using libthread_db enabled]
Program received signal SIGSEGV, Segmentation fault.
0x00002aaaaaf652c7 in mca_base_component_find (directory=0x0,
type=0x3b914a7fb5 "rte", static_components=0x3b916cb040,
requested_component_names=0x0, include_mode=128, found_components=0x1,
open_dso_components=16)
at mca_base_component_find.c:162
162 OBJ_CONSTRUCT(found_components, opal_list_t);
Missing separate debuginfos, use: debuginfo-install
glibc-2.12-1.149.el6.x86_64 libgcc-4.4.7-11.el6.x86_64
libgfortran-4.4.7-11.el6.x86_64 libtool-ltdl-2.2.6-15.5.el6.x86_64
openmpi-1.8.1-1.el6.x86_64
(gdb) where
#0 0x00002aaaaaf652c7 in mca_base_component_find (directory=0x0,
type=0x3b914a7fb5 "rte", static_components=0x3b916cb040,
requested_component_names=0x0, include_mode=128, found_components=0x1,
open_dso_components=16)
at mca_base_component_find.c:162
#1 0x0000003b90c4870a in mca_base_framework_components_register ()
from /usr/lib64/openmpi/lib/libopen-pal.so.6
#2 0x0000003b90c48c06 in mca_base_framework_register () from
/usr/lib64/openmpi/lib/libopen-pal.so.6
#3 0x0000003b90c48def in mca_base_framework_open () from
/usr/lib64/openmpi/lib/libopen-pal.so.6
#4 0x0000003b914407e7 in ompi_mpi_init () from
/usr/lib64/openmpi/lib/libmpi.so.1
#5 0x0000003b91463200 in PMPI_Init () from
/usr/lib64/openmpi/lib/libmpi.so.1
#6 0x00002aaaaacd9295 in mpi_init_f (ierr=0x7fffffffd268) at
pinit_f.c:75
#7 0x00000000005bb159 in MODE_MNH_WORLD::init_nmnh_comm_world
(kinfo_ll=Cannot access memory at address 0x0
) at
/home/lascaux/MNH-V5-1-2/src/dir_obj-LXifortI4-MNH-V5-1-2-OMPI12X-O2/MASTER/spll_mode_mnh_world.f90:45
#8 0x00000000005939d3 in MODE_IO_LL::initio_ll () at
/home/lascaux/MNH-V5-1-2/src/dir_obj-LXifortI4-MNH-V5-1-2-OMPI12X-O2/MASTER/spll_mode_io_ll.f90:107
#9 0x000000000049d02f in prep_pgd () at
/home/lascaux/MNH-V5-1-2/src/dir_obj-LXifortI4-MNH-V5-1-2-OMPI12X-O2/MASTER/spll_prep_pgd.f90:130
#10 0x000000000049cf8c in main ()
--
Luca Fini. INAF - Oss. Astrofisico di Arcetri
L.go E.Fermi, 5. 50125 Firenze. Italy
Tel: +39 055 2752 307 <tel:%2B39%20055%202752%20307> Fax: +39
055 2752 292 <tel:%2B39%20055%202752%20292>
Skype: l.fini
Web: http://www.arcetri.inaf.it/~lfini
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2014/12/25945.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2014/12/25946.php