Hello Joseph,
Sorry for the delay but I didn't know if I was missing something yesterday
evening and wanted to double check everything this morning. This is for WRF
but other apps exhibit the same behavior.
* I had no problem with the serial version (and gdb obviously didn't report
any issue).
* I tried compiling with the --enable-debug flag but it was generating
errors during the compilation and never completed.
* I went back to my standard flags for debugging: -g -fbacktrace -ggdb
-fcheck=bounds,do,mem,pointer -ffpe-trap=invalid,zero,overflow. WRF is
still crashing with little extra info vs yesterday:
Backtrace for this error:
#0 0x7f5a4e54451f in ???
at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
#1 0x7f5a4e5a73fe in __GI___libc_free
at ./malloc/malloc.c:3368
#2 0x7f5a4c7aa5c3 in ???
#3 0x7f5a4e83b048 in ???
#4 0x7f5a4e7d3ef1 in ???
#5 0x7f5a4e8dab7b in ???
#6 0x8f6bbf in __module_dm_MOD_split_communicator
at /home/ubuntu/WRF-4.5.2/frame/module_dm.f90:5734
#7 0x1879ebd in init_modules_
at /home/ubuntu/WRF-4.5.2/share/init_modules.f90:63
#8 0x406fe4 in __module_wrf_top_MOD_wrf_init
at ../main/module_wrf_top.f90:130
#9 0x405ff3 in wrf
at /home/ubuntu/WRF-4.5.2/main/wrf.f90:22
#10 0x40605c in main
at /home/ubuntu/WRF-4.5.2/main/wrf.f90:6
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun noticed that process rank 0 with PID 0 on node ip-172-31-31-163
exited on signal 11 (Segmentation fault).
Any pointers on what might be going on here as this never happened with
OMPIv4. Thanks.
Joseph Schuchart via users wrote:
This looks like memory corruption. Do you have more details on what your
app is doing? I don't see any MPI calls inside the call stack. Could you
rebuild Open MPI with debug information enabled (by adding `--enable-debug`
to configure)? If this error occurs on singleton runs (1 process) then you
can easily attach gdb to it to get a better stack trace. Also, valgrind may
help pin down the problem by telling you which memory block is being free'd
On 1/30/24 07:41, afernandez via users wrote:
quote class="gmail_quote" type="cite" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
I upgraded one of the systems to v5.0.1 and have compiled everything >
exactly as dozens of previous times with v4. I wasn't expecting any > issue
(and the compilations didn't report anything out of ordinary) > but running
several apps has resulted in error messages such as:
/Backtrace for this error:/
/#0 0x7f7c9571f51f in ???/
/ at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0/
/#1 0x7f7c957823fe in __GI___libc_free/
/ at ./malloc/malloc.c:3368/
/#2 0x7f7c93a635c3 in ???/
/#3 0x7f7c95f84048 in ???/
/#4 0x7f7c95f1cef1 in ???/
/#5 0x7f7c95e34b7b in ???/
/#6 0x6e05be in ???/
/#7 0x6e58d7 in ???/
/#8 0x405d2c in ???/
/#9 0x7f7c95706d8f in __libc_start_call_main/
/ at ../sysdeps/nptl/libc_start_call_main.h:58/
/#10 0x7f7c95706e3f in __libc_start_main_impl/
/ at ../csu/libc-start.c:392/
/#11 0x405d64 in ???/
/#12 0xffffffffffffffff in ???/
OS is Ubuntu 22.04, OpenMPI was built with GCC13.2, and before > building
OpenMPI, I had previously built the hwloc (2.10.0) library at >
/usr/lib/x86_64-linux-gnu. Maybe I'm missing something pretty basic, > but
the problem seems to be related to memory allocation.

Reply via email to