You need to provide some hints! What we know so far:

1. What we see here is a backtrace from (what looks like) an Open MPI/PMI-x 
backtrace.
2. Your decision to address this to the Slurm mailing list suggests that you 
think that Slurm might be involved.
3. You have something (a job? a program?) that segfaults when you go from 30 to 
32 processes.

At a minimum, it would help your readers' understanding, and ability to help, 
to know:

a. What operating system?
b. Are you seeing this while running Slurm? What version?
c. What version of Open MPI?
d. Are you building your own PMI-x, or are you using what's provided by Open 
MPI and Slurm?
e. What does your hardware configuration look like -- particularly, what cpu 
type(s), and how many cores/node?
f. What does you Slurm configuration look like (assuming you're seeing this 
with Slurm)? I suggest purging your configuration files of node names and IP 
addresses, and including them with your query.
g. What does your command line look like? Especially, are you trying to run 32 
processes on a single node? Spreading them out across 2 or more nodes?
h. Can you reproduce the problem if you substitute `hostname` or `true` for the 
program in the command line? What about a simple MPI-enabled "hello world?"

Andy

-----Original Message-----
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Diego Zuccato
Sent: Monday, October 5, 2020 7:05 AM
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: [slurm-users] Segfault with 32 processes, OK with 30 ???

Hello all.

I'm seeing (again) this weird issue.
The same executable, launched with 32 processes crashes immediately,
while it runs flawlessy with only 30 processes.

The reported error is:
[str957-bl0-03:05271] *** Process received signal ***
[str957-bl0-03:05271] Signal: Segmentation fault (11)
[str957-bl0-03:05271] Signal code: Address not mapped (1)
[str957-bl0-03:05271] Failing at address: 0x7f3826fb4008
[str957-bl0-03:05271] [ 0]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x7f3825df6730]
[str957-bl0-03:05271] [ 1]
/usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x2936)[0x7f3824553936]
[str957-bl0-03:05271] [ 2]
/usr/lib/x86_64-linux-gnu/libmca_common_dstore.so.1(pmix_common_dstor_init+0x9d3)[0x7f382452a733]
[str957-bl0-03:05271] [ 3]
/usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x25b4)[0x7f38245535b4]
[str957-bl0-03:05271] [ 4]
/usr/lib/x86_64-linux-gnu/libpmix.so.2(pmix_gds_base_select+0x12e)[0x7f382467946e]
[str957-bl0-03:05271] [ 5]
/usr/lib/x86_64-linux-gnu/libpmix.so.2(pmix_rte_init+0x8cd)[0x7f382463188d]
[str957-bl0-03:05271] [ 6]
/usr/lib/x86_64-linux-gnu/libpmix.so.2(PMIx_Init+0xdc)[0x7f38245edd7c]
[str957-bl0-03:05271] [ 7]
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pmix_ext2x.so(ext2x_client_init+0xc4)[0x7f38246e9fe4]
[str957-bl0-03:05271] [ 8]
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_ess_pmi.so(+0x2656)[0x7f3826fb9656]
[str957-bl0-03:05271] [ 9]
/usr/lib/x86_64-linux-gnu/libopen-rte.so.40(orte_init+0x29a)[0x7f3825b8011a]
[str957-bl0-03:05271] [10]
/usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_mpi_init+0x252)[0x7f3825e50e62]
[str957-bl0-03:05271] [11]
/usr/lib/x86_64-linux-gnu/libmpi.so.40(MPI_Init+0x6e)[0x7f3825e7f17e]
[str957-bl0-03:05271] [12] ./C-GenIC(+0x23b9)[0x55bf9fa8e3b9]
[str957-bl0-03:05271] [13]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb)[0x7f3825c4709b]
[str957-bl0-03:05271] [14] ./C-GenIC(+0x251a)[0x55bf9fa8e51a]
[str957-bl0-03:05271] *** End of error message ***


In the past, just installing gdb to try to debug it made the problem
disappear: obviously it was not a solution...

Any hint?

TIA

-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786


Reply via email to