Re: [OMPI users] Code failing when requesting all "processors"

2020-10-19 Thread Diego Zuccato via users
Il 13/10/20 16:33, Jeff Squyres (jsquyres) ha scritto:

> That's odd.  What version of Open MPI are you using?

The version is 3.1.3 , as packaged in Debian Buster.

I don't know OpenMPI (or even MPI in general) much. Some time ago, I've
had to add a
mtl = psm2
line to /etc/openmpi/openmpi-mca-params.conf .

Another strangeness is that I've had the same problem on other nodes,
that got "solved" (or, more likely, just "masked") by simply installing
gdb: while trying to debug the issue I noticed that when I installed gdb
I could no longer reproduce the problem. Too bad on this server gdb is
already installed and apparently useless to debug the issue.

>> On Oct 13, 2020, at 6:34 AM, Diego Zuccato via users 
>>  wrote:
>>
>> Hello all.
>>
>> I have a problem on a server: launching a job with mpirun fails if I
>> request all 32 CPUs (threads, since HT is enabled) but succeeds if I
>> only request 30.
>>
>> The test code is really minimal:
>> -8<--
>> #include "mpi.h"
>> #include 
>> #include 
>> #define  MASTER 0
>>
>> int main (int argc, char *argv[])
>> {
>>  int   numtasks, taskid, len;
>>  char hostname[MPI_MAX_PROCESSOR_NAME];
>>  MPI_Init(&argc, &argv);
>> //  int provided=0;
>> //  MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
>> //printf("MPI provided threads: %d\n", provided);
>>  MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
>>  MPI_Comm_rank(MPI_COMM_WORLD,&taskid);
>>
>>  if (taskid == MASTER)
>>printf("This is an MPI parallel code for Hello World with no
>> communication\n");
>>  //MPI_Barrier(MPI_COMM_WORLD);
>>
>>
>>  MPI_Get_processor_name(hostname, &len);
>>
>>  printf ("Hello from task %d on %s!\n", taskid, hostname);
>>
>>  if (taskid == MASTER)
>>printf("MASTER: Number of MPI tasks is: %d\n",numtasks);
>>
>>  MPI_Finalize();
>>
>>  printf("END OF CODE from task %d\n", taskid);
>> }
>> -8<--
>> (the commented section is a leftover of one of the tests).
>>
>> The error is :
>> -8<--
>> [str957-bl0-03:19637] *** Process received signal ***
>> [str957-bl0-03:19637] Signal: Segmentation fault (11)
>> [str957-bl0-03:19637] Signal code: Address not mapped (1)
>> [str957-bl0-03:19637] Failing at address: 0x77fac008
>> [str957-bl0-03:19637] [ 0]
>> /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x77e92730]
>> [str957-bl0-03:19637] [ 1]
>> /usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x2936)[0x7646d936]
>> [str957-bl0-03:19637] [ 2]
>> /usr/lib/x86_64-linux-gnu/libmca_common_dstore.so.1(pmix_common_dstor_init+0x9d3)[0x76444733]
>> [str957-bl0-03:19637] [ 3]
>> /usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x25b4)[0x7646d5b4]
>> [str957-bl0-03:19637] [ 4]
>> /usr/lib/x86_64-linux-gnu/libpmix.so.2(pmix_gds_base_select+0x12e)[0x7659346e]
>> [str957-bl0-03:19637] [ 5]
>> /usr/lib/x86_64-linux-gnu/libpmix.so.2(pmix_rte_init+0x8cd)[0x7654b88d]
>> [str957-bl0-03:19637] [ 6]
>> /usr/lib/x86_64-linux-gnu/libpmix.so.2(PMIx_Init+0xdc)[0x76507d7c]
>> [str957-bl0-03:19637] [ 7]
>> /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pmix_ext2x.so(ext2x_client_init+0xc4)[0x76603fe4]
>> [str957-bl0-03:19637] [ 8]
>> /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_ess_pmi.so(+0x2656)[0x77fb1656]
>> [str957-bl0-03:19637] [ 9]
>> /usr/lib/x86_64-linux-gnu/libopen-rte.so.40(orte_init+0x29a)[0x77c1c11a]
>> [str957-bl0-03:19637] [10]
>> /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_mpi_init+0x252)[0x77eece62]
>> [str957-bl0-03:19637] [11]
>> /usr/lib/x86_64-linux-gnu/libmpi.so.40(MPI_Init+0x6e)[0x77f1b17e]
>> [str957-bl0-03:19637] [12] ./mpitest-debug(+0x11c6)[0x51c6]
>> [str957-bl0-03:19637] [13]
>> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb)[0x77ce309b]
>> [str957-bl0-03:19637] [14] ./mpitest-debug(+0x10da)[0x50da]
>> [str957-bl0-03:19637] *** End of error message ***
>> -8<--
>>
>> I'm using Debian stable packages. On other servers there is no problem
>> (but there was in the past, and it got "solved" by just installing gdb).
>>
>> Any hints?
>>
>> TIA
>>
>> -- 
>> Diego Zuccato
>> DIFA - Dip. di Fisica e Astronomia
>> Servizi Informatici
>> Alma Mater Studiorum - Università di Bologna
>> V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
>> tel.: +39 051 20 95786
> 
> 


-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786


Re: [OMPI users] Code failing when requesting all "processors"

2020-10-19 Thread Jeff Squyres (jsquyres) via users
On Oct 14, 2020, at 3:07 AM, Diego Zuccato 
mailto:diego.zucc...@unibo.it>> wrote:

Il 13/10/20 16:33, Jeff Squyres (jsquyres) ha scritto:

That's odd.  What version of Open MPI are you using?

The version is 3.1.3 , as packaged in Debian Buster.

The 3.1.x series is pretty old.  If you want to stay in the 3.1.x series, you 
might try upgrading to the latest -- 3.1.6.  That has a bunch of bug fixes 
compared to v3.1.3.

Alternatively, the most recent release series is the v4.0.x series: v4.0.5 is 
the latest in that series.

I don't know OpenMPI (or even MPI in general) much. Some time ago, I've
had to add a
mtl = psm2
line to /etc/openmpi/openmpi-mca-params.conf .

This implies that you have Infinipath networking on your cluster.

Another strangeness is that I've had the same problem on other nodes,
that got "solved" (or, more likely, just "masked") by simply installing
gdb: while trying to debug the issue I noticed that when I installed gdb
I could no longer reproduce the problem. Too bad on this server gdb is
already installed and apparently useless to debug the issue.

I can't imagine what installing gdb would do to mask the problem.  Strange.

--
Jeff Squyres
jsquy...@cisco.com