[OMPI users] Code failing when requesting all "processors"
Hello all. I have a problem on a server: launching a job with mpirun fails if I request all 32 CPUs (threads, since HT is enabled) but succeeds if I only request 30. The test code is really minimal: -8<-- #include "mpi.h" #include #include #define MASTER 0 int main (int argc, char *argv[]) { int numtasks, taskid, len; char hostname[MPI_MAX_PROCESSOR_NAME]; MPI_Init(&argc, &argv); // int provided=0; // MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided); //printf("MPI provided threads: %d\n", provided); MPI_Comm_size(MPI_COMM_WORLD, &numtasks); MPI_Comm_rank(MPI_COMM_WORLD,&taskid); if (taskid == MASTER) printf("This is an MPI parallel code for Hello World with no communication\n"); //MPI_Barrier(MPI_COMM_WORLD); MPI_Get_processor_name(hostname, &len); printf ("Hello from task %d on %s!\n", taskid, hostname); if (taskid == MASTER) printf("MASTER: Number of MPI tasks is: %d\n",numtasks); MPI_Finalize(); printf("END OF CODE from task %d\n", taskid); } -8<-- (the commented section is a leftover of one of the tests). The error is : -8<-- [str957-bl0-03:19637] *** Process received signal *** [str957-bl0-03:19637] Signal: Segmentation fault (11) [str957-bl0-03:19637] Signal code: Address not mapped (1) [str957-bl0-03:19637] Failing at address: 0x77fac008 [str957-bl0-03:19637] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x77e92730] [str957-bl0-03:19637] [ 1] /usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x2936)[0x7646d936] [str957-bl0-03:19637] [ 2] /usr/lib/x86_64-linux-gnu/libmca_common_dstore.so.1(pmix_common_dstor_init+0x9d3)[0x76444733] [str957-bl0-03:19637] [ 3] /usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x25b4)[0x7646d5b4] [str957-bl0-03:19637] [ 4] /usr/lib/x86_64-linux-gnu/libpmix.so.2(pmix_gds_base_select+0x12e)[0x7659346e] [str957-bl0-03:19637] [ 5] /usr/lib/x86_64-linux-gnu/libpmix.so.2(pmix_rte_init+0x8cd)[0x7654b88d] [str957-bl0-03:19637] [ 6] /usr/lib/x86_64-linux-gnu/libpmix.so.2(PMIx_Init+0xdc)[0x76507d7c] [str957-bl0-03:19637] [ 7] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pmix_ext2x.so(ext2x_client_init+0xc4)[0x76603fe4] [str957-bl0-03:19637] [ 8] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_ess_pmi.so(+0x2656)[0x77fb1656] [str957-bl0-03:19637] [ 9] /usr/lib/x86_64-linux-gnu/libopen-rte.so.40(orte_init+0x29a)[0x77c1c11a] [str957-bl0-03:19637] [10] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_mpi_init+0x252)[0x77eece62] [str957-bl0-03:19637] [11] /usr/lib/x86_64-linux-gnu/libmpi.so.40(MPI_Init+0x6e)[0x77f1b17e] [str957-bl0-03:19637] [12] ./mpitest-debug(+0x11c6)[0x51c6] [str957-bl0-03:19637] [13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb)[0x77ce309b] [str957-bl0-03:19637] [14] ./mpitest-debug(+0x10da)[0x50da] [str957-bl0-03:19637] *** End of error message *** -8<-- I'm using Debian stable packages. On other servers there is no problem (but there was in the past, and it got "solved" by just installing gdb). Any hints? TIA -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786
Re: [OMPI users] Code failing when requesting all "processors"
That's odd. What version of Open MPI are you using? > On Oct 13, 2020, at 6:34 AM, Diego Zuccato via users > wrote: > > Hello all. > > I have a problem on a server: launching a job with mpirun fails if I > request all 32 CPUs (threads, since HT is enabled) but succeeds if I > only request 30. > > The test code is really minimal: > -8<-- > #include "mpi.h" > #include > #include > #define MASTER 0 > > int main (int argc, char *argv[]) > { > int numtasks, taskid, len; > char hostname[MPI_MAX_PROCESSOR_NAME]; > MPI_Init(&argc, &argv); > // int provided=0; > // MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided); > //printf("MPI provided threads: %d\n", provided); > MPI_Comm_size(MPI_COMM_WORLD, &numtasks); > MPI_Comm_rank(MPI_COMM_WORLD,&taskid); > > if (taskid == MASTER) >printf("This is an MPI parallel code for Hello World with no > communication\n"); > //MPI_Barrier(MPI_COMM_WORLD); > > > MPI_Get_processor_name(hostname, &len); > > printf ("Hello from task %d on %s!\n", taskid, hostname); > > if (taskid == MASTER) >printf("MASTER: Number of MPI tasks is: %d\n",numtasks); > > MPI_Finalize(); > > printf("END OF CODE from task %d\n", taskid); > } > -8<-- > (the commented section is a leftover of one of the tests). > > The error is : > -8<-- > [str957-bl0-03:19637] *** Process received signal *** > [str957-bl0-03:19637] Signal: Segmentation fault (11) > [str957-bl0-03:19637] Signal code: Address not mapped (1) > [str957-bl0-03:19637] Failing at address: 0x77fac008 > [str957-bl0-03:19637] [ 0] > /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x77e92730] > [str957-bl0-03:19637] [ 1] > /usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x2936)[0x7646d936] > [str957-bl0-03:19637] [ 2] > /usr/lib/x86_64-linux-gnu/libmca_common_dstore.so.1(pmix_common_dstor_init+0x9d3)[0x76444733] > [str957-bl0-03:19637] [ 3] > /usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x25b4)[0x7646d5b4] > [str957-bl0-03:19637] [ 4] > /usr/lib/x86_64-linux-gnu/libpmix.so.2(pmix_gds_base_select+0x12e)[0x7659346e] > [str957-bl0-03:19637] [ 5] > /usr/lib/x86_64-linux-gnu/libpmix.so.2(pmix_rte_init+0x8cd)[0x7654b88d] > [str957-bl0-03:19637] [ 6] > /usr/lib/x86_64-linux-gnu/libpmix.so.2(PMIx_Init+0xdc)[0x76507d7c] > [str957-bl0-03:19637] [ 7] > /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pmix_ext2x.so(ext2x_client_init+0xc4)[0x76603fe4] > [str957-bl0-03:19637] [ 8] > /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_ess_pmi.so(+0x2656)[0x77fb1656] > [str957-bl0-03:19637] [ 9] > /usr/lib/x86_64-linux-gnu/libopen-rte.so.40(orte_init+0x29a)[0x77c1c11a] > [str957-bl0-03:19637] [10] > /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_mpi_init+0x252)[0x77eece62] > [str957-bl0-03:19637] [11] > /usr/lib/x86_64-linux-gnu/libmpi.so.40(MPI_Init+0x6e)[0x77f1b17e] > [str957-bl0-03:19637] [12] ./mpitest-debug(+0x11c6)[0x51c6] > [str957-bl0-03:19637] [13] > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb)[0x77ce309b] > [str957-bl0-03:19637] [14] ./mpitest-debug(+0x10da)[0x50da] > [str957-bl0-03:19637] *** End of error message *** > -8<-- > > I'm using Debian stable packages. On other servers there is no problem > (but there was in the past, and it got "solved" by just installing gdb). > > Any hints? > > TIA > > -- > Diego Zuccato > DIFA - Dip. di Fisica e Astronomia > Servizi Informatici > Alma Mater Studiorum - Università di Bologna > V.le Berti-Pichat 6/2 - 40127 Bologna - Italy > tel.: +39 051 20 95786 -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI users] Code failing when requesting all "processors"
Can you use taskid after MPI_Finalize? Isn't it undefined/deallocated at that point? Just a question (... or two) ... Gus Correa > MPI_Finalize(); > > printf("END OF CODE from task %d\n", taskid); On Tue, Oct 13, 2020 at 10:34 AM Jeff Squyres (jsquyres) via users < users@lists.open-mpi.org> wrote: > That's odd. What version of Open MPI are you using? > > > > On Oct 13, 2020, at 6:34 AM, Diego Zuccato via users < > users@lists.open-mpi.org> wrote: > > > > Hello all. > > > > I have a problem on a server: launching a job with mpirun fails if I > > request all 32 CPUs (threads, since HT is enabled) but succeeds if I > > only request 30. > > > > The test code is really minimal: > > -8<-- > > #include "mpi.h" > > #include > > #include > > #define MASTER 0 > > > > int main (int argc, char *argv[]) > > { > > int numtasks, taskid, len; > > char hostname[MPI_MAX_PROCESSOR_NAME]; > > MPI_Init(&argc, &argv); > > // int provided=0; > > // MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided); > > //printf("MPI provided threads: %d\n", provided); > > MPI_Comm_size(MPI_COMM_WORLD, &numtasks); > > MPI_Comm_rank(MPI_COMM_WORLD,&taskid); > > > > if (taskid == MASTER) > >printf("This is an MPI parallel code for Hello World with no > > communication\n"); > > //MPI_Barrier(MPI_COMM_WORLD); > > > > > > MPI_Get_processor_name(hostname, &len); > > > > printf ("Hello from task %d on %s!\n", taskid, hostname); > > > > if (taskid == MASTER) > >printf("MASTER: Number of MPI tasks is: %d\n",numtasks); > > > > MPI_Finalize(); > > > > printf("END OF CODE from task %d\n", taskid); > > } > > -8<-- > > (the commented section is a leftover of one of the tests). > > > > The error is : > > -8<-- > > [str957-bl0-03:19637] *** Process received signal *** > > [str957-bl0-03:19637] Signal: Segmentation fault (11) > > [str957-bl0-03:19637] Signal code: Address not mapped (1) > > [str957-bl0-03:19637] Failing at address: 0x77fac008 > > [str957-bl0-03:19637] [ 0] > > /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x77e92730] > > [str957-bl0-03:19637] [ 1] > > > /usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x2936)[0x7646d936] > > [str957-bl0-03:19637] [ 2] > > > /usr/lib/x86_64-linux-gnu/libmca_common_dstore.so.1(pmix_common_dstor_init+0x9d3)[0x76444733] > > [str957-bl0-03:19637] [ 3] > > > /usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x25b4)[0x7646d5b4] > > [str957-bl0-03:19637] [ 4] > > > /usr/lib/x86_64-linux-gnu/libpmix.so.2(pmix_gds_base_select+0x12e)[0x7659346e] > > [str957-bl0-03:19637] [ 5] > > > /usr/lib/x86_64-linux-gnu/libpmix.so.2(pmix_rte_init+0x8cd)[0x7654b88d] > > [str957-bl0-03:19637] [ 6] > > /usr/lib/x86_64-linux-gnu/libpmix.so.2(PMIx_Init+0xdc)[0x76507d7c] > > [str957-bl0-03:19637] [ 7] > > > /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pmix_ext2x.so(ext2x_client_init+0xc4)[0x76603fe4] > > [str957-bl0-03:19637] [ 8] > > > /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_ess_pmi.so(+0x2656)[0x77fb1656] > > [str957-bl0-03:19637] [ 9] > > > /usr/lib/x86_64-linux-gnu/libopen-rte.so.40(orte_init+0x29a)[0x77c1c11a] > > [str957-bl0-03:19637] [10] > > > /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_mpi_init+0x252)[0x77eece62] > > [str957-bl0-03:19637] [11] > > /usr/lib/x86_64-linux-gnu/libmpi.so.40(MPI_Init+0x6e)[0x77f1b17e] > > [str957-bl0-03:19637] [12] ./mpitest-debug(+0x11c6)[0x51c6] > > [str957-bl0-03:19637] [13] > > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb)[0x77ce309b] > > [str957-bl0-03:19637] [14] ./mpitest-debug(+0x10da)[0x50da] > > [str957-bl0-03:19637] *** End of error message *** > > -8<-- > > > > I'm using Debian stable packages. On other servers there is no problem > > (but there was in the past, and it got "solved" by just installing gdb). > > > > Any hints? > > > > TIA > > > > -- > > Diego Zuccato > > DIFA - Dip. di Fisica e Astronomia > > Servizi Informatici > > Alma Mater Studiorum - Università di Bologna > > V.le Berti-Pichat 6/2 - 40127 Bologna - Italy > > tel.: +39 051 20 95786 > > > -- > Jeff Squyres > jsquy...@cisco.com > >
Re: [OMPI users] Code failing when requesting all "processors"
On Oct 13, 2020, at 10:43 AM, Gus Correa via users wrote: > > Can you use taskid after MPI_Finalize? Yes. It's a variable, just like any other. > Isn't it undefined/deallocated at that point? No. MPI filled it in during MPI_Comm_rank() and then never touched it again. So even though MPI may have shut down, the value that it loaded into taskid is still valid/initialized. -- Jeff Squyres jsquy...@cisco.com