Re: [OMPI users] Fwd: pmix and srun
Aha - I found it. It’s a typo in the v2.2.1 release. Sadly, our Slurm plugin folks seem to be off somewhere for awhile and haven’t been testing it. Sigh. I’ll patch the branch and let you know - we’d appreciate the feedback. Ralph > On Jan 18, 2019, at 2:09 PM, Michael Di Domenico > wrote: > > here's the branches i'm using. i did a git clone on the repo's and > then a git checkout > > [ec2-user@labhead bin]$ cd /hpc/src/pmix/ > [ec2-user@labhead pmix]$ git branch > master > * v2.2 > [ec2-user@labhead pmix]$ cd ../slurm/ > [ec2-user@labhead slurm]$ git branch > * (detached from origin/slurm-18.08) > master > [ec2-user@labhead slurm]$ cd ../ompi/ > [ec2-user@labhead ompi]$ git branch > * (detached from origin/v3.1.x) > master > > > attached is the debug out from the run with the debugging turned on > > On Fri, Jan 18, 2019 at 4:30 PM Ralph H Castain wrote: >> >> Looks strange. I’m pretty sure Mellanox didn’t implement the event >> notification system in the Slurm plugin, but you should only be trying to >> call it if OMPI is registering a system-level event code - which OMPI 3.1 >> definitely doesn’t do. >> >> If you are using PMIx v2.2.0, then please note that there is a bug in it >> that slipped through our automated testing. I replaced it today with v2.2.1 >> - you probably should update if that’s the case. However, that wouldn’t >> necessarily explain this behavior. I’m not that familiar with the Slurm >> plugin, but you might try adding >> >> PMIX_MCA_pmix_client_event_verbose=5 >> PMIX_MCA_pmix_server_event_verbose=5 >> OMPI_MCA_pmix_base_verbose=10 >> >> to your environment and see if that provides anything useful. >> >>> On Jan 18, 2019, at 12:09 PM, Michael Di Domenico >>> wrote: >>> >>> i compilied pmix slurm openmpi >>> >>> ---pmix >>> ./configure --prefix=/hpc/pmix/2.2 --with-munge=/hpc/munge/0.5.13 >>> --disable-debug >>> ---slurm >>> ./configure --prefix=/hpc/slurm/18.08 --with-munge=/hpc/munge/0.5.13 >>> --with-pmix=/hpc/pmix/2.2 >>> ---openmpi >>> ./configure --prefix=/hpc/ompi/3.1 --with-hwloc=external >>> --with-libevent=external --with-slurm=/hpc/slurm/18.08 >>> --with-pmix=/hpc/pmix/2.2 >>> >>> everything seemed to compile fine, but when i do an srun i get the >>> below errors, however, if i salloc and then mpirun it seems to work >>> fine. i'm not quite sure where the breakdown is or how to debug it >>> >>> --- >>> >>> [ec2-user@labcmp1 linux]$ srun --mpi=pmix_v2 -n 16 xhpl >>> [labcmp6:18353] PMIX ERROR: NOT-SUPPORTED in file >>> event/pmix_event_registration.c at line 101 >>> [labcmp6:18355] PMIX ERROR: NOT-SUPPORTED in file >>> event/pmix_event_registration.c at line 101 >>> [labcmp5:18355] PMIX ERROR: NOT-SUPPORTED in file >>> event/pmix_event_registration.c at line 101 >>> -- >>> It looks like MPI_INIT failed for some reason; your parallel process is >>> likely to abort. There are many reasons that a parallel process can >>> fail during MPI_INIT; some of which are due to configuration or environment >>> problems. This failure appears to be an internal failure; here's some >>> additional information (which may only be relevant to an Open MPI >>> developer): >>> >>> ompi_interlib_declare >>> --> Returned "Would block" (-10) instead of "Success" (0) >>> ...snipped... >>> [labcmp6:18355] *** An error occurred in MPI_Init >>> [labcmp6:18355] *** reported by process [140726281390153,15] >>> [labcmp6:18355] *** on a NULL communicator >>> [labcmp6:18355] *** Unknown error >>> [labcmp6:18355] *** MPI_ERRORS_ARE_FATAL (processes in this >>> communicator will now abort, >>> [labcmp6:18355] ***and potentially your MPI job) >>> [labcmp6:18352] *** An error occurred in MPI_Init >>> [labcmp6:18352] *** reported by process [1677936713,12] >>> [labcmp6:18352] *** on a NULL communicator >>> [labcmp6:18352] *** Unknown error >>> [labcmp6:18352] *** MPI_ERRORS_ARE_FATAL (processes in this >>> communicator will now abort, >>> [labcmp6:18352] ***and potentially your MPI job) >>> [labcmp6:18354] *** An error occurred in MPI_Init >>> [labcmp6:18354] *** reported by process [140726281390153,14] >>> [labcmp6:18354] *** on a NULL communicator >>> [labcmp6:18354] *** Unknown error >>> [labcmp6:18354] *** MPI_ERRORS_ARE_FATAL (processes in this >>> communicator will now abort, >>> [labcmp6:18354] ***and potentially your MPI job) >>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish. >>> slurmstepd: error: *** STEP 24.0 ON labcmp3 CANCELLED AT >>> 2019-01-18T20:03:33 *** >>> [labcmp5:18358] PMIX ERROR: NOT-SUPPORTED in file >>> event/pmix_event_registration.c at line 101 >>> -- >>> It looks like MPI_INIT failed for some reason; your parallel process is >>> likely to abort. There are many reasons that a parallel process can >>> fail during MPI_INIT; some of which are due to configuration or environ
Re: [OMPI users] Fwd: pmix and srun
here's the branches i'm using. i did a git clone on the repo's and then a git checkout [ec2-user@labhead bin]$ cd /hpc/src/pmix/ [ec2-user@labhead pmix]$ git branch master * v2.2 [ec2-user@labhead pmix]$ cd ../slurm/ [ec2-user@labhead slurm]$ git branch * (detached from origin/slurm-18.08) master [ec2-user@labhead slurm]$ cd ../ompi/ [ec2-user@labhead ompi]$ git branch * (detached from origin/v3.1.x) master attached is the debug out from the run with the debugging turned on On Fri, Jan 18, 2019 at 4:30 PM Ralph H Castain wrote: > > Looks strange. I’m pretty sure Mellanox didn’t implement the event > notification system in the Slurm plugin, but you should only be trying to > call it if OMPI is registering a system-level event code - which OMPI 3.1 > definitely doesn’t do. > > If you are using PMIx v2.2.0, then please note that there is a bug in it that > slipped through our automated testing. I replaced it today with v2.2.1 - you > probably should update if that’s the case. However, that wouldn’t necessarily > explain this behavior. I’m not that familiar with the Slurm plugin, but you > might try adding > > PMIX_MCA_pmix_client_event_verbose=5 > PMIX_MCA_pmix_server_event_verbose=5 > OMPI_MCA_pmix_base_verbose=10 > > to your environment and see if that provides anything useful. > > > On Jan 18, 2019, at 12:09 PM, Michael Di Domenico > > wrote: > > > > i compilied pmix slurm openmpi > > > > ---pmix > > ./configure --prefix=/hpc/pmix/2.2 --with-munge=/hpc/munge/0.5.13 > > --disable-debug > > ---slurm > > ./configure --prefix=/hpc/slurm/18.08 --with-munge=/hpc/munge/0.5.13 > > --with-pmix=/hpc/pmix/2.2 > > ---openmpi > > ./configure --prefix=/hpc/ompi/3.1 --with-hwloc=external > > --with-libevent=external --with-slurm=/hpc/slurm/18.08 > > --with-pmix=/hpc/pmix/2.2 > > > > everything seemed to compile fine, but when i do an srun i get the > > below errors, however, if i salloc and then mpirun it seems to work > > fine. i'm not quite sure where the breakdown is or how to debug it > > > > --- > > > > [ec2-user@labcmp1 linux]$ srun --mpi=pmix_v2 -n 16 xhpl > > [labcmp6:18353] PMIX ERROR: NOT-SUPPORTED in file > > event/pmix_event_registration.c at line 101 > > [labcmp6:18355] PMIX ERROR: NOT-SUPPORTED in file > > event/pmix_event_registration.c at line 101 > > [labcmp5:18355] PMIX ERROR: NOT-SUPPORTED in file > > event/pmix_event_registration.c at line 101 > > -- > > It looks like MPI_INIT failed for some reason; your parallel process is > > likely to abort. There are many reasons that a parallel process can > > fail during MPI_INIT; some of which are due to configuration or environment > > problems. This failure appears to be an internal failure; here's some > > additional information (which may only be relevant to an Open MPI > > developer): > > > > ompi_interlib_declare > > --> Returned "Would block" (-10) instead of "Success" (0) > > ...snipped... > > [labcmp6:18355] *** An error occurred in MPI_Init > > [labcmp6:18355] *** reported by process [140726281390153,15] > > [labcmp6:18355] *** on a NULL communicator > > [labcmp6:18355] *** Unknown error > > [labcmp6:18355] *** MPI_ERRORS_ARE_FATAL (processes in this > > communicator will now abort, > > [labcmp6:18355] ***and potentially your MPI job) > > [labcmp6:18352] *** An error occurred in MPI_Init > > [labcmp6:18352] *** reported by process [1677936713,12] > > [labcmp6:18352] *** on a NULL communicator > > [labcmp6:18352] *** Unknown error > > [labcmp6:18352] *** MPI_ERRORS_ARE_FATAL (processes in this > > communicator will now abort, > > [labcmp6:18352] ***and potentially your MPI job) > > [labcmp6:18354] *** An error occurred in MPI_Init > > [labcmp6:18354] *** reported by process [140726281390153,14] > > [labcmp6:18354] *** on a NULL communicator > > [labcmp6:18354] *** Unknown error > > [labcmp6:18354] *** MPI_ERRORS_ARE_FATAL (processes in this > > communicator will now abort, > > [labcmp6:18354] ***and potentially your MPI job) > > srun: Job step aborted: Waiting up to 32 seconds for job step to finish. > > slurmstepd: error: *** STEP 24.0 ON labcmp3 CANCELLED AT > > 2019-01-18T20:03:33 *** > > [labcmp5:18358] PMIX ERROR: NOT-SUPPORTED in file > > event/pmix_event_registration.c at line 101 > > -- > > It looks like MPI_INIT failed for some reason; your parallel process is > > likely to abort. There are many reasons that a parallel process can > > fail during MPI_INIT; some of which are due to configuration or environment > > problems. This failure appears to be an internal failure; here's some > > additional information (which may only be relevant to an Open MPI > > developer): > > > > ompi_interlib_declare > > --> Returned "Would block" (-10) instead of "Success" (0) > > -- > > [labcmp5:18357] PMIX
Re: [OMPI users] Fwd: pmix and srun
Looks strange. I’m pretty sure Mellanox didn’t implement the event notification system in the Slurm plugin, but you should only be trying to call it if OMPI is registering a system-level event code - which OMPI 3.1 definitely doesn’t do. If you are using PMIx v2.2.0, then please note that there is a bug in it that slipped through our automated testing. I replaced it today with v2.2.1 - you probably should update if that’s the case. However, that wouldn’t necessarily explain this behavior. I’m not that familiar with the Slurm plugin, but you might try adding PMIX_MCA_pmix_client_event_verbose=5 PMIX_MCA_pmix_server_event_verbose=5 OMPI_MCA_pmix_base_verbose=10 to your environment and see if that provides anything useful. > On Jan 18, 2019, at 12:09 PM, Michael Di Domenico > wrote: > > i compilied pmix slurm openmpi > > ---pmix > ./configure --prefix=/hpc/pmix/2.2 --with-munge=/hpc/munge/0.5.13 > --disable-debug > ---slurm > ./configure --prefix=/hpc/slurm/18.08 --with-munge=/hpc/munge/0.5.13 > --with-pmix=/hpc/pmix/2.2 > ---openmpi > ./configure --prefix=/hpc/ompi/3.1 --with-hwloc=external > --with-libevent=external --with-slurm=/hpc/slurm/18.08 > --with-pmix=/hpc/pmix/2.2 > > everything seemed to compile fine, but when i do an srun i get the > below errors, however, if i salloc and then mpirun it seems to work > fine. i'm not quite sure where the breakdown is or how to debug it > > --- > > [ec2-user@labcmp1 linux]$ srun --mpi=pmix_v2 -n 16 xhpl > [labcmp6:18353] PMIX ERROR: NOT-SUPPORTED in file > event/pmix_event_registration.c at line 101 > [labcmp6:18355] PMIX ERROR: NOT-SUPPORTED in file > event/pmix_event_registration.c at line 101 > [labcmp5:18355] PMIX ERROR: NOT-SUPPORTED in file > event/pmix_event_registration.c at line 101 > -- > It looks like MPI_INIT failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during MPI_INIT; some of which are due to configuration or environment > problems. This failure appears to be an internal failure; here's some > additional information (which may only be relevant to an Open MPI > developer): > > ompi_interlib_declare > --> Returned "Would block" (-10) instead of "Success" (0) > ...snipped... > [labcmp6:18355] *** An error occurred in MPI_Init > [labcmp6:18355] *** reported by process [140726281390153,15] > [labcmp6:18355] *** on a NULL communicator > [labcmp6:18355] *** Unknown error > [labcmp6:18355] *** MPI_ERRORS_ARE_FATAL (processes in this > communicator will now abort, > [labcmp6:18355] ***and potentially your MPI job) > [labcmp6:18352] *** An error occurred in MPI_Init > [labcmp6:18352] *** reported by process [1677936713,12] > [labcmp6:18352] *** on a NULL communicator > [labcmp6:18352] *** Unknown error > [labcmp6:18352] *** MPI_ERRORS_ARE_FATAL (processes in this > communicator will now abort, > [labcmp6:18352] ***and potentially your MPI job) > [labcmp6:18354] *** An error occurred in MPI_Init > [labcmp6:18354] *** reported by process [140726281390153,14] > [labcmp6:18354] *** on a NULL communicator > [labcmp6:18354] *** Unknown error > [labcmp6:18354] *** MPI_ERRORS_ARE_FATAL (processes in this > communicator will now abort, > [labcmp6:18354] ***and potentially your MPI job) > srun: Job step aborted: Waiting up to 32 seconds for job step to finish. > slurmstepd: error: *** STEP 24.0 ON labcmp3 CANCELLED AT 2019-01-18T20:03:33 > *** > [labcmp5:18358] PMIX ERROR: NOT-SUPPORTED in file > event/pmix_event_registration.c at line 101 > -- > It looks like MPI_INIT failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during MPI_INIT; some of which are due to configuration or environment > problems. This failure appears to be an internal failure; here's some > additional information (which may only be relevant to an Open MPI > developer): > > ompi_interlib_declare > --> Returned "Would block" (-10) instead of "Success" (0) > -- > [labcmp5:18357] PMIX ERROR: NOT-SUPPORTED in file > event/pmix_event_registration.c at line 101 > [labcmp5:18356] PMIX ERROR: NOT-SUPPORTED in file > event/pmix_event_registration.c at line 101 > srun: error: labcmp6: tasks 12-15: Exited with exit code 1 > srun: error: labcmp3: tasks 0-3: Killed > srun: error: labcmp4: tasks 4-7: Killed > srun: error: labcmp5: tasks 8-11: Killed > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
[OMPI users] Fwd: pmix and srun
i compilied pmix slurm openmpi ---pmix ./configure --prefix=/hpc/pmix/2.2 --with-munge=/hpc/munge/0.5.13 --disable-debug ---slurm ./configure --prefix=/hpc/slurm/18.08 --with-munge=/hpc/munge/0.5.13 --with-pmix=/hpc/pmix/2.2 ---openmpi ./configure --prefix=/hpc/ompi/3.1 --with-hwloc=external --with-libevent=external --with-slurm=/hpc/slurm/18.08 --with-pmix=/hpc/pmix/2.2 everything seemed to compile fine, but when i do an srun i get the below errors, however, if i salloc and then mpirun it seems to work fine. i'm not quite sure where the breakdown is or how to debug it --- [ec2-user@labcmp1 linux]$ srun --mpi=pmix_v2 -n 16 xhpl [labcmp6:18353] PMIX ERROR: NOT-SUPPORTED in file event/pmix_event_registration.c at line 101 [labcmp6:18355] PMIX ERROR: NOT-SUPPORTED in file event/pmix_event_registration.c at line 101 [labcmp5:18355] PMIX ERROR: NOT-SUPPORTED in file event/pmix_event_registration.c at line 101 -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_interlib_declare --> Returned "Would block" (-10) instead of "Success" (0) ...snipped... [labcmp6:18355] *** An error occurred in MPI_Init [labcmp6:18355] *** reported by process [140726281390153,15] [labcmp6:18355] *** on a NULL communicator [labcmp6:18355] *** Unknown error [labcmp6:18355] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [labcmp6:18355] ***and potentially your MPI job) [labcmp6:18352] *** An error occurred in MPI_Init [labcmp6:18352] *** reported by process [1677936713,12] [labcmp6:18352] *** on a NULL communicator [labcmp6:18352] *** Unknown error [labcmp6:18352] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [labcmp6:18352] ***and potentially your MPI job) [labcmp6:18354] *** An error occurred in MPI_Init [labcmp6:18354] *** reported by process [140726281390153,14] [labcmp6:18354] *** on a NULL communicator [labcmp6:18354] *** Unknown error [labcmp6:18354] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [labcmp6:18354] ***and potentially your MPI job) srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: *** STEP 24.0 ON labcmp3 CANCELLED AT 2019-01-18T20:03:33 *** [labcmp5:18358] PMIX ERROR: NOT-SUPPORTED in file event/pmix_event_registration.c at line 101 -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_interlib_declare --> Returned "Would block" (-10) instead of "Success" (0) -- [labcmp5:18357] PMIX ERROR: NOT-SUPPORTED in file event/pmix_event_registration.c at line 101 [labcmp5:18356] PMIX ERROR: NOT-SUPPORTED in file event/pmix_event_registration.c at line 101 srun: error: labcmp6: tasks 12-15: Exited with exit code 1 srun: error: labcmp3: tasks 0-3: Killed srun: error: labcmp4: tasks 4-7: Killed srun: error: labcmp5: tasks 8-11: Killed ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users