Looks strange. I’m pretty sure Mellanox didn’t implement the event notification system in the Slurm plugin, but you should only be trying to call it if OMPI is registering a system-level event code - which OMPI 3.1 definitely doesn’t do.
If you are using PMIx v2.2.0, then please note that there is a bug in it that slipped through our automated testing. I replaced it today with v2.2.1 - you probably should update if that’s the case. However, that wouldn’t necessarily explain this behavior. I’m not that familiar with the Slurm plugin, but you might try adding PMIX_MCA_pmix_client_event_verbose=5 PMIX_MCA_pmix_server_event_verbose=5 OMPI_MCA_pmix_base_verbose=10 to your environment and see if that provides anything useful. > On Jan 18, 2019, at 12:09 PM, Michael Di Domenico <mdidomeni...@gmail.com> > wrote: > > i compilied pmix slurm openmpi > > ---pmix > ./configure --prefix=/hpc/pmix/2.2 --with-munge=/hpc/munge/0.5.13 > --disable-debug > ---slurm > ./configure --prefix=/hpc/slurm/18.08 --with-munge=/hpc/munge/0.5.13 > --with-pmix=/hpc/pmix/2.2 > ---openmpi > ./configure --prefix=/hpc/ompi/3.1 --with-hwloc=external > --with-libevent=external --with-slurm=/hpc/slurm/18.08 > --with-pmix=/hpc/pmix/2.2 > > everything seemed to compile fine, but when i do an srun i get the > below errors, however, if i salloc and then mpirun it seems to work > fine. i'm not quite sure where the breakdown is or how to debug it > > --- > > [ec2-user@labcmp1 linux]$ srun --mpi=pmix_v2 -n 16 xhpl > [labcmp6:18353] PMIX ERROR: NOT-SUPPORTED in file > event/pmix_event_registration.c at line 101 > [labcmp6:18355] PMIX ERROR: NOT-SUPPORTED in file > event/pmix_event_registration.c at line 101 > [labcmp5:18355] PMIX ERROR: NOT-SUPPORTED in file > event/pmix_event_registration.c at line 101 > -------------------------------------------------------------------------- > It looks like MPI_INIT failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during MPI_INIT; some of which are due to configuration or environment > problems. This failure appears to be an internal failure; here's some > additional information (which may only be relevant to an Open MPI > developer): > > ompi_interlib_declare > --> Returned "Would block" (-10) instead of "Success" (0) > ...snipped... > [labcmp6:18355] *** An error occurred in MPI_Init > [labcmp6:18355] *** reported by process [140726281390153,15] > [labcmp6:18355] *** on a NULL communicator > [labcmp6:18355] *** Unknown error > [labcmp6:18355] *** MPI_ERRORS_ARE_FATAL (processes in this > communicator will now abort, > [labcmp6:18355] *** and potentially your MPI job) > [labcmp6:18352] *** An error occurred in MPI_Init > [labcmp6:18352] *** reported by process [1677936713,12] > [labcmp6:18352] *** on a NULL communicator > [labcmp6:18352] *** Unknown error > [labcmp6:18352] *** MPI_ERRORS_ARE_FATAL (processes in this > communicator will now abort, > [labcmp6:18352] *** and potentially your MPI job) > [labcmp6:18354] *** An error occurred in MPI_Init > [labcmp6:18354] *** reported by process [140726281390153,14] > [labcmp6:18354] *** on a NULL communicator > [labcmp6:18354] *** Unknown error > [labcmp6:18354] *** MPI_ERRORS_ARE_FATAL (processes in this > communicator will now abort, > [labcmp6:18354] *** and potentially your MPI job) > srun: Job step aborted: Waiting up to 32 seconds for job step to finish. > slurmstepd: error: *** STEP 24.0 ON labcmp3 CANCELLED AT 2019-01-18T20:03:33 > *** > [labcmp5:18358] PMIX ERROR: NOT-SUPPORTED in file > event/pmix_event_registration.c at line 101 > -------------------------------------------------------------------------- > It looks like MPI_INIT failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during MPI_INIT; some of which are due to configuration or environment > problems. This failure appears to be an internal failure; here's some > additional information (which may only be relevant to an Open MPI > developer): > > ompi_interlib_declare > --> Returned "Would block" (-10) instead of "Success" (0) > -------------------------------------------------------------------------- > [labcmp5:18357] PMIX ERROR: NOT-SUPPORTED in file > event/pmix_event_registration.c at line 101 > [labcmp5:18356] PMIX ERROR: NOT-SUPPORTED in file > event/pmix_event_registration.c at line 101 > srun: error: labcmp6: tasks 12-15: Exited with exit code 1 > srun: error: labcmp3: tasks 0-3: Killed > srun: error: labcmp4: tasks 4-7: Killed > srun: error: labcmp5: tasks 8-11: Killed > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users