here's the branches i'm using.  i did a git clone on the repo's and
then a git checkout

[ec2-user@labhead bin]$ cd /hpc/src/pmix/
[ec2-user@labhead pmix]$ git branch
  master
* v2.2
[ec2-user@labhead pmix]$ cd ../slurm/
[ec2-user@labhead slurm]$ git branch
* (detached from origin/slurm-18.08)
  master
[ec2-user@labhead slurm]$ cd ../ompi/
[ec2-user@labhead ompi]$ git branch
* (detached from origin/v3.1.x)
  master


attached is the debug out from the run with the debugging turned on

On Fri, Jan 18, 2019 at 4:30 PM Ralph H Castain <r...@open-mpi.org> wrote:
>
> Looks strange. I’m pretty sure Mellanox didn’t implement the event 
> notification system in the Slurm plugin, but you should only be trying to 
> call it if OMPI is registering a system-level event code - which OMPI 3.1 
> definitely doesn’t do.
>
> If you are using PMIx v2.2.0, then please note that there is a bug in it that 
> slipped through our automated testing. I replaced it today with v2.2.1 - you 
> probably should update if that’s the case. However, that wouldn’t necessarily 
> explain this behavior. I’m not that familiar with the Slurm plugin, but you 
> might try adding
>
> PMIX_MCA_pmix_client_event_verbose=5
> PMIX_MCA_pmix_server_event_verbose=5
> OMPI_MCA_pmix_base_verbose=10
>
> to your environment and see if that provides anything useful.
>
> > On Jan 18, 2019, at 12:09 PM, Michael Di Domenico <mdidomeni...@gmail.com> 
> > wrote:
> >
> > i compilied pmix slurm openmpi
> >
> > ---pmix
> > ./configure --prefix=/hpc/pmix/2.2 --with-munge=/hpc/munge/0.5.13
> > --disable-debug
> > ---slurm
> > ./configure --prefix=/hpc/slurm/18.08 --with-munge=/hpc/munge/0.5.13
> > --with-pmix=/hpc/pmix/2.2
> > ---openmpi
> > ./configure --prefix=/hpc/ompi/3.1 --with-hwloc=external
> > --with-libevent=external --with-slurm=/hpc/slurm/18.08
> > --with-pmix=/hpc/pmix/2.2
> >
> > everything seemed to compile fine, but when i do an srun i get the
> > below errors, however, if i salloc and then mpirun it seems to work
> > fine.  i'm not quite sure where the breakdown is or how to debug it
> >
> > ---
> >
> > [ec2-user@labcmp1 linux]$ srun --mpi=pmix_v2 -n 16 xhpl
> > [labcmp6:18353] PMIX ERROR: NOT-SUPPORTED in file
> > event/pmix_event_registration.c at line 101
> > [labcmp6:18355] PMIX ERROR: NOT-SUPPORTED in file
> > event/pmix_event_registration.c at line 101
> > [labcmp5:18355] PMIX ERROR: NOT-SUPPORTED in file
> > event/pmix_event_registration.c at line 101
> > --------------------------------------------------------------------------
> > It looks like MPI_INIT failed for some reason; your parallel process is
> > likely to abort.  There are many reasons that a parallel process can
> > fail during MPI_INIT; some of which are due to configuration or environment
> > problems.  This failure appears to be an internal failure; here's some
> > additional information (which may only be relevant to an Open MPI
> > developer):
> >
> >  ompi_interlib_declare
> >  --> Returned "Would block" (-10) instead of "Success" (0)
> > ...snipped...
> > [labcmp6:18355] *** An error occurred in MPI_Init
> > [labcmp6:18355] *** reported by process [140726281390153,15]
> > [labcmp6:18355] *** on a NULL communicator
> > [labcmp6:18355] *** Unknown error
> > [labcmp6:18355] *** MPI_ERRORS_ARE_FATAL (processes in this
> > communicator will now abort,
> > [labcmp6:18355] ***    and potentially your MPI job)
> > [labcmp6:18352] *** An error occurred in MPI_Init
> > [labcmp6:18352] *** reported by process [1677936713,12]
> > [labcmp6:18352] *** on a NULL communicator
> > [labcmp6:18352] *** Unknown error
> > [labcmp6:18352] *** MPI_ERRORS_ARE_FATAL (processes in this
> > communicator will now abort,
> > [labcmp6:18352] ***    and potentially your MPI job)
> > [labcmp6:18354] *** An error occurred in MPI_Init
> > [labcmp6:18354] *** reported by process [140726281390153,14]
> > [labcmp6:18354] *** on a NULL communicator
> > [labcmp6:18354] *** Unknown error
> > [labcmp6:18354] *** MPI_ERRORS_ARE_FATAL (processes in this
> > communicator will now abort,
> > [labcmp6:18354] ***    and potentially your MPI job)
> > srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> > slurmstepd: error: *** STEP 24.0 ON labcmp3 CANCELLED AT 
> > 2019-01-18T20:03:33 ***
> > [labcmp5:18358] PMIX ERROR: NOT-SUPPORTED in file
> > event/pmix_event_registration.c at line 101
> > --------------------------------------------------------------------------
> > It looks like MPI_INIT failed for some reason; your parallel process is
> > likely to abort.  There are many reasons that a parallel process can
> > fail during MPI_INIT; some of which are due to configuration or environment
> > problems.  This failure appears to be an internal failure; here's some
> > additional information (which may only be relevant to an Open MPI
> > developer):
> >
> >  ompi_interlib_declare
> >  --> Returned "Would block" (-10) instead of "Success" (0)
> > --------------------------------------------------------------------------
> > [labcmp5:18357] PMIX ERROR: NOT-SUPPORTED in file
> > event/pmix_event_registration.c at line 101
> > [labcmp5:18356] PMIX ERROR: NOT-SUPPORTED in file
> > event/pmix_event_registration.c at line 101
> > srun: error: labcmp6: tasks 12-15: Exited with exit code 1
> > srun: error: labcmp3: tasks 0-3: Killed
> > srun: error: labcmp4: tasks 4-7: Killed
> > srun: error: labcmp5: tasks 8-11: Killed
> > _______________________________________________
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

Attachment: out.1547849064.gz
Description: application/gzip

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to