Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)
Hello all, Thank you all for the suggestions. Takahiro suggestion has gotten me to a point were all of the test will run but as soon as it gets to the clean up step IMB will seg fault again. I opened an issues on IMB's Github but I guess I am not gonna be able to get much help from them. So I will have to wait and see what happens next. Thanks again for all your help, Adam LeBlanc On Thu, Feb 21, 2019 at 7:22 AM Peter Kjellström wrote: > On Wed, 20 Feb 2019 10:46:10 -0500 > Adam LeBlanc wrote: > > > Hello, > > > > When I do a run with OpenMPI v4.0.0 on Infiniband with this command: > > mpirun --mca btl_openib_warn_no_device_params_found 0 --map-by node > > --mca orte_base_help_aggregate 0 --mca btl openib,vader,self --mca > > pml ob1 --mca btl_openib_allow_ib 1 -np 6 > > -hostfile /home/aleblanc/ib-mpi-hosts IMB-MPI1 > > > > I get this error: > ... > > # Benchmarking Reduce_scatter > ... > > 2097152 20 8738.08 9340.50 9147.89 > > [pandora:04500] *** Process received signal *** > > [pandora:04500] Signal: Segmentation fault (11) > > This is very likely a bug in IMB not in OpenMPI. It's been discussed on > the list before, thread name: > > MPI_Reduce_Scatter Segmentation Fault with Intel 2019 Update 1 > Compilers on OPA-1... > > You can work around by using an older IMB version (the bug is in the > newer/est version). > > /Peter K > ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)
Hello Howard, Thanks for all of the help and suggestions I will look into them. I also realized that my ansible wasn't setup properly for handling tar files so the nightly build didn't even install, but will do it by hand and will give you an update tomorrow somewhere in the afternoon. Thanks, Adam LeBlanc On Wed, Feb 20, 2019 at 4:26 PM Howard Pritchard wrote: > Hello Adam, > > This helps some. Could you post first 20 lines of you config.log. This > will > help in trying to reproduce. The content of your host file (you can use > generic > names for the nodes if that'a an issue to publicize) would also help as > the number of nodes and number of MPI processes/node impacts the way > the reduce scatter operation works. > > One thing to note about the openib BTL - it is on life support. That's > why you needed to set btl_openib_allow_ib 1 on the mpirun command line. > > You may get much better success by installing UCX > <https://github.com/openucx/ucx/releases> and rebuilding Open MPI to use > UCX. You may actually already have UCX installed on your system if > a recent version of MOFED is installed. > > You can check this by running /usr/bin/ofed_rpm_info. It will show which > ucx version has been installed. > If UCX is installed, you can add --with-ucx to the Open MPi configuration > line and it should build in UCX > support. If Open MPI is built with UCX support, it will by default use > UCX for message transport rather than > the OpenIB BTL. > > thanks, > > Howard > > > Am Mi., 20. Feb. 2019 um 12:49 Uhr schrieb Adam LeBlanc < > alebl...@iol.unh.edu>: > >> On tcp side it doesn't seg fault anymore but will timeout on some tests >> but on the openib side it will still seg fault, here is the output: >> >> [pandora:19256] *** Process received signal *** >> [pandora:19256] Signal: Segmentation fault (11) >> [pandora:19256] Signal code: Address not mapped (1) >> [pandora:19256] Failing at address: 0x7f911c69fff0 >> [pandora:19255] *** Process received signal *** >> [pandora:19255] Signal: Segmentation fault (11) >> [pandora:19255] Signal code: Address not mapped (1) >> [pandora:19255] Failing at address: 0x7ff09cd3fff0 >> [pandora:19256] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f913467f680] >> [pandora:19256] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f91343ec4a0] >> [pandora:19256] [ 2] >> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f9133d1be55] >> [pandora:19256] [ 3] >> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f913493798b] >> [pandora:19256] [ 4] [pandora:19255] [ 0] >> /usr/lib64/libpthread.so.0(+0xf680)[0x7ff0b4d27680] >> [pandora:19255] [ 1] >> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f913490eda7] >> [pandora:19256] [ 5] IMB-MPI1[0x40b83b] >> [pandora:19256] [ 6] IMB-MPI1[0x407155] >> [pandora:19256] [ 7] IMB-MPI1[0x4022ea] >> [pandora:19256] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7ff0b4a944a0] >> [pandora:19255] [ 2] >> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f91342c23d5] >> [pandora:19256] [ 9] IMB-MPI1[0x401d49] >> [pandora:19256] *** End of error message *** >> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7ff0b43c3e55] >> [pandora:19255] [ 3] >> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7ff0b4fdf98b] >> [pandora:19255] [ 4] >> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7ff0b4fb6da7] >> [pandora:19255] [ 5] IMB-MPI1[0x40b83b] >> [pandora:19255] [ 6] IMB-MPI1[0x407155] >> [pandora:19255] [ 7] IMB-MPI1[0x4022ea] >> [pandora:19255] [ 8] >> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ff0b496a3d5] >> [pandora:19255] [ 9] IMB-MPI1[0x401d49] >> [pandora:19255] *** End of error message *** >> [phoebe:12418] *** Process received signal *** >> [phoebe:12418] Signal: Segmentation fault (11) >> [phoebe:12418] Signal code: Address not mapped (1) >> [phoebe:12418] Failing at address: 0x7f5ce27dfff0 >> [phoebe:12418] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f5cfa767680] >> [phoebe:12418] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f5cfa4d44a0] >> [phoebe:12418] [ 2] >> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5cf9e03e55] >> [phoebe:12418] [ 3] >> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5cfaa1f98b] >> [phoebe:12418] [ 4] >> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f5cfa9f6da7] >> [phoebe:12418] [ 5] IMB-MPI1[0x40b83b] >
Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)
On tcp side it doesn't seg fault anymore but will timeout on some tests but on the openib side it will still seg fault, here is the output: [pandora:19256] *** Process received signal *** [pandora:19256] Signal: Segmentation fault (11) [pandora:19256] Signal code: Address not mapped (1) [pandora:19256] Failing at address: 0x7f911c69fff0 [pandora:19255] *** Process received signal *** [pandora:19255] Signal: Segmentation fault (11) [pandora:19255] Signal code: Address not mapped (1) [pandora:19255] Failing at address: 0x7ff09cd3fff0 [pandora:19256] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f913467f680] [pandora:19256] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f91343ec4a0] [pandora:19256] [ 2] /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f9133d1be55] [pandora:19256] [ 3] /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f913493798b] [pandora:19256] [ 4] [pandora:19255] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7ff0b4d27680] [pandora:19255] [ 1] /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f913490eda7] [pandora:19256] [ 5] IMB-MPI1[0x40b83b] [pandora:19256] [ 6] IMB-MPI1[0x407155] [pandora:19256] [ 7] IMB-MPI1[0x4022ea] [pandora:19256] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7ff0b4a944a0] [pandora:19255] [ 2] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f91342c23d5] [pandora:19256] [ 9] IMB-MPI1[0x401d49] [pandora:19256] *** End of error message *** /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7ff0b43c3e55] [pandora:19255] [ 3] /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7ff0b4fdf98b] [pandora:19255] [ 4] /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7ff0b4fb6da7] [pandora:19255] [ 5] IMB-MPI1[0x40b83b] [pandora:19255] [ 6] IMB-MPI1[0x407155] [pandora:19255] [ 7] IMB-MPI1[0x4022ea] [pandora:19255] [ 8] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ff0b496a3d5] [pandora:19255] [ 9] IMB-MPI1[0x401d49] [pandora:19255] *** End of error message *** [phoebe:12418] *** Process received signal *** [phoebe:12418] Signal: Segmentation fault (11) [phoebe:12418] Signal code: Address not mapped (1) [phoebe:12418] Failing at address: 0x7f5ce27dfff0 [phoebe:12418] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f5cfa767680] [phoebe:12418] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f5cfa4d44a0] [phoebe:12418] [ 2] /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5cf9e03e55] [phoebe:12418] [ 3] /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5cfaa1f98b] [phoebe:12418] [ 4] /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f5cfa9f6da7] [phoebe:12418] [ 5] IMB-MPI1[0x40b83b] [phoebe:12418] [ 6] IMB-MPI1[0x407155] [phoebe:12418] [ 7] IMB-MPI1[0x4022ea] [phoebe:12418] [ 8] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5cfa3aa3d5] [phoebe:12418] [ 9] IMB-MPI1[0x401d49] [phoebe:12418] *** End of error message *** -- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -- -- mpirun noticed that process rank 0 with PID 0 on node pandora exited on signal 11 (Segmentation fault). -- - Adam LeBlanc On Wed, Feb 20, 2019 at 2:08 PM Jeff Squyres (jsquyres) via users < users@lists.open-mpi.org> wrote: > Can you try the latest 4.0.x nightly snapshot and see if the problem still > occurs? > > https://www.open-mpi.org/nightly/v4.0.x/ > > > > On Feb 20, 2019, at 1:40 PM, Adam LeBlanc wrote: > > > > I do here is the output: > > > > 2 total processes killed (some possibly by mpirun during cleanup) > > [pandora:12238] *** Process received signal *** > > [pandora:12238] Signal: Segmentation fault (11) > > [pandora:12238] Signal code: Invalid permissions (2) > > [pandora:12238] Failing at address: 0x7f5c8e31fff0 > > [pandora:12238] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f5ca205f680] > > [pandora:12238] [ 1] [pandora:12237] *** Process received signal *** > > /usr/lib64/libc.so.6(+0x14c4a0)[0x7f5ca1dcc4a0] > > [pandora:12238] [ 2] [pandora:12237] Signal: Segmentation fault (11) > > [pandora:12237] Signal code: Invalid permissions (2) > > [pandora:12237] Failing at address: 0x7f6c4ab3fff0 > > /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5ca16fbe55] > > [pandora:12238] [ 3] > /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5ca231798b] > > [pandora:12238] [ 4] > /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f5ca22eeda7] > > [pandora:12238] [ 5] IMB-MPI1[0x40b83b] > >
Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)
I do here is the output: 2 total processes killed (some possibly by mpirun during cleanup) [pandora:12238] *** Process received signal *** [pandora:12238] Signal: Segmentation fault (11) [pandora:12238] Signal code: Invalid permissions (2) [pandora:12238] Failing at address: 0x7f5c8e31fff0 [pandora:12238] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f5ca205f680] [pandora:12238] [ 1] [pandora:12237] *** Process received signal *** /usr/lib64/libc.so.6(+0x14c4a0)[0x7f5ca1dcc4a0] [pandora:12238] [ 2] [pandora:12237] Signal: Segmentation fault (11) [pandora:12237] Signal code: Invalid permissions (2) [pandora:12237] Failing at address: 0x7f6c4ab3fff0 /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5ca16fbe55] [pandora:12238] [ 3] /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5ca231798b] [pandora:12238] [ 4] /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f5ca22eeda7] [pandora:12238] [ 5] IMB-MPI1[0x40b83b] [pandora:12238] [ 6] IMB-MPI1[0x407155] [pandora:12238] [ 7] IMB-MPI1[0x4022ea] [pandora:12238] [ 8] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5ca1ca23d5] [pandora:12238] [ 9] IMB-MPI1[0x401d49] [pandora:12238] *** End of error message *** [pandora:12237] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f6c5e73f680] [pandora:12237] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f6c5e4ac4a0] [pandora:12237] [ 2] /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f6c5dddbe55] [pandora:12237] [ 3] /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f6c5e9f798b] [pandora:12237] [ 4] /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f6c5e9ceda7] [pandora:12237] [ 5] IMB-MPI1[0x40b83b] [pandora:12237] [ 6] IMB-MPI1[0x407155] [pandora:12237] [ 7] IMB-MPI1[0x4022ea] [pandora:12237] [ 8] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f6c5e3823d5] [pandora:12237] [ 9] IMB-MPI1[0x401d49] [pandora:12237] *** End of error message *** [phoebe:07408] *** Process received signal *** [phoebe:07408] Signal: Segmentation fault (11) [phoebe:07408] Signal code: Invalid permissions (2) [phoebe:07408] Failing at address: 0x7f6b9ca9fff0 [titan:07169] *** Process received signal *** [titan:07169] Signal: Segmentation fault (11) [titan:07169] Signal code: Invalid permissions (2) [titan:07169] Failing at address: 0x7fc01295fff0 [phoebe:07408] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f6bb03b7680] [phoebe:07408] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f6bb01244a0] [phoebe:07408] [ 2] [titan:07169] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7fc026117680] [titan:07169] [ 1] /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f6bafa53e55] [phoebe:07408] [ 3] /usr/lib64/libc.so.6(+0x14c4a0)[0x7fc025e844a0] [titan:07169] [ 2] /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f6bb066f98b] [phoebe:07408] [ 4] /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7fc0257b3e55] [titan:07169] [ 3] /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f6bb0646da7] [phoebe:07408] [ 5] IMB-MPI1[0x40b83b] [phoebe:07408] [ 6] IMB-MPI1[0x407155] /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7fc0263cf98b] [titan:07169] [ 4] [phoebe:07408] [ 7] IMB-MPI1[0x4022ea] [phoebe:07408] [ 8] /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7fc0263a6da7] [titan:07169] [ 5] IMB-MPI1[0x40b83b] [titan:07169] [ 6] IMB-MPI1[0x407155] [titan:07169] [ 7] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f6bafffa3d5] [phoebe:07408] [ 9] IMB-MPI1[0x401d49] [phoebe:07408] *** End of error message *** IMB-MPI1[0x4022ea] [titan:07169] [ 8] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fc025d5a3d5] [titan:07169] [ 9] IMB-MPI1[0x401d49] [titan:07169] *** End of error message *** -- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -- -- mpirun noticed that process rank 0 with PID 0 on node pandora exited on signal 11 (Segmentation fault). -- - Adam LeBlanc On Wed, Feb 20, 2019 at 1:20 PM Howard Pritchard wrote: > HI Adam, > > As a sanity check, if you try to use --mca btl self,vader,tcp > > do you still see the segmentation fault? > > Howard > > > Am Mi., 20. Feb. 2019 um 08:50 Uhr schrieb Adam LeBlanc < > alebl...@iol.unh.edu>: > >> Hello, >> >> When I do a run with OpenMPI v4.0.0 on Infiniband with this command: >> mpirun --mca btl_openib_warn_no_device_params_found 0 --map-by node --mca >> orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca >> btl_openib_allow_ib 1 -
[OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)
. -- -- mpirun noticed that process rank 1 with PID 3779 on node phoebe-ib exited on signal 11 (Segmentation fault). -- Also if I reinstall 3.1.2 I do not have this issue at all. Any thoughts on what could be the issue? Thanks, Adam LeBlanc ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] Bug with Open-MPI Processor Count
Hello Ralph, Is there any update on this? Thanks, Adam LeBlanc On Fri, Nov 2, 2018 at 11:06 AM Adam LeBlanc wrote: > Hello Ralph, > > When I do the -np 7 it still fails with "There are not enough slots > available in the system to satisfy the 7 slots that were requested by the > application", but when I do -np 2 it will actually run from a machine that > was failing but will only run on one other machine and in this case it ran > from a machine with 2 processors to a machine with only 1 processor. If I > try to make -np higher then 2 it will also fail. > > -Adam LeBlanc > > On Thu, Nov 1, 2018 at 3:53 PM Ralph H Castain wrote: > >> Hmmm - try adding a value for nprocs instead of leaving it blank. Say >> “-np 7” >> >> Sent from my iPhone >> >> On Nov 1, 2018, at 11:56 AM, Adam LeBlanc wrote: >> >> Hello Ralph, >> >> Here is the output for a failing machine: >> >> [130_02:44:13_aleblanc@farbauti]{~}$ > mpirun --mca >> btl_openib_warn_no_device_params_found 0 --mca orte_base_help_aggregate 0 >> --mca btl openib,vader,self --mca pml ob1 --mca btl_openib_receive_queues >> P,65536,120,64,32 -hostfile /home/soesterreich/ce-mpi-hosts --mca >> ras_base_verbose 5 IMB-MPI1 >> >> == ALLOCATED NODES == >> farbauti: flags=0x11 slots=1 max_slots=0 slots_inuse=0 state=UP >> hyperion-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN >> io-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN >> jarnsaxa-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN >> rhea-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN >> tarqeq-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN >> tarvos-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN >> = >> -- >> There are not enough slots available in the system to satisfy the 7 slots >> that were requested by the application: >> 10 >> >> Either request fewer slots for your application, or make more slots >> available >> for use. >> -- >> >> >> Here is an output of a passing machine: >> >> [1_02:54:26_aleblanc@hyperion]{~}$ > mpirun --mca >> btl_openib_warn_no_device_params_found 0 --mca orte_base_help_aggregate 0 >> --mca btl openib,vader,self --mca pml ob1 --mca btl_openib_receive_queues >> P,65536,120,64,32 -hostfile /home/soesterreich/ce-mpi-hosts --mca >> ras_base_verbose 5 IMB-MPI1 >> >> == ALLOCATED NODES == >> hyperion: flags=0x11 slots=1 max_slots=0 slots_inuse=0 state=UP >> farbauti-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN >> io-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN >> jarnsaxa-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN >> rhea-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN >> tarqeq-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN >> tarvos-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN >> = >> >> >> Yes the hostfile is available on all nodes through an NFS mount for all >> of our home directories. >> >> On Thu, Nov 1, 2018 at 2:44 PM Adam LeBlanc wrote: >> >>> >>> >>> -- Forwarded message - >>> From: Ralph H Castain >>> Date: Thu, Nov 1, 2018 at 2:34 PM >>> Subject: Re: [OMPI users] Bug with Open-MPI Processor Count >>> To: Open MPI Users >>> >>> >>> I’m a little under the weather and so will only be able to help a bit at >>> a time. However, a couple of things to check: >>> >>> * add -mca ras_base_verbose 5 to the cmd line to see what mpirun thought >>> the allocation was >>> >>> * is the hostfile available on every node? >>> >>> Ralph >>> >>> On Nov 1, 2018, at 10:55 AM, Adam LeBlanc wrote: >>> >>> Hello Ralph, >>> >>> Attached below is the verbose output for a failing machine and a passing >>> machine. >>> >>> Thanks, >>> Adam LeBlanc >>> >>> On Thu, Nov 1, 2018 at 1:41 PM Adam LeBlanc >>> wrote: >>> >>>> >>>> >>>> -- Forwarded message - >>>> From: R
Re: [OMPI users] Bug with Open-MPI Processor Count
Hello Ralph, When I do the -np 7 it still fails with "There are not enough slots available in the system to satisfy the 7 slots that were requested by the application", but when I do -np 2 it will actually run from a machine that was failing but will only run on one other machine and in this case it ran from a machine with 2 processors to a machine with only 1 processor. If I try to make -np higher then 2 it will also fail. -Adam LeBlanc On Thu, Nov 1, 2018 at 3:53 PM Ralph H Castain wrote: > Hmmm - try adding a value for nprocs instead of leaving it blank. Say “-np > 7” > > Sent from my iPhone > > On Nov 1, 2018, at 11:56 AM, Adam LeBlanc wrote: > > Hello Ralph, > > Here is the output for a failing machine: > > [130_02:44:13_aleblanc@farbauti]{~}$ > mpirun --mca > btl_openib_warn_no_device_params_found 0 --mca orte_base_help_aggregate 0 > --mca btl openib,vader,self --mca pml ob1 --mca btl_openib_receive_queues > P,65536,120,64,32 -hostfile /home/soesterreich/ce-mpi-hosts --mca > ras_base_verbose 5 IMB-MPI1 > > == ALLOCATED NODES == > farbauti: flags=0x11 slots=1 max_slots=0 slots_inuse=0 state=UP > hyperion-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN > io-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN > jarnsaxa-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN > rhea-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN > tarqeq-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN > tarvos-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN > = > -- > There are not enough slots available in the system to satisfy the 7 slots > that were requested by the application: > 10 > > Either request fewer slots for your application, or make more slots > available > for use. > -- > > > Here is an output of a passing machine: > > [1_02:54:26_aleblanc@hyperion]{~}$ > mpirun --mca > btl_openib_warn_no_device_params_found 0 --mca orte_base_help_aggregate 0 > --mca btl openib,vader,self --mca pml ob1 --mca btl_openib_receive_queues > P,65536,120,64,32 -hostfile /home/soesterreich/ce-mpi-hosts --mca > ras_base_verbose 5 IMB-MPI1 > > == ALLOCATED NODES == > hyperion: flags=0x11 slots=1 max_slots=0 slots_inuse=0 state=UP > farbauti-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN > io-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN > jarnsaxa-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN > rhea-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN > tarqeq-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN > tarvos-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN > ===== > > > Yes the hostfile is available on all nodes through an NFS mount for all of > our home directories. > > On Thu, Nov 1, 2018 at 2:44 PM Adam LeBlanc wrote: > >> >> >> -- Forwarded message - >> From: Ralph H Castain >> Date: Thu, Nov 1, 2018 at 2:34 PM >> Subject: Re: [OMPI users] Bug with Open-MPI Processor Count >> To: Open MPI Users >> >> >> I’m a little under the weather and so will only be able to help a bit at >> a time. However, a couple of things to check: >> >> * add -mca ras_base_verbose 5 to the cmd line to see what mpirun thought >> the allocation was >> >> * is the hostfile available on every node? >> >> Ralph >> >> On Nov 1, 2018, at 10:55 AM, Adam LeBlanc wrote: >> >> Hello Ralph, >> >> Attached below is the verbose output for a failing machine and a passing >> machine. >> >> Thanks, >> Adam LeBlanc >> >> On Thu, Nov 1, 2018 at 1:41 PM Adam LeBlanc wrote: >> >>> >>> >>> -- Forwarded message - >>> From: Ralph H Castain >>> Date: Thu, Nov 1, 2018 at 1:07 PM >>> Subject: Re: [OMPI users] Bug with Open-MPI Processor Count >>> To: Open MPI Users >>> >>> >>> Set rmaps_base_verbose=10 for debugging output >>> >>> Sent from my iPhone >>> >>> On Nov 1, 2018, at 9:31 AM, Adam LeBlanc wrote: >>> >>> The version by the way for Open-MPI is 3.1.2. >>> >>> -Adam LeBlanc >>> >>> On Thu, Nov 1, 2018 at 12:05 PM Adam LeBlanc >>> wrote: &
Re: [OMPI users] Bug with Open-MPI Processor Count
Hello Ralph, Here is the output for a failing machine: [130_02:44:13_aleblanc@farbauti]{~}$ > mpirun --mca btl_openib_warn_no_device_params_found 0 --mca orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca btl_openib_receive_queues P,65536,120,64,32 -hostfile /home/soesterreich/ce-mpi-hosts --mca ras_base_verbose 5 IMB-MPI1 == ALLOCATED NODES == farbauti: flags=0x11 slots=1 max_slots=0 slots_inuse=0 state=UP hyperion-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN io-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN jarnsaxa-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN rhea-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN tarqeq-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN tarvos-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN = -- There are not enough slots available in the system to satisfy the 7 slots that were requested by the application: 10 Either request fewer slots for your application, or make more slots available for use. -- Here is an output of a passing machine: [1_02:54:26_aleblanc@hyperion]{~}$ > mpirun --mca btl_openib_warn_no_device_params_found 0 --mca orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca btl_openib_receive_queues P,65536,120,64,32 -hostfile /home/soesterreich/ce-mpi-hosts --mca ras_base_verbose 5 IMB-MPI1 == ALLOCATED NODES == hyperion: flags=0x11 slots=1 max_slots=0 slots_inuse=0 state=UP farbauti-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN io-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN jarnsaxa-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN rhea-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN tarqeq-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN tarvos-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN = Yes the hostfile is available on all nodes through an NFS mount for all of our home directories. On Thu, Nov 1, 2018 at 2:44 PM Adam LeBlanc wrote: > > > -- Forwarded message - > From: Ralph H Castain > Date: Thu, Nov 1, 2018 at 2:34 PM > Subject: Re: [OMPI users] Bug with Open-MPI Processor Count > To: Open MPI Users > > > I’m a little under the weather and so will only be able to help a bit at a > time. However, a couple of things to check: > > * add -mca ras_base_verbose 5 to the cmd line to see what mpirun thought > the allocation was > > * is the hostfile available on every node? > > Ralph > > On Nov 1, 2018, at 10:55 AM, Adam LeBlanc wrote: > > Hello Ralph, > > Attached below is the verbose output for a failing machine and a passing > machine. > > Thanks, > Adam LeBlanc > > On Thu, Nov 1, 2018 at 1:41 PM Adam LeBlanc wrote: > >> >> >> -- Forwarded message - >> From: Ralph H Castain >> Date: Thu, Nov 1, 2018 at 1:07 PM >> Subject: Re: [OMPI users] Bug with Open-MPI Processor Count >> To: Open MPI Users >> >> >> Set rmaps_base_verbose=10 for debugging output >> >> Sent from my iPhone >> >> On Nov 1, 2018, at 9:31 AM, Adam LeBlanc wrote: >> >> The version by the way for Open-MPI is 3.1.2. >> >> -Adam LeBlanc >> >> On Thu, Nov 1, 2018 at 12:05 PM Adam LeBlanc >> wrote: >> >>> Hello, I am an employee of the UNH InterOperability Lab, and we are in >>> the process of testing OFED-4.17-RC1 for the OpenFabrics Alliance. We have >>> purchased some new hardware that has one processor, and noticed an issue >>> when running mpi jobs on nodes that do not have similar processor counts. >>> If we launch the MPI job from a node that has 2 processors, it will fail >>> and stating there are not enough resources and will not start the run, like >>> so: >>> -- >>> There are not enough slots available in the system to satisfy the 14 slots >>> that were requested by the application: IMB-MPI1 Either request fewer >>> slots for your application, or make more slots available for use. >>> -- >>> If we launch the MPI job from the node with one processor, without changing >>> the mpirun command at all, it runs as expected. Here is the command being >>&
Re: [OMPI users] Bug with Open-MPI Processor Count
Hello Ralph, Attached below is the verbose output for a failing machine and a passing machine. Thanks, Adam LeBlanc On Thu, Nov 1, 2018 at 1:41 PM Adam LeBlanc wrote: > > > -- Forwarded message - > From: Ralph H Castain > Date: Thu, Nov 1, 2018 at 1:07 PM > Subject: Re: [OMPI users] Bug with Open-MPI Processor Count > To: Open MPI Users > > > Set rmaps_base_verbose=10 for debugging output > > Sent from my iPhone > > On Nov 1, 2018, at 9:31 AM, Adam LeBlanc wrote: > > The version by the way for Open-MPI is 3.1.2. > > -Adam LeBlanc > > On Thu, Nov 1, 2018 at 12:05 PM Adam LeBlanc wrote: > >> Hello, I am an employee of the UNH InterOperability Lab, and we are in >> the process of testing OFED-4.17-RC1 for the OpenFabrics Alliance. We have >> purchased some new hardware that has one processor, and noticed an issue >> when running mpi jobs on nodes that do not have similar processor counts. >> If we launch the MPI job from a node that has 2 processors, it will fail >> and stating there are not enough resources and will not start the run, like >> so: >> -- >> There are not enough slots available in the system to satisfy the 14 slots >> that were requested by the application: IMB-MPI1 Either request fewer >> slots for your application, or make more slots available for use. >> -- >> If we launch the MPI job from the node with one processor, without changing >> the mpirun command at all, it runs as expected. Here is the command being >> run: mpirun --mca btl_openib_warn_no_device_params_found 0 --mca >> orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca >> btl_openib_receive_queues P,65536,120,64,32 -hostfile >> /home/soesterreich/ce-mpi-hosts IMB-MPI1 Here is the hostfile being used: >> farbauti-ce.ofa.iol.unh.edu slots=1 hyperion-ce.ofa.iol.unh.edu slots=1 >> io-ce.ofa.iol.unh.edu slots=1 jarnsaxa-ce.ofa.iol.unh.edu slots=1 >> rhea-ce.ofa.iol.unh.edu slots=1 tarqeq-ce.ofa.iol.unh.edu slots=1 >> tarvos-ce.ofa.iol.unh.edu slots=1 This seems like a bug and we would >> like some help to explain and fix what is happening. The IBTA plugfest saw >> similar behaviours, so this should be reproduceable. Thanks, Adam LeBlanc >> > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users > > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users > [0_01:49:28_aleblanc@hyperion]{~}$ > mpirun --mca btl_openib_warn_no_device_params_found 0 --mca orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca btl_openib_receive_queues P,65536,120,64,32 -hostfile /home/soesterreich/ce-mpi-hosts --mca rmaps_base_verbose 10 IMB-MPI1 [hyperion.ofa.iol.unh.edu:05190] mca: base: components_register: registering framework rmaps components [hyperion.ofa.iol.unh.edu:05190] mca: base: components_register: found loaded component resilient [hyperion.ofa.iol.unh.edu:05190] mca: base: components_register: component resilient register function successful [hyperion.ofa.iol.unh.edu:05190] mca: base: components_register: found loaded component seq [hyperion.ofa.iol.unh.edu:05190] mca: base: components_register: component seq register function successful [hyperion.ofa.iol.unh.edu:05190] mca: base: components_register: found loaded component ppr [hyperion.ofa.iol.unh.edu:05190] mca: base: components_register: component ppr register function successful [hyperion.ofa.iol.unh.edu:05190] mca: base: components_register: found loaded component mindist [hyperion.ofa.iol.unh.edu:05190] mca: base: components_register: component mindist register function successful [hyperion.ofa.iol.unh.edu:05190] mca: base: components_register: found loaded component round_robin [hyperion.ofa.iol.unh.edu:05190] mca: base: components_register: component round_robin register function successful [hyperion.ofa.iol.unh.edu:05190] mca: base: components_register: found loaded component rank_file [hyperion.ofa.iol.unh.edu:05190] mca: base: components_register: component rank_file register function successful [hyperion.ofa.iol.unh.edu:05190] [[63394,0],0] rmaps:base set policy with NULL device NONNULL [hyperion.ofa.iol.unh.edu:05190] mca: base: components_open: opening rmaps components [hyperion.ofa.iol.unh.edu:05190] mca: base: components_open: found loaded component resilient [hyperion.ofa.iol.unh.edu:05190] mca: base: components_open: component resilient open function successful [hyperion.ofa.iol.unh.edu:05190]
Re: [OMPI users] Bug with Open-MPI Processor Count
The version by the way for Open-MPI is 3.1.2. -Adam LeBlanc On Thu, Nov 1, 2018 at 12:05 PM Adam LeBlanc wrote: > Hello, I am an employee of the UNH InterOperability Lab, and we are in the > process of testing OFED-4.17-RC1 for the OpenFabrics Alliance. We have > purchased some new hardware that has one processor, and noticed an issue > when running mpi jobs on nodes that do not have similar processor counts. > If we launch the MPI job from a node that has 2 processors, it will fail > and stating there are not enough resources and will not start the run, like > so: > -- > There are not enough slots available in the system to satisfy the 14 slots > that were requested by the application: IMB-MPI1 Either request fewer > slots for your application, or make more slots available for use. > -- > If we launch the MPI job from the node with one processor, without changing > the mpirun command at all, it runs as expected. Here is the command being > run: mpirun --mca btl_openib_warn_no_device_params_found 0 --mca > orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca > btl_openib_receive_queues P,65536,120,64,32 -hostfile > /home/soesterreich/ce-mpi-hosts IMB-MPI1 Here is the hostfile being used: > farbauti-ce.ofa.iol.unh.edu slots=1 hyperion-ce.ofa.iol.unh.edu slots=1 > io-ce.ofa.iol.unh.edu slots=1 jarnsaxa-ce.ofa.iol.unh.edu slots=1 > rhea-ce.ofa.iol.unh.edu slots=1 tarqeq-ce.ofa.iol.unh.edu slots=1 > tarvos-ce.ofa.iol.unh.edu slots=1 This seems like a bug and we would like > some help to explain and fix what is happening. The IBTA plugfest saw > similar behaviours, so this should be reproduceable. Thanks, Adam LeBlanc > ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
[OMPI users] Bug with Open-MPI Processor Count
Hello, I am an employee of the UNH InterOperability Lab, and we are in the process of testing OFED-4.17-RC1 for the OpenFabrics Alliance. We have purchased some new hardware that has one processor, and noticed an issue when running mpi jobs on nodes that do not have similar processor counts. If we launch the MPI job from a node that has 2 processors, it will fail and stating there are not enough resources and will not start the run, like so: -- There are not enough slots available in the system to satisfy the 14 slots that were requested by the application: IMB-MPI1 Either request fewer slots for your application, or make more slots available for use. -- If we launch the MPI job from the node with one processor, without changing the mpirun command at all, it runs as expected. Here is the command being run: mpirun --mca btl_openib_warn_no_device_params_found 0 --mca orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca btl_openib_receive_queues P,65536,120,64,32 -hostfile /home/soesterreich/ce-mpi-hosts IMB-MPI1 Here is the hostfile being used: farbauti-ce.ofa.iol.unh.edu slots=1 hyperion-ce.ofa.iol.unh.edu slots=1 io-ce.ofa.iol.unh.edu slots=1 jarnsaxa-ce.ofa.iol.unh.edu slots=1 rhea-ce.ofa.iol.unh.edu slots=1 tarqeq-ce.ofa.iol.unh.edu slots=1 tarvos-ce.ofa.iol.unh.edu slots=1 This seems like a bug and we would like some help to explain and fix what is happening. The IBTA plugfest saw similar behaviours, so this should be reproduceable. Thanks, Adam LeBlanc ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users