It looks to me like mxm related failure ?

On Thu, Aug 16, 2018 at 1:51 PM Vallee, Geoffroy R. <valle...@ornl.gov>
wrote:

> Hi,
>
> I ran some tests on Summitdev here at ORNL:
> - the UCX problem is solved and I get the expected results for the tests
> that I am running (netpipe and IMB).
> - without UCX:
>         * the performance numbers are below what would be expected but I
> believe at this point that the slight performance deficiency is due to
> other users using other parts of the system.
>         * I also encountered the following problem while running IMB_EXT
> and I now realize that I had the same problem with 2.4.1rc1 but did not
> catch it at the time:
> [summitdev-login1:112517:0] Caught signal 11 (Segmentation fault)
> [summitdev-r0c2n13:91094:0] Caught signal 11 (Segmentation fault)
> ==== backtrace ====
>  2 0x0000000000073864 mxm_handle_error()
> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641
>  3 0x0000000000073fa4 mxm_error_signal_handler()
> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616
>  4 0x0000000000017b24 ompi_osc_rdma_component_query()
> osc_rdma_component.c:0
>  5 0x00000000000d4634 ompi_osc_base_select()  ??:0
>  6 0x0000000000065e84 ompi_win_create()  ??:0
>  7 0x00000000000a2488 PMPI_Win_create()  ??:0
>  8 0x000000001000b28c IMB_window()  ??:0
>  9 0x0000000010005764 IMB_init_buffers_iter()  ??:0
> 10 0x0000000010001ef8 main()  ??:0
> 11 0x0000000000024980 generic_start_main.isra.0()  libc-start.c:0
> 12 0x0000000000024b74 __libc_start_main()  ??:0
> ===================
> ==== backtrace ====
>  2 0x0000000000073864 mxm_handle_error()
> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641
>  3 0x0000000000073fa4 mxm_error_signal_handler()
> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616
>  4 0x0000000000017b24 ompi_osc_rdma_component_query()
> osc_rdma_component.c:0
>  5 0x00000000000d4634 ompi_osc_base_select()  ??:0
>  6 0x0000000000065e84 ompi_win_create()  ??:0
>  7 0x00000000000a2488 PMPI_Win_create()  ??:0
>  8 0x000000001000b28c IMB_window()  ??:0
>  9 0x0000000010005764 IMB_init_buffers_iter()  ??:0
> 10 0x0000000010001ef8 main()  ??:0
> 11 0x0000000000024980 generic_start_main.isra.0()  libc-start.c:0
> 12 0x0000000000024b74 __libc_start_main()  ??:0
> ===================
>
> FYI, the 2.x series is not important to me so it can stay as is. I will
> move on testing 3.1.2rc1.
>
> Thanks,
>
>
> > On Aug 15, 2018, at 6:07 PM, Jeff Squyres (jsquyres) via devel <
> devel@lists.open-mpi.org> wrote:
> >
> > Per our discussion over the weekend and on the weekly webex yesterday,
> we're releasing v2.1.5.  There are only two changes:
> >
> > 1. A trivial link issue for UCX.
> > 2. A fix for the vader BTL issue.  This is how I described it in NEWS:
> >
> > - A subtle race condition bug was discovered in the "vader" BTL
> >  (shared memory communications) that, in rare instances, can cause
> >  MPI processes to crash or incorrectly classify (or effectively drop)
> >  an MPI message sent via shared memory.  If you are using the "ob1"
> >  PML with "vader" for shared memory communication (note that vader is
> >  the default for shared memory communication with ob1), you need to
> >  upgrade to v2.1.5 to fix this issue.  You may also upgrade to the
> >  following versions to fix this issue:
> >  - Open MPI v3.0.1 (released March, 2018) or later in the v3.0.x
> >    series
> >  - Open MPI v3.1.2 (expected end of August, 2018) or later
> >
> > This vader fix was warranted serious enough to generate a 2.1.5
> release.  This really will be the end of the 2.1.x series.  Trust me; my
> name is Joe Isuzu.
> >
> > 2.1.5rc1 will be available from the usual location in a few minutes (the
> website will update in about 7 minutes):
> >
> >    https://www.open-mpi.org/software/ompi/v2.1/
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> >
> > _______________________________________________
> > devel mailing list
> > devel@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/devel
>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Reply via email to