Nathan,

Has this bug always been present in OpenIB or is this a recent addition? If
this is regression, I would also be inclined to say that this is a blocker
for 1.8.4. This is a SIGNIFICANT bug. Both Howard and I were quite
surprised that all the while this code has been in use at LANL
in production systems, this issue was never discovered.

Once again, many thanks to Alina for discovering and reporting this. Keep
up the MTT vigilance!

Josh



On Tuesday, November 4, 2014, Joshua Ladd <jladd.m...@gmail.com
<javascript:_e(%7B%7D,'cvml','jladd.m...@gmail.com');>> wrote:

> Thanks, Nathan. After a bit more investigation yesterday, this was our
> conclusion too; that it is a longstanding bug in OpenIB BTL we just
> happened to start triggering the broken flow with some recent changes made
> to the default max_lmc parameter. Let us know if you need anything from our
> end.
>
> Josh
>
> On Mon, Nov 3, 2014 at 6:03 PM, Nathan Hjelm <hje...@lanl.gov> wrote:
>
>>
>> I see the problem. The openib btl does not properly handle the following
>> call sequence (this is an openib btl bug IMHO):
>>
>> btl_sendi (..., &descriptor);
>> btl_free (..., descriptor);
>>
>> The bug is in the message coalescing code and it looks like extra logic
>> needs to be added to the openib btl's btl_free function for this to work
>> properly. I am working on a fix now.
>>
>> -Nathan
>>
>> On Mon, Nov 03, 2014 at 04:26:10PM +0200, Alina Sklarevich wrote:
>> >    Hi,
>> >    On 1.8.4rc1 we observe the following assert in the osu_mbw_mr test
>> when
>> >    using the openib BTL.
>> >    When compiled in production mode (i.e. no --enable-debug) the test
>> simply
>> >    hangs.
>> >    When using either the tcp BTL or the cm PML, the benchmark completes
>> >    without error.
>> >    The command line to reproduce this is:
>> >    $ mpirun --bind-to core -display-map -mca btl_openib_if_include
>> mlx5_0:1
>> >    -np 2 -mca pml ob1 -mca btl openib,self,sm ./osu_mbw_mr
>> >    # OSU MPI Multiple Bandwidth / Message Rate Test v4.4
>> >    # [ pairs: 1 ] [ window size: 64 ]
>> >    # Size                  MB/s        Messages/s
>> >    osu_mbw_mr: ../../../../opal/class/opal_list.h:547:
>> _opal_list_append:
>> >    Assertion `0 == item->opal_list_item_refcount' failed.
>> >    [vegas15:30395] *** Process received signal ***
>> >    [vegas15:30395] Signal: Aborted (6)
>> >    [vegas15:30395] Signal code:  (-6)
>> >    [vegas15:30395] [ 0] /lib64/libpthread.so.0[0x30bc40f500]
>> >    [vegas15:30395] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x30bc0328a5]
>> >    [vegas15:30395] [ 2] /lib64/libc.so.6(abort+0x175)[0x30bc034085]
>> >    [vegas15:30395] [ 3] /lib64/libc.so.6[0x30bc02ba1e]
>> >    [vegas15:30395] [ 4]
>> >    /lib64/libc.so.6(__assert_perror_fail+0x0)[0x30bc02bae0]
>> >    [vegas15:30395] [ 5]
>> >
>> /labhome/alinas/workspace/tt/ompi_rc1/openmpi-1.8.4rc1/install/lib/openmpi/mca_btl_openib.so(+0x9087)[0x7ffff3f70087]
>> >    [vegas15:30395] [ 6]
>> >
>> /labhome/alinas/workspace/tt/ompi_rc1/openmpi-1.8.4rc1/install/lib/openmpi/mca_btl_openib.so(mca_btl_openib_alloc+0x403)[0x7ffff3f754b3]
>> >    [vegas15:30395] [ 7]
>> >
>> /labhome/alinas/workspace/tt/ompi_rc1/openmpi-1.8.4rc1/install/lib/openmpi/mca_btl_openib.so(mca_btl_openib_sendi+0xf9e)[0x7ffff3f785b4]
>> >    [vegas15:30395] [ 8]
>> >
>> /labhome/alinas/workspace/tt/ompi_rc1/openmpi-1.8.4rc1/install/lib/openmpi/mca_pml_ob1.so(+0xed08)[0x7ffff3308d08]
>> >    [vegas15:30395] [ 9]
>> >
>> /labhome/alinas/workspace/tt/ompi_rc1/openmpi-1.8.4rc1/install/lib/openmpi/mca_pml_ob1.so(+0xf8ba)[0x7ffff33098ba]
>> >    [vegas15:30395] [10]
>> >
>> /labhome/alinas/workspace/tt/ompi_rc1/openmpi-1.8.4rc1/install/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_isend+0x108)[0x7ffff3309a1f]
>> >    [vegas15:30395] [11]
>> >
>> /labhome/alinas/workspace/tt/ompi_rc1/openmpi-1.8.4rc1/install/lib/libmpi.so.1(MPI_Isend+0x2ec)[0x7ffff7cff5e8]
>> >    [vegas15:30395] [12]
>> >
>> /hpc/local/benchmarks/hpc-stack-gcc/install/ompi-mellanox-v1.8/tests/osu-micro-benchmarks-4.4/osu_mbw_mr[0x400fa4]
>> >    [vegas15:30395] [13]
>> >
>> /hpc/local/benchmarks/hpc-stack-gcc/install/ompi-mellanox-v1.8/tests/osu-micro-benchmarks-4.4/osu_mbw_mr[0x40167d]
>> >    [vegas15:30395] [14]
>> >    /lib64/libc.so.6(__libc_start_main+0xfd)[0x30bc01ecdd]
>> >    [vegas15:30395] [15]
>> >
>> /hpc/local/benchmarks/hpc-stack-gcc/install/ompi-mellanox-v1.8/tests/osu-micro-benchmarks-4.4/osu_mbw_mr[0x400db9]
>> >    [vegas15:30395] *** End of error message ***
>> >
>> --------------------------------------------------------------------------
>> >    mpirun noticed that process rank 0 with PID 30395 on node vegas15
>> exited
>> >    on signal 6 (Aborted).
>> >
>> --------------------------------------------------------------------------
>> >    Thanks,
>> >    Alina.
>>
>> > _______________________________________________
>> > devel mailing list
>> > de...@open-mpi.org
>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> > Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/11/16142.php
>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/11/16159.php
>>
>
>

Reply via email to