i opened #4610 https://svn.open-mpi.org/trac/ompi/ticket/4610
and attached a patch for the v1.8 branch
i ran several tests from the intel_tests test suite and did not observe
any regression.
please note there are still issues when running with --mca btl
scif,vader,self
this might be an other issue, i will investigate more next week
Gilles
On 2014/05/09 18:08, Gilles Gouaillardet wrote:
> I ran some more investigations with --mca btl scif,self
>
> i found that the previous patch i posted was complete crap and i
> apologize for it.
>
> on a brighter side, and imho, the issue only occurs if fragments are
> received (and then processed) out of order.
> /* i did not observe this with the tcp btl, but i always see that with
> the scif btl, i guess this can be observed too
> with openib+RDMA */
>
> in this case only, opal_convertor_generic_simple_position(...) is
> invoked and does not set the pConvertor->pStack
> as expected by r31496
>
> i will run some more tests from now
>
> Gilles
>
> On 2014/05/08 2:23, George Bosilca wrote:
>> Strange. The outcome and the timing of this issue seems to highlight a link
>> with the other datatype-related issue you reported earlier, and as suggested
>> by Ralph with Gilles scif+vader issue.
>>
>> Generally speaking, the mechanism used to split the data in the case of
>> multiple BTLs, is identical to the one used to split the data in fragments.
>> So, if the culprit is in the splitting logic, one might see some weirdness
>> as soon as we force the exclusive usage of the send protocol, with an
>> unconventional fragment size.
>>
>> In other words using the following flags “—mca btl tcp,self —mca
>> btl_tcp_flags 3 —mca btl_tcp_rndv_eager_limit 23 —mca btl_tcp_eager_limit 23
>> —mca btl_tcp_max_send_size 23” should always transfer wrong data, even when
>> only one single BTL is in play.
>>
>> George.
>>
>> On May 7, 2014, at 13:11 , Rolf vandeVaart wrote:
>>
>>> OK. So, I investigated a little more. I only see the issue when I am
>>> running with multiple ports enabled such that I have two openib BTLs
>>> instantiated. In addition, large message RDMA has to be enabled. If those
>>> conditions are not met, then I do not see the problem. For example:
>>> FAILS:
>>> Ø mpirun –np 2 –host host1,host2 –mca btl_openib_if_include
>>> mlx5_0:1,mlx5_0:2 –mca btl_openib_flags 3 MPI_Isend_ator_c
>>> PASS:
>>> Ø mpirun –np 2 –host host1,host2 –mca btl_openib_if_include mlx5_0:1 –mca
>>> btl_openib_flags 3 MPI_Isend_ator_c
>>> Ø mpirun –np 2 –host host1,host2 –mca
>>> btl_openib_if_include_mlx5:0:1,mlx5_0:2 –mca btl_openib_flags 1
>>> MPI_Isend_ator_c
>>>
>>> So we must have some type of issue when we break up the message between the
>>> two openib BTLs. Maybe someone else can confirm my observations?
>>> I was testing against the latest trunk.
>>>