r31904 should fix this issue. Please test it thoughtfully and report all issues.
George. On Fri, May 9, 2014 at 6:56 AM, Gilles Gouaillardet <gilles.gouaillar...@iferc.org> wrote: > i opened #4610 https://svn.open-mpi.org/trac/ompi/ticket/4610 > and attached a patch for the v1.8 branch > > i ran several tests from the intel_tests test suite and did not observe > any regression. > > please note there are still issues when running with --mca btl > scif,vader,self > > this might be an other issue, i will investigate more next week > > Gilles > > On 2014/05/09 18:08, Gilles Gouaillardet wrote: >> I ran some more investigations with --mca btl scif,self >> >> i found that the previous patch i posted was complete crap and i >> apologize for it. >> >> on a brighter side, and imho, the issue only occurs if fragments are >> received (and then processed) out of order. >> /* i did not observe this with the tcp btl, but i always see that with >> the scif btl, i guess this can be observed too >> with openib+RDMA */ >> >> in this case only, opal_convertor_generic_simple_position(...) is >> invoked and does not set the pConvertor->pStack >> as expected by r31496 >> >> i will run some more tests from now >> >> Gilles >> >> On 2014/05/08 2:23, George Bosilca wrote: >>> Strange. The outcome and the timing of this issue seems to highlight a link >>> with the other datatype-related issue you reported earlier, and as >>> suggested by Ralph with Gilles scif+vader issue. >>> >>> Generally speaking, the mechanism used to split the data in the case of >>> multiple BTLs, is identical to the one used to split the data in fragments. >>> So, if the culprit is in the splitting logic, one might see some weirdness >>> as soon as we force the exclusive usage of the send protocol, with an >>> unconventional fragment size. >>> >>> In other words using the following flags “—mca btl tcp,self —mca >>> btl_tcp_flags 3 —mca btl_tcp_rndv_eager_limit 23 —mca btl_tcp_eager_limit >>> 23 —mca btl_tcp_max_send_size 23” should always transfer wrong data, even >>> when only one single BTL is in play. >>> >>> George. >>> >>> On May 7, 2014, at 13:11 , Rolf vandeVaart <rvandeva...@nvidia.com> wrote: >>> >>>> OK. So, I investigated a little more. I only see the issue when I am >>>> running with multiple ports enabled such that I have two openib BTLs >>>> instantiated. In addition, large message RDMA has to be enabled. If >>>> those conditions are not met, then I do not see the problem. For example: >>>> FAILS: >>>> Ø mpirun –np 2 –host host1,host2 –mca btl_openib_if_include >>>> mlx5_0:1,mlx5_0:2 –mca btl_openib_flags 3 MPI_Isend_ator_c >>>> PASS: >>>> Ø mpirun –np 2 –host host1,host2 –mca btl_openib_if_include mlx5_0:1 –mca >>>> btl_openib_flags 3 MPI_Isend_ator_c >>>> Ø mpirun –np 2 –host host1,host2 –mca >>>> btl_openib_if_include_mlx5:0:1,mlx5_0:2 –mca btl_openib_flags 1 >>>> MPI_Isend_ator_c >>>> >>>> So we must have some type of issue when we break up the message between >>>> the two openib BTLs. Maybe someone else can confirm my observations? >>>> I was testing against the latest trunk. >>>> > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14766.php