On Jun 15, 2012, at 20:59 , Nathan Hjelm wrote:

> Seems like either a bug in the converter code or in setting up the send 
> request. r26597 ensures correctness in the case the btl's sendi does all 
> three of the following: returns an error, changes the converter, and returns 
> a descriptor.

None of the above. There is a shortcut in the PML preventing the creation of a 
convertor in case the amount of data is zero. This shortcut saves few tens of 
instructions in the critical path.

  george.



> 
> Until we can find the root cause I pushed a change that protects the reset by 
> checking if size > 0.
> 
> Let me know if that works for you.
> 
> -Nathan
> 
> On Fri, Jun 15, 2012 at 12:34:32PM -0400, Eugene Loh wrote:
>> Backing out r26597 solves my particular test cases.  I'll back it
>> out of the trunk as well unless someone has objections.
>> 
>> I like how you say "same segfault."  In certain cases, I just go on
>> to different segfaults.  E.g.,
>> 
>>  [2] btl_openib_handle_incoming(openib_btl, ep, frag, byte_len =
>> 20U), line 3208 in "btl_openib_component.c"
>>  [3] handle_wc(device, cq = 0, wc), line 3516 in "btl_openib_component.c"
>>  [4] poll_device(device, count = 1), line 3654 in "btl_openib_component.c"
>>  [5] progress_one_device(device), line 3762 in "btl_openib_component.c"
>>  [6] btl_openib_component_progress(), line 3787 in
>> "btl_openib_component.c"
>>  [7] opal_progress(), line 207 in "opal_progress.c"
>>  [8] opal_condition_wait(c, m), line 100 in "condition.h"
>>  [9] ompi_request_default_wait_all(count = 2U, requests, statuses),
>> line 281 in "req_wait.c"
>>  [10] ompi_coll_tuned_sendrecv_actual(sendbuf = (nil), scount = 0,
>> sdatatype, dest = 0, stag = -16, recvbuf = (nil), rcount = 0,
>> rdatatype, source = 0, rtag = -16, comm, status = (nil)), line 54 in
>> "coll_tuned_util.c"
>>  [11] ompi_coll_tuned_barrier_intra_recursivedoubling(comm,
>> module), line 172 in "coll_tuned_barrier.c"
>>  [12] ompi_coll_tuned_barrier_intra_dec_fixed(comm, module), line
>> 207 in "coll_tuned_decision_fixed.c"
>>  [13] PMPI_Barrier(comm = 0x518370), line 62 in "pbarrier.c"
>> 
>> The reg->cbfunc is NULL.  I'm still considering whether that's an
>> artifact of how I build that particular case, though.
>> 
>> On 06/15/12 09:44, George Bosilca wrote:
>>> There should be no datatype attached to the barrier, so it is normal you 
>>> get the zero values in the convertor.
>>> 
>>> Something weird is definitively going on. As there is no data to be sent, 
>>> the opal_convertor_set_position function is supposed to trigger the special 
>>> path, mark the convertor as completed and return successfully. However, 
>>> this seems not to be the case anymore as in your backtrace I see the call 
>>> to opal_convertor_set_position_nocheck, which only happens if the above 
>>> described test fails.
>>> 
>>> I had some doubts about r26597, but I don't have time to check into it 
>>> until Monday. Maybe you can remove it and se if you continue to have the 
>>> same segfault.
>>> 
>>>  george.
>>> 
>>> On Jun 15, 2012, at 01:24 , Eugene Loh wrote:
>>> 
>>>> I see a segfault show up in trunk testing starting with r26598 when tests 
>>>> like
>>>> 
>>>>   ibm  collective/struct_gatherv
>>>>   intel src/MPI_Type_free_[types|pending_msg]_[f|c]
>>>> 
>>>> are run over openib.  Here is a typical stack trace:
>>>> 
>>>>  opal_convertor_create_stack_at_begining(convertor = 0x689730, sizes), 
>>>> line 404 in "opal_convertor.c"
>>>>  opal_convertor_set_position_nocheck(convertor = 0x689730, position), line 
>>>> 423 in "opal_convertor.c"
>>>>  opal_convertor_set_position(convertor = 0x689730, position = 
>>>> 0x7fffc36e0bf0), line 321 in "opal_convertor.h"
>>>>  mca_pml_ob1_send_request_start_copy(sendreq, bml_btl = 0x6a3ea0, size = 
>>>> 0), line 485 in "pml_ob1_sendreq.c"
>>>>  mca_pml_ob1_send_request_start_btl(sendreq, bml_btl), line 387 in 
>>>> "pml_ob1_sendreq.h"
>>>>  mca_pml_ob1_send_request_start(sendreq = 0x689680), line 458 in 
>>>> "pml_ob1_sendreq.h"
>>>>  mca_pml_ob1_isend(buf = (nil), count = 0, datatype, dst = 2, tag = -16, 
>>>> sendmode = MCA_PML_BASE_SEND_STANDARD, comm, request), line 87 in 
>>>> "pml_ob1_isend.c"
>>>>  ompi_coll_tuned_sendrecv_actual(sendbuf = (nil), scount = 0, sdatatype, 
>>>> dest = 2, stag = -16, recvbuf = (nil), rcount = 0, rdatatype, source = 2, 
>>>> rtag = -16, comm, status = (nil)), line 51 in "coll_tuned_util.c"
>>>>  ompi_coll_tuned_barrier_intra_recursivedoubling(comm, module), line 172 
>>>> in "coll_tuned_barrier.c"
>>>>  ompi_coll_tuned_barrier_intra_dec_fixed(comm, module), line 207 in 
>>>> "coll_tuned_decision_fixed.c"
>>>>  PMPI_Barrier(comm = 0x5195a0), line 62 in "pbarrier.c"
>>>>  main(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0x403219
>>>> 
>>>> The fact that some derived data types were sent before seems to have 
>>>> something to do with it.  I see this sort of problem cropping up in Cisco 
>>>> and Oracle testing.  Up at the level of pml_ob1_send_request_start_copy, 
>>>> at line 485:
>>>> 
>>>>  MCA_PML_OB1_SEND_REQUEST_RESET(sendreq);
>>>> 
>>>> I see
>>>> 
>>>>   *sendreq->req_send.req_base.req_convertor.use_desc = {
>>>>       length = 0
>>>>       used   = 0
>>>>       desc   = (nil)
>>>>   }
>>>> 
>>>> and I guess that desc=NULL is causing the segfault at opal_convertor.c 
>>>> line 404.
>>>> 
>>>> Anyhow, I'm trudging along, but thought I would share at least that much 
>>>> with you helpful folks in case any of this is ringing a bell.
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to