Seems like either a bug in the converter code or in setting up the send request. r26597 ensures correctness in the case the btl's sendi does all three of the following: returns an error, changes the converter, and returns a descriptor.
Until we can find the root cause I pushed a change that protects the reset by checking if size > 0. Let me know if that works for you. -Nathan On Fri, Jun 15, 2012 at 12:34:32PM -0400, Eugene Loh wrote: > Backing out r26597 solves my particular test cases. I'll back it > out of the trunk as well unless someone has objections. > > I like how you say "same segfault." In certain cases, I just go on > to different segfaults. E.g., > > [2] btl_openib_handle_incoming(openib_btl, ep, frag, byte_len = > 20U), line 3208 in "btl_openib_component.c" > [3] handle_wc(device, cq = 0, wc), line 3516 in "btl_openib_component.c" > [4] poll_device(device, count = 1), line 3654 in "btl_openib_component.c" > [5] progress_one_device(device), line 3762 in "btl_openib_component.c" > [6] btl_openib_component_progress(), line 3787 in > "btl_openib_component.c" > [7] opal_progress(), line 207 in "opal_progress.c" > [8] opal_condition_wait(c, m), line 100 in "condition.h" > [9] ompi_request_default_wait_all(count = 2U, requests, statuses), > line 281 in "req_wait.c" > [10] ompi_coll_tuned_sendrecv_actual(sendbuf = (nil), scount = 0, > sdatatype, dest = 0, stag = -16, recvbuf = (nil), rcount = 0, > rdatatype, source = 0, rtag = -16, comm, status = (nil)), line 54 in > "coll_tuned_util.c" > [11] ompi_coll_tuned_barrier_intra_recursivedoubling(comm, > module), line 172 in "coll_tuned_barrier.c" > [12] ompi_coll_tuned_barrier_intra_dec_fixed(comm, module), line > 207 in "coll_tuned_decision_fixed.c" > [13] PMPI_Barrier(comm = 0x518370), line 62 in "pbarrier.c" > > The reg->cbfunc is NULL. I'm still considering whether that's an > artifact of how I build that particular case, though. > > On 06/15/12 09:44, George Bosilca wrote: > >There should be no datatype attached to the barrier, so it is normal you get > >the zero values in the convertor. > > > >Something weird is definitively going on. As there is no data to be sent, > >the opal_convertor_set_position function is supposed to trigger the special > >path, mark the convertor as completed and return successfully. However, this > >seems not to be the case anymore as in your backtrace I see the call to > >opal_convertor_set_position_nocheck, which only happens if the above > >described test fails. > > > >I had some doubts about r26597, but I don't have time to check into it until > >Monday. Maybe you can remove it and se if you continue to have the same > >segfault. > > > > george. > > > >On Jun 15, 2012, at 01:24 , Eugene Loh wrote: > > > >>I see a segfault show up in trunk testing starting with r26598 when tests > >>like > >> > >> ibm collective/struct_gatherv > >> intel src/MPI_Type_free_[types|pending_msg]_[f|c] > >> > >>are run over openib. Here is a typical stack trace: > >> > >> opal_convertor_create_stack_at_begining(convertor = 0x689730, sizes), > >> line 404 in "opal_convertor.c" > >> opal_convertor_set_position_nocheck(convertor = 0x689730, position), > >> line 423 in "opal_convertor.c" > >> opal_convertor_set_position(convertor = 0x689730, position = > >> 0x7fffc36e0bf0), line 321 in "opal_convertor.h" > >> mca_pml_ob1_send_request_start_copy(sendreq, bml_btl = 0x6a3ea0, size = > >> 0), line 485 in "pml_ob1_sendreq.c" > >> mca_pml_ob1_send_request_start_btl(sendreq, bml_btl), line 387 in > >> "pml_ob1_sendreq.h" > >> mca_pml_ob1_send_request_start(sendreq = 0x689680), line 458 in > >> "pml_ob1_sendreq.h" > >> mca_pml_ob1_isend(buf = (nil), count = 0, datatype, dst = 2, tag = -16, > >> sendmode = MCA_PML_BASE_SEND_STANDARD, comm, request), line 87 in > >> "pml_ob1_isend.c" > >> ompi_coll_tuned_sendrecv_actual(sendbuf = (nil), scount = 0, sdatatype, > >> dest = 2, stag = -16, recvbuf = (nil), rcount = 0, rdatatype, source = 2, > >> rtag = -16, comm, status = (nil)), line 51 in "coll_tuned_util.c" > >> ompi_coll_tuned_barrier_intra_recursivedoubling(comm, module), line 172 > >> in "coll_tuned_barrier.c" > >> ompi_coll_tuned_barrier_intra_dec_fixed(comm, module), line 207 in > >> "coll_tuned_decision_fixed.c" > >> PMPI_Barrier(comm = 0x5195a0), line 62 in "pbarrier.c" > >> main(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0x403219 > >> > >>The fact that some derived data types were sent before seems to have > >>something to do with it. I see this sort of problem cropping up in Cisco > >>and Oracle testing. Up at the level of pml_ob1_send_request_start_copy, at > >>line 485: > >> > >> MCA_PML_OB1_SEND_REQUEST_RESET(sendreq); > >> > >>I see > >> > >> *sendreq->req_send.req_base.req_convertor.use_desc = { > >> length = 0 > >> used = 0 > >> desc = (nil) > >> } > >> > >>and I guess that desc=NULL is causing the segfault at opal_convertor.c line > >>404. > >> > >>Anyhow, I'm trudging along, but thought I would share at least that much > >>with you helpful folks in case any of this is ringing a bell. > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel