Hi Jeff, > Does your llp sed path order MPI matching ordering? Eg if some prior isend > is already queued, could the llp send overtake it?
Yes, LLP send may overtake queued isend. But we use correct PML send_sequence. So the LLP message is queued as unexpected message on receiver side, and I think it's no problem. > > rc = MCA_LLP_CALL(send(buf, size, OMPI_PML_OB1_MATCH_HDR_LEN, > > (bool)OMPI_ENABLE_OB1_PAD_MATCH_HDR, > > ompi_comm_peer_lookup(comm, dst), > > MCA_PML_OB1_HDR_TYPE_MATCH)); > > > > if (rc == OMPI_SUCCESS) { > > /* NOTE this is not thread safe */ > > OPAL_THREAD_ADD32(&proc->send_sequence, 1); > > } Takahiro Kawashima, MPI development team, Fujitsu > Does your llp sed path order MPI matching ordering? Eg if some prior isend > is already queued, could the llp send overtake it? > > Sent from my phone. No type good. > > On Jun 29, 2011, at 8:27 AM, "Kawashima" <t-kawash...@jp.fujitsu.com> wrote: > > > Hi Jeff, > > > >>> First, we created a new BTL component, 'tofu BTL'. It's not so special > >>> one but dedicated to our Tofu interconnect. But its latency was not > >>> enough for us. > >>> > >>> So we created a new framework, 'LLP', and its component, 'tofu LLP'. > >>> It bypasses request object creation in PML and BML/BTL, and sends > >>> a message immediately if possible. > >> > >> Gotcha. Was the sendi pml call not sufficient? (sendi = "send > >> immediate") This call was designed to be part of a latency reduction > >> mechanism. I forget offhand what we don't do before calling sendi, but > >> the rationale was that if the message was small enough, we could skip some > >> steps in the sending process and "just send it." > > > > I know sendi, but its latency was not sufficient for us. > > To come at sendi call, we must do: > > - allocate send request (MCA_PML_OB1_SEND_REQUEST_ALLOC) > > - initialize send request (MCA_PML_OB1_SEND_REQUEST_INIT) > > - select BTL module (mca_pml_ob1_send_request_start) > > - select protocol (mca_pml_ob1_send_request_start_btl) > > We want to eliminate these overheads. We want to send more immediately. > > > > Here is a code snippet: > > > > ------------------------------------------------ > > > > #if OMPI_ENABLE_LLP > > static inline int mca_pml_ob1_call_llp_send(void *buf, > > size_t size, > > int dst, > > int tag, > > ompi_communicator_t *comm) > > { > > int rc; > > mca_pml_ob1_comm_proc_t *proc = &comm->c_pml_comm->procs[dst]; > > mca_pml_ob1_match_hdr_t *match = mca_pml_ob1.llp_send_buf; > > > > match->hdr_common.hdr_type = MCA_PML_OB1_HDR_TYPE_MATCH; > > match->hdr_common.hdr_flags = 0; > > match->hdr_ctx = comm->c_contextid; > > match->hdr_src = comm->c_my_rank; > > match->hdr_tag = tag; > > match->hdr_seq = proc->send_sequence + 1; > > > > rc = MCA_LLP_CALL(send(buf, size, OMPI_PML_OB1_MATCH_HDR_LEN, > > (bool)OMPI_ENABLE_OB1_PAD_MATCH_HDR, > > ompi_comm_peer_lookup(comm, dst), > > MCA_PML_OB1_HDR_TYPE_MATCH)); > > > > if (rc == OMPI_SUCCESS) { > > /* NOTE this is not thread safe */ > > OPAL_THREAD_ADD32(&proc->send_sequence, 1); > > } > > > > return rc; > > } > > #endif > > > > int mca_pml_ob1_send(void *buf, > > size_t count, > > ompi_datatype_t * datatype, > > int dst, > > int tag, > > mca_pml_base_send_mode_t sendmode, > > ompi_communicator_t * comm) > > { > > int rc; > > mca_pml_ob1_send_request_t *sendreq; > > > > #if OMPI_ENABLE_LLP > > /* try to send message via LLP if > > * - one of LLP modules is available, and > > * - datatype is basic, and > > * - data is small, and > > * - communication mode is standard, buffered, or ready, and > > * - destination is not myself > > */ > > if (((datatype->flags & DT_FLAG_BASIC) == DT_FLAG_BASIC) && > > (datatype->size * count < mca_pml_ob1.llp_max_payload_size) && > > (sendmode == MCA_PML_BASE_SEND_STANDARD || > > sendmode == MCA_PML_BASE_SEND_BUFFERED || > > sendmode == MCA_PML_BASE_SEND_READY) && > > (dst != comm->c_my_rank)) { > > rc = mca_pml_ob1_call_llp_send(buf, datatype->size * count, dst, > > tag, comm); > > if (rc != OMPI_ERR_NOT_AVAILABLE) { > > /* successfully sent out via LLP or unrecoverable error occurred > > */ > > return rc; > > } > > } > > #endif > > > > MCA_PML_OB1_SEND_REQUEST_ALLOC(comm, dst, sendreq, rc); > > if (rc != OMPI_SUCCESS) > > return rc; > > > > MCA_PML_OB1_SEND_REQUEST_INIT(sendreq, > > buf, > > count, > > datatype, > > dst, tag, > > comm, sendmode, false); > > > > PERUSE_TRACE_COMM_EVENT (PERUSE_COMM_REQ_ACTIVATE, > > &(sendreq)->req_send.req_base, > > PERUSE_SEND); > > > > MCA_PML_OB1_SEND_REQUEST_START(sendreq, rc); > > if (rc != OMPI_SUCCESS) { > > MCA_PML_OB1_SEND_REQUEST_RETURN( sendreq ); > > return rc; > > } > > > > ompi_request_wait_completion(&sendreq->req_send.req_base.req_ompi); > > > > rc = sendreq->req_send.req_base.req_ompi.req_status.MPI_ERROR; > > ompi_request_free( (ompi_request_t**)&sendreq ); > > return rc; > > } > > > > ------------------------------------------------ > > > > mca_pml_ob1_send is body of MPI_Send in Open MPI. Region of > > OMPI_ENABLE_LLP is added by us. > > > > We don't have to use a send request if we could "send immediately". > > So we try to send via LLP at first. If LLP could not send immediately > > because of interconnect busy or something, LLP returns > > OMPI_ERR_NOT_AVAILABLE, and we continue normal PML/BML/BTL send(i). > > Since we want to use simple memcpy instead of complex convertor, > > we restrict datatype that can go into the LLP. > > > > Of course, we cannot use LLP on MPI_Isend. > > > >> Note, too, that the coll modules can be laid overtop of each other -- > >> e.g., if you only implement barrier (and some others) in tofu coll, then > >> you can supply NULL for the other function pointers and the coll base will > >> resolve those functions to other coll modules automatically. > > > > Thanks for the info. I've read mca_coll_base_comm_select() and understood. > > Our implementation was bad.