Hi Jeff,

> Does your llp sed path order MPI matching ordering?  Eg if some prior isend 
> is already queued, could the llp send overtake it?

Yes, LLP send may overtake queued isend.
But we use correct PML send_sequence. So the LLP message is queued as
unexpected message on receiver side, and I think it's no problem.

> >    rc = MCA_LLP_CALL(send(buf, size, OMPI_PML_OB1_MATCH_HDR_LEN,
> >                           (bool)OMPI_ENABLE_OB1_PAD_MATCH_HDR,
> >                           ompi_comm_peer_lookup(comm, dst),
> >                           MCA_PML_OB1_HDR_TYPE_MATCH));
> > 
> >    if (rc == OMPI_SUCCESS) {
> >        /* NOTE this is not thread safe */
> >        OPAL_THREAD_ADD32(&proc->send_sequence, 1);
> >    }

Takahiro Kawashima,
MPI development team,
Fujitsu

> Does your llp sed path order MPI matching ordering?  Eg if some prior isend 
> is already queued, could the llp send overtake it?
> 
> Sent from my phone. No type good. 
> 
> On Jun 29, 2011, at 8:27 AM, "Kawashima" <t-kawash...@jp.fujitsu.com> wrote:
> 
> > Hi Jeff,
> > 
> >>> First, we created a new BTL component, 'tofu BTL'. It's not so special
> >>> one but dedicated to our Tofu interconnect. But its latency was not
> >>> enough for us.
> >>> 
> >>> So we created a new framework, 'LLP', and its component, 'tofu LLP'.
> >>> It bypasses request object creation in PML and BML/BTL, and sends
> >>> a message immediately if possible.
> >> 
> >> Gotcha.  Was the sendi pml call not sufficient?  (sendi = "send 
> >> immediate")  This call was designed to be part of a latency reduction 
> >> mechanism.  I forget offhand what we don't do before calling sendi, but 
> >> the rationale was that if the message was small enough, we could skip some 
> >> steps in the sending process and "just send it."
> > 
> > I know sendi, but its latency was not sufficient for us.
> > To come at sendi call, we must do:
> >  - allocate send request (MCA_PML_OB1_SEND_REQUEST_ALLOC)
> >  - initialize send request (MCA_PML_OB1_SEND_REQUEST_INIT)
> >  - select BTL module (mca_pml_ob1_send_request_start)
> >  - select protocol (mca_pml_ob1_send_request_start_btl)
> > We want to eliminate these overheads. We want to send more immediately.
> > 
> > Here is a code snippet:
> > 
> > ------------------------------------------------
> > 
> > #if OMPI_ENABLE_LLP
> > static inline int mca_pml_ob1_call_llp_send(void *buf,
> >                                            size_t size,
> >                                            int dst,
> >                                            int tag,
> >                                            ompi_communicator_t *comm)
> > {
> >    int rc;
> >    mca_pml_ob1_comm_proc_t *proc = &comm->c_pml_comm->procs[dst];
> >    mca_pml_ob1_match_hdr_t *match = mca_pml_ob1.llp_send_buf;
> > 
> >    match->hdr_common.hdr_type = MCA_PML_OB1_HDR_TYPE_MATCH;
> >    match->hdr_common.hdr_flags = 0;
> >    match->hdr_ctx = comm->c_contextid;
> >    match->hdr_src = comm->c_my_rank;
> >    match->hdr_tag = tag;
> >    match->hdr_seq = proc->send_sequence + 1;
> > 
> >    rc = MCA_LLP_CALL(send(buf, size, OMPI_PML_OB1_MATCH_HDR_LEN,
> >                           (bool)OMPI_ENABLE_OB1_PAD_MATCH_HDR,
> >                           ompi_comm_peer_lookup(comm, dst),
> >                           MCA_PML_OB1_HDR_TYPE_MATCH));
> > 
> >    if (rc == OMPI_SUCCESS) {
> >        /* NOTE this is not thread safe */
> >        OPAL_THREAD_ADD32(&proc->send_sequence, 1);
> >    }
> > 
> >    return rc;
> > }
> > #endif
> > 
> > int mca_pml_ob1_send(void *buf,
> >                     size_t count,
> >                     ompi_datatype_t * datatype,
> >                     int dst,
> >                     int tag,
> >                     mca_pml_base_send_mode_t sendmode,
> >                     ompi_communicator_t * comm)
> > {
> >    int rc;
> >    mca_pml_ob1_send_request_t *sendreq;
> > 
> > #if OMPI_ENABLE_LLP
> >    /* try to send message via LLP if
> >     *   - one of LLP modules is available, and
> >     *   - datatype is basic, and
> >     *   - data is small, and
> >     *   - communication mode is standard, buffered, or ready, and
> >     *   - destination is not myself
> >     */
> >    if (((datatype->flags & DT_FLAG_BASIC) == DT_FLAG_BASIC) &&
> >        (datatype->size * count < mca_pml_ob1.llp_max_payload_size) &&
> >        (sendmode == MCA_PML_BASE_SEND_STANDARD ||
> >         sendmode == MCA_PML_BASE_SEND_BUFFERED ||
> >         sendmode == MCA_PML_BASE_SEND_READY) &&
> >        (dst != comm->c_my_rank)) {
> >        rc = mca_pml_ob1_call_llp_send(buf, datatype->size * count, dst, 
> > tag, comm);
> >        if (rc != OMPI_ERR_NOT_AVAILABLE) {
> >            /* successfully sent out via LLP or unrecoverable error occurred 
> > */
> >            return rc;
> >        }
> >    }
> > #endif
> > 
> >    MCA_PML_OB1_SEND_REQUEST_ALLOC(comm, dst, sendreq, rc);
> >    if (rc != OMPI_SUCCESS)
> >        return rc;
> > 
> >    MCA_PML_OB1_SEND_REQUEST_INIT(sendreq,
> >                                  buf,
> >                                  count,
> >                                  datatype,
> >                                  dst, tag,
> >                                  comm, sendmode, false);
> > 
> >    PERUSE_TRACE_COMM_EVENT (PERUSE_COMM_REQ_ACTIVATE,
> >                             &(sendreq)->req_send.req_base,
> >                             PERUSE_SEND);
> > 
> >    MCA_PML_OB1_SEND_REQUEST_START(sendreq, rc);
> >    if (rc != OMPI_SUCCESS) {
> >        MCA_PML_OB1_SEND_REQUEST_RETURN( sendreq );
> >        return rc;
> >    }
> > 
> >    ompi_request_wait_completion(&sendreq->req_send.req_base.req_ompi);
> > 
> >    rc = sendreq->req_send.req_base.req_ompi.req_status.MPI_ERROR;
> >    ompi_request_free( (ompi_request_t**)&sendreq );
> >    return rc;
> > }
> > 
> > ------------------------------------------------
> > 
> > mca_pml_ob1_send is body of MPI_Send in Open MPI. Region of
> > OMPI_ENABLE_LLP is added by us.
> > 
> > We don't have to use a send request if we could "send immediately".
> > So we try to send via LLP at first. If LLP could not send immediately
> > because of interconnect busy or something, LLP returns
> > OMPI_ERR_NOT_AVAILABLE, and we continue normal PML/BML/BTL send(i).
> > Since we want to use simple memcpy instead of complex convertor,
> > we restrict datatype that can go into the LLP.
> > 
> > Of course, we cannot use LLP on MPI_Isend.
> > 
> >> Note, too, that the coll modules can be laid overtop of each other -- 
> >> e.g., if you only implement barrier (and some others) in tofu coll, then 
> >> you can supply NULL for the other function pointers and the coll base will 
> >> resolve those functions to other coll modules automatically.
> > 
> > Thanks for the info. I've read mca_coll_base_comm_select() and understood.
> > Our implementation was bad.

Reply via email to