Hi Jeff,

> > First, we created a new BTL component, 'tofu BTL'. It's not so special
> > one but dedicated to our Tofu interconnect. But its latency was not
> > enough for us.
> > 
> > So we created a new framework, 'LLP', and its component, 'tofu LLP'.
> > It bypasses request object creation in PML and BML/BTL, and sends
> > a message immediately if possible.
> 
> Gotcha.  Was the sendi pml call not sufficient?  (sendi = "send immediate")  
> This call was designed to be part of a latency reduction mechanism.  I forget 
> offhand what we don't do before calling sendi, but the rationale was that if 
> the message was small enough, we could skip some steps in the sending process 
> and "just send it."

I know sendi, but its latency was not sufficient for us.
To come at sendi call, we must do:
  - allocate send request (MCA_PML_OB1_SEND_REQUEST_ALLOC)
  - initialize send request (MCA_PML_OB1_SEND_REQUEST_INIT)
  - select BTL module (mca_pml_ob1_send_request_start)
  - select protocol (mca_pml_ob1_send_request_start_btl)
We want to eliminate these overheads. We want to send more immediately.

Here is a code snippet:

------------------------------------------------

#if OMPI_ENABLE_LLP
static inline int mca_pml_ob1_call_llp_send(void *buf,
                                            size_t size,
                                            int dst,
                                            int tag,
                                            ompi_communicator_t *comm)
{
    int rc;
    mca_pml_ob1_comm_proc_t *proc = &comm->c_pml_comm->procs[dst];
    mca_pml_ob1_match_hdr_t *match = mca_pml_ob1.llp_send_buf;

    match->hdr_common.hdr_type = MCA_PML_OB1_HDR_TYPE_MATCH;
    match->hdr_common.hdr_flags = 0;
    match->hdr_ctx = comm->c_contextid;
    match->hdr_src = comm->c_my_rank;
    match->hdr_tag = tag;
    match->hdr_seq = proc->send_sequence + 1;

    rc = MCA_LLP_CALL(send(buf, size, OMPI_PML_OB1_MATCH_HDR_LEN,
                           (bool)OMPI_ENABLE_OB1_PAD_MATCH_HDR,
                           ompi_comm_peer_lookup(comm, dst),
                           MCA_PML_OB1_HDR_TYPE_MATCH));

    if (rc == OMPI_SUCCESS) {
        /* NOTE this is not thread safe */
        OPAL_THREAD_ADD32(&proc->send_sequence, 1);
    }

    return rc;
}
#endif

int mca_pml_ob1_send(void *buf,
                     size_t count,
                     ompi_datatype_t * datatype,
                     int dst,
                     int tag,
                     mca_pml_base_send_mode_t sendmode,
                     ompi_communicator_t * comm)
{
    int rc;
    mca_pml_ob1_send_request_t *sendreq;

#if OMPI_ENABLE_LLP
    /* try to send message via LLP if
     *   - one of LLP modules is available, and
     *   - datatype is basic, and
     *   - data is small, and
     *   - communication mode is standard, buffered, or ready, and
     *   - destination is not myself
     */
    if (((datatype->flags & DT_FLAG_BASIC) == DT_FLAG_BASIC) &&
        (datatype->size * count < mca_pml_ob1.llp_max_payload_size) &&
        (sendmode == MCA_PML_BASE_SEND_STANDARD ||
         sendmode == MCA_PML_BASE_SEND_BUFFERED ||
         sendmode == MCA_PML_BASE_SEND_READY) &&
        (dst != comm->c_my_rank)) {
        rc = mca_pml_ob1_call_llp_send(buf, datatype->size * count, dst, tag, 
comm);
        if (rc != OMPI_ERR_NOT_AVAILABLE) {
            /* successfully sent out via LLP or unrecoverable error occurred */
            return rc;
        }
    }
#endif

    MCA_PML_OB1_SEND_REQUEST_ALLOC(comm, dst, sendreq, rc);
    if (rc != OMPI_SUCCESS)
        return rc;

    MCA_PML_OB1_SEND_REQUEST_INIT(sendreq,
                                  buf,
                                  count,
                                  datatype,
                                  dst, tag,
                                  comm, sendmode, false);

    PERUSE_TRACE_COMM_EVENT (PERUSE_COMM_REQ_ACTIVATE,
                             &(sendreq)->req_send.req_base,
                             PERUSE_SEND);

    MCA_PML_OB1_SEND_REQUEST_START(sendreq, rc);
    if (rc != OMPI_SUCCESS) {
        MCA_PML_OB1_SEND_REQUEST_RETURN( sendreq );
        return rc;
    }

    ompi_request_wait_completion(&sendreq->req_send.req_base.req_ompi);

    rc = sendreq->req_send.req_base.req_ompi.req_status.MPI_ERROR;
    ompi_request_free( (ompi_request_t**)&sendreq );
    return rc;
}

------------------------------------------------

mca_pml_ob1_send is body of MPI_Send in Open MPI. Region of
OMPI_ENABLE_LLP is added by us.

We don't have to use a send request if we could "send immediately".
So we try to send via LLP at first. If LLP could not send immediately
because of interconnect busy or something, LLP returns
OMPI_ERR_NOT_AVAILABLE, and we continue normal PML/BML/BTL send(i).
Since we want to use simple memcpy instead of complex convertor,
we restrict datatype that can go into the LLP.

Of course, we cannot use LLP on MPI_Isend.

> Note, too, that the coll modules can be laid overtop of each other -- e.g., 
> if you only implement barrier (and some others) in tofu coll, then you can 
> supply NULL for the other function pointers and the coll base will resolve 
> those functions to other coll modules automatically.

Thanks for the info. I've read mca_coll_base_comm_select() and understood.
Our implementation was bad.

Regards,

Takahiro Kawashima,
MPI development team,
Fujitsu

Reply via email to