Re: [OMPI devel] RDMA pipeline

Gleb Natapov Thu, 21 Feb 2008 04:32:53 -0500

On Wed, Feb 20, 2008 at 04:08:46PM -0500, George Bosilca wrote:
> So I tracked this issue and it seems that the new behavior was  
> introduced one year ago by the commit 12433. Starting from this commit, 
Except that the log message of this commit says:


   Fix regression from v1.1.
   1) make the code do what comment says
   2) if memory is prepinned don't send multiple PUT messages.

And to be absolutely sure I checked v1.1 and of cause there is no
pipeline for TCP BTLs there as well.

> there was no pipeline in the RDMA protocol. That make sense as we usually 
> don't use NetPipe all the time to check the performances of the message 
> logging (we use real applications). However, last week, we did a NetPipe 
> and that's how we realized the lack of pipelining for the RDMA case.
Perhaps at the time you wrote message logging you relied on buggy
behaviour that was later fixed.

>
> I would be in favor of having a consistent behavior everywhere. In other 
> words don't ask the user to know if there is or not an mpool associated 
> with a particular device, in order to figure out what protocol we use 
> internally. Actually, it's not only for users, it might help us as well.
>
User indeed shouldn't care what protocol we use as long as performance
is good. Pipeline is need to improve performance of some "insane"
interconnects that need memory pinning. The heuristic of OB1 is very simple:
if send and receive message buffers are pinned do not use pipeline (no matter
what interconnect is in use) otherwise use pipeline protocol to hide pinning
cost. The only assumption OB1 does is that if BTL has not MPOOL then all memory
is always pinned. Think about pipeline as slow path and no pipeline as fast 
path.
For Infiniband we use every dirty trick in the book (registration cache +
ptmalloc) to go fast path and you want TCP/MX/ELAN to always go slow
path! This doesn't make sense to me.

If you need pipeline in OB1 to hide message logging cost we may add another 
config
parameter that will always enable pipeline. We may even not expose it to
users, but set it automatically if message logging is enabled.


>   Thanks,
>     george.
>
> On Feb 20, 2008, at 4:29 AM, Gleb Natapov wrote:
>
>> On Tue, Feb 19, 2008 at 10:40:46PM -0500, George Bosilca wrote:
>>> Actually, it restores the original behavior. The RDMA operations were
>>> pipelined before the r15247 commit, independent of the fact that they
>>> had mpool or not. We were actively using this behavior in the message
>>> logging framework to hide the cost of the local storage of the  
>>> payload,
>>> and we were quite surprised when we realized that it disappeared.
>> I checked v1.2 with tcp BTL (I can't test mx or elan, but tcp also
>> support RDMA and has no mpool) and no matter what  
>> btl_tcp_max_rdma_size
>> I provide the whole buffer is sent in one rdma operation. And here is
>> explanation why this happens:
>> 1. If BTL is RDMA capable but does not provide mpool
>> mca_pml_ob1_rdma_btls() assumes that memory is always registered. This
>> function will always return non zero value for any buffer it is called
>> with in our case.
>>
>> 2. When mca_pml_ob1_send_request_start_btl() chooses what function to
>> use for rendezvous send it checks if buffer is contiguous and if it is
>> then it check if buffer is already registered by checking non zero  
>> value
>> returned by mca_pml_ob1_rdma_btls() and for BTLs without mpool
>> mca_pml_ob1_send_request_start_rdma() is always chosen.
>>
>> 3. Receiver checks if local buffer is registered by calling
>> mca_pml_ob1_rdma_btls() on it (see pml_ob1_recvreq.c:259):
>>
>>  recvreq->req_rdma_cnt = mca_pml_ob1_rdma_btls(
>>          bml_endpoint,
>>          (unsigned char*) base,
>>          recvreq->req_recv.req_bytes_packed,
>>          recvreq->req_rdma);
>> So recvreq->req_rdma_cnt is set to non zero value (if receive buffer  
>> is
>> contiguous of cause).
>>
>> 4. Receiver send PUT messages to a senders in
>> mca_pml_ob1_recv_request_schedule_exclusive(). Here is the code snip
>> from the function (see pml_ob1_recvreq.c:684):
>>
>>       /* makes sure that we don't exceed BTL max rdma size
>>        * if memory is not pinned already */
>>       if(0 == recvreq->req_rdma_cnt &&
>>             bml_btl->btl_max_rdma_size != 0 &&
>>             size > bml_btl->btl_max_rdma_size)
>>       {
>>
>>           size = bml_btl->btl_max_rdma_size;
>>       }
>> Pay special attention to a comment. If recvreq->req_rdma_cnt is not
>> zero btl_max_rdma_size is ignored and message is send by one big RDMA
>> operation.
>>
>> So what I have shown here is that there was no pipeline for TCP btl in
>> v1.2 and that the code specifically written to behave this way.
>> If you still think that there is a difference in behaviour between  
>> v1.2
>> and the trunk can you explain what code path is executed in v1.2 for
>> your test case and how trunk behaves differently.
>>
>>>
>>> If a BTL don't want to use pipeline for RDMA operations, it can set  
>>> the
>>> RDMA fragment size to the max value, and this will automatically  
>>> disable
>>> the pipeline. However, if the BTL support pipeline with the trunk  
>>> version
>>> today it is not possible to activate it. Moreover, in the current  
>>> version
>>> the parameters that define the BTL behavior are blatantly ignored,  
>>> as the
>>> PML make high level assumption about what they want to do.
>> I am not defending current behaviour. If you want to change it we can
>> discuss exact semantics that you want to see. But before that I what  
>> to
>> make sure that trunk is indeed different from v1.2 in this regard like
>> you claim it to be. Can you provide me with a test case that works
>> differently in v1.2 and the trunk?
>>
>> --
>>                      Gleb.
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
                        Gleb.

Re: [OMPI devel] RDMA pipeline

Reply via email to