Sailesh Mukil created KUDU-1865:
-----------------------------------

             Summary: Create fast path for RespondSuccess() in KRPC
                 Key: KUDU-1865
                 URL: https://issues.apache.org/jira/browse/KUDU-1865
             Project: Kudu
          Issue Type: Improvement
          Components: rpc
            Reporter: Sailesh Mukil


A lot of RPCs just respond with RespondSuccess() which returns the exact 
payload every time. This takes the same path as any other response by 
ultimately calling Connection::QueueResponseForCall() which has a few small 
allocations. These small allocations (and their corresponding deallocations) 
are called quite frequently (once for every IncomingCall) and end up taking 
quite some time in the kernel (traversing the free list, spin locks etc.)

This was found when [~mmokhtar] ran some profiles on Impala over KRPC on a 20 
node cluster and found the following:

The exact % of time spent is hard to quantify from the profiles, but these were 
the among the top 5 of the slowest stacks:

{code:java}
impalad ! tcmalloc::CentralFreeList::ReleaseToSpans - [unknown source file]
impalad ! tcmalloc::CentralFreeList::ReleaseListToSpans + 0x1a - [unknown 
source file]
impalad ! tcmalloc::CentralFreeList::InsertRange + 0x3b - [unknown source file]
impalad ! tcmalloc::ThreadCache::ReleaseToCentralCache + 0x103 - [unknown 
source file]
impalad ! tcmalloc::ThreadCache::Scavenge + 0x3e - [unknown source file]
impalad ! operator delete + 0x329 - [unknown source file]
impalad ! __gnu_cxx::new_allocator<kudu::Slice>::deallocate + 0x4 - 
new_allocator.h:110
impalad ! std::_Vector_base<kudu::Slice, 
std::allocator<kudu::Slice>>::_M_deallocate + 0x5 - stl_vector.h:178
impalad ! ~_Vector_base + 0x4 - stl_vector.h:160
impalad ! ~vector - stl_vector.h:425                           <----Deleting 
'slices' vector
impalad ! kudu::rpc::Connection::QueueResponseForCall + 0xac - connection.cc:433
impalad ! kudu::rpc::InboundCall::Respond + 0xfa - inbound_call.cc:133
impalad ! kudu::rpc::InboundCall::RespondSuccess + 0x43 - inbound_call.cc:77
impalad ! kudu::rpc::RpcContext::RespondSuccess + 0x1f7 - rpc_context.cc:66
..
{code}

{code:java}
impalad ! tcmalloc::CentralFreeList::FetchFromOneSpans - [unknown source file]
impalad ! tcmalloc::CentralFreeList::RemoveRange + 0xc0 - [unknown source file]
impalad ! tcmalloc::ThreadCache::FetchFromCentralCache + 0x62 - [unknown source 
file]
impalad ! operator new + 0x297 - [unknown source file]        <--- Creating new 
'OutboundTransferTask' object.
impalad ! kudu::rpc::Connection::QueueResponseForCall + 0x76 - connection.cc:432
impalad ! kudu::rpc::InboundCall::Respond + 0xfa - inbound_call.cc:133
impalad ! kudu::rpc::InboundCall::RespondSuccess + 0x43 - inbound_call.cc:77
impalad ! kudu::rpc::RpcContext::RespondSuccess + 0x1f7 - rpc_context.cc:66
...
{code}

Even creating and deleting the 'RpcContext' takes a lot of time:
{code:java}
impalad ! tcmalloc::CentralFreeList::ReleaseToSpans - [unknown source file]
impalad ! tcmalloc::CentralFreeList::ReleaseListToSpans + 0x1a - [unknown 
source file]
impalad ! tcmalloc::CentralFreeList::InsertRange + 0x3b - [unknown source file]
impalad ! tcmalloc::ThreadCache::ReleaseToCentralCache + 0x103 - [unknown 
source file]
impalad ! tcmalloc::ThreadCache::Scavenge + 0x3e - [unknown source file]
impalad ! operator delete + 0x329 - [unknown source file]
impalad ! impala::TransmitDataResponsePb::~TransmitDataResponsePb + 0x16 - 
impala_internal_service.pb.cc:1221
impalad ! impala::TransmitDataResponsePb::~TransmitDataResponsePb + 0x8 - 
impala_internal_service.pb.cc:1222
impalad ! kudu::DefaultDeleter<google::protobuf::Message>::operator() + 0x5 - 
gscoped_ptr.h:145
impalad ! ~gscoped_ptr_impl + 0x9 - gscoped_ptr.h:228
impalad ! ~gscoped_ptr - gscoped_ptr.h:318
impalad ! kudu::rpc::RpcContext::~RpcContext + 0x1e - rpc_context.cc:53   <-----
impalad ! kudu::rpc::RpcContext::RespondSuccess + 0x1ff - rpc_context.cc:67
{code}


The above show that creating these small objects under moderately heavy load 
results in heavy contention in the kernel. We will benefit a lot if we create a 
fast path for 'RespondSuccess'.



My suggestion is to create all these small objects at once along with the 
'InboundCall' object while it is being created, in a 'RespondSuccess' 
structure, and just use that structure when we want to send 'success' back to 
the sender. This would already contain the 'OutboundTransferTask', a 'Slice' 
with 'success', etc. We would expect that most RPCs respond with 'success' a 
majority of the time.

How this would benefit us is that we don't go back and forth every time to 
allocate and deallocate memory for these small objects, instead we do it all at 
once while creating the 'InboundCall' object.

I just wanted to start a discussion about this, so even if what I suggested 
seems a little off, hopefully we can move forward with this on some level.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to