[jira] [Comment Edited] (KUDU-1865) Create fast path for RespondSuccess() in KRPC

Michael Ho (JIRA) Tue, 19 Dec 2017 14:05:34 -0800

    [ 
https://issues.apache.org/jira/browse/KUDU-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294467#comment-16294467
 ]


Michael Ho edited comment on KUDU-1865 at 12/19/17 10:04 PM:
-------------------------------------------------------------

While working on IMPALA-5528, I noticed some of the overhead has to do with the 
way Protobuf handles fields which aren't primitive types. Normally, we'd use 
the {{mutable_<field_name>()}} interface to get a handle to a field in a 
protobuf buffer in order to set it to the some values. That implicitly creates 
an object on the heap. In the example above, {{TransmitDataResponsePb}} will 
always a {{StatusPB}} object in the fast path to serialize the value 
{{Status::OK()}} into. This is causing extra pressure of the thread caches in 
TCMalloc.

Part of this is due to the limitation of the interface {{RpcContext}}. In 
particular, {{RpcContext}} always allocates and owns the response protobuf 
buffer. Upon calling {{RpcContext::RespondSuccess()}}, the {{RpcContext}} 
object will be deleted and so will the response protobuf buffer, making it 
impossible to stack allocate {{StatusPB}} as protobuf internally assumes that 
it owns the {{StatusPB}} buffer.

I notice KRPC in general is inducing quite a lot of pressure on the thread 
cache.
I tried doing the followings and they seem to reduce the pressure in TCMalloc 
thread cache
- Introduce a new interface {{RpcContext::RespondSucceess(const MessageLite& 
message)}} to allow callers to stack allocate the response protobuf buffer
- Introduce a virtual function in {{ServiceIf::AllocResponsePB()}} to allow 
each KRPC based service to override the way response protobuf buffer is 
allocated. This complements point 1 as services which only utilize the new 
interface to stack allocate response buffer don't care about allocating a 
response protobuf buffer in the heap.
- Recycle {{InboundCall}} objects with an object pool and embed {{RpcContext}} 
into {{InboundCall}}

Other things which may help:
- Allows {{Trace}} objects to be recycled. This seems tricky with the current 
way the {{ObjectPool}} is implemented.

[~tlipcon], what do you think ?



was (Author: kwho):
While working on IMPALA-5528, I noticed some of the overhead has to do with the 
way Protobuf handles fields which aren't primitive types. Normally, we'd use 
the {{mutable_<field_name>()}} interface to get a handle to a field in a 
protobuf buffer in order to set it to the some values. That implicitly creates 
an object on the heap. In the example above, {{TransmitDataResponsePb}} will 
always a {{StatusPB}} object in the fast path to serialize the value 
{{Status::OK()}} into. This is causing extra pressure of the thread caches in 
TCMalloc.

Part of this is due to the limitation of the interface {{RpcContext}}. In 
particular, {{RpcContext}} always allocates and owns the response protobuf 
buffer. Upon calling {{RpcContext::RespondSuccess()}}, the {{RpcContext}} 
object will be deleted and so will the response protobuf buffer, making it 
impossible to stack allocate {{StatusPB}} as protobuf internally assumes that 
it owns the {{StatusPB}} buffer.

I notice KRPC in general is inducing quite a lot of pressure on the thread 
cache.
I tried doing the followings and they seem to reduce the pressure in TCMalloc 
thread cache
- Introduce a new interface {{RpcContext::RespondSucceess(const MessageLite& 
message)}} to allow callers to stack allocate the response protobuf buffer
- Introduce a virtual function in {{ServiceIf::AllocResponsePB()}} to allow 
each KRPC based service to override the way response protobuf buffer is 
allocated. This complements point 1 as services which only utilize the new 
interface to stack allocate response buffer don't care about allocating a 
response protobuf buffer in the heap.
- Recycle {{RpcContext}} with an object pool
- Recycle {{InboundCall}} with an object pool

Other things which may help:
- Allows {{Trace}} objects to be recycled. This seems tricky with the current 
way the {{ObjectPool}} is implemented.

[~tlipcon], what do you think ?


> Create fast path for RespondSuccess() in KRPC
> ---------------------------------------------
>
>                 Key: KUDU-1865
>                 URL: https://issues.apache.org/jira/browse/KUDU-1865
>             Project: Kudu
>          Issue Type: Improvement
>          Components: rpc
>            Reporter: Sailesh Mukil
>              Labels: perfomance, rpc
>         Attachments: alloc-pattern.py, cross-thread.txt
>
>
> A lot of RPCs just respond with RespondSuccess() which returns the exact 
> payload every time. This takes the same path as any other response by 
> ultimately calling Connection::QueueResponseForCall() which has a few small 
> allocations. These small allocations (and their corresponding deallocations) 
> are called quite frequently (once for every IncomingCall) and end up taking 
> quite some time in the kernel (traversing the free list, spin locks etc.)
> This was found when [~mmokhtar] ran some profiles on Impala over KRPC on a 20 
> node cluster and found the following:
> The exact % of time spent is hard to quantify from the profiles, but these 
> were the among the top 5 of the slowest stacks:
> {code:java}
> impalad ! tcmalloc::CentralFreeList::ReleaseToSpans - [unknown source file]
> impalad ! tcmalloc::CentralFreeList::ReleaseListToSpans + 0x1a - [unknown 
> source file]
> impalad ! tcmalloc::CentralFreeList::InsertRange + 0x3b - [unknown source 
> file]
> impalad ! tcmalloc::ThreadCache::ReleaseToCentralCache + 0x103 - [unknown 
> source file]
> impalad ! tcmalloc::ThreadCache::Scavenge + 0x3e - [unknown source file]
> impalad ! operator delete + 0x329 - [unknown source file]
> impalad ! __gnu_cxx::new_allocator<kudu::Slice>::deallocate + 0x4 - 
> new_allocator.h:110
> impalad ! std::_Vector_base<kudu::Slice, 
> std::allocator<kudu::Slice>>::_M_deallocate + 0x5 - stl_vector.h:178
> impalad ! ~_Vector_base + 0x4 - stl_vector.h:160
> impalad ! ~vector - stl_vector.h:425                           <----Deleting 
> 'slices' vector
> impalad ! kudu::rpc::Connection::QueueResponseForCall + 0xac - 
> connection.cc:433
> impalad ! kudu::rpc::InboundCall::Respond + 0xfa - inbound_call.cc:133
> impalad ! kudu::rpc::InboundCall::RespondSuccess + 0x43 - inbound_call.cc:77
> impalad ! kudu::rpc::RpcContext::RespondSuccess + 0x1f7 - rpc_context.cc:66
> ..
> {code}
> {code:java}
> impalad ! tcmalloc::CentralFreeList::FetchFromOneSpans - [unknown source file]
> impalad ! tcmalloc::CentralFreeList::RemoveRange + 0xc0 - [unknown source 
> file]
> impalad ! tcmalloc::ThreadCache::FetchFromCentralCache + 0x62 - [unknown 
> source file]
> impalad ! operator new + 0x297 - [unknown source file]        <--- Creating 
> new 'OutboundTransferTask' object.
> impalad ! kudu::rpc::Connection::QueueResponseForCall + 0x76 - 
> connection.cc:432
> impalad ! kudu::rpc::InboundCall::Respond + 0xfa - inbound_call.cc:133
> impalad ! kudu::rpc::InboundCall::RespondSuccess + 0x43 - inbound_call.cc:77
> impalad ! kudu::rpc::RpcContext::RespondSuccess + 0x1f7 - rpc_context.cc:66
> ...
> {code}
> Even creating and deleting the 'RpcContext' takes a lot of time:
> {code:java}
> impalad ! tcmalloc::CentralFreeList::ReleaseToSpans - [unknown source file]
> impalad ! tcmalloc::CentralFreeList::ReleaseListToSpans + 0x1a - [unknown 
> source file]
> impalad ! tcmalloc::CentralFreeList::InsertRange + 0x3b - [unknown source 
> file]
> impalad ! tcmalloc::ThreadCache::ReleaseToCentralCache + 0x103 - [unknown 
> source file]
> impalad ! tcmalloc::ThreadCache::Scavenge + 0x3e - [unknown source file]
> impalad ! operator delete + 0x329 - [unknown source file]
> impalad ! impala::TransmitDataResponsePb::~TransmitDataResponsePb + 0x16 - 
> impala_internal_service.pb.cc:1221
> impalad ! impala::TransmitDataResponsePb::~TransmitDataResponsePb + 0x8 - 
> impala_internal_service.pb.cc:1222
> impalad ! kudu::DefaultDeleter<google::protobuf::Message>::operator() + 0x5 - 
> gscoped_ptr.h:145
> impalad ! ~gscoped_ptr_impl + 0x9 - gscoped_ptr.h:228
> impalad ! ~gscoped_ptr - gscoped_ptr.h:318
> impalad ! kudu::rpc::RpcContext::~RpcContext + 0x1e - rpc_context.cc:53   
> <-----
> impalad ! kudu::rpc::RpcContext::RespondSuccess + 0x1ff - rpc_context.cc:67
> {code}
> The above show that creating these small objects under moderately heavy load 
> results in heavy contention in the kernel. We will benefit a lot if we create 
> a fast path for 'RespondSuccess'.
> My suggestion is to create all these small objects at once along with the 
> 'InboundCall' object while it is being created, in a 'RespondSuccess' 
> structure, and just use that structure when we want to send 'success' back to 
> the sender. This would already contain the 'OutboundTransferTask', a 'Slice' 
> with 'success', etc. We would expect that most RPCs respond with 'success' a 
> majority of the time.
> How this would benefit us is that we don't go back and forth every time to 
> allocate and deallocate memory for these small objects, instead we do it all 
> at once while creating the 'InboundCall' object.
> I just wanted to start a discussion about this, so even if what I suggested 
> seems a little off, hopefully we can move forward with this on some level.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (KUDU-1865) Create fast path for RespondSuccess() in KRPC

Reply via email to